Rust-GPU/rust-cuda

Item: Rust-GPU/rust-cuda
Rating: 5
Author: RepoPilot

Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.

Healthy

Healthy across all four use cases

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 1w ago
✓9 active contributors
✓Apache-2.0 licensed

Show all 6 evidence items →

✓CI configured
⚠Concentrated ownership — top contributor handles 61% of recent commits
⚠No test directory detected

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/rust-gpu/rust-cuda)](https://repopilot.app/r/rust-gpu/rust-cuda)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/rust-gpu/rust-cuda on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: Rust-GPU/rust-cuda

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/Rust-GPU/rust-cuda shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

Last commit 1w ago
9 active contributors
Apache-2.0 licensed
CI configured
⚠ Concentrated ownership — top contributor handles 61% of recent commits
⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live Rust-GPU/rust-cuda repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/Rust-GPU/rust-cuda.

What it runs against: a local clone of Rust-GPU/rust-cuda — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in Rust-GPU/rust-cuda | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 39 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>Rust-GPU/rust-cuda</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of Rust-GPU/rust-cuda. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/Rust-GPU/rust-cuda.git
#   cd rust-cuda
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of Rust-GPU/rust-cuda and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "Rust-GPU/rust-cuda(\\.git)?\\b" \\
  && ok "origin remote is Rust-GPU/rust-cuda" \\
  || miss "origin remote is not Rust-GPU/rust-cuda (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "Cargo.toml" \\
  && ok "Cargo.toml" \\
  || miss "missing critical file: Cargo.toml"
test -f "crates/cuda_std/src/lib.rs" \\
  && ok "crates/cuda_std/src/lib.rs" \\
  || miss "missing critical file: crates/cuda_std/src/lib.rs"
test -f "crates/rustc_codegen_nvvm/src" \\
  && ok "crates/rustc_codegen_nvvm/src" \\
  || miss "missing critical file: crates/rustc_codegen_nvvm/src"
test -f "crates/cust/src/lib.rs" \\
  && ok "crates/cust/src/lib.rs" \\
  || miss "missing critical file: crates/cust/src/lib.rs"
test -f "crates/cuda_std/src/thread.rs" \\
  && ok "crates/cuda_std/src/thread.rs" \\
  || miss "missing critical file: crates/cuda_std/src/thread.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 39 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~9d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/Rust-GPU/rust-cuda"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Rust-CUDA is an ecosystem for writing GPU kernels entirely in Rust and executing them on NVIDIA GPUs, eliminating the need to write CUDA C++. It provides a Rust compiler backend (rustc_codegen_nvvm) that targets NVIDIA's NVVM IR, standard library support (cuda_std), and high-level abstractions (cust, blastoff) for kernel compilation, device memory management, and GPU compute. Monorepo using Cargo workspace (15+ crates). Core structure: crates/cust_core provides raw CUDA bindings; crates/rustc_codegen_nvvm implements the Rust compiler backend; crates/cuda_std provides GPU-side standard library; crates/blastoff offers high-level BLAS-like kernels; examples/ (vecadd, gemm, sha2, i128_demo) demonstrate usage patterns.

👥Who it's for

Systems engineers and GPU-accelerated application developers who want type-safe, memory-safe GPU code without learning CUDA C++; contributors building Rust-first GPU infrastructure; researchers prototyping numerical algorithms on GPUs in pure Rust.

🌱Maturity & risk

Early but actively rebooting (as of January 2025). The project was dormant but is undergoing revival with recent blog updates and contributions. Expect significant gaps: no stable release version, early development stage with known bugs and safety issues. CI exists (Linux/Windows workflows) but coverage is incomplete.

High risk: this is pre-1.0 with acknowledged bugs and safety issues. The NVIDIA LLVM backend (NVVM) is unstable, and the Rust-to-PTX compilation pipeline is nascent. Single point of failure: rustc_codegen_nvvm requires maintaining parity with rustc versions. Multiple disabled features (OptiX support commented out) suggest incomplete infrastructure.

Active areas of work

Active reboot phase with focus on stabilizing the NVVM backend and documenting the compilation pipeline. Recent work visible in CI workflows targeting CUDA 12 and 13, containerization of build environments (RockyLinux 9, Ubuntu 24 Dockerfiles), and efforts to enable more complex examples (path_tracer, SHA2).

🚀Get running

git clone https://github.com/Rust-GPU/rust-cuda.git
cd rust-cuda
cargo build --workspace
# Run a basic example
cargo run --example vecadd

Requires CUDA Toolkit 12+ and rustc nightly with LLVM backend support. See .devcontainer.json for containerized setup.

Daily commands: For examples: cargo run --example vecadd --release. For kernel compilation: use cuda_builder::CudaBuilder in build.rs to compile .rs kernel files to .ptx at build time. Full workflow documented in The Rust CUDA Guide (https://rust-gpu.github.io/rust-cuda/).

🗺️Map of the codebase

Cargo.toml — Workspace root defining all 16 crates; essential for understanding project structure and dependency graph
crates/cuda_std/src/lib.rs — Core GPU standard library exposing thread, warp, atomic, and runtime abstractions that all GPU kernels depend on
crates/rustc_codegen_nvvm/src — LLVM-to-PTX compiler backend; the engine that transforms Rust code into executable GPU instructions
crates/cust/src/lib.rs — Host-side CUDA runtime wrapper providing device memory management, kernel launching, and stream handling
crates/cuda_std/src/thread.rs — GPU thread intrinsics (threadIdx, blockIdx, blockDim) that form the foundation of kernel coordinate mapping
crates/cuda_builder/src/lib.rs — Build-time GPU code compilation pipeline; integrates with rustc_codegen_nvvm to compile kernels for CUDA targets
README.md — Project overview confirming ecosystem status, reboot announcement, and dual Apache/MIT licensing

🛠️How to make changes

Add a new GPU kernel using cuda_std

Create a new library crate (e.g., cargo new --lib crates/my_kernel) that depends on cuda_std (crates/my_kernel/Cargo.toml)
Write kernel functions using thread coordinate intrinsics from crates/cuda_std/src/thread.rs (threadIdx, blockIdx, blockDim, sync_threads) (crates/my_kernel/src/lib.rs)
Compile with cuda_builder by creating a build script that invokes the codegen backend (crates/my_kernel/build.rs)
Call compiled kernels from host code using cust launch_on_stream with device memory from cust_core::DeviceBuffer (examples/my_kernel_example.rs)

Add a new cuDNN operation binding

Define descriptor struct and mode enum in crates/cudnn/src/{new_op}/mod.rs (following pattern of crates/cudnn/src/activation/) (crates/cudnn/src/new_op/mod.rs)
Add raw FFI bindings in crates/cudnn-sys/build/wrapper.h and regenerate via build.rs (crates/cudnn-sys/build/wrapper.h)
Expose high-level safe API in crates/cudnn/src/lib.rs with error handling using cudnn::Result (crates/cudnn/src/lib.rs)

Add GPU-side intrinsic or helper function

Implement function in appropriate crates/cuda_std/src/*.rs module (e.g., thread.rs, warp.rs, atomic.rs) with inline assembly via #[inline] and asm! macros (crates/cuda_std/src/thread.rs)
Re-export in crates/cuda_std/src/lib.rs public API (crates/cuda_std/src/lib.rs)
Document with examples showing usage pattern on GPU side (e.g., crates/cuda_std/src/thread.rs for sync_threads usage) (crates/cuda_std/src/thread.rs)

Add BLAS operation to blastoff

Wrap cuBLAS Level-1, Level-2, or Level-3 operation in crates/blastoff/src/level{1,2,3}.rs following existing patterns (crates/blastoff/src/level1.rs)
Add raw bindings in crates/blastoff/src/raw/level{1,2,3}.rs with FFI calls to cublas* (crates/blastoff/src/raw/level1.rs)
Expose safe wrapper in crates/blastoff/src/lib.rs using BlastoffContext for handle management (crates/blastoff/src/lib.rs)

🔧Why these technologies

LLVM/NVVM — Enables high-level Rust code to be compiled to GPU targets; NVVM is NVIDIA's LLVM dialect for PTX generation
CUDA Driver API (via cust_raw) — Provides lowest-level control over device memory, streams, and kernel launch without requiring CUDA Runtime
cuBLAS (blastoff) and cuDNN (cudnn) — Exposes optimized vendor-provided GPU libraries for linear algebra and deep learning; critical for performance-sensitive workloads
Rust proc macros (cuda_std_macros, cust_derive) — Eliminates boilerplate for kernel argument marshalling and GPU memory lifetime tracking at compile-time

⚖️Trade-offs already made

Custom rustc_codegen_nvvm backend instead of GPU-agnostic IR
- Why: Enables direct targeting of NVIDIA LLVM/NVVM without intermediate passes; NVIDIA-only but optimized for CUDA
- Consequence: Not portable to AMD, Intel, or other GPU vendors; tight coupling to NVIDIA toolchain
Dual host (cust) and device (cuda_std) APIs instead of unified abstraction
- Why: Host and device code execute in fundamentally different contexts; separate APIs prevent accidents like GPU code running on CPU
- Consequence: Developer must explicitly manage data movement between CPU and GPU; more verbose but safer
Unsafe blocks required for kernel writing and FFI (cust_raw, cuda_std intrinsics)
- Why: CUDA fundamentals (thread coordinates, shared memory, atomics) require low-level operations; FFI to C libraries is inherently unsafe
- Consequence: Burden on developer to validate thread safety and correctness; compiler cannot guarantee memory safety in kernels

🚫Non-goals (don't propose these)

Multi-vendor GPU support (only NVIDIA CUDA)
Real-time GPU debugging or profiling tools (relies on external NVIDIA tools)
Automatic memory optimization or kernel fusion
Support for dynamic GPU kernel compilation at runtime

🪤Traps & gotchas

CUDA Toolkit version pinning: the codebase targets CUDA 12+ (see container/rockylinux9-cuda12); older CUDA versions will fail. Nightly Rust required: only recent nightly compilers with LLVM backend support work; stable Rust cannot compile GPU code. LLVM version mismatch: rustc_codegen_nvvm is sensitive to rustc's internal LLVM version; CI failures may indicate version incompatibility. OptiX disabled: OptiX code is entirely commented out (see OPTIX_DISABLED comments); attempting to use OptiX features requires manual uncommentation. No Windows CUDA testing in CI: ci_windows.yml exists but may have gaps; primary testing on Linux. Devcontainer required for some workflows: .devcontainer.json suggests the project expects containerized builds for consistency.

🏗️Architecture

💡Concepts to learn

NVVM (NVIDIA Virtual Machine IR) — It is the intermediate representation that rustc_codegen_nvvm targets; understanding NVVM is essential for debugging GPU code generation and contributing to the compiler backend
PTX (Parallel Thread Execution) — The low-level assembly language that NVVM IR compiles to before GPU execution; PTX is the actual code running on NVIDIA hardware, so understanding it helps debug perf issues
Rustc Compiler Backend (Codegen) — rust-cuda implements a custom rustc backend (not LLVM's default x86 backend) that emits NVVM IR; this is unique and requires understanding Rust's compiler plugin system
Memory Coalescing & Warp Execution — CUDA GPU performance depends critically on memory access patterns (coalescing) and warp-level parallelism; cuda_std abstracts these but you must understand them to write efficient kernels
BLAS Level 1/3 Operations — blastoff implements Level 1 (vector ops) and Level 3 (matrix multiply) kernels; understanding BLAS terminology and algorithms is essential for contributing numerical compute to the library
Procedural Macros for Kernel Attributes — cuda_std_macros and rustc_codegen_nvvm_macros use #[kernel] and similar attributes to mark device code; understanding proc macro expansion is required to extend kernel annotation syntax
Type-State Pattern (Device Memory Lifetimes) — cust uses type-state (CudaSlice, DeviceBuffer) to encode memory safety as Rust types; this pattern prevents use-after-free on GPU memory at compile time

gpuwasm/gpuwasm — Alternative GPU language targeting WebGPU/WGSL instead of CUDA, solving the same 'Rust on GPU' problem for web and portable targets
EmbarkStudios/rust-gpu — Predecessor/inspiration: EmbarkStudios' earlier Rust-GPU project for shader compilation; rust-cuda borrowed architectural patterns but focuses on CUDA instead of graphics shaders
NVIDIA/cuda-samples — Official CUDA C++ reference implementations of algorithms like GEMM and vector addition; rust-cuda examples (vecadd, gemm) are direct Rust ports of these
rapidsai/cuml — CUDA-accelerated machine learning library in C++/Python; rust-cuda can eventually target the same compute kernels, providing a Rust-native alternative
torch/pytorch — PyTorch uses CUDA extensively; rust-cuda could eventually provide Rust-native GPU operations as a PyTorch backend alternative

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for cuda_builder crate with multiple CUDA SDK versions

The cuda_builder crate (crates/cuda_builder/src/lib.rs) is critical for the ecosystem but lacks integration tests that verify it works across different CUDA versions (12, 13). The CI workflows (ci_linux.yml, ci_windows.yml) show multiple CUDA versions are tested, but there are no explicit integration tests validating the builder's correctness. This would catch regressions and ensure reliability across SDK versions.

[ ] Create crates/cuda_builder/tests/integration_tests.rs with tests for kernel compilation
[ ] Add test cases for CUDA 12 and CUDA 13 compilation paths
[ ] Verify output PTX/cubin files are generated correctly
[ ] Test error handling for invalid kernel code
[ ] Document expected behavior in tests for future maintainers

Add missing safety documentation and examples for cuda_std unsafe APIs

The cuda_std crate (crates/cuda_std/src/) exposes low-level unsafe APIs (atomic.rs, intrinsics.rs, ptr.rs) that lack Safety sections in their rustdoc comments. Given the importance of GPU memory safety and the note in README that 'Expect bugs, safety issues', documenting unsafe invariants is critical. The CONTRIBUTING.md file exists but doesn't specify safety documentation requirements.

[ ] Review crates/cuda_std/src/atomic/intrinsics.rs and add /// Safety sections to all unsafe functions
[ ] Review crates/cuda_std/src/intrinsics.rs and document thread-safety guarantees and memory ordering
[ ] Review crates/cuda_std/src/ptr.rs and document valid pointer lifecycle requirements
[ ] Add examples to crates/cuda_std/CHANGELOG.md showing safe usage patterns
[ ] Update CONTRIBUTING.md to require Safety documentation for unsafe GPU APIs

Create comprehensive CI workflow for testing blastoff BLAS bindings across platforms

The blastoff crate (crates/blastoff/src/) provides BLAS level1 and level3 bindings but lacks dedicated CI validation. Unlike cuda_std which is implicitly tested, blastoff's raw FFI bindings (crates/blastoff/src/raw/) need explicit platform-specific testing to catch linkage issues and API mismatches. Current CI only shows generic Linux/Windows workflows without BLAS-specific validation.

[ ] Create .github/workflows/ci_blastoff.yml with steps to verify cuBLAS/rocBLAS linkage
[ ] Add test cases in crates/blastoff/tests/ for level1 operations (dot, axpy) on sample matrices
[ ] Add test cases for level3 operations (gemm) with various data types
[ ] Configure matrix testing for both Linux (cuBLAS) and rocBLAS compatible systems
[ ] Document expected BLAS library availability in CONTRIBUTING.md

🌿Good first issues

Add comprehensive device-memory safety tests to crates/cust_core/tests/ for buffer aliasing and use-after-free scenarios; currently only basic tests exist
Document the rustc_codegen_nvvm compilation pipeline with inline code comments in crates/rustc_codegen_nvvm/src/lib.rs explaining MIR-to-LLVM IR mapping; the critical transformation is underdocumented
Create a 'hello-world' kernel example in examples/hello_world demonstrating minimal cuda_std usage (thread printing, sync barriers); would onboard new contributors faster

⭐Top contributors

Click to expand

@nnethercote — 61 commits
@CharryWu — 12 commits
@LegNeato — 9 commits
@Snehal-Reddy — 5 commits
@niklebedenko — 5 commits

📝Recent commits