NVIDIA/cutlass

Item: NVIDIA/cutlass
Rating: 3
Author: RepoPilot

CUDA Templates and Python DSLs for High-Performance Linear Algebra

Mixed

Mixed signals — read the receipts

worst of 4 axes

Use as dependencyConcerns

non-standard license (Other); no tests detected

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 2d ago
✓46+ active contributors
✓Distributed ownership (top contributor 14% of recent commits)

Show 4 more →

✓Other licensed
✓CI configured
⚠Non-standard license (Other) — review terms
⚠No test directory detected

What would change the summary?

→Use as dependency Concerns → Mixed if: clarify license terms

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Forkable](https://repopilot.app/api/badge/nvidia/cutlass?axis=fork)](https://repopilot.app/r/nvidia/cutlass)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/nvidia/cutlass on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: NVIDIA/cutlass

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/NVIDIA/cutlass shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Mixed signals — read the receipts

Last commit 2d ago
46+ active contributors
Distributed ownership (top contributor 14% of recent commits)
Other licensed
CI configured
⚠ Non-standard license (Other) — review terms
⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live NVIDIA/cutlass repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/NVIDIA/cutlass.

What it runs against: a local clone of NVIDIA/cutlass — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in NVIDIA/cutlass | Confirms the artifact applies here, not a fork | | 2 | License is still Other | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 32 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>NVIDIA/cutlass</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of NVIDIA/cutlass. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/NVIDIA/cutlass.git
#   cd cutlass
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of NVIDIA/cutlass and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "NVIDIA/cutlass(\\.git)?\\b" \\
  && ok "origin remote is NVIDIA/cutlass" \\
  || miss "origin remote is not NVIDIA/cutlass (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Other)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Other\"" package.json 2>/dev/null) \\
  && ok "license is Other" \\
  || miss "license drift — was Other at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "CMakeLists.txt" \\
  && ok "CMakeLists.txt" \\
  || miss "missing critical file: CMakeLists.txt"
test -f "include/cutlass" \\
  && ok "include/cutlass" \\
  || miss "missing critical file: include/cutlass"
test -f "CHANGELOG.md" \\
  && ok "CHANGELOG.md" \\
  || miss "missing critical file: CHANGELOG.md"
test -f "README.md" \\
  && ok "README.md" \\
  || miss "missing critical file: README.md"
test -f "CUDA.cmake" \\
  && ok "CUDA.cmake" \\
  || miss "missing critical file: CUDA.cmake"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 32 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~2d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/NVIDIA/cutlass"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

CUTLASS is a C++/CUDA template library for implementing high-performance matrix-multiply (GEMM) and linear algebra kernels across NVIDIA GPUs. It provides hierarchical abstractions for tile-based decomposition, data movement strategies, and mixed-precision compute (FP64, FP32, TF32, FP16, BF16, block-scaled types like MXFP4/MXFP8, and binary types) from Volta through Blackwell architectures, plus a Python DSL (CuTe DSL) for rapid kernel prototyping without C++ expertise. Monolithic C++ template library with modular decomposition: include/cutlass/ contains core abstractions (arch/, gemm/, tensor/, memory/, epilogue/); examples/ holds reference implementations; cmake/ handles CUDA compilation and testing; docs/ is auto-generated Doxygen HTML. Python DSL lives alongside in separate namespaces. CI workflows in .github/workflows/ manage testing across architectures.

👥Who it's for

GPU kernel engineers, performance researchers, and deep learning framework developers who need production-grade GEMM implementations, tensor-core optimization, or rapid prototyping of custom linear algebra kernels. Users range from CUDA beginners using CuTe DSL Python bindings to advanced performance engineers tuning memory hierarchies and data layouts.

🌱Maturity & risk

Highly mature and production-ready: CUTLASS 4.5.0 released May 2026, active CI/CD via GitHub workflows (blossom-ci, auto-labeling, stale-issue management), 30M+ lines of C++ indicating extensive codebase, and comprehensive test infrastructure via CMake/CTest. Actively maintained with regular version bumps and architectural evolution (CuTe DSL introduced in 4.0).

Low risk for stable kernel abstractions, but moderate risk for DSL beta features: CuTe DSL is explicitly 'in public beta' per README, so breaking API changes possible. GPU architecture coupling (Volta/Turing/Ampere/Ada/Hopper/Blackwell specifics) means code may require updates per new architecture releases. Dependency on NVIDIA CUDA toolkit version matching required; no obvious npm/pip lock files visible in top-level.

Active areas of work

Active development on CUTLASS 4.5.0 with CuTe DSL stabilization and new data type support (e5m2/e4m3 FP8, MXFP4/MXFP6/MXFP8 block-scaled). GitHub workflows show automated issue triage (new-issues-to-triage-projects.yml), stale-issue management, and continuous Blossom CI testing. CHANGELOG.md and version tracking in cmake/version_extended.h.in indicate structured release cadence.

🚀Get running

git clone https://github.com/NVIDIA/cutlass.git
cd cutlass
mkdir build && cd build
cmake .. -DCUDA_TOOLKIT_ROOT_DIR=/path/to/cuda
make -j$(nproc)
ctest

Requires NVIDIA CUDA toolkit installed and CMAKE 3.12+. Use CMake-based build (no pip/npm).

Daily commands: No single 'dev server'; this is a template library. Build examples: cd examples && cmake . && make (builds individual CUDA kernels). Run unit tests: cd build && ctest --verbose. Python DSL examples would execute as python cutlass_py_example.py (if included in examples/). CUDA compute capability targeting: use -DCUTLASS_ENABLE_TENSOR_CORE_MMA=ON in CMake for tensor-core specific kernels.

🗺️Map of the codebase

CMakeLists.txt — Root build configuration orchestrating CUTLASS compilation, CUDA architecture support, and dependency management; every contributor must understand the build system.
include/cutlass — Core CUTLASS header directory containing fundamental template abstractions for GEMM, tensor operations, and data movement primitives (inferred from structure).
CHANGELOG.md — Official change history documenting API evolution, feature additions, and breaking changes; essential for understanding codebase versioning and stability.
README.md — High-level overview of CUTLASS capabilities, design philosophy, and hierarchy decomposition strategy; entry point for all contributors.
CUDA.cmake — CUDA-specific build configuration handling architecture detection, compute capability targeting, and compiler flags; critical for GPU compatibility.
cmake/NvidiaCutlassConfig.cmake.in — CMake package configuration template for CUTLASS library installation and integration into external projects.
CONTRIBUTORS.md — Contribution guidelines and community attribution; required reading for anyone submitting changes.

🛠️How to make changes

Add Support for a New GPU Architecture (e.g., SM90)

Create architecture-specific MMA header in include/cutlass/arch/ following SM60/SM61 patterns (docs/arch_2mma__sm61_8h.html)
Define architecture capability constants and tensor core dimensions in CUDA.cmake (CUDA.cmake)
Update CMakeLists.txt to compile kernels targeting the new architecture (CMakeLists.txt)
Add architecture-specific tests and benchmarks following existing test patterns (cmake/CTestTestfile.configure.cmake)
Document new architecture in CHANGELOG.md with version bumps and migration notes (CHANGELOG.md)

Extend CUTLASS for New Data Types (e.g., FP8)

Define FP8 type specialization in include/cutlass/numeric_types.h with conversion operators (docs/array_8h.html)
Create sub-byte packing utilities if needed, extending array__subbyte utilities (docs/array__subbyte_8h.html)
Add MMA specializations for FP8 in architecture-specific headers (SM60+) (docs/arch_2mma_8h.html)
Update CMakeLists.txt with FP8 test compilation flags and enable conditional compilation (CMakeLists.txt)

Add a New Reduction Operation Type

Extend batched__reduction.h with new reduction operation functor following Sum/Max patterns (docs/batched__reduction_8h.html)
Define corresponding traits in batched__reduction__traits.h for algorithm configuration (docs/batched__reduction__traits_8h.html)
Add unit tests in tests/ directory with multiple architecture targets (cmake/CTestTestfile.configure.cmake)
Update CHANGELOG.md documenting the new operation and usage examples (CHANGELOG.md)

Integrate CUTLASS into External CUDA Project

Build and install CUTLASS using CMakeLists.txt which generates NvidiaCutlassConfig.cmake (CMakeLists.txt)
Reference cmake/NvidiaCutlassConfig.cmake.in in external project's CMakeLists.txt via find_package(NvidiaCutlass) (cmake/NvidiaCutlassConfig.cmake.in)
Link against installed CUTLASS headers and apply CUDA.cmake compiler flags (CUDA.cmake)

🔧Why these technologies

C++ Template Metaprogramming — Enables compile-time specialization for different GPU architectures, data types, and tile sizes without runtime overhead; zero-cost abstractions for performance-critical linear algebra kernels.
CUDA C++ — Direct GPU programming access required to implement hierarchical parallelization patterns and leverage specialized hardware features (tensor cores, async memory operations).
CMake Build System — Handles complex multi-architecture compilation (SM50-SM90), conditional compilation for different precision types, and package generation for cross-project integration.
Doxygen Documentation — Automatically generates detailed API documentation from C++ template headers; critical for large template-heavy codebase with hundreds of generic components.

⚖️Trade-offs already made

Template-based abstractions over dynamic polymorphism
- Why: Maximizes compile-time optimization opportunities and eliminates virtual function overhead in performance-critical kernels.
- Consequence: Significantly increases compile time and binary size; higher barriers to entry for developers unfamiliar with template metaprogramming; complex error messages.
Architecture-specific specializations (SM50, SM60, SM61, SM80, etc.)
- Why: Different GPU generations have fundamentally different tensor core capabilities and memory hierarchies requiring tailored implementations for peak performance.
- Consequence: Code duplication across architecture files; maintenance burden increases with each new GPU generation; binary complexity.
Header-only library design with inline implementations
- Why: Enables user code to specialize templates directly; avoids binary compatibility issues across CUDA versions.
- Consequence: Recompilation required for every consuming project; slower incremental builds; larger code bloat in applications linking CUTLASS.
Support for mixed-precision operations (FP64, FP32, TF32, FP16, BF16)
- Why: Allows applications to trade precision for throughput/memory bandwidth; critical for modern deep learning workloads.
- Consequence: Increased template instantiation complexity; requires architecture-specific MMA implementations for each precision; larger test matrix.

🚫Non-goals (don't propose these)

Does not provide CPU implementations or host-side GEMM fallbacks — GPU-only library
Does not include automatic performance tuning or autotuning framework — users must manually select tiling sizes and policies
Does not handle distributed/multi-GPU GEMM orchestration — single-GPU kernel library
Does not provide Python bindings or high-level ML framework integration — low-level C++ template abstraction layer
Does not include I/O, serialization, or checkpointing — pure compute kernels

🪤Traps & gotchas

CUDA toolkit version must match: mismatched CUDA_TOOLKIT_ROOT_DIR in CMake causes cryptic compilation errors. Tensor-core operations require compute capability ≥7.0 (Volta+); SM 5.0/6.x kernels use slower SIMD paths. CMake must locate CUTLASS headers before system CUDA headers (include order matters in consuming projects). Python DSL requires separate Python environment setup (not bundled in C++ build). GPU memory layout assumptions are non-obvious: CUTLASS uses 'K-major' vs 'N-major' tiling internally, conflicting with naive transposition. No lock file for CUDA version pinning; rely on environment variables or explicit CMake flags.

🏗️Architecture

💡Concepts to learn

Tensor Core / Warp-level Matrix Multiply — CUTLASS's entire design revolves around hierarchical abstraction of Tensor Cores (architecture/mma.h); understanding tensor-core atom instruction sets and shape constraints (e.g., 16×16×16 FP16 on Ampere) is essential to writing efficient CUTLASS kernels
CUDA Thread Block Hierarchy & Tile Decomposition — CUTLASS abstracts matrix-multiply as nested tile loops (block tile → warp tile → thread tile), each mapping to hardware hierarchy; understanding grid/block/warp/thread partitioning of work is fundamental to all CUTLASS policies
Layout Abstraction & Memory Access Patterns — cutlass/layout/ provides generalized 2D/3D data layout representations (row-major, column-major, swizzled, tensor layouts); efficient GEMM kernels depend on correct layout composition to avoid memory bank conflicts and achieve coalesced access
Cooperative Matrix Load/Store (SMEM & GMEM) — CUTLASS provides async copy primitives and software pipelining abstractions for overlapping compute with memory moves; epilogue/ operations batch post-processing to amortize DRAM bandwidth across tiles
Mixed-Precision Compute & Type Promotion — CUTLASS supports compute in one precision (e.g., TF32, FP16) while accumulating in another (FP32) and outputting a third (BF16); numeric_types.h defines type conversion policies critical for numerical stability in deep learning
NVIDIA Hopper Tensor Float-32 (TF32) & Block-Scaled Types — CUTLASS 4.5 adds support for block-scaled quantized types (MXFP4, MXFP6, MXFP8) and FP8 variants (e5m2, e4m3); these require specialized accumulation rules and dequantization logic in epilogue/ abstractions
CUDA C++ Template Metaprogramming & Policy-Based Design — CUTLASS is built on heavy C++ template metaprogramming (traits, policy structs, SFINAE); understanding template specialization patterns and compile-time type erasure is required to extend CUTLASS with custom kernels

openai/triton — Alternative Python-first GPU kernel DSL with similar goal (rapid GEMM/ML kernel prototyping) but different IR; complementary rather than competing approach
pytorch/pytorch — Major consumer of CUTLASS; PyTorch delegates to CUTLASS GEMMs in aten/src/ATen/native/cuda/gemm/ for fp16/tf32 tensor-core operations
jax-ml/jax — JAX uses CuBLAS and custom GEMM kernels; CUTLASS would be natural replacement for custom high-performance LA ops in JAX backends
NVIDIA/CuTe — CuTe (CUTLASS Tensor) is the lower-level C++ abstraction library that CUTLASS kernels build atop; understanding CuTe concepts (layouts, tensors, atoms) is prerequisite for CUTLASS template mastery
NVIDIA/cuDNN — Companion NVIDIA library for neural-network primitives; cuDNN delegates matrix-multiply to CUTLASS for optimal Tensor Core utilization on recent architectures

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive Python DSL documentation and examples for cutlass module

The README mentions 'Python DSLs for High-Performance Linear Algebra' as a core feature, but the docs/ directory contains only Doxygen HTML output for C++ components. There's no dedicated documentation for the Python API, which creates a barrier for Python users. This is a high-value contribution that directly addresses the stated dual nature of the project.

[ ] Create docs/python-dsl/ directory with markdown files documenting Python DSL features
[ ] Add docs/python-dsl/getting-started.md with Python environment setup and first GEMM example
[ ] Add docs/python-dsl/api-reference.md documenting major Python classes and functions
[ ] Create docs/python-dsl/examples/ with 3-5 working Python example scripts (basic GEMM, mixed-precision, custom kernels)
[ ] Update README.md with Python DSL section linking to new documentation

Implement GitHub Actions workflow for Python package testing

The .github/workflows/ directory has CI for C++ components (blossom-ci.yml, labeler.yml) but no dedicated Python testing workflow. Given that CUTLASS advertises Python DSL support, there should be automated testing for Python code quality, unit tests, and package installation across Python versions (3.8-3.12) and CUDA versions.

[ ] Create .github/workflows/python-tests.yml that runs on PR and push to main
[ ] Configure matrix testing for Python 3.8, 3.9, 3.10, 3.11, 3.12
[ ] Add pytest execution for any python/ or test/python/ directories with coverage reporting
[ ] Include pip package build and installation test steps
[ ] Add linting checks (flake8/pylint) and type checking (mypy) for Python code

Add benchmark/performance regression tests for GEMM kernels across SM architectures

The file structure shows multiple SM-specific architecture headers (arch_2mma_sm50.h, arch_2mma_sm60.h, arch_2mma_sm61.h), but there's no visible performance benchmarking suite. Adding a standardized benchmark workflow would help catch performance regressions across GPU architectures and serve as reference material for contributors.

[ ] Create benchmarks/gemm/ directory with performance test harness
[ ] Implement benchmark runner that tests core GEMM operations across SM50, SM60, SM61+ architectures
[ ] Add benchmarks/results/ with baseline performance data (TFLOPS, bandwidth utilization) for reference
[ ] Create .github/workflows/benchmark-pr-check.yml to run benchmarks on PRs and report relative performance changes
[ ] Document benchmark methodology in benchmarks/README.md with instructions for local execution

🌿Good first issues

Add example demonstrating MXFP4/MXFP6/MXFP8 block-scaled type GEMMs (examples/ lacks concrete block-scaled tutorials despite README advertising MXFP4/MXFP8 support)
Expand test/ coverage for int4/int8 narrow-integer GEMM kernels on Hopper+ architectures (repo advertises 4/8-bit support but test matrix may be incomplete)
Write CuTe DSL Python tutorial notebook covering layout composition and tensor slicing (Python DSL is public beta; learning curve docs would help adoption)

⭐Top contributors

Click to expand

@Junkai-Wu — 14 commits
@hwu36 — 11 commits
@fengxie — 9 commits
@Johnsonms — 5 commits
@questa-wang — 3 commits

📝Recent commits

Click to expand

ae6bccf — [CuTeDSL] Update atomic_max_float32 to atomic_fmax in blockscaled GEMM example (#3206) (questa-wang)
cb37157 — v4.5 tag update (#3202) (Junkai-Wu)
f74fea9 — [Hopper CuTeDSL] Add FP8 GEMM with 2xAcc (#3149) (Johnsonms)
7a9fe05 — fix: Add missing kElementsPerAccess division in RegularTileIterator store (#3049) (bledden)
9135a9b — Replace std::min with cute::min in sm120 blockwise scaling device functions (#3055) (vruga)
b46b16d — Small Tile N BlockScaled GEMM + Grouped GEMM (#3176) (depaulmillz)
aeba0d3 — correct BLayout stride in SM80 m16n8k32 int4 MMA traits (#3140) (zfmmmm)
ea46e27 — Add absf and floor to cute.math (#3156) (nandor)
3f3db08 — Add support for empty dataclass arguments (#3152) (nandor)
08185b9 — Update blackwell tutorial to be compatible with 4.5-dev version (#3130) (LongshengDu)

🔒Security observations

The CUTLASS repository demonstrates a generally secure foundation typical of a mature NVIDIA open-source project. No critical vulnerabilities were identified from the static file structure analysis. The codebase appears to follow good practices with proper licensing (EULA.txt, LICENSE.txt), CITATION.cff for attribution, and organized CI/CD workflows. Minor recommendations include: (1) formalizing dependency management with lock files, (2) implementing explicit compiler hardening flags in build configurations, (3) excluding auto-generated documentation from version control, and (4) ensuring git submodules are properly audited and pinned to specific commits. The project would benefit from a SECURITY.md file documenting security practices and vulnerability disclosure procedures.

Low · Git submodules without pinned versions — .gitmodules. The presence of .gitmodules file indicates external dependencies are included via git submodules. Without proper version pinning and security audits, this could introduce vulnerabilities from third-party repositories. Fix: Review all git submodules, ensure they point to trusted sources, pin to specific commit hashes rather than branches, and regularly audit submodule updates for security issues.
Low · Documentation files may contain sensitive information — docs/ directory (*.html files). The presence of extensive HTML documentation files in the docs/ directory could potentially expose implementation details or configuration information that could aid attackers. Fix: Review documentation for sensitive information, ensure documentation build artifacts are not committed to source control if they contain sensitive details, and use .gitignore to exclude auto-generated docs.
Low · CMake configuration without explicit security settings — CMakeLists.txt, CUDA.cmake, cmake/ directory. Multiple CMake configuration files (CMakeLists.txt, CUDA.cmake, etc.) are present but no evidence of security-focused compiler flags or hardening options visible in file structure. Fix: Add compiler hardening flags (-fstack-protector-strong, -D_FORTIFY_SOURCE=2), enable position-independent code, use compiler warnings as errors, and implement security sanitizers in CI/CD pipelines.
Low · No visible dependency lock files — Root directory. No dependency lock files (package-lock.json, requirements.lock, vcpkg.lock, etc.) are evident in the file structure, making reproducible builds and vulnerability tracking difficult. Fix: Implement and commit dependency lock files to ensure reproducible builds. Use tools like vcpkg, conan, or similar for C++ dependency management with locked versions.
Low · Potential exposure of internal paths in documentation — docs/ directory (*__dep__incl.md5, *__incl.md5 files). The presence of auto-generated HTML documentation with dependency graph MD5 files suggests internal project structure is documented in version control. Fix: Consider excluding auto-generated documentation from version control using .gitignore. Generate documentation as part of CI/CD instead of committing artifacts.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

NVIDIA/cutlass

Embed the "Forkable" badge

Onboarding doc

Onboarding: NVIDIA/cutlass

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

Add Support for a New GPU Architecture (e.g., SM90)

Extend CUTLASS for New Data Types (e.g., FP8)

Add a New Reduction Operation Type

Integrate CUTLASS into External CUDA Project

🔧Why these technologies

⚖️Trade-offs already made

🚫Non-goals (don't propose these)

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive Python DSL documentation and examples for cutlass module

Implement GitHub Actions workflow for Python package testing

Add benchmark/performance regression tests for GEMM kernels across SM architectures

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next