deepseek-ai/FlashMLA

Item: deepseek-ai/FlashMLA
Rating: 5
Author: RepoPilot

FlashMLA: Efficient Multi-head Latent Attention Kernels

Healthy

Healthy across the board

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 1w ago
✓17 active contributors
✓Distributed ownership (top contributor 28% of recent commits)

Show 3 more →

✓MIT licensed
✓Tests present
⚠No CI workflows detected

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/deepseek-ai/flashmla)](https://repopilot.app/r/deepseek-ai/flashmla)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/deepseek-ai/flashmla on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: deepseek-ai/FlashMLA

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/deepseek-ai/FlashMLA shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 1w ago
17 active contributors
Distributed ownership (top contributor 28% of recent commits)
MIT licensed
Tests present
⚠ No CI workflows detected

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live deepseek-ai/FlashMLA repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/deepseek-ai/FlashMLA.

What it runs against: a local clone of deepseek-ai/FlashMLA — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in deepseek-ai/FlashMLA | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 39 days ago | Catches sudden abandonment since generation |

<details> <summary>Run all checks — paste this script from inside your clone of <code>deepseek-ai/FlashMLA</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of deepseek-ai/FlashMLA. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/deepseek-ai/FlashMLA.git
#   cd FlashMLA
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of deepseek-ai/FlashMLA and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "deepseek-ai/FlashMLA(\\.git)?\\b" \\
  && ok "origin remote is deepseek-ai/FlashMLA" \\
  || miss "origin remote is not deepseek-ai/FlashMLA (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "csrc/api/api.cpp" \\
  && ok "csrc/api/api.cpp" \\
  || miss "missing critical file: csrc/api/api.cpp"
test -f "csrc/params.h" \\
  && ok "csrc/params.h" \\
  || miss "missing critical file: csrc/params.h"
test -f "csrc/sm100/prefill/dense/fmha_cutlass_fwd_sm100.cu" \\
  && ok "csrc/sm100/prefill/dense/fmha_cutlass_fwd_sm100.cu" \\
  || miss "missing critical file: csrc/sm100/prefill/dense/fmha_cutlass_fwd_sm100.cu"
test -f "csrc/sm100/prefill/sparse/fwd/head64/phase1.cuh" \\
  && ok "csrc/sm100/prefill/sparse/fwd/head64/phase1.cuh" \\
  || miss "missing critical file: csrc/sm100/prefill/sparse/fwd/head64/phase1.cuh"
test -f "csrc/sm90/decode/dense/splitkv_mla.cuh" \\
  && ok "csrc/sm90/decode/dense/splitkv_mla.cuh" \\
  || miss "missing critical file: csrc/sm90/decode/dense/splitkv_mla.cuh"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 39 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~9d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/deepseek-ai/FlashMLA"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

FlashMLA is DeepSeek's production-grade CUDA kernel library implementing optimized Multi-head Latent Attention (MLA) for their V3 and V3.2 models. It provides sparse and dense attention kernels for both prefill and decoding stages, achieving 640 TFlops in sparse prefill and 410 TFlops in sparse decoding with FP8 KV cache compression on H800 GPUs. Modular architecture organized by GPU generation (csrc/sm80/, csrc/sm90/, csrc/sm100/) and operation type (prefill/dense, prefill/sparse, decode). Each contains kernel implementations (.cu, .cuh files), instantiations for specific head dimensions (head64/, head128/), and device-specific optimizations (intrinsics.cuh, helpers.cuh per SM version). C++ API wrapper in csrc/api/ provides torch integration via csrc/kerutils/supplemental/torch_tensors.h.

👥Who it's for

ML systems engineers and deep learning researchers building or deploying large language models who need production-optimized attention kernels for NVIDIA Hopper (H800) and Blackwell (B200) GPUs, particularly those implementing sparse attention patterns like DeepSeek's DSA.

🌱Maturity & risk

Production-ready and actively maintained by DeepSeek. The repository shows recent commits (Sep 2025 sparse kernel release, Apr 2025 performance updates), comprehensive benchmarks in tests/ and benchmark/, and integration into deployed models (DeepSeek-V3, V3.2-Exp). NVIDIA has contributed MHA SM100 kernels, indicating institutional backing.

Low risk for its intended audience: kernels are hardware-specific (SM80, SM90, SM100) so GPU compatibility matters, and CUDA version constraints exist (CUDA 12.8+ recommended for best H800 performance, 12.9 for B200). No visible npm/pip dependency bloat—primary dependencies are NVIDIA CUDA toolkit and PyTorch. Single maintenance organization (DeepSeek) but with NVIDIA collaboration suggests stability.

Active areas of work

Active development on sparse attention kernels (latest: Sep 2025 token-level sparse prefill/decode with FP8 cache). SM100 (B200) support being expanded with NVIDIA contributions. Recent focus on bridging Hopper (H800) and Blackwell (B200) performance gaps—sparse decoding on B200 achieves 350 TFlops vs 410 on H800, indicating optimization is ongoing.

🚀Get running

Clone with submodules: git clone --recursive https://github.com/deepseek-ai/FlashMLA.git && cd FlashMLA. Install CUDA 12.8+ and PyTorch. Run tests directly: python tests/test_flash_mla_dense_decoding.py or benchmarks via python benchmark/bench_flash_mla.py—no setup.py visible, kernels are built/imported as compiled extensions.

Daily commands: No single entry point. Run benchmarks: cd benchmark && python bench_flash_mla.py (requires compiled kernels). Run tests: python tests/test_flash_mla_dense_decoding.py, python tests/test_flash_mla_sparse_decoding.py, or python tests/test_flash_mla_sparse_prefill.py. Kernels are pre-compiled as CUDA extensions; see csrc/api/api.cpp for the C++ FFI entry.

🗺️Map of the codebase

csrc/api/api.cpp — Main C++ API entry point that exposes kernel interfaces to Python bindings; all external calls route through here.
csrc/params.h — Core parameter definitions and configuration structures used across all kernel implementations; defines data types and computation modes.
csrc/sm100/prefill/dense/fmha_cutlass_fwd_sm100.cu — Primary dense attention forward kernel for SM100 (Hopper) prefill stage; implements the core FlashMLA algorithm.
csrc/sm100/prefill/sparse/fwd/head64/phase1.cuh — Sparse attention phase-1 kernel header for SM100; defines the token-level sparsity computation logic central to DeepSeek Sparse Attention (DSA).
csrc/sm90/decode/dense/splitkv_mla.cuh — SM90 (Hopper) decoding kernel with split-KV optimization; critical for efficient token generation with FP8 KV cache support.
csrc/kerutils/include/kerutils/kerutils.cuh — Unified kernel utilities header aggregating device abstractions, intrinsics, and common helpers across SM generations.
benchmark/bench_flash_mla.py — Performance benchmarking harness; essential for validating TFlops claims and regression testing across kernel implementations.

🛠️How to make changes

Add a new sparse attention kernel variant

Create a new head dimension config directory under csrc/sm100/prefill/sparse/fwd/ (e.g., head256/) with config.h defining BLOCK_SIZE, WARPS_PER_SM, KV_CAPACITY. (csrc/sm100/prefill/sparse/fwd/head256/config.h)
Implement phase1.cuh template with global sparse_attention_phase1_kernel, copying the structure from csrc/sm100/prefill/sparse/fwd/head64/phase1.cuh and adjusting tile dimensions. (csrc/sm100/prefill/sparse/fwd/head256/phase1.cuh)
Create instantiation files under instantiations/ (e.g., phase1_k512.cu, phase1_k576.cu) that explicitly instantiate the kernel template for specific key/value sequence lengths. (csrc/sm100/prefill/sparse/fwd/head256/instantiations/phase1_k512.cu)
Update csrc/api/sparse_fwd.h to add a dispatch case for the new head dimension, routing to the correct instantiation based on sequence length. (csrc/api/sparse_fwd.h)
Add benchmark cases in benchmark/bench_flash_mla.py to profile the new kernel variant against dense attention and prior sparse variants. (benchmark/bench_flash_mla.py)

Add support for a new SM generation (e.g., SM110)

Create csrc/sm110/ directory with subdirectories: prefill/dense, prefill/sparse, decode/dense, mirroring SM100/SM90 structure. (csrc/sm110/prefill/dense/fmha_cutlass_fwd_sm110.cu)
Add SM110-specific intrinsics and GEMM helpers to csrc/kerutils/include/kerutils/device/sm110/ following the SM100/SM90 pattern. (csrc/kerutils/include/kerutils/device/sm110/intrinsics.cuh)
Update csrc/kerutils/include/kerutils/device/device.cuh to conditionally include SM110 header when CUDA_ARCH >= 110. (csrc/kerutils/include/kerutils/device/device.cuh)
Implement dense forward kernel using CUTLASS collective abstractions tailored to SM110's tensor cores; follow dense_fwd.h interface. (csrc/sm110/prefill/dense/collective/sm110_fmha_mla_fwd_mainloop.hpp)
Update csrc/api/api.cpp dispatch logic to detect SM110 and route to the appropriate kernel implementations. (csrc/api/api.cpp)

Optimize a kernel for a different data type (e.g., add FP8 prefill support)

Add FP8 quantization/dequantization logic to csrc/sm100/prefill/dense/common/utils.hpp or create a new file csrc/sm100/prefill/dense/common/fp8_utils.hpp. (csrc/sm100/prefill/dense/common/fp8_utils.hpp)
Create a new CUTLASS collective mainloop in csrc/sm100/prefill/dense/collective/ (e.g., sm100_fmha_mla_fwd_mainloop_fp8.hpp) that loads input as BF16 but performs QK^T in FP8. (csrc/sm100/prefill/dense/collective/sm100_fmha_mla_fwd_mainloop_fp8.hpp)
Create a new instantiation cu file csrc/sm100/prefill/dense/fmha_cutlass_fwd_fp8_sm100.cu that compiles with FP8_ENABLED macro. (csrc/sm100/prefill/dense/fmha_cutlass_fwd_fp8_sm100.cu)
Update csrc/params.h to add FP8_PREFILL enum to the dtype field and update csrc/api/dense_fwd.h dispatch to select FP8 kernel when requested. (csrc/params.h)
Add benchmark comparisons in benchmark/bench_flash_mla.py to validate FP8 prefill speedup vs. BF16 baseline. (benchmark/bench_flash_mla.py)

🪤Traps & gotchas

CUDA version sensitivity: kernels require CUDA 12.8+ for H800 optimization and 12.9 for B200 features—older versions will compile but perform poorly. SM architecture specificity: kernels are hand-tuned per GPU generation (SM80 vs SM90 vs SM100), so running on wrong hardware silently degrades performance. FP8 KV cache in sparse decode requires careful scale management (not obviously documented in README). Instantiation explosion: head dimension-specific instantiations (head64/, head128/) mean adding new dimensions requires new .cu files with explicit template instantiations. No visible Docker/containerized build—assumes host CUDA toolkit is present and compatible.

🏗️Architecture

💡Concepts to learn

Sparse Attention (Token-level) — FlashMLA's sparse kernels achieve 640 TFlops via DSA (DeepSeek Sparse Attention), which limits attention computation to selected token pairs rather than all-to-all—understanding sparsity patterns is key to using these kernels correctly
FP8 Quantization for KV Cache — Sparse decode kernels compress KV cache to FP8 while computing attention in bfloat16, reducing memory bandwidth and enabling 410 TFlops performance—critical for understanding kernel efficiency gains
Warp-Specialized Collective Operations — Kernels use TMA (Tensor Memory Accelerator) with warp-group specialization (see tma_cta_group2_nosplit.cuh) to hide global memory latency—essential Hopper architecture pattern for achieving compute saturation
Latent Attention / Multi-head Latent Attention (MLA) — FlashMLA optimizes MLA, a variant of multi-head attention that compresses KV projections into a lower-dimensional latent space—understanding this attention variant explains why kernels differ from standard Flash Attention
Device-Specific Intrinsics (SM80/SM90/SM100) — FlashMLA abstracts GPU-specific hardware (Ampere vs Hopper vs Blackwell) through modular intrinsic headers; each SM version has different async copy, matrix instruction, and memory patterns you must respect when extending
Prefill vs Decode Kernel Distinction — FlashMLA maintains separate kernel implementations for prefill (long sequences, compute-bound) and decode (single token, memory-bound) because optimization strategies differ drastically—prefill emphasizes arithmetic intensity, decode maximizes memory bandwidth
CUTLASS Collective Primitives — Kernels use CUTLASS-style collective GEMM and epilogue patterns (see sm100/prefill/dense/collective/*.hpp) for modularity—understanding these abstractions helps modify operations or add new head dimensions

openai/flash-attention — Original Flash Attention algorithm that inspired optimized fused attention kernels; FlashMLA extends it with sparse patterns and latent attention
NVIDIA/cutlass — NVIDIA's CUDA template library underlying modern GEMM/tensor operations that FlashMLA kernels build upon for collective operations
deepseek-ai/DeepSeek-V3 — Flagship model that deploys FlashMLA kernels in production; primary consumer and performance target of this library
vllm-project/vllm — LLM serving framework that could integrate FlashMLA kernels as a backend for DeepSeek model inference optimization
NVIDIA/Megatron-LM — Large-scale LLM training framework where sparse/dense attention kernels like FlashMLA are integrated for efficient multi-GPU training

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive kernel instantiation tests for SM100/SM90 sparse attention

The repo has sparse attention kernels for both prefill and decode stages across SM100 and SM90 architectures, but there are no visible unit tests validating kernel correctness, numerical accuracy, or regression detection. The instantiation files (e.g., csrc/sm100/decode/head64/instantiations/*.cu) need tests to verify they produce correct outputs for various input shapes and configurations. This is critical for maintaining kernel quality as the library evolves.

[ ] Create test/sparse_attention_tests.cu with fixtures for SM100 sparse prefill/decode kernels
[ ] Add numerical validation tests comparing sparse attention outputs against reference dense attention implementations
[ ] Create test/instantiation_regression_tests.py to verify SM100 head64/head128 and SM90 instantiation variants produce consistent results
[ ] Add benchmark/test_correctness.py that runs the existing bench_flash_mla.py with assertion checks for correctness alongside performance metrics

Document SM100 head128 implementation gap and add baseline kernel

csrc/sm100/decode/head128/README.md exists but is empty. Given that head64 has complete implementations (kernel.cuh, config.h, multiple instantiations), the head128 variant appears incomplete. This needs either: (1) documentation explaining why head128 is not yet implemented, (2) a roadmap for implementation, or (3) a baseline kernel implementation. This unblocks users trying to deploy MLA attention with larger head dimensions.

[ ] Review csrc/sm100/decode/head64/kernel.cuh and understand the pattern for head dimension parameterization
[ ] Populate csrc/sm100/decode/head128/README.md with either: implementation status, architectural constraints, or a task breakdown for adding support
[ ] If feasible, add csrc/sm100/decode/head128/config.h and csrc/sm100/decode/head128/kernel.cuh as a template/baseline implementation
[ ] Update main README.md to clarify supported head dimensions for each SM version

Add GPU memory profiling and performance regression CI workflow

The benchmark/ directory exists with bench_flash_mla.py and visualize.py, but there is no GitHub Actions workflow to automatically run benchmarks and detect performance/memory regressions on kernel changes. This is critical for a high-performance kernel library to prevent unintended slowdowns from merging PRs. The workflow should track metrics like throughput (tokens/sec), memory bandwidth, and peak memory usage across SM architectures.

[ ] Create .github/workflows/benchmark.yml that runs on PR pushes to main/release branches with GPU runners (CUDA 12.1+)
[ ] Extend benchmark/bench_flash_mla.py to output JSON results with timing and memory stats for dense/sparse prefill/decode kernels
[ ] Add benchmark/compare_baseline.py to load baseline JSON results and flag any >5% regression in throughput or >10% increase in peak memory
[ ] Document GPU runner setup requirements in CONTRIBUTING.md (e.g., SM100/SM90 GPU availability, CUDA version, driver version)

🌿Good first issues

Add benchmarks for SM80 (A100) sparse attention in benchmark/bench_flash_mla.py to match coverage of Hopper/Blackwell—currently only H800 and B200 are highlighted
Document the FP8 scale factor computation and KV cache quantization scheme in a new docs/ file or README section, since csrc/sm90/decode/ uses FP8 but the math is not explained
Add unit tests under tests/ for head dimension 32 and 96 instantiations (only 64 and 128 appear in csrc/sm90/decode/head64/ and head128/)—requires creating head32.cu and head96.cu instantiation files

⭐Top contributors

Click to expand

@beginlner — 17 commits
@interestingLSY — 13 commits
[@Jiashi Li](https://github.com/Jiashi Li) — 6 commits
@uchihatmtkinu — 4 commits
@hpp — 4 commits

📝Recent commits

Click to expand

9241ae3 — Swap FlashMLA combine grid dimensions (#182) (PerkzZheng)
71c7379 — Change the order of grid dim in bwd convert kernel to avoid overlimit when sequence length is very large(>1M) (#173) (uchihatmtkinu)
47c35a7 — Add CUDAGuard and device id assignment in sm100 dense fmha (#160) (uchihatmtkinu)
48c6dc4 — nits (interestingLSY)
c741387 — Add missing include (Jiashi Li)
082094b — Multiple updates and refactorings (#150) (interestingLSY)
1408756 — Update README (Jiashi Li)
1858932 — Code format (Jiashi Li)
7f55c71 — Fix error message (Jiashi Li)
e9b6732 — Update blog and README (interestingLSY)

🔒Security observations

FlashMLA is primarily a CUDA kernel library with moderate security posture. The codebase lacks visible critical vulnerabilities but has several medium-risk areas related to unsafe memory operations, type casting, and integer arithmetic in performance-critical kernels. The absence of dependency files suggests this is a low-level library with minimal external dependencies, reducing supply chain risk. Primary concerns are around buffer safety, integer overflow, and unsafe type conversions in the kernel implementations. Recommend implementing comprehensive input validation at the API layer and adding static analysis tools for memory safety. The code is well-structured but would benefit from explicit safety documentation and hardened error handling.

Medium · CUDA Kernel Safety - Potential Buffer Overflow in Custom Kernels — csrc/sm100/prefill/sparse/*, csrc/kerutils/include/kerutils/device/*. The codebase contains custom CUDA kernels with complex memory management (TMA, shared memory operations) across multiple SM architectures (SM80, SM90, SM100). Without visible bounds checking in kernel headers, there is potential risk of buffer overflows, especially in sparse attention implementations where dynamic indexing is common. Fix: Implement comprehensive bounds checking in all kernel implementations. Add static analysis tools to detect potential out-of-bounds access patterns. Use CUDA memory protection mechanisms where available.
Medium · Unsafe Integer Arithmetic in Tensor Operations — csrc/api/dense_fwd.h, csrc/api/sparse_fwd.h, csrc/sm100/prefill/. The codebase performs tensor shape calculations and indexing computations (evident from file names like 'head64', 'head128', and sparse attention patterns). Integer overflow in dimension calculations could lead to memory corruption or incorrect attention computations. Fix: Use checked arithmetic operations for all tensor dimension and stride calculations. Implement maximum size limits for tensor operations. Add static analysis for integer overflow detection.
Medium · Unsafe Type Casting in Kernel Utilities — csrc/kerutils/include/kerutils/supplemental/torch_tensors.h, csrc/kerutils/include/kerutils/device/*/intrinsics.cuh. The kerutils library (csrc/kerutils/) contains device-specific intrinsics and helpers that may involve unsafe pointer casting and reinterpretation between different numeric types (FP8, FP32, etc.), particularly in the torch tensor interface. Fix: Replace unsafe casts with safe type conversion functions. Add compile-time type checking. Implement runtime type validation before pointer reinterpretation.
Low · Missing Input Validation in API Layer — csrc/api/api.cpp. The API entry points (csrc/api/api.cpp) may lack comprehensive input validation for tensor shapes, batch sizes, and sequence lengths before passing to CUDA kernels. Fix: Implement strict input validation including range checks, shape compatibility verification, and memory availability checks before kernel dispatch.
Low · Potential Race Conditions in Concurrent Kernel Execution — csrc/api/sparse_decode.h, csrc/api/dense_decode.h. The sparse and dense kernel implementations may have thread safety issues if multiple kernel invocations share memory resources or if stream synchronization is insufficient. Fix: Ensure proper stream management and synchronization. Add explicit dependency tracking between kernel launches. Document thread safety assumptions.
Low · Missing Error Handling in CUDA API Calls — csrc/api/api.cpp, csrc/sm100/prefill/dense/fmha_cutlass_fwd_sm100.cu. The codebase likely contains CUDA API calls without visible error checking (kernel launches, memory allocations). Unchecked CUDA errors could lead to undefined behavior. Fix: Wrap all CUDA API calls with error checking macros. Implement proper error propagation and logging. Use CUDA error handling best practices.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

deepseek-ai/FlashMLA

Embed the "Healthy" badge

Onboarding doc

Onboarding: deepseek-ai/FlashMLA

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

Add a new sparse attention kernel variant

Add support for a new SM generation (e.g., SM110)

Optimize a kernel for a different data type (e.g., add FP8 prefill support)

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive kernel instantiation tests for SM100/SM90 sparse attention

Document SM100 head128 implementation gap and add baseline kernel

Add GPU memory profiling and performance regression CI workflow

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next