google/gemma.cpp
lightweight, standalone C++ inference engine for Google's Gemma models.
Healthy across the board
Permissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 1d ago
- ✓14 active contributors
- ✓Apache-2.0 licensed
Show 3 more →Show less
- ✓CI configured
- ✓Tests present
- ⚠Concentrated ownership — top contributor handles 59% of recent commits
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/google/gemma.cpp)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/google/gemma.cpp on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: google/gemma.cpp
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/google/gemma.cpp shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit 1d ago
- 14 active contributors
- Apache-2.0 licensed
- CI configured
- Tests present
- ⚠ Concentrated ownership — top contributor handles 59% of recent commits
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live google/gemma.cpp
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/google/gemma.cpp.
What it runs against: a local clone of google/gemma.cpp — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in google/gemma.cpp | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 31 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of google/gemma.cpp. If you don't
# have one yet, run these first:
#
# git clone https://github.com/google/gemma.cpp.git
# cd gemma.cpp
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of google/gemma.cpp and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "google/gemma.cpp(\\.git)?\\b" \\
&& ok "origin remote is google/gemma.cpp" \\
|| miss "origin remote is not google/gemma.cpp (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "gemma/gemma.h" \\
&& ok "gemma/gemma.h" \\
|| miss "missing critical file: gemma/gemma.h"
test -f "gemma/gemma.cc" \\
&& ok "gemma/gemma.cc" \\
|| miss "missing critical file: gemma/gemma.cc"
test -f "gemma/configs.h" \\
&& ok "gemma/configs.h" \\
|| miss "missing critical file: gemma/configs.h"
test -f "gemma/tokenizer.h" \\
&& ok "gemma/tokenizer.h" \\
|| miss "missing critical file: gemma/tokenizer.h"
test -f "gemma/attention.h" \\
&& ok "gemma/attention.h" \\
|| miss "missing critical file: gemma/attention.h"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 31 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~1d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/google/gemma.cpp"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
gemma.cpp is a lightweight, CPU-only C++ inference engine for Google's Gemma 2/3 and PaliGemma 2 models, featuring mixed-precision (fp8/bf16/fp32) weight compression and SIMD optimization via Google Highway. It delivers standalone LLM inference with <2K LoC core implementation, targeting research and embedded deployment where Python frameworks are too heavyweight. Monorepo structure: gemma/ subdirectory contains core inference (.cc/.h files), compression/ isolates weight quantization logic with both C++ implementation (-inl.h headers) and Python bindings (compression/python/), evals/ provides benchmarking and cross-entropy evaluation utilities, bazel/ contains build configuration. No single main.cc; instead API_SERVER_README.md and examples guide integration.
👥Who it's for
ML researchers and embedded systems engineers who need to run Gemma models on CPUs with full control over low-level optimization and model internals; developers building edge LLM applications that can't depend on PyTorch/JAX ecosystems; teams doing LLM co-design research combining high-level algorithms with SIMD acceleration.
🌱Maturity & risk
Actively developed (dev branch model, community Discord active). Has comprehensive test coverage (compression/ contains _test.cc files, evals/ has benchmark.cc), Bazel/CMake CI configured (.github/workflows/build.yml present), but marked for research/experimentation rather than production. Repository structure suggests solid foundation but early-stage feature completeness (PaliGemma 2 support recently added).
Low external dependency risk (minimalist philosophy, only Google Highway required for SIMD). Single organization maintainer (Google) mitigates bus factor. Main risk: research-grade code with active development on dev branch means main branch may lag; compression/ module complexity (custom fp8 format with 2-3 mantissa bits) requires deep domain knowledge. No evident breaking change policy stated.
Active areas of work
Active work on mixed-precision GEMM with 7-parameter autotuning, Gemma-3 model support expansion, backward pass (VJP) and Adam optimizer implementation for research workflows. Dev branch is primary: pull requests target dev not main. Community engagement via Discord suggests regular feature discussion and bug triage.
🚀Get running
Clone: git clone https://github.com/google/gemma.cpp.git && cd gemma.cpp. Build with CMake: ./cmake.sh or Bazel: bazel build //gemma:all. For Python compression bindings: pip install -r compression/python/requirements.txt. See DEVELOPERS.md for detailed contributor setup and API_SERVER_README.md for integration examples.
Daily commands:
For inference: build with ./cmake.sh or bazel build, then integrate via gemma/ headers into your C++ project (see API_SERVER_README.md for code patterns). For benchmarks: bazel run //evals:benchmarks. For compression Python: python compression/python/compression_test.py. No standalone server binary in default build; API Server is in separate deployment layer.
🗺️Map of the codebase
gemma/gemma.h— Core inference engine interface defining the Gemma model class and primary API for loading and running inference.gemma/gemma.cc— Main implementation of the Gemma model inference logic, including forward passes and token generation.gemma/configs.h— Model configuration definitions for all supported Gemma variants (Gemma-2, Gemma-3, PaliGemma-2).gemma/tokenizer.h— Tokenizer interface for encoding/decoding text to/from token IDs, critical for input/output processing.gemma/attention.h— Attention mechanism implementation including flash attention optimizations, core to model computation.gemma/kv_cache.h— Key-value cache management for efficient autoregressive generation.compression/compress.h— Weight compression utilities (NUQ, SFP) for reducing model size while maintaining inference quality.
🛠️How to make changes
Add support for a new Gemma model variant
- Define the model configuration with layer counts, head dimensions, and other hyperparameters (
gemma/configs.h) - Add instantiation in the config factory or enum (
gemma/configs.cc) - If the variant has unique architecture (e.g., different attention), modify the forward pass (
gemma/gemma.cc) - Add tests validating the new configuration loads and generates tokens correctly (
evals/gemma_test.cc)
Implement a new weight compression format
- Define the compression algorithm header (similar to nuq-inl.h or sfp-inl.h) (
compression/compress.h) - Implement compression and decompression logic as inline functions (
compression/compress-inl.h) - Add encode/decode entry points in the main compression interface (
compression/compress.cc) - Write compression tests including round-trip and distortion analysis (
compression/compress_test.cc)
Add a new language binding or API interface
- Define the C API interface following the existing c_api.h pattern (
gemma/bindings/c_api.h) - Implement the C API wrapper converting language calls to C++ model calls (
gemma/bindings/c_api.cc) - For managed languages (C#), add language-specific bindings file (
gemma/bindings/GemmaInterop.cs) - Create integration example demonstrating language-specific usage (
examples/hello_world/run.cc)
Optimize a performance bottleneck
- Profile the operation using evals/benchmark.cc to establish baseline (
evals/benchmark.cc) - Implement optimized kernel or algorithm (e.g., in attention.cc or flash_attention.cc) (
gemma/attention.cc) - Add micro-benchmarks validating correctness and performance gain (
evals/gemma_test.cc) - Document the optimization rationale and trade-offs in DEVELOPERS.md (
DEVELOPERS.md)
🔧Why these technologies
- C++ — Enables standalone lightweight inference without runtime dependencies, direct hardware access, and minimal memory overhead—critical for embedded and resource-constrained deployment.
- SentencePiece tokenizer — Industry-standard, language-agnostic tokenization matching Gemma's training setup, enabling accurate prompt encoding/decoding.
- Weight compression (NUQ, SFP) — Reduces model footprint 2-4x while preserving inference quality, enabling deployment on memory-limited devices without accuracy loss.
- Flash Attention — Reduces attention computation complexity from O(n²) memory to O(n), dramatically improving inference speed and supporting longer context windows.
- Bazel & CMake dual build systems — Accommodates both monorepo (Google) and open-source standalone use cases, enabling flexible dependency and compilation management.
⚖️Trade-offs already made
-
Minimalist implementation over full generality
- Why: Simplicity enables research experimentation and direct inspection of computation graph, rather than abstraction through complex compilation.
- Consequence: Does not support dynamic shapes, ONNX export, or model surgery—trade-off is intentional per README.
-
Single-threaded inference loop with optional batch support
- Why: Simplifies synchronization logic and reduces memory fragmentation for autoregressive token generation.
- Consequence: Lower throughput on many-token batches compared to vLLM; optimized for latency-sensitive single-instance serving.
-
Embedded weight format (compressed tensors) rather than industry standards like ONNX or SafeTensors
- Why: Custom compression formats (NUQ, SFP) are co-designed with inference kernels for optimal hardware utilization.
- Consequence: Weights are not portable to other runtimes; requires custom tooling for weight export/import.
-
KV-cache managed explicitly per-model rather than transparent to user
- Why: Exposes cache structure for research modifications (e.g., cache pruning, mixture-of-experts routing).
- Consequence: Users must manage cache lifecycle; higher complexity than opaque auto-caching frameworks.
🚫Non-goals (don't propose these)
- Does not support distributed inference across multiple devices or machines
- Does not provide real-time streaming output—tokens are generated in-process and returned in batches
- Does not include model training or fine-tuning capabilities
- Does not guarantee bit-exact reproducibility across different hardware/compiler combinations
- Does not support dynamic shape changes or graph compilation
- Not intended as a general-purpose neural network runtime (focused only on Gemma-family models)
🪤Traps & gotchas
Bazel build requires .bazelversion match and MODULE.bazel dependency resolution (can fail silently with wrong Bazel version). Google Highway SIMD dispatch is compile-time based on target architecture flags — cross-compilation requires explicit --config specification. Weight compression formats (fp8, custom mantissa bits) are Gemma.cpp-specific; cannot use standard int8 quantized models without conversion. No official Docker image; CMakePresets.json has hard-coded paths that may break in non-standard environments. Python compression_test.py requires numpy>=1.26.4 exactly (see requirements.txt), not newer versions.
🏗️Architecture
💡Concepts to learn
- Mixed-precision GEMM (General Matrix Multiply) — Core performance optimization in gemma.cpp: weights stored in fp8/bf16, accumulation in fp32, with runtime autotuning of 7 parameters per matrix shape to balance speed vs accuracy
- SIMD (Single Instruction Multiple Data) via Google Highway — CPU optimization technique used throughout gemma/ core to parallelize tensor operations; Highway abstracts away x86 AVX512/AVX2 and ARM NEON/SVE differences, critical for portable performance
🔗Related repos
ggerganov/ggml— Cited inspiration (llama.c's foundation); similar minimalist C inference approach but broader model coverage and more mature quantization formatskarpentry/llama2.c— Explicit ancestor referenced in README; pioneered the 'small is beautiful' C LLM inference philosophy that gemma.cpp followsgoogle/highway— Required dependency for SIMD acceleration; gemma.cpp is primary user of Highway for portable x86/ARM tensor operationsgoogle/jax— Complementary deployment target; README explicitly recommends JAX for production, gemma.cpp for research and embedded usehuggingface/transformers— Model distribution and fine-tuning companion; Kaggle hosts Gemma weights compatible with transformers, gemma.cpp consumes converted artifacts
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive CMake/Bazel parity tests in CI/CD pipeline
The repo supports both Bazel and CMake build systems (BUILD.bazel, CMakeLists.txt, CMakePresets.json, .bazelrc, WORKSPACE exist in parallel), but .github/workflows/build.yml likely only tests one. This creates risk of build failures when switching systems. A contributor could add parallel CI jobs testing both build systems across different platforms to ensure feature parity.
- [ ] Examine .github/workflows/build.yml to identify current build coverage
- [ ] Add separate CI job matrix entries for Bazel and CMake builds
- [ ] Test compression/, evals/, and examples/ directories with both systems
- [ ] Document expected build outputs in DEVELOPERS.md if missing
Create integration tests for compression module Python bindings
compression/python/ exists with compression_extension.cc and compression_test.py, but the repo lacks cross-language integration tests. The Python compression module is critical for model preparation, yet there's no test coverage ensuring C++ compression code produces identical results when called from Python vs. directly.
- [ ] Review compression/python/compression_test.py for current scope
- [ ] Add test cases in compression_test.cc that compare C++ API output with Python bindings output
- [ ] Add fixture data files for reproducible compression testing
- [ ] Update compression/python/requirements.txt with any test dependencies (pytest, etc.)
Add GPU/SIMD backend benchmark comparison tests in evals/
The repo contains evals/benchmark.cc and evals/benchmark_helper.cc but likely lacks systematic comparison tests between different computational backends (scalar, SIMD, potential GPU paths). Given that gemma.cpp targets 'lightweight standalone' inference with co-design opportunities, contributors should add repeatable benchmark suites that measure performance across hardware configurations.
- [ ] Extend evals/benchmark.cc with parameterized tests for different kernel implementations
- [ ] Add baseline golden files in evals/ for performance regression detection
- [ ] Create benchmark runner script that outputs JSON for CI trend analysis
- [ ] Document benchmark methodology in docs/CONTRIBUTING.md (e.g., warm-up iterations, device state)
🌿Good first issues
- Add unit tests for compression/distortion.h (file exists but distortion_test.cc exists, so verify coverage is complete) — helps validate quantization quality metrics before integration
- Document SIMD dispatching strategy in DEVELOPERS.md: explain how Google Highway target choices affect binary size and performance for ARM64 vs x86-64 builds
- Create a minimal C++ example program (e.g., examples/inference_simple.cc) that demonstrates loading a Gemma model and generating 10 tokens, filling the gap between API_SERVER_README.md and full integration
⭐Top contributors
Click to expand
Top contributors
- @jan-wassenberg — 59 commits
- @theraysmith — 11 commits
- @mariecwhite — 8 commits
- @niting — 4 commits
- @Hitesh-ed — 3 commits
📝Recent commits
Click to expand
Recent commits
3ed403e— Major cleanup of profiler zones, add Caller annotation for all pool.Run (jan-wassenberg)9e8ac7e— Use correct offsets in BlobWriter. (niting)64a82ed— Merge pull request #735 from Hitesh-ed:gemma.cpp-windows-build-fix (copybara-github)027288b— Merge branch 'dev' into gemma.cpp-windows-build-fix (Hitesh-ed)acede9d— Warning fix (unused var), Windows build fix (missing member variable) (jan-wassenberg)c55120f— Merge branch 'dev' into gemma.cpp-windows-build-fix (Hitesh-ed)f59eb2e— Remove multi-package support from topology (jan-wassenberg)cc1d256— Update CMakePresets.json (Hitesh-ed)9b6ed1a— gemma_batch_bench: generate more unique prompts (jan-wassenberg)503aadd— Add 8-bit integer quantization (I8Stream) to Gemma.cpp. (pculliton)
🔒Security observations
The gemma.cpp codebase appears to be a well-structured C++ inference engine with generally good security practices. No critical vulnerabilities such as hardcoded credentials or obvious injection flaws were identified. However, several medium-severity concerns exist: (1) dependency management lacks version bounds, (2) low-level compression implementations require thorough bounds checking review, and (3) external model weight loading lacks visible verification mechanisms. The API server and C bindings would benefit from enhanced input validation. Overall security posture is moderate-to-good, but runtime safety and supply chain security measures should be strengthened.
- Medium · Broad Dependency Version Specification —
compression/python/requirements.txt. The numpy dependency in requirements.txt uses a minimum version constraint (>=1.26.4) without an upper bound. This allows installation of any future numpy version, including those with potential security vulnerabilities or breaking changes that could affect the compression module. Fix: Use bounded version constraints like 'numpy>=1.26.4,<2.0.0' or pin to a specific tested version to ensure predictable and secure dependencies. - Medium · Potential Buffer Overflow in Compression Module —
compression/ (compress-inl.h, int-inl.h, nuq-inl.h, sfp-inl.h). The compression module contains inline header files (compress-inl.h, int-inl.h, nuq-inl.h, sfp-inl.h) that implement low-level data manipulation. Without visible bounds checking in the file structure, there is risk of buffer overflow vulnerabilities in compression/decompression operations. Fix: Conduct thorough code review of inline compression implementations, ensure all buffer operations include bounds checking, add fuzzing tests for compression functions, and use safe C++ practices (std::vector with bounds checking vs raw pointers). - Medium · Unvalidated Model Weight Loading —
gemma/ (model loading components). The project loads external model weights from Kaggle. Without visible verification mechanisms (checksums, signatures) in the file structure, there is risk of loading compromised or tampered model files. Fix: Implement cryptographic verification of model weights using checksums (SHA-256) or signatures. Document the secure process for obtaining and verifying model artifacts from Kaggle. - Low · C++ API Surface Exposure —
gemma/bindings/c_api.cc and c_api.h. The C API bindings (gemma/bindings/c_api.cc) expose C++ functionality through C interfaces. This could potentially be misused if the library is used as a shared library without proper bounds checking. Fix: Validate all input parameters in C API functions, document memory ownership and lifecycle clearly, add input validation tests, and consider adding runtime guards against common C API misuses. - Low · Potential Integer Overflow in Numeric Operations —
compression/int-inl.h, compression/types.h. The int-inl.h and compression type handling files work with numeric conversions and integer operations that could be vulnerable to overflow attacks with specially crafted inputs. Fix: Add overflow checks for integer operations, use checked arithmetic functions where available, add fuzz testing with extreme values, and consider using safe integer libraries. - Low · No Visible Input Sanitization in API Server —
gemma/api_server.cc. The API server (gemma/api_server.cc) accepts prompts and parameters but validation mechanisms are not visible in the file listing. Fix: Implement comprehensive input validation for all API endpoints, sanitize prompt inputs, enforce request size limits, add rate limiting, and validate all numeric parameters for reasonable ranges.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.