Onboarding: ggml-org/llama.cpp

Item: ggml-org/llama.cpp
Rating: 5
Author: RepoPilot

Generated by RepoPilot · 2026-05-05 · Source

Verdict

GO — Healthy across the board

Last commit today
5 active contributors
Distributed ownership (top contributor 47%)
MIT licensed
CI configured
Tests present
⚠ Small team — 5 top contributors

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

TL;DR

llama.cpp is a high-performance LLM inference engine written in C/C++ that runs large language models (like LLaMA, Mistral, Qwen, GPT-NeoX, etc.) locally on consumer hardware without Python dependencies. It uses the GGUF model format with aggressive quantization (2-bit through 8-bit) to minimize memory footprint, and exposes a REST API server (llama-server), a CLI (llama-cli), and a C library (libllama) for embedding inference into other applications. Flat-ish monorepo: src/ and include/ hold the core libllama C++ library, examples/ contains standalone programs (llama-cli, llama-server, llama-bench, etc.), ggml/ is a vendored/submodule copy of the ggml tensor backend, and tools/ holds conversion and quantization Python scripts. GPU backends live under ggml/src/ (e.g., CUDA kernels in .cu files, Metal shaders as .metal).

Who it's for

ML engineers and systems developers who need to run LLMs on-premise or on edge hardware (Mac M-series, NVIDIA/AMD GPUs, or CPU-only machines) without cloud API costs. Also targeted at app developers embedding LLM inference via the libllama C API, and researchers who need fast, hackable inference without PyTorch overhead.

Maturity & risk

Extremely mature and production-ready: the repo has a formal release pipeline (.github/actions/get-tag-name), extensive CI across CUDA, ROCm, Vulkan, OpenVINO, and CPU targets (multiple Dockerfiles in .devops/), and a tracked API changelog (issues #9289 and #9291). Active development is confirmed by recent hot-topic features like multimodal support, Hugging Face cache integration, and MXFP4 format. One of the most starred C++ ML repos on GitHub.

Low risk for core inference usage: the primary dependency is the companion ggml tensor library (ggml-org/ggml), which is co-developed by the same org. The main risk is API churn — libllama and llama-server REST API both have active changelogs tracking breaking changes (#9289, #9291), so downstream consumers must pin versions. GPU backend fragmentation (CUDA, ROCm, Metal, Vulkan, CANN, MUSA, OpenVINO) means hardware-specific code paths can diverge.

Active areas of work

Active work includes: multimodal (vision) support in llama-server (PR #12898 recently merged), native MXFP4 format support for NVIDIA gpt-oss models (PR #15091), a new WebUI for llama-server (discussion #16938), Hugging Face cache directory migration for -hf flag downloads, and ongoing VS Code / Vim plugin ecosystem (llama.vscode, llama.vim).

Get running

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp && cmake -B build -DGGML_CUDA=ON # omit CUDA flag for CPU-only && cmake --build build --config Release -j$(nproc) && ./build/bin/llama-cli -m /path/to/model.gguf -p 'Hello' -n 128 # or pull a model: ./build/bin/llama-cli -hf bartowski/Llama-3.2-1B-Instruct-GGUF

Daily commands:

CPU only

cmake -B build && cmake --build build -j$(nproc) ./build/bin/llama-server -m model.gguf --port 8080

With CUDA

cmake -B build -DGGML_CUDA=ON && cmake --build build -j$(nproc)

Docker (CUDA)

docker build -f .devops/cuda.Dockerfile -t llama-cpp-cuda . docker run --gpus all -p 8080:8080 llama-cpp-cuda -m /models/model.gguf

Map of the codebase

CMakeLists.txt — Root build configuration that wires together all targets, backend flags (CUDA, Metal, Vulkan, etc.), and compile-time options — must be understood before adding any new component or backend.
README.md — Primary documentation covering build instructions, supported models, quantization formats, and server usage — the canonical entry point for all contributors and users.
CLAUDE.md — AI-agent contribution guide that defines code style, PR conventions, and architectural decisions specifically governing how changes should be made in this repo.
AGENTS.md — Defines agent-facing conventions and automated tooling rules that govern CI bot behavior and agentic contribution workflows.
CONTRIBUTING.md — Formal contributor guidelines covering commit hygiene, testing expectations, and review process — mandatory reading before submitting a PR.
.github/workflows/build.yml — Primary CI workflow that validates all platform builds (Linux, macOS, Windows, CUDA, Vulkan) — understanding this is critical for knowing what must pass before merge.
.github/workflows/server.yml — CI workflow specifically for the llama-server REST API, covering integration tests that gate server-side changes.

How to make changes

Add a new GPU/hardware backend

Create a new Dockerfile for the backend following the pattern of existing ones (e.g., cuda.Dockerfile). Name it after the backend. (.devops/cuda.Dockerfile)
Add a new CMake option (e.g., LLAMA_NEWBACKEND=ON) and corresponding target/source linkage in the root CMakeLists.txt. (CMakeLists.txt)
Add a build preset for the new backend in CMakePresets.json so developers can easily configure it. (CMakePresets.json)
Add a new GitHub Actions workflow (or extend build.yml) with a job that builds and tests the new backend in CI. (.github/workflows/build.yml)
Add a Nix package derivation for the backend if Nix support is desired, following the pattern of package.nix. (.devops/nix/package.nix)

Add a new model architecture

Review CONTRIBUTING.md and CLAUDE.md to understand the expected code style, testing requirements, and where model-specific code lives. (CONTRIBUTING.md)
Update CMakeLists.txt to include any new source files required for the model architecture. (CMakeLists.txt)
Add an issue or PR using the enhancement template to track the feature and get early feedback from CODEOWNERS. (.github/ISSUE_TEMPLATE/020-enhancement.yml)
Ensure the new architecture is covered in the server integration tests by extending or adding test cases in the server workflow. (.github/workflows/server.yml)

Add a new quantization format

Consult CLAUDE.md for conventions around quantization naming, format versioning (GGUF), and backward compatibility requirements. (CLAUDE.md)
Add the new quantization type and any compile-time flags to CMakeLists.txt. (CMakeLists.txt)
Update the GGUF publish workflow if the new format needs to be distributed via Hugging Face or package releases. (.github/workflows/gguf-publish.yml)
Verify pre-tokenizer hashes are not broken by running or updating the hash check workflow. (.github/workflows/pre-tokenizer-hashes.yml)

Add a new REST API endpoint to llama-server

Review the server workflow to understand which integration tests cover the REST API and what format test cases follow. (.github/workflows/server.yml)
Review the server sanitizer workflow to understand memory-safety requirements for new endpoint code. (.github/workflows/server-sanitize.yml)
Add the new endpoint implementation following existing patterns and update CMakeLists.txt if new source files are needed. (CMakeLists.txt)
Update the PR template checklist to confirm the new endpoint is documented and tested before merge. (.github/pull_request_template.md)

Why these technologies

C/C++ (C++17) — Maximum portability and performance with zero runtime overhead; allows direct memory management critical for fitting large models in RAM/VRAM.
GGUF file format — Self-describing binary format for quantized model weights that supports memory-mapping, enabling models larger than RAM to run via OS-managed paging.
ggml tensor library — Custom low-level tensor computation library (also by ggml-org) that provides backend-agnostic op definitions and quantized kernel implementations.
CMake — Industry-standard cross-platform build system that handles the complex matrix of optional backends (CUDA, Metal, Vulkan, SYCL, CANN, OpenVINO) and platforms.
CUDA / Metal / Vulkan / ROCm — Multiple GPU backends ensure broad hardware coverage; each backend accelerates the same ggml ops with platform-native kernels without changing the model logic.
Nix — Provides fully reproducible development environments and Docker/SIF image builds, eliminating 'works on my machine' issues across CI and contributor setups.

Trade-offs already made

- Why: undefined
- Consequence: undefined

Traps & gotchas

GGUF format versions change — models converted with old convert_hf_to_gguf.py may fail to load in newer builds; always reconvert after major updates. 2) GGML_CUDA=ON requires matching CUDA toolkit version; mismatches cause silent compute errors, not build failures. 3) Metal backend on macOS requires Xcode command-line tools, not just the compiler — xcode-select --install must be run. 4) The -hf flag now stores models in ~/.cache/huggingface/ (recently migrated), which breaks scripts that assumed the old local directory. 5) Context length (-c) above the model's trained maximum silently degrades quality — there is no hard error.

Architecture

Concepts to learn

GGUF model format — All models must be in GGUF format to run in this repo — it encodes quantized weights, tokenizer vocab, and metadata in a single portable binary file.
Post-training quantization (k-quants) — llama.cpp's performance advantage comes from running Q4_K_M, IQ2_XXS, and similar quantization schemes that reduce model size 4-8x with minimal quality loss.
KV cache — The key-value cache stores intermediate attention states to avoid recomputing past tokens — its size limits maximum context length and its management is central to src/llama.cpp.
Flash Attention — The memory-efficient attention algorithm used in llama.cpp's CUDA and Metal backends to handle long contexts without OOM — toggled via build flags.
Compute graph (lazy evaluation) — ggml builds a DAG of tensor operations before executing them, enabling backend-agnostic dispatch to CPU, CUDA, Metal, or Vulkan without changing model code.
Speculative decoding — llama.cpp supports draft-model speculative decoding to increase token throughput — implemented in examples/speculative/ and requires understanding of how token acceptance works.
Grouped Query Attention (GQA) — Modern models like LLaMA-3 use GQA to reduce KV cache size by sharing key/value heads across query heads — llama.cpp must handle this in its attention kernel implementations.

Related repos

ggml-org/ggml — The underlying tensor computation library that llama.cpp vendors — understanding ggml ops is required for adding new model architectures or backends.
ollama/ollama — Higher-level LLM serving tool built on top of llama.cpp — useful reference for how libllama is consumed via Go bindings.
ggerganov/whisper.cpp — Sister project by the same author using the same ggml backend for speech-to-text — shares architectural patterns and build system conventions.
huggingface/transformers — The source of model weights in HuggingFace format that must be converted to GGUF via tools/ before llama.cpp can run them.
Mozilla-Ocho/llamafile — Alternative distribution approach for llama.cpp that bundles model + binary into a single executable — same inference core, different packaging.

PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add a GitHub Actions workflow for OpenVINO backend CI (build-openvino.yml is missing test steps)

The file .github/workflows/build-openvino.yml exists alongside .devops/openvino.Dockerfile and .github/actions/linux-setup-openvino/, suggesting OpenVINO is a supported backend. However, there is no evidence of inference/smoke-test steps in the workflow — only build verification. Adding actual runtime inference tests (e.g. running llama-cli with a small GGUF model on the OpenVINO backend) would catch regressions that pure build checks miss, which is especially important for a less-mainstream backend.

[ ] Review .github/workflows/build-openvino.yml and .github/actions/linux-setup-openvino/action.yml to understand current setup steps
[ ] Identify a minimal quantized GGUF model (e.g. a tiny LLaMA or Qwen model) that can be downloaded in CI via the -hf flag or a direct URL
[ ] Add a test job step in build-openvino.yml that runs ./llama-cli -m <model> -p 'Hello' -n 16 --device openvino and asserts a zero exit code
[ ] Add a negative test that verifies graceful error output when an unsupported model type is passed to the OpenVINO backend
[ ] Update .github/actions/linux-setup-openvino/action.yml if any additional runtime dependencies (e.g. OpenVINO runtime shared libs) are needed for the test step

Add a dedicated GitHub Actions workflow for s390x cross-compilation and smoke-test (.devops/s390x.Dockerfile is untested in CI)

The file .devops/s390x.Dockerfile exists, indicating llama.cpp targets IBM Z (s390x) architecture, but there is no corresponding .github/workflows/build-s390x.yml or build-cross.yml entry visibly covering s390x. By contrast, RISC-V has .github/workflows/build-riscv.yml. Adding an s390x workflow using QEMU emulation (as is common in GitHub Actions for non-native architectures) would ensure s390x builds do not silently regress, which matters for enterprise Linux users.

[ ] Examine .github/workflows/build-riscv.yml as a reference template for QEMU-based cross-compilation workflows
[ ] Create .github/workflows/build-s390x.yml mirroring the RISC-V workflow structure but targeting s390x via docker run --platform linux/s390x or qemu-user-static
[ ] Reference .devops/s390x.Dockerfile in the new workflow's Docker build step
[ ] Add a basic smoke-test step that runs ./llama-cli --version inside the s390x container to confirm the binary executes correctly under emulation
[ ] Add the new workflow to the labeler in .github/labeler.yml so PRs touching s390x-related files are auto-labeled

Split .github/workflows/build-self-hosted.yml into per-backend self-hosted workflows to reduce blast radius and improve debuggability

A single monolithic self-hosted runner workflow (build-self-hosted.yml) that covers multiple hardware backends (e.g. CUDA, ROCm, CANN, Snapdragon — evidenced by the existence of build-and-test-snapdragon.yml as a separate file) is hard to maintain: a failure in one backend's job blocks visibility into others, and reviewer diffs are large. Given that Snapdragon already has its own file, the same split should be applied to the remaining backends in build-self-hosted.yml, each in a focused workflow file with clear on.workflow_dispatch and on.push triggers scoped to relevant paths.

[ ] Open build-self-hosted.yml and enumerate each distinct hardware-backend job defined inside it (e

Good first issues

Add missing man pages for llama-bench and llama-perplexity binaries — examples/ has these tools but .devops/ and docs/ lack corresponding usage documentation. 2) The .devops/nix/ package definitions may be missing the newly added MXFP4/gpt-oss backend flags — audit package.nix against current CMakeLists.txt feature flags. 3) Add integration tests for the OpenAI-compatible /v1/chat/completions streaming endpoint in examples/server/ — the CI checks build correctness but functional API response tests are sparse.

Top contributors

@ggerganov — 16 commits
@JohannesGaessler — 5 commits
@angt — 5 commits
@reeselevine — 4 commits
@SharmaRithik — 4 commits

Recent commits

eff0670 — kleidiai : update to v1.24.0 and use release archive (#22549) (chaxu01)
e77056f — CUDA: use fastdiv for batch index split in get_rows (#22650) (leonardHONG)
935a340 — server: implement /models?reload=1 (#21848) (ngxson)
d8794ee — examples: refactor diffusion generation (#22590) (Sailaukan)
36a694c — webui : fix circular dependency between chat.service.ts and models.svelte.ts (#22625) (Juste-Leo2)
a4701c9 — common/autoparser: fixes for newline handling / forced tool calls (#22654) (pwilkin)
994118a — model: move load_hparams and load_tensors to per-model definition (#22004) (ngxson)
c84e6d6 — server: Add a simple get_datetime server tool (#22649) (eapache)
fa8feae — webui: restore missing settings (#22666) (ntowle)
846262d — docs : update speculative decoding parameters after refactor (#22397) (#22539) (ggerganov)

Security observations

High · Outdated/Unverified Third-Party Dependency (rotate-bits) — package.json (rotate-bits dependency). The dependency file references 'rotate-bits' version 0.1.1 from 'jb55/rotate-bits.h' and a development dependency 'thlorenz/tap.c' with a wildcard version ''. Wildcard versioning allows any version to be resolved, including potentially malicious or broken future releases. There is no integrity hash or checksum specified, making supply-chain attacks possible if the upstream repository is compromised. Fix: Pin all dependencies to specific, audited versions with integrity checksums (e.g., SRI hashes or lockfiles). Avoid wildcard '' versioning for any dependency, including development ones. Periodically audit third-party C header-only libraries for security issues.
High · Untrusted Model Loading Risk — src/ (model loading code, GGUF parsing logic). As noted in SECURITY.md under 'Untrusted models', loading arbitrary GGUF model files from untrusted sources poses a significant risk. GGUF parsing involves reading binary data with complex structure handling in C/C++, which is historically prone to buffer overflows, integer overflows, and heap corruption vulnerabilities during deserialization of maliciously crafted files. Fix: Implement strict input validation and bounds checking during GGUF file parsing. Use fuzzing (e.g., libFuzzer/AFL++) continuously against the model loading code paths. Do not load GGUF files from untrusted sources without sandboxing (e.g., seccomp, namespaces). Consider adopting memory-safe wrappers for critical parsing routines.
High · Prompt Injection Risk in LLM Server API — .github/workflows/server.yml, llama-server (server component). The llama-server exposes a REST API (tracked in .github/workflows/server.yml). LLM inference servers are susceptible to prompt injection attacks where untrusted user input manipulates the model's behavior, potentially leading to data exfiltration, jailbreaks, or unintended system interactions if the server is used in an agentic context. Fix: Implement input sanitization and length limits on all user-supplied prompts. Document clearly that the server should not be exposed publicly without authentication. Add rate limiting and authentication middleware. Warn users about prompt injection risks in multi-tenant deployments.
High · Missing Authentication on llama-server REST API — .github/workflows/server.yml, server/ component. Based on the project structure and known llama.cpp server behavior, the REST API server (llama-server) does not enforce authentication by default. If deployed and exposed on a network, any client can submit inference requests, potentially leading to resource exhaustion, unauthorized data access, or abuse of the server. Fix: Enforce API key authentication by default. The server should require explicit opt-in to disable authentication. Provide clear documentation warning against exposing the server to untrusted networks without authentication. Consider adding TLS support or requiring a reverse proxy with auth.
Medium · Docker Images May Use Broad Base Images Without Hardening — .devops/cuda.Dockerfile, .devops/rocm.Dockerfile, .devops/vulkan.Dockerfile, .devops/cann.Dockerfile, .devops/musa.Dockerfile. Multiple Dockerfiles are present (.devops/cuda.Dockerfile, .devops/rocm.Dockerfile, .devops/vulkan.Dockerfile, etc.). Without reviewing their contents, common risks include: running as root user, use of unversioned/unpinned base images (e.g., 'ubuntu:latest'), inclusion of unnecessary build tools in runtime images, and no explicit USER directive to drop privileges. Fix: Pin base image digests using SHA256. Use multi-stage builds to separate build and runtime environments. Add a non-root USER directive in all Dockerfiles. Remove build tools and unnecessary packages from final images. Scan images with tools like Trivy or Snyk.
Medium · Potential Buffer Overflow in C/C++ Inference Code — undefined. The codebase is written primarily in C/C++ and performs low-level memory operations for LLM tensor computations. C/C++ codebases of this nature are inherently susceptible to memory safety issues: buffer overflows, use-after-free, integer overflows in size calculations, and out-of-bounds memory access. These risks increase significantly when processing untrusted inputs (model files, user Fix: undefined

LLM-derived; treat as a starting point, not a security audit.

Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

ggml-org/llama.cpp

Embed this verdict

Onboarding doc