Tiiny-AI/PowerInfer

Item: Tiiny-AI/PowerInfer
Rating: 5
Author: RepoPilot

High-speed Large Language Model Serving for Local Deployment

Healthy

Healthy across all four use cases

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 4mo ago
✓30+ active contributors
✓Distributed ownership (top contributor 38% of recent commits)

Show 4 more →

✓MIT licensed
✓CI configured
✓Tests present
⚠Slowing — last commit 4mo ago

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/tiiny-ai/powerinfer)](https://repopilot.app/r/tiiny-ai/powerinfer)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/tiiny-ai/powerinfer on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: Tiiny-AI/PowerInfer

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/Tiiny-AI/PowerInfer shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

Last commit 4mo ago
30+ active contributors
Distributed ownership (top contributor 38% of recent commits)
MIT licensed
CI configured
Tests present
⚠ Slowing — last commit 4mo ago

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live Tiiny-AI/PowerInfer repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/Tiiny-AI/PowerInfer.

What it runs against: a local clone of Tiiny-AI/PowerInfer — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in Tiiny-AI/PowerInfer | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 135 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>Tiiny-AI/PowerInfer</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of Tiiny-AI/PowerInfer. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/Tiiny-AI/PowerInfer.git
#   cd PowerInfer
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of Tiiny-AI/PowerInfer and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "Tiiny-AI/PowerInfer(\\.git)?\\b" \\
  && ok "origin remote is Tiiny-AI/PowerInfer" \\
  || miss "origin remote is not Tiiny-AI/PowerInfer (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "CMakeLists.txt" \\
  && ok "CMakeLists.txt" \\
  || miss "missing critical file: CMakeLists.txt"
test -f "common/common.h" \\
  && ok "common/common.h" \\
  || miss "missing critical file: common/common.h"
test -f "convert.py" \\
  && ok "convert.py" \\
  || miss "missing critical file: convert.py"
test -f "examples/batched/batched.cpp" \\
  && ok "examples/batched/batched.cpp" \\
  || miss "missing critical file: examples/batched/batched.cpp"
test -f "README.md" \\
  && ok "README.md" \\
  || miss "missing critical file: README.md"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 135 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~105d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/Tiiny-AI/PowerInfer"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

PowerInfer is a CPU/GPU hybrid LLM inference engine that accelerates large language model serving on consumer-grade hardware by exploiting activation sparsity patterns—only computing and loading the sparse subset of model parameters needed for each token. It achieves 11–22× speedup over competitors on sparse models like TurboSparse-Mixtral-47B (4B active params) by dynamically routing computation to GPU for dense layers and CPU for sparse layers. Monorepo with core inference engine in C++ (CMakeLists.txt, llama.cpp heritage), Python bindings in powerinfer-py/ for model loading, DevOps layer in .devops/ (Dockerfiles for CUDA/ROCm/Metal backends), and common utilities (common/common.cpp, common/sampling.cpp). GPU kernels in cuda/ directory; sparse matrix handling via custom activation routing logic. Build system: CMake + Zig support (build.zig).

👥Who it's for

ML engineers and researchers who deploy open-source LLMs locally on modest GPUs (RTX 3060, M-series Macs) and need production-grade inference speed without enterprise cloud costs; also systems researchers optimizing sparse neural network execution on heterogeneous hardware.

🌱Maturity & risk

Actively maintained and production-ready. The project has significant industry traction (featured at CES 2026 with Tiiny AI Pocket Lab, 90%+ sparse models published to HuggingFace), comprehensive CI/CD workflows (build.yml, docker.yml, tidy-review.yml), and Docker support for CUDA/ROCm/CPU. Recent commits and feature releases (SmallThinker models in 2025, PowerInfer-2 in 2024) indicate ongoing active development.

Moderate complexity risk: the codebase is large (10M+ LoC C++, 2.8M C, multi-GPU backend support) and requires careful CUDA/ROCm/Metal integration testing. Dependencies are minimal but GPU-specific (CUDA Toolkit, cuDNN implied by full-cuda.Dockerfile), creating platform lock-in. Single inference kernel bugs could silently produce wrong results; test coverage in common/ and test directories appears present but sparse model validation is critical. Breaking API changes may occur as the sparse tensor format evolves.

Active areas of work

Active focus on smartphone inference (PowerInfer-2), sparse model optimization (SmallThinker-21B and 4.6B variants released Q3 2025), and ROCm/AMD hardware support (added 2024). Recent workflows show emphasis on code coverage (code-coverage.yml) and multi-platform Docker builds. Competition engagement via CCF-TCArch challenge suggests optimization of sparse kernels is ongoing work.

🚀Get running

git clone https://github.com/Tiiny-AI/PowerInfer.git
cd PowerInfer
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# For Python bindings:
pip install -e ./powerinfer-py
# For CUDA support (requires CUDA Toolkit):
cmake -B build -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUDA=ON

Daily commands: Build and run default CPU inference: cmake --build build && ./main -m model.gguf -p 'Hello'. GPU (CUDA): rebuild with -DLLAMA_CUDA=ON, same command runs sparse kernels on GPU. Docker: docker build -f .devops/full-cuda.Dockerfile -t powerinfer . && docker run -it powerinfer ./main .... Python: from powerinfer import PowerInfer; model = PowerInfer('model.gguf'); print(model.generate('Hello')) after pip-installing powerinfer-py.

🗺️Map of the codebase

CMakeLists.txt — Root build configuration that orchestrates compilation of the entire PowerInfer engine; essential for setting up development environments and understanding the build dependency graph.
common/common.h — Core header defining fundamental data structures and utilities shared across the inference engine; a load-bearing abstraction every contributor must understand.
convert.py — Primary model conversion pipeline that transforms HuggingFace models into GGUF format; critical for understanding the model ingestion workflow.
examples/batched/batched.cpp — Reference implementation of batched inference execution; demonstrates the primary inference API pattern used throughout the codebase.
README.md — Project overview explaining the activation locality optimization concept and PowerInfer's value proposition; required context for all contributions.
.github/workflows/build.yml — CI/CD pipeline defining how code is validated, tested, and built across platforms; shows automation expectations for contributors.
common/CMakeLists.txt — Build configuration for common utilities library that all other modules depend on; a critical linking point in the build system.

🛠️How to make changes

Add a new inference backend or optimization

Create new backend module in CMakeLists.txt under appropriate target_sources() section (CMakeLists.txt)
Add backend-specific SIMD/GPU detection to cmake/FindSIMD.cmake if needed (cmake/FindSIMD.cmake)
Implement backend logic using common inference context from common/common.h (common/common.h)
Create example usage in examples/ directory following batched.cpp pattern (examples/batched/batched.cpp)
Add backend-specific Docker image variant in .devops/ (.devops/main.Dockerfile)

Add a new model conversion feature

Extend convert.py with new model architecture support in conversion pipeline (convert.py)
Add HuggingFace-specific conversion in convert-hf-to-powerinfer-gguf.py (convert-hf-to-powerinfer-gguf.py)
Define new model tensors and metadata structures compatible with common/common.h types (common/common.h)
Test conversion with example in examples/ and document in README.md (README.md)

Add a new sampling or decoding strategy

Implement sampling logic in common/sampling.h with new strategy class (common/sampling.h)
Integrate into common/sampling.cpp for context initialization (common/sampling.cpp)
Create example demonstrating new strategy in examples/beam-search/ or similar (examples/beam-search/beam-search.cpp)
Add tests via .github/workflows/build.yml if complex logic introduced (.github/workflows/build.yml)

Support a new hardware platform

Update cmake/FindSIMD.cmake to detect new CPU ISA or GPU vendor (cmake/FindSIMD.cmake)
Add platform-specific build flags in CMakeLists.txt target configuration (CMakeLists.txt)
Create Dockerfile variant in .devops/ with platform-specific dependencies (.devops/main.Dockerfile)
Add platform to GitHub Actions matrix in .github/workflows/build.yml (.github/workflows/build.yml)

🔧Why these technologies

C++ with CMake — Provides fine-grained control over GPU/CPU memory and compute, essential for optimizing activation locality; cross-platform build system for diverse hardware targets.
CUDA/ROCm/Metal support — PowerInfer's core innovation is leveraging GPU acceleration via activation sparsity; multi-backend support allows deployment across consumer GPUs (NVIDIA, AMD, Apple).
GGUF model format — Efficient quantized model serialization enabling fast loading and reduced memory footprint; industry standard for local LLM deployment.
Python model conversion pipeline — HuggingFace ecosystem integration allows converting any modern LLM architecture; separate from C++ runtime for ease of iteration.
Batched inference design — Multi-request batching amortizes activation locality benefits across parallel sequences; critical for throughput on consumer hardware.

⚖️Trade-offs already made

Activation locality optimization requires profiling phase to identify sparse compute patterns
- Why: Reduces runtime memory bandwidth and GPU utilization by selectively activating only necessary layers per token
- Consequence: Model loading becomes two-phase (profile + inference); adds initial latency but dramatically improves throughput and memory efficiency on consumer GPUs
C++ inference engine vs. Python-only implementation
- Why: Fine-grained memory management and GPU scheduling needed for activation locality to work efficiently
- Consequence: Steeper deployment complexity than pure Python; faster runtime but harder to modify without recompilation
GGUF quantization support vs. full-precision models
- Why: Quantization reduces model size and memory bandwidth to fit consumer GPUs (e.g., 8GB VRAM for 13B models)
- Consequence: Small accuracy loss (~1-3% on typical benchmarks) but enables practical local deployment; cannot serve full-precision models efficiently
Modular backend system (CUDA/ROCm/Metal/CPU) vs. single GPU vendor lock-in
- Why: Maximize addressable hardware market for local LLM inference
- Consequence: Higher maintenance burden across platform variants; inconsistent performance characteristics require per-backend tuning

🚫Non-goals (don't propose these)

Not a production-grade serving framework (no authentication, load balancing, or multi-tenancy)
Does not handle distributed inference across multiple machines
Does not include automatic model optimization or quantization (external tools like llama.cpp's quantizers required)
Not focused on training or fine-tuning at scale (examples provide lightweight fine-tuning only)
Does not support real-time model updates without reloading inference context

🪤Traps & gotchas

GGUF format lock-in: Models must be converted to GGUF format (not HuggingFace safetensors by default); conversion requires llama.cpp's convert script. 2. Sparse metadata dependency: Activation sparsity is pre-baked into GGUF; models without sparsity annotations fall back to dense mode, negating speedup—verify with PowerInfer/tools/sparse_checker.py (if exists). 3. GPU memory pinning: CUDA kernels assume page-locked host memory for transfers; insufficient pinned memory causes silent slowdown, not errors. 4. CMake in-tree build required: Out-of-tree builds may fail due to relative paths in .devops/ Dockerfiles; always use cmake -B build pattern. 5. ROCm version fragility: ROCm APIs break between minor versions; full-rocm.Dockerfile pins a specific version, deviating breaks compilation.

🏗️Architecture

💡Concepts to learn

Activation Sparsity / Sparse Neural Networks — Core efficiency mechanism in PowerInfer—ReLU-based and routed sparse models activate only 10% of parameters per forward pass, enabling the engine to skip 90% of computation. Understanding which layers are dense vs sparse determines the GPU/CPU routing strategy.
Heterogeneous Computation (CPU/GPU Task Dispatch) — PowerInfer's distinguishing feature: dense layers run on GPU (high throughput), sparse layers run on CPU (low latency); requires runtime kernel selection and memory coherency. Developers must understand when each processor type is optimal.
GGUF Quantization Format — All PowerInfer models are stored and loaded in GGUF (GPT-Generated Unified Format); underpins model serialization, sparsity metadata embedding, and cross-platform portability. Essential for model preparation and custom model integration.
Token-by-Token Autoregressive Generation — LLM inference generates one token at a time; PowerInfer's sparse routing is latency-critical per-token since sparse layer selection must happen dynamically. Understanding the batch-size-1 execution model is crucial for performance analysis.
CUDA/ROCm Heterogeneous Kernel Programming — PowerInfer supports NVIDIA (CUDA), AMD (ROCm), and Apple (Metal) GPU backends; each requires backend-specific sparse matrix kernels in cuda/, rocm/, metal/ directories. Code must handle triple-path GPU dispatch and fallback logic.
Sparse Matrix Formats (CSR, COO, blocking) — Efficient sparse layer storage and multiplication require structured storage (Compressed Sparse Row or custom blocking patterns). PowerInfer's performance depends on choosing the right sparsity structure that aligns with hardware (NVIDIA Tensor Cores, CPU cache lines).
Model Sparsification / Pruning — PowerInfer achieves 90% sparsity via offline model training or sparsification (e.g., TurboSparse-Mixtral); understanding which sparsification techniques preserve model quality while maximizing compute savings is essential for preparing new models.

ggerganov/llama.cpp — Direct ancestor and foundation—PowerInfer forks llama.cpp and extends it with sparse activation routing and multi-GPU scheduling.
vllm-project/vllm — Alternative high-speed LLM serving framework with PagedAttention; targets data-center GPUs whereas PowerInfer targets consumer hardware with sparse models.
SJTU-IPADS/Bamboo — LLM model (7B) optimized for PowerInfer inference; demonstrates production-ready sparse models and achieves both high quality and speed with the engine.
OpenSpiel/open_spiel — Not directly related but demonstrates sparse neural network research from SJTU-IPADS lab; provides context for activation sparsity techniques PowerInfer exploits.
huggingface/transformers — Model loading and tokenization dependency (transformers>=4.33.2 in dependencies); PowerInfer models are exported to HuggingFace Hub as GGUF artifacts.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for Python conversion scripts (convert.py, convert-dense.py, convert-hf-to-powerinfer-gguf.py)

The repo has critical Python conversion utilities but no visible test files in the structure. These scripts handle model conversion to GGUF format—a core operation for users. Testing edge cases (malformed models, missing layers, quantization parameters) would prevent silent conversion failures and improve reliability for contributors modifying these scripts.

[ ] Create tests/python/test_convert.py with fixtures for sample HF models
[ ] Add tests for convert-hf-to-powerinfer-gguf.py covering activation sparsity extraction
[ ] Add tests for convert-dense.py with various tensor shapes and data types
[ ] Integrate pytest into .github/workflows/build.yml with Python test stage
[ ] Document test setup in ci/README.md for running locally

Add GPU memory profiling and benchmark CI workflow for CUDA/ROCm inference

The repo has multiple Dockerfile variants (full-cuda.Dockerfile, full-rocm.Dockerfile, main-cuda.Dockerfile) and a batched-bench example, but no dedicated CI workflow measuring memory usage or throughput across GPU targets. This would catch regressions in the core inference engine and help validate activation locality optimizations.

[ ] Create .github/workflows/gpu-benchmark.yml with matrix for CUDA and ROCm containers
[ ] Extend examples/batched-bench/batched-bench.cpp to output peak memory and tokens/sec metrics
[ ] Add workflow step to run benchmarks against a fixed model checkpoint and report results as workflow artifacts
[ ] Document expected performance baselines in docs/token_generation_performance_tips.md
[ ] Reference the workflow in the main README under performance expectations

Create missing CMake integration tests for SIMD/architecture detection (cmake/FindSIMD.cmake)

The FindSIMD.cmake module handles CPU capability detection (critical for CPU inference), but there are no visible tests validating it works across architectures (x86-64, ARM, AVX-512 variants). Contributors may break SIMD feature detection or CMake configuration without realizing it. Unit tests would ensure portable builds.

[ ] Create tests/cmake/test_find_simd.cmake with mock CPU capability scenarios
[ ] Add tests validating correct flags for AVX2, SSE4.2, NEON, SVE detection
[ ] Integrate CMake tests into .github/workflows/build.yml on multiple runner architectures (ubuntu-latest, arm-based if available)
[ ] Document SIMD requirements and testing in docs/BLIS.md
[ ] Add fallback validation to ensure generic builds work when SIMD is unavailable

🌿Good first issues

Add unit tests for sparse activation routing logic: Currently no explicit test file for the core sparse-dispatch mechanism in llama.cpp. Create tests/test_sparse_routing.cpp with fixtures testing CPU vs GPU kernel selection for known sparsity patterns.: Medium: Prevents silent bugs in sparse-path execution and documents the routing contract.
Document PowerInfer-specific GGUF extensions in common/: The sparse metadata format (activation masks, layer sparsity levels) is not formally documented. Add comments to the GGUF reader explaining how sparse tensors are tagged and loaded.: Easy: Unblocks contributors attempting custom sparse model formats and model optimization research.
Implement Metal (Apple Silicon) sparse kernels: Metal backend in common/ is CPU-only fallback; add GPU-accelerated sparse matrix ops in Metal Shading Language for M-series Macs (similar to cuda/sparse_ops.cu pattern).: Hard: Enables 10x+ speedup on MacBook inference, major missing platform optimization.

⭐Top contributors

Click to expand

@hodlen — 38 commits
@jeremyyx — 17 commits
@ZeyuMi — 6 commits
@wisman-tccr — 4 commits
@ggerganov — 4 commits

📝Recent commits