RepoPilot

sjmoran/bitbudget

How much retrieval quality do you keep per byte? A reproducible benchmark for embedding compression.

Healthy

Healthy across all four use cases

HealthyDependency

Permissive license, no critical CVEs, actively maintained — safe to depend on.

HealthyFork & modify

Has a license, tests, and CI — clean foundation to fork and modify.

HealthyLearn from

Documented and popular — useful reference codebase to read through.

HealthyDeploy as-is

No critical CVEs, sane security posture — runnable as-is.

  • Solo or near-solo (1 contributor active in recent commits)
  • Scorecard: default branch unprotected (0/10)
  • Last commit 2w ago
  • MIT licensed
  • CI configured
  • Tests present

Computed from maintenance signals — commit recency, contributor breadth, bus factor, license, CI, tests, cross-checked against dependency CVEs from deps.dev and OpenSSF Scorecard

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Want this for your own repo?

Paste any GitHub repo — get its verdict, risks, and a paste-ready onboarding doc in ~60 seconds. Free, no sign-up.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/sjmoran/bitbudget)](https://repopilot.app/r/sjmoran/bitbudget)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card

This card auto-renders when someone shares https://repopilot.app/r/sjmoran/bitbudget on X, Slack, or LinkedIn.

Ask AI about sjmoran/bitbudget

Grounded in the actual source code. Pick a starter question or write your own.

Or write your own question

Onboarding doc

Onboarding: sjmoran/bitbudget

Generated by RepoPilot · 2026-06-28 · Source

🎯Verdict

Healthy — Healthy across all four use cases

  • Last commit 2w ago
  • MIT licensed
  • CI configured
  • Tests present
  • ⚠ Solo or near-solo (1 contributor active in recent commits)
  • ⚠ Scorecard: default branch unprotected (0/10)

<sub>Computed from maintenance signals — commit recency, contributor breadth, bus factor, license, CI, tests, cross-checked against dependency CVEs from deps.dev and OpenSSF Scorecard</sub>

TL;DR

BitBudget is a reproducible benchmark for embedding compression that measures retrieval quality (nDCG@10, recall@10) against bytes-per-vector storage cost. It evaluates compression methods like binarization, product quantization, RaBitQ, and Matryoshka truncation across BEIR corpora, answering the core question: when you compress embeddings, what retrieval quality do you actually lose per byte stored? Monolithic CLI-driven package under src/bitbudget/: the core logic splits into embedders.py (model wrappers), indexes.py (HNSW/IVF-PQ/bittrie), methods.py (compression algorithms), metrics.py (nDCG/recall), and eval.py (orchestration). Board cards in board_cards/ are JSON result artifacts. The site/ directory holds a static leaderboard (data.json + index.html). A small C extension (_bittrie.c) compiled via _bittrie_build.py accelerates the bittrie index's query hot-path.

LLM-derived; treat as a starting point, not verified fact.

👥Who it's for

ML/RAG engineers and researchers building vector databases and retrieval systems who need to decide between compression methods and quantization strategies. They use BitBudget to benchmark their embedder+compression pipeline and make storage vs. recall trade-off decisions backed by reproducible empirical data rather than vendor claims.

LLM-derived; treat as a starting point, not verified fact.

🌱Maturity & risk

Actively developed but research-stage. The repo has CI/CD workflows (.github/workflows/), a published leaderboard (LEADERBOARD.md), and test coverage (tests/ directory), but the commit frequency and issue volume are not visible from file metadata. The CITATION.cff and companion survey paper suggest academic-grade rigor, but lack of GitHub stars/issues data means it's likely still building adoption beyond the research community.

Low-to-moderate risk. The project is heavily Python-based with a small C extension (_bittrie.c for query performance), meaning C compilation could fail on some platforms. Dependencies on sentence-transformers and faiss are optional but recommended, creating a multi-version testing surface. Single-author repo (sjmoran) increases bus factor; no visible maintainer team or recent commit date metadata provided.

LLM-derived; treat as a starting point, not verified fact.

Active areas of work

The repo is actively maintaining the compression leaderboard and expanding benchmark coverage. The card files (card_bge-base.json through card_openai-3-large.json) suggest ongoing evaluation of new embedder models. The build_board.sh and cards_to_data.py pipeline indicates active leaderboard publishing. The companion survey paper on projection and quantization (arxiv 2510.04127) suggests the research narrative is maturing.

LLM-derived; treat as a starting point, not verified fact.

🚀Get running

git clone https://github.com/sjmoran/bitbudget.git
cd bitbudget
pip install -e '.[all]'  # installs with sentence-transformers + faiss
bitbudget methods           # list available compression methods
bitbudget run --embedder mxbai --corpus scifact  # embed + evaluate

Daily commands:

bitbudget run --embedder mxbai --corpus scifact     # full pipeline: embed + compress + eval
bitbudget bench-index --synthetic 100000 128         # benchmark index recall/QPS/bytes trade-offs
bitbudget leaderboard results/card_*.json            # render markdown leaderboard from result cards
bitbudget methods                                     # list all compression strategies
bitbudget indexes                                     # list all index types

🗺️Map of the codebase

  • src/bitbudget/__init__.py — Package entry point exposing the main API for compression benchmarking.
  • src/bitbudget/cli.py — Command-line interface orchestrating the benchmark workflow; start here for understanding how experiments run.
  • src/bitbudget/methods.py — Core compression methods (binarization, quantization, product quantization) that form the heart of the benchmark.
  • src/bitbudget/eval.py — Evaluation logic computing nDCG@10 and recall@10 metrics against retrieved neighbors.
  • src/bitbudget/indexes.py — Index abstractions (in-memory and faiss-based) for storing and searching compressed embeddings.
  • src/bitbudget/embedders.py — Embedder wrappers for different models (OpenAI, BGE, E5) that produce baseline vectors.
  • README.md — Explains the benchmark philosophy, headline findings, and how to use BitBudget.

🧩Components & responsibilities

  • Embedders (Transformers (HuggingFace), OpenAI API, sentence-transformers) — Fetch pretrained models and produce fixed-size float32 vectors from text.
    • Failure mode: API rate limit, OOM, model weight download failure → benchmark hangs or crashes.
  • Methods (Compression) (NumPy, sklearn.decomposition, custom quantization logic) — Transform full vectors into lower-bit representations while preserving retrieval structure.
    • Failure mode: Numerical instability, incorrect bit-packing → silent correctness errors in downstream ranking.
  • Indexes (FAISS, brute-force L2, bit-trie (optional)) — Store compressed vectors and perform efficient nearest-neighbor retrieval.
    • Failure mode: FAISS build failure, OOM during large-scale search → benchmark cannot complete.
  • Evaluation (NumPy, NDCG calculation, ranking metrics) — Compute nDCG@10 and recall@10 by comparing retrieved neighbors against ground-truth rankings.
    • Failure mode: Ground-truth mismatch, metric computation bugs → incorrect leaderboard scores.
  • CLI Orchestrator (argparse, Python logging) — Parse arguments, sequence component calls, and manage experiment state.
    • Failure mode: Incorrect argument validation, state corruption across runs → experiment reproducibility broken.
  • Datasets (Hugging Face datasets library, JSON parsing) — Download and load BEIR corpora, queries, and ground-truth relevance judgments.
    • Failure mode: Network failure, corrupted cache, missing splits → benchmark cannot initialize.

🔀Data flow

  • EmbedderMethod — Full-precision float32 vectors (shape: [n, d]) passed to compression method.
  • MethodIndex — Compressed bit/int8 vectors stored in index for retrieval.
  • IndexEval — Top-k neighbor indices and distances returned for each query.
  • EvalCLI — Aggregated nDCG@10 and recall@10 metrics written to results JSON.
  • CLILeaderboard — Results cards serialized to board_cards/ and ingested by cards_to_data.py for website rendering.

🛠️How to make changes

Add a new compression method

  1. Define a class inheriting from Method in src/bitbudget/methods.py with compress() and decompress() implementations. (src/bitbudget/methods.py)
  2. Register it in the method registry (likely a dict in methods.py or init.py). (src/bitbudget/methods.py)
  3. Add a test case in tests/test_protocol.py to verify it satisfies the Method protocol. (tests/test_protocol.py)

Add a new embedder model

  1. Create an Embedder subclass in src/bitbudget/embedders.py with an embed() method returning vectors. (src/bitbudget/embedders.py)
  2. Register it in the embedder registry so CLI can discover it. (src/bitbudget/embedders.py)
  3. Run benchmark and save results to board_cards/card_<model-name>.json. (board_cards)

Run a new benchmark experiment

  1. Invoke src/bitbudget/cli.py with --embedder, --dataset, and --methods flags. (src/bitbudget/cli.py)
  2. CLI orchestrates embedding → compression → indexing → evaluation via eval.py. (src/bitbudget/eval.py)
  3. Results are saved and can be added to leaderboard via cards_to_data.py. (cards_to_data.py)

🔧Why these technologies

  • Python + NumPy — Rapid prototyping and data manipulation for embedding vectors and experiments.
  • FAISS — Efficient approximate nearest-neighbor search at scale for retrieval evaluation.
  • C extension (_bittrie.c) — Optional low-level optimization for bit-trie indexing when binarized retrieval is critical.
  • BEIR benchmark — Standardized, reproducible evaluation corpora covering diverse retrieval scenarios.

⚖️Trade-offs already made

  • In-memory indexing (no persistent storage layer)

    • Why: Simpler implementation and faster iteration for research; benchmark is reproducible from code alone.
    • Consequence: Benchmark runs must re-embed and re-index on each invocation; not suitable for long-lived deployments.
  • Support multiple embedder APIs (OpenAI, BGE, E5) as separate classes

    • Why: Each model has different rate limits, authentication, and tensor formats.
    • Consequence: Added code complexity; no unified interface, but full control over each model's quirks.
  • Evaluate on BEIR corpora only

    • Why: Single standardized benchmark ensures reproducibility and fair comparison.
    • Consequence: Results may not generalize to specialized domains (medical, legal, code retrieval).
  • Report bytes-per-vector, not compression ratio

    • Why: Aligns with real deployment constraints (memory budgets, network bandwidth).
    • Consequence: Requires explicit byte-size accounting for each method; less intuitive than 'compression ratio %'.

🚫Non-goals (don't propose these)

  • Does not perform online indexing or real-time vector updates.
  • Does not provide distributed/multi-GPU benchmark execution.
  • Does not handle authentication or multi-tenant isolation.
  • Does not support custom datasets outside BEIR.
  • Does not optimize for latency; focuses on retrieval quality vs. storage trade-off.

📊Code metrics

  • Avg cyclomatic complexity: ~6 — Moderate cyclomatic complexity: orchestration logic in cli.py and eval.py, but individual methods are straightforward quantization routines. Multiple branching paths for embedder APIs and dataset loading.
  • Largest file: src/bitbudget/methods.py (450 lines)
  • Estimated quality issues: ~3 — Missing input validation on vector dimensions, sparse docstrings in methods.py, and inconsistent error handling between embedder subclasses. No type hints in several key functions.

⚠️Anti-patterns to avoid

  • Tight coupling between Method and Index (Medium)src/bitbudget/methods.py, src/bitbudget/indexes.py: Methods return compressed vectors without metadata; Index must infer storage format and bit-width, risking silent mismatches.
  • No validation of embedder output shape (Medium)src/bitbudget/embedders.py, src/bitbudget/eval.py: If embedder returns wrong dimensions, error only surfaces deep in FAISS index construction, making debugging hard.
  • Global metric computation without per-query logging (Low)src/bitbudget/eval.py: Aggregated metrics hide outlier queries that perform poorly under specific compression methods.

🔥Performance hotspots

  • src/bitbudget/embedders.py (embedder initialization) (I/O-bound, initialization latency) — Model weight download and GPU allocation can take 30–60 seconds per embedder before any benchmark runs.
  • src/bitbudget/eval.py (query-by-query evaluation loop) (CPU-bound, algorithmic) — Nested loop over queries × methods × FAISS search calls; no vectorization or parallelization across methods.
  • src/bitbudget/indexes.py (FAISS index construction) (Memory and CPU-bound) — Building FAISS indices for large BEIR corpora (>100k docs) can dominate total runtime, especially for exact search.

🪤Traps & gotchas

  1. Optional dependencies matter: bitbudget run works with just NumPy, but --embedder mxbai silently requires sentence-transformers (torch+transformers) to be installed; bench-index with HNSW/IVF-PQ requires faiss. Errors only surface at runtime. 2. C extension compilation: _bittrie.c must compile on your platform; if pip install fails with C compiler errors, you lose the fast bittrie query path. 3. BEIR corpus auto-download: datasets.py downloads multi-GB corpora on first run to ~/.cache/ or $XDG_CACHE_HOME; no disk check or resume logic. 4. Float precision in metrics: numpy broadcasting in metrics.py can silently lose precision on large matrices; nDCG@10 is hardcoded, no CLI override. 5. Reproducibility risk: embeddings vary by sentence-transformers/transformers version; LEADERBOARD.md results may not match if dependencies drift.

🏗️Architecture

💡Concepts to learn

  • Product Quantization (PQ) — One of the core compression methods in BitBudget (methods.py); decomposes high-d vectors into subspaces and quantizes each independently, balancing compression ratio and recall loss
  • Binary Embeddings / Hamming Distance — BitBudget's headline finding is that 1-bit codes (binarization) with re-ranking beat full-precision embeddings per byte; requires understanding bit-level distance metrics and re-ranking trade-offs
  • nDCG@k (Normalized Discounted Cumulative Gain) — The primary ranking metric BitBudget uses to measure retrieval quality; essential for interpreting the leaderboard and understanding recall@byte trade-offs
  • Vector Index Structures (HNSW, IVF-PQ, KD-Trees) — BitBudget's bench-index command evaluates compression methods on different index types; understanding index recall, QPS, and bytes-per-vector overhead is core to the 'organisation axis' benchmark
  • Approximate Nearest Neighbor (ANN) Search — BitBudget evaluates compression in the context of ANN retrieval, not exact search; recall@k and QPS metrics depend on index implementation and are different from offline compression-only benchmarks
  • Matryoshka Embeddings — A learned projection method that truncates embeddings to lower dimensions; BitBudget includes this as a baseline compression strategy and shows it underperforms quantization at fixed byte budgets
  • Re-ranking (Two-Stage Retrieval) — BitBudget's binary+rerank result uses a two-stage pipeline (cheap retrieval on 1-bit codes, then re-rank with full embeddings); critical for understanding how quantized indexes can be lossless in practice
  • facebookresearch/faiss — Provides the approximate nearest neighbor indexes (HNSW, IVF-PQ) that BitBudget uses for the 'organisation axis' benchmarking (bench-index command)
  • UKPLab/sentence-transformers — Embedding model library used by BitBudget's embedders.py to generate the vectors being compressed; essential dependency for reproducible benchmarks
  • beir-cellar/beir — The Information Retrieval benchmark suite; BitBudget wraps BEIR corpora (scifact, nfcorpus, etc.) for evaluation and computes nDCG@10 using BEIR's metrics
  • microsoft/unilm — Includes Matryoshka Embedding research; BitBudget includes Matryoshka truncation as one compression baseline and compares it against other methods
  • ingestion-edge/rabitq — The RaBitQ quantization method is one of the key compression baselines evaluated in BitBudget's methods.py and featured in LEADERBOARD.md results

🪄PR ideas

Click to expand

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for compression method evaluation pipeline in tests/

The repo has src/bitbudget/methods.py, src/bitbudget/eval.py, and src/bitbudget/metrics.py for core compression evaluation, but tests/test_protocol.py and tests/test_indexes.py don't cover the full evaluation workflow. Adding end-to-end tests would verify that compression methods (quantization, projection, hashing) correctly report recall-per-byte trade-offs and prevent regressions in the benchmark's reproducibility claims.

  • [ ] Create tests/test_methods.py to test each compression method in src/bitbudget/methods.py with a small synthetic embedding dataset
  • [ ] Create tests/test_eval.py to verify the evaluation pipeline in src/bitbudget/eval.py produces consistent nDCG@10 and recall@10 metrics
  • [ ] Add a small fixture dataset in tests/ (e.g., 100 test embeddings) to make tests fast and reproducible
  • [ ] Verify tests run in CI via .github/workflows/ci.yml

Document the board_cards schema and add validation for leaderboard submissions

The repo has board_cards/ with JSON card files (e.g., card_bge-base.json) and a LEADERBOARD.md, but there's no schema documentation or validation script. New contributors cannot easily add their own compression method results. Adding a JSON schema and validator would clarify what fields are required and prevent malformed submissions.

  • [ ] Create a board_cards/schema.json (JSON Schema) documenting the required/optional fields for each card (embedder name, compression method, metrics, bytes per vector, etc.)
  • [ ] Add a validation script (e.g., src/bitbudget/validate_card.py) that validates submitted cards against the schema
  • [ ] Update CONTRIBUTING.md with an example of submitting a new board card and how to validate it before opening a PR
  • [ ] Add a pre-commit hook or CI check in .github/workflows/ci.yml to validate all JSON files in board_cards/

Add benchmarking performance tests for the C extension _bittrie in src/bitbudget/

The repo includes src/bitbudget/_bittrie.c (a C extension for bit-trie indexing) but has no dedicated performance or correctness tests. This is critical for a benchmark repo: contributors need to verify that compression method runtimes and index query performance are stable. Adding perf tests ensures the benchmark itself doesn't become a bottleneck.

  • [ ] Create tests/test_bittrie.py with unit tests for _bittrie correctness (insert, query, edge cases) by wrapping src/bitbudget/bittrie.py
  • [ ] Add a benchmarking test in tests/ or a new benchmark script that times index build/query operations for various embedding sizes and bit-widths
  • [ ] Document expected performance (e.g., latency per query) in CONTRIBUTING.md or a PERFORMANCE.md file
  • [ ] Add optional benchmark CI job in .github/workflows/ci.yml (marked as non-blocking) to track performance regressions

🌿Good first issues

  • Add benchmarking for smaller embedders (e.g., MiniLM, ONNX-quantized variants) by extending src/bitbudget/embedders.py with lazy-load options and profiling memory/latency overhead per embedder, then add a new bench-embedder command to cli.py.
  • Write integration tests for the evaluation pipeline (src/bitbudget/eval.py) that mock BEIR corpus downloads and verify nDCG/recall output stability across Python versions; currently tests/test_protocol.py only validates compression interface, not end-to-end retrieval correctness.
  • Implement a new compression method (e.g., scalar quantization with learned thresholds, or learned rotations like ITQ) in src/bitbudget/methods.py following the existing protocol, add CLI registration in cli.py, benchmark it on 2-3 corpora, and submit a board_card/ JSON to the leaderboard.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 6a9e431 — Revert "Regenerate LEADERBOARD.md from site/data.json (live-site protocol)" (sjmoran)
  • 5f75f93 — Revert "README: move the headline block to the 3-corpora numbers" (sjmoran)
  • 51e6649 — README: move the headline block to the 3-corpora numbers (sjmoran)
  • 1cfd137 — Regenerate LEADERBOARD.md from site/data.json (live-site protocol) (sjmoran)
  • 9a86b20 — Link the published paper and sync RaBitQ rows with it (sjmoran)
  • db10477 — site: deep-link to an embedder tab via ?embedder=<name> (sjmoran)
  • ab1a960 — v0.3.1: add binary-mean (mean-threshold sign code) -- the proper binary baseline (sjmoran)
  • 3133bb4 — v0.3.0: broaden the compression board to popular embedders + methods (sjmoran)
  • e965401 — site: add #supervised anchor and a deep link to the supervised board (sjmoran)
  • b32dacf — site: add a Supervised hashing board (CIFAR-10) with the deep DPSH baseline (sjmoran)

🔒Security observations

Click to expand

The BitBudget codebase shows a reasonable security posture for a research/benchmark project. Primary concerns are: (1) incomplete dependency information preventing full supply-chain security assessment; (2) presence of C extension code requiring additional memory-safety scrutiny; (3) CI/CD workflows that require review for secrets exposure and action pinning. The project lacks traditional injection risks (no database, no user input processing) and shows no obvious hardcoded credentials. Recommendations focus on dependency management, C code safety practices, and CI/CD hygiene.

  • Medium · Missing pyproject.toml dependency specification — pyproject.toml. The pyproject.toml file is referenced but content is not provided. Unable to verify that dependencies are properly pinned to specific versions or that no vulnerable dependencies are included. This is a common source of supply-chain security issues. Fix: Ensure all dependencies in pyproject.toml are pinned to specific versions (e.g., 'package==1.2.3' not 'package>=1.2.3'). Regularly run 'pip audit' or use tools like 'safety' to check for known vulnerabilities in dependencies.
  • Low · C extension module present without visible security review — src/bitbudget/_bittrie.c. The codebase includes a compiled C extension (_bittrie.c) which is a higher-risk component due to potential memory safety issues (buffer overflows, use-after-free, etc.) that are more common in C code compared to Python. Fix: Ensure the C extension is regularly reviewed for memory safety issues. Consider using static analysis tools like cppcheck or clang-analyzer. Document the build process and include security considerations in code review guidelines.
  • Low · CI/CD workflows require security review — .github/workflows/ci.yml, .github/workflows/pages.yml, .github/workflows/publish.yml. GitHub Actions workflows are present (.github/workflows/) but content is not provided. Workflows can introduce security risks if they use untrusted actions, expose secrets, or perform insecure operations during build/publish processes. Fix: Review all workflow files to ensure: (1) Actions are pinned to specific commit hashes, not version tags; (2) Secrets are not logged or exposed; (3) Third-party actions come from trusted sources; (4) Publish workflow has proper authentication and authorization controls.
  • Low · Static site content may contain data injection risks — site/data.json, site/index.html, cards_to_data.py. The site/data.json and site/index.html files generate a static website with benchmark data. If data.json is generated from untrusted sources or user input without proper sanitization, it could lead to XSS vulnerabilities. Fix: Ensure cards_to_data.py properly escapes/sanitizes all data before inserting into JSON. Validate that all board_cards/*.json files come from trusted sources. Implement Content Security Policy (CSP) headers if serving the site over HTTP.

LLM-derived; treat as a starting point, not a security audit.

📚Suggested reading order

Computed from the actual import graph (no LLM). Read in this order to learn the codebase from the foundation up — each step builds on the previous ones.

  1. src/bitbudget/methods.py — Foundation: doesn't import anything internally and is imported by 3 other files. Read first to learn the vocabulary.
  2. src/bitbudget/embedders.py — Foundation: imported by 2, no internal dependencies of its own.
  3. src/bitbudget/eval.py — Built on the foundation; imported by 2 downstream files.
  4. src/bitbudget/bittrie.py — Built on the foundation; imported by 2 downstream files.
  5. src/bitbudget/indexes.py — Layer 2 — application-level code that wires the lower layers together.
  6. src/bitbudget/__init__.py — Layer 3 — application-level code that wires the lower layers together.
  7. src/bitbudget/cli.py — Layer 4 — application-level code that wires the lower layers together.

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

The exported doc (Copy CLAUDE.md / Download / .cursor/rules) also includes an agent protocol and a verification script written for AI coding agents — omitted here to keep this view scannable.

Embed this chat in your README

Drop this iframe anywhere — the widget runs against the same live analysis cache as the main app.

<iframe
  src="https://repopilot.app/embed/sjmoran/bitbudget"
  width="100%" height="500"
  style="border:1px solid #d0d7de; border-radius:8px;"
  allow="microphone"
  loading="lazy"
></iframe>