spotify/annoy

Item: spotify/annoy
Rating: 5
Author: RepoPilot

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Healthy

Healthy across all four use cases

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 6mo ago
✓21+ active contributors
✓Distributed ownership (top contributor 39% of recent commits)

Show 4 more →

✓Apache-2.0 licensed
✓CI configured
✓Tests present
⚠Slowing — last commit 6mo ago

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/spotify/annoy)](https://repopilot.app/r/spotify/annoy)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/spotify/annoy on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: spotify/annoy

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/spotify/annoy shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

Last commit 6mo ago
21+ active contributors
Distributed ownership (top contributor 39% of recent commits)
Apache-2.0 licensed
CI configured
Tests present
⚠ Slowing — last commit 6mo ago

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live spotify/annoy repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/spotify/annoy.

What it runs against: a local clone of spotify/annoy — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in spotify/annoy | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | Last commit ≤ 222 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>spotify/annoy</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of spotify/annoy. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/spotify/annoy.git
#   cd annoy
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of spotify/annoy and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "spotify/annoy(\\.git)?\\b" \\
  && ok "origin remote is spotify/annoy" \\
  || miss "origin remote is not spotify/annoy (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 222 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~192d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/spotify/annoy"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Annoy is a C++ library with Python/Go/Lua bindings that builds approximate nearest neighbor search indexes optimized for memory efficiency and disk persistence. It enables fast similarity searches in high-dimensional spaces (tested up to 1,000 dimensions) while supporting five distance metrics (Euclidean, Manhattan, cosine, Hamming, and dot product), with the unique capability to memory-map static index files for zero-copy sharing across processes. Monolithic core with multi-language bindings. src/annoylib.h contains the core C++ template library (~86K lines). Language-specific modules wrap it: src/annoymodule.cc (Python), src/annoygomodule.i (Go), src/annoyluamodule.cc (Lua), all using SWIG for binding generation. Python entry point is annoy/init.py with type stubs in annoy/init.pyi. Tests are isolated by metric type (test/angular_index_test.py, test/euclidean_index_test.py, etc.) and feature (test/on_disk_build_test.py for mmap behavior).

👥Who it's for

Machine learning engineers and data scientists at scale (Spotify uses it for music recommendations) who need to find similar items in millions of high-dimensional vectors and want to distribute read-only indexes across multiple processes or production servers without rebuilding or duplicating data.

🌱Maturity & risk

Production-ready and actively maintained. The project has CI/CD via GitHub Actions (.github/workflows/ci.yml and publish.yml), comprehensive test coverage across six distance metric types and integration with Go/Lua/C++ bindings, and was created by Erik Bernhardsson at Spotify years ago but continues to receive updates. The codebase is stable with only incremental improvements in recent work.

Low risk for core functionality but some integration concerns: the SWIG bindings (2,975 lines) for language support add maintenance surface area, and the codebase heavily relies on mmap behavior which can vary subtly across operating systems (note mman.h for Windows support). Single-maintainer dependency risk is present, though the simplicity of the core C++ algorithm (annoylib.h) mitigates this. No external dependencies are visible, reducing supply chain risk.

Active areas of work

Active maintenance mode with focus on build system improvements (CMakeLists.txt modernization visible) and CI robustness. No specific breaking changes are evident from the file list, but test coverage for threading (test/multithreaded_build_test.py) and memory leaks (test/memory_leak_test.py) suggests recent hardening work. The publish.yml workflow indicates automated PyPI releases.

🚀Get running

git clone https://github.com/spotify/annoy.git
cd annoy
pip install -e .
python examples/simple_test.py

Daily commands: For Python: python examples/simple_test.py or python -m pytest test/ for full test suite. For C++: include src/annoylib.h in your code and compile with C++11 or later. For building from source: pip install -e . triggers CMake build via setup.py. See examples/s_compile_cpp.sh for standalone C++ compilation example.

🗺️Map of the codebase

src/annoylib.h: The entire core algorithm: random projection tree construction, index building, and all five distance metric implementations as C++ templates.
src/annoymodule.cc: Python SWIG bindings that expose annoylib.h as the annoy.AnnoyIndex class users interact with via pip install.
annoy/init.py: Python package entry point that imports the compiled C extension and provides the public API (AnnoyIndex class, load/save/query methods).
test/index_test.py: Core integration test covering build, save, load, query, and mmap functionality—failure here blocks all releases.
examples/mmap_test.py: Demonstrates the flagship feature (memory-mapped file sharing across processes) with concrete code.
CMakeLists.txt: Build configuration for compiling the C++ core and Python extension, ensuring portability across Windows/Mac/Linux.
.github/workflows/ci.yml: CI pipeline that runs all tests on multiple Python versions and OS combinations before merge.

🛠️How to make changes

To add a new distance metric: modify src/annoylib.h (the Distance template specializations). To fix Python-specific bugs: edit src/annoymodule.cc and test with test/index_test.py. To add language bindings: create a new .i SWIG file (model: src/annoygomodule.i) and corresponding test file. Core algorithm changes touch src/annoylib.h only. Performance regressions are caught by test/accuracy_test.py and test/precision_test.cpp.

🪤Traps & gotchas

Index files are not version-safe: an index built by annoy v1.16 may not load in v1.15; always rebuild indexes when upgrading. 2. mmap behavior differs across OSes: Linux allows mmap of sparse files, Windows requires contiguous allocation—test on target platform. 3. Building from source requires a C++11 compiler; pip install will fail silently on very old toolchains. 4. The random seed (test/seed_test.py exists for this reason) affects tree construction; set a seed for reproducible results across runs. 5. Memory usage is proportional to (num_trees × dimensionality × item_count) due to tree replication—there is no tuning knob to reduce this beyond lowering num_trees.

💡Concepts to learn

Random Projection Trees — Core algorithm in annoylib.h—understanding how trees partition space via random hyperplanes is essential to tuning num_trees for accuracy vs. memory trade-offs.
Memory-Mapped I/O (mmap) — The flagship feature that makes Annoy unique: indexes are stored as binary files and mapped directly into virtual memory, allowing zero-copy sharing across processes without deserialization overhead.
Approximate Nearest Neighbor Search — Annoy trades exactness for speed and memory—understanding the approximation guarantees and how num_trees / search_k parameters affect recall is critical for production use.
SWIG (Simplified Wrapper and Interface Generator) — Used to auto-generate language bindings (Python annoymodule.cc, Go annoygomodule.i, Lua) from the C++ core; understanding SWIG's .i file syntax is necessary to add new language support.
Distance Metrics (Euclidean, Cosine, Manhattan, Hamming, Dot Product) — Annoy's template specialization pattern in annoylib.h supports five distance functions; choosing the right metric for your embedding space directly impacts both accuracy and performance.
Binary Index Serialization — Indexes are stored as compact binary files (load/save methods in annoy/init.py); understanding the format is essential for debugging, cross-platform compatibility, and implementing language-specific loaders.
Thread-Safe Index Building — test/multithreaded_build_test.py and test/threading_test.py cover this—the tree building phase is parallelizable, but index querying is read-only and fully thread-safe, affecting deployment patterns.

nmslib/hnswlib — Direct competitor offering HNSW algorithm for approximate nearest neighbors with similar multi-language support and lower query latency but higher memory usage than Annoy.
facebookresearch/faiss — Facebook's nearest neighbor search library optimized for GPU acceleration and billion-scale indexes; used when Annoy's memory footprint is already minimal but latency is critical.
lmdb/lmdb — Companion library for key-value storage with mmap semantics similar to Annoy's disk persistence; sometimes used together for storing metadata alongside Annoy indexes.
chrislit/annoy-python — Minimal fork/example project demonstrating how to wrap annoylib.h for custom Python extensions, useful for understanding the SWIG binding layer.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive type hints and update annoy/init.pyi with full API coverage

The repo has annoy/init.pyi stub file but it's likely incomplete given the library's rich API (multiple distance metrics, build modes, mmap support). Type hints are critical for Python users and enable better IDE support. This is evident from the test files showing various index types (angular, euclidean, manhattan, hamming, dot) and parameters like on_disk_build that need proper type annotations.

[ ] Review annoy/init.pyi and compare against actual Python API in annoy/init.py
[ ] Add complete type hints for all AnnoyIndex methods including init, add_item, build, save, load, get_nns_by_item, get_nns_by_vector, etc.
[ ] Add type hints for all distance metric options (angular, euclidean, manhattan, hamming, dot)
[ ] Add type hints for parameters like on_disk_build, search_k, include_distances
[ ] Run mypy against test/ directory to validate completeness
[ ] Update any docstrings in annoy/init.py to match type signatures

Add integration tests for Go bindings (test/annoy_test.go) to CI pipeline

The repo has Go bindings (src/annoygomodule.h, src/annoygomodule.i) and a test file (test/annoy_test.go), but looking at .github/workflows/ci.yml, there's no evidence Go tests are being run in the CI pipeline. This leaves Go bindings untested in automated builds. Given the multi-language nature of Annoy, Go tests should run alongside Python tests.

[ ] Review .github/workflows/ci.yml to confirm Go tests aren't currently run
[ ] Add a new job or step to ci.yml to compile and run Go tests using 'go test ./test'
[ ] Ensure proper Go environment setup (version specification in workflow)
[ ] Test against test/annoy_test.go to verify compilation and test execution
[ ] Document Go testing requirements in README.rst or a new CONTRIBUTING.md

Add memory profiling and benchmark tests for on-disk mmap functionality

The repo has on_disk_build_test.py and mmap_test.py examples, but these appear to be one-off validation tests rather than automated benchmarks. Given that memory efficiency and mmap performance are core value propositions of Annoy (mentioned in README), there should be continuous benchmarking to catch regressions. Currently only memory_leak_test.py exists for memory concerns.

[ ] Create test/mmap_benchmark_test.py with benchmarks for: mmap file loading time, memory usage comparison (mmap vs in-memory), concurrent access patterns
[ ] Add benchmarks comparing on-disk vs in-memory build performance for various index sizes
[ ] Integrate benchmarks into CI (can be optional/slower job in ci.yml) or create separate benchmark workflow
[ ] Measure and document baseline memory usage for standard test datasets to catch regressions
[ ] Add benchmark results tracking or comparison against previous runs

🌿Good first issues

Add type hints to src/annoymodule.cc's Python C API calls and sync with annoy/init.pyi—currently the .pyi stub is separate from the actual implementation, creating doc/IDE sync issues.
Write integration tests for Go bindings (test/annoy_test.go exists but is minimal) that mirror the Python accuracy_test.py suite, ensuring feature parity across languages.
Document the exact binary format of .ann index files in README.rst with a hex dump example and spec, enabling third-party implementations and debugging—currently only implicitly specified in annoylib.h comments.

⭐Top contributors

Click to expand

@erikbern — 39 commits
@mathematicalmichael — 10 commits
@pkorobov — 9 commits
[@Erik Bernhardsson](https://github.com/Erik Bernhardsson) — 9 commits
@LTLA — 7 commits

📝Recent commits

Click to expand

379f744 — Merge pull request #680 from benglewis/add-python-3.13-support (erikbern)
3f32f02 — Merge pull request #1 from mathematicalmichael/add-python-3.13-support (benglewis)
ecf3ab7 — bump cibuildwheel (mathematicalmichael)
4fb7771 — secrets (mathematicalmichael)
3f87e65 — figure back to image (original) (mathematicalmichael)
d48b274 — address feedback + bump cibuildwheel (mathematicalmichael)
6528894 — aarch64 wheels on linux - tested (mathematicalmichael)
3373daa — version bump back down (mathematicalmichael)
f552a1e — move permissions into publish step (mathematicalmichael)
8f440e8 — Update .github/workflows/publish.yml (mathematicalmichael)

🔒Security observations

The Annoy codebase demonstrates generally good security practices for an open-source C++ library with Python bindings. No critical vulnerabilities were identified from the visible file structure. Primary concerns are around dependency management, C++ memory safety (which requires code review of implementation), and lack of formal security disclosure policy. The library's core functionality (nearest neighbor search) is mathematically focused with lower injection attack surface compared to web applications. Recommend implementing security testing in CI/CD, adding a SECURITY.md policy, and conducting periodic security audits of C++ implementations for memory safety issues.

Medium · Missing dependency security pinning in setup.py — setup.py. The setup.py file lacks explicit dependency version pinning, which could allow installation of vulnerable transitive dependencies. Without version constraints, future dependency updates could introduce security vulnerabilities. Fix: Add explicit version constraints for all dependencies in setup.py. Use tools like pip-audit to regularly scan for known vulnerabilities in dependencies.
Low · Missing security policy documentation — Repository root. No SECURITY.md or security policy file is present in the repository root. This makes it unclear how users should report security vulnerabilities responsibly. Fix: Create a SECURITY.md file documenting the security vulnerability disclosure process and providing contact information for security reports.
Low · C++ memory safety concerns in annoylib.h — src/annoylib.h, src/annoymodule.cc. The codebase contains C++ code with manual memory management and pointer operations. Without access to the full code, potential buffer overflows, use-after-free, or other memory safety issues could exist in the core C++ library. Fix: Conduct thorough code review focusing on buffer management, bounds checking, and use of safe memory allocation patterns. Consider using AddressSanitizer and MemorySanitizer in CI/CD pipeline.
Low · Missing SWIG binding security considerations — src/annoygomodule.i, src/annoyluamodule.cc. The SWIG bindings (src/annoygomodule.i) for multiple languages (Go, Lua) may have language-specific security issues not visible in static analysis without examining binding implementation details. Fix: Ensure SWIG bindings properly validate input types and implement bounds checking for all language bindings. Review language-specific security best practices.
Low · Mmap security implications — src/mman.h, examples/mmap_test.py. The library uses memory-mapped file I/O for data structures shared between processes. This introduces potential race conditions and privilege escalation risks if file permissions are misconfigured. Fix: Ensure mmap files are created with restrictive permissions (0600). Implement proper file locking mechanisms and document secure usage patterns in documentation.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

spotify/annoy

Embed the "Healthy" badge

Onboarding doc

Onboarding: spotify/annoy

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

🪤Traps & gotchas

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive type hints and update annoy/init.pyi with full API coverage

Add integration tests for Go bindings (test/annoy_test.go) to CI pipeline

Add memory profiling and benchmark tests for on-disk mmap functionality

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next