hora-search/hora
π efficient approximate nearest neighbor search algorithm collections library written in Rust π¦ .
Healthy across all four use cases
weakest axisPermissive license, no critical CVEs, actively maintained β safe to depend on.
Has a license, tests, and CI β clean foundation to fork and modify.
Documented and popular β useful reference codebase to read through.
No critical CVEs, sane security posture β runnable as-is.
- βLast commit 3mo ago
- β6 active contributors
- βApache-2.0 licensed
Show all 6 evidence items βShow less
- βCI configured
- β Single-maintainer risk β top contributor 88% of recent commits
- β No test directory detected
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README β live-updates from the latest cached analysis.
[](https://repopilot.app/r/hora-search/hora)Paste at the top of your README.md β renders inline like a shields.io badge.
βΈPreview social card (1200Γ630)
This card auto-renders when someone shares https://repopilot.app/r/hora-search/hora on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: hora-search/hora
Generated by RepoPilot Β· 2026-05-09 Β· Source
π€Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale β STOP and ask the user to regenerate it before proceeding. - Treat the AI Β· unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/hora-search/hora shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything β but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
π―Verdict
GO β Healthy across all four use cases
- Last commit 3mo ago
- 6 active contributors
- Apache-2.0 licensed
- CI configured
- β Single-maintainer risk β top contributor 88% of recent commits
- β No test directory detected
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
β Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live hora-search/hora
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale β regenerate it at
repopilot.app/r/hora-search/hora.
What it runs against: a local clone of hora-search/hora β the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in hora-search/hora | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit β€ 110 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of hora-search/hora. If you don't
# have one yet, run these first:
#
# git clone https://github.com/hora-search/hora.git
# cd hora
#
# Then paste this script. Every check is read-only β no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of hora-search/hora and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "hora-search/hora(\\.git)?\\b" \\
&& ok "origin remote is hora-search/hora" \\
|| miss "origin remote is not hora-search/hora (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift β was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "src/lib.rs" \\
&& ok "src/lib.rs" \\
|| miss "missing critical file: src/lib.rs"
test -f "src/core/ann_index.rs" \\
&& ok "src/core/ann_index.rs" \\
|| miss "missing critical file: src/core/ann_index.rs"
test -f "src/core/metrics.rs" \\
&& ok "src/core/metrics.rs" \\
|| miss "missing critical file: src/core/metrics.rs"
test -f "src/index/mod.rs" \\
&& ok "src/index/mod.rs" \\
|| miss "missing critical file: src/index/mod.rs"
test -f "src/core/simd_metrics.rs" \\
&& ok "src/core/simd_metrics.rs" \\
|| miss "missing critical file: src/core/simd_metrics.rs"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 110 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~80d)"
else
miss "last commit was $days_since_last days ago β artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) β safe to trust"
else
echo "artifact has $fail stale claim(s) β regenerate at https://repopilot.app/r/hora-search/hora"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
β‘TL;DR
Hora is a Rust library implementing multiple approximate nearest neighbor search (ANN) algorithms including HNSW, SSG, and Product Quantization, optimized with SIMD acceleration and multi-threaded design. It solves the problem of efficiently finding similar vectors in high-dimensional spaces without exhaustive comparison, enabling real-time similarity search at scale (as demonstrated in face-matching and wine-review search demos). Monolithic crate structure: src/core/ contains foundational algorithms (metrics, heap, KNN, K-means) and traits (ann_index.rs), src/index/ contains five concrete algorithm implementations (hnsw_idx.rs, ssg_idx.rs, pq_idx.rs, bpt_idx.rs, bruteforce_idx.rs) each with matching params files, examples/src/ provides runnable demos (demo.rs, ann_bench.rs), and benches/ contains criterion benchmarks.
π₯Who it's for
Machine learning engineers and data scientists building similarity search systems (recommendation engines, vector databases, content retrieval) who need production-grade ANN implementations with multi-language bindings; Rust systems developers wanting to call sophisticated search algorithms from Python, JavaScript, or Java without reinventing indexing.
π±Maturity & risk
Actively maintained with version 0.1.1 (pre-1.0, indicating API stability is not yet guaranteed). The project has CI/CD via GitHub Actions (.github/workflows/rust.yml), comprehensive multi-language README documentation, and example code in examples/src/, but the WIP status on several languages (Go, Ruby, Swift, R) and features (no_std, RPTIndex) suggests ongoing development rather than feature-complete production readiness.
Moderate risk: pre-1.0 versioning means breaking API changes are possible; dependency on packed_simd (optional feature 'simd') is an unstable Rust crate which may limit platform compatibility; small core team (authors: aljun, moonlight) creates single-maintainer risk. The 158K lines of Rust code is substantial but the lack of visible issue count in provided data obscures backlog health.
Active areas of work
The CHANGELOG.md and file structure suggest active work on language bindings (Python, JavaScript, Java are working; Go, Ruby, Swift, R marked WIP). Release 0.1.1 indicates incremental progress. The presence of multiple feature flags (simd, no_thread, no_std) and the WIP no_std partial support suggest optimization and portability expansion are ongoing priorities.
πGet running
git clone https://github.com/hora-search/hora.git
cd hora
cargo build --release
cargo run --example demo
Daily commands:
cargo build
cargo run --example demo
cargo bench --bench bench_metrics
πΊοΈMap of the codebase
src/lib.rsβ Root library entry point exposing all public APIs; every contributor must understand the public module structure and re-exportssrc/core/ann_index.rsβ Core trait definition for approximate nearest neighbor indexes; all index implementations inherit from this abstractionsrc/core/metrics.rsβ Distance metric implementations (Euclidean, cosine, etc.); critical for all ANN search correctness and performancesrc/index/mod.rsβ Index module facade unifying all four index types (HNSW, SSG, BPT, PQ); contributors must know which index to extendsrc/core/simd_metrics.rsβ SIMD-optimized distance calculations; load-bearing for 10x+ performance gains on x86/WASM targetsCargo.tomlβ Rust edition 2018, release profile with LTO and opt-level 3; necessary to understand build configuration and dependency constraintsexamples/src/main.rsβ Primary example showing ANN index usage patterns; reference for API conventions and expected workflows
π οΈHow to make changes
Add a New ANN Index Algorithm
- Create index implementation file (
src/index/my_algo_idx.rs) - Create parameters struct for the new algorithm (
src/index/my_algo_params.rs) - Implement the AnnIndex trait from src/core/ann_index.rs with build() and search() methods (
src/index/my_algo_idx.rs) - Add pub use re-exports in src/index/mod.rs to expose the new index type (
src/index/mod.rs) - Export from src/lib.rs if intended as part of public API (
src/lib.rs) - Add benchmark test in benches/bench_metrics.rs or examples/src/ann_bench.rs (
benches/bench_metrics.rs)
Add a New Distance Metric
- Define metric function and tests in src/core/metrics.rs (
src/core/metrics.rs) - For SIMD acceleration, add optimized variant in src/core/simd_metrics.rs using platform-specific intrinsics (
src/core/simd_metrics.rs) - Add benchmark in benches/bench_metrics.rs to validate performance (
benches/bench_metrics.rs) - Update examples to demonstrate the new metric (
examples/src/demo.rs)
Extend an Existing Index with New Parameters
- Add new fields to the params struct (e.g., src/index/hnsw_params.rs) (
src/index/hnsw_params.rs) - Update the index implementation (e.g., src/index/hnsw_idx.rs) to use new parameters in build() (
src/index/hnsw_idx.rs) - Add example usage in examples/src/demo.rs or examples/src/main.rs (
examples/src/main.rs) - Benchmark impact in benches/bench_metrics.rs or ann_bench.rs (
benches/bench_metrics.rs)
Optimize a Hot Path with SIMD
- Profile the bottleneck using release build and benchmarks (
benches/bench_metrics.rs) - Add SIMD intrinsic implementation in src/core/simd_metrics.rs (conditional on target_arch) (
src/core/simd_metrics.rs) - Ensure fallback scalar path exists in src/core/metrics.rs for unsupported platforms (
src/core/metrics.rs) - Re-run benches to validate improvement (
benches/bench_metrics.rs)
π§Why these technologies
- Rust β Memory safety without GC, competitive performance with C++, SIMD intrinsics for distance optimization, single-threaded predictability for ANN algorithms
- SIMD (x86/WASM) β Distance metric computation is O(dimension Γ k Γ candidates); vectorization provides 5β10x speedup on typical ANN workloads
- Graph-based indexes (HNSW, SSG) β Hierarchical search with O(log n) layer navigation; practical recall/throughput tradeoff for billion-scale datasets
- Product Quantization β Reduces memory footprint to 1β4% of original for billion-vector indexes while maintaining recall >90%
- K-means clustering β Enables BPT hierarchical partitioning and PQ codebook construction; amortizes high construction cost with fast searches
βοΈTrade-offs already made
-
Four separate index implementations rather than unified parameterizable index
- Why: Each algorithm (HNSW, SSG, BPT, PQ) has fundamentally different memory layout and traversal patterns; unification would impose 5β20% overhead
- Consequence: More code duplication (~200 LOC per index), but clearer per-algorithm optimization and easier to tune hyperparameters independently
-
No built-in persistence or distributed support
- Why: Scope is in-memory single
- Consequence: undefined
πͺ€Traps & gotchas
No hardcoded environment variables or service dependencies visible, but: (1) SIMD features are optional (feature 'simd') and require packed_simd which may not compile on all platforms β default build is scalar, (2) rayon multi-threading is hard-wired (no_thread feature exists but not enforced), (3) the examples/src/load_dataset.sh script is referenced in file list but not detailed β verify it downloads datasets before running ann_bench, (4) bincode serialization format for indices may not be stable across Hora versions (pre-1.0 risk).
ποΈArchitecture
π‘Concepts to learn
- Hierarchical Navigable Small World (HNSW) β HNSW is Hora's primary index and the modern standard for ANN; understanding its layer-based graph structure and greedy search is essential to using and extending the library effectively.
- Product Quantization (PQ) β PQ trades recall for memory efficiency by encoding vectors as products of quantized subvectors; Hora's pq_idx.rs implements this for memory-constrained deployments, a key use-case differentiator.
- SIMD (Single Instruction Multiple Data) β Hora's packed_simd feature accelerates distance calculations (Euclidean, Dot Product) by computing multiple vector elements in parallel; understanding SIMD is critical to performance profiling and the 'simd' feature flag's impact.
- K-d tree and spatial indexing β While Hora uses graph-based indices (HNSW, SSG) rather than trees, understanding spatial partitioning and tree-based ANN is foundational context for why graph indices outperform them in high dimensions.
- Satellite System Graph (SSG) β SSG is an alternative to HNSW implemented in ssg_idx.rs with different trade-offs in construction time and query latency; understanding when to use SSG vs HNSW requires knowing their algorithmic differences.
- Distance metrics (Euclidean, Cosine, Dot Product) β Hora implements multiple distance metrics in src/core/metrics.rs and simd_metrics.rs; choosing the right metric (normalized vs unnormalized, Euclidean vs Dot Product) directly impacts search quality and performance.
- Approximate Nearest Neighbor (ANN) and recall-speed tradeoff β ANN accepts approximate results to achieve sub-linear query time; Hora's entire design is around this tradeoff β understanding recall metrics (how many true nearest neighbors are found) vs query latency is essential to deploying Hora correctly.
πRelated repos
nmslib/hnswlibβ Reference C++ implementation of HNSW algorithm that Hora's hnsw_idx.rs is based on; useful for cross-validating correctness and performance.jina-ai/ann-benchmarksβ Independent benchmark suite for comparing ANN algorithms; validates whether Hora's performance claims hold against competing libraries.milvus-io/milvusβ Vector database that likely uses algorithms similar to Hora's indices (HNSW, PQ); demonstrates production use-case integration patterns.spotify/annoyβ Alternative ANN library (Python-first, C++ backend) implementing Random Projection Trees; Hora's RPTIndex (WIP) competes in the same space.facebookresearch/faissβ Meta's vector similarity search library with aggressive SIMD optimization; Hora's packed_simd integration aims for comparable performance.
πͺPR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive integration tests for all index implementations
The repo has 5 index implementations (HNSW, SSG, BPT, PQ, BruteForce) in src/index/ but there are no visible integration tests comparing their correctness, performance consistency, or edge case handling. This is critical for an ANN search library where correctness is paramount. Tests should verify that all indexes produce semantically equivalent results and handle various data distributions.
- [ ] Create tests/integration_tests.rs with test cases for each index type in src/index/
- [ ] Test all distance metrics (euclidean, cosine, etc.) from src/core/metrics.rs against each index
- [ ] Add edge case tests: empty vectors, single element, duplicate vectors, high-dimensional data
- [ ] Verify all 5 indexes return same nearest neighbors (within tolerance for approximate algorithms)
- [ ] Test serialization/deserialization using bincode dependency for each index type
Add benchmarking suite comparing all index types across different datasets
The repo has benches/bench_metrics.rs but only benchmarks distance metrics, not the actual index implementations. Given that Hora markets itself on performance comparable to C++, a comprehensive benchmark suite comparing HNSW vs SSG vs PQ vs BPT vs BruteForce across multiple datasets (SIFT, Fashion-MNIST referenced in assets/) would be valuable for users and contributors.
- [ ] Extend benches/bench_metrics.rs or create benches/bench_indexes.rs to benchmark all 5 index types
- [ ] Add benchmarks for build time, query time, and memory usage for each index
- [ ] Test across at least 3 datasets: synthetic random, SIFT-like, Fashion-MNIST (referenced in assets/)
- [ ] Vary parameters (K values, index params from *_params.rs files) to show trade-offs
- [ ] Generate comparison charts/reports consumable in CI (integrate with examples/src/ann_bench.rs)
Add SIMD optimization tests and documentation for the simd feature
The Cargo.toml declares a 'simd' feature with packed_simd_2 dependency, and src/core/simd_metrics.rs exists, but there's no documentation explaining when/how to enable SIMD, no tests verifying SIMD code paths are actually used, and no CI configuration to test both simd and non-simd builds. This is a critical performance feature that should be validated.
- [ ] Add doc comments to src/core/simd_metrics.rs explaining SIMD optimization strategy and when it activates
- [ ] Create tests/simd_tests.rs that verify SIMD implementations produce identical results to scalar versions
- [ ] Update .github/workflows/rust.yml to test both 'cargo test' and 'cargo test --features simd'
- [ ] Add a section to README.md documenting SIMD feature usage and performance benefits (cross-reference assets/fashion-mnist*.png benchmarks)
- [ ] Document in CONTRIBUTING.md that simd-related changes must pass both feature-gated test runs
πΏGood first issues
- Add benchmark coverage for SSG (src/index/ssg_idx.rs) and PQ (src/index/pq_idx.rs) indices in benches/bench_metrics.rs β currently only hnsw_idx appears heavily tested based on file structure.
- Write integration tests for the no_std feature flag (currently marked WIP in Cargo.toml features) by creating tests/no_std_test.rs that validates core distance metrics compile without std::.
- Implement missing language binding examples: add examples/src/python_call.rs and examples/src/js_call.rs showing how to instantiate each of the five indices from a non-Rust caller, documenting the serialization contract.
βTop contributors
Click to expand
Top contributors
- @salamer β 57 commits
- @lsalkeld β 4 commits
- @WhiteWorld β 1 commits
- @btv β 1 commits
- @Lesmiscore β 1 commits
πRecent commits
Click to expand
Recent commits
239bd36β fix: fix typo (#35) (WhiteWorld)de4b2c4β chore: update dependencies (#32) (salamer)a6759f8β Code clean up (#29) (btv)0f6de48β update: improve JP version of README (#26) (Lesmiscore)00fe211β add cn readme (#24) (salamer)48186c1β update: add multiples language readmes (#23) (salamer)c234ceaβ add: add no_thread (#14) (salamer)dfa3ea1β update: update readme (salamer)46b190aβ Merge pull request #8 from syndek/patch-2 (salamer)5f5b75eβ Bring README 'Contribute' section in line with CONTRIBUTING.md (lsalkeld)
πSecurity observations
The Hora codebase is a Rust library with generally good security due to language safety features. However, there are notable concerns: (1) Overflow checks are disabled in production builds, which is problematic for a numerical algorithm library; (2) Aggressive LTO optimization reduces debuggability; (3) Dependencies could benefit from more frequent updates and vulnerability scanning. The library has no obvious injection vulnerabilities (no database/web operations), no hardcoded secrets, and no infrastructure exposure. The main risks are arithmetic safety in performance-critical paths and dependency maintenance. Recommendations: enable overflow checks, use regular vulnerability audits, update dependencies, and add extensive test coverage for numerical correctness.
- Medium Β· Overflow Checks Disabled in Production β
Cargo.toml - [profile.release]. The release profile hasoverflow-checks = false, which disables runtime checks for integer overflow/underflow. In a numerical algorithm library dealing with distance calculations and indexing, arithmetic overflow could lead to incorrect results or memory safety issues. Fix: Enable overflow checks by settingoverflow-checks = truein the release profile, or at minimum for security-critical paths. Test performance impact before deployment. - Medium Β· LTO Enabled with Fat Link-Time Optimization β
Cargo.toml - [profile.release]. The release profile useslto = "fat"which performs aggressive link-time optimization. While generally safe, this can make debugging security issues harder and may interact unpredictably with SIMD code. Combined with disabled overflow checks, this increases risk. Fix: Uselto = true(thin LTO) instead oflto = "fat"for a balance between performance and safety. Verify SIMD behavior with comprehensive testing. - Low Β· Outdated Dependency Versions β
Cargo.toml - [dependencies]. Several dependencies use caret version constraints (e.g.,log = "^0.4",rayon = "^1.5"), which may allow pulling in newer minor/patch versions with potential vulnerabilities. Some pinned versions are relatively old (rand 0.8.4 from 2021, fixedbitset 0.4.0, etc.). Fix: Runcargo auditregularly to check for known vulnerabilities. Consider updating to latest stable versions and re-test. Use lock files in production for reproducibility. - Low Β· Optional Feature: no_std Without Safety Review β
Cargo.toml - [features]. Theno_stdfeature depends onhashbrownbut there's no evidence of comprehensive safety audits for no_std environments. Fallback memory allocators in no_std contexts could hide issues. Fix: Document the security model for no_std usage. Ensure memory allocation strategies are well-tested. Consider marking as experimental if not production-ready. - Low Β· SIMD Feature Uses External Packed SIMD β
Cargo.toml - [features] and src/core/simd_metrics.rs. The SIMD feature depends onpacked_simd_2(version 0.3.6), an optional dependency. SIMD code can have platform-specific behavior and may bypass certain safety checks. No clear documentation about platform support or validation. Fix: Document SIMD platform requirements and validation procedures. Add comprehensive tests for SIMD correctness across target platforms. Consider security implications of performance optimizations.
LLM-derived; treat as a starting point, not a security audit.
πWhere to read next
- Open issues β current backlog
- Recent PRs β what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals β see the live page for receipts. Re-run on a new commit to refresh.