hora-search/hora

Item: hora-search/hora
Rating: 5
Author: RepoPilot

🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

Healthy

Healthy across all four use cases

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 3mo ago
✓6 active contributors
✓Apache-2.0 licensed

Show all 6 evidence items →

✓CI configured
⚠Single-maintainer risk — top contributor 88% of recent commits
⚠No test directory detected

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/hora-search/hora)](https://repopilot.app/r/hora-search/hora)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/hora-search/hora on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: hora-search/hora

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/hora-search/hora shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

Last commit 3mo ago
6 active contributors
Apache-2.0 licensed
CI configured
⚠ Single-maintainer risk — top contributor 88% of recent commits
⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live hora-search/hora repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/hora-search/hora.

What it runs against: a local clone of hora-search/hora — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in hora-search/hora | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 110 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>hora-search/hora</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of hora-search/hora. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/hora-search/hora.git
#   cd hora
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of hora-search/hora and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "hora-search/hora(\\.git)?\\b" \\
  && ok "origin remote is hora-search/hora" \\
  || miss "origin remote is not hora-search/hora (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "src/lib.rs" \\
  && ok "src/lib.rs" \\
  || miss "missing critical file: src/lib.rs"
test -f "src/core/ann_index.rs" \\
  && ok "src/core/ann_index.rs" \\
  || miss "missing critical file: src/core/ann_index.rs"
test -f "src/core/metrics.rs" \\
  && ok "src/core/metrics.rs" \\
  || miss "missing critical file: src/core/metrics.rs"
test -f "src/index/mod.rs" \\
  && ok "src/index/mod.rs" \\
  || miss "missing critical file: src/index/mod.rs"
test -f "src/core/simd_metrics.rs" \\
  && ok "src/core/simd_metrics.rs" \\
  || miss "missing critical file: src/core/simd_metrics.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 110 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~80d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/hora-search/hora"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Hora is a Rust library implementing multiple approximate nearest neighbor search (ANN) algorithms including HNSW, SSG, and Product Quantization, optimized with SIMD acceleration and multi-threaded design. It solves the problem of efficiently finding similar vectors in high-dimensional spaces without exhaustive comparison, enabling real-time similarity search at scale (as demonstrated in face-matching and wine-review search demos). Monolithic crate structure: src/core/ contains foundational algorithms (metrics, heap, KNN, K-means) and traits (ann_index.rs), src/index/ contains five concrete algorithm implementations (hnsw_idx.rs, ssg_idx.rs, pq_idx.rs, bpt_idx.rs, bruteforce_idx.rs) each with matching params files, examples/src/ provides runnable demos (demo.rs, ann_bench.rs), and benches/ contains criterion benchmarks.

👥Who it's for

Machine learning engineers and data scientists building similarity search systems (recommendation engines, vector databases, content retrieval) who need production-grade ANN implementations with multi-language bindings; Rust systems developers wanting to call sophisticated search algorithms from Python, JavaScript, or Java without reinventing indexing.

🌱Maturity & risk

Actively maintained with version 0.1.1 (pre-1.0, indicating API stability is not yet guaranteed). The project has CI/CD via GitHub Actions (.github/workflows/rust.yml), comprehensive multi-language README documentation, and example code in examples/src/, but the WIP status on several languages (Go, Ruby, Swift, R) and features (no_std, RPTIndex) suggests ongoing development rather than feature-complete production readiness.

Moderate risk: pre-1.0 versioning means breaking API changes are possible; dependency on packed_simd (optional feature 'simd') is an unstable Rust crate which may limit platform compatibility; small core team (authors: aljun, moonlight) creates single-maintainer risk. The 158K lines of Rust code is substantial but the lack of visible issue count in provided data obscures backlog health.

Active areas of work

The CHANGELOG.md and file structure suggest active work on language bindings (Python, JavaScript, Java are working; Go, Ruby, Swift, R marked WIP). Release 0.1.1 indicates incremental progress. The presence of multiple feature flags (simd, no_thread, no_std) and the WIP no_std partial support suggest optimization and portability expansion are ongoing priorities.

🚀Get running

git clone https://github.com/hora-search/hora.git
cd hora
cargo build --release
cargo run --example demo

Daily commands:

cargo build
cargo run --example demo
cargo bench --bench bench_metrics

🗺️Map of the codebase

src/lib.rs — Root library entry point exposing all public APIs; every contributor must understand the public module structure and re-exports
src/core/ann_index.rs — Core trait definition for approximate nearest neighbor indexes; all index implementations inherit from this abstraction
src/core/metrics.rs — Distance metric implementations (Euclidean, cosine, etc.); critical for all ANN search correctness and performance
src/index/mod.rs — Index module facade unifying all four index types (HNSW, SSG, BPT, PQ); contributors must know which index to extend
src/core/simd_metrics.rs — SIMD-optimized distance calculations; load-bearing for 10x+ performance gains on x86/WASM targets
Cargo.toml — Rust edition 2018, release profile with LTO and opt-level 3; necessary to understand build configuration and dependency constraints
examples/src/main.rs — Primary example showing ANN index usage patterns; reference for API conventions and expected workflows

🛠️How to make changes

Add a New ANN Index Algorithm

Create index implementation file (src/index/my_algo_idx.rs)
Create parameters struct for the new algorithm (src/index/my_algo_params.rs)
Implement the AnnIndex trait from src/core/ann_index.rs with build() and search() methods (src/index/my_algo_idx.rs)
Add pub use re-exports in src/index/mod.rs to expose the new index type (src/index/mod.rs)
Export from src/lib.rs if intended as part of public API (src/lib.rs)
Add benchmark test in benches/bench_metrics.rs or examples/src/ann_bench.rs (benches/bench_metrics.rs)

Add a New Distance Metric

Define metric function and tests in src/core/metrics.rs (src/core/metrics.rs)
For SIMD acceleration, add optimized variant in src/core/simd_metrics.rs using platform-specific intrinsics (src/core/simd_metrics.rs)
Add benchmark in benches/bench_metrics.rs to validate performance (benches/bench_metrics.rs)
Update examples to demonstrate the new metric (examples/src/demo.rs)

Extend an Existing Index with New Parameters

Add new fields to the params struct (e.g., src/index/hnsw_params.rs) (src/index/hnsw_params.rs)
Update the index implementation (e.g., src/index/hnsw_idx.rs) to use new parameters in build() (src/index/hnsw_idx.rs)
Add example usage in examples/src/demo.rs or examples/src/main.rs (examples/src/main.rs)
Benchmark impact in benches/bench_metrics.rs or ann_bench.rs (benches/bench_metrics.rs)

Optimize a Hot Path with SIMD

Profile the bottleneck using release build and benchmarks (benches/bench_metrics.rs)
Add SIMD intrinsic implementation in src/core/simd_metrics.rs (conditional on target_arch) (src/core/simd_metrics.rs)
Ensure fallback scalar path exists in src/core/metrics.rs for unsupported platforms (src/core/metrics.rs)
Re-run benches to validate improvement (benches/bench_metrics.rs)

🔧Why these technologies

Rust — Memory safety without GC, competitive performance with C++, SIMD intrinsics for distance optimization, single-threaded predictability for ANN algorithms
SIMD (x86/WASM) — Distance metric computation is O(dimension × k × candidates); vectorization provides 5–10x speedup on typical ANN workloads
Graph-based indexes (HNSW, SSG) — Hierarchical search with O(log n) layer navigation; practical recall/throughput tradeoff for billion-scale datasets
Product Quantization — Reduces memory footprint to 1–4% of original for billion-vector indexes while maintaining recall >90%
K-means clustering — Enables BPT hierarchical partitioning and PQ codebook construction; amortizes high construction cost with fast searches

⚖️Trade-offs already made

Four separate index implementations rather than unified parameterizable index
- Why: Each algorithm (HNSW, SSG, BPT, PQ) has fundamentally different memory layout and traversal patterns; unification would impose 5–20% overhead
- Consequence: More code duplication (~200 LOC per index), but clearer per-algorithm optimization and easier to tune hyperparameters independently
No built-in persistence or distributed support
- Why: Scope is in-memory single
- Consequence: undefined

🪤Traps & gotchas

No hardcoded environment variables or service dependencies visible, but: (1) SIMD features are optional (feature 'simd') and require packed_simd which may not compile on all platforms — default build is scalar, (2) rayon multi-threading is hard-wired (no_thread feature exists but not enforced), (3) the examples/src/load_dataset.sh script is referenced in file list but not detailed — verify it downloads datasets before running ann_bench, (4) bincode serialization format for indices may not be stable across Hora versions (pre-1.0 risk).

🏗️Architecture

💡Concepts to learn

Hierarchical Navigable Small World (HNSW) — HNSW is Hora's primary index and the modern standard for ANN; understanding its layer-based graph structure and greedy search is essential to using and extending the library effectively.
Product Quantization (PQ) — PQ trades recall for memory efficiency by encoding vectors as products of quantized subvectors; Hora's pq_idx.rs implements this for memory-constrained deployments, a key use-case differentiator.
SIMD (Single Instruction Multiple Data) — Hora's packed_simd feature accelerates distance calculations (Euclidean, Dot Product) by computing multiple vector elements in parallel; understanding SIMD is critical to performance profiling and the 'simd' feature flag's impact.
K-d tree and spatial indexing — While Hora uses graph-based indices (HNSW, SSG) rather than trees, understanding spatial partitioning and tree-based ANN is foundational context for why graph indices outperform them in high dimensions.
Satellite System Graph (SSG) — SSG is an alternative to HNSW implemented in ssg_idx.rs with different trade-offs in construction time and query latency; understanding when to use SSG vs HNSW requires knowing their algorithmic differences.
Distance metrics (Euclidean, Cosine, Dot Product) — Hora implements multiple distance metrics in src/core/metrics.rs and simd_metrics.rs; choosing the right metric (normalized vs unnormalized, Euclidean vs Dot Product) directly impacts search quality and performance.
Approximate Nearest Neighbor (ANN) and recall-speed tradeoff — ANN accepts approximate results to achieve sub-linear query time; Hora's entire design is around this tradeoff — understanding recall metrics (how many true nearest neighbors are found) vs query latency is essential to deploying Hora correctly.

nmslib/hnswlib — Reference C++ implementation of HNSW algorithm that Hora's hnsw_idx.rs is based on; useful for cross-validating correctness and performance.
jina-ai/ann-benchmarks — Independent benchmark suite for comparing ANN algorithms; validates whether Hora's performance claims hold against competing libraries.
milvus-io/milvus — Vector database that likely uses algorithms similar to Hora's indices (HNSW, PQ); demonstrates production use-case integration patterns.
spotify/annoy — Alternative ANN library (Python-first, C++ backend) implementing Random Projection Trees; Hora's RPTIndex (WIP) competes in the same space.
facebookresearch/faiss — Meta's vector similarity search library with aggressive SIMD optimization; Hora's packed_simd integration aims for comparable performance.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive integration tests for all index implementations

The repo has 5 index implementations (HNSW, SSG, BPT, PQ, BruteForce) in src/index/ but there are no visible integration tests comparing their correctness, performance consistency, or edge case handling. This is critical for an ANN search library where correctness is paramount. Tests should verify that all indexes produce semantically equivalent results and handle various data distributions.

[ ] Create tests/integration_tests.rs with test cases for each index type in src/index/
[ ] Test all distance metrics (euclidean, cosine, etc.) from src/core/metrics.rs against each index
[ ] Add edge case tests: empty vectors, single element, duplicate vectors, high-dimensional data
[ ] Verify all 5 indexes return same nearest neighbors (within tolerance for approximate algorithms)
[ ] Test serialization/deserialization using bincode dependency for each index type

Add benchmarking suite comparing all index types across different datasets

The repo has benches/bench_metrics.rs but only benchmarks distance metrics, not the actual index implementations. Given that Hora markets itself on performance comparable to C++, a comprehensive benchmark suite comparing HNSW vs SSG vs PQ vs BPT vs BruteForce across multiple datasets (SIFT, Fashion-MNIST referenced in assets/) would be valuable for users and contributors.

[ ] Extend benches/bench_metrics.rs or create benches/bench_indexes.rs to benchmark all 5 index types
[ ] Add benchmarks for build time, query time, and memory usage for each index
[ ] Test across at least 3 datasets: synthetic random, SIFT-like, Fashion-MNIST (referenced in assets/)
[ ] Vary parameters (K values, index params from *_params.rs files) to show trade-offs
[ ] Generate comparison charts/reports consumable in CI (integrate with examples/src/ann_bench.rs)

Add SIMD optimization tests and documentation for the simd feature

The Cargo.toml declares a 'simd' feature with packed_simd_2 dependency, and src/core/simd_metrics.rs exists, but there's no documentation explaining when/how to enable SIMD, no tests verifying SIMD code paths are actually used, and no CI configuration to test both simd and non-simd builds. This is a critical performance feature that should be validated.

[ ] Add doc comments to src/core/simd_metrics.rs explaining SIMD optimization strategy and when it activates
[ ] Create tests/simd_tests.rs that verify SIMD implementations produce identical results to scalar versions
[ ] Update .github/workflows/rust.yml to test both 'cargo test' and 'cargo test --features simd'
[ ] Add a section to README.md documenting SIMD feature usage and performance benefits (cross-reference assets/fashion-mnist*.png benchmarks)
[ ] Document in CONTRIBUTING.md that simd-related changes must pass both feature-gated test runs

🌿Good first issues

Add benchmark coverage for SSG (src/index/ssg_idx.rs) and PQ (src/index/pq_idx.rs) indices in benches/bench_metrics.rs — currently only hnsw_idx appears heavily tested based on file structure.
Write integration tests for the no_std feature flag (currently marked WIP in Cargo.toml features) by creating tests/no_std_test.rs that validates core distance metrics compile without std::.
Implement missing language binding examples: add examples/src/python_call.rs and examples/src/js_call.rs showing how to instantiate each of the five indices from a non-Rust caller, documenting the serialization contract.

⭐Top contributors

Click to expand

@salamer — 57 commits
@lsalkeld — 4 commits
@WhiteWorld — 1 commits
@btv — 1 commits
@Lesmiscore — 1 commits

📝Recent commits

Click to expand

239bd36 — fix: fix typo (#35) (WhiteWorld)
de4b2c4 — chore: update dependencies (#32) (salamer)
a6759f8 — Code clean up (#29) (btv)
0f6de48 — update: improve JP version of README (#26) (Lesmiscore)
00fe211 — add cn readme (#24) (salamer)
48186c1 — update: add multiples language readmes (#23) (salamer)
c234cea — add: add no_thread (#14) (salamer)
dfa3ea1 — update: update readme (salamer)
46b190a — Merge pull request #8 from syndek/patch-2 (salamer)
5f5b75e — Bring README 'Contribute' section in line with CONTRIBUTING.md (lsalkeld)

🔒Security observations

The Hora codebase is a Rust library with generally good security due to language safety features. However, there are notable concerns: (1) Overflow checks are disabled in production builds, which is problematic for a numerical algorithm library; (2) Aggressive LTO optimization reduces debuggability; (3) Dependencies could benefit from more frequent updates and vulnerability scanning. The library has no obvious injection vulnerabilities (no database/web operations), no hardcoded secrets, and no infrastructure exposure. The main risks are arithmetic safety in performance-critical paths and dependency maintenance. Recommendations: enable overflow checks, use regular vulnerability audits, update dependencies, and add extensive test coverage for numerical correctness.

Medium · Overflow Checks Disabled in Production — Cargo.toml - [profile.release]. The release profile has overflow-checks = false, which disables runtime checks for integer overflow/underflow. In a numerical algorithm library dealing with distance calculations and indexing, arithmetic overflow could lead to incorrect results or memory safety issues. Fix: Enable overflow checks by setting overflow-checks = true in the release profile, or at minimum for security-critical paths. Test performance impact before deployment.
Medium · LTO Enabled with Fat Link-Time Optimization — Cargo.toml - [profile.release]. The release profile uses lto = "fat" which performs aggressive link-time optimization. While generally safe, this can make debugging security issues harder and may interact unpredictably with SIMD code. Combined with disabled overflow checks, this increases risk. Fix: Use lto = true (thin LTO) instead of lto = "fat" for a balance between performance and safety. Verify SIMD behavior with comprehensive testing.
Low · Outdated Dependency Versions — Cargo.toml - [dependencies]. Several dependencies use caret version constraints (e.g., log = "^0.4", rayon = "^1.5"), which may allow pulling in newer minor/patch versions with potential vulnerabilities. Some pinned versions are relatively old (rand 0.8.4 from 2021, fixedbitset 0.4.0, etc.). Fix: Run cargo audit regularly to check for known vulnerabilities. Consider updating to latest stable versions and re-test. Use lock files in production for reproducibility.
Low · Optional Feature: no_std Without Safety Review — Cargo.toml - [features]. The no_std feature depends on hashbrown but there's no evidence of comprehensive safety audits for no_std environments. Fallback memory allocators in no_std contexts could hide issues. Fix: Document the security model for no_std usage. Ensure memory allocation strategies are well-tested. Consider marking as experimental if not production-ready.
Low · SIMD Feature Uses External Packed SIMD — Cargo.toml - [features] and src/core/simd_metrics.rs. The SIMD feature depends on packed_simd_2 (version 0.3.6), an optional dependency. SIMD code can have platform-specific behavior and may bypass certain safety checks. No clear documentation about platform support or validation. Fix: Document SIMD platform requirements and validation procedures. Add comprehensive tests for SIMD correctness across target platforms. Consider security implications of performance optimizations.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

hora-search/hora

Embed the "Healthy" badge

Onboarding doc

Onboarding: hora-search/hora

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

Add a New ANN Index Algorithm

Add a New Distance Metric

Extend an Existing Index with New Parameters

Optimize a Hot Path with SIMD

🔧Why these technologies

⚖️Trade-offs already made

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive integration tests for all index implementations

Add benchmarking suite comparing all index types across different datasets

Add SIMD optimization tests and documentation for the simd feature

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next