quickwit-oss/tantivy

Item: quickwit-oss/tantivy
Rating: 5
Author: RepoPilot

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Healthy

Healthy across the board

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit today
✓25+ active contributors
✓Distributed ownership (top contributor 19% of recent commits)

Show all 6 evidence items →

✓MIT licensed
✓CI configured
✓Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/quickwit-oss/tantivy)](https://repopilot.app/r/quickwit-oss/tantivy)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/quickwit-oss/tantivy on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: quickwit-oss/tantivy

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/quickwit-oss/tantivy shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit today
25+ active contributors
Distributed ownership (top contributor 19% of recent commits)
MIT licensed
CI configured
Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live quickwit-oss/tantivy repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/quickwit-oss/tantivy.

What it runs against: a local clone of quickwit-oss/tantivy — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in quickwit-oss/tantivy | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>quickwit-oss/tantivy</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of quickwit-oss/tantivy. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/quickwit-oss/tantivy.git
#   cd tantivy
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of quickwit-oss/tantivy and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "quickwit-oss/tantivy(\\.git)?\\b" \\
  && ok "origin remote is quickwit-oss/tantivy" \\
  || miss "origin remote is not quickwit-oss/tantivy (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "Cargo.toml" \\
  && ok "Cargo.toml" \\
  || miss "missing critical file: Cargo.toml"
test -f "ARCHITECTURE.md" \\
  && ok "ARCHITECTURE.md" \\
  || miss "missing critical file: ARCHITECTURE.md"
test -f "columnar/src/lib.rs" \\
  && ok "columnar/src/lib.rs" \\
  || miss "missing critical file: columnar/src/lib.rs"
test -f "bitpacker/src/lib.rs" \\
  && ok "bitpacker/src/lib.rs" \\
  || miss "missing critical file: bitpacker/src/lib.rs"
test -f "src/lib.rs" \\
  && ok "src/lib.rs" \\
  || miss "missing critical file: src/lib.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/quickwit-oss/tantivy"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Tantivy is a full-text search engine library written in Rust, inspired by Apache Lucene, that provides indexing, tokenization, BM25 scoring, and natural query language support for building search applications. It is designed as a library (not a server) for embedding search capabilities into Rust applications, with sub-100ms startup time and support for phrase queries, range queries, and faceted search. Monorepo structure with core library at root (src/) alongside specialized crates: bitpacker/ (SIMD integer compression), columnar/ (columnar storage via tantivy-columnar), sstable/ (sorted string tables via tantivy-sstable). Benchmarks in benches/ with real datasets (alice.txt, wiki.json, hdfs.json). CI workflows and GitHub issue templates in .github/.

👥Who it's for

Rust developers and search engine builders who need to embed fast, configurable full-text search into applications without running a separate Elasticsearch or Solr instance. Also used by Quickwit (a distributed search engine built on Tantivy) and developers building CLI tools or embedded search features.

🌱Maturity & risk

Production-ready and actively maintained. The codebase is substantial (4.8M lines of Rust), has comprehensive CI/CD via GitHub Actions (test.yml, coverage.yml, long_running.yml), and follows semantic versioning (currently 0.26.0 with MSRV 1.86). Regular releases and active community (Discord chat linked in README) indicate ongoing development and support.

Low risk for a mature library. Dependency surface is well-managed with explicit feature flags (memmap2, lz4_flex, zstd optional). The codebase has strong test coverage (coverage.yml workflow present) and security scanning (OpenSSF Scorecard badge). Single-author origins (Paul Masurel) but now community-driven under quickwit-oss organization, reducing single-maintainer risk.

Active areas of work

Active development with recent version 0.26.0. The repository includes .claude/skills/ directory suggesting AI-assisted development workflows (rationalize-deps, simple-pr, update-changelog skills). Long-running tests and coverage workflows indicate focus on quality and performance regression detection.

🚀Get running

git clone https://github.com/quickwit-oss/tantivy.git
cd tantivy
cargo build
cargo test

Daily commands:

cargo build --release
# Run benchmarks:
cargo bench
# Run tests:
cargo test --all
# Build documentation:
cargo doc --open

🗺️Map of the codebase

Cargo.toml — Root manifest defining tantivy as a Rust search engine library with 0.26.0 version and all core dependencies (regex, memmap2, tantivy-fst); essential for understanding the project scope and external dependencies.
ARCHITECTURE.md — High-level architectural overview documenting design decisions, data structures, and component relationships; required reading for understanding how the search engine is organized.
columnar/src/lib.rs — Columnar storage abstraction layer that manages column-oriented data structures for efficient aggregations and analytics; core to tantivy's performance characteristics.
bitpacker/src/lib.rs — Bit-packing compression library with SIMD optimizations (AVX2) for efficient integer encoding; critical for index storage efficiency and query performance.
src/lib.rs — Main tantivy library entry point exposing the public API for indexing and searching; defines the primary integration surface for users.
README.md — Project overview positioning tantivy as an Elasticsearch/Solr alternative written in Rust with links to Quickwit; essential context for contributor motivation.

🛠️How to make changes

Add a New Column Encoding Strategy

Define the new column encoding trait in columnar/src/column/mod.rs matching the Column trait interface (columnar/src/column/mod.rs)
Implement encoding/decoding logic in a new file under columnar/src/column/ (e.g., columnar/src/column/my_encoding.rs) (columnar/src/column/dictionary_encoded.rs)
Register the new encoding in the column factory/dispatcher within the same module (columnar/src/column/mod.rs)
Add serialization support in columnar/src/column/serialize.rs for persistence (columnar/src/column/serialize.rs)
Add benchmark in columnar/benches/ to measure performance (columnar/benches/bench_access.rs)

Add a New Query Type

Create query struct in src/query/ or update existing query module with new variant (src/lib.rs)
Implement the Query trait methods (explain, weight) for the new type (src/lib.rs)
Add benchmark in benches/ (e.g., benches/my_query.rs) for performance validation (benches/and_or_queries.rs)
Update query parser (if applicable) to support syntax for the new query type (src/lib.rs)

Optimize a Hot Path with Bitpacking

Profile the target operation using existing benchmarks in benches/ (benches/intersection_bench.rs)
Implement scalar version in bitpacker/src/blocked_bitpacker.rs or filter_vec/scalar.rs (bitpacker/src/filter_vec/scalar.rs)
Add SIMD-optimized variant in bitpacker/src/filter_vec/avx2.rs with CPU feature guards (bitpacker/src/filter_vec/avx2.rs)
Update bitpacker/src/lib.rs to expose the new optimization with runtime CPU feature detection (bitpacker/src/lib.rs)
Add benchmark in bitpacker/benches/bench.rs to validate improvement (bitpacker/benches/bench.rs)

🔧Why these technologies

Rust — Memory safety without garbage collection, enabling predictable performance critical for search engine at Lucene-equivalent scale; SIMD vectorization capabilities for compression and filtering.
Column-oriented storage (columnar crate) — Efficient aggregations and bulk analytics; dramatically faster for range queries and faceting compared to row-oriented inverted index alone.
Bit-packing with SIMD (bitpacker crate) — Integer compression reduces memory footprint and I/O, with AVX2 vectorization enabling 4–8x faster integer filtering than scalar CPU code.
Memory-mapped I/O (memmap2) — Zero-copy access to large indexes already on disk; OS handles paging transparently, avoiding explicit file read syscalls.
FST (Finite State Transducer) for term dictionaries (tantivy-fst) — Compact storage and fast prefix/fuzzy term lookups; more space-efficient than hash tables for millions of terms.

⚖️Trade-offs already made

In-process library vs. standalone server
- Why: Tantivy is embedded (like Lucene); Quickwit is the distributed server wrapper. This allows bare-metal performance and simpler deployment for embedded use cases.
- Consequence: Users must manage index distribution, replication, and scaling themselves; no built-in cluster support in tantivy itself.
Columnar storage layer alongside inverted index
- Why: Inverted index excels at boolean queries; columns excel at aggregations and range filters. Both needed for comprehensive search+analytics.
- Consequence: Slightly larger index size due to dual representation; more complex merge and serialization logic (see columnar/src/column_index/merge/).
Bit-packing with SIMD optimizations
- Why: Dramatic CPU time savings (4–8x on integer filtering); essential for query performance at scale.
- Consequence: Requires platform-specific SIMD code (AVX2, scalar fallback); CPU feature detection adds complexity; not all postings benefit equally.
Segment-based indexing (append-only segments + merges)
- Why: Lucene-inspired model; allows near-real-time search (publish segments as they fill) and efficient deletion via tombstones.
- Consequence: Merge overhead during heavy indexing; query must search all segments; segment explosion without disciplined merging.

🚫Non-goals (don't propose these)

Does not provide distributed search or clustering (that is Quickwit's role).
Not a database engine; no ACID transactions, no multi-statement consistency.
Not a document store; indexes are optimized for search, not full document retrieval (though snippets/highlighting are supported).
No real-time replication or failover; each index instance is standalone.
Does not handle authentication or multi-tenancy; that is the responsibility of the embedding application or wrapper (e.g., Quickwit).

🪤Traps & gotchas

No major hidden traps. Key points: (1) Tantivy is a library, not a server—it must be embedded into an application (see Quickwit for server example). (2) Segment merging is automatic but can be tuned via IndexWriter settings; understand segment-based architecture before optimizing. (3) Memory-mapped I/O is optional (memmap2 feature)—choose based on your deployment target. (4) Tokenizers require explicit language selection (e.g., Language::English for stemming). (5) Some features like columnar storage and sstable are behind optional feature flags and in separate workspace crates.

🏗️Architecture

💡Concepts to learn

Segment-based indexing — Tantivy divides the index into immutable segments like Lucene; understanding how segments are created, merged, and searched is essential for using the IndexWriter and Searcher APIs correctly
BM25 scoring — Tantivy's default relevance function; you need to understand BM25 parameters (k1, b) when tuning search quality and interpreting score values
Finite-state transducers (FST) — Tantivy uses tantivy-fst for memory-efficient term dictionaries and prefix queries; understanding FSTs explains why term lookups are fast and why term enumeration is supported
Inverted index — Core data structure in Tantivy for mapping terms to documents; the entire library is built around efficient inverted index construction and querying
Columnar storage — Tantivy's columnar crate (tantivy-columnar) enables fast faceting, filtering, and aggregations by storing fields column-wise; critical for analytics queries
Stemming and tokenization — Tantivy includes rust-stemmers for 17 Latin languages; proper tokenization strategy (choice of stemmer, case handling, stopwords) directly impacts search recall and precision
Integer compression (bitpacking) — Tantivy compresses posting lists and columnar data using bitpacking; this is why Tantivy achieves fast search with low memory footprint despite large indexes

quickwit-oss/quickwit — Distributed search engine built on top of Tantivy; shows production use of Tantivy at scale with cloud storage and multi-node coordination
BurntSushi/ripgrep — Uses regex-automata and finite-state machines for fast text search; demonstrates Rust patterns Tantivy also employs for performance
tantivy-search/tantivy-jieba — Official Chinese tokenizer plugin for Tantivy; shows how to extend tokenizer ecosystem beyond built-in stemmers
tantivy-search/tantivy-tokenizer-tiny-segmenter — Official Japanese tokenizer plugin; another language extension example for CJK text support
apache/lucene — Original inspiration and design reference for Tantivy's segment-based index structure and BM25 scoring implementation

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive benchmark suite for columnar storage performance

The columnar/ subdirectory has multiple benches (bench_access.rs, bench_column_values_get.rs, etc.) but they're not integrated into the main CI/CD pipeline. Adding a GitHub Actions workflow to track columnar performance regressions would prevent performance degradation and help contributors understand the impact of their changes on the core data structure performance.

[ ] Create .github/workflows/bench-columnar.yml to run columnar/benches/* benchmarks on PR submissions
[ ] Set up baseline tracking using criterion or similar to compare against main branch
[ ] Document expected performance ranges in columnar/README.md with instructions for contributors
[ ] Ensure bench results are reported as workflow artifacts or comments on PRs

Add integration tests for bitpacker SIMD implementations across platforms

The bitpacker/src/filter_vec/ contains platform-specific implementations (avx2.rs, scalar.rs) but there's no evidence of cross-platform integration tests verifying correctness equivalence between scalar and SIMD paths. This is critical for correctness and would prevent subtle bugs from being introduced.

[ ] Create bitpacker/tests/filter_vec_equivalence.rs with test cases that verify AVX2 and scalar implementations produce identical results
[ ] Add parameterized tests covering edge cases (empty vectors, single elements, large datasets, various bit widths)
[ ] Update .github/workflows/test.yml to run bitpacker tests on both x86_64 and non-SIMD compatible runners
[ ] Document the test coverage in bitpacker/README.md (if it doesn't exist, create it)

Create missing documentation for .claude/skills and establish contributor workflow guide

The .claude/skills/ directory contains three skill modules (rationalize-deps, simple-pr, update-changelog) with SKILL.md files, but there's no root documentation explaining how contributors should use these skills or what they're for. This creates friction for new contributors who don't understand the development workflow.

[ ] Create .claude/README.md documenting the purpose of Claude-assisted development tasks and how to trigger/use each skill
[ ] Update CONTRIBUTING.md (or create it if missing) with a 'Development Workflow' section referencing the Claude skills
[ ] Add examples to each SKILL.md showing when/why a contributor would use that skill
[ ] Reference the skills in the main README.md's 'Contributing' section

🌿Good first issues

Add integration tests for the tantivy-columnar crate (columnar/) to document column-oriented query filtering patterns—currently sparse test coverage for this key subcomponent.
Expand query_parser benchmarks (benches/query_parser_nested.rs) to cover edge cases like deeply nested boolean queries and add regression tests to prevent parser performance degradation.
Document tokenizer customization examples in a new examples/custom_analyzer.rs showing how to chain stemming + filtering + edge-ngram generation for autocomplete, filling gap between README and API docs.

⭐Top contributors

Click to expand

@congx4 — 19 commits
@PSeitz — 18 commits
@fulmicoton — 13 commits
@trinity-1686a — 11 commits
@dependabot[bot] — 5 commits

📝Recent commits

Click to expand

edfb02b — switch to enum, fix mixed types for cardinality agg (PSeitz)
d0fad88 — use bitsets for card agg (PSeitz)
351280c — add card bench for high card (PSeitz)
4480cf0 — Enable BMW for single-scorer boolean queries by removing early return in scorer_union (#2915) (jamessewell)
d47abdf — early cut off for order by sub agg in term agg (PSeitz)
c11952e — add order by agg benchmark (PSeitz)
09667ee — Merge pull request #2909 from osyniakov/claude/add-ossf-scorecard-1z6Vn (trinity-1686a)
333ccf5 — Merge pull request #2896 from osyniakov/claude/fix-issues-5945-5937-eQm1Q (trinity-1686a)
60a39a4 — Merge branch 'main' into claude/fix-issues-5945-5937-eQm1Q (osyniakov)
f8f3e42 — remove not neeeded permissions for the public repo (osyniakov)

🔒Security observations

The tantivy codebase demonstrates good security practices overall. It is written in Rust, which provides memory safety guarantees against common vulnerability classes (buffer overflows, use-after-free). The project maintains active CI/CD workflows including security scorecard validation. Dependency versions are reasonably current, though some minor updates are recommended. The main security considerations are: (1) ensuring optional test-focused dependencies like 'fail' are never enabled in production, (2) documenting proper file permission requirements for index storage, and (3) establishing a formal vulnerability disclosure policy. No critical security vulnerabilities were identified in the static analysis of the provided files and dependency manifest.

Low · Outdated dependency: regex — Cargo.toml - dependency: regex = 1.5.5. The regex crate is pinned to version 1.5.5, which is significantly outdated (current stable is 1.10+). While regex in Rust is generally safe from ReDoS attacks due to backtracking limitations, newer versions may contain performance improvements and bug fixes. Fix: Update regex to a recent stable version (1.10+) and test compatibility with the codebase.
Low · Outdated dependency: bitpacking — Cargo.toml - dependency: bitpacking = 0.9.3. The bitpacking crate is pinned to version 0.9.3. While this is a relatively recent version, checking for any known vulnerabilities or updates would be prudent for a security-sensitive search engine library. Fix: Verify the latest version of bitpacking (currently 0.10+) and evaluate migration if security patches are available.
Low · Optional feature security consideration: fail crate — Cargo.toml - optional dependency: fail = 0.5.0. The 'fail' crate (v0.5.0) is included as an optional dependency for failure injection testing. While useful for testing, ensure this feature is never enabled in production builds, as it could allow injected failures. Fix: Verify that the 'fail' feature is not included in production releases. Add CI/CD checks to ensure production builds exclude this feature.
Low · Memory-mapped file usage without explicit permissions verification — Cargo.toml - optional dependency: memmap2 = 0.9.0 and likely used in various modules. The crate uses memmap2 for memory-mapped file I/O (optional feature). While generally safe in Rust, ensure that index files are stored with appropriate filesystem permissions to prevent unauthorized access to search indexes. Fix: Document security requirements for index file permissions and validate file access controls in deployment documentation.
Low · No explicit security policy or vulnerability disclosure process visible — Repository root - missing SECURITY.md or similar. While the repository has security workflows (scorecard.yml), there is no visible SECURITY.md file or vulnerability disclosure policy mentioned in the provided structure. Fix: Create a SECURITY.md file with a vulnerability disclosure policy and contact information for security researchers.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

quickwit-oss/tantivy

Embed the "Healthy" badge

Onboarding doc

Onboarding: quickwit-oss/tantivy

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

Add a New Column Encoding Strategy

Add a New Query Type

Optimize a Hot Path with Bitpacking

🔧Why these technologies

⚖️Trade-offs already made

🚫Non-goals (don't propose these)

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive benchmark suite for columnar storage performance

Add integration tests for bitpacker SIMD implementations across platforms

Create missing documentation for .claude/skills and establish contributor workflow guide

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next