blevesearch/bleve

Item: blevesearch/bleve
Rating: 5
Author: RepoPilot

A modern text/numeric/geo-spatial/vector indexing library for go

Healthy

Healthy across the board

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 2d ago
✓11 active contributors
✓Distributed ownership (top contributor 32% of recent commits)

Show all 6 evidence items →

✓Apache-2.0 licensed
✓CI configured
✓Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/blevesearch/bleve)](https://repopilot.app/r/blevesearch/bleve)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/blevesearch/bleve on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: blevesearch/bleve

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/blevesearch/bleve shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 2d ago
11 active contributors
Distributed ownership (top contributor 32% of recent commits)
Apache-2.0 licensed
CI configured
Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live blevesearch/bleve repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/blevesearch/bleve.

What it runs against: a local clone of blevesearch/bleve — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in blevesearch/bleve | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 32 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>blevesearch/bleve</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of blevesearch/bleve. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/blevesearch/bleve.git
#   cd bleve
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of blevesearch/bleve and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "blevesearch/bleve(\\.git)?\\b" \\
  && ok "origin remote is blevesearch/bleve" \\
  || miss "origin remote is not blevesearch/bleve (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "search.go" \\
  && ok "search.go" \\
  || miss "missing critical file: search.go"
test -f "index.go" \\
  && ok "index.go" \\
  || miss "missing critical file: index.go"
test -f "analysis/analyzer/standard/standard.go" \\
  && ok "analysis/analyzer/standard/standard.go" \\
  || miss "missing critical file: analysis/analyzer/standard/standard.go"
test -f "analysis/freq.go" \\
  && ok "analysis/freq.go" \\
  || miss "missing critical file: analysis/freq.go"
test -f "go.mod" \\
  && ok "go.mod" \\
  || miss "missing critical file: go.mod"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 32 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~2d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/blevesearch/bleve"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Bleve is a production-grade Go library for building full-text search, numeric range, geospatial, and vector similarity indices. It supports indexing JSON/Go structs with pluggable analyzers (text, datetime, geo), multiple field types (text, number, geopoint, vector), and sophisticated queries (phrase, fuzzy, wildcard, k-NN), using the Scorch index format as its default storage backend for performance and flexibility. Monolithic Go library organized by feature domain: analysis/ (tokenizers, stemmers, char filters, datetime formats), index/ (storage backends like Scorch), search/ (query types and execution), mapping/ (schema definitions), geo/ and vector support integrated at top level. No separate CLI or examples dirs visible in file list; primarily a library consumed as github.com/blevesearch/bleve/v2.

👥Who it's for

Go developers building search-enabled applications (e.g., document management systems, product catalogs, log search platforms) who need embedded or standalone search without operating Elasticsearch or Solr. Also used by projects like Couchbase Server that require fast, configurable indexing.

🌱Maturity & risk

Production-ready and actively developed. The codebase is substantial (~4.2MB Go code), has comprehensive CI/CD via GitHub Actions (cover.yml, tests.yml), dense test coverage across analysis and index subsystems, and imports from stable ecosystem packages (RoaringBitmap, etcd/bbolt, FAISS). Recent commits visible; no signs of abandonment.

Moderate complexity with 20+ blevesearch-owned dependencies (zapx v11-v17, scorch_segment_api, go-faiss, segment, vellum) creating potential supply-chain risk if any are compromised. Large API surface (multiple query types, field types, analyzers) means breaking changes are possible between major versions. Single organization (blevesearch) maintains most deps, so organizational changes could impact maintenance.

Active areas of work

Active development focused on vector search (go-faiss integration), hybrid search (RRF/RSF score fusion per docs), and maintaining Scorch as the primary index backend. CI workflows (tests.yml, cover.yml) indicate continuous integration is enabled. Recent work on datetime analysis flexibility and multiple Zap versions (v11-v17) suggests ongoing index format compatibility.

🚀Get running

go get github.com/blevesearch/bleve/v2
cd /path/to/your/go/project
# Then: import "github.com/blevesearch/bleve/v2" and use in your code

For local development: git clone https://github.com/blevesearch/bleve.git && cd bleve && go test ./... (no explicit make/npm steps visible; uses native Go toolchain).

Daily commands: Not a runnable service; a library. Import and use in Go code. Example (from README): mapping := bleve.NewIndexMapping(); index, err := bleve.New("example.bleve", mapping); index.Index("doc_id", data). For testing: go test -v ./... or go test -cover ./... (per cover.yml CI workflow).

🗺️Map of the codebase

search.go — Core search interface and entry point; defines the primary API for indexing and querying
index.go — Index interface and management; orchestrates all indexing operations and segment coordination
analysis/analyzer/standard/standard.go — Default text analyzer; critical for understanding how documents are tokenized and analyzed
analysis/freq.go — Token frequency analysis; fundamental to scoring and relevance calculations
go.mod — Dependency manifest; lists all core indexing, storage, and analysis libraries that underpin bleve
README.md — Project overview and feature description; essential context for understanding scope and capabilities

🧩Components & responsibilities

Text Analyzer (Snowball, Porter Stemmer, Unicode handlers, language-specific filters) — Converts unstructured text into searchable tokens; applies language rules (stemming, stopwords, normalization)
- Failure mode: Incorrect tokenization leads to missing or spurious search results; unsupported language degrades to simple whitespace split
Index Manager (scorch segment API, concurrent data structures, batch operations) — Orchestrates document indexing, segment lifecycle, and query routing across segments
- Failure mode: Segment corruption causes index unavailability; improper merging leads to bloated index or query slowdown
Storage Layer (LevelDB, RoaringBitmap, memory-mapped files, write-ahead logging) — Persists inverted indexes, posting lists, and field data to disk; handles concurrent read/write access
- Failure mode: Disk full halts indexing; corrupted segments require index rebuild; concurrent writes without locking cause data loss
Query Executor (Roaring Bitmaps for set operations, BM25 scoring formula, term dictionary (V) — Parses queries, locates matching documents, ranks by relevance (BM25), returns results

🛠️How to make changes

Add a new language analyzer

Create analyzer directory structure following pattern in analysis/lang/{language}/ (analysis/lang/{language}/analyzer_{language}.go)
Define language-specific stemmer or normalization rules (analysis/lang/{language}/stemmer_{language}.go or normalize_{language}.go)
Create stop words list for language filtering (analysis/lang/{language}/stop_words_{language}.go)
Compose analyzer from tokenizer, stop filter, and stemmer using custom analyzer pattern (analysis/analyzer/custom/custom.go)
Add tests following existing language test patterns (analysis/lang/{language}/analyzer_{language}_test.go)

Add a new character filter

Create filter directory in analysis/char/{filter_name}/ (analysis/char/{filter_name}/{filter_name}.go)
Implement character-level text transformation logic (analysis/char/{filter_name}/{filter_name}.go)
Add unit tests for the character filter (analysis/char/{filter_name}/{filter_name}_test.go)
Register filter in custom analyzer composition (analysis/analyzer/custom/custom.go)

Add a new date/time format parser

Create parser directory in analysis/datetime/{format_type}/ (analysis/datetime/{format_type}/{format_type}.go)
Implement parsing logic converting datetime strings to indexed format (analysis/datetime/{format_type}/{format_type}.go)
Create comprehensive tests covering format variations (analysis/datetime/{format_type}/{format_type}_test.go)
Integrate into analyzer field mapping configuration (analysis/analyzer/custom/custom.go)

🔧Why these technologies

Go — Efficient memory usage, built-in concurrency for parallel indexing/search, fast compilation, static binaries
scorch (segment-based indexing) — Modern immutable segment design enables concurrent indexing, efficient incremental updates, and snapshot consistency
Roaring Bitmaps (RoaringBitmap/roaring) — Fast set operations for posting lists, efficient memory usage compared to traditional bitsets
LevelDB (goleveldb) — Persistent key-value storage for index segments, write-ahead logging for durability
Snowball/Porter Stemming — Language-agnostic stemming algorithm with implementations in 15+ languages; proven effectiveness for English and European languages
Vellum (go-vellum) — Efficient finite-state automaton library for term dictionaries and prefix queries

⚖️Trade-offs already made

Immutable segment architecture with periodic merging
- Why: Enables lock-free concurrent reads and consistent snapshots
- Consequence: Write latency slightly higher due to segment creation; query must search multiple segments until merged
Pluggable analyzer pipeline vs. monolithic tokenizer
- Why: Flexibility for custom tokenization, filtering, and language-specific rules
- Consequence: Higher complexity in analyzer composition; users must understand filter ordering and interactions
Support for 20+ languages with specific analyzers
- Why: Enable effective search in non-English languages with linguistic rules
- Consequence: Larger binary size and increased code maintenance burden
Vector search via FAISS integration
- Why: Enable semantic/similarity search alongside keyword matching
- Consequence: Additional dependency; separate vector pipeline requires embedding generation upstream

🚫Non-goals (don't propose these)

Does not provide distributed/clustered search out-of-the-box (single-machine library)
Does not handle authentication or access control
Does not provide SQL-like query language (uses domain-specific query DSL)
Does not automatically sync index files across nodes
Not a full-featured search appliance like Elasticsearch (library-only, no REST API)
Does not support real-time replication to replicas

🪤Traps & gotchas

No required env vars or external service dependencies documented. Traps: (1) Field names in mappings must be tagged explicitly; silent no-op if not; (2) Scorch is the default backend but legacy UpsideDown indices still supported—migration not automatic; (3) Vector search requires FAISS compiled bindings (go-faiss wraps native C++ lib via CGO); build failures possible on platforms without proper C++ toolchain; (4) Analysis chains are immutable after index creation; changing analyzers requires re-indexing.

🏗️Architecture

💡Concepts to learn

Inverted Index (Multi-segment Scorch format) — Bleve's core data structure for fast text search; understanding segment merging, posting lists, and term dictionaries explains query performance and index sizing
Term Frequency-Inverse Document Frequency (TF-IDF) and BM25 Scoring — Bleve implements both scoring models (configurable per index); you need to understand these to tune relevance and interpret why certain results rank higher
Finite-State Transducers (Vellum term dictionaries) — Bleve uses FSTs for compact storage and fast prefix/wildcard matching; affects memory footprint and query speed for phrase/fuzzy/wildcard queries
Text Analysis Pipelines (Tokenizers, Stemming, Character Filters) — Bleve's pluggable analysis chain transforms raw text into indexable tokens; understanding this pipeline (char filters → tokenizer → token filters → stemmer) is essential for tuning search quality
Roaring Bitmaps (Set operations for posting lists) — Bleve uses RoaringBitmap/roaring v2 for efficient boolean operations (AND, OR, NOT) across posting lists; critical for compound queries and memory efficiency
Approximate Nearest Neighbor Search (via FAISS) — Bleve's vector search leverages FAISS for k-NN queries on high-dimensional embeddings; understanding indexing methods (HNSW, IVF) explains tradeoffs between speed and accuracy
Schema Mapping (Field type inference and analysis binding) — Bleve's mapping system determines how struct fields are indexed (text with stemming vs. keyword vs. numeric range, etc.); misconfigurations silently degrade search quality

olivere/elastic — Go client library for Elasticsearch; if you want to replace bleve's embedded search with a server-based solution, this is the standard Go driver
blevesearch/bleve_index_api — Core API contract (interfaces and protobuf messages) that bleve and external index implementations (Scorch, UpsideDown) must satisfy
blevesearch/scorch_segment_api — Segment-level API that Scorch index backend implements; defines how inverted index segments are structured and queried
couchbase/moss — Concurrent key-value store used by bleve as an optional storage backend; relevant if you need MVCC or concurrent access patterns
blevesearch/go-faiss — CGO wrapper around Meta's FAISS library enabling vector similarity search in bleve; essential for k-NN queries

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive vector search benchmarks and tests

The repo includes go-faiss dependency for vector indexing but analysis/benchmark_test.go only benchmarks text analysis. With vector search becoming critical, adding dedicated benchmarks for vector similarity search, HNSW operations, and batch vector indexing would help contributors understand performance characteristics and catch regressions. This is especially valuable given the scorch_segment_api dependency.

[ ] Create search/vector/benchmark_test.go with benchmarks for vector Add, Search, and batch operations
[ ] Add tests in search/vector/vector_test.go covering edge cases (empty vectors, dimension mismatches, large batches)
[ ] Document vector search performance expectations in docs/vector-search.md

Add unit tests for all analysis/lang/* language-specific components

The file structure shows 10+ language analyzers (ar, bg, ca, cjk, etc.) but only ar and ca have visible *_test.go files. Language analyzers are critical for correctness and internationalization. Systematic test coverage gaps mean bugs in stemming, stop word filtering, or normalization could slip through for unsupported languages.

[ ] Add analysis/lang/bg/analyzer_bg_test.go and analysis/lang/bg/stemmer_bg_test.go (Bulgarian has only stop filter tests)
[ ] Add analysis/lang/cjk/analyzer_cjk_test.go with tests for CJK tokenization and segmentation
[ ] Add analysis/lang/{de,en,es,fr,it,pt,ro,ru,sv}/analyzer_*_test.go for any missing language test files in the full repo

Add GitHub Action workflow for dependency security scanning

The repo has cover.yml and tests.yml workflows but no security scanning despite maintaining 50+ dependencies. Given this is a widely-used indexing library, adding dependabot configuration and/or a Snyk/Trivy scanning workflow would catch vulnerable transitive dependencies (e.g., in RoaringBitmap, zapx, or protobuf) before releases.

[ ] Create .github/dependabot.yml to enable automated dependency updates with Go, GitHub Actions grouping
[ ] Add .github/workflows/security.yml that runs 'go mod tidy && go mod verify' and trivy scanning on pull requests
[ ] Update SECURITY.md with policy for reporting vulnerabilities and SLA for patching critical issues

🌿Good first issues

Add missing unit tests for analysis/char/html/html.go—currently present but no corresponding *_test.go file visible in the file list, unlike most other char filters which have test coverage.
Expand analysis/datetime/flexible/flexible_test.go to cover edge cases around timezone handling and malformed input (flexible datetime parsing is notoriously error-prone) following patterns in optional_test.go and sanitized_test.go.
Create end-to-end examples in a new examples/ directory (currently missing from file list) showing: (1) basic indexing + search, (2) custom analyzer setup, (3) vector similarity search with FAISS, (4) geospatial queries—reference the README snippets but make them runnable Go programs.

⭐Top contributors

Click to expand

@CascadingRadium — 32 commits
@abhinavdangeti — 27 commits
@Thejas-bhat — 12 commits
@Likith101 — 12 commits
@capemox — 10 commits

📝Recent commits

Click to expand

f7e4c92 — MB-71397: Upgrade zapx, go-faiss (#2331) (CascadingRadium)
aac22e0 — Upgrade faiss in docs/vectors.md (#2330) (CascadingRadium)
3ce9f4c — MB-62182: Disable training once its been marked complete (#2329) (Thejas-bhat)
d8f2ab9 — Upgrade to go-faiss@v1.1.0; Fix formatting, typos, etc. in docs/ (#2328) (abhinavdangeti)
71b13fe — go fmt ./... (#2327) (CascadingRadium)
2a48049 — MB-71216, MB-71650: Implement fast merge over binary index classes (#2326) (Thejas-bhat)
a9e101a — Fix metrics involving NestedDocuments (#2325) (CascadingRadium)
2c7269a — v2.6.0 doc fixes (#2323) (CascadingRadium)
e5e7e9e — MB-71607: Fixed data corruption in bolt (#2324) (Likith101)
08e551f — Updates to docs/vectors.md for v2.6.0 (#2320) (abhinavdangeti)

🔒Security observations

The blevesearch/bleve project demonstrates generally good security practices with a documented security policy, clear vulnerability reporting procedures, and use of well-maintained dependencies. However, the codebase has several areas needing attention: the go.mod file contains an invalid/future Go version that should be corrected, multiple dependencies are using older versions that may contain known vulnerabilities and should be updated to latest stable releases, and the security policy could be more comprehensive regarding support timelines and patch release procedures. No hardcoded secrets, injection vulnerabilities, or obvious misconfigurations were identified in the provided file structure. The project's focus on text/numeric/geo-spatial/vector indexing doesn't inherently introduce common web application vulnerabilities like SQL injection or XSS, though any downstream integration should validate security assumptions.

Medium · Outdated Go Version — go.mod. The go.mod file specifies 'go 1.25.0', which appears to be a future/invalid version. Go's latest stable version is 1.23.x. This may indicate a misconfiguration or typo in the go.mod file, which could lead to unexpected build behavior or version resolution issues. Fix: Update the Go version to a valid, stable release such as 'go 1.23' or the minimum supported version for the project. Verify the version aligns with the project's actual development and CI/CD pipeline.
Medium · Older Dependency Versions — go.mod. Several dependencies use older versions that may contain known vulnerabilities: golang.org/x/net v0.51.0 (released ~2024), golang.org/x/text v0.35.0, and golang.org/x/sys v0.42.0. These versions are significantly behind the latest releases and may have unpatched security issues. Fix: Run 'go get -u' to update all dependencies to their latest versions, particularly the golang.org/x packages which frequently receive security updates. Review the CHANGELOG for each dependency for any breaking changes before upgrading.
Low · Missing CHANGELOG or Version Information — Repository root. The repository structure doesn't show a CHANGELOG or VERSION file in the provided file listing. This makes it difficult for users to understand what security fixes are included in each release. Fix: Maintain a detailed CHANGELOG documenting security fixes, bug fixes, and feature updates. Include CVE references when addressing known vulnerabilities. Follow semantic versioning clearly.
Low · Limited Security Policy Details — SECURITY.md. The SECURITY.md only mentions supporting the latest release (v2.5.x) but doesn't specify security patch timelines, vulnerability disclosure timelines, or detail which versions receive security updates. Fix: Expand SECURITY.md to include: (1) Clear timeline for security patch releases, (2) Definition of what constitutes a 'supported' version, (3) Timeline for vulnerability acknowledgment and patching, (4) Preferred contact method and expected response times.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

blevesearch/bleve

Embed the "Healthy" badge

Onboarding doc

Onboarding: blevesearch/bleve

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🧩Components & responsibilities

🛠️How to make changes

Add a new language analyzer

Add a new character filter

Add a new date/time format parser

🔧Why these technologies

⚖️Trade-offs already made

🚫Non-goals (don't propose these)

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive vector search benchmarks and tests

Add unit tests for all analysis/lang/* language-specific components

Add GitHub Action workflow for dependency security scanning

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next