timescale/pgvectorscale

Item: timescale/pgvectorscale
Rating: 5
Author: RepoPilot

Postgres extension for vector search (DiskANN), complements pgvector for performance and scale. Postgres OSS licensed.

Healthy

Healthy across the board

weakest axis

Use as dependencyConcerns

non-standard license (PostgreSQL)

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 1w ago
✓12 active contributors
✓Distributed ownership (top contributor 31% of recent commits)

Show all 7 evidence items →

✓PostgreSQL licensed
✓CI configured
✓Tests present
⚠Non-standard license (PostgreSQL) — review terms

What would change the summary?

→Use as dependency Concerns → Mixed if: clarify license terms

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/timescale/pgvectorscale)](https://repopilot.app/r/timescale/pgvectorscale)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/timescale/pgvectorscale on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: timescale/pgvectorscale

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/timescale/pgvectorscale shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 1w ago
12 active contributors
Distributed ownership (top contributor 31% of recent commits)
PostgreSQL licensed
CI configured
Tests present
⚠ Non-standard license (PostgreSQL) — review terms

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live timescale/pgvectorscale repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/timescale/pgvectorscale.

What it runs against: a local clone of timescale/pgvectorscale — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in timescale/pgvectorscale | Confirms the artifact applies here, not a fork | | 2 | License is still PostgreSQL | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 38 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>timescale/pgvectorscale</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of timescale/pgvectorscale. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/timescale/pgvectorscale.git
#   cd pgvectorscale
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of timescale/pgvectorscale and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "timescale/pgvectorscale(\\.git)?\\b" \\
  && ok "origin remote is timescale/pgvectorscale" \\
  || miss "origin remote is not timescale/pgvectorscale (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(PostgreSQL)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"PostgreSQL\"" package.json 2>/dev/null) \\
  && ok "license is PostgreSQL" \\
  || miss "license drift — was PostgreSQL at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "pgvectorscale/src/access_method/build.rs" \\
  && ok "pgvectorscale/src/access_method/build.rs" \\
  || miss "missing critical file: pgvectorscale/src/access_method/build.rs"
test -f "pgvectorscale/src/access_method/distance/mod.rs" \\
  && ok "pgvectorscale/src/access_method/distance/mod.rs" \\
  || miss "missing critical file: pgvectorscale/src/access_method/distance/mod.rs"
test -f "pgvectorscale/src/access_method/graph/mod.rs" \\
  && ok "pgvectorscale/src/access_method/graph/mod.rs" \\
  || miss "missing critical file: pgvectorscale/src/access_method/graph/mod.rs"
test -f "pgvectorscale/Cargo.toml" \\
  && ok "pgvectorscale/Cargo.toml" \\
  || miss "missing critical file: pgvectorscale/Cargo.toml"
test -f "pgvectorscale/pgvectorscale_derive/src/lib.rs" \\
  && ok "pgvectorscale/pgvectorscale_derive/src/lib.rs" \\
  || miss "missing critical file: pgvectorscale/pgvectorscale_derive/src/lib.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 38 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~8d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/timescale/pgvectorscale"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

pgvectorscale is a Rust-based PostgreSQL extension that extends pgvector with StreamingDiskANN (Microsoft DiskANN-inspired) approximate nearest neighbor indexing, Statistical Binary Quantization compression, and label-filtered vector search. It solves the performance and cost problem of vector similarity search at scale, delivering 28x lower p95 latency and 16x higher throughput than Pinecone on 50M vector datasets. Monorepo structure: pgvectorscale/ contains the main Rust extension using PGRX, with pgvectorscale/pgvectorscale_derive/ providing procedural macros for SQL bindings. SQL migrations live in pgvectorscale/sql/ (versioned migrations 0.0.2→0.9.0). Benchmarks in pgvectorscale/benches/ (distance.rs, lsr.rs). GitHub Actions workflows orchestrate testing across PostgreSQL versions.

👥Who it's for

Database administrators and application developers building AI-powered PostgreSQL systems who need efficient vector similarity search on million+ embedding datasets without the cost of managed vector databases like Pinecone. Rust contributors interested in extending PostgreSQL via the PGRX framework.

🌱Maturity & risk

Actively developed but pre-production: currently at version 0.9.0 (migrations exist from 0.0.2 through 0.9.0 in pgvectorscale/sql/), CI/CD pipelines run (pgrx_test.yaml, python_tests.yml, code_checks.yml), and benchmarks against production systems exist. However, the 0.x version number and 'private beta access' messaging in README indicate this is not yet GA-ready despite solid engineering practices.

Moderate risk: PGRX framework is the single point of compatibility with PostgreSQL versions, and the extension must maintain binary compatibility with pgvector types. Rust's strict type system mitigates memory safety issues, but any breaking changes to the SQL migration chain (pgvectorscale--0.0.2--0.9.0.sql) could impact existing installations. The distributed codebase split between Rust extension and Python testing infrastructure adds integration risk.

Active areas of work

Extension is approaching production stability (0.9.0) with incremental feature and version migrations tracked. SQL upgrade paths span from 0.0.2 through 0.9.0, suggesting active refinement of storage formats and index algorithms. CI testing includes pgrx, Python integration tests, and shell checks—indicating focus on stability.

🚀Get running

git clone https://github.com/timescale/pgvectorscale && cd pgvectorscale && make install (Makefile present). DEVELOPMENT.md contains build-from-source instructions including Rust toolchain setup and pgrx initialization.

Daily commands: make install builds and installs the extension to PostgreSQL. See DEVELOPMENT.md for Docker setup (Dockerfile implied by .dockerignore) or local pg_config-based builds. Tests run via pgrx test framework (workflows use pgrx_test.yaml).

🗺️Map of the codebase

pgvectorscale/src/access_method/build.rs — Entry point for index construction; defines the build algorithm that creates StreamingDiskANN indexes and handles parallel index building.
pgvectorscale/src/access_method/distance/mod.rs — Core distance computation module; abstracts SIMD distance calculations (x86/aarch64) critical for search performance.
pgvectorscale/src/access_method/graph/mod.rs — Graph structure and navigation logic; implements the DiskANN proximity graph traversal for vector search.
pgvectorscale/Cargo.toml — Workspace and dependency manifest; defines all Rust crate dependencies, version requirements, and build profiles.
pgvectorscale/pgvectorscale_derive/src/lib.rs — Procedural macro for deriving Postgres-compatible types; reduces boilerplate for FFI bindings to pgvector and Postgres types.
pgvectorscale/src/access_method/cost_estimate.rs — Query planner integration; provides cost estimation for index scans to guide Postgres optimizer decisions.
Makefile — Build orchestration; defines compilation targets, test execution, and release packaging workflows.

🛠️How to make changes

Add a new distance metric to the SIMD kernels

Define the metric computation in the distance abstraction (pgvectorscale/src/access_method/distance/mod.rs)
Implement x86 SIMD version using AVX2/SSE intrinsics (pgvectorscale/src/access_method/distance/distance_x86.rs)
Implement aarch64 NEON version for ARM compatibility (pgvectorscale/src/access_method/distance/distance_aarch64.rs)
Add benchmark to measure throughput across architectures (pgvectorscale/benches/distance.rs)

Modify graph construction algorithm to support label filtering

Update neighbor selection criteria to filter by labels (pgvectorscale/src/access_method/graph/mod.rs)
Extend start node selection to respect label constraints (pgvectorscale/src/access_method/graph/start_nodes.rs)
Update parallel build to partition by label groups (pgvectorscale/src/access_method/build/parallel.rs)
Add SQL UDFs for filtered search in migration script (pgvectorscale/sql/vectorscale--0.8.0--0.9.0.sql)

Integrate Binary Quantization compression into index storage

Add quantization logic to neighbor distance caching (pgvectorscale/src/access_method/graph/neighbor_with_distance.rs)
Update neighbor store serialization to compress distances (pgvectorscale/src/access_method/graph/neighbor_store.rs)
Adjust distance computation to decompress on-the-fly during search (pgvectorscale/src/access_method/distance/mod.rs)
Add cost estimation adjustments for quantized indexes (pgvectorscale/src/access_method/cost_estimate.rs)

Add a new SQL diagnostic function for index inspection

Export diagnostic data structure and accessor functions (pgvectorscale/src/access_method/debugging.rs)
Derive Postgres type mapping for the diagnostic struct (pgvectorscale/pgvectorscale_derive/src/lib.rs)
Register the UDF in the latest migration script (pgvectorscale/sql/vectorscale--0.8.0--0.9.0.sql)

🔧Why these technologies

Rust + pgrx — Memory-safe systems language with zero-cost abstractions; pgrx provides Postgres C-API bindings without FFI overhead. Critical for index performance and safe concurrent access.
SIMD intrinsics (AVX2, NEON) — Vector distance computation is the bottleneck; explicit SIMD achieves 10–100× throughput over scalar code. Dual-architecture support (x86/aarch64) ensures portability.
DiskANN algorithm (Microsoft research) — Logarithmic search complexity O(log N) with linear space O(N); enables 50M+ vector searches on commodity hardware without GPU or external vector DB.
Postgres extension (not separate service) — Eliminates network round-trips, co-locates compute with storage, enables transactional consistency within SQL. Tight integration with planner for cost-aware query optimization.

⚖️Trade-offs already made

Index built in-place during table scan vs. external build service
- Why: Simplifies deployment and guarantees consistency; avoids distributed system complexity.
- Consequence: Build time blocks schema mutations; parallel build mitigates but remains slower than external indexing on very large datasets (100M+ vectors).
Greedy best-first graph search vs. exact k-NN
- Why: Greedy search is 1000× faster; DiskANN research shows ≥99% recall with proper tuning.
- Consequence: Recall is probabilistic and depends on graph degree, start nodes, and search parameters; users must tune trade-off between speed and accuracy.
Distance caching in neighbor lists (with optional BQ compression)
- Why: Avoids redundant distance recomputation during search; BQ reduces memory footprint by 8–16×.
- Consequence: Index size grows with distance precision; BQ introduces quantization error (~1–2% recall loss); uncompressed indexes are faster but larger.
Thread-
- Why: undefined
- Consequence: undefined

🪤Traps & gotchas

PGRX requires a matching PostgreSQL installation detected via pg_config—must install PostgreSQL dev headers before make install. Binary migration path is strict: SQL schemas must maintain backward compatibility for upgrades (every versioned .sql file must work on top of the previous). The extension name changed from 'timescale_vector' to 'vectorscale' (see timescale_vector--0.0.1--0.0.2.sql)—existing databases need rename migration. CONTRIBUTING.md likely specifies code review or CLA requirements; check before opening PRs.

🏗️Architecture

💡Concepts to learn

Approximate Nearest Neighbor (ANN) Search with DiskANN — StreamingDiskANN is the core algorithm—understanding graph-based ANN, Delaunay graphs, and disk-friendly layouts is essential to modify index construction and query paths
Binary Quantization (BQ) — Statistical Binary Quantization is pgvectorscale's compression layer for 75% cost reduction—understanding how vectors are quantized to 1-bit representations affects storage and recall tuning
PostgreSQL Extension API (pgrx) — PGRX procedural macros bridge Rust and SQL—you must understand how #[pg_operator] and #[pg_function] map Rust code to SQL for any feature additions
Memory-Mapped I/O (MMIO) for Graph Structures — DiskANN indexes live on disk with selective in-memory buffering—understanding memory mapping avoids O(n) memory usage for billion-scale graphs
SQL Schema Migrations and Backward Compatibility — Every pgvectorscale--X--Y.sql file must work upgrades from prior versions—breaking the migration chain leaves users unable to upgrade production databases
Filtered / Label-based Vector Search — Label filtering allows WHERE-clause-like constraints during ANN queries (research from Microsoft Filtered DiskANN)—essential for multi-tenant or segmented search use cases
Criterion Benchmarking — pgvectorscale/benches/ use Rust's criterion crate for statistical performance testing—understanding how to interpret confidence intervals and regressions is critical for proving index optimization claims

pgvector/pgvector — The foundational C-based vector extension pgvectorscale extends—all pgvectorscale indexes operate on pgvector types
pgcentralfoundation/pgrx — The Rust framework pgvectorscale is built on—any PGRX version upgrades or macro issues originate here
microsoft/DiskANN — The reference algorithm implementation that inspired StreamingDiskANN; understanding the original improves index modifications
timescale/timescaledb — Companion Timescale extension for time-series; pgvectorscale designed to integrate with TimescaleDB for time-partitioned vector data
postgres/postgres — PostgreSQL source—occasionally referenced for low-level extension compatibility and FunctionCallInfo details

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive integration tests for SQL migration paths in pgvectorscale/sql/

The repo has 20+ migration files (vectorscale--0.0.2--0.9.0.sql, etc.) but no visible test coverage for migration correctness. New contributors could add a test suite that verifies each migration file executes successfully, maintains data integrity, and doesn't break existing schemas. This is critical for a production extension where failed migrations cause serious issues.

[ ] Create tests/sql_migrations/ directory with migration test framework
[ ] Add test cases in .github/workflows/ or TESTING.md for running migrations sequentially (0.0.2 → 0.3.0 → 0.9.0)
[ ] Test data integrity checks (verify vectors/indexes persist correctly through migrations)
[ ] Document in DEVELOPMENT.md how to add tests for new migration files

Add Rust benchmark suite with CI integration for DiskANN index performance regressions

The repo has benches/distance.rs and benches/lsr.rs but no automated performance regression detection in CI. pgvectorscale's core value proposition is performance, so contributors could add benchmarks that run on every PR to catch performance regressions in StreamingDiskANN index operations (insertion, search, quantization).

[ ] Expand pgvectorscale/benches/ with comprehensive DiskANN index operation benchmarks (build, search, serialization)
[ ] Add new GitHub Actions workflow (.github/workflows/benchmark.yml) to run benchmarks and compare against baseline
[ ] Implement results tracking in the workflow (post comments on PRs showing perf delta)
[ ] Document benchmark setup in TESTING.md with instructions for local reproduction

Create Python integration test suite for label-based filtered vector search feature

TESTING.md and .github/workflows/python_tests.yml exist but the specific label-filtering feature (mentioned in README) lacks visible test coverage. Contributors could add comprehensive Python tests that validate filtered search correctness, performance characteristics, and edge cases (empty filters, all vectors filtered out, etc.).

[ ] Add tests/python/test_label_filtering.py with test cases for: basic filtering, combined filters, empty result sets, boundary conditions
[ ] Verify integration with pgvector's vector type in test fixtures
[ ] Add test cases for both StreamingDiskANN and standard pgvector indexes with labels
[ ] Update .github/workflows/python_tests.yml to explicitly run label filtering tests and document in TESTING.md

🌿Good first issues

Add distance metric benchmarks for new dtypes: pgvectorscale/benches/distance.rs only benchmarks known distance functions; add criterion benchmarks for any newly-added similarity metrics to establish performance regression baselines: Improves test coverage while learning the benchmark infrastructure
Document the DiskANN index build parameters in a new doc: pgvectorscale/sql/*.sql files reference index construction but lack a guide explaining ef_construction, max_neighbors, or other tuning parameters for users: Needed for production deployments and teaches SQL extension patterns
Add integration test for label-filtered search upgrade: migration files show label filtering was added, but python_tests.yml may lack end-to-end test asserting filtered queries work post-upgrade from 0.7.1→0.9.0: Catches regressions in the critical label-filtering feature

⭐Top contributors

Click to expand

@tjgreen42 — 31 commits
@cevian — 27 commits
@syvb — 22 commits
@alejandrodnm — 9 commits
@billy-the-fish — 4 commits

📝Recent commits

Click to expand

57c88b7 — style: qualify pg_catalog schema references in extension SQL (#271) (x4m)
4c04103 — Update name to Tiger Data, copyright year (#262) (mfreed)
c66cae4 — Release 0.9.0 (#256) (syvb)
7a96623 — Parallel index build (#251) (syvb)
36271fa — PG18 support (#254) (syvb)
6af0ee1 — Prepare 0.8.0 release (#246) (tjgreen42)
fa6f1a3 — Fix crash on concurrent insertions (#244) (tjgreen42)
db05d01 — Fix for issue #238: crash when scanning index with NULL value (#243) (tjgreen42)
adcd492 — Fix #209 rescoring cannot be disabled (#242) (tjgreen42)
2ad79ea — Use base actions image for rustfmt check (#241) (syvb)

🔒Security observations

The pgvectorscale codebase demonstrates good security posture overall. It is a Rust-based PostgreSQL extension with OSS licensing and active CI/CD workflows. No critical vulnerabilities were identified in the visible configuration and file structure. Primary concerns are operational: the aggressive compiler optimizations may complicate debugging, and the lack of a security disclosure policy could hinder responsible vulnerability reporting. The numerous SQL migration files should be carefully managed with comprehensive testing to prevent data integrity issues. Rust's memory safety features provide inherent protection against common vulnerabilities like buffer overflows and use-after-free errors.

Medium · Aggressive Compiler Optimizations in Release Build — Cargo.toml (profile.release section). The Cargo.toml uses 'lto = "fat"' with 'codegen-units = 1' and 'opt-level = 3' in release profile. While this improves performance, it may make debugging difficult and could mask certain memory safety issues during development. The commented-out 'debug = true' suggests this was intentionally disabled, which could hinder post-mortem analysis of production issues. Fix: Consider enabling 'debug = true' in release builds to maintain symbol information for crash dumps and profiling, even with aggressive optimizations. Alternatively, create separate debug and release-optimized profiles.
Low · Missing SECURITY.md File — Repository root. The repository lacks a SECURITY.md file that typically outlines the vulnerability disclosure process and security contact information. This makes it harder for security researchers to responsibly report vulnerabilities. Fix: Create a SECURITY.md file following GitHub's security best practices, including vulnerability reporting guidelines and security contact information.
Low · Multiple Database Migration Files Suggest Version Management Risk — pgvectorscale/sql/ directory. The large number of SQL migration files (vectorscale--*.sql) indicates a long version history. While not inherently a vulnerability, this requires careful attention to ensure all migrations are properly tested and validated, as errors in migration scripts could lead to data loss or corruption. Fix: Implement comprehensive migration testing in CI/CD pipelines. Consider using tools like pg_prove for SQL migration validation. Document migration rollback procedures clearly.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

timescale/pgvectorscale

Embed the "Healthy" badge

Onboarding doc

Onboarding: timescale/pgvectorscale

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

Add a new distance metric to the SIMD kernels

Modify graph construction algorithm to support label filtering

Integrate Binary Quantization compression into index storage

Add a new SQL diagnostic function for index inspection

🔧Why these technologies

⚖️Trade-offs already made

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive integration tests for SQL migration paths in pgvectorscale/sql/

Add Rust benchmark suite with CI integration for DiskANN index performance regressions

Create Python integration test suite for label-based filtered vector search feature

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next