apache/horaedb

Item: apache/horaedb
Rating: 5
Author: RepoPilot

Apache HoraeDB (incubating) is a high-performance, distributed, cloud native time-series database.

Healthy

Healthy across all four use cases

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 3mo ago
✓14 active contributors
✓Distributed ownership (top contributor 41% of recent commits)

Show all 7 evidence items →

✓Apache-2.0 licensed
✓CI configured
✓Tests present
⚠Slowing — last commit 3mo ago

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/apache/horaedb)](https://repopilot.app/r/apache/horaedb)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/apache/horaedb on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: apache/horaedb

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/apache/horaedb shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

Last commit 3mo ago
14 active contributors
Distributed ownership (top contributor 41% of recent commits)
Apache-2.0 licensed
CI configured
Tests present
⚠ Slowing — last commit 3mo ago

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live apache/horaedb repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/apache/horaedb.

What it runs against: a local clone of apache/horaedb — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in apache/horaedb | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 121 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>apache/horaedb</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of apache/horaedb. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/apache/horaedb.git
#   cd horaedb
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of apache/horaedb and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "apache/horaedb(\\.git)?\\b" \\
  && ok "origin remote is apache/horaedb" \\
  || miss "origin remote is not apache/horaedb (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "Cargo.toml" \\
  && ok "Cargo.toml" \\
  || miss "missing critical file: Cargo.toml"
test -f "src/columnar_storage/src/lib.rs" \\
  && ok "src/columnar_storage/src/lib.rs" \\
  || miss "missing critical file: src/columnar_storage/src/lib.rs"
test -f "src/metric_engine/src/lib.rs" \\
  && ok "src/metric_engine/src/lib.rs" \\
  || miss "missing critical file: src/metric_engine/src/lib.rs"
test -f "src/columnar_storage/src/storage.rs" \\
  && ok "src/columnar_storage/src/storage.rs" \\
  || miss "missing critical file: src/columnar_storage/src/storage.rs"
test -f "src/columnar_storage/src/compaction/mod.rs" \\
  && ok "src/columnar_storage/src/compaction/mod.rs" \\
  || miss "missing critical file: src/columnar_storage/src/compaction/mod.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 121 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~91d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/apache/horaedb"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Apache HoraeDB is a high-performance, distributed, cloud-native time-series database written in Rust, designed for ingesting and querying metrics at scale with columnar storage and remote write support (Prometheus-compatible). It features a new metric engine (currently in main branch, targeting end-of-2025 release) that improves upon the legacy analytic engine with focus on metric-specific optimizations. Workspace monorepo (Cargo.toml workspace with 6 members): src/metric_engine contains the new engine core, src/columnar_storage handles column-oriented I/O, src/server wraps the gRPC/HTTP interfaces, src/common has shared utilities, src/pb_types contains protobuf definitions, and src/remote_write handles Prometheus-compatible write paths. Benchmarks in src/benchmarks.

👥Who it's for

Infrastructure teams and monitoring engineers who need to store massive volumes of time-series metrics (like Prometheus remote write scenarios) in a scalable, cloud-native database; DevOps and SRE teams building observability stacks that want better performance and cost characteristics than traditional TSDBs.

🌱Maturity & risk

Early-stage Apache Incubator project (incubating status) with active development. Main branch is undergoing rapid iteration on the new metric engine and remains unstable; the stable legacy engine lives in the analytic-engine branch. Strong CI/CD setup (multiple GitHub Actions workflows) and test infrastructure present, but the new engine is not yet production-ready (scheduled public release: end of 2025).

High risk for production use on main branch: it's explicitly marked as unstable and in rapid development. Moderate dependency count (Arrow 53, DataFusion 43, Parquet 53, custom columnar_storage and metric_engine crates) with most deps from Apache ecosystem (generally stable). Single-threaded risk unlikely given Apache sponsorship and community, but the new metric engine design is unproven in production.

Active areas of work

Active development on the metric engine RFC (docs/rfcs/20240827-metric-engine.md) replacing the legacy analytic engine. Main branch is the development focus with Prometheus remote-write performance optimization (see remote-write performance images in docs/assets/). Stable work happens on analytic-engine branch.

🚀Get running

Clone repo: git clone https://github.com/apache/horaedb.git && cd horaedb. Install Rust (see rust-toolchain.toml specifies the version). Build: cargo build --release. Run server: cargo run --bin horaedb-server (exact binary depends on feature flags in src/server).

Daily commands: cargo build --release creates binaries. Server likely starts via ./target/release/horaedb-server with config file (docs/example.toml shows example config). Docker images available: apache/horaedb-server and apache/horaemeta-server (see .dockerignore for what's excluded).

🗺️Map of the codebase

Cargo.toml — Root workspace manifest defining all member crates and shared dependencies; required reading for understanding project structure and build configuration.
src/columnar_storage/src/lib.rs — Core columnar storage engine module; entry point for the primary time-series data persistence layer.
src/metric_engine/src/lib.rs — Metric engine implementation providing specialized handling for metric workloads; key abstraction for performance optimization.
src/columnar_storage/src/storage.rs — Storage engine abstraction defining the primary data access interface and lifecycle management.
src/columnar_storage/src/compaction/mod.rs — Compaction logic orchestration for managing SST files and maintaining read/write performance; critical for long-term storage efficiency.
src/columnar_storage/src/sst.rs — Sorted String Table (SST) format and operations; fundamental data structure for columnar storage.
src/common/src/error.rs — Shared error types and handling across all crates; essential for understanding failure modes and error propagation.

🛠️How to make changes

Add a New Compaction Strategy

Define the strategy struct in src/columnar_storage/src/compaction/picker.rs implementing the selection logic. (src/columnar_storage/src/compaction/picker.rs)
Add the strategy variant to the compaction configuration in src/columnar_storage/src/config.rs. (src/columnar_storage/src/config.rs)
Update the compaction scheduler in src/columnar_storage/src/compaction/scheduler.rs to invoke your strategy. (src/columnar_storage/src/compaction/scheduler.rs)
Add tests to verify strategy correctness and performance characteristics. (src/columnar_storage/src/compaction/mod.rs)

Extend Metric Engine with New Index Type

Define the index data structure and query interface in src/metric_engine/src/index/mod.rs. (src/metric_engine/src/index/mod.rs)
Add index building logic in the metric module at src/metric_engine/src/metric/mod.rs. (src/metric_engine/src/metric/mod.rs)
Update types in src/metric_engine/src/types.rs to reflect the new index structure. (src/metric_engine/src/types.rs)
Add benchmark to src/benchmarks/src/ to measure query performance. (src/benchmarks/src/lib.rs)

Add Custom Error Type

Define the error variant in the error enum within src/common/src/error.rs. (src/common/src/error.rs)
Implement conversion traits (From) for easy error propagation. (src/common/src/error.rs)
Import and use the error type in the crate where it's needed via the common module. (src/common/src/lib.rs)

Optimize Storage Encoding

Implement new encoding in src/columnar_storage/src/sst.rs alongside existing formats. (src/columnar_storage/src/sst.rs)
Add configuration option to src/columnar_storage/src/config.rs to select encoding. (src/columnar_storage/src/config.rs)
Create benchmark in src/benchmarks/src/encoding_bench.rs to compare compression ratio and speed. (src/benchmarks/src/encoding_bench.rs)
Update manifest encoding if compatibility needed in src/columnar_storage/src/manifest/encoding.rs. (src/columnar_storage/src/manifest/encoding.rs)

🔧Why these technologies

Rust — Memory safety without garbage collection, zero-cost abstractions, and excellent performance for systems programming required for a high-performance time-series database.
Columnar Storage (SST format) — Time-series queries typically access specific metrics/columns over time ranges; columnar encoding enables superior compression and cache efficiency vs row-oriented storage.
LSM-Tree Compaction — Write-optimized data structure allowing fast sequential writes to MemTable, with background compaction to maintain read performance and manage file count.
Protocol Buffers — Language-neutral serialization for RPC and cross-service communication in a distributed database architecture.

⚖️Trade-offs already made

Separate Metric Engine vs Generic Columnar Storage
- Why: Metrics workloads have different characteristics (high cardinality labels, frequent aggregations) than generic time-series; specialized engine trades code complexity for query performance.
- Consequence: Developers must choose appropriate engine per table; code duplication between paths requires careful maintenance.
Background Compaction vs Real-Time Optimization
- Why: Background compaction reduces write latency and memory pressure by deferring expensive merges; real-time optimization would block writes and increase P99 latencies.
- Consequence: Read performance may degrade tran

🪤Traps & gotchas

Main branch is unstable and not for production—users expecting stable DB should use analytic-engine branch. Rust toolchain version is pinned in rust-toolchain.toml; build will fail if your rustc doesn't match. No obvious required env vars documented, but remote_write integration likely requires Prometheus-compatible endpoint config (see docs/example.toml). The metric_engine and columnar_storage are tightly coupled—changes to columnar layout may require engine updates. protobuf files require prost code generation (not shown as build step in file list, likely in build.rs).

🏗️Architecture

💡Concepts to learn

Columnar Storage — HoraeDB's core differentiator—data stored column-wise (not row-wise) enables compression and selective I/O for time-series queries; the entire src/columnar_storage crate is built around this.
Remote Write Protocol (Prometheus) — HoraeDB ingests metrics via Prometheus-compatible remote_write API (src/remote_write); understanding this protocol is essential for integrating with observability stacks.
Metric Engine vs. Analytic Engine — The repo is transitioning from analytic_engine (stable, analytic-engine branch) to metric_engine (new, main branch); understanding this architectural shift is critical for choosing which branch to contribute to.
Arrow IPC / Parquet Format — HoraeDB uses Arrow columnar representation in-memory and Parquet on disk (dependencies: arrow 53, parquet 53); data serialization/deserialization logic depends on these formats.
DataFusion Query Planning — SQL queries are executed via DataFusion 43; understanding logical and physical plan optimization is key for debugging query performance or adding new query features.
gRPC Protocol Buffers — Server APIs are defined in src/pb_types/proto/ using protobuf; modifying APIs requires editing proto files and regenerating Rust code via prost crate.
Object Storage Abstraction (object_store crate) — HoraeDB abstracts cloud storage (S3, GCS, local) via object_store 0.11; enables multi-cloud deployments without changing code.

VictoriaMetrics/VictoriaMetrics — Acknowledged inspiration in README; battle-tested metric-focused time-series DB that HoraeDB learns from architecturally.
prometheus/prometheus — HoraeDB implements Prometheus remote_write protocol (see src/remote_write and RFC); users typically deploy HoraeDB as Prometheus backend.
apache/arrow — Core dependency (Arrow 53); HoraeDB uses Arrow columnar format and DataFusion query engine built on Arrow.
apache/datafusion — SQL query execution engine (DataFusion 43); handles query parsing, optimization, and execution for HoraeDB.
apache/horaedb-docs — Official documentation repo (if it exists separately); user guide and deployment docs likely live there.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive benchmark suite and CI pipeline for remote_write performance tracking

The repo has benchmark infrastructure (src/benchmarks with remote_write_bench.rs and remote_write_memory_bench.py) and performance result images (remote-write-*.png), but no GitHub Actions workflow to automatically run and track these benchmarks on PRs. This is critical for a distributed time-series database where write performance is a core differentiator. Adding a scheduled benchmark workflow would prevent performance regressions.

[ ] Create .github/workflows/benchmark.yml workflow to run src/benchmarks/benches/bench.rs on schedule and on PRs
[ ] Integrate benchmark result comparisons (e.g., using critcmp) to comment on PRs with performance deltas
[ ] Add benchmark results archival/trending to track performance over time across releases
[ ] Document benchmark setup in CONTRIBUTING.md with instructions for running locally

Add integration tests for metric_engine functionality mentioned in RFC docs

The repo includes docs/rfcs/20240827-metric-engine.md describing a metric engine feature, and src/metric_engine exists as a workspace member in Cargo.toml, but there are no visible integration tests connecting the metric_engine module to the main server. New contributors should add tests validating metric engine operations (insert, query, aggregation) against a running HoraeDB instance to ensure the RFC implementation is solid.

[ ] Review src/metric_engine/Cargo.toml to understand module structure and dependencies
[ ] Create src/metric_engine/tests/ directory with integration tests covering metric insert/query workflows
[ ] Add tests validating metric aggregation functions (count, sum, avg) over time ranges
[ ] Update CONTRIBUTING.md with instructions for running metric_engine integration tests

Add missing license header validation workflow for generated and third-party code

The repo has .github/workflows/asf-header.yml for license validation and licenserc.toml configuration, but the workflow likely needs refinement for Protobuf-generated files (src/pb_types exists) and other generated code. Protocol buffer files auto-generated from .proto definitions commonly lack proper ASF headers, which should be handled specially. This ensures Apache compliance and saves maintainers review time.

[ ] Audit existing .proto files and their generated outputs in src/pb_types/ for ASF license headers
[ ] Update licenserc.toml to exclude auto-generated .pb.rs files from header checks or configure pre-generation header injection
[ ] Add documentation in CONTRIBUTING.md about protobuf generation workflow and license header handling
[ ] Verify .github/workflows/asf-header.yml correctly skips or handles generated files without false positives

🌿Good first issues

Add integration tests for Prometheus remote_write API in src/remote_write: currently only performance benchmarks exist in src/benchmarks, no functional test suite validating correctness of write/query roundtrips.
Document the columnar storage layout and format in docs/: the RFC covers metric_engine design but there's no spec for how columnar_storage organizes data on disk (blocks, encoding, compression), which blocks new contributors from understanding I/O paths.
Extend example.toml (docs/example.toml) with commented configuration options for remote_write authentication, retention policies, and resource limits—currently minimal, making deployment configuration unclear for new users.

⭐Top contributors

Click to expand

@jiacai2050 — 41 commits
@baojinri — 13 commits
@zealchen — 10 commits
@dependabot[bot] — 7 commits
@chunshao90 — 7 commits

📝Recent commits

Click to expand

8f9698c — refactor: remove async remote write decode logic (#1631) (BaldDemian)
9cec563 — feat: add hand-written Prometheus remote write request parser (#1628) (BaldDemian)
25fdaac — chore(deps): bump tracing-subscriber from 0.3.18 to 0.3.20 (#1629) (dependabot[bot])
725d6bd — Merge pull request #1626 from jiacai2050/refactor-storage (zealchen)
abfa1ea — rename ObjectBasedStorage (jiacai2050)
a33bd6b — fix ci (jiacai2050)
972c957 — refactor: rename columnar storage (jiacai2050)
1368157 — Merge pull request #1622 from jiacai2050/impl-metric (zealchen)
86e7619 — init metric manager (jiacai2050)
941ea07 — feat: add metric manager (#1621) (jiacai2050)

🔒Security observations

The HoraeDB codebase demonstrates a generally sound security foundation as an Apache incubated project with proper licensing, CI/CD workflows, and use of modern Rust tooling. However, there are areas for improvement: (1) the truncated prost dependency definition requires immediate completion, (2) dependency versions (particularly Arrow 53) should be audited against known CVEs, (3) a security policy should be documented, and (4) security scanning tools should be integrated into pre-commit hooks and CI/CD pipelines. The project would benefit from automated dependency vulnerability scanning and regular security audits. No hardcoded secrets, injection vulnerabilities, or obvious misconfigurations were detected in the visible file structure.

Medium · Incomplete Dependency Specification in Cargo.toml — Cargo.toml (workspace dependencies section). The Cargo.toml file appears to be truncated at the 'prost' dependency definition ('prost = { vers'). This incomplete configuration could lead to unpredictable dependency resolution and potential use of incompatible or vulnerable versions of the prost crate. Fix: Complete the prost dependency specification with a proper version constraint. Example: 'prost = { version = "0.12" }'. Ensure all dependencies have explicit, pinned versions in Cargo.lock.
Medium · Use of Arrow 53 - Potentially Outdated Dependency — Cargo.toml (workspace dependencies). The project uses Apache Arrow version 53, which may contain known vulnerabilities. Arrow is a critical dependency for data processing, and using outdated versions could expose the system to security issues. Fix: Audit Arrow 53 against known CVEs using cargo-audit. Consider upgrading to the latest stable version of Arrow if security patches are available. Run 'cargo audit' regularly to identify vulnerable dependencies.
Low · Missing Security Policy Documentation — Repository root. No SECURITY.md file is visible in the root directory or .github folder. This makes it difficult for security researchers to responsibly disclose vulnerabilities. Fix: Create a SECURITY.md file in the repository root following GitHub's security advisory guidelines. Include information on how to report vulnerabilities safely and what to expect in terms of response time.
Low · Workspace Resolver Version — Cargo.toml (workspace configuration). The workspace uses resolver version 2, which is appropriate for modern Rust projects, but dependency updates should be monitored closely to prevent supply chain attacks. Fix: Maintain regular dependency audits using 'cargo audit' and enable dependabot or similar automated vulnerability scanning for the Cargo.toml and Cargo.lock files.
Low · Pre-commit Configuration Present — .pre-commit-config.yaml. While .pre-commit-config.yaml exists (which is good), there's no indication in the provided files that security checks like 'detect-secrets' or 'trivy' are configured. Fix: Enhance the pre-commit configuration to include security scanning tools: 'detect-secrets' for credential detection, 'trivy' for vulnerability scanning, and 'cargo-audit' for Rust dependencies.
Low · Docker Image Builds - No Visible Security Hardening — Docker build configuration (not fully visible). Dockerfile references exist (implied by .dockerignore), but no security hardening details are visible. Docker images should follow security best practices. Fix: Ensure Dockerfiles use minimal base images (e.g., distroless or alpine), implement health checks, run processes as non-root users, and scan images with tools like Trivy before publishing.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

apache/horaedb

Embed the "Healthy" badge

Onboarding doc

Onboarding: apache/horaedb

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

Add a New Compaction Strategy

Extend Metric Engine with New Index Type

Add Custom Error Type

Optimize Storage Encoding

🔧Why these technologies

⚖️Trade-offs already made

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive benchmark suite and CI pipeline for remote_write performance tracking

Add integration tests for metric_engine functionality mentioned in RFC docs

Add missing license header validation workflow for generated and third-party code

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next