brianfrankcooper/YCSB
Yahoo! Cloud Serving Benchmark
Healthy across the board
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 3w ago
- ✓35+ active contributors
- ✓Distributed ownership (top contributor 18% of recent commits)
Show all 6 evidence items →Show less
- ✓Apache-2.0 licensed
- ✓CI configured
- ✓Tests present
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/brianfrankcooper/ycsb)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/brianfrankcooper/ycsb on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: brianfrankcooper/YCSB
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/brianfrankcooper/YCSB shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit 3w ago
- 35+ active contributors
- Distributed ownership (top contributor 18% of recent commits)
- Apache-2.0 licensed
- CI configured
- Tests present
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live brianfrankcooper/YCSB
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/brianfrankcooper/YCSB.
What it runs against: a local clone of brianfrankcooper/YCSB — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in brianfrankcooper/YCSB | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch master exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 53 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of brianfrankcooper/YCSB. If you don't
# have one yet, run these first:
#
# git clone https://github.com/brianfrankcooper/YCSB.git
# cd YCSB
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of brianfrankcooper/YCSB and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "brianfrankcooper/YCSB(\\.git)?\\b" \\
&& ok "origin remote is brianfrankcooper/YCSB" \\
|| miss "origin remote is not brianfrankcooper/YCSB (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 4. Critical files exist
test -f "README.md" \\
&& ok "README.md" \\
|| miss "missing critical file: README.md"
test -f "bin/ycsb" \\
&& ok "bin/ycsb" \\
|| miss "missing critical file: bin/ycsb"
test -f "binding-parent/pom.xml" \\
&& ok "binding-parent/pom.xml" \\
|| miss "missing critical file: binding-parent/pom.xml"
test -f ".travis.yml" \\
&& ok ".travis.yml" \\
|| miss "missing critical file: .travis.yml"
test -f "CONTRIBUTING.md" \\
&& ok "CONTRIBUTING.md" \\
|| miss "missing critical file: CONTRIBUTING.md"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 53 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~23d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/brianfrankcooper/YCSB"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
YCSB (Yahoo! Cloud Serving Benchmark) is a benchmarking framework for evaluating the performance of cloud/NoSQL databases through standardized workloads. It provides pluggable client bindings (Accumulo, Aerospike, ArangoDB, AsyncHBase, etc.) that run configurable workload patterns (WorkloadA, WorkloadB, etc.) and measure throughput, latency percentiles (P50, P95, P99, P99.9), and other metrics across heterogeneous datastores. Multi-binding monorepo structure: top-level directories per database (accumulo1.9/, aerospike/, arangodb/, asynchbase/, etc.), each containing a pom.xml and src/main/java/site/ycsb/db/{DbName}Client.java implementation. Core YCSB engine lives in a parent pom (binding-parent), workload definitions in workloads/ directory, and CLI entry points in bin/ (ycsb.sh, ycsb.bat). Build system is Maven with per-binding module declaration.
👥Who it's for
Database engineers, cloud infrastructure teams, and researchers who need to compare NoSQL/cloud database performance objectively before deployment. Also relevant to database vendors who want certification on YCSB benchmarks.
🌱Maturity & risk
This is a mature, production-ready project: originally from Yahoo! Labs, now community-maintained with 40+ database bindings, active CI/CD (Travis CI configured), multiple releases tracked (0.17.0, 0.18.0-SNAPSHOT visible), and comprehensive test suites per binding (e.g., AccumuloTest.java). Last signs of activity suggest ongoing maintenance, though core innovation may have stabilized.
Moderate risk: the monorepo contains 40+ semi-independent database bindings with varying maintenance levels—some bindings may lag behind their target database versions (e.g., accumulo1.9 locked to Hadoop 2.6.4, skipping JDK9+ tests). Dependency sprawl across bindings increases supply-chain risk; breaking changes in upstream databases (Accumulo, HBase, Cassandra, etc.) cascade into binding updates. Single binding owners may become bottlenecks.
Active areas of work
Version 0.18.0-SNAPSHOT is in development (visible in pom.xml). The project maintains compatibility with modern database versions (Accumulo 1.9+) while selectively skipping tests for Java 9+ on older bindings. Recent focus appears to be adding new database bindings and maintaining existing ones rather than core algorithm changes.
🚀Get running
Clone and build with Maven: git clone https://github.com/brianfrankcooper/YCSB.git && cd YCSB && mvn clean package (builds all bindings). To build one binding: mvn -pl site.ycsb:mongodb-binding -am clean package. Then run: bin/ycsb.sh load basic -P workloads/workloada && bin/ycsb.sh run basic -P workloads/workloada on Linux or bin/ycsb.bat on Windows.
Daily commands:
Load phase: bin/ycsb.sh load <database> -P workloads/workloada -p recordcount=1000000. Run phase: bin/ycsb.sh run <database> -P workloads/workloada -p operationcount=1000000. Replace <database> with binding name (basic, mongodb, aerospike, etc.). See bin/ycsb.sh for detailed CLI parsing of -P (properties file), -p (property override), and operation mode (load/run).
🗺️Map of the codebase
README.md— Entry point documenting YCSB's purpose as Yahoo's Cloud Serving Benchmark, setup instructions, and project links that orient all contributorsbin/ycsb— Main executable script that orchestrates benchmark execution across all database bindings—essential for understanding how the framework runsbinding-parent/pom.xml— Parent Maven configuration defining shared dependencies and build patterns for all database client bindings.travis.yml— CI/CD configuration that validates builds across multiple database implementations and prevents regressionsCONTRIBUTING.md— Contributor guidelines explaining how to add new database bindings and follow project conventionsbin/bindings.properties— Central registry mapping database names to their implementation classes, enabling dynamic binding loadingLICENSE.txt— Apache 2.0 license definition critical for understanding code reuse and attribution requirements
🛠️How to make changes
Add a New Database Binding
- Create a new module directory (e.g.,
newdb/) with Maven structure following existing patterns inaerospike/orcassandra/(newdb/pom.xml) - Implement a client class extending YCSB's DB interface in the package
site.ycsb.db.newdb.NewDBClientmirroringazurecosmos/src/main/java/site/ycsb/db/AzureCosmosClient.java(newdb/src/main/java/site/ycsb/db/newdb/NewDBClient.java) - Add database-specific configuration properties in
newdb/conf/newdb.properties(e.g.,newdb.hosts,newdb.port) matching patterns inazurecosmos/conf/azurecosmos.properties(newdb/conf/newdb.properties) - Register the new binding in
bin/bindings.propertiesby adding line:newdb=com.yahoo.ycsb.db.newdb.NewDBClient(bin/bindings.properties) - Create unit tests in
newdb/src/test/java/site/ycsb/db/newdb/NewDBTest.javafollowing patterns fromaccumulo1.9/src/test/java/site/ycsb/db/accumulo/AccumuloTest.java(newdb/src/test/java/site/ycsb/db/newdb/NewDBTest.java) - Add a
newdb/README.mddocumenting setup prerequisites, configuration options, and example benchmark commands (newdb/README.md)
Configure and Run a Benchmark
- Set database credentials and connection params in the database's config directory (e.g.,
azurecosmos/conf/azurecosmos.propertiesfor Azure Cosmos) (azurecosmos/conf/azurecosmos.properties) - Execute the benchmark via the main launcher:
./bin/ycsb load|run <database> -P <workload.properties> -p <key>=<value>(bin/ycsb) - Workload files define operation mix (read/write ratios, record counts); create or modify per benchmark needs (
bin/bindings.properties)
Modify Logging Configuration
- Edit database-specific logging config (e.g.,
arangodb/conf/logback.xml) to adjust verbosity, output format, or file paths (arangodb/conf/logback.xml) - Or modify test-level logging in
asynchbase/src/test/resources/log4j.propertiesfor debug output during tests (asynchbase/src/test/resources/log4j.properties)
🔧Why these technologies
- Java & Maven — Enables single codebase to test heterogeneous databases (HBase, Cassandra, Azure, Aerospike) with pluggable adapters; Maven multi-module structure mirrors binding architecture
- Adapter/Bridge Pattern (DB-specific Client classes) — Isolates database-specific connection logic from core benchmarking framework; each binding implements a common YCSB DB interface
- Properties-based Configuration — Allows runtime override of connection details, cluster endpoints, and tuning parameters without code changes (e.g.,
azurecosmos.properties) - Shell Scripts (bin/ycsb, bin/ycsb.sh, bin/ycsb.bat) — Provides cross-platform entry points and classpath management, abstracting Java invocation complexity from operators
⚖️Trade-offs already made
-
One Maven module per database (aerospike/, cassandra/, azurecosmos/) rather than single monolith
- Why: Allows independent version bumps of database client libraries and decouples dependency conflicts
- Consequence: More repositories to build and test, but cleaner dependency trees and easier troubleshooting
-
Async (AsyncHBase) vs. blocking (most others) clients both supported
- Why: Async clients can achieve higher throughput under contention; blocking clients are simpler to debug
- Consequence: Operators must understand their database's concurrency model to pick appropriate binding
-
No built-in result aggregation or distributed coordination
- Why: Keeps framework lightweight; benchmarks often run on single machine or require external orchestration
- Consequence: Multi-node benchmarks require external tools (e.g., scripts, Kubernetes) to coordinate load and merge results
🚫Non-goals (don't propose these)
- Real-time or streaming analytics of benchmark results (YCSB generates offline reports)
- Automatic database provisioning or infrastructure-as-code (operators must pre-create
🪤Traps & gotchas
Database client libraries must be installed and running before benchmark execution (YCSB does not start them). Workload properties (recordcount, operationcount, etc.) are case-sensitive and required; missing properties silently default to hardcoded values. Latency percentile reporting (P99.99) requires sufficient operation count to populate tail buckets or results mislead. Per-binding Maven skipJDKxTests flags disable tests on newer JDKs for old bindings (e.g., Accumulo 1.9 skips JDK9+), creating blind spots. Key distribution (zipfian vs. uniform) in workload files dramatically impacts cache hit rates but is often overlooked.
🏗️Architecture
💡Concepts to learn
- Latency Percentiles (P50, P95, P99, P99.9, P99.99) — YCSB's core output metrics; understanding that P99 can be 10x higher than median (P50) is critical to avoid accepting databases with poor tail latency in production clusters
- Zipfian Distribution — Default key access pattern in YCSB workloads (heavily skewed toward hot keys), simulating real-world database access; crucial for cache-layer performance evaluation
- Read/Write Ratio and Workload Mixing — YCSB defines workload profiles (A=50/50, B=95/5, C=100% read) to model diverse use cases; incorrect ratio skews benchmark conclusions about database suitability
- Load and Run Phases — YCSB separates data insertion (load) from query execution (run) to isolate schema/data setup overhead from operational performance; many benchmarks conflate these
- Database Client Interface Abstraction — YCSB's DB abstract class defines init(), read(), insert(), update(), delete(), scan() operations; all bindings implement this contract, enabling apples-to-apples cross-database comparison
- Property-Driven Configuration — YCSB uses .properties files (workloads/workloada) and command-line overrides (-p key=value) to avoid recompilation; understanding this pattern is essential for workload customization
- Throughput vs. Latency Trade-off — YCSB measures both ops/sec (throughput) and latency percentiles simultaneously; databases often optimize for one at the expense of the other, and this trade-off is visible in benchmark reports
🔗Related repos
apache/incubator-cassandra— Common benchmarking target for YCSB; Cassandra binding development often tests against this codebasegoogle/leveldb— Alternative KV-store benchmark target and reference implementation for embedded database performance characteristicsaerospike/aerospike-benchmark— Vendor-specific benchmark for Aerospike that complements YCSB's cross-database comparisonscockroachdb/cockroach— Modern distributed SQL database that benefits from YCSB workload testing; some users benchmark CockroachDB via YCSBmongodb/mongo— Ecosystem companion: MongoDB binding in YCSB drives adoption; reverse: YCSB used by MongoDB teams for performance validation
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add integration tests for database binding modules
YCSB has multiple database bindings (Accumulo, Aerospike, ArangoDB, AsyncHBase, AzureCosmos, etc.) but only a few have test directories. Most bindings like aerospike/ and arangodb/ lack src/test/java directories. Adding integration/unit tests for each binding would improve reliability and prevent regressions when dependencies are updated.
- [ ] Create src/test/java/site/ycsb/db/ directories in aerospike/, arangodb/, and other bindings missing tests
- [ ] Add basic connection/initialization tests (similar to accumulo1.9/src/test/java/site/ycsb/db/accumulo/AccumuloTest.java)
- [ ] Add test resources (config files like hbase-site.xml in asynchbase) where needed
- [ ] Update each binding's pom.xml to include junit and mockito dependencies in test scope
Create GitHub Actions CI workflow for multi-binding compilation verification
The repo has .travis.yml but the build configuration is outdated. GitHub Actions is now standard. Currently there's no CI verification that all database bindings compile successfully. With 20+ bindings, a single breaking change in the parent POM can silently break multiple bindings.
- [ ] Create .github/workflows/multi-binding-build.yml that runs 'mvn clean compile' for each binding module (accumulo1.9, aerospike, arangodb, etc.)
- [ ] Add a matrix strategy to test against Java 8, 11, and 17 since skipJDK9Tests and similar properties suggest version compatibility concerns
- [ ] Configure workflow to fail fast and report which binding(s) failed compilation
- [ ] Add workflow status badge to README.md
Add README documentation for each database binding with configuration examples
While accumulo1.9/, aerospike/, arangodb/, and asynchbase/ have README.md files, many bindings like azurecosmos/ have incomplete documentation. The main README.md doesn't link to or summarize binding-specific setup instructions. New contributors can't easily discover how to set up each database for benchmarking.
- [ ] Create comprehensive README.md for azurecosmos/ and other bindings if missing (check full binding list)
- [ ] Each binding README should include: required dependencies, environment setup, connection configuration properties (see accumulo.properties example), and basic usage example
- [ ] Add a 'Supported Bindings' table in root README.md linking to each binding's README with version info
- [ ] Document which Java versions and parent POM version each binding targets
🌿Good first issues
- Add latency histogram unit tests for all bindings: accumulo1.9/src/test/java/site/ycsb/db/accumulo/ and aerospike/src/test/ lack explicit tests for P99/P99.9 percentile calculation accuracy. Create test class LatencyHistogramTest.java verifying percentile boundaries.
- Document minimum Java version per binding in each README: accumulo1.9/README.md, arangodb/README.md, etc. lack explicit JDK version requirements. Add 'Requires Java X.Y+' section mirroring the skipJDK flags in pom.xml.
- Create workload composition example: workloads/ directory has basic workloads but no documented pattern for mixing (e.g., 70% read + 30% update). Write workload-custom.md in docs/ showing property overrides for derived workloads.
⭐Top contributors
Click to expand
Top contributors
- @igorbernstein2 — 18 commits
- @cindy-peng — 18 commits
- @suraj-pasuparthy — 15 commits
- @busbey — 8 commits
- @sitano — 6 commits
📝Recent commits
Click to expand
Recent commits
d9faaac— Merge pull request #1777 from brianfrankcooper/igorbernstein2-patch-1 (igorbernstein2)e5e2d1d— Update bigtable client version (igorbernstein2)8b2ecaf— Merge pull request #1775 from PeteyPii/master (igorbernstein2)a25b3e8— Add option to keep going even if an insert fails (Patrick Wrobel)6d0fbba— Merge pull request #1773 from kumaran-coder/i1772 (brianfrankcooper)15f09db— [core] Fix issue where recordcount in the workload is limited to max integer value (kumaran-coder)595a403— Merge pull request #1770 from jinseopkim0/master (suraj-pasuparthy)1853b0d— Add gRPC as default transport layer for google datastore binding (jinseopkim0)19e885f— Update YCSB to support slow scan controlled by three new run parameters. (#1768) (liujiongxin)1e62880— Merge pull request #1760 from krishnamoorthy-r/krishraj-dev (suraj-pasuparthy)
🔒Security observations
Failed to generate security analysis.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.