RepoPilotOpen in app →

skyzh/mini-lsm

A course of building an LSM-Tree storage engine (database) in a week.

Healthy

Healthy across the board

weakest axis
Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 2w ago
  • 37+ active contributors
  • Distributed ownership (top contributor 36% of recent commits)
Show all 6 evidence items →
  • Apache-2.0 licensed
  • CI configured
  • Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/skyzh/mini-lsm)](https://repopilot.app/r/skyzh/mini-lsm)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/skyzh/mini-lsm on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: skyzh/mini-lsm

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/skyzh/mini-lsm shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit 2w ago
  • 37+ active contributors
  • Distributed ownership (top contributor 36% of recent commits)
  • Apache-2.0 licensed
  • CI configured
  • Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live skyzh/mini-lsm repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/skyzh/mini-lsm.

What it runs against: a local clone of skyzh/mini-lsm — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in skyzh/mini-lsm | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 46 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>skyzh/mini-lsm</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of skyzh/mini-lsm. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/skyzh/mini-lsm.git
#   cd mini-lsm
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of skyzh/mini-lsm and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "skyzh/mini-lsm(\\.git)?\\b" \\
  && ok "origin remote is skyzh/mini-lsm" \\
  || miss "origin remote is not skyzh/mini-lsm (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "mini-lsm-mvcc/src/lib.rs" \\
  && ok "mini-lsm-mvcc/src/lib.rs" \\
  || miss "missing critical file: mini-lsm-mvcc/src/lib.rs"
test -f "mini-lsm-mvcc/src/compact.rs" \\
  && ok "mini-lsm-mvcc/src/compact.rs" \\
  || miss "missing critical file: mini-lsm-mvcc/src/compact.rs"
test -f "mini-lsm-mvcc/src/block.rs" \\
  && ok "mini-lsm-mvcc/src/block.rs" \\
  || miss "missing critical file: mini-lsm-mvcc/src/block.rs"
test -f "mini-lsm-book/src/SUMMARY.md" \\
  && ok "mini-lsm-book/src/SUMMARY.md" \\
  || miss "missing critical file: mini-lsm-book/src/SUMMARY.md"
test -f "Cargo.toml" \\
  && ok "Cargo.toml" \\
  || miss "missing critical file: Cargo.toml"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 46 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~16d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/skyzh/mini-lsm"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Mini-LSM is a pedagogical implementation of a Log-Structured Merge Tree (LSM) storage engine written in Rust, designed to teach database internals through a structured 3-week course. It demonstrates how key-value stores organize data into blocks, SSTables, memtables, and multi-level compaction strategies to achieve high write throughput with efficient reads—the core technology behind RocksDB and LevelDB. Monorepo with four main crates: mini-lsm (reference solution for weeks 1-2), mini-lsm-mvcc (week 3 MVCC solution), mini-lsm-starter (empty skeleton for students), and xtask (build/copy utilities). The mini-lsm-book/ subtree contains mdBook chapters (00-overview through 09-whats-next) plus LSM architecture SVGs (week1-01-overview.svg through week2-00-triangle.svg). Students copy tests via cargo x copy-test --week N --day N and validate with cargo x scheck.

👥Who it's for

Computer science students and junior database engineers learning LSM internals through hands-on coding; the course scaffolds implementation across 21 days (7 per week) with the mini-lsm-starter crate providing skeleton code and the mini-lsm book providing detailed explanations and SVG diagrams of read/write flows.

🌱Maturity & risk

Actively maintained course material (not a production system). The reference implementation in mini-lsm/ and mini-lsm-mvcc/ is stable with CI/CD via GitHub Actions (main.yml, pr.yml), comprehensive mdBook documentation, and a community solution tracker (SOLUTIONS.md). However, this is explicitly educational code—Week 3 MVCC support is complete but the 'Extra Week' optimizations section notes "unlikely to be available in 2025."

Standard open source risks apply.

Active areas of work

Course is stable through Week 3 (MVCC complete); Week 4 (optimizations) is marked as work-in-progress and unlikely in 2025. The repository accepts community solutions (SOLUTIONS.md PR submissions). Binaries mini-lsm-cli-ref, mini-lsm-cli-mvcc-ref, and compaction-simulator-ref allow students to preview the reference implementation before coding.

🚀Get running

git clone https://github.com/skyzh/mini-lsm.git
cd mini-lsm
cargo x install-tools
cargo x copy-test --week 1 --day 1
cargo x scheck
cargo run --bin mini-lsm-cli

Daily commands: For students: cargo run --bin mini-lsm-cli (interactive shell). For developers: cargo x check (verify reference solutions), cargo x book (build mdBook), cargo x sync (sync API changes to starter). Reference binaries: cargo run --bin mini-lsm-cli-ref, cargo run --bin compaction-simulator-ref.

🗺️Map of the codebase

  • mini-lsm-mvcc/src/lib.rs — Main library entry point for the LSM engine; defines core data structures and public API
  • mini-lsm-mvcc/src/compact.rs — Central compaction logic orchestrating level-based and tiered strategies; critical for write performance and data organization
  • mini-lsm-mvcc/src/block.rs — Block abstraction forming the smallest serializable unit; fundamental to all SST read/write operations
  • mini-lsm-book/src/SUMMARY.md — Course structure and learning path; essential for understanding how to navigate the tutorial progression
  • Cargo.toml — Workspace manifest defining four sub-projects (mini-lsm, mini-lsm-starter, mini-lsm-mvcc, xtask) and shared dependencies
  • mini-lsm-mvcc/src/iterators.rs — Iterator abstractions used throughout read paths; enables multi-level merge and scan operations

🛠️How to make changes

Add a New Compaction Strategy

  1. Create a new module in mini-lsm-mvcc/src/compact/ (e.g., adaptive.rs) implementing the CompactionStrategy trait (mini-lsm-mvcc/src/compact/adaptive.rs)
  2. Add the new variant to the CompactionStrategy enum in mini-lsm-mvcc/src/compact.rs and route to your implementation in the match statement (mini-lsm-mvcc/src/compact.rs)
  3. Implement methods: decide_level(), compact(), and estimate_cost() following the pattern in leveled.rs or tiered.rs (mini-lsm-mvcc/src/compact/adaptive.rs)
  4. Add unit tests in your new module and integration tests in the test suite (mini-lsm-mvcc/src/compact/adaptive.rs)

Add a New Iterator Type

  1. Define a new iterator struct in mini-lsm-mvcc/src/iterators/ (e.g., filter_iterator.rs) implementing the Iterator trait (mini-lsm-mvcc/src/iterators/filter_iterator.rs)
  2. Declare the module in mini-lsm-mvcc/src/iterators.rs and export the struct (mini-lsm-mvcc/src/iterators.rs)
  3. Implement next(), key(), value(), and is_valid() methods following the pattern in concat_iterator.rs (mini-lsm-mvcc/src/iterators/filter_iterator.rs)

Add a New Tutorial Chapter

  1. Create a new markdown file in mini-lsm-book/src/ (e.g., week1-08-advanced.md) following the structure of existing chapters (mini-lsm-book/src/week1-08-advanced.md)
  2. Add an entry to mini-lsm-book/src/SUMMARY.md in the appropriate week section with link and description (mini-lsm-book/src/SUMMARY.md)
  3. Include code snippets (using mdbook's include syntax) referencing corresponding implementations in mini-lsm-mvcc/src/ (mini-lsm-book/src/week1-08-advanced.md)
  4. Add SVG diagrams to mini-lsm-book/src/lsm-tutorial/ and reference them in your chapter markdown (mini-lsm-book/src/lsm-tutorial/week1-08-diagram.svg)

Implement a New Optimization Feature

  1. Add your feature flag or module to mini-lsm-mvcc/src/lib.rs (e.g., bloom filters in src/filter/) (mini-lsm-mvcc/src/filter/mod.rs)
  2. Extend the Block or LsmEngine struct to utilize the feature during reads/writes (mini-lsm-mvcc/src/block.rs)
  3. Document the optimization in a new week3 or week4 chapter in mini-lsm-book/src/ (mini-lsm-book/src/week3-07-compaction-filter.md)

🔧Why these technologies

  • Rust — Memory safety without GC, zero-cost abstractions for LSM operations, and strong type system for compile-time correctness
  • mdbook — Excellent for tutorial delivery with code snippets, versioning, and static hosting; enables collaborative course development
  • Cargo workspace — Separates starter code (mini-lsm-starter) from reference solution (mini-lsm-mvcc) and build tools (xtask), reducing student confusion
  • Block-based SST format — Enables efficient compression, caching, and partial reads; foundational to RocksDB-like engines

⚖️Trade-offs already made

  • Separate mini-lsm-starter and mini-lsm-mvcc crates

    • Why: Allows students to implement from scratch while reference implementation remains available for reference
    • Consequence: Requires maintaining two versions of core modules; students must understand both starter scaffolds and full solutions
  • Implement three compaction strategies (simple, tiered, leveled) in reference solution

    • Why: Demonstrates design trade-offs between write amplification, read latency, and space amplification
    • Consequence: Complexity increases by week 2; students must understand multiple strategies before choosing one
  • 7-day (chapter) progression structured as weekly milestones

    • Why: Provides clear pacing and incremental learning; aligns with typical course/bootcamp schedules
    • Consequence: Non-linear learners may struggle with strict dependencies; requires implementing features before understanding full context
  • No distributed LSM; single-machine only

    • Why: Simplifies scope for a 1-week teaching project; focuses on core LSM concepts without replication/consensus
    • Consequence: Limited real-world applicability for systems requiring durability across nodes; students must understand this is a foundation

🚫Non-goals (don't propose these)

  • Distributed replication or consensus protocols (single-machine only)
  • SQL query engine or semantic analysis (key-value only)
  • Network I/O or client-server protocol (embedded library only)
  • Multi-version concurrency control (MVCC) in week 1–2 (added in week 3)
  • Production-grade error handling or observability hooks
  • Compression beyond basic block-level encoding

🪤Traps & gotchas

  1. Students must modify mini-lsm-starter, not mini-lsm—modifying the reference crate directly breaks learning. 2. Tests are not in starter by default; use cargo x copy-test --week W --day D to populate them. 3. The xtask crate implements cargo custom commands (cargo x *); don't try to run cargo x without first running cargo x install-tools. 4. The book is published to GitHub Pages; local cargo x book generates HTML in target/book/ but won't auto-reload unless you use mdbook serve separately. 5. Edition is "2024" (not 2021), which is unusual—ensure your Rust toolchain is up-to-date.

🏗️Architecture

💡Concepts to learn

  • LSM Tree (Log-Structured Merge Tree) — The entire course is structured around teaching how LSM trades random writes for sequential I/O via write amplification; you need to grasp levels, compaction, and memtables to understand every week
  • Compaction (Leveled vs. Tiered) — Week 2 focuses on compaction strategies; the simulator (compaction-simulator binary) lets you experiment with trade-offs; picking wrong strategy ruins performance
  • SSTable (Sorted String Table) — Week 1 Day 2 core data structure; all persistent data is stored as immutable SSTables organized in levels; understanding block encoding is prerequisite to Week 2
  • Bloom Filter — Week 1 Day 7 optimization; tells you if a key might exist without reading disk, cutting read amplification; Week 3 MVCC makes it critical for version filtering
  • Multi-Version Concurrency Control (MVCC) — Week 3 entire focus; enables snapshot isolation and lock-free reads; mini-lsm-mvcc demonstrates how LSM naturally supports MVCC via immutable SSTables
  • Key Compression (Prefix Encoding) — Week 1 Day 8 optimization; reduces memory footprint of sorted keys by encoding common prefixes; impacts both in-memory and on-disk sizes
  • Write-Ahead Logging (WAL) — Week 2 Day 6 (Recovery); durability requires persisting writes to a log before committing to memtable; losing this step causes data loss on crash
  • pingcap/talent-plan — Similar Rust education project covering distributed systems; TiKV course materials; overlapping audience of Rust learners
  • facebook/rocksdb — The production LSM implementation that mini-lsm teaches through simplified design; reference to understand trade-offs made in course simplification
  • skyzh/mini-lsm-solution-checkpoint — Companion repo with each commit corresponding to one course chapter; for instructors or students tracking day-by-day progress via git history
  • google/leveldb — Original LSM design; mini-lsm's pedagogical predecessor and source of architectural patterns (levels, compaction strategies)
  • etcd-io/etcd — Production Golang key-value store using LSM-like patterns; shows real-world LSM deployment in a distributed consensus system

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive integration tests for LSM compaction strategies

The repo has detailed documentation about leveled vs tiered compaction (week2-*.svg diagrams exist), but there are no visible integration tests verifying the compaction behavior matches the documented strategies. This would validate that the course implementations actually produce the intended LSM tree shapes and compaction patterns.

  • [ ] Create tests/compaction_integration.rs that creates LSM engines with known data patterns
  • [ ] Add test cases validating leveled compaction maintains level invariants (referenced in mini-lsm-book/src/05-compaction.md)
  • [ ] Add test cases validating tiered compaction behavior
  • [ ] Verify tests run in CI by adding them to .github/workflows/main.yml if not already present

Add benchmark suite and CI benchmarking workflow

The repo teaches LSM optimization techniques (week1-07-sst-optimizations, bloom filters, key compression in weeks 7-8) but has no visible benchmarks to measure the performance impact of these features. Adding benchmarks with a CI workflow would let contributors verify their optimizations actually improve performance.

  • [ ] Create benches/ directory with criterion benchmarks for read/write operations
  • [ ] Add benchmarks comparing performance with/without bloom filters (referenced in mini-lsm-book/src/07-bloom-filter.md)
  • [ ] Add benchmarks comparing performance with/without key compression (referenced in mini-lsm-book/src/08-key-compression.md)
  • [ ] Create a new GitHub Action workflow (.github/workflows/benchmark.yml) to run benchmarks on main branch

Add recovery/crash-consistency tests for durability features

The course includes Week 1 Day 6 (write-path) and an entire chapter on recovery (mini-lsm-book/src/06-recovery.md), but there are no visible tests validating crash-consistency or recovery correctness. This is critical for a storage engine to actually verify durability claims.

  • [ ] Create tests/recovery_integration.rs with crash simulation tests
  • [ ] Add tests that write data, simulate crashes at various points (before/after flush, during compaction), and verify correct recovery
  • [ ] Test WAL replay correctness after simulated crashes
  • [ ] Verify recovery tests are run in CI by adding to .github/workflows/main.yml

🌿Good first issues

  • Add integration tests for Week 1 Day 4 (SST format) in mini-lsm-starter/tests/; currently the chapter has no reference test file visible in the structure, so write tests that verify key-value iteration across block boundaries matches the reference implementation's behavior.
  • Expand mini-lsm-book/src/07-bloom-filter.md with a concrete code walkthrough showing how BloomFilter::may_contain() is called in get() paths; the book describes the concept but lacks the integration point in the engine flow diagrams.
  • Add a missing troubleshooting section to mini-lsm-book/src/00-get-started.md documenting common errors when running cargo x copy-test (e.g., week/day out of range, missing xtask tools); new students report confusion here based on the book's current brevity.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 427c6cc — Add tiny-lsm implementation link to SOLUTIONS.md (#176) (mehrdad3301)
  • a07e833 — docs: fix slateDB link error (#171) (Standing-Man)
  • 49053e5 — bump dependencies (#170) (skyzh)
  • 88543a0 — chore: fix many clippy warnings (#166) (Standing-Man)
  • 85de1c1 — bump dependencies + change resolver (#163) (skyzh)
  • f484bc5 — adds self solution link to SOLUTIONS.md (#158) (7143192)
  • c6b7ff8 — docs: update week2-03-tiered.md (#154) (KKKZOZ)
  • af96807 — Replace a predicate that was always false with a literal (#151) (jkosh44)
  • 067fd2e — docs: update the introduction of StorageIterator (#152) (KKKZOZ)
  • fc4765b — Fix wrong input type of put_batch (#146) (ppdogg)

🔒Security observations

This is a Rust educational codebase with generally good security posture. The primary issues are configuration-related rather than exploitable vulnerabilities. The invalid Rust edition ('2024') must be corrected. The codebase lacks a security policy and visible dependency vulnerability scanning in CI/CD. No hardcoded secrets, injection risks, or infrastructure misconfigurations are evident from the provided file structure. Regular dependency audits and a responsible disclosure policy are recommended best practices for this open-source project.

  • Medium · Outdated Rust Edition in Workspace Configuration — Cargo.toml - [workspace.package] section. The workspace.package edition is set to '2024', which is not a valid Rust edition. Valid editions are 2015, 2018, and 2021. This configuration error could lead to compilation issues or unexpected behavior. The edition should be corrected to a supported value. Fix: Change edition from '2024' to '2021' (the latest stable Rust edition): edition = '2021'
  • Low · Minimal Dependency Pinning — Cargo.toml - [workspace.dependencies] section. The workspace dependencies use loose version constraints (e.g., 'anyhow = "1"', 'bytes = "1"'). While this allows flexibility, it could potentially pull in minor/patch versions with unexpected behavior changes. Consider evaluating dependency updates regularly. Fix: Regularly audit and test dependency updates. Consider using 'cargo audit' to check for known vulnerabilities in dependencies.
  • Low · Missing Security Policy — Repository root. No SECURITY.md or security policy file is visible in the repository structure. This makes it difficult for security researchers to report vulnerabilities responsibly. Fix: Create a SECURITY.md file with instructions for reporting security vulnerabilities (e.g., via email to maintainers rather than public issues).
  • Low · No Evidence of Dependency Vulnerability Scanning — .github/workflows/. The CI/CD workflows (main.yml, pr.yml) content is not provided, making it unclear if automated security scanning (cargo audit, SAST) is configured. Fix: Add 'cargo audit' step to CI pipeline to automatically detect known vulnerabilities in dependencies. Consider integrating additional security scanning tools.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · skyzh/mini-lsm — RepoPilot