etcd-io/etcd
Distributed reliable key-value store for the most critical data of a distributed system
Healthy across the board
Permissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 1d ago
- ✓15 active contributors
- ✓Distributed ownership (top contributor 35% of recent commits)
Show 3 more →Show less
- ✓Apache-2.0 licensed
- ✓CI configured
- ✓Tests present
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/etcd-io/etcd)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/etcd-io/etcd on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: etcd-io/etcd
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/etcd-io/etcd shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit 1d ago
- 15 active contributors
- Distributed ownership (top contributor 35% of recent commits)
- Apache-2.0 licensed
- CI configured
- Tests present
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live etcd-io/etcd
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/etcd-io/etcd.
What it runs against: a local clone of etcd-io/etcd — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in etcd-io/etcd | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 31 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of etcd-io/etcd. If you don't
# have one yet, run these first:
#
# git clone https://github.com/etcd-io/etcd.git
# cd etcd
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of etcd-io/etcd and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "etcd-io/etcd(\\.git)?\\b" \\
&& ok "origin remote is etcd-io/etcd" \\
|| miss "origin remote is not etcd-io/etcd (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "api/etcdserverpb/rpc.proto" \\
&& ok "api/etcdserverpb/rpc.proto" \\
|| miss "missing critical file: api/etcdserverpb/rpc.proto"
test -f "api/go.mod" \\
&& ok "api/go.mod" \\
|| miss "missing critical file: api/go.mod"
test -f "api/mvccpb/kv.proto" \\
&& ok "api/mvccpb/kv.proto" \\
|| miss "missing critical file: api/mvccpb/kv.proto"
test -f "api/authpb/auth.proto" \\
&& ok "api/authpb/auth.proto" \\
|| miss "missing critical file: api/authpb/auth.proto"
test -f "api/etcdserverpb/etcdserver.proto" \\
&& ok "api/etcdserverpb/etcdserver.proto" \\
|| miss "missing critical file: api/etcdserverpb/etcdserver.proto"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 31 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~1d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/etcd-io/etcd"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
etcd is a distributed, strongly-consistent key-value store that uses the Raft consensus algorithm to replicate data across a cluster. It provides a gRPC API with automatic TLS support and is designed for storing critical metadata in distributed systems like Kubernetes, achieving ~10,000 writes/sec with guaranteed durability and availability. Monorepo structure: go.etcd.io/etcd/api/v3 (protobuf API definitions), server package (Raft consensus + storage), clientv3 (client library), etcdctl (CLI tool). Core state management in raft/ package implements the Raft algorithm; storage is abstraction-based (bolt backend historical, current uses bbolt). Workflows in .github/workflows/ orchestrate CI including antithesis chaos testing.
👥Who it's for
Platform engineers and DevOps teams building distributed systems (especially Kubernetes clusters) who need a reliable, highly-available data store for configuration, service discovery, and coordination. Contributors are typically infrastructure-focused Go developers familiar with consensus algorithms and distributed systems.
🌱Maturity & risk
etcd is production-ready with extensive real-world adoption (Kubernetes core dependency, many CNCF projects). The codebase has comprehensive testing infrastructure (CI workflows for antithesis testing, codeql analysis, flakiness measurement), active maintenance, and spans 5.7M lines of Go. However, the main branch is explicitly marked unstable during development—use releases for stability.
Risk is low for production use of released versions, but the main branch carries instability risk. Dependencies are minimal and well-maintained (protobuf, grpc, semver). The Raft consensus implementation is complex and bugs could cause data consistency issues, though mitigated by extensive robustness testing in tests/robustness/. No single-maintainer risk due to CoreOS/Red Hat backing.
Active areas of work
Active development on reliability and testing: the repo includes workflows for antithesis-test.yml (formal property testing), gh-workflow-approve.yaml, measure-testgrid-flakiness.yaml, and codeql-analysis. Recent focus on module management visible in Documentation/contributor-guide/modules.md, dependency management in dependency_management.md, and exit codes standardization.
🚀Get running
Clone and build with: git clone https://github.com/etcd-io/etcd.git && cd etcd && make build (requires Go 1.26+ per .go-version). Verify with ./bin/etcd --version. For local cluster dev: make run or follow Documentation/contributor-guide/local_cluster.md. Run tests with make test.
Daily commands:
Start single etcd instance: ./bin/etcd (listens on 2379 for client, 2380 for peer). For local 3-node cluster use the script in Documentation/contributor-guide/local_cluster.md or run make run which uses etcdctl in proxy mode. Monitor with etcdctl --endpoints=localhost:2379 member list.
🗺️Map of the codebase
api/etcdserverpb/rpc.proto— Core protobuf definition for all etcd RPC APIs (KV, Auth, Lease, Watch, etc.); essential for understanding the client-server contractapi/go.mod— API module dependencies; this is a separate module published independently and defines what etcd clients depend onapi/mvccpb/kv.proto— Protobuf definitions for key-value data structures used throughout etcd's data modelapi/authpb/auth.proto— Authentication and authorization protobuf definitions; critical for role-based access controlapi/etcdserverpb/etcdserver.proto— Server-internal RPC definitions for inter-member communication and cluster managementapi/etcdserverpb/raft_internal.proto— Raft consensus protocol messages; essential for understanding distributed coordinationREADME.md— Project overview, use cases, and quick start; all contributors should understand etcd's positioning
🛠️How to make changes
Add a New RPC Method to the etcd Client API
- Define the new RPC method and request/response messages in the service definition (
api/etcdserverpb/rpc.proto) - If the RPC introduces new data structures, add protobuf message definitions to kv.proto or authpb/auth.proto (
api/mvccpb/kv.proto) - Run
protocto generate Go stubs in rpc_grpc.pb.go and the HTTP gateway (api/etcdserverpb/rpc_grpc.pb.go) - Implement the server-side handler in the main etcd server (not in this repo, but referenced via api/v3) (
README.md) - Update go.mod if new external dependencies are required and verify .gomodguard.yaml allows them (
api/.gomodguard.yaml)
Add Authentication/Authorization Rules
- Define new permission types or user role structures in the auth protobuf schema (
api/authpb/auth.proto) - Add corresponding request/response messages if a new auth RPC is needed (
api/etcdserverpb/rpc.proto) - Regenerate protobuf files to produce Go code with new auth types (
api/authpb/auth.pb.go) - Document the new auth model in the contributor guide (
Documentation/contributor-guide/logging.md)
Modify Inter-Member (Raft) Protocol
- Update Raft message definitions and internal coordination protocol in raft_internal.proto (
api/etcdserverpb/raft_internal.proto) - If cluster membership changes, update the membership protobuf as well (
api/membershippb/membership.proto) - Regenerate and verify backward compatibility, including the Raft stringer tests (
api/etcdserverpb/raft_internal_stringer_test.go) - Update cluster upgrade documentation in contributor guide (
Documentation/contributor-guide/release.md)
🔧Why these technologies
- Protocol Buffers (protobuf3) — Language-agnostic serialization; enables auto-generation of client libraries in Go, Python, Java, C++, etc.; compact binary format critical for etcd's network and storage efficiency
- gRPC — High-performance RPC framework built on HTTP/2; supports streaming (Watch), multiplexing, and backpressure; industry standard for distributed systems
- gRPC Gateway (grpc-gateway/v2) — Bridges gRPC and HTTP/JSON; allows REST clients to interact with etcd without requiring gRPC libraries
- Raft Consensus (via raft_internal) — Proven distributed consensus algorithm; ensures strong consistency across cluster members
- Go 1.26 — High performance, concurrency primitives (goroutines, channels), excellent standard library; primary language for cloud infrastructure
⚖️Trade-offs already made
-
Separate api/ module published independently
- Why: Clients only depend on protobuf definitions and generated code; decouples client library lifecycle from server releases
- Consequence: Adds complexity to monorepo structure and release process; requires careful version management
-
gRPC as primary protocol with HTTP/JSON gateway overlay
- Why: gRPC provides performance and streaming; HTTP/JSON provides universal accessibility
- Consequence: Must maintain parity between gRPC and HTTP APIs; gateway adds latency and overhead for HTTP requests
-
Strong consistency via Raft (not eventual consistency)
- Why: Critical for distributed systems managing configuration and state; prevents split-brain
- Consequence: Write latency bounded by leader election and log replication; scales vertically better than horizontally
-
MVCC (Multi-Version Concurrency Control) for key-value store
- Why: Supports non-blocking reads and watch mechanisms without locking; enables consistent point-in-time reads
- Consequence: Storage overhead due to multiple versions; garbage collection complexity for old revisions
🚫Non-goals (don't propose these)
- Not a relational database: no SQL, joins, or complex queries
- Not a real-time messaging queue: etcd Watch is eventual, not a pub/sub system
- Not a graph database or document store: flat key-value model only
- Not a password manager: stores data unencrypted at rest by default
- Not a replacement for Zookeeper in all use cases: different consistency model and API
🪤Traps & gotchas
- Raft state transitions are not enforced at compile time—mutations can corrupt quorum state if you skip validation (always check HasLeader(), CurrentLeaderId checks). 2. Protocol buffer version mismatches between api/v3 and server code cause silent data corruption; regenerate protos after any .proto change (
make proto). 3. Local cluster testing requires specific port ranges (2379-2382 for clients, 2380-2383 for peers); firewall or Docker port conflicts silently fail. 4. Snapshots and WAL logs must be synchronized; deleting only one causes recovery failure. 5. The main branch is unstable—development may break release branches; always test against a specific tag for production validation. 6. TLS certificate generation requires proper SANs or client connections fail cryptically; see SECURITY.md for mTLS setup.
🏗️Architecture
💡Concepts to learn
- Raft Consensus Algorithm — etcd's entire reliability and data consistency guarantee rests on Raft; understanding leader election, log replication, and term progression is mandatory for modifying replication logic
- Write-Ahead Logging (WAL) — etcd persists all Raft log entries to disk before applying them; WAL corruption or misalignment with snapshots causes unrecoverable cluster state—critical for understanding durability guarantees
- Distributed Consensus and Quorum — etcd requires majority quorum (N/2+1 nodes) for any decision; understanding quorum semantics prevents split-brain scenarios and data loss in partitioned networks
- MVCC (Multi-Version Concurrency Control) — etcd implements MVCC for snapshot isolation and watch subscriptions; keys have revision numbers and etcd maintains version history, enabling consistent reads without locking
- Protocol Buffers & gRPC — etcd API is entirely defined via .proto files and served over gRPC; changes to message schemas require careful versioning to maintain backward compatibility with older clients
- B+Tree Indexing (bbolt) — etcd uses bbolt (embedded B+tree) for persistent key-value storage; understanding range queries, cursor iteration, and transaction isolation is essential for storage layer debugging
- Chaos Testing & Formal Verification — etcd uses antithesis (property-based testing) in CI to find bugs under network failures and Byzantine conditions; the robustness testing philosophy is core to etcd's reliability culture and hidden assumptions in the codebase
🔗Related repos
hashicorp/consul— Alternative distributed KV store also using Raft, with integrated service discovery—direct competitor in the same problem spacetikv/tikv— Distributed KV store using Raft for replication with ACID transactions; different architectural choices (Rust, key-range sharding) for similar reliability goalskubernetes/kubernetes— Primary production consumer of etcd; Kubernetes stores all cluster state (nodes, pods, config) in etcd, making this the reference workload for etcd design decisionsetcd-io/etcd-io.github.io— Official etcd documentation site and project governance repository; contains release coordination and community decision recordsetcd-io/bbolt— Underlying embedded B+tree database (forked from boltdb) that handles etcd's persistent storage layer and transaction semantics
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive test coverage for exit codes documentation in etcd
Documentation/contributor-guide/exit_codes.md exists but there's no corresponding test suite validating that all documented exit codes are actually implemented and returned correctly throughout the codebase. This would prevent documentation drift and ensure exit codes remain reliable for monitoring/alerting systems that depend on etcd.
- [ ] Review Documentation/contributor-guide/exit_codes.md to catalog all documented exit codes
- [ ] Search codebase for all exit code usages and verify coverage completeness
- [ ] Create tests/integration/exit_codes_test.go with test cases for each documented exit code scenario
- [ ] Add CI validation step to ensure exit code documentation stays synchronized with implementation
Implement missing GitHub Actions workflow for dependency security scanning beyond Dependabot
The repo has .github/dependabot.yml but lacks a dedicated security audit workflow. Given etcd's critical role as a distributed key-value store for critical system data, add a GitHub Actions workflow that runs 'go mod graph' vulnerability scans and licenses compliance checks on every PR, similar to the existing codeql-analysis.yml pattern.
- [ ] Create .github/workflows/dependency-audit.yml workflow file
- [ ] Integrate 'go list -json -m all' with vulnerability database scanning (govulncheck or similar)
- [ ] Add license compatibility checks for all transitive dependencies
- [ ] Configure workflow to run on pull_request and schedule events (similar to scorecards.yml pattern)
- [ ] Document the new workflow in CONTRIBUTING.md
Add integration tests for branch management and cherry-pick workflows documented in contributor guide
Documentation/contributor-guide/branch_management.md and cherry-pick.md exist but lack automated validation tests. Create tests that validate the documented workflows actually work correctly, including branch naming conventions, cherry-pick compatibility checks, and release branch procedures.
- [ ] Review Documentation/contributor-guide/branch_management.md and cherry-pick.md for all documented procedures
- [ ] Create tests/integration/branch_management_test.go to validate branch naming patterns and protections
- [ ] Add validation logic for cherry-pick compatibility (ensure commits can be cleanly applied to release branches)
- [ ] Create a GitHub Actions workflow (similar to .github/workflows/cherrypick-bot-ok-to-test.yaml) that auto-validates PRs follow branch management guidelines
- [ ] Document test execution requirements in CONTRIBUTING.md
🌿Good first issues
- Add integration tests for new protobuf fields in api/v3: Pick an unused proto field in KV.proto and write a test in clientv3/*_test.go verifying round-trip serialization and server handling.
- Document Raft log compaction: The raft/ package has snapshot and WAL truncation logic but contributors frequently misunderstand when snapshots are created. Add code examples in Documentation/contributor-guide/ showing the relationship between ApplyIndex and SnapshotIndex.
- Add exit code documentation for specific failure modes: exit_codes.md exists but is incomplete; map actual os.Exit() calls in server/*.go to documented codes, then write integration test in tests/ verifying each code is reachable.
⭐Top contributors
Click to expand
Top contributors
- @ahrtr — 35 commits
- @serathius — 16 commits
- @Jefftree — 15 commits
- @ivanvc — 12 commits
- @fuweid — 11 commits
📝Recent commits
Click to expand
Recent commits
e838001— Merge pull request #21718 from Jefftree/rangestream-gofail-delay (serathius)e8a9571— Merge pull request #21717 from fuweid/gogo-part-5-cmdline (ahrtr)32a860b— etcdctl/ctlv3/command: use pointer for gogopb responses (fuweid)4f074ee— Merge pull request #21666 from Kunalbehbud/fix/maintenance-status-non-admin (ahrtr)05749fa— Merge pull request #21726 from silentred/bump-go-1.26.3 (ahrtr)8618f08— Merge pull request #21722 from etcd-io/dependabot/github_actions/github/codeql-action-4.35.4 (ahrtr)72a0d55— Bump Go to 1.26.3 (silentred)e66a0a4— build(deps): bump github/codeql-action from 4.35.2 to 4.35.4 (dependabot[bot])fef4394— etcdserver: add beforeRangeStreamChunk gofail point (Jefftree)0cbc031— Merge pull request #21716 from AwesomePatrol/fix-etcd-k8s-patches (serathius)
🔒Security observations
The etcd codebase demonstrates good security practices including use of distroless containers, security workflows (CodeQL, Scorecards), and a formal security policy. Primary concerns involve mixed protobuf library versions and the deprecated gogo/protobuf dependency. The project uses a modern Go version with actively maintained dependencies. Network port exposure is appropriately documented through exposed ports. Recommend: (1) Complete migration from gogo/protobuf to google.golang.org/protobuf, (2) Regular dependency scanning and updates, (3) Enhanced container image verification, (4) HEALTHCHECK implementation in Docker. Overall security posture is strong for a critical infrastructure project.
- Medium · Outdated Go Version —
.go-version, api/go.mod. The project specifies Go 1.26.3 in .go-version and go.mod files. While this appears to be a future version (current stable is 1.23), ensure that all dependencies are compatible and regularly updated with security patches. Fix: Monitor Go security advisories and update dependencies regularly using 'go get -u' and 'go mod tidy'. Consider using dependabot for automated dependency updates. - Medium · Protobuf Library Mixed Versions —
api/go.mod. The project uses both 'github.com/gogo/protobuf v1.3.2' (deprecated) and 'google.golang.org/protobuf v1.36.11'. Gogo/protobuf is community-maintained and may have slower security updates compared to the official Google protobuf library. Fix: Migrate entirely to 'google.golang.org/protobuf' and remove the gogo/protobuf dependency. Update all proto generation and usage accordingly. - Low · Distroless Base Image Without Pinned Hash Verification —
Dockerfile. While the Dockerfile uses a distroless image (security best practice), the base image is pulled from gcr.io. Ensure supply chain security by verifying the image digest and considering using a private registry mirror. Fix: Regularly scan the base image for vulnerabilities using tools like 'trivy'. Consider implementing image signing and verification using Cosign or similar tools. - Low · Exposed Network Ports Without Documentation —
Dockerfile. The Dockerfile exposes ports 2379 and 2380 (etcd client and peer communication). While necessary for operation, ensure proper network policies and authentication mechanisms protect these ports. Fix: Document security best practices for port exposure in deployment guides. Ensure TLS/mTLS is configured for production deployments. Consider requiring authentication on both ports. - Low · No HEALTHCHECK Directive in Dockerfile —
Dockerfile. The Dockerfile lacks a HEALTHCHECK instruction, which could prevent orchestrators from detecting unhealthy etcd instances. Fix: Add a HEALTHCHECK directive using etcdctl to verify instance health, e.g., 'HEALTHCHECK CMD etcdctl endpoint health'
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.