RepoPilotOpen in app →

etcd-io/etcd

Distributed reliable key-value store for the most critical data of a distributed system

Healthy

Healthy across the board

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 1d ago
  • 15 active contributors
  • Distributed ownership (top contributor 35% of recent commits)
Show 3 more →
  • Apache-2.0 licensed
  • CI configured
  • Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/etcd-io/etcd)](https://repopilot.app/r/etcd-io/etcd)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/etcd-io/etcd on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: etcd-io/etcd

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/etcd-io/etcd shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit 1d ago
  • 15 active contributors
  • Distributed ownership (top contributor 35% of recent commits)
  • Apache-2.0 licensed
  • CI configured
  • Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live etcd-io/etcd repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/etcd-io/etcd.

What it runs against: a local clone of etcd-io/etcd — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in etcd-io/etcd | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 31 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>etcd-io/etcd</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of etcd-io/etcd. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/etcd-io/etcd.git
#   cd etcd
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of etcd-io/etcd and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "etcd-io/etcd(\\.git)?\\b" \\
  && ok "origin remote is etcd-io/etcd" \\
  || miss "origin remote is not etcd-io/etcd (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "api/etcdserverpb/rpc.proto" \\
  && ok "api/etcdserverpb/rpc.proto" \\
  || miss "missing critical file: api/etcdserverpb/rpc.proto"
test -f "api/go.mod" \\
  && ok "api/go.mod" \\
  || miss "missing critical file: api/go.mod"
test -f "api/mvccpb/kv.proto" \\
  && ok "api/mvccpb/kv.proto" \\
  || miss "missing critical file: api/mvccpb/kv.proto"
test -f "api/authpb/auth.proto" \\
  && ok "api/authpb/auth.proto" \\
  || miss "missing critical file: api/authpb/auth.proto"
test -f "api/etcdserverpb/etcdserver.proto" \\
  && ok "api/etcdserverpb/etcdserver.proto" \\
  || miss "missing critical file: api/etcdserverpb/etcdserver.proto"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 31 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~1d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/etcd-io/etcd"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

etcd is a distributed, strongly-consistent key-value store that uses the Raft consensus algorithm to replicate data across a cluster. It provides a gRPC API with automatic TLS support and is designed for storing critical metadata in distributed systems like Kubernetes, achieving ~10,000 writes/sec with guaranteed durability and availability. Monorepo structure: go.etcd.io/etcd/api/v3 (protobuf API definitions), server package (Raft consensus + storage), clientv3 (client library), etcdctl (CLI tool). Core state management in raft/ package implements the Raft algorithm; storage is abstraction-based (bolt backend historical, current uses bbolt). Workflows in .github/workflows/ orchestrate CI including antithesis chaos testing.

👥Who it's for

Platform engineers and DevOps teams building distributed systems (especially Kubernetes clusters) who need a reliable, highly-available data store for configuration, service discovery, and coordination. Contributors are typically infrastructure-focused Go developers familiar with consensus algorithms and distributed systems.

🌱Maturity & risk

etcd is production-ready with extensive real-world adoption (Kubernetes core dependency, many CNCF projects). The codebase has comprehensive testing infrastructure (CI workflows for antithesis testing, codeql analysis, flakiness measurement), active maintenance, and spans 5.7M lines of Go. However, the main branch is explicitly marked unstable during development—use releases for stability.

Risk is low for production use of released versions, but the main branch carries instability risk. Dependencies are minimal and well-maintained (protobuf, grpc, semver). The Raft consensus implementation is complex and bugs could cause data consistency issues, though mitigated by extensive robustness testing in tests/robustness/. No single-maintainer risk due to CoreOS/Red Hat backing.

Active areas of work

Active development on reliability and testing: the repo includes workflows for antithesis-test.yml (formal property testing), gh-workflow-approve.yaml, measure-testgrid-flakiness.yaml, and codeql-analysis. Recent focus on module management visible in Documentation/contributor-guide/modules.md, dependency management in dependency_management.md, and exit codes standardization.

🚀Get running

Clone and build with: git clone https://github.com/etcd-io/etcd.git && cd etcd && make build (requires Go 1.26+ per .go-version). Verify with ./bin/etcd --version. For local cluster dev: make run or follow Documentation/contributor-guide/local_cluster.md. Run tests with make test.

Daily commands: Start single etcd instance: ./bin/etcd (listens on 2379 for client, 2380 for peer). For local 3-node cluster use the script in Documentation/contributor-guide/local_cluster.md or run make run which uses etcdctl in proxy mode. Monitor with etcdctl --endpoints=localhost:2379 member list.

🗺️Map of the codebase

  • api/etcdserverpb/rpc.proto — Core protobuf definition for all etcd RPC APIs (KV, Auth, Lease, Watch, etc.); essential for understanding the client-server contract
  • api/go.mod — API module dependencies; this is a separate module published independently and defines what etcd clients depend on
  • api/mvccpb/kv.proto — Protobuf definitions for key-value data structures used throughout etcd's data model
  • api/authpb/auth.proto — Authentication and authorization protobuf definitions; critical for role-based access control
  • api/etcdserverpb/etcdserver.proto — Server-internal RPC definitions for inter-member communication and cluster management
  • api/etcdserverpb/raft_internal.proto — Raft consensus protocol messages; essential for understanding distributed coordination
  • README.md — Project overview, use cases, and quick start; all contributors should understand etcd's positioning

🛠️How to make changes

Add a New RPC Method to the etcd Client API

  1. Define the new RPC method and request/response messages in the service definition (api/etcdserverpb/rpc.proto)
  2. If the RPC introduces new data structures, add protobuf message definitions to kv.proto or authpb/auth.proto (api/mvccpb/kv.proto)
  3. Run protoc to generate Go stubs in rpc_grpc.pb.go and the HTTP gateway (api/etcdserverpb/rpc_grpc.pb.go)
  4. Implement the server-side handler in the main etcd server (not in this repo, but referenced via api/v3) (README.md)
  5. Update go.mod if new external dependencies are required and verify .gomodguard.yaml allows them (api/.gomodguard.yaml)

Add Authentication/Authorization Rules

  1. Define new permission types or user role structures in the auth protobuf schema (api/authpb/auth.proto)
  2. Add corresponding request/response messages if a new auth RPC is needed (api/etcdserverpb/rpc.proto)
  3. Regenerate protobuf files to produce Go code with new auth types (api/authpb/auth.pb.go)
  4. Document the new auth model in the contributor guide (Documentation/contributor-guide/logging.md)

Modify Inter-Member (Raft) Protocol

  1. Update Raft message definitions and internal coordination protocol in raft_internal.proto (api/etcdserverpb/raft_internal.proto)
  2. If cluster membership changes, update the membership protobuf as well (api/membershippb/membership.proto)
  3. Regenerate and verify backward compatibility, including the Raft stringer tests (api/etcdserverpb/raft_internal_stringer_test.go)
  4. Update cluster upgrade documentation in contributor guide (Documentation/contributor-guide/release.md)

🔧Why these technologies

  • Protocol Buffers (protobuf3) — Language-agnostic serialization; enables auto-generation of client libraries in Go, Python, Java, C++, etc.; compact binary format critical for etcd's network and storage efficiency
  • gRPC — High-performance RPC framework built on HTTP/2; supports streaming (Watch), multiplexing, and backpressure; industry standard for distributed systems
  • gRPC Gateway (grpc-gateway/v2) — Bridges gRPC and HTTP/JSON; allows REST clients to interact with etcd without requiring gRPC libraries
  • Raft Consensus (via raft_internal) — Proven distributed consensus algorithm; ensures strong consistency across cluster members
  • Go 1.26 — High performance, concurrency primitives (goroutines, channels), excellent standard library; primary language for cloud infrastructure

⚖️Trade-offs already made

  • Separate api/ module published independently

    • Why: Clients only depend on protobuf definitions and generated code; decouples client library lifecycle from server releases
    • Consequence: Adds complexity to monorepo structure and release process; requires careful version management
  • gRPC as primary protocol with HTTP/JSON gateway overlay

    • Why: gRPC provides performance and streaming; HTTP/JSON provides universal accessibility
    • Consequence: Must maintain parity between gRPC and HTTP APIs; gateway adds latency and overhead for HTTP requests
  • Strong consistency via Raft (not eventual consistency)

    • Why: Critical for distributed systems managing configuration and state; prevents split-brain
    • Consequence: Write latency bounded by leader election and log replication; scales vertically better than horizontally
  • MVCC (Multi-Version Concurrency Control) for key-value store

    • Why: Supports non-blocking reads and watch mechanisms without locking; enables consistent point-in-time reads
    • Consequence: Storage overhead due to multiple versions; garbage collection complexity for old revisions

🚫Non-goals (don't propose these)

  • Not a relational database: no SQL, joins, or complex queries
  • Not a real-time messaging queue: etcd Watch is eventual, not a pub/sub system
  • Not a graph database or document store: flat key-value model only
  • Not a password manager: stores data unencrypted at rest by default
  • Not a replacement for Zookeeper in all use cases: different consistency model and API

🪤Traps & gotchas

  1. Raft state transitions are not enforced at compile time—mutations can corrupt quorum state if you skip validation (always check HasLeader(), CurrentLeaderId checks). 2. Protocol buffer version mismatches between api/v3 and server code cause silent data corruption; regenerate protos after any .proto change (make proto). 3. Local cluster testing requires specific port ranges (2379-2382 for clients, 2380-2383 for peers); firewall or Docker port conflicts silently fail. 4. Snapshots and WAL logs must be synchronized; deleting only one causes recovery failure. 5. The main branch is unstable—development may break release branches; always test against a specific tag for production validation. 6. TLS certificate generation requires proper SANs or client connections fail cryptically; see SECURITY.md for mTLS setup.

🏗️Architecture

💡Concepts to learn

  • Raft Consensus Algorithm — etcd's entire reliability and data consistency guarantee rests on Raft; understanding leader election, log replication, and term progression is mandatory for modifying replication logic
  • Write-Ahead Logging (WAL) — etcd persists all Raft log entries to disk before applying them; WAL corruption or misalignment with snapshots causes unrecoverable cluster state—critical for understanding durability guarantees
  • Distributed Consensus and Quorum — etcd requires majority quorum (N/2+1 nodes) for any decision; understanding quorum semantics prevents split-brain scenarios and data loss in partitioned networks
  • MVCC (Multi-Version Concurrency Control) — etcd implements MVCC for snapshot isolation and watch subscriptions; keys have revision numbers and etcd maintains version history, enabling consistent reads without locking
  • Protocol Buffers & gRPC — etcd API is entirely defined via .proto files and served over gRPC; changes to message schemas require careful versioning to maintain backward compatibility with older clients
  • B+Tree Indexing (bbolt) — etcd uses bbolt (embedded B+tree) for persistent key-value storage; understanding range queries, cursor iteration, and transaction isolation is essential for storage layer debugging
  • Chaos Testing & Formal Verification — etcd uses antithesis (property-based testing) in CI to find bugs under network failures and Byzantine conditions; the robustness testing philosophy is core to etcd's reliability culture and hidden assumptions in the codebase
  • hashicorp/consul — Alternative distributed KV store also using Raft, with integrated service discovery—direct competitor in the same problem space
  • tikv/tikv — Distributed KV store using Raft for replication with ACID transactions; different architectural choices (Rust, key-range sharding) for similar reliability goals
  • kubernetes/kubernetes — Primary production consumer of etcd; Kubernetes stores all cluster state (nodes, pods, config) in etcd, making this the reference workload for etcd design decisions
  • etcd-io/etcd-io.github.io — Official etcd documentation site and project governance repository; contains release coordination and community decision records
  • etcd-io/bbolt — Underlying embedded B+tree database (forked from boltdb) that handles etcd's persistent storage layer and transaction semantics

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive test coverage for exit codes documentation in etcd

Documentation/contributor-guide/exit_codes.md exists but there's no corresponding test suite validating that all documented exit codes are actually implemented and returned correctly throughout the codebase. This would prevent documentation drift and ensure exit codes remain reliable for monitoring/alerting systems that depend on etcd.

  • [ ] Review Documentation/contributor-guide/exit_codes.md to catalog all documented exit codes
  • [ ] Search codebase for all exit code usages and verify coverage completeness
  • [ ] Create tests/integration/exit_codes_test.go with test cases for each documented exit code scenario
  • [ ] Add CI validation step to ensure exit code documentation stays synchronized with implementation

Implement missing GitHub Actions workflow for dependency security scanning beyond Dependabot

The repo has .github/dependabot.yml but lacks a dedicated security audit workflow. Given etcd's critical role as a distributed key-value store for critical system data, add a GitHub Actions workflow that runs 'go mod graph' vulnerability scans and licenses compliance checks on every PR, similar to the existing codeql-analysis.yml pattern.

  • [ ] Create .github/workflows/dependency-audit.yml workflow file
  • [ ] Integrate 'go list -json -m all' with vulnerability database scanning (govulncheck or similar)
  • [ ] Add license compatibility checks for all transitive dependencies
  • [ ] Configure workflow to run on pull_request and schedule events (similar to scorecards.yml pattern)
  • [ ] Document the new workflow in CONTRIBUTING.md

Add integration tests for branch management and cherry-pick workflows documented in contributor guide

Documentation/contributor-guide/branch_management.md and cherry-pick.md exist but lack automated validation tests. Create tests that validate the documented workflows actually work correctly, including branch naming conventions, cherry-pick compatibility checks, and release branch procedures.

  • [ ] Review Documentation/contributor-guide/branch_management.md and cherry-pick.md for all documented procedures
  • [ ] Create tests/integration/branch_management_test.go to validate branch naming patterns and protections
  • [ ] Add validation logic for cherry-pick compatibility (ensure commits can be cleanly applied to release branches)
  • [ ] Create a GitHub Actions workflow (similar to .github/workflows/cherrypick-bot-ok-to-test.yaml) that auto-validates PRs follow branch management guidelines
  • [ ] Document test execution requirements in CONTRIBUTING.md

🌿Good first issues

  • Add integration tests for new protobuf fields in api/v3: Pick an unused proto field in KV.proto and write a test in clientv3/*_test.go verifying round-trip serialization and server handling.
  • Document Raft log compaction: The raft/ package has snapshot and WAL truncation logic but contributors frequently misunderstand when snapshots are created. Add code examples in Documentation/contributor-guide/ showing the relationship between ApplyIndex and SnapshotIndex.
  • Add exit code documentation for specific failure modes: exit_codes.md exists but is incomplete; map actual os.Exit() calls in server/*.go to documented codes, then write integration test in tests/ verifying each code is reachable.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • e838001 — Merge pull request #21718 from Jefftree/rangestream-gofail-delay (serathius)
  • e8a9571 — Merge pull request #21717 from fuweid/gogo-part-5-cmdline (ahrtr)
  • 32a860b — etcdctl/ctlv3/command: use pointer for gogopb responses (fuweid)
  • 4f074ee — Merge pull request #21666 from Kunalbehbud/fix/maintenance-status-non-admin (ahrtr)
  • 05749fa — Merge pull request #21726 from silentred/bump-go-1.26.3 (ahrtr)
  • 8618f08 — Merge pull request #21722 from etcd-io/dependabot/github_actions/github/codeql-action-4.35.4 (ahrtr)
  • 72a0d55 — Bump Go to 1.26.3 (silentred)
  • e66a0a4 — build(deps): bump github/codeql-action from 4.35.2 to 4.35.4 (dependabot[bot])
  • fef4394 — etcdserver: add beforeRangeStreamChunk gofail point (Jefftree)
  • 0cbc031 — Merge pull request #21716 from AwesomePatrol/fix-etcd-k8s-patches (serathius)

🔒Security observations

The etcd codebase demonstrates good security practices including use of distroless containers, security workflows (CodeQL, Scorecards), and a formal security policy. Primary concerns involve mixed protobuf library versions and the deprecated gogo/protobuf dependency. The project uses a modern Go version with actively maintained dependencies. Network port exposure is appropriately documented through exposed ports. Recommend: (1) Complete migration from gogo/protobuf to google.golang.org/protobuf, (2) Regular dependency scanning and updates, (3) Enhanced container image verification, (4) HEALTHCHECK implementation in Docker. Overall security posture is strong for a critical infrastructure project.

  • Medium · Outdated Go Version — .go-version, api/go.mod. The project specifies Go 1.26.3 in .go-version and go.mod files. While this appears to be a future version (current stable is 1.23), ensure that all dependencies are compatible and regularly updated with security patches. Fix: Monitor Go security advisories and update dependencies regularly using 'go get -u' and 'go mod tidy'. Consider using dependabot for automated dependency updates.
  • Medium · Protobuf Library Mixed Versions — api/go.mod. The project uses both 'github.com/gogo/protobuf v1.3.2' (deprecated) and 'google.golang.org/protobuf v1.36.11'. Gogo/protobuf is community-maintained and may have slower security updates compared to the official Google protobuf library. Fix: Migrate entirely to 'google.golang.org/protobuf' and remove the gogo/protobuf dependency. Update all proto generation and usage accordingly.
  • Low · Distroless Base Image Without Pinned Hash Verification — Dockerfile. While the Dockerfile uses a distroless image (security best practice), the base image is pulled from gcr.io. Ensure supply chain security by verifying the image digest and considering using a private registry mirror. Fix: Regularly scan the base image for vulnerabilities using tools like 'trivy'. Consider implementing image signing and verification using Cosign or similar tools.
  • Low · Exposed Network Ports Without Documentation — Dockerfile. The Dockerfile exposes ports 2379 and 2380 (etcd client and peer communication). While necessary for operation, ensure proper network policies and authentication mechanisms protect these ports. Fix: Document security best practices for port exposure in deployment guides. Ensure TLS/mTLS is configured for production deployments. Consider requiring authentication on both ports.
  • Low · No HEALTHCHECK Directive in Dockerfile — Dockerfile. The Dockerfile lacks a HEALTHCHECK instruction, which could prevent orchestrators from detecting unhealthy etcd instances. Fix: Add a HEALTHCHECK directive using etcdctl to verify instance health, e.g., 'HEALTHCHECK CMD etcdctl endpoint health'

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · etcd-io/etcd — RepoPilot