kubernetes/kube-state-metrics
Add-on agent to generate and expose cluster-level metrics.
Healthy across the board
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 1d ago
- ✓16 active contributors
- ✓Distributed ownership (top contributor 35% of recent commits)
Show all 6 evidence items →Show less
- ✓Apache-2.0 licensed
- ✓CI configured
- ✓Tests present
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/kubernetes/kube-state-metrics)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/kubernetes/kube-state-metrics on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: kubernetes/kube-state-metrics
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/kubernetes/kube-state-metrics shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit 1d ago
- 16 active contributors
- Distributed ownership (top contributor 35% of recent commits)
- Apache-2.0 licensed
- CI configured
- Tests present
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live kubernetes/kube-state-metrics
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/kubernetes/kube-state-metrics.
What it runs against: a local clone of kubernetes/kube-state-metrics — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in kubernetes/kube-state-metrics | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 31 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of kubernetes/kube-state-metrics. If you don't
# have one yet, run these first:
#
# git clone https://github.com/kubernetes/kube-state-metrics.git
# cd kube-state-metrics
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of kubernetes/kube-state-metrics and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "kubernetes/kube-state-metrics(\\.git)?\\b" \\
&& ok "origin remote is kubernetes/kube-state-metrics" \\
|| miss "origin remote is not kubernetes/kube-state-metrics (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "main.go" \\
&& ok "main.go" \\
|| miss "missing critical file: main.go"
test -f "pkg/metrics_store.go" \\
&& ok "pkg/metrics_store.go" \\
|| miss "missing critical file: pkg/metrics_store.go"
test -f "pkg/collectors" \\
&& ok "pkg/collectors" \\
|| miss "missing critical file: pkg/collectors"
test -f "go.mod" \\
&& ok "go.mod" \\
|| miss "missing critical file: go.mod"
test -f "Makefile" \\
&& ok "Makefile" \\
|| miss "missing critical file: Makefile"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 31 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~1d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/kubernetes/kube-state-metrics"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
kube-state-metrics is a Kubernetes add-on agent that watches the Kubernetes API server and exposes cluster-level state metrics on HTTP /metrics endpoint in Prometheus format. It generates metrics about Kubernetes objects like Deployments, Pods, Nodes, StatefulSets, and DaemonSets without modification—exposing raw API state rather than computed health judgments—enabling monitoring stacks to scrape object counts, conditions, and resource allocations at scale. Monolithic Go application with modular internal structure: internal/metrics/ contains metric generators for each Kubernetes resource type (pod metrics, deployment metrics, etc.); internal/store/ holds the object state watchers and cache layer; cmd/ wraps the main application entry point. Metrics documentation mirrors this structure in docs/metrics/ with subdirectories per API group (auth/, cluster/, apps/, etc.).
👥Who it's for
Platform engineers and SREs who operate Kubernetes clusters and need Prometheus-compatible observability of cluster state; monitoring stack maintainers who integrate KSM into Prometheus or other metric collection systems; DevOps teams building alerting rules on object health (e.g., pod crash loops, pending workloads).
🌱Maturity & risk
Production-ready and actively maintained. This is a CNCF Kubernetes project with comprehensive CI/CD (GitHub Actions workflows for govulncheck, semantic versioning, SBOM generation, and pre-release automation), OpenSSF Best Practices badge, Go 1.25 support, and Kubernetes API v0.35.4 compatibility. Regular dependency updates via Dependabot and active issue triage indicate sustained maintenance.
Low risk for core stability but moderate operational complexity: ~50+ direct Go dependencies (transitive count much higher), heavy coupling to Kubernetes API versions (requires version alignment—see docs/design/ for compatibility matrices), and CustomResourceDefinition (CRD) support adds versioning surface area. Single-maintainer risk mitigated by OWNERS file. Watch for breaking changes in Kubernetes API releases between cluster and KSM versions.
Active areas of work
Active development with focus on Kubernetes 1.35 compatibility, security scanning (govulncheck workflow), and supply-chain hardening (SBOM, OpenVEX templates). Recent work includes dependency maintenance (Prometheus client v1.23+, controller-runtime v0.23), ECMAScript regex support for metric filtering, and performance optimization docs. Semantic versioning and release automation are in place.
🚀Get running
Clone and build: git clone https://github.com/kubernetes/kube-state-metrics.git && cd kube-state-metrics && make build. This runs via Makefile (not npm/pip). Requires Go 1.25 (locked in .go-version). For local development: make && ./kube-state-metrics connects to your kubectl context. For container: make container builds the Docker image from Dockerfile.
Daily commands:
make build produces binary ./kube-state-metrics. Run with ./kube-state-metrics --kubeconfig=$HOME/.kube/config or inside cluster (auto-auth via ServiceAccount). Metrics served on :8080/metrics by default. CLI flags documented in docs/developer/cli-arguments.md. Test: make test runs Go unit tests.
🗺️Map of the codebase
main.go— Entry point that initializes the application, sets up metrics collectors, and starts the HTTP server serving Prometheus metrics.pkg/metrics_store.go— Core abstraction managing all metrics collection and caching; every contributor must understand how metrics flow through this component.pkg/collectors— Directory containing all Kubernetes object collectors; foundational to understanding how KSM generates state metrics for different resource types.go.mod— Declares critical Prometheus and Kubernetes client dependencies that shape the entire architecture and API surface.Makefile— Build automation and testing orchestration; defines how the project compiles, tests, and releases artifacts.docs/design/metrics-best-practices.md— Design documentation explaining metric naming, labeling, and cardinality principles that guide all new metrics additions.
🧩Components & responsibilities
- Resource Collectors (pkg/collectors/*) — Each collector watches a specific Kubernetes resource type (Pods, Nodes, etc.) and transforms object
🛠️How to make changes
Add a new Kubernetes resource collector
- Create a new collector file in pkg/collectors/ following the pattern of existing collectors (e.g., node.go, pod.go) (
pkg/collectors/mynewresource.go) - Implement the collector interface with Describe() and Collect() methods to generate metrics for your resource type (
pkg/collectors/mynewresource.go) - Register the collector in the metrics store initialization, typically in main.go or the store factory function (
main.go) - Document the new metrics in docs/metrics/ following the existing structure and format (
docs/metrics/category/mynewresource-metrics.md)
Extend metrics with custom resource state configuration
- Define a CustomResourceStateMetrics configuration YAML specifying the CRD resource, rules, and metric generators (
examples/custom-resource-state.yaml) - Pass the configuration file path to kube-state-metrics via --custom-resource-state-config-file CLI flag (
main.go) - Consult CustomResourceState collector implementation to understand rule evaluation and metric generation logic (
pkg/collectors/customresourcestate.go)
Add a new CLI argument for configuration
- Define the flag in the CLI command definition, typically with Cobra in main.go or a flags package (
main.go) - Document the new flag in CLI arguments reference (
docs/developer/cli-arguments.md) - Update the template and regenerate docs if needed (
docs/developer/cli-arguments.md.tpl)
Optimize or fix metrics collection performance
- Review the metrics store performance optimization design documentation (
docs/design/metrics-store-performance-optimization.md) - Identify the collector or metrics_store code path that needs optimization (
pkg/metrics_store.go) - Implement caching, filtering, or batch processing changes in the target collector or store (
pkg/collectors/)
🔧Why these technologies
- Kubernetes Client-Go (k8s.io/client-go) — Official Go client for Kubernetes API; enables efficient watch/list operations on cluster objects with built-in caching and resource versioning.
- Prometheus Client Library (prometheus/client_golang) — Standard metrics exposition format; allows integration with any Prometheus-compatible monitoring system with minimal overhead.
- Cobra CLI Framework (spf13/cobra) — Structured command-line argument parsing and help generation for managing complex configuration options.
- Viper Configuration Management (spf13/viper) — Unified configuration from CLI flags, environment variables, and config files; simplifies operational deployment flexibility.
- Go 1.25+ Concurrency Primitives — Goroutines and channels enable efficient parallel collector execution and non-blocking HTTP request handling at scale.
⚖️Trade-offs already made
-
In-memory metrics caching with configurable TTL instead of on-demand computation
- Why: Kubernetes API calls are expensive; caching reduces load on API server and provides predictable scrape latency.
- Consequence: Metrics are eventually consistent with cluster state (lag = cache TTL); not suitable for sub-second freshness requirements.
-
Single-process design (no distributed sharding by default) vs. StatefulSet autosharding optional feature
- Why: Simplicity and zero-dependency operation for small clusters; optional sharding for large clusters to reduce per-instance cardinality.
- Consequence: Default deployment may OOM on very large clusters (>5k nodes); requires explicit sharding configuration for scale.
-
Collector registry pattern with pull-based metrics generation vs. event-driven push
- Why: Pull model aligns with Prometheus conventions; decouples scraper from collector implementation and enables graceful restart.
- Consequence: Cannot react immediately to cluster changes; lag inherent in scrape interval.
-
CustomResourceState metrics via configuration file vs. hard-coded collectors
- Why: Enables monitoring of user-defined CRDs without rebuilding KSM; declarative, GitOps-friendly approach.
- Consequence: Requires understanding of rule DSL and careful metric naming to avoid cardinality explosions.
🚫Non-goals (don't propose these)
- Real-time alerting or event streaming; KSM is a metrics exporter, not an event bus.
- Authentication/authorization enforcement; assumes RBAC is handled by cluster admin and reverse proxy.
- Storing metrics or providing a time-series database; metrics are scraped by external Prometheus.
- Monitoring application-level metrics; scope is Kubernetes object state only.
- Supporting non-Kubernetes clusters or hybrid cloud without Kubernetes API access.
🪤Traps & gotchas
Kubernetes version skew: KSM must match or stay close to cluster API version (incompatible versions cause API discovery failures). ServiceAccount RBAC: KSM needs ClusterRole with list/watch on all tracked resources—missing permissions silently fail without metrics. Metrics cardinality explosion: Label selectors (--metric-allowlist, --metric-denylist) use ECMAScript regex, not glob patterns. StatefulSet/DaemonSet metrics require explicit opt-in if CRD support needed. Cluster role must grant get/list/watch on specific resource versions (e.g., apps/v1/deployments)—old API group versions not queried by default.
🏗️Architecture
💡Concepts to learn
- Kubernetes Informers — KSM uses informer pattern (watch + local cache) from controller-runtime to efficiently track cluster state changes without polling the API; understanding this is essential for modifying store logic
- Prometheus Exposition Format — KSM's
/metricsendpoint serves metrics in Prometheus text format (lines likekube_pod_status_phase{pod=...,namespace=...,phase=...} 1); you must understand metric naming, labels, and types to add or modify metrics - Metric Cardinality & Label Explosion — KSM metrics can explode in cardinality if labels are unbounded (e.g., per-pod custom labels); the codebase includes
--metric-allowlist/denylistfilters using ECMAScript regex to prevent memory exhaustion - Kubernetes CustomResourceDefinitions (CRDs) — KSM supports both native Kubernetes resources and CRDs via dynamic discovery; generators must handle CRD versioning and schema variations
- RBAC & ServiceAccount Permissions — KSM's ability to list/watch resources depends on ClusterRole bindings; missing permissions silently fail, producing incomplete metrics—you'll need to debug RBAC issues when adding new resource types
- Prometheus Client Library Instrumentation — KSM uses prometheus/client_golang to define Gauge, Counter, and Histogram metrics; understanding metric types (push vs pull, gauge vs counter) is needed to instrument new metrics correctly
- Metric Store Abstraction & Plugin Pattern — KSM's internal architecture separates Kubernetes watchers (store layer) from metric exposition (metric generators); new resource types plug into this store without touching the HTTP/Prometheus layer
🔗Related repos
prometheus-operator/prometheus-operator— Helm/CRD-based Prometheus deployment on Kubernetes; commonly used alongside KSM to scrape and store metrics it generatesprometheus/prometheus— Time-series database that consumes metrics from KSM's HTTP/metricsendpoint; KSM is a Prometheus exporterkubernetes/kubernetes— Core Kubernetes project whose API objects and CRDs KSM watches; KSM tracks Kubernetes versions and API compatibilityprometheus/client_golang— Prometheus client library used by KSM to expose metrics in text format; direct dependency in go.modkubernetes/sample-controller— Kubernetes reference implementation for controller patterns; KSM uses similar informer/workqueue patterns from this architecture
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive CustomResourceState metrics documentation and examples
The CustomResourceState feature is powerful but underdocumented. While docs/metrics/extend/customresourcestate-metrics.md exists, there are no examples in docs showing how to configure CRS for common use cases (e.g., Istio VirtualServices, Flux HelmReleases, ArgoCD Applications). New contributors could add practical YAML examples and validation schemas to help users adopt this feature.
- [ ] Review docs/metrics/extend/customresourcestate-metrics.md for gaps
- [ ] Create docs/metrics/extend/customresourcestate-examples/ directory with real-world YAML examples
- [ ] Add JSON schema validation documentation for CustomResourceStateMetrics config
- [ ] Document label cardinality limits and best practices for custom metrics
Implement missing unit test coverage for collector generators
The repo has docs/metrics covering 20+ Kubernetes resource types (Node, Pod, Deployment, etc.), but many likely lack comprehensive unit tests for their state generators. Contributors could systematically add tests for edge cases like: missing optional fields, invalid state transitions, and label generation for under-tested resource types found in docs/metrics/.
- [ ] Identify collector generator files missing test coverage (e.g., in internal/generator/)
- [ ] For each resource type in docs/metrics/, verify corresponding *_test.go file has 80%+ coverage
- [ ] Add tests for edge cases: nil fields, empty label values, status phase transitions
- [ ] Update Makefile or CI workflow to enforce minimum coverage thresholds
Add deprecation lifecycle documentation and migration guide for metrics
The CHANGELOG.md lists removed/deprecated metrics, but there's no formal deprecation policy. Create docs/design/metrics-deprecation-policy.md defining: deprecation timelines, notice periods, and a migration guide index. This helps users plan upgrades and contributors understand the process for proposing breaking changes.
- [ ] Create docs/design/metrics-deprecation-policy.md with timeline (e.g., 2 releases notice)
- [ ] Add a deprecation section to MAINTAINING.md or CONTRIBUTING.md
- [ ] Create docs/migration/ directory with guides for removed metrics from recent releases
- [ ] Link deprecation notices from CHANGELOG.md to migration guides
🌿Good first issues
- Add missing metric tests for the
certificatesigningrequestgenerator (generator exists ininternal/metrics/certificatesigningrequest_gen.gobut test coverage appears minimal based on metrics docs) by writing unit tests in*_test.gothat assert metric names, label sets, and cardinality - Expand CLI arguments documentation in
docs/developer/cli-arguments.md.tplwith concrete examples for each flag (currently template-driven; add worked examples like--metric-allowlist 'kube_pod_.*'with expected output) and regenerate withmake generate-template - Add support for OpenShift-specific Kubernetes API resources (e.g.,
Route,BuildConfig) by creating new metric generators ininternal/metrics/and documenting indocs/metrics/following the existing pod/deployment pattern (requires RBAC checks against OpenShift APIs)
⭐Top contributors
Click to expand
Top contributors
- @k8s-ci-robot — 35 commits
- @mrueg — 16 commits
- @dependabot[bot] — 15 commits
- @bhope — 8 commits
- @ystkfujii — 7 commits
📝Recent commits
Click to expand
Recent commits
cd5430f— Merge pull request #2950 from bhope/update-go-version (k8s-ci-robot)aa8942a— Go bumped from 1.26.1 to (bhope)d40135d— Merge pull request #2920 from bhope/fix-mem-leak (k8s-ci-robot)3ccffc2— Merge pull request #2947 from rexagod/fix-ghsa (rexagod)d1adb7b— fixup! fix: move pprof endpoints to telemetry server and protect with auth filter (rexagod)80dede7— Merge pull request #2944 from kubernetes/dependabot/go_modules/github.com/fsnotify/fsnotify-1.10.1 (k8s-ci-robot)50b0f3f— build(deps): Bump github.com/fsnotify/fsnotify from 1.9.0 to 1.10.1 (dependabot[bot])f9b1b42— fix: move pprof endpoints to telemetry server and protect with auth filter (bhope)454b63b— Merge pull request #2941 from marioferh/bump_go_jose (k8s-ci-robot)35abc5d— fix: stop goroutine and memory leak in CR reflectors on repeated CRD discovery (bhope)
🔒Security observations
The kube-state-metrics repository demonstrates good security practices with multiple security workflows (govulncheck, vulnerability scanning), OpenSSF compliance, and adherence to Kubernetes security standards. The Dockerfile uses distroless images and drops privileges appropriately. However, minor concerns exist regarding Go version consistency, use of 'latest' base image tags, and a pre-release dependency version. The SECURITY.md documentation appears incomplete. Overall, the project maintains a strong security posture with room for minor improvements in build reproducibility and documentation completeness.
- Medium · Go Version Mismatch in go.mod and Dockerfile —
.go-version, go.mod, Dockerfile. The go.mod file specifies Go 1.25.0, but the Dockerfile uses Go 1.26 as the default GOVERSION. This version mismatch could lead to inconsistent builds and potential compatibility issues. Fix: Ensure Go versions are synchronized across .go-version, go.mod, and Dockerfile. Update Dockerfile to match the Go version specified in go.mod (1.25.0) or update go.mod if 1.26 is the intended version. - Low · Distroless Base Image Without Specific Version Tag —
Dockerfile. The Dockerfile uses 'gcr.io/distroless/static-debian13:latest-${GOARCH}' which relies on the 'latest' tag. Using 'latest' tags can lead to unexpected behavior changes when the base image is updated. Fix: Use a specific, pinned version tag for the distroless image instead of 'latest' (e.g., 'gcr.io/distroless/static-debian13:nonroot-debian13-amd64-v1.x.x'). This ensures reproducible and secure builds. - Low · Default User Configuration —
Dockerfile. While the Dockerfile does drop privileges by using 'USER nobody', distroless images provide a 'nonroot' variant that offers better security. The current configuration uses a system-level nobody user which may vary across distributions. Fix: Consider using the 'distroless/static-debian13:nonroot' variant instead, which runs as a dedicated nonroot user with better security properties than the 'nobody' user. - Low · Pre-release Prometheus Client Library Version —
go.mod. The dependency 'github.com/prometheus/client_golang' is pinned to a pre-release version (v1.23.3-0.20251103151724-a5ae20370e5e). Pre-release versions may contain unvetted changes and stability issues. Fix: Update to the latest stable release of prometheus/client_golang once available, or document the rationale for using a pre-release version. Monitor for security updates to this pre-release version. - Low · Incomplete SECURITY.md File —
SECURITY.md. The SECURITY.md file appears to be truncated at the end (cuts off mid-URL on the last line). This incomplete documentation could confuse security researchers trying to report vulnerabilities. Fix: Complete and review the SECURITY.md file to ensure all security contact information and vulnerability disclosure procedures are clearly documented.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.