RepoPilotOpen in app →

keephq/keep

The open-source AIOps and alert management platform

Healthy

Healthy across the board

weakest axis
Use as dependencyConcerns

non-standard license (Other)

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 4d ago
  • 39+ active contributors
  • Distributed ownership (top contributor 9% of recent commits)
Show all 7 evidence items →
  • Other licensed
  • CI configured
  • Tests present
  • Non-standard license (Other) — review terms
What would change the summary?
  • Use as dependency ConcernsMixed if: clarify license terms

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/keephq/keep)](https://repopilot.app/r/keephq/keep)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/keephq/keep on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: keephq/keep

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/keephq/keep shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit 4d ago
  • 39+ active contributors
  • Distributed ownership (top contributor 9% of recent commits)
  • Other licensed
  • CI configured
  • Tests present
  • ⚠ Non-standard license (Other) — review terms

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live keephq/keep repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/keephq/keep.

What it runs against: a local clone of keephq/keep — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in keephq/keep | Confirms the artifact applies here, not a fork | | 2 | License is still Other | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 34 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>keephq/keep</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of keephq/keep. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/keephq/keep.git
#   cd keep
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of keephq/keep and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "keephq/keep(\\.git)?\\b" \\
  && ok "origin remote is keephq/keep" \\
  || miss "origin remote is not keephq/keep (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Other)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Other\"" package.json 2>/dev/null) \\
  && ok "license is Other" \\
  || miss "license drift — was Other at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "README.md" \\
  && ok "README.md" \\
  || miss "missing critical file: README.md"
test -f "docker-compose.yml" \\
  && ok "docker-compose.yml" \\
  || miss "missing critical file: docker-compose.yml"
test -f "docker/Dockerfile.api" \\
  && ok "docker/Dockerfile.api" \\
  || miss "missing critical file: docker/Dockerfile.api"
test -f "docker/Dockerfile.ui" \\
  && ok "docker/Dockerfile.ui" \\
  || miss "missing critical file: docker/Dockerfile.ui"
test -f ".github/workflows/test-pr-ut.yml" \\
  && ok ".github/workflows/test-pr-ut.yml" \\
  || miss "missing critical file: .github/workflows/test-pr-ut.yml"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 34 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~4d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/keephq/keep"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Keep is an open-source AIOps and alert management platform that acts as a single pane of glass for monitoring alerts across your infrastructure. It provides alert deduplication, correlation, enrichment, filtering, bi-directional integrations with monitoring tools (Prometheus, Datadog, PagerDuty, etc.), and AI-powered incident context gathering. The platform enables teams to reduce alert noise and automate incident response through customizable workflows. Monorepo structure with Python backend (likely in root with Flask/FastAPI server files) and TypeScript/React frontend (keep-ui/ directory). Backend exposes provider integrations (alert ingestion, bi-directional syncs), alert deduplication/correlation logic, and workflow engine. Frontend is React/TypeScript UI consuming the backend API. Multiple Docker configurations (docker/Dockerfile.api, docker/Dockerfile.ui, .dev variants) support both containerized and local development.

👥Who it's for

DevOps engineers, SREs, and on-call teams who manage multiple monitoring tools and need to centralize, deduplicate, and correlate alerts across their stack. Platform engineers building internal alert management systems. Organizations adopting AIOps practices to reduce MTTR (Mean Time To Resolution).

🌱Maturity & risk

Actively developed with regular commits and organized GitHub workflows for CI/CD, testing, and releases (see .github/workflows/). The codebase shows significant scale (4.8M Python LOC, 2.7M TypeScript LOC) with comprehensive Docker setup and multiple deployment configurations. Production-ready for alert management, though as an open-source project it requires self-hosting.

Large polyglot codebase (Python + TypeScript + JavaScript) means onboarding overhead and potential maintenance burden across multiple technology stacks. Dependency surface area is substantial given the integration-heavy nature (supporting 20+ providers means many external API dependencies). The monolithic structure (no clear microservices separation visible) could become scaling bottleneck, but active commit history suggests the team is monitoring and addressing issues.

Active areas of work

Active development across multiple fronts: auto-release and release workflow automation, E2E test infrastructure (test-pr-e2e.yml), provider integrations testing (test-pr-integrations.yml), and UI testing (test-pr-ut-ui.yml). Recent focus includes alert evaluation documentation (docs/alertevaluation/examples/) with concrete monitoring system examples (VictoriaMetrics). Developer onboarding workflow automation suggests scaling the team.

🚀Get running

Clone the repo: git clone https://github.com/keephq/keep.git && cd keep. Check for .python-version and package.json to determine runtime requirements. Use docker-compose for full stack: docker-compose -f docker-compose.dev.yml up (see docker-compose.dev.yml). For local development, install Python dependencies and TypeScript/Node dependencies separately, then run backend and frontend servers independently. See CONTRIBUTING.md for detailed onboarding.

Daily commands: Dev environment: docker-compose -f docker-compose.dev.yml up launches full stack. For backend-only: run Python server from root (likely python -m app or flask run after pip install). For frontend: cd keep-ui && npm install && npm run dev. Production: docker-compose up -f docker-compose.yml. Auth-enabled variant: docker-compose -f docker-compose-with-auth.yml up. Check .env files and docker-compose.common.yml for required environment variables.

🗺️Map of the codebase

  • README.md — Entry point documenting Keep's core value proposition as an AIOps platform with alert management, deduplication, enrichment, and workflows.
  • docker-compose.yml — Primary orchestration file defining the deployment architecture and service dependencies for the entire platform.
  • docker/Dockerfile.api — Backend API container definition; essential for understanding the production runtime environment and dependencies.
  • docker/Dockerfile.ui — Frontend UI container definition; required for understanding the React/TypeScript build and runtime setup.
  • .github/workflows/test-pr-ut.yml — Primary CI/CD pipeline for unit tests; defines how code quality is validated before merge.
  • CONTRIBUTING.md — Contributor guidelines establishing development standards, PR expectations, and onboarding procedures.
  • .cursor/rules/keep-ui-react-typescript.mdc — Cursor AI rules for React/TypeScript conventions; reflects the team's established coding patterns and style.

🛠️How to make changes

Add a new Provider Integration

  1. Create a new provider directory following Keep's provider structure (most providers are in a separate providers package not fully visible in file list) (docs/cli/commands/provider-connect.mdx)
  2. Implement the provider class with authentication and webhook handling methods (docs/applications/github.mdx)
  3. Register the provider in the available integrations list (docs/cli/commands/provider-list.mdx)
  4. Add integration tests in the test suite (.github/workflows/test-pr-integrations.yml)

Add a new Alert Rule or Enrichment Step

  1. Define the enrichment logic in the alert evaluation engine (docs/alertevaluation/overview.mdx)
  2. Add example configurations for common monitoring backends (docs/alertevaluation/examples/victoriametricssingle.mdx)
  3. Expose the new rule via CLI or API endpoint (docs/cli/commands/alert-enrich.mdx)
  4. Update the frontend alert table to display enriched data (docs/alerts/table.mdx)

Add a new Workflow or Automation

  1. Define the workflow YAML schema following Keep's workflow DSL (docs/cli/commands/workflow-apply.mdx)
  2. Implement the workflow execution logic in the backend (docs/cli/commands/cli-workflow.mdx)
  3. Add the workflow template to the UI dashboard for user discovery (.cursor/rules/keep-ui-react-typescript.mdc)
  4. Document the workflow with examples and configuration options (docs/cli/overview.mdx)

Add a new UI Component or Dashboard View

  1. Follow React/TypeScript patterns and component structure (.cursor/rules/keep-ui-react-typescript.mdc)
  2. Implement tests using the established UI testing framework (.cursor/rules/keep-ui-tests.mdc)
  3. Create the component in the UI layer and connect to backend API (docs/alerts/sidebar.mdx)
  4. Add the component to the navigation and routing if needed (docker/Dockerfile.ui)

🔧Why these technologies

  • Docker + docker-compose — Standardizes dev, test, and prod environments; enables multi-service orchestration (API, UI, cache, DB) with reproducible deployments across local and cloud platforms.
  • React + TypeScript (Frontend) — Type-safe UI development with real-time alert dashboard updates; TypeScript catches integration errors early; React enables reactive alert state changes and preset filtering.
  • Python Backend (Flask/FastAPI implied) — Rapid alert processing and enrichment logic; rich ecosystem for monitoring integrations; supports async task queues (arq noted in compose) for background alert correlation.
  • Multi-auth support (Okta, Auth0, Keycloak, OAuth2, DB) — Enterprise flexibility; no single auth vendor lock-in; enables Keep deployment in strict corporate environments with existing identity systems.
  • Bi-directional webhooks/integrations — Enables both alert ingestion from monitoring tools and alert dispatch to incident management; reduces need for polling and achieves near-real-time alert flow.

⚖️Trade-offs already made

  • Single pane of glass for alert management vs. staying agnostic to monitoring backend

    • Why: Users need a unified interface to deduplicate and correlate alerts from multiple sources; a single aggregation point is essential for AIOps correlation.
    • Consequence: Requires connectors for each monitoring/alerting system; adds integration maintenance burden; creates single point of failure risk if Keep itself goes down.
  • CLI + API + UI for alert operations (three interfaces)

    • Why: Supports different user personas: ops engineers (CLI/automation), SREs (API), and incident responders (UI dashboard).
    • Consequence: Higher development and testing burden; must keep all three in sync; risk of feature inconsistency across interfaces.
  • Alert deduplication + enrichment + correlation in backend vs. at ingestion point

    • Why: Centralized dedup ensures consistency; allows rich correlation across time and multiple sources; supports backfilling historical alerts.
    • Consequence: Backend becomes compute-intensive; alert latency increases; requires efficient caching and indexing for high-volume environments.

🪤Traps & gotchas

Environment variables: backend likely requires DATABASE_URL, API keys for provider integrations (OpenAI, Anthropic, Datadog, etc.), and OTEL configuration if observability is enabled. Services: requires PostgreSQL or similar DB (docker-compose sets this up), Redis/ARQ for async tasks (docker-compose-with-arq.yml). Frontend build: TypeScript strict mode likely enforced (common in mature projects). Monorepo: changes in backend may require frontend rebuild due to API changes—ensure both are in sync. Pre-commit hooks: .pre-commit-config.yaml present, so commits may be blocked if linting fails locally.

🏗️Architecture

💡Concepts to learn

  • Alert Deduplication and Correlation — Core feature of Keep that reduces alert fatigue by grouping related alerts from multiple sources; critical to understand fingerprinting, grouping keys, and time-window based correlation
  • Provider Pattern (Pluggable Integrations) — Keep's extensibility is built on a provider abstraction layer allowing new monitoring tools to be integrated; understanding this pattern is essential for adding new integrations
  • Bi-directional Sync / Event-driven Architecture — Alerts flow into Keep from providers AND Keep can push actions back (close incident, update status); requires understanding webhooks, polling, eventual consistency, and idempotency
  • Async Task Queuing (ARQ/Celery) — Provider syncs and alert processing likely run asynchronously; docker-compose-with-arq.yml indicates ARQ is used for background job processing
  • Alert Enrichment via AI Backends — Keep integrates OpenAI and Anthropic to add context to alerts; understanding prompt engineering, token limits, and API rate limiting is important for contributing AI features
  • Workflow Automation (Alert Response Pipelines) — Users define workflows to automatically respond to alerts (notify teams, create tickets, run remediation); understanding conditional logic, action chaining, and error handling is key
  • OpenTelemetry Instrumentation — docker-compose-with-otel.yaml present indicates OTEL is used for observability; understanding traces, metrics, and logs helps debug alert processing pipelines
  • prometheus-community/alertmanager — Alerts received and routed by Keep often originate from Prometheus AlertManager; understanding AlertManager routing and grouping is complementary
  • grafana/grafana — Grafana is a common alert source and visualization companion; Keep integrates with Grafana alerts and dashboards
  • opsgenie/atlassian-opsgenie-integration — OpsGenie is a direct competitor/alternative alert management platform; Keep likely integrates with it as a bi-directional provider
  • elastic/kibana — Elastic stack is a common source of alerts and logs for Keep's enrichment and correlation features
  • keephq/keep-workflows — Sister repository likely containing official workflow templates and examples for automating alert response

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add E2E tests for alert enrichment and correlation workflows

The repo has comprehensive E2E test workflows (.github/workflows/test-pr-e2e.yml, run-e2e-tests.yml) and extensive alert evaluation documentation (docs/alertevaluation/), but there are no visible E2E test files for the core alert enrichment, deduplication, and correlation features mentioned in the README. This is critical for an alert management platform where data accuracy is paramount.

  • [ ] Create tests/e2e/alert-enrichment.spec.ts to test enrichment pipeline with real provider integrations
  • [ ] Create tests/e2e/alert-correlation.spec.ts to test alert deduplication and correlation logic
  • [ ] Reference the alert evaluation examples in docs/alertevaluation/examples/ as test data sources
  • [ ] Integrate new tests into .github/workflows/test-pr-e2e.yml workflow

Add provider authentication tests and documentation

The repo has multiple Docker Compose configs with different auth modes (docker-compose-with-auth.yml, docker-compose-with-otel.yaml) and authentication docs (docs/authentication/okta.md), but there's no systematic test coverage for bi-directional provider integrations. New contributors need clearer guidance on testing provider connectivity and credential handling.

  • [ ] Create tests/unit/providers/auth-handler.test.ts to validate credential encryption and storage patterns
  • [ ] Create tests/integration/providers-smoke-test.ts to verify each provider's bi-directional integration setup
  • [ ] Add docs/authentication/provider-setup-guide.md documenting testing auth flows for new providers
  • [ ] Ensure test examples reference the existing CONTRIBUTING.md guidelines

Implement missing CLI command tests and documentation

The docs/cli/commands/ directory has many documented commands (alert-enrich, alert-get, alert-list, config, provider), but there's no corresponding test directory structure visible in the file listing. The test-pr-ut.yml workflow exists but likely lacks CLI-specific coverage for these commands.

  • [ ] Create tests/unit/cli/commands/alert-commands.test.ts covering alert-enrich, alert-get, alert-list flows
  • [ ] Create tests/unit/cli/commands/config-commands.test.ts covering cli-config-new, cli-config-show, cli-config
  • [ ] Create tests/unit/cli/commands/provider-commands.test.ts covering cli-provider operations
  • [ ] Update .github/workflows/test-pr-ut.yml to include CLI-specific test reporting and coverage thresholds

🌿Good first issues

  • Add unit test coverage for the alert deduplication logic in the backend; the test-pr-ut.yml workflow exists but likely has uncovered functions in the core alert matching/correlation system.
  • Document the provider integration template by creating a provider-skeleton.md in docs/ with a step-by-step guide; new providers are requested frequently (new_provider_request.md template exists) but there's no clear contributor guide.
  • Implement missing E2E test for the bi-directional sync workflow between Keep and a sample provider (e.g., mock PagerDuty sync); test-pr-e2e.yml infrastructure exists but coverage is likely incomplete for sync scenarios.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • df4e48d — fix: unit tests (#6298) (Walkablenormal)
  • 08b4c77 — feat(pagerduty): add client and client_url support for Events API v2 (#6241) (jyoti369)
  • 256b07e — fix: expose is_visible in IncidentDto for workflow conditions (#6282) (alpar)
  • 0eed307 — fix: handle None alert_results in _notify_alert when no if_condition (#6284) (alpar)
  • b2f4226 — fix: Docker deploy script exit on error (#6286) (TubSticks)
  • 8669af0 — fix: set event in foreach AlertDto path so incident resolution check runs (#6213) (DragonBot00)
  • 9a8cb02 — fix: generate UUID for prometheus alerts when id is not a valid uuid (#6219) (DragonBot00)
  • 42954d8 — docs: Add ALERT_SIDEBAR_FIELDS to configuration options (#6244) (Walkablenormal)
  • 4f7981e — fix: add OKTA and KEYCLOAK to tenant seeding allowlist (#6249) (QuinnClaw)
  • 4411945 — fix(auth): restore get_roles() for Okta — API key creation broken under AUTH_TYPE=OKTA (#6254) (ahbeigi)

🔒Security observations

  • High · Default NO_AUTH Configuration in Docker Compose — docker-compose.yml (keep-frontend and keep-backend services). The docker-compose.yml file configures AUTH_TYPE=NO_AUTH for both frontend and backend services. This disables all authentication mechanisms, allowing unauthenticated access to the entire AIOps platform including alert management, workflows, and integrations. Fix: Replace NO_AUTH with appropriate authentication type (OAuth2, OIDC, JWT, etc.) for production deployments. Use environment-specific configurations and never commit production credentials. Implement proper authentication before exposing the application.
  • High · Unencrypted Inter-service Communication — docker-compose.yml (API_URL environment variable). Services communicate over plain HTTP (http://keep-backend:8080) without TLS/HTTPS encryption. This allows potential man-in-the-middle attacks on internal service communication, especially when alerts and sensitive data are transmitted between frontend and backend. Fix: Configure HTTPS/TLS for all service-to-service communication. Use encrypted connections (https://) and implement certificate validation. Consider service mesh solutions like Istio for automatic encryption.
  • Medium · Shared Volume /state Without Explicit Permissions — docker-compose.yml (volumes section for keep-frontend and keep-backend). Both keep-frontend and keep-backend services mount the same ./state volume without explicit permission controls or encryption. This could lead to unauthorized access to persistent data or configuration files stored in this directory. Fix: Implement proper file permissions on the /state directory. Use encrypted volumes. Consider using Docker secrets for sensitive data instead of volume mounts. Segregate volumes by service where possible.
  • Medium · Exposed Grafana Service on Unprotected Port — docker-compose.yml (grafana service, ports: 3001:3000). Grafana service is exposed on port 3001 without authentication requirements visible in the configuration. The default Grafana setup allows unauthenticated access to monitoring dashboards and metrics. Fix: Enforce Grafana authentication with strong credentials. Restrict port 3001 access using firewall rules. Set GF_SECURITY_ADMIN_PASSWORD and GF_AUTH_ANONYMOUS_ENABLED=false. Remove Grafana from production if not needed.
  • Medium · Missing Prometheus Multiproc Directory Configuration — docker-compose.yml (keep-backend environment variables). PROMETHEUS_MULTIPROC_DIR is set to /tmp/prometheus which is a world-readable temporary directory. Prometheus metrics may contain sensitive information about system behavior and could be accessed by other processes. Fix: Use a secure directory with restricted permissions instead of /tmp. Set appropriate ownership and chmod (700 or stricter). Consider using /var/lib/prometheus or similar application-specific directories.
  • Medium · Missing Security Headers Configuration — docker-compose.yml, keep-ui Dockerfile configuration. No visible configuration for security headers (Content-Security-Policy, X-Frame-Options, X-Content-Type-Options, etc.) in the provided infrastructure setup. This increases XSS and clickjacking risks. Fix: Configure security headers in the reverse proxy/nginx configuration. Implement CSP, X-Frame-Options: DENY, X-Content-Type-Options: nosniff, X-XSS-Protection headers.
  • Medium · Latest Image Tags in Production Configuration — docker-compose.yml (grafana service image). Docker Compose uses latest tags for Grafana (grafana/grafana:latest), which could pull vulnerable versions. Using floating tags makes deployments non-deterministic and harder to audit. Fix: Pin Docker images to specific versions (e.g., grafana/grafana:9.5.3 or grafana/grafana:10.0.0). Use image scanning and implement automated updates with proper testing.
  • Low · Missing Network Isolation — docker-compose.yml (overall service configuration). Docker Compose services use default networking without explicit network segmentation. All services can potentially communicate with each other without restrictions. Fix: Define custom Docker networks and use them to isolate services. Only expose necessary ports. Implement network policies for service-to-service communication.
  • Low · No Container Resource Limits — undefined. undefined Fix: undefined

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · keephq/keep — RepoPilot