ccfos/nightingale
Nightingale is to monitoring and alerting what Grafana is to visualization.
Healthy across all four use cases
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit today
- ✓8 active contributors
- ✓Apache-2.0 licensed
Show all 6 evidence items →Show less
- ✓CI configured
- ⚠Single-maintainer risk — top contributor 85% of recent commits
- ⚠No test directory detected
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/ccfos/nightingale)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/ccfos/nightingale on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: ccfos/nightingale
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/ccfos/nightingale shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across all four use cases
- Last commit today
- 8 active contributors
- Apache-2.0 licensed
- CI configured
- ⚠ Single-maintainer risk — top contributor 85% of recent commits
- ⚠ No test directory detected
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live ccfos/nightingale
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/ccfos/nightingale.
What it runs against: a local clone of ccfos/nightingale — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in ccfos/nightingale | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of ccfos/nightingale. If you don't
# have one yet, run these first:
#
# git clone https://github.com/ccfos/nightingale.git
# cd nightingale
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of ccfos/nightingale and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "ccfos/nightingale(\\.git)?\\b" \\
&& ok "origin remote is ccfos/nightingale" \\
|| miss "origin remote is not ccfos/nightingale (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "aiagent/agent.go" \\
&& ok "aiagent/agent.go" \\
|| miss "missing critical file: aiagent/agent.go"
test -f "aiagent/llm/llm.go" \\
&& ok "aiagent/llm/llm.go" \\
|| miss "missing critical file: aiagent/llm/llm.go"
test -f "aiagent/a2a/executor.go" \\
&& ok "aiagent/a2a/executor.go" \\
|| miss "missing critical file: aiagent/a2a/executor.go"
test -f "aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md" \\
&& ok "aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md" \\
|| miss "missing critical file: aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md"
test -f "aiagent/chat/request.go" \\
&& ok "aiagent/chat/request.go" \\
|| miss "missing critical file: aiagent/chat/request.go"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/ccfos/nightingale"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
Nightingale is an open-source alerting and monitoring engine that plugs into existing time-series and log storage systems (VictoriaMetrics, Elasticsearch, etc.) to provide rule-based alerting and notification routing. Unlike Grafana (which emphasizes visualization), Nightingale focuses on the alert generation pipeline, processing, and distribution—making it the monitoring equivalent of an alerting-first platform. Monorepo with multiple domains: core alerting logic (implied in top-level, not fully visible in file list), aiagent/ for AI-powered alert analysis and MCP server integration (Claude, OpenAI, Gemini support), and modular backend using Redis for task storage (aiagent/a2a/taskstore/redis_store.go). Build system uses Makefile and GitHub Actions; Docker release via .goreleaser.yaml.
👥Who it's for
DevOps engineers, SREs, and platform teams who already collect metrics with tools like Categraf but need a dedicated system to define alert rules, deduplicate/group alarms, and route notifications across multiple channels (Slack, PagerDuty, email, etc.). Also appeals to organizations moving away from monolithic monitoring stacks toward composable, pluggable observability.
🌱Maturity & risk
Production-ready. The project has 3+ million lines of Go code, was originally built by DiDi and donated to CCF ODTC in May 2022, maintains active CI/CD workflows (.github/workflows/n9e.yml), supports Docker distribution (flashcatcloud/nightingale on Hub), and shows recent activity. The codebase is substantial and battle-tested at scale.
Low-to-moderate risk. The codebase is large (3M+ Go lines) and the recent addition of a full AI agent subsystem (aiagent/ with LLM integrations to Claude, Gemini, OpenAI) introduces complexity and potential attack surface. Maintainer diversity appears reasonable (GitHub contributors badge), but the Go-dominant architecture means any breaking changes in the core alerting engine could impact many downstream users.
Active areas of work
Active development on AI/LLM integration: the aiagent/ directory shows recent work on MCP (Model Context Protocol) support, Claude/Gemini/OpenAI LLM bridges, and agent-to-agent (a2a/) task execution. Also maintains i18n support (English/Chinese docs, chat/i18n.go) and HTTP retry logic for resilience.
🚀Get running
Clone: git clone https://github.com/ccfos/nightingale.git. Install: make build (inferred from Makefile). The project uses Go as primary language; ensure Go 1.20+ is installed. For development, check .github/workflows/n9e.yml for test commands. Docker: pull flashcatcloud/nightingale from Hub.
Daily commands:
make build compiles the binary. .goreleaser.yaml suggests goreleaser release for multi-platform builds. Run via Docker: docker run flashcatcloud/nightingale. Specific commands for dev server not visible in file list; check Makefile and README.md for targets like make run or make dev.
🗺️Map of the codebase
aiagent/agent.go— Core agent orchestration logic; defines the Agent interface and main execution flow for the AI-powered alerting system.aiagent/llm/llm.go— LLM client abstraction layer; essential for understanding how Nightingale integrates with OpenAI, Claude, Gemini, and other language models.aiagent/a2a/executor.go— Agent-to-agent task execution engine; handles multi-step workflow orchestration and skill invocation.aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md— Defines the built-in skill schema and execution interface; required reading for understanding how skills are structured and registered.aiagent/chat/request.go— Chat request parsing and routing; entry point for user queries into the agent system.aiagent/llm/openai.go— OpenAI integration implementation; most commonly used LLM provider in this system.Makefile— Build and release configuration; essential for understanding project build process and deployment strategy.
🛠️How to make changes
Add a New Built-in Skill
- Create a new skill directory in aiagent/skill/embedded/builtin/{skill-name} (
aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md) - Write a SKILL.md file defining the skill's schema, input parameters, and execution interface (
aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md) - Implement the skill handler in Go that satisfies the Skill interface (
aiagent/skill.go) - Register the skill in the skill registry during agent initialization (
aiagent/agent.go)
Add a New LLM Provider
- Create a new provider file (e.g., aiagent/llm/newprovider.go) implementing the LLMClient interface (
aiagent/llm/llm.go) - Implement required methods: CreateChatCompletion, ParseResponse, and error handling (
aiagent/llm/openai.go) - Add HTTP retry logic using the existing helper from aiagent/llm/http_retry.go (
aiagent/llm/http_retry.go) - Register the provider in the LLM factory or configuration in aiagent/llmconfig/probe.go (
aiagent/llmconfig/probe.go)
Add a New Execution Strategy (Plan/ReAct Variant)
- Create a new strategy file (e.g., aiagent/tree_of_thought.go) implementing the Agent interface (
aiagent/agent.go) - Define the strategy's system prompt in aiagent/prompts/{strategy}_system.md (
aiagent/prompts/plan_system.md) - Implement the execution loop in your strategy file, using the LLM caller pattern from aiagent/llm_caller.go (
aiagent/llm_caller.go) - Wire the strategy into the agent factory or chat router in aiagent/chat/request.go (
aiagent/chat/request.go)
Connect an External Tool via MCP
- Define the MCP server endpoint in aiagent/mcp/manager.go (
aiagent/mcp/manager.go) - Implement MCP client connection logic using aiagent/mcp/client.go (
aiagent/mcp/client.go) - Create a wrapper skill that bridges the MCP tool to the agent in aiagent/skill.go (
aiagent/skill.go) - Register the MCP-backed skill in agent initialization (aiagent/agent.go) (
aiagent/agent.go)
🔧Why these technologies
- Go — High concurrency for handling multiple agent tasks, fast startup, single binary deployment suitable for cloud-native monitoring systems
- LLM APIs (OpenAI, Claude, Gemini) — Access to state-of-the-art reasoning capabilities for intelligent alert rule generation, incident analysis, and query understanding
- Model Context Protocol (MCP) — Standardized tool integration protocol enabling agents to interface with external services (databases, APIs) without tightly coupling skill logic
- Redis — Fast caching layer for conversation context, task persistence in A2A workflows, and distributed state management
- undefined — undefined
🪤Traps & gotchas
LLM provider credentials (Claude, OpenAI, Gemini API keys) must be configured in environment or config files—no defaults provided. Redis must be running for task storage (aiagent/a2a/taskstore/redis_store.go relies on it). MCP Server requires specific endpoint setup; see aiagent/mcp/manager.go. Retry logic in http_retry.go may mask transient failures—understand backoff behavior before deploying at scale. No clear default data source connection examples in file list; check full README for supported backends (VictoriaMetrics, Elasticsearch, etc.).
🏗️Architecture
💡Concepts to learn
- Model Context Protocol (MCP) — Nightingale's new aiagent/ uses MCP to allow LLMs (Claude, etc.) to dynamically invoke alerting operations; understanding MCP is critical for extending AI-driven features
- Alert Rule Engine & Routing — The core of Nightingale—how time-series queries trigger alerts and distribute notifications; essential to understand for any customization
- Server-Sent Events (SSE) — aiagent/mcp/sse.go indicates SSE-based streaming for MCP client communication; relevant for real-time alert subscriptions
- Redis-backed Task Queue / Store — aiagent/a2a/taskstore/redis_store.go persists async alert processing tasks; understanding this pattern prevents data loss in distributed scenarios
- HTTP Retry & Exponential Backoff — aiagent/llm/http_retry.go implements resilience for LLM API calls; critical for production reliability when external APIs are flaky
- LLM Provider Abstraction & Adapter Pattern — aiagent/llm/ uses adapter pattern to support Claude, OpenAI, Gemini interchangeably; understanding this pattern helps add new providers
- Intent & NLP for Alert Queries — aiagent/chat/intent.go extracts user intent from natural language alert requests; foundational for the AI-driven alert management UX
🔗Related repos
flashcatcloud/categraf— Official metric collector companion to Nightingale; recommended for scraping metrics before ingestion into alerting rulesccfos/nightingale-web— Frontend UI for Nightingale alerting and rule management (likely separate repo hosting the React/dashboard code)n9e/n9e-mcp-server— Standalone MCP server adaptor for Nightingale, enabling LLM assistants to query alerts and manage rules via natural languagegrafana/grafana— The visualization-focused alternative that Nightingale complements; many users run both together for metrics + alerts + dashboardsprometheus/alertmanager— Similar alert routing and grouping tool in the Prometheus ecosystem; represents competitive/complementary approach to alert processing
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive unit tests for MCP (Model Context Protocol) client implementations
The aiagent/mcp directory contains critical infrastructure for managing AI tool integrations (client.go, jsonrpc.go, stdio.go, sse.go) but notably lacks test files. Given the complexity of protocol handling and the importance of reliable tool execution in an alerting system, this is a high-impact area. The existing executor_test.go and redis_store_test.go demonstrate the testing culture; extending this to MCP would prevent regressions in agent-to-tool communication.
- [ ] Create aiagent/mcp/client_test.go with tests for MCP client initialization and lifecycle
- [ ] Create aiagent/mcp/jsonrpc_test.go with tests for JSON-RPC message serialization/deserialization edge cases
- [ ] Create aiagent/mcp/stdio_test.go and aiagent/mcp/sse_test.go for transport layer integration tests
- [ ] Add mock server fixtures for testing bidirectional communication patterns
- [ ] Reference existing test patterns from aiagent/a2a/executor_test.go for consistency
Add integration tests for LLM provider fallback and retry logic
The aiagent/llm directory contains multiple LLM providers (openai.go, claude.go, gemini.go) with a client_cache.go and http_retry.go, suggesting sophisticated retry/fallback patterns exist but are untested. For a production alerting system, reliability of LLM interactions is critical. Currently only helper.go is mentioned but no integration test files exist to verify provider switching, timeout handling, or cache behavior.
- [ ] Create aiagent/llm/provider_failover_test.go to test switching between OpenAI/Claude/Gemini
- [ ] Create aiagent/llm/http_retry_test.go with tests for exponential backoff and transient failure recovery
- [ ] Create aiagent/llm/client_cache_test.go with concurrent access and cache eviction scenarios
- [ ] Add fixture LLM endpoint mocks using httptest package
- [ ] Verify timeout and circuit-breaker patterns match production requirements
Add missing documentation and examples for AI Agent skill creation and MCP tool registration
The repo includes sophisticated skill infrastructure (aiagent/skill/embedded/builtin/n9e-create-alert-mute with SKILL.md template) and MCP manager/probe logic, but there's no consolidated guide for contributors wanting to add new skills or custom MCP tools. The SKILL.md template exists but lacks examples. This is a friction point for extending the agent's capabilities, which is central to Nightingale's value proposition.
- [ ] Create docs/AI_SKILLS_DEVELOPMENT.md with step-by-step guide referencing aiagent/skill structure
- [ ] Add a worked example in docs/examples/custom-skill/ demonstrating creating a new alert skill with MCP registration
- [ ] Document the skill discovery mechanism (aiagent/skill/ directory scanning) and SKILL.md format expectations
- [ ] Add reference documentation for MCP tool schema in docs/MCP_TOOLS.md with examples from aiagent/mcp/types.go
- [ ] Include troubleshooting section for common MCP connection and serialization issues
🌿Good first issues
- Write integration tests for the new MCP server (aiagent/mcp/jsonrpc.go and aiagent/mcp/client.go lack _test.go files; start with basic round-trip JSON-RPC message tests)
- Add missing documentation for the aiagent LLM configuration in README.md—specifically how to wire up Claude vs OpenAI vs Gemini providers with concrete config examples
- Extend the builtin_tools.go toolset: add a 'describe-rule' tool that fetches alerting rule metadata from the main Nightingale API, enabling the AI agent to reason about existing rules
⭐Top contributors
Click to expand
Top contributors
- @710leo — 85 commits
- @pioneerlfn — 4 commits
- @jie210 — 3 commits
- @SenCoder — 3 commits
- @laiwei — 2 commits
📝Recent commits
Click to expand
Recent commits
1016ad2— feat: support agent a2a api (#3147) (710leo)d1c5c7b— fix: edge heartbeat update host beattime (710leo)82bb07b— refactor: ai chat (#3145) (710leo)b816b9f— fix: resolve masked api_key to real key when testing AI LLM config (710leo)1bd5f36— fix: skip macOS metadata files when extracting skill archives (710leo)86dc24b— fix: handle multibyte characters correctly in replaceLastEightChars (710leo)c9867b3— fix: allow admin to access private message templates without group check (710leo)87c21ce— feat: increase skill archive max file count from 50 to 100 (710leo)67c3b69— feat: add service endpoints for creating and updating ai-llm-configs (710leo)6b67b8f— feat: implement QueryMapData for doris datasource to return log data as string maps (710leo)
🔒Security observations
- High · LLM API Key Exposure Risk —
aiagent/llm/openai.go, aiagent/llm/claude.go, aiagent/llm/gemini.go. The codebase contains multiple LLM integrations (OpenAI, Claude, Gemini) in aiagent/llm/ directory. These typically require API keys for authentication. If not properly managed through environment variables or secure vaults, credentials could be exposed in logs, error messages, or configuration files. Fix: Ensure all API keys are loaded from environment variables or secure secret management systems (e.g., Vault, AWS Secrets Manager). Never commit credentials to the repository. Implement key rotation policies and monitoring for unauthorized API usage. - High · Redis Store Without Authentication Validation —
aiagent/a2a/taskstore/redis_store.go. The Redis task store implementation (aiagent/a2a/taskstore/redis_store.go) may not enforce authentication or encryption for Redis connections. Redis instances without proper security can expose sensitive task data and agent state information. Fix: Implement Redis authentication with strong passwords, enable TLS/SSL for all Redis connections, use Redis ACLs for role-based access control, and restrict network access to Redis instances. - High · MCP Client Communication Security —
aiagent/mcp/client.go, aiagent/mcp/stdio.go, aiagent/mcp/sse.go, aiagent/mcp/jsonrpc.go. The Model Context Protocol (MCP) client implementation includes stdio and SSE transports (aiagent/mcp/stdio.go, aiagent/mcp/sse.go). These may not properly validate or sanitize external tool inputs, potentially allowing injection attacks or execution of arbitrary commands. Fix: Implement strict input validation and sanitization for all MCP tool inputs. Use sandboxing for tool execution. Validate all JSON-RPC messages. Implement proper authentication and authorization checks for MCP client connections. - High · HTTP Retry Without Rate Limiting —
aiagent/llm/http_retry.go. The HTTP retry mechanism (aiagent/llm/http_retry.go) may lack proper rate limiting and backoff strategies, potentially enabling retry storms or amplifying DDoS attacks against upstream services. Fix: Implement exponential backoff with jitter, enforce maximum retry counts, implement circuit breaker patterns, and add rate limiting. Monitor and alert on excessive retry attempts. - Medium · User Context Information Leakage —
aiagent/a2a/user_ctx.go, aiagent/chat/request.go. The user context handling (aiagent/a2a/user_ctx.go) may store sensitive user information in memory or logs without proper redaction, potentially exposing user data in error messages or debug logs. Fix: Implement proper data redaction in all logs, avoid logging sensitive user information (tokens, API keys, credentials), use structured logging with field-level redaction, and implement audit logging for sensitive operations. - Medium · SQL Injection Risk in Database Operations —
aiagent/skill/dbsync.go. The skill database synchronization (aiagent/skill/dbsync.go) and database-related operations may be vulnerable to SQL injection if user inputs or external data are not properly parameterized. Fix: Use parameterized queries and prepared statements for all database operations. Implement input validation and whitelist allowed values. Never concatenate user input into SQL strings. - Medium · Missing CORS and Security Headers —
aiagent/chat/actions.go. The HTTP endpoints exposed by the aiagent and monitoring services likely lack proper CORS configuration and security headers (CSP, X-Frame-Options, X-Content-Type-Options), potentially enabling XSS and clickjacking attacks. Fix: Implement proper CORS policies with specific allowed origins, add security headers (Content-Security-Policy, X-Frame-Options: DENY, X-Content-Type-Options: nosniff), implement CSRF protection for state-changing operations. - Medium · Insufficient Input Validation in Chat/Intent Processing —
undefined. The chat intent and action processing (aiagent/chat/intent.go, aiagent/chat/actions.go) may not properly validate user Fix: undefined
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.