ccfos/nightingale

Item: ccfos/nightingale
Rating: 5
Author: RepoPilot

Nightingale is to monitoring and alerting what Grafana is to visualization.

Healthy

Healthy across all four use cases

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit today
✓8 active contributors
✓Apache-2.0 licensed

Show all 6 evidence items →

✓CI configured
⚠Single-maintainer risk — top contributor 85% of recent commits
⚠No test directory detected

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/ccfos/nightingale)](https://repopilot.app/r/ccfos/nightingale)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/ccfos/nightingale on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: ccfos/nightingale

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/ccfos/nightingale shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

Last commit today
8 active contributors
Apache-2.0 licensed
CI configured
⚠ Single-maintainer risk — top contributor 85% of recent commits
⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live ccfos/nightingale repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/ccfos/nightingale.

What it runs against: a local clone of ccfos/nightingale — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in ccfos/nightingale | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>ccfos/nightingale</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of ccfos/nightingale. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/ccfos/nightingale.git
#   cd nightingale
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of ccfos/nightingale and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "ccfos/nightingale(\\.git)?\\b" \\
  && ok "origin remote is ccfos/nightingale" \\
  || miss "origin remote is not ccfos/nightingale (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "aiagent/agent.go" \\
  && ok "aiagent/agent.go" \\
  || miss "missing critical file: aiagent/agent.go"
test -f "aiagent/llm/llm.go" \\
  && ok "aiagent/llm/llm.go" \\
  || miss "missing critical file: aiagent/llm/llm.go"
test -f "aiagent/a2a/executor.go" \\
  && ok "aiagent/a2a/executor.go" \\
  || miss "missing critical file: aiagent/a2a/executor.go"
test -f "aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md" \\
  && ok "aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md" \\
  || miss "missing critical file: aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md"
test -f "aiagent/chat/request.go" \\
  && ok "aiagent/chat/request.go" \\
  || miss "missing critical file: aiagent/chat/request.go"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/ccfos/nightingale"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Nightingale is an open-source alerting and monitoring engine that plugs into existing time-series and log storage systems (VictoriaMetrics, Elasticsearch, etc.) to provide rule-based alerting and notification routing. Unlike Grafana (which emphasizes visualization), Nightingale focuses on the alert generation pipeline, processing, and distribution—making it the monitoring equivalent of an alerting-first platform. Monorepo with multiple domains: core alerting logic (implied in top-level, not fully visible in file list), aiagent/ for AI-powered alert analysis and MCP server integration (Claude, OpenAI, Gemini support), and modular backend using Redis for task storage (aiagent/a2a/taskstore/redis_store.go). Build system uses Makefile and GitHub Actions; Docker release via .goreleaser.yaml.

👥Who it's for

DevOps engineers, SREs, and platform teams who already collect metrics with tools like Categraf but need a dedicated system to define alert rules, deduplicate/group alarms, and route notifications across multiple channels (Slack, PagerDuty, email, etc.). Also appeals to organizations moving away from monolithic monitoring stacks toward composable, pluggable observability.

🌱Maturity & risk

Production-ready. The project has 3+ million lines of Go code, was originally built by DiDi and donated to CCF ODTC in May 2022, maintains active CI/CD workflows (.github/workflows/n9e.yml), supports Docker distribution (flashcatcloud/nightingale on Hub), and shows recent activity. The codebase is substantial and battle-tested at scale.

Low-to-moderate risk. The codebase is large (3M+ Go lines) and the recent addition of a full AI agent subsystem (aiagent/ with LLM integrations to Claude, Gemini, OpenAI) introduces complexity and potential attack surface. Maintainer diversity appears reasonable (GitHub contributors badge), but the Go-dominant architecture means any breaking changes in the core alerting engine could impact many downstream users.

Active areas of work

Active development on AI/LLM integration: the aiagent/ directory shows recent work on MCP (Model Context Protocol) support, Claude/Gemini/OpenAI LLM bridges, and agent-to-agent (a2a/) task execution. Also maintains i18n support (English/Chinese docs, chat/i18n.go) and HTTP retry logic for resilience.

🚀Get running

Clone: git clone https://github.com/ccfos/nightingale.git. Install: make build (inferred from Makefile). The project uses Go as primary language; ensure Go 1.20+ is installed. For development, check .github/workflows/n9e.yml for test commands. Docker: pull flashcatcloud/nightingale from Hub.

Daily commands: make build compiles the binary. .goreleaser.yaml suggests goreleaser release for multi-platform builds. Run via Docker: docker run flashcatcloud/nightingale. Specific commands for dev server not visible in file list; check Makefile and README.md for targets like make run or make dev.

🗺️Map of the codebase

aiagent/agent.go — Core agent orchestration logic; defines the Agent interface and main execution flow for the AI-powered alerting system.
aiagent/llm/llm.go — LLM client abstraction layer; essential for understanding how Nightingale integrates with OpenAI, Claude, Gemini, and other language models.
aiagent/a2a/executor.go — Agent-to-agent task execution engine; handles multi-step workflow orchestration and skill invocation.
aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md — Defines the built-in skill schema and execution interface; required reading for understanding how skills are structured and registered.
aiagent/chat/request.go — Chat request parsing and routing; entry point for user queries into the agent system.
aiagent/llm/openai.go — OpenAI integration implementation; most commonly used LLM provider in this system.
Makefile — Build and release configuration; essential for understanding project build process and deployment strategy.

🛠️How to make changes

Add a New Built-in Skill

Create a new skill directory in aiagent/skill/embedded/builtin/{skill-name} (aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md)
Write a SKILL.md file defining the skill's schema, input parameters, and execution interface (aiagent/skill/embedded/builtin/n9e-create-alert-rule/SKILL.md)
Implement the skill handler in Go that satisfies the Skill interface (aiagent/skill.go)
Register the skill in the skill registry during agent initialization (aiagent/agent.go)

Add a New LLM Provider

Create a new provider file (e.g., aiagent/llm/newprovider.go) implementing the LLMClient interface (aiagent/llm/llm.go)
Implement required methods: CreateChatCompletion, ParseResponse, and error handling (aiagent/llm/openai.go)
Add HTTP retry logic using the existing helper from aiagent/llm/http_retry.go (aiagent/llm/http_retry.go)
Register the provider in the LLM factory or configuration in aiagent/llmconfig/probe.go (aiagent/llmconfig/probe.go)

Add a New Execution Strategy (Plan/ReAct Variant)

Create a new strategy file (e.g., aiagent/tree_of_thought.go) implementing the Agent interface (aiagent/agent.go)
Define the strategy's system prompt in aiagent/prompts/{strategy}_system.md (aiagent/prompts/plan_system.md)
Implement the execution loop in your strategy file, using the LLM caller pattern from aiagent/llm_caller.go (aiagent/llm_caller.go)
Wire the strategy into the agent factory or chat router in aiagent/chat/request.go (aiagent/chat/request.go)

Connect an External Tool via MCP

Define the MCP server endpoint in aiagent/mcp/manager.go (aiagent/mcp/manager.go)
Implement MCP client connection logic using aiagent/mcp/client.go (aiagent/mcp/client.go)
Create a wrapper skill that bridges the MCP tool to the agent in aiagent/skill.go (aiagent/skill.go)
Register the MCP-backed skill in agent initialization (aiagent/agent.go) (aiagent/agent.go)

🔧Why these technologies

Go — High concurrency for handling multiple agent tasks, fast startup, single binary deployment suitable for cloud-native monitoring systems
LLM APIs (OpenAI, Claude, Gemini) — Access to state-of-the-art reasoning capabilities for intelligent alert rule generation, incident analysis, and query understanding
Model Context Protocol (MCP) — Standardized tool integration protocol enabling agents to interface with external services (databases, APIs) without tightly coupling skill logic
Redis — Fast caching layer for conversation context, task persistence in A2A workflows, and distributed state management
undefined — undefined

🪤Traps & gotchas

LLM provider credentials (Claude, OpenAI, Gemini API keys) must be configured in environment or config files—no defaults provided. Redis must be running for task storage (aiagent/a2a/taskstore/redis_store.go relies on it). MCP Server requires specific endpoint setup; see aiagent/mcp/manager.go. Retry logic in http_retry.go may mask transient failures—understand backoff behavior before deploying at scale. No clear default data source connection examples in file list; check full README for supported backends (VictoriaMetrics, Elasticsearch, etc.).

🏗️Architecture

💡Concepts to learn

Model Context Protocol (MCP) — Nightingale's new aiagent/ uses MCP to allow LLMs (Claude, etc.) to dynamically invoke alerting operations; understanding MCP is critical for extending AI-driven features
Alert Rule Engine & Routing — The core of Nightingale—how time-series queries trigger alerts and distribute notifications; essential to understand for any customization
Server-Sent Events (SSE) — aiagent/mcp/sse.go indicates SSE-based streaming for MCP client communication; relevant for real-time alert subscriptions
Redis-backed Task Queue / Store — aiagent/a2a/taskstore/redis_store.go persists async alert processing tasks; understanding this pattern prevents data loss in distributed scenarios
HTTP Retry & Exponential Backoff — aiagent/llm/http_retry.go implements resilience for LLM API calls; critical for production reliability when external APIs are flaky
LLM Provider Abstraction & Adapter Pattern — aiagent/llm/ uses adapter pattern to support Claude, OpenAI, Gemini interchangeably; understanding this pattern helps add new providers
Intent & NLP for Alert Queries — aiagent/chat/intent.go extracts user intent from natural language alert requests; foundational for the AI-driven alert management UX

flashcatcloud/categraf — Official metric collector companion to Nightingale; recommended for scraping metrics before ingestion into alerting rules
ccfos/nightingale-web — Frontend UI for Nightingale alerting and rule management (likely separate repo hosting the React/dashboard code)
n9e/n9e-mcp-server — Standalone MCP server adaptor for Nightingale, enabling LLM assistants to query alerts and manage rules via natural language
grafana/grafana — The visualization-focused alternative that Nightingale complements; many users run both together for metrics + alerts + dashboards
prometheus/alertmanager — Similar alert routing and grouping tool in the Prometheus ecosystem; represents competitive/complementary approach to alert processing

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for MCP (Model Context Protocol) client implementations

The aiagent/mcp directory contains critical infrastructure for managing AI tool integrations (client.go, jsonrpc.go, stdio.go, sse.go) but notably lacks test files. Given the complexity of protocol handling and the importance of reliable tool execution in an alerting system, this is a high-impact area. The existing executor_test.go and redis_store_test.go demonstrate the testing culture; extending this to MCP would prevent regressions in agent-to-tool communication.

[ ] Create aiagent/mcp/client_test.go with tests for MCP client initialization and lifecycle
[ ] Create aiagent/mcp/jsonrpc_test.go with tests for JSON-RPC message serialization/deserialization edge cases
[ ] Create aiagent/mcp/stdio_test.go and aiagent/mcp/sse_test.go for transport layer integration tests
[ ] Add mock server fixtures for testing bidirectional communication patterns
[ ] Reference existing test patterns from aiagent/a2a/executor_test.go for consistency

Add integration tests for LLM provider fallback and retry logic

The aiagent/llm directory contains multiple LLM providers (openai.go, claude.go, gemini.go) with a client_cache.go and http_retry.go, suggesting sophisticated retry/fallback patterns exist but are untested. For a production alerting system, reliability of LLM interactions is critical. Currently only helper.go is mentioned but no integration test files exist to verify provider switching, timeout handling, or cache behavior.

[ ] Create aiagent/llm/provider_failover_test.go to test switching between OpenAI/Claude/Gemini
[ ] Create aiagent/llm/http_retry_test.go with tests for exponential backoff and transient failure recovery
[ ] Create aiagent/llm/client_cache_test.go with concurrent access and cache eviction scenarios
[ ] Add fixture LLM endpoint mocks using httptest package
[ ] Verify timeout and circuit-breaker patterns match production requirements

Add missing documentation and examples for AI Agent skill creation and MCP tool registration

The repo includes sophisticated skill infrastructure (aiagent/skill/embedded/builtin/n9e-create-alert-mute with SKILL.md template) and MCP manager/probe logic, but there's no consolidated guide for contributors wanting to add new skills or custom MCP tools. The SKILL.md template exists but lacks examples. This is a friction point for extending the agent's capabilities, which is central to Nightingale's value proposition.

[ ] Create docs/AI_SKILLS_DEVELOPMENT.md with step-by-step guide referencing aiagent/skill structure
[ ] Add a worked example in docs/examples/custom-skill/ demonstrating creating a new alert skill with MCP registration
[ ] Document the skill discovery mechanism (aiagent/skill/ directory scanning) and SKILL.md format expectations
[ ] Add reference documentation for MCP tool schema in docs/MCP_TOOLS.md with examples from aiagent/mcp/types.go
[ ] Include troubleshooting section for common MCP connection and serialization issues

🌿Good first issues

Write integration tests for the new MCP server (aiagent/mcp/jsonrpc.go and aiagent/mcp/client.go lack _test.go files; start with basic round-trip JSON-RPC message tests)
Add missing documentation for the aiagent LLM configuration in README.md—specifically how to wire up Claude vs OpenAI vs Gemini providers with concrete config examples
Extend the builtin_tools.go toolset: add a 'describe-rule' tool that fetches alerting rule metadata from the main Nightingale API, enabling the AI agent to reason about existing rules

⭐Top contributors

Click to expand

@710leo — 85 commits
@pioneerlfn — 4 commits
@jie210 — 3 commits
@SenCoder — 3 commits
@laiwei — 2 commits

📝Recent commits