RepoPilotOpen in app →

Netflix/chaosmonkey

Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.

Mixed

Stale — last commit 1y ago

worst of 4 axes
Use as dependencyMixed

last commit was 1y ago; no tests detected

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • 25+ active contributors
  • Distributed ownership (top contributor 28% of recent commits)
  • Apache-2.0 licensed
Show 3 more →
  • CI configured
  • Stale — last commit 1y ago
  • No test directory detected
What would change the summary?
  • Use as dependency MixedHealthy if: 1 commit in the last 365 days

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Forkable
[![RepoPilot: Forkable](https://repopilot.app/api/badge/netflix/chaosmonkey?axis=fork)](https://repopilot.app/r/netflix/chaosmonkey)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/netflix/chaosmonkey on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: Netflix/chaosmonkey

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/Netflix/chaosmonkey shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 1y ago

  • 25+ active contributors
  • Distributed ownership (top contributor 28% of recent commits)
  • Apache-2.0 licensed
  • CI configured
  • ⚠ Stale — last commit 1y ago
  • ⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live Netflix/chaosmonkey repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/Netflix/chaosmonkey.

What it runs against: a local clone of Netflix/chaosmonkey — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in Netflix/chaosmonkey | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | Last commit ≤ 518 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>Netflix/chaosmonkey</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of Netflix/chaosmonkey. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/Netflix/chaosmonkey.git
#   cd chaosmonkey
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of Netflix/chaosmonkey and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "Netflix/chaosmonkey(\\.git)?\\b" \\
  && ok "origin remote is Netflix/chaosmonkey" \\
  || miss "origin remote is not Netflix/chaosmonkey (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 518 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~488d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/Netflix/chaosmonkey"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Chaos Monkey is a resiliency testing tool written in Go that randomly terminates VM instances and containers in production environments to verify application fault tolerance. It integrates tightly with Spinnaker (Netflix's continuous delivery platform) and works across AWS, GCE, Azure, Kubernetes, and Cloud Foundry to inject controlled failures that expose engineering teams to realistic outages. Flat modular structure: cmd/chaosmonkey/main.go is the CLI entry point, command/ package handles subcommands (install, schedule, terminate, fetchschedule, eligible), config/ manages configuration parsing with config/monkey.go for app-specific behavior, deploy/ handles ASG/app abstractions (asg.go, deployment.go, eligible_instance_groups.go), and utility packages (cal/ for calendar logic, clock/ for time, constrainer/, decryptor/) handle cross-cutting concerns. Database migrations are SQL files managed via sql-migrate.

👥Who it's for

DevOps engineers and SREs at organizations using Spinnaker who need to validate that their microservices gracefully handle instance failures; platform teams building resilient infrastructure on Spinnaker-managed clouds.

🌱Maturity & risk

Production-ready with active maintenance. The codebase shows solid structure (comprehensive command package with install/schedule/terminate workflows), test coverage across key modules (cal_test.go, config_test.go, deploy_test.go), CI/CD via Travis, and integration tests (schedule_int_test.go). However, the Go version constraint (1.19) and dated dependencies (crypto from 2016) suggest the project is stable but not cutting-edge.

Standard open source risks apply.

Active areas of work

No specific PR or milestone data visible in the provided file list. The stable versioning (v2 in go.mod) and mature dependency lock suggests the project is in maintenance mode rather than active feature development. The presence of docs/ (Configuration-file-format.md, How-to-deploy.md, Running-locally.md) indicates focus on usability and operational guidance.

🚀Get running

go get github.com/netflix/chaosmonkey/cmd/chaosmonkey
cd $GOPATH/src/github.com/netflix/chaosmonkey
make

Daily commands:

make
./cmd/chaosmonkey/chaosmonkey --help
# Or run specific subcommands:
./cmd/chaosmonkey/chaosmonkey schedule --config config.yaml
./cmd/chaosmonkey/chaosmonkey terminate --app myapp

🗺️Map of the codebase

  • command/chaosmonkey.go: Main command dispatcher that orchestrates the chaos killing workflow via Spinnaker integration
  • command/schedule.go: Core scheduling logic that determines which instances are eligible for termination based on calendar and config constraints
  • deploy/deployment.go: Abstract interface that decouples the chaos logic from cloud provider specifics (AWS ASG, Kubernetes, etc.)
  • config/monkey.go: Application-level configuration that defines per-app termination policies, blackout windows, and eligibility rules
  • cal/cal.go: Calendar-based eligibility logic that enforces blackout periods (e.g., no terminations on holidays or during business hours)
  • chaosmonkey.go: Core orchestration logic that ties deployment queries, scheduling, and termination together
  • command/install.go: Handles initial Chaos Monkey setup and database schema migration via sql-migrate

🛠️How to make changes

New termination strategies: extend deploy/deployment.go and deploy/asg.go interfaces. New scheduling logic: modify command/schedule.go and cal/cal.go (calendar-based eligibility). Configuration options: add fields to config/monkey.go and config/config.go. CLI commands: add new files in command/ and wire them through command/chaosmonkey.go. Tests: follow the pattern in *_test.go and *_int_test.go files in the command/ and config/ directories.

🪤Traps & gotchas

Spinnaker dependency: Chaos Monkey has no built-in way to talk directly to cloud providers; it relies entirely on Spinnaker's Gate API for instance enumeration and termination—if Spinnaker is down or misconfigured, Chaos Monkey fails silently. Database migration: The install command (command/install.go) runs SQL migrations; if you skip this step, schedule and terminate commands will panic on missing tables. Calendar logic (cal/cal.go): eligibility depends on parsing local time zones and custom calendar files—misconfiguration can cause unexpected blackout periods. Config hot-reload: No indication that Viper config hot-reloading is implemented; changes require restart. Decryptor package: decryptor/decryptor.go hints at encrypted secrets in config, but no documentation on key management visible.

💡Concepts to learn

  • Chaos Engineering Principles — Chaos Monkey implements the formal Principles of Chaos Engineering; understanding steady-state metrics, blast radius, and controlled failure injection is essential to using this tool safely in production
  • Calendar-based Eligibility Windows — Chaos Monkey uses cal/cal.go to enforce blackout periods (holidays, business hours, maintenance windows); understanding how calendars gate termination prevents accidental outages during critical times
  • Spinnaker Gate API Integration — Chaos Monkey has no direct cloud provider integration; it routes all instance operations through Spinnaker's Gate REST API, so understanding Spinnaker's deployment abstractions (Applications, Clusters, Server Groups) is mandatory
  • Multi-cloud Abstraction via Deployment Interfaces — The deploy/deployment.go interface abstracts cloud-provider-specific logic (ASGs, Kubernetes pods, Cloud Foundry apps), allowing Chaos Monkey to work across AWS, GCE, Azure, and Kubernetes without code duplication
  • Database-backed Scheduling — Chaos Monkey persists scheduling state and eligibility rules in SQL (MySQL/PostgreSQL/SQLite); understanding sql-migrate schemas and query patterns is needed to debug scheduling failures or customize eligibility logic
  • Declarative Configuration via Viper — All termination policies, blackout windows, and per-app settings are defined in YAML/TOML config files parsed by Viper; understanding Viper's config merging (defaults, env vars, files) is critical for deployment
  • Frigga AWS Naming Convention — Chaos Monkey uses frigga-go to parse AWS ASG/instance names (e.g., 'app-stack-detail-v001') to determine ownership and routing; misunderstanding Frigga naming can cause wrong instances to be targeted
  • Netflix/SimianArmy — The predecessor to Chaos Monkey; SimianArmy is the older Java-based chaos engineering toolkit that inspired Chaos Monkey's design and principles
  • gremlin/chaos-toolkit — Alternative chaos engineering framework (Python-based) that works across cloud providers without Spinnaker dependency; complementary approach to the same problem
  • spinnaker/spinnaker — The continuous delivery platform that Chaos Monkey requires for instance enumeration and termination; tight integration means you must understand Spinnaker's Gate API
  • Netflix/titus — Netflix's container orchestration platform; Chaos Monkey can target Titus-managed containers, making this a relevant deployment backend
  • hashicorp/consul — Service discovery platform that can integrate with Spinnaker for instance metadata; useful for understanding how Chaos Monkey discovers instances in complex deployments

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for the schedule command workflow

The repo has command/schedule_test.go and command/schedule_int_test.go, but the integration test file is sparse. Given that scheduling is core to Chaos Monkey's operation (determining when to terminate instances), adding comprehensive integration tests covering the full schedule fetch→eligibility check→termination flow would improve reliability. This directly supports the 'Termination-behavior.md' and scheduling logic in command/schedule.go.

  • [ ] Review existing command/schedule_int_test.go and identify missing scenarios (e.g., timezone edge cases, daylight saving transitions, concurrent scheduling)
  • [ ] Add test cases in command/schedule_int_test.go for calendar parsing (cal/cal_test.go) integration with schedule execution
  • [ ] Add mocking for deploy/deployment.go to simulate ASG eligibility during scheduling tests
  • [ ] Document expected behavior in docs/dev/Running-tests.md for running integration tests locally

Add missing unit tests for config/param/param.go parameter validation

The file structure shows config/param/param.go exists but there is no config/param/param_test.go. Given the importance of configuration in Chaos Monkey (docs show Configuration-file-format.md exists), parameter validation logic should have dedicated unit tests covering edge cases like invalid parameter types, missing required fields, and boundary conditions.

  • [ ] Create config/param/param_test.go with unit tests for parameter validation logic
  • [ ] Test parameter type coercion and error handling (building on config/config_test.go patterns)
  • [ ] Add tests for default parameter values and override behavior
  • [ ] Ensure tests cover integration with config/monkey.go config loading

Add GitHub Actions CI workflow to replace deprecated Travis CI

The repo still uses .travis.yml (visible in file structure) which is outdated. The README badges reference Travis CI, but Travis is deprecated for open source. Migrating to GitHub Actions would modernize the CI/CD, improve reliability, and better integrate with the GitHub ecosystem. This is blocking new contributors who expect standard GitHub CI.

  • [ ] Create .github/workflows/ci.yml with Go 1.19+ build and test jobs (matching go.mod requirement)
  • [ ] Include test coverage job running: go test ./... and go test -race ./...
  • [ ] Add build job for cmd/chaosmonkey/main.go for multiple platforms (linux, darwin)
  • [ ] Update README.md badges to reference GitHub Actions instead of Travis CI
  • [ ] Remove or archive .travis.yml with a comment explaining deprecation

🌿Good first issues

  • Add missing unit tests for deploy/eligible_instance_groups.go—the file exists but there are no tests for the logic that determines which instance groups are eligible for termination; write tests covering ASG filtering by account, region, and app name.
  • Expand command/terminate.go with dry-run mode—add a --dry-run flag that logs which instances would be terminated without actually calling Spinnaker, making it safer for operational testing.
  • Document the cal/cal.go calendar format and blackout window rules—add examples to docs/Configuration-file-format.md showing how to configure holiday calendars and business-hour blackout periods, since the current docs don't explain this critical feature.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • eaa28fb — fix(build): Fix build in Travis CI (#101) (SihangYu)
  • 5ca6740 — Merge pull request #99 from SihangYu/sihang-mysql8 (tedpennings)
  • 4c5b73f — fix(mysql): Update to use mysql 8.0 in docs and tests (SihangYu)
  • a8ac412 — fix(mysql): Replace the deprecated system variable tx_isolation in MySQL 8.0 (SihangYu)
  • a76ce32 — Merge pull request #91 from KhurramJalil/patch-1 (robzienert)
  • 4e9de3d — Update go.mod (KhurramJalil)
  • 4f4812e — Add .idea/ to gitignore (ttomsu)
  • 5749420 — Add /v2 to package to comply with golang's semantic import versioning (ttomsu)
  • 3f8c23f — Merge pull request #66 from imsamuel/patch-1 (anotherchrisberry)
  • 32cec46 — Merge pull request #78 from ttomsu/gomod (anotherchrisberry)

🔒Security observations

  • Critical · Extremely Outdated Go Dependencies with Known Vulnerabilities — go.mod - All direct and indirect dependencies. The project uses Go dependencies from 2016 (6-8 years old) with multiple known security vulnerabilities. golang.org/x/crypto v0.0.0-20160922170629 contains known cryptographic weaknesses. github.com/go-sql-driver/mysql v1.2.1-0.20160802113842c is vulnerable to SQL injection and connection string parsing attacks (CVE-2016-3109, CVE-2016-5363). github.com/spf13/viper v0.0.0-20160926150402 has unsafe YAML parsing vulnerabilities. Fix: Immediately update all dependencies to their latest stable versions. Run 'go get -u ./...' and address any breaking changes. Minimum versions: golang.org/x/crypto v0.17.0+, go-sql-driver/mysql v1.7.0+, spf13/viper v1.18.0+
  • High · Potential Encrypted Credentials File Detected — docKey.enc. The file 'docKey.enc' appears to be an encrypted key or credentials file. If this is committed to the repository (even encrypted), it represents a potential credential exposure risk if the encryption is weak or the key is compromised. Fix: Remove all encrypted key files from version control. Store encryption keys and credentials in secure vaults (AWS Secrets Manager, HashiCorp Vault, etc.). If already committed, rotate all credentials and remove from git history using git-filter-branch or BFG Repo-Cleaner.
  • High · SQL Injection Risk via sql-migrate and Raw SQL — migration/mysql/, command/migrate.go, config/config.go. The migration system (github.com/rubenv/sql-migrate v0.0.0-20160620083229) combined with raw SQL migration files (migration/mysql/1.0.0_initial_schema.sql) presents SQL injection risks. The outdated dependency lacks modern parameterization safeguards. Additionally, if user input flows into database queries without proper prepared statements, injection is possible. Fix: Upgrade sql-migrate to the latest version. Use parameterized queries throughout the codebase. Audit all database interaction code in command/migrate.go and config/ for unsanitized user inputs. Use prepared statements for all dynamic queries.
  • High · Missing Input Validation and XSS Prevention — command/, config/param/, config/config.go. The command package processes configuration and user inputs without apparent comprehensive validation. The config/param/param.go module may accept unsanitized input. If these values are logged, displayed, or used in command construction, XSS and command injection attacks are possible. Fix: Implement strict input validation for all user-provided configuration. Sanitize all inputs before use in system commands, logs, or output. Use whitelisting approaches for configuration values. Add input type validation in config/param/param.go.
  • High · Weak Cryptography - Outdated crypto/aes Usage — decryptor/decryptor.go, golang.org/x/crypto v0.0.0-20160922170629. The golang.org/x/crypto dependency is from 2016 and lacks modern cryptographic best practices. The decryptor package (decryptor/decryptor.go) likely uses weak encryption algorithms or incorrect modes that are vulnerable to attacks. Fix: Update to golang.org/x/crypto v0.17.0 or later. Review and upgrade encryption implementation to use authenticated encryption (AES-GCM). Replace ECB mode (if used) with CBC or GCM. Use crypto/rand for key generation.
  • Medium · Potential Command Injection in Shell Commands — command/osutil.go, command/chaosmonkey.go, command/terminate.go. Multiple command files (command/osutil.go, command/chaosmonkey.go, command/terminate.go) likely execute system commands or shell operations. Without proper escaping and validation, user-controlled data could lead to command injection vulnerabilities. Fix: Use exec.Command() with argument slicing instead of shell.exec(). Never pass unsanitized user input as command arguments. Use filepath.Join() for file paths. Validate all inputs against strict whitelists.
  • Medium · Missing Security Headers — undefined. undefined Fix: undefined

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Mixed signals · Netflix/chaosmonkey — RepoPilot