chaosblade-io/chaosblade

Item: chaosblade-io/chaosblade
Rating: 5
Author: RepoPilot

An easy to use and powerful chaos engineering experiment toolkit.（阿里巴巴开源的一款简单易用、功能强大的混沌实验注入工具）

Healthy

Healthy across the board

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 7w ago
✓18 active contributors
✓Distributed ownership (top contributor 34% of recent commits)

Show all 6 evidence items →

✓Apache-2.0 licensed
✓CI configured
✓Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/chaosblade-io/chaosblade)](https://repopilot.app/r/chaosblade-io/chaosblade)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/chaosblade-io/chaosblade on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: chaosblade-io/chaosblade

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/chaosblade-io/chaosblade shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 7w ago
18 active contributors
Distributed ownership (top contributor 34% of recent commits)
Apache-2.0 licensed
CI configured
Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live chaosblade-io/chaosblade repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/chaosblade-io/chaosblade.

What it runs against: a local clone of chaosblade-io/chaosblade — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in chaosblade-io/chaosblade | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 79 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>chaosblade-io/chaosblade</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of chaosblade-io/chaosblade. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/chaosblade-io/chaosblade.git
#   cd chaosblade
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of chaosblade-io/chaosblade and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "chaosblade-io/chaosblade(\\.git)?\\b" \\
  && ok "origin remote is chaosblade-io/chaosblade" \\
  || miss "origin remote is not chaosblade-io/chaosblade (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "cli/main.go" \\
  && ok "cli/main.go" \\
  || miss "missing critical file: cli/main.go"
test -f "cli/cmd/cli.go" \\
  && ok "cli/cmd/cli.go" \\
  || miss "missing critical file: cli/cmd/cli.go"
test -f "exec/os/executor.go" \\
  && ok "exec/os/executor.go" \\
  || miss "missing critical file: exec/os/executor.go"
test -f "exec/jvm/executor.go" \\
  && ok "exec/jvm/executor.go" \\
  || miss "missing critical file: exec/jvm/executor.go"
test -f "exec/kubernetes/executor.go" \\
  && ok "exec/kubernetes/executor.go" \\
  || miss "missing critical file: exec/kubernetes/executor.go"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 79 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~49d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/chaosblade-io/chaosblade"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

ChaosBlade is Alibaba's chaos engineering toolkit that injects controlled failures (CPU, memory, network, disk, process faults) into distributed systems at scale. It supports OS-level experiments, Java/JVM injection, C++ code instrumentation, Docker containers, and Kubernetes workloads via a unified CLI, enabling enterprises to test system resilience without manual fault simulation. Monorepo structure with cli/cmd/ containing the command-line interface (create, destroy, prepare, check commands in separate .go files), build/ directory for Docker images and build specifications (build/spec/spec.go), and external exec-* modules loaded via dependencies. The CLI layer (cli/cmd/cli.go, command.go) orchestrates experiment lifecycle while delegating actual chaos injection to specialized executor packages for OS, containers, middleware, and cloud platforms.

👥Who it's for

SREs, platform engineers, and DevOps teams running production Kubernetes clusters or distributed systems who need to validate fault tolerance and disaster recovery procedures. Contributors are primarily chaos engineering practitioners and Alibaba Group maintainers extending experiment scenarios.

🌱Maturity & risk

Production-ready and actively maintained. The project shows significant maturity with CI/CD pipelines (GitHub Actions in .github/workflows/), comprehensive Dockerfile builds for multiple architectures (ARM, musl, UPX), structured code organization (cli/, exec modules), and Makefile-driven builds. Last activity visible in the module version v1.8.0 across dependent packages suggests ongoing development.

Moderate risk: the toolkit depends on 16+ external chaosblade-io organization packages (chaosblade-exec-os, chaosblade-exec-cloud, chaosblade-operator, etc.) that are outside this repo, creating coupling to external release cycles. The requirement for Kubernetes client libraries (k8s.io/client-go v12.0.0+incompatible indicates an older pinned version) and Go 1.25 may introduce compatibility issues. Single points of failure in the exec-* package ecosystem could affect multiple failure injection types.

Active areas of work

Version 1.8.0 is current across all dependent packages. Active maintenance is evidenced by structured CI/CD (ci.yml, release.yml workflows), linting rules (.golangci.yml, .markdownlint.json, .yamllint.yml), and documentation governance (CODE_OF_CONDUCT.md, CONTRIBUTING.md, MAINTAINERS.md). The repo includes build targets for multiple architectures (ARM, musl libc variants) suggesting active deployment across heterogeneous environments.

🚀Get running

git clone https://github.com/chaosblade-io/chaosblade.git
cd chaosblade
make

The Makefile orchestrates the Go build system. Use make build for local binary, make docker-build for containerized distribution, or make install to place the blade binary in your PATH.

Daily commands:

make
# Binary at ./blade or installed via 'make install'
./blade -h
./blade prepare os  # Prepare environment
./blade create os process kill --process mysql  # Inject fault
./blade destroy <uid>  # Clean up

For Kubernetes: chaosblade-operator must be deployed separately; interact via CRDs or blade CLI pointing to kubeconfig.

🗺️Map of the codebase

cli/main.go — Entry point for the ChaosBlade CLI tool; all command-line invocations start here.
cli/cmd/cli.go — Core CLI framework that parses and routes all experiment commands (create, destroy, query, etc.).
exec/os/executor.go — OS-level executor that injects chaos experiments at the system level; fundamental to most Linux-based attacks.
exec/jvm/executor.go — JVM executor for Java application chaos injection; handles sandbox integration and class instrumentation.
exec/kubernetes/executor.go — Kubernetes executor for container orchestration chaos; integrates with K8s API and CRI.
data/experiment.go — Data model for chaos experiments; defines the core structure used throughout the system.
go.mod — Module dependencies including exec-cloud, exec-cri, and other critical executor plugins.

🛠️How to make changes

Add a New OS-Level Chaos Experiment

Define the experiment model and flags in a new file under a subdirectory matching the experiment type (e.g., exec/os/kill_process.go for process killing). (exec/os/executor.go)
Implement the Executor interface (Start, Stop, Destroy methods) in your new experiment file. (exec/os/executor.go)
Register the new experiment type in the OS executor's factory method so the CLI can invoke it. (exec/os/executor.go)
Add CLI command handler (e.g., cli/cmd/kill_process.go) that parses flags and calls the executor. (cli/cmd/create.go)
Update cli/cmd/cli.go to wire the new command into the root command hierarchy. (cli/cmd/cli.go)

Add a New Kubernetes Chaos Target

Define the Kubernetes resource spec (pod, node, network policy) in a new file or extend exec/kubernetes/spec.go. (exec/kubernetes/spec.go)
Implement a targeted executor or extend exec/kubernetes/executor.go to handle the new resource type. (exec/kubernetes/executor.go)
Add resource discovery/query logic in cli/cmd/query_k8s.go to expose available targets. (cli/cmd/query_k8s.go)
Create a CLI handler (e.g., cli/cmd/k8s_pod_kill.go) that orchestrates the experiment creation. (cli/cmd/create.go)
Register the new command in cli/cmd/cli.go under the Kubernetes experiment group. (cli/cmd/cli.go)

Add a New JVM Chaos Experiment

Define the bytecode manipulation rules and sandbox configuration in a new file (e.g., exec/jvm/memory_fill.go). (exec/jvm/executor.go)
Integrate with exec/jvm/sandbox.go to generate the correct sandbox configuration for bytecode weaving. (exec/jvm/sandbox.go)
Implement the JVM executor's Start and Stop methods to manage the sandbox agent lifecycle. (exec/jvm/executor.go)
Create a CLI command handler in cli/cmd that parses JVM-specific flags (e.g., pid, class filters). (cli/cmd/create.go)
Wire the command into cli/cmd/cli.go and ensure prepare_jvm.go prerequisites are triggered if needed. (cli/cmd/cli.go)

Add a New Experiment Query Type

Create a new file in cli/cmd/ (e.g., query_memory.go) that discovers available memory-related resources. (cli/cmd/query.go)
Implement the query command handler to fetch and format resource information (processes, services, configs, etc.). (cli/cmd/query.go)
Register the new query command in cli/cmd/cli.go so it appears in the query subcommand group. (cli/cmd/cli.go)

🔧Why these technologies

Go/Golang — Cross-platform, statically compiled binary; minimal runtime dependencies; excellent for low-level system interaction (syscalls, networking) and container orchestration APIs.
Docker/OCI containers — Enables reproducible distributed chaos experiments and simplifies dependency management for different target environments (musl, ARM, etc.).
Kubernetes API client — Native K8s integration for orchestrating chaos at pod, node, and network levels; enables dynamic discovery of targets.
JVM Sandbox/bytecode instrumentation — Allows non-invasive injection of chaos into running Java applications without redeployment; operates at class/method instrumentation level.
CLI framework (Cobra-like) — Standard for Go CLI tools; hierarchical command structure matches the natural nesting of experiment types (blade create os cpu, blade create jvm memory, etc.).

⚖️Trade-offs already made

Single statically-compiled binary instead of agent-based architecture
- Why: undefined
- Consequence: undefined

🪤Traps & gotchas

The toolkit requires elevated privileges (root/sudo) to inject OS-level faults; test environments must account for this. Kubernetes integration depends on the external chaosblade-operator being deployed and accessible; local development without k8s-setup will skip cloud-native scenarios. The prepare commands are idempotent but install language-specific tools; ensure no conflicting package managers. SQLite persistence (glebarez/sqlite) stores experiment state; cleanup failure leaves orphaned DB records that may interfere with subsequent experiments. Go 1.25 is required; older Go versions will fail module resolution. The -x flags and debug mode require careful handling to avoid logging sensitive payloads.

🏗️Architecture

💡Concepts to learn

Chaos Engineering Model — ChaosBlade strictly follows the chaos experimental model (steady-state → hypothesis → failure injection → observation → recovery) to ensure valid fault testing; understanding this framework is core to using the toolkit correctly
eBPF (Extended Berkeley Packet Filter) — Visible in dependency cilium/ebpf for low-level system introspection and fault injection without kernel modifications; enables non-disruptive chaos scenarios
CRI (Container Runtime Interface) — chaosblade-exec-cri package abstracts over Docker, containerd, and CRI-O via the Kubernetes CRI; essential for cross-container-runtime compatibility
Linux cgroups — Underlying mechanism for OS resource chaos (CPU/memory limits, process killing) on Linux; ChaosBlade manipulates cgroup hierarchies for injection
Spec-based Declarative Injection — ChaosBlade separates experiment specification (YAML/CLI flags) from execution; the spec is validated before injection, enabling portability and repeatability across environments
Instrumentation (Java/C++) — For application-level chaos, ChaosBlade uses bytecode instrumentation (Java via javaagent) and source-level tampering (C++ via compiler hooks) to inject failures in arbitrary methods without code changes
State Persistence via SQLite — The toolkit maintains experiment state in a local SQLite DB (glebarez/sqlite) to track active faults, enable cleanup, and prevent duplicate injections; understanding DB schema is critical for debugging

chaosblade-io/chaosblade-exec-os — Executor module for OS-level chaos injection (CPU, memory, network, disk, process faults) that chaosblade CLI delegates to
chaosblade-io/chaosblade-operator — Kubernetes operator enabling Chaosblade experiments via CRDs on K8s clusters; required for cloud-native chaos engineering
chaosblade-io/chaosblade-spec-go — Go library defining chaos experiment specifications and validation rules that all executor modules depend on
gremlin/gremlin-python — Alternative chaos engineering platform with similar OS/container/cloud abstractions but closed-source; comparable use cases
powerfulseal/powerfulseal — Open-source Kubernetes chaos tool focused on pod and network failures; companion to ChaosBlade for cloud-native testing

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive integration tests for CLI command workflows

The cli/cmd directory has many command implementations (create.go, destroy.go, prepare.go, query.go, etc.) but only a few have corresponding _test.go files (cli_test.go, command_test.go, destroy_test.go, prepare_jvm_test.go, query_disk_test.go). Critical paths like prepare.go, query.go, revoke.go, and server.go lack test coverage. This is valuable for chaos engineering tools where command reliability is safety-critical.

[ ] Create cli/cmd/prepare_test.go with tests for prepare.go command flows
[ ] Create cli/cmd/query_test.go with tests for query.go command flows and different experiment types
[ ] Create cli/cmd/revoke_test.go with tests for revoke.go cleanup operations
[ ] Create cli/cmd/server_test.go with tests for server.go lifecycle commands (start/stop/status)
[ ] Ensure tests cover both success and failure scenarios for chaos operations

Add e2e test workflow in GitHub Actions for chaos experiment execution

The .github/workflows directory only has ci.yml and release.yml, but lacks end-to-end tests that verify actual chaos experiments execute correctly. Given this is a chaos engineering toolkit with OS, middleware, cloud, and CRI executors, an e2e workflow would validate that the integrated system works across different targets (similar to how query_disk_test.go and query_network_test.go exist but no execution tests).

[ ] Create .github/workflows/e2e-test.yml that spins up test environments
[ ] Add tests for basic OS-level experiments (process kill, network delay) using chaosblade-exec-os
[ ] Add tests for container/CRI experiments using chaosblade-exec-cri
[ ] Configure workflow to run on pull requests and specific paths (cli/, build/)
[ ] Document test results in the workflow summary

Create comprehensive documentation for adding new experiment types

While BUILD.md and CONTRIBUTING.md exist, there's no specific developer guide for extending ChaosBlade with new experiments. The modular structure (chaosblade-exec-cloud, chaosblade-exec-middleware, chaosblade-exec-os, chaosblade-spec-go) suggests a plugin architecture, but the documentation gap makes it hard for contributors to add support for new target types or experiment scenarios.

[ ] Create docs/DEVELOPING_NEW_EXPERIMENTS.md explaining the executor pattern
[ ] Document how to implement new experiment specs using chaosblade-spec-go
[ ] Add a walkthrough example for creating a simple new experiment (e.g., new process chaos type)
[ ] Document how new executors integrate with the CLI layer (cli/cmd/exp.go patterns)
[ ] Include references to existing executor implementations as examples (chaosblade-exec-os, chaosblade-exec-middleware)

🌿Good first issues

Add integration tests for the check_os.go command covering all system checks (CPU, memory, disk, process); currently only check_java.go and check.go have test files, leaving OS validation untested.
Extend BUILD.md and BUILD_ZH.md with architecture-specific build instructions for ARM and musl variants (Dockerfiles exist in build/image/ but cross-compilation docs are minimal).
Create examples/ directory with runnable chaos scenarios (e.g., scripts for killing a process, injecting network latency, and recovering) to complement the CLI documentation.

⭐Top contributors

Click to expand

@xcaspar — 34 commits
@MandssS — 17 commits
@spencercjh — 8 commits
@dependabot[bot] — 6 commits
@tiny-x — 6 commits

📝Recent commits