RepoPilotOpen in app →

ShangtongZhang/reinforcement-learning-an-introduction

Python Implementation of Reinforcement Learning: An Introduction

Mixed

Stale — last commit 2y ago

weakest axis
Use as dependencyMixed

last commit was 2y ago; no tests detected

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • 19 active contributors
  • MIT licensed
  • CI configured
Show all 6 evidence items →
  • Stale — last commit 2y ago
  • Concentrated ownership — top contributor handles 65% of recent commits
  • No test directory detected
What would change the summary?
  • Use as dependency MixedHealthy if: 1 commit in the last 365 days

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Forkable
[![RepoPilot: Forkable](https://repopilot.app/api/badge/shangtongzhang/reinforcement-learning-an-introduction?axis=fork)](https://repopilot.app/r/shangtongzhang/reinforcement-learning-an-introduction)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/shangtongzhang/reinforcement-learning-an-introduction on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: ShangtongZhang/reinforcement-learning-an-introduction

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/ShangtongZhang/reinforcement-learning-an-introduction shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 2y ago

  • 19 active contributors
  • MIT licensed
  • CI configured
  • ⚠ Stale — last commit 2y ago
  • ⚠ Concentrated ownership — top contributor handles 65% of recent commits
  • ⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live ShangtongZhang/reinforcement-learning-an-introduction repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/ShangtongZhang/reinforcement-learning-an-introduction.

What it runs against: a local clone of ShangtongZhang/reinforcement-learning-an-introduction — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in ShangtongZhang/reinforcement-learning-an-introduction | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 666 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>ShangtongZhang/reinforcement-learning-an-introduction</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of ShangtongZhang/reinforcement-learning-an-introduction. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/ShangtongZhang/reinforcement-learning-an-introduction.git
#   cd reinforcement-learning-an-introduction
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of ShangtongZhang/reinforcement-learning-an-introduction and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "ShangtongZhang/reinforcement-learning-an-introduction(\\.git)?\\b" \\
  && ok "origin remote is ShangtongZhang/reinforcement-learning-an-introduction" \\
  || miss "origin remote is not ShangtongZhang/reinforcement-learning-an-introduction (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "chapter02/ten_armed_testbed.py" \\
  && ok "chapter02/ten_armed_testbed.py" \\
  || miss "missing critical file: chapter02/ten_armed_testbed.py"
test -f "chapter05/blackjack.py" \\
  && ok "chapter05/blackjack.py" \\
  || miss "missing critical file: chapter05/blackjack.py"
test -f "chapter06/cliff_walking.py" \\
  && ok "chapter06/cliff_walking.py" \\
  || miss "missing critical file: chapter06/cliff_walking.py"
test -f "chapter10/mountain_car.py" \\
  && ok "chapter10/mountain_car.py" \\
  || miss "missing critical file: chapter10/mountain_car.py"
test -f "chapter12/mountain_car.py" \\
  && ok "chapter12/mountain_car.py" \\
  || miss "missing critical file: chapter12/mountain_car.py"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 666 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~636d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/ShangtongZhang/reinforcement-learning-an-introduction"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

A Python implementation of Sutton & Barto's Reinforcement Learning: An Introduction (2nd Edition) textbook, providing runnable examples for 13 chapters of RL algorithms. It implements core algorithms like multi-armed bandits (Chapter 2), dynamic programming (Chapter 4), Monte Carlo methods (Chapter 5), temporal-difference learning (Chapter 6), and policy gradient methods (Chapter 13) with visualization outputs using matplotlib and seaborn. Modular chapter-by-chapter structure: each chapterXX/ directory contains standalone Python scripts implementing specific algorithms or problems from that chapter (e.g., chapter02/ten_armed_testbed.py, chapter05/blackjack.py, chapter06/cliff_walking.py). Each script is self-contained with minimal dependencies, producing matplotlib figures saved to the images/ directory. No package structure or shared utilities—emphasis is on clarity over code reuse.

👥Who it's for

Students and practitioners learning reinforcement learning theory who want to see working code implementations alongside the textbook; ML engineers verifying algorithm behavior on classic RL problems like blackjack, grid worlds, and mountain car; educators using the book who need runnable examples to demonstrate concepts in lectures.

🌱Maturity & risk

Actively maintained educational resource with Travis CI integration and clean structure organized by chapter. The repo implements the full 13-chapter arc of a published academic textbook, making it reliable for learning purposes. However, it is not a production library—it prioritizes clarity and pedagogical value over performance or industrial robustness.

Very low risk for educational use: minimal dependencies (numpy, matplotlib, seaborn, scipy, tqdm), no external API calls, and straightforward Python code. Single-maintainer repo (ShangtongZhang) with no breaking changes expected since the textbook content is stable. The main limitation is that it's designed for learning, not for extending or deploying production systems.

Active areas of work

Repo is in steady state: it represents a complete implementation of the 2nd edition textbook. Recent commits likely focus on bug fixes or clarity improvements rather than new algorithms. No visible major feature development—the scope is fixed by the textbook's 13 chapters. The Travis CI configuration ensures code remains runnable across Python versions.

🚀Get running

git clone https://github.com/ShangtongZhang/reinforcement-learning-an-introduction.git
cd reinforcement-learning-an-introduction
pip install numpy matplotlib seaborn scipy tqdm
python chapter02/ten_armed_testbed.py

This runs the 10-armed bandit example from Chapter 2, generating performance comparison plots.

Daily commands: Navigate to any chapterXX/filename.py and execute directly: python chapter06/cliff_walking.py. Output is matplotlib figures saved to images/ directory and printed to console. No server, no config files—each script is fully self-contained and executable independently.

🗺️Map of the codebase

  • chapter02/ten_armed_testbed.py — Entry point demonstrating core bandit algorithms (epsilon-greedy, UCB, gradient) that establish foundational patterns used throughout the codebase.
  • chapter05/blackjack.py — First complete Monte Carlo implementation; essential reference for understanding episodic task structure and off-policy learning used in later chapters.
  • chapter06/cliff_walking.py — Canonical TD learning example (SARSA vs Q-learning); critical for understanding temporal difference methods that dominate chapters 6–13.
  • chapter10/mountain_car.py — First continuous control problem with function approximation; establishes patterns for scaling RL beyond tabular methods.
  • chapter12/mountain_car.py — Eligibility traces with function approximation; demonstrates integration of core RL concepts into production-grade implementation.
  • requirements.txt — Defines minimal dependencies (numpy, matplotlib, scipy); every developer must verify compatibility with their environment.
  • README.md — Maps all implementations to textbook figures/examples; essential for understanding which chapter corresponds to which algorithm and problem domain.

🛠️How to make changes

Add a new tabular RL algorithm (e.g., Expected SARSA variant)

  1. Create new file in chapter06/ following naming convention (e.g., chapter06/new_algorithm.py) (chapter06/new_algorithm.py)
  2. Import numpy and define environment class (copy structure from cliff_walking.py GridWorld) (chapter06/cliff_walking.py)
  3. Implement Q-table updates following Agent pattern (see SARSA in cliff_walking.py lines with q_values) (chapter06/cliff_walking.py)
  4. Run episodes with epsilon-greedy action selection; collect rewards and state-action traces (chapter06/cliff_walking.py)
  5. Plot convergence curves using matplotlib (reference: cliff_walking.py plotting block) (chapter06/cliff_walking.py)

Add function approximation to a continuous control task

  1. Create new file in chapter10/ or chapter12/ (e.g., chapter10/new_continuous_task.py) (chapter10/mountain_car.py)
  2. Define feature engineering function (tile coding or polynomial basis; copy from mountain_car.py) (chapter10/mountain_car.py)
  3. Replace Q-table with weight vector w; implement linear value function q_hat(s,a) = w^T * phi(s,a) (chapter10/mountain_car.py)
  4. Use gradient-based updates: w += alpha * (reward - q_hat) * phi(s,a) (see access_control.py) (chapter10/access_control.py)
  5. Validate convergence on learning curves; compare against tabular baseline if available (chapter10/mountain_car.py)

Add eligibility traces to an existing TD algorithm

  1. Start from a chapter12/ baseline (e.g., random_walk.py or mountain_car.py with existing TD update) (chapter12/random_walk.py)
  2. Initialize eligibility trace array e (same shape as Q or w); reset at episode start (chapter12/random_walk.py)
  3. Update trace decay: e *= gamma * lambda after each step; set e(s,a) = 1 for visited state-action (chapter12/random_walk.py)
  4. Distribute TD error across all eligible traces: w += alpha * delta * e (backward view) (chapter12/random_walk.py)
  5. Sweep lambda values (0.0, 0.4, 0.8, 1.0) and plot RMS error vs lambda (reference: lambda_effect.py) (chapter12/lambda_effect.py)

Add a new gridworld variant (Dyna, planning, or stochastic environment)

  1. Copy chapter04/grid_world.py or chapter06/windy_grid_world.py as template (chapter04/grid_world.py)
  2. Modify environment: change grid size, reward structure, or add stochastic transitions (e.g., wind probabilities) (chapter06/windy_grid_world.py)
  3. For Dyna: implement model dict {(s,a) -> (s',r)} populated during real experience (see chapter08/maze.py) (chapter08/maze.py)
  4. Add planning loop: sample (s,a) from model and apply TD update (see chapter08/maze.py planning steps) (chapter08/maze.py)
  5. Plot learning curves showing effect of planning (real steps vs planning ratio) (chapter08/maze.py)

🔧Why these technologies

  • NumPy — Matrix operations for Q-tables, weight vectors, feature representations; essential for efficient numerical RL updates on modern CPUs.

🪤Traps & gotchas

No hidden environment variables or service dependencies. Each script runs standalone with numpy seeding for reproducibility (important: some scripts use np.random.seed() for consistent results across runs—removing or changing seeds will alter outputs). Scripts write PNG images to the images/ directory; ensure write permissions exist. Some examples like chapter08/maze.py may be computationally intensive; allow extra runtime on slower machines. The repo assumes Python 3.6+; very old Python versions may have compatibility issues with f-strings or type hints in some files.

🏗️Architecture

💡Concepts to learn

  • Multi-Armed Bandit Problem — Foundation of exploration-exploitation tradeoff in RL; Chapter 2 teaches epsilon-greedy, UCB, and gradient bandit strategies used in all later algorithms
  • Temporal-Difference (TD) Learning — Core mechanism in Chapter 6 (SARSA, Q-Learning) that bridges Monte Carlo and dynamic programming; critical for on-policy vs. off-policy algorithm design
  • Function Approximation (Tile Coding) — Chapter 10 and 12 use tile coding to handle continuous state spaces with linear approximation; essential for scaling tabular methods
  • Policy Gradient Methods (REINFORCE) — Chapter 13 implements gradient-based policy optimization; foundation for modern actor-critic and deep policy gradient algorithms (A3C, PPO, TRPO)
  • Eligibility Traces (Lambda) — Chapter 12 shows how eligibility traces unify TD methods across variable lookahead horizons; critical for understanding TD(λ) and n-step returns
  • Maximization Bias — Chapter 6 shows Double Q-Learning addresses overestimation in Q-Learning; a subtle but important correction for off-policy learning
  • Value Function Convergence (Bellman Operators) — Chapters 3-4 use iterative policy evaluation to compute value functions via Bellman equations; underpins all dynamic programming and bootstrapping methods
  • dennybritz/reinforcement-learning — Alternative Python implementation of the same Sutton & Barto textbook; competes directly on pedagogy and completeness for Chapters 1-13
  • keras-rl/keras-rl — High-level deep RL library (DQN, A3C, DDPG); useful when you've mastered fundamentals here and want to scale to deep neural networks
  • openai/gym — Standard RL environment toolkit; many of the problems here (mountain car, blackjack) are built into Gym, allowing easy extension
  • openai/baselines — Production-grade RL baseline implementations (A2C, PPO, TRPO); shows how textbook algorithms scale to real-world problems
  • pytorch/rl — PyTorch Reinforcement Learning library; modern reference implementation for those moving from NumPy to GPU-accelerated frameworks

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Create shared utility module and refactor duplicate code across chapters

Multiple chapters implement similar algorithms (e.g., random_walk.py appears in chapter06, chapter07, and chapter09; grid_world.py in chapter03 and chapter04). A common utilities module would reduce duplication, improve maintainability, and make it easier for learners to understand core concepts. This aligns with the educational purpose of the repo.

  • [ ] Create chapter00 or utils/ directory with shared modules (e.g., grid_world_base.py, random_walk_base.py, environment.py)
  • [ ] Refactor chapter03/grid_world.py and chapter04/grid_world.py to inherit from a common base class
  • [ ] Refactor chapter06/random_walk.py, chapter07/random_walk.py, and chapter09/random_walk.py to use a shared RandomWalk base implementation
  • [ ] Update README.md to document the shared utilities section and explain code reuse patterns
  • [ ] Verify all existing tests still pass and figure generation remains identical

Add comprehensive unit tests for all chapter implementations

The repo has .travis.yml indicating CI/CD but no visible test files in the file structure. Each chapter's algorithms should have unit tests validating correctness (e.g., convergence properties, expected value ranges, policy improvements). This ensures code reliability and helps catch regressions when refactoring.

  • [ ] Create tests/ directory with subdirectories mirroring chapter structure (tests/chapter02/, tests/chapter03/, etc.)
  • [ ] Add tests for chapter02/ten_armed_testbed.py validating epsilon-greedy convergence and reward statistics
  • [ ] Add tests for chapter04 dynamic programming implementations (car_rental.py, gamblers_problem.py) validating policy improvement and value convergence
  • [ ] Add tests for chapter06/cliff_walking.py and chapter05/blackjack.py validating Q-learning convergence
  • [ ] Update .travis.yml to run pytest and report coverage (currently missing explicit test runner configuration)

Add figure reproduction scripts and validation checks

The images/ folder contains ~30 figure files referenced in the README, but there are no explicit scripts to regenerate these figures or validate that current code still produces matching results. Adding reproducibility scripts helps learners verify implementations match the textbook and catches bugs when code changes.

  • [ ] Create reproduce_figures.py or figures/ directory with scripts named by figure (e.g., figure_2_1.py, figure_2_2.py, figure_10_1.py)
  • [ ] Each script should generate the corresponding figure and save to images/ with comparison mode to detect deviations
  • [ ] Add image comparison validation in CI/CD (compare generated figures against baseline using structural similarity or pixel-level tolerance)
  • [ ] Document in README.md how to regenerate all figures: 'python reproduce_figures.py' or 'python figures/figure_X_Y.py'
  • [ ] Update .travis.yml to include figure reproduction as part of CI pipeline

🌿Good first issues

  • Add a requirements.txt or setup.py for easier pip installation—currently versions are not pinned, so dependency version conflicts are possible across different environments
  • Create a tests/ directory with unit tests for core algorithms in Chapter 2-6 (e.g., test_ten_armed_testbed.py validating that epsilon-greedy converges correctly)—no tests currently exist
  • Add missing figure implementations: the README lists many figures (e.g., 'Figure 2.1' through 'Figure 12.11') but several chapter*/ directories lack corresponding scripts; implement missing algorithms as standalone scripts with matplotlib output

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 96bc203 — Update README.md (ShangtongZhang)
  • c7cc538 — Update README.md (ShangtongZhang)
  • fbf020d — Merge pull request #148 from VEXLife/master (ShangtongZhang)
  • dedf66e — Corrected the typo. (VEXLife)
  • d017afb — Uploaded the new figures. (VEXLife)
  • 1c32673 — Update random_walk.py (VEXLife)
  • ac4fbce — Update README.md (ShangtongZhang)
  • a94922d — Merge pull request #144 from Johann-Huber/fig_12_14 (ShangtongZhang)
  • 9962a1b — add script that reproduce example 12.14 (Johann-Huber)
  • efff86d — Merge pull request #143 from VEXLife/chapter-5-3-change (ShangtongZhang)

🔒Security observations

This is an educational reinforcement learning codebase with a strong security posture overall. The primary concerns are dependency management practices rather than application-level vulnerabilities. The codebase contains no hardcoded credentials, no SQL injection risks, no web framework misconfigurations, or Docker infrastructure issues. The main recommendation is to implement strict dependency version pinning and add automated security scanning to the CI/CD pipeline to catch vulnerabilities early. The absence of web endpoints, file I/O operations with user input, and database interactions significantly reduces the attack surface.

  • Low · Outdated or Unpinned Dependencies — requirements.txt. The requirements.txt file lists dependencies without version pinning (numpy, matplotlib, seaborn, tqdm, scipy). This could lead to installation of vulnerable versions when dependencies are updated, as no specific versions are enforced. Fix: Pin all dependencies to specific versions (e.g., 'numpy==1.21.0') and regularly audit for security updates. Use tools like 'pip-audit' or 'safety' to check for known vulnerabilities.
  • Low · Missing Security Headers in CI/CD Configuration — .travis.yml. The .travis.yml file is present but not fully visible in the provided context. CI/CD pipelines should include security scanning steps to detect vulnerabilities during the build process. Fix: Configure Travis CI with security scanning tools (e.g., bandit for Python code analysis, safety for dependency vulnerabilities) and enforce passing security checks before merging.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Mixed signals · ShangtongZhang/reinforcement-learning-an-introduction — RepoPilot