thu-ml/tianshou
An elegant PyTorch deep reinforcement learning library.
Healthy across all four use cases
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 5w ago
- ✓4 active contributors
- ✓MIT licensed
Show all 7 evidence items →Show less
- ✓CI configured
- ✓Tests present
- ⚠Small team — 4 contributors active in recent commits
- ⚠Concentrated ownership — top contributor handles 55% of recent commits
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/thu-ml/tianshou)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/thu-ml/tianshou on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: thu-ml/tianshou
Generated by RepoPilot · 2026-05-07 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/thu-ml/tianshou shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across all four use cases
- Last commit 5w ago
- 4 active contributors
- MIT licensed
- CI configured
- Tests present
- ⚠ Small team — 4 contributors active in recent commits
- ⚠ Concentrated ownership — top contributor handles 55% of recent commits
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live thu-ml/tianshou
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/thu-ml/tianshou.
What it runs against: a local clone of thu-ml/tianshou — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in thu-ml/tianshou | Confirms the artifact applies here, not a fork |
| 2 | License is still MIT | Catches relicense before you depend on it |
| 3 | Default branch master exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 63 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of thu-ml/tianshou. If you don't
# have one yet, run these first:
#
# git clone https://github.com/thu-ml/tianshou.git
# cd tianshou
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of thu-ml/tianshou and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "thu-ml/tianshou(\\.git)?\\b" \\
&& ok "origin remote is thu-ml/tianshou" \\
|| miss "origin remote is not thu-ml/tianshou (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
&& ok "license is MIT" \\
|| miss "license drift — was MIT at generation time"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 4. Critical files exist
test -f "tianshou/__init__.py" \\
&& ok "tianshou/__init__.py" \\
|| miss "missing critical file: tianshou/__init__.py"
test -f "tianshou/policy/base.py" \\
&& ok "tianshou/policy/base.py" \\
|| miss "missing critical file: tianshou/policy/base.py"
test -f "tianshou/data/buffer/base.py" \\
&& ok "tianshou/data/buffer/base.py" \\
|| miss "missing critical file: tianshou/data/buffer/base.py"
test -f "tianshou/collector/base.py" \\
&& ok "tianshou/collector/base.py" \\
|| miss "missing critical file: tianshou/collector/base.py"
test -f "tianshou/trainer/base.py" \\
&& ok "tianshou/trainer/base.py" \\
|| miss "missing critical file: tianshou/trainer/base.py"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 63 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~33d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/thu-ml/tianshou"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
Tianshou is a production-grade PyTorch-based deep reinforcement learning library that provides modular, type-safe abstractions for implementing RL algorithms (DQN, PPO, TRPO, SAC, etc.) with clean separation between Algorithm and Policy classes. It supports on-policy, off-policy, offline, and experimental multi-agent RL workflows with Gymnasium environments, optimized for both research flexibility and practical application. Monolithic package structure rooted at tianshou/ with core abstractions (Algorithm, Policy, Batch via docs/02_deep_dives/L1_Batch.ipynb), environment handling (L3_Environments.ipynb), data collection (Collector in L5_Collector.ipynb), and algorithm implementations organized by learning paradigm. Extensive Jupyter-based documentation (docs/02_deep_dives/*.ipynb) explains each component's design. GitHub workflows auto-lint, test, and publish to PyPI.
👥Who it's for
RL researchers and engineers building custom reinforcement learning agents who need well-architected, hackable low-level algorithm implementations (e.g., implementing a novel policy gradient variant) as well as practitioners applying existing algorithms to custom environments without framework boilerplate. Also academia/industry teams doing MARL or offline RL research.
🌱Maturity & risk
Actively developed and production-ready: v2.0+ with complete API overhaul (released on PyPI), strong GitHub presence (indicated by stars/forks badges), extensive CI/CD via GitHub Actions (.github/workflows/pytest.yml, lint_and_docs.yml, gputest.yml), comprehensive docs in ReadTheDocs format, and official Docker support. Version 2 represents a mature rearchitecture with clear design principles, not experimental code.
Breaking changes introduced in v2 (not backwards compatible per CHANGELOG.md) require migration for existing users. Dependency on Gymnasium ecosystem (Farama-Foundation/Gymnasium) introduces external maintenance risk. Large Python codebase (1.48M lines) with broad scope (on/off-policy/offline/MARL/model-based) means high surface area for bugs; MARL and model-based features explicitly marked as experimental. No obvious single-maintainer bottleneck visible, but multi-algorithm scope requires sustained maintenance.
Active areas of work
Active development on v2.x with focus on API stability and comprehensive documentation. ReadTheDocs integration active (.readthedocs.yaml present), pre-commit hooks configured (.pre-commit-config.yaml), and multi-stage CI testing (pytest, GPU tests, linting). CHANGELOG.md and recent PR template suggest regular releases and community contributions being triaged.
🚀Get running
git clone https://github.com/thu-ml/tianshou.git
cd tianshou
pip install -e .
Then validate: python -c 'import tianshou; print(tianshou.__version__)'. For GPU testing, use the Dockerfile: docker build -t tianshou . && docker run -it tianshou.
Daily commands:
No traditional 'dev server'—this is a library. To validate installation: python -c 'import tianshou'. To run examples: check benchmark/run_benchmark.py for reference training loops. To run tests: pytest (pytest.yml workflow shows this is the canonical test command). For Docker: docker build -t tianshou . && docker run tianshou pytest.
🗺️Map of the codebase
tianshou/__init__.py— Main package entry point defining the public API and core module exports for the entire DRL librarytianshou/policy/base.py— Base policy class that all reinforcement learning algorithms inherit from; foundational abstraction for training agentstianshou/data/buffer/base.py— Core replay buffer abstraction managing experience storage and sampling, critical for all training pipelinestianshou/collector/base.py— Environment interaction collector that orchestrates episode rollout and data collection from environmentstianshou/trainer/base.py— Training loop orchestrator coordinating collector, policy updates, and episode evaluationtianshou/data/batch.py— Unified data structure wrapping environment transitions and batched experience; used throughout the codebase
🧩Components & responsibilities
- Policy (BasePolicy) (PyTorch, neural networks) — Converts observations to actions; learns from experience via gradient descent
- Failure mode: Divergent training, numerical instability, NaN losses due to poor learning rates or reward scaling
- Collector (Gym, NumPy, vectorization) — Steps environments, applies policy actions, and accumulates transitions into replay buffer
- Failure mode: Environment deadlocks, seed desynchronization across parallel envs, episode boundary corruption
- Replay Buffer (NumPy arrays, circular buffers) — Stores and samples experience transitions to break correlations in training data
- Failure mode: Memory exhaustion with large buffer sizes, stale data if not properly rotated, slow sampling with prioritization
- Trainer (Python async, PyTorch optimizers) — Orchestrates collection → learning → evaluation loop and manages training state
- Failure mode: Training hangs if collector blocks, incorrect learning rate schedules causing divergence
🛠️How to make changes
Add a new DRL algorithm
- Create new policy class inheriting from BasePolicy in tianshou/policy/ (
tianshou/policy/your_algorithm.py) - Implement forward(batch, state=None, ...) for inference and learn(batch, **kwargs) for training (
tianshou/policy/your_algorithm.py) - Register algorithm in tianshou/policy/init.py (
tianshou/policy/__init__.py) - Create example script demonstrating training with your algorithm (
examples/atari/atari_youralgo.py or examples/box2d/environment_youralgo.py)
Add a new replay buffer type
- Create buffer class inheriting from BaseBuffer in tianshou/data/buffer/ (
tianshou/data/buffer/your_buffer.py) - Implement add() for storing transitions and sample() for retrieval (
tianshou/data/buffer/your_buffer.py) - Export new buffer in tianshou/data/buffer/init.py (
tianshou/data/buffer/__init__.py)
Add environment wrappers for a new domain
- Create wrapper class in tianshou/env/ extending gym.Wrapper (
tianshou/env/your_domain_env.py) - Implement step() and reset() to standardize observations and rewards (
tianshou/env/your_domain_env.py) - Create example training script using the wrapper (
examples/your_domain/environment_algorithm.py)
🔧Why these technologies
- PyTorch — Primary deep learning framework for neural network policies; enables GPU acceleration and automatic differentiation
- OpenAI Gym API — Standard environment interface allowing seamless integration with diverse RL benchmarks (Atari, MuJoCo, Box2D)
- NumPy/PyTorch tensors — Efficient vectorized data structures for batch processing experience transitions and parallel environment collection
- TensorBoard/W&B — Experiment tracking and visualization of training metrics across distributed runs
⚖️Trade-offs already made
-
Unified Batch abstraction wrapping all data types
- Why: Provides consistent interface across on-policy, off-policy, and model-based algorithms
- Consequence: Slight overhead in serialization but massive flexibility in algorithm implementations and extensibility
-
Vectorized environment collection in parallel
- Why: Dramatically improves sample efficiency and reduces wall-clock training time
- Consequence: Added complexity in state management and synchronization; requires careful handling of episode boundaries across parallel envs
-
Separate Trainer and Collector responsibilities
- Why: Clean separation allowing independent reuse (e.g., evaluation without training, multi-policy coordination)
- Consequence: More moving parts to coordinate; requires passing configuration through multiple layers
-
**Flexible policy forward() signature using kwargs
- Why: Allows algorithms to use heterogeneous observations (images, vectors, recurrent states) without subclassing
- Consequence: Reduced type safety; developers must understand optional parameters for each algorithm
🚫Non-goals (don't propose these)
- Does not provide built-in distributed training across multiple machines (single-node multi-GPU supported via PyTorch)
- Does not implement robot-specific simulation environments (delegates to third-party simulators like MuJoCo, Pybullet)
- Does not handle real-world robotics deployment or sim-to-real transfer
- Does not include pre-trained model checkpoints (examples show training from scratch)
🪤Traps & gotchas
Version 2 is not backwards compatible (CHANGELOG.md emphasizes this)—existing code targeting v1 will break. Algorithm/Policy split is new abstraction; old code that mixed these will need refactoring. Gymnasium dependency (not OpenAI Gym) is hard requirement; gym code won't work without adapters. MARL and model-based features are experimental (docs note this), so their APIs may change. Docker builds for GPU tests (.github/workflows/gputest.yml) suggest CUDA version pinning—verify your GPU environment matches Dockerfile. ReadTheDocs builds from .readthedocs.yaml; docs require specific build dependencies.
🏗️Architecture
💡Concepts to learn
- Markov Decision Process (MDP) — Foundational RL formalism that every Tianshou algorithm solves; understanding state transitions, rewards, and policies is prerequisite to implementing any new algorithm
- Experience Replay / Replay Buffer — Central mechanism in Tianshou for breaking correlations between samples; the tianshou/data/buffer/ implementations are critical to algorithm stability, especially in off-policy learning
- Policy Gradient Methods — Family of algorithms (PPO, TRPO, A3C, SAC) heavily implemented in Tianshou; core to understanding on-policy learning and entropy-regularized RL
- Batch (vectorized transition groups) — Tianshou's custom Batch data structure (tianshou/data/batch.py) abstracts trajectory storage and enables efficient vectorized RL across environments; foundational to the library's speed and design
- Off-policy vs On-policy Learning — Tianshou cleanly separates these learning paradigms at the Algorithm class level; understanding the distinction (behavior policy vs. target policy, importance sampling) is essential for choosing and extending algorithms
- Generalized Advantage Estimation (GAE) — Tianshou has a dedicated deep-dive notebook (docs/02_deep_dives/L4_GAE.ipynb) and GAE is used in most on-policy algorithms (PPO, A3C); critical for variance reduction in policy gradients
- Multi-Agent RL (MARL) — Tianshou experimental support (L6_MARL.ipynb) extends single-agent algorithms to cooperative/competitive settings; requires understanding of independent learners, centralized training/decentralized execution (CTDE), and multi-agent value functions
🔗Related repos
openai/spinningup— Educational RL library (now deprecated) that inspired clean algorithm documentation; Tianshou adopted similar pedagogical separation of concerns but in production codeDLR-RM/stable-baselines3— Alternative PyTorch RL library with similar scope (on/off-policy algorithms); direct competitor with different API philosophy (SB3 emphasizes simplicity, Tianshou emphasizes modularity)Farama-Foundation/Gymnasium— The environment interface standard Tianshou builds on; any custom RL environment must implement Gymnasium API to work with Tianshou collectorspytorch/pytorch— Core dependency; Tianshou is pure PyTorch, so understanding PyTorch tensor ops and autograd is prerequisite knowledgethu-ml/tianshou-models— Companion repo (if it exists) with pre-trained model checkpoints and benchmark results for Tianshou algorithms across Gymnasium environments
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add missing deep dive documentation for Policy and Trainer abstractions
The docs/02_deep_dives folder contains L1-L6 notebooks covering Batch, Buffer, Environments, GAE, Collector, and MARL, but conspicuously missing are deep dives on Policy (core to RL) and Trainer (the main entry point). Given this is an 'elegant PyTorch deep RL library', new contributors should have accessible tutorials on these critical abstractions. This would improve onboarding and reduce GitHub issues about policy implementation.
- [ ] Create docs/02_deep_dives/L0_Policy.ipynb covering policy architecture, different policy types (DQN, PG, Actor-Critic), and the policy interface
- [ ] Create docs/02_deep_dives/L7_Trainer.ipynb covering training loops, the Trainer class, and integration with Collector/Policy/Buffer
- [ ] Link these notebooks from docs/02_deep_dives/0_intro.md and update docs/01_user_guide/02_core_abstractions.md with references
- [ ] Add example code snippets showing minimal policy and trainer usage
Add comprehensive integration tests for multi-agent RL (MARL) workflows
The repo has L6_MARL.ipynb documentation and MARL support (evident from file structure), but there's no indication of dedicated integration tests verifying end-to-end MARL training workflows. The pytest.yml workflow exists but likely lacks MARL-specific test coverage. This is critical for a library claiming elegant MARL support, as distributed multi-agent scenarios are complex and error-prone.
- [ ] Create tests/test_marl_integration.py with end-to-end tests covering: multi-agent environment setup, policy-per-agent training, shared policy training, and communication patterns
- [ ] Add a simple MARL environment fixture (e.g., using PettingZoo) in tests/fixtures/environments.py
- [ ] Update .github/workflows/pytest.yml to explicitly run MARL tests with a dedicated job or marker
- [ ] Document MARL testing requirements in CONTRIBUTING.md (e.g., PettingZoo installation)
Add type hints and type checking (mypy) to CI workflow
The repo has .github/workflows/lint_and_docs.yml but there's no evidence of type checking (mypy) in the linting pipeline. For a PyTorch library that's already well-structured, adding type hints and mypy checks would improve code quality, reduce bugs, and make the codebase more maintainable. This is a high-value contribution that affects all future PRs.
- [ ] Add mypy configuration to pyproject.toml or setup.cfg with strict settings
- [ ] Create .github/workflows/mypy_check.yml (or extend lint_and_docs.yml) to run 'mypy tianshou' on each PR
- [ ] Incrementally add type hints to core modules (start with tianshou/policy/ and tianshou/trainer/) across multiple PRs
- [ ] Document type hint requirements in CONTRIBUTING.md for new contributors
🌿Good first issues
- Add comprehensive docstring examples to all Policy subclasses in tianshou/policy/: currently many have sparse documentation, and adding runnable code snippets (like batch shape contracts and usage patterns) would help new contributors understand the Policy abstraction.
- Create integration tests in tests/ for Collector with vectorized Gymnasium environments: the async.png and collector documentation suggest this is important but dedicated test coverage of parallel sampling edge cases (episode termination, batch padding, reset semantics) appears minimal.
- Expand docs/02_deep_dives/L2_Buffer.ipynb with concrete examples of custom replay buffer implementations: the buffer abstraction is powerful but existing notebook only covers built-in buffers; adding 'build your own prioritized buffer' walkthrough would unblock algorithm researchers.
⭐Top contributors
Click to expand
Top contributors
- @opcode81 — 55 commits
- @MischaPanch — 43 commits
- @Trinkle23897 — 1 commits
- @Mr-Neutr0n — 1 commits
📝Recent commits
Click to expand
Recent commits
f240205— [data collector] Use monotonic clocks for collector timing (#1295) (Trinkle23897)bf5b636— Fix link, improve wording [skip ci] (opcode81)1d57967— Release v2.0.1 (opcode81)8da023e— Constrain pandas to <3 owing to incompatibility (#1290) (opcode81)3d73f8f— Remove conda badge from README [skip ci] (opcode81)162473e— Constrain pandas to <3 owing to incompatibility (opcode81)ddf6083— fix: restore parameters on TRPO line search failure (#1287) (opcode81)af0d191— fix: restore original parameters when TRPO line search fails (Mr-Neutr0n)414589d— Disable fail-fast for matrix build (opcode81)be90aad— Fix link to developer guide [skip ci] (opcode81)
🔒Security observations
The Tianshou codebase shows a moderate security posture with a critical issue requiring immediate attention: the incomplete Dockerfile COPY command. Additionally, lack of version pinning for system and Python packages introduces maintenance and security risks. The project lacks a non-root user configuration and comprehensive dependency vulnerability scanning. The CI/CD pipeline (visible through workflows) appears to include basic testing, but could be enhanced with security scanning tools. No hardcoded secrets were detected in the analyzed file structure. Overall, the project would benefit from stricter Docker security practices and dependency management policies.
- High · Incomplete Dockerfile COPY Command —
Dockerfile. The Dockerfile contains a truncated COPY instruction: 'COPY pyproject.tom' which appears incomplete. This could indicate a malformed build configuration that may fail during image creation or copy incorrect files. The complete filename should be 'pyproject.toml'. Fix: Complete the COPY command to 'COPY pyproject.toml poetry.lock* .' to properly copy project dependencies configuration files into the image. - Medium · Unrestricted System Package Installation —
Dockerfile - apt-get install command. The Dockerfile installs system packages without pinning specific versions (curl, build-essential, git, wget, unzip, libvips-dev, gnupg2). This can lead to inconsistent builds and potential security issues if vulnerable versions are installed in the future. Fix: Pin specific versions for all system packages to ensure reproducible and secure builds. Example: 'curl=7.x.x' instead of just 'curl'. - Medium · Unrestricted Python Package Installation —
Dockerfile - pipx install poetry. The Dockerfile installs 'poetry' and 'pipx' without version pinning. This could result in installing vulnerable versions or versions with breaking changes. Fix: Specify exact versions for pipx and poetry installations, e.g., 'pipx install poetry==1.x.x' to ensure reproducible builds. - Low · Missing HEALTHCHECK in Dockerfile —
Dockerfile. The Dockerfile does not include a HEALTHCHECK instruction, making it difficult to monitor container health in production environments. Fix: Add a HEALTHCHECK instruction to the Dockerfile to enable automatic health monitoring of the container. - Low · Running Container as Root —
Dockerfile. The Dockerfile does not specify a non-root user. The container will run as root by default, which violates the principle of least privilege and increases the attack surface. Fix: Create a non-root user and switch to it before running the application. Example: 'RUN useradd -m tianshou && USER tianshou'. - Low · Missing SBOM and Supply Chain Security Metadata —
Dockerfile and CI/CD workflows. No evidence of software bill of materials (SBOM) generation or dependency vulnerability scanning in the build pipeline. This limits supply chain security visibility. Fix: Integrate tools like Syft or Trivy into the build pipeline to generate SBOMs and scan for known vulnerabilities in dependencies.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.