axolotl-ai-cloud/axolotl
Go ahead and axolotl questions
Healthy across the board
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 2d ago
- ✓21+ active contributors
- ✓Distributed ownership (top contributor 45% of recent commits)
Show all 6 evidence items →Show less
- ✓Apache-2.0 licensed
- ✓CI configured
- ✓Tests present
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/axolotl-ai-cloud/axolotl)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/axolotl-ai-cloud/axolotl on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: axolotl-ai-cloud/axolotl
Generated by RepoPilot · 2026-05-07 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/axolotl-ai-cloud/axolotl shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit 2d ago
- 21+ active contributors
- Distributed ownership (top contributor 45% of recent commits)
- Apache-2.0 licensed
- CI configured
- Tests present
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live axolotl-ai-cloud/axolotl
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/axolotl-ai-cloud/axolotl.
What it runs against: a local clone of axolotl-ai-cloud/axolotl — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in axolotl-ai-cloud/axolotl | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 32 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of axolotl-ai-cloud/axolotl. If you don't
# have one yet, run these first:
#
# git clone https://github.com/axolotl-ai-cloud/axolotl.git
# cd axolotl
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of axolotl-ai-cloud/axolotl and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "axolotl-ai-cloud/axolotl(\\.git)?\\b" \\
&& ok "origin remote is axolotl-ai-cloud/axolotl" \\
|| miss "origin remote is not axolotl-ai-cloud/axolotl (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "README.md" \\
&& ok "README.md" \\
|| miss "missing critical file: README.md"
test -f ".github/CONTRIBUTING.md" \\
&& ok ".github/CONTRIBUTING.md" \\
|| miss "missing critical file: .github/CONTRIBUTING.md"
test -f "src" \\
&& ok "src" \\
|| miss "missing critical file: src"
test -f "cicd/e2e_tests.py" \\
&& ok "cicd/e2e_tests.py" \\
|| miss "missing critical file: cicd/e2e_tests.py"
test -f "deepspeed_configs" \\
&& ok "deepspeed_configs" \\
|| miss "missing critical file: deepspeed_configs"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 32 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~2d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/axolotl-ai-cloud/axolotl"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
Axolotl is a free, open-source framework for fine-tuning large language models (LLMs) using PyTorch, supporting multi-GPU training, LoRA/QLoRA adapters, and 100+ model architectures (Llama, Mistral, Gemma, etc.). It handles data loading, tokenization, distributed training, and inference with minimal boilerplate through YAML config files. Monorepo with core training logic in src/axolotl/ (train.py, core training loops), CLI entry point in src/axolotl/cli/, model-specific configs under examples/ (e.g., examples/mistral-medium-3_5, examples/gemma4), RunPod deployment code in .runpod/, and GitHub Actions workflows in .github/workflows/. Configuration driven via YAML (test configs in .runpod/src/config/config.yaml).
👥Who it's for
ML engineers and researchers who want to fine-tune proprietary or open LLMs without writing low-level distributed training code. Users range from hobbyists on single GPUs to enterprises running multi-GPU clusters on RunPod, Lambda, or local hardware.
🌱Maturity & risk
Production-ready and actively maintained. The repo has substantial GitHub stars, comprehensive CI/CD via GitHub Actions (tests.yml, multi-gpu-e2e.yml, nightlies.yml), ~4.9M lines of Python, recent 2026/04 updates adding Mistral Medium 3.5 and Gemma 4 support, and a migrating infrastructure (uv-first as of PR #3545). Actively developed with a contributor community.
Dependencies are heavy (PyTorch, transformers, accelerate, bitsandbytes, flash-attention) with frequent upstream breaking changes in HuggingFace ecosystem. No single-maintainer risk evident from multiple workflows, but rapid model/feature churn (new models every release) means configs can bitrot. Pre-release features like SonicMoE fused LoRA suggest cutting-edge but potentially unstable additions.
Active areas of work
Active development on new model support (Mistral Medium 3.5, Gemma 4 added in 2026/04), migration to uv package manager (PR #3545), and experimental SonicMoE fused LoRA implementation. CI/CD includes nightly tests and multi-GPU semi-weekly E2E tests. Community contributions tracked via CONTRIBUTING.md and CODE_OF_CONDUCT.md.
🚀Get running
git clone https://github.com/axolotl-ai-cloud/axolotl.git && cd axolotl && pip install -e . (or use uv: uv pip install -e . per PR #3545). Install test dependencies: pip install -e .[test]. Reference .runpod/requirements.txt for pinned versions or use examples/colab-notebooks/colab-axolotl-example.ipynb for a no-setup sandbox.
Daily commands: For local training: axolotl train examples/mistral-7b-lora.yaml (YAML config specifies model, data, learning rate, etc.). For inference: axolotl inference examples/mistral-7b-lora.yaml. For RunPod: docker build .runpod && docker run with handler.py (see .runpod/src/handler.py). See .github/workflows/tests.yml for exact pytest commands.
🗺️Map of the codebase
README.md— Entry point describing Axolotl as a free, open-source LLM fine-tuning framework; essential for understanding project scope and features..github/CONTRIBUTING.md— Defines contribution guidelines, coding standards, and development workflow that all contributors must follow.src— Primary source code directory containing core training, dataset, and model logic (inferred as load-bearing despite truncated file list).cicd/e2e_tests.py— End-to-end test suite validating training pipelines; critical for ensuring framework stability across GPU configurations.deepspeed_configs— DeepSpeed configuration templates (zero1, zero2, zero3) that enable distributed training; fundamental to the framework's scaling capabilities.docker/Dockerfile— Primary Docker image definition for containerized training environments; essential for reproducible deployments..pre-commit-config.yaml— Pre-commit hooks enforcing code quality and linting standards before commits; maintains codebase consistency.
🛠️How to make changes
Add Support for a New Model Architecture
- Define model architecture specifications and initialization logic in the core training module (
src) - Add integration tests validating model loading and forward passes (
cicd/e2e_tests.py) - Document model-specific configuration requirements and hyperparameters (
docs/agents/model_architectures.md) - Create a reference fine-tuning example configuration (
.runpod/src/config/config.yaml)
Add a New Fine-tuning Method (e.g., LoRA, QLoRA, Full FT)
- Implement method-specific optimizer and gradient computation in core framework (
src) - Add benchmark scripts comparing memory/speed tradeoffs (
benchmarks/bench_scattermoe_lora.py) - Create method selection guide with recommendations (
docs/choosing_method.qmd) - Add end-to-end test covering the new method (
cicd/e2e_tests.py)
Configure Training for a New Hardware Setup
- Create appropriate DeepSpeed ZeRO configuration (Stage 1, 2, or 3 based on GPU memory) (
deepspeed_configs/zero2.json) - Build or adapt Docker image for target GPU vendor (NVIDIA/AMD/etc) (
docker/Dockerfile) - Add multi-GPU test workflow validating the configuration (
.github/workflows/multi-gpu-e2e.yml) - Document hardware-specific setup and environment variables (
docs/amd_hpc.qmd)
Integrate a New Dataset Format
- Add dataset loader and format parser to core framework (
src) - Document format schema with examples (
docs/dataset-formats) - Add integration test loading and preprocessing sample data (
cicd/e2e_tests.py) - Create example configuration using new format (
.runpod/src/config/config.yaml)
🔧Why these technologies
- PyTorch + Transformers (HuggingFace) — De-facto standard for LLM fine-tuning; extensive model zoo and community support
- DeepSpeed (ZeRO optimizer stages) — Enables training on memory-constrained GPUs via parameter/gradient/optimizer-state partitioning across distributed systems
- Docker containerization — Ensures reproducible training environments across heterogeneous hardware (NVIDIA/AMD GPUs, cloud providers)
- YAML configuration-driven design — Allows users to define complex multi-stage training pipelines without modifying code
- CI/CD (GitHub Actions) with multi-GPU E2E tests — Validates training correctness across different GPU counts and optimization strategies before release
⚖️Trade-offs already made
-
Configuration-driven (YAML) rather than programmatic API
- Why: Lowers barrier to entry for non-ML-engineers; familiar pattern for DevOps workflows
- Consequence: Less flexible for advanced custom training loops; schema validation complexity increases with features
-
DeepSpeed ZeRO for distributed training instead of native DDP
- Why: Reduces memory footprint by 3–10x, enabling fine-tuning of larger models on limited GPU memory
- Consequence: Adds distributed training complexity; slight performance overhead from communication; requires careful tuning per hardware
-
Support multiple fine-tuning methods (LoRA, QLoRA, full FT) in single framework
- Why: Users can trade off quality vs. speed/cost without switching tools
- Consequence: Higher code complexity; larger codebase; potential for method-specific bugs
-
Multiple deployment targets (Docker, RunPod, Kubernetes, local)
- Why: Maximizes accessibility across different infrastructure (cloud, on-prem, edge)
- Consequence: More Dockerfiles and entrypoints to maintain; environment-specific debugging complexity
🚫Non-goals (don't propose these)
- Real-time serving or inference optimization (use vLLM, TensorRT, or Ollama for deployment)
- Automated hyperparameter tuning (users must manually sweep or integrate external HPO tools)
- Multi-model ensemble training (single-model focus per run)
- Federated learning or privacy-preserving training (no differential privacy, federated averaging, etc.)
- Windows native support (Linux/WSL primary targets; Windows support not guaranteed)
🪤Traps & gotchas
- YAML config brittleness: typos in config keys silently fail or use defaults — validate early with axolotl validate examples/config.yaml. 2. GPU memory: LoRA+quantization still requires 8–24GB VRAM depending on model size; batch_size tuning is crucial. 3. Tokenizer version drift: HuggingFace tokenizers update, breaking reproducibility — lock model repo commit in config with
model_id: username/model@commit. 4. Multi-GPU requires NCCL (nvidia-nccl) correctly installed; set NCCL_DEBUG=INFO if distributed training hangs. 5. RunPod deployment: .runpod/src/config/config.yaml and .runpod/src/handler.py are separate from main examples/ — they must be synced manually.
🏗️Architecture
💡Concepts to learn
- LoRA (Low-Rank Adaptation) — Axolotl's core feature: fine-tune 7B models with <1GB VRAM by training only small rank matrices instead of all weights — essential for production efficiency
- QLoRA (Quantized LoRA) — Combines 4-bit quantization (bitsandbytes) with LoRA for sub-8GB VRAM training — Axolotl's killer feature for hobbyists; requires understanding weight precision tradeoffs
- Distributed Data Parallelism (DDP) — Axolotl uses Accelerate to hide DDP complexity (multi-GPU via NCCL); understanding rank/world_size concepts is vital for debugging multi-GPU hangs
- Flash Attention — Kernel-level optimization Axolotl integrates for 2–3x speed; trade-off: compile time, CUDA/PyTorch version lockstep, slight numeric instability
- SFT (Supervised Fine-Tuning) vs. DPO (Direct Preference Optimization) — Axolotl abstracts both training modes via config; SFT is standard supervised loss, DPO aligns outputs to human preferences without RL — affects loss computation and data format
- Token-level Batch Processing & Padding — Axolotl's data pipeline handles variable-length sequences (padding, truncation, packing); misconfiguration causes OOM or silent accuracy loss — see src/axolotl/utils/data.py
- Quantization (INT8, NF4) — Axolotl supports BitsAndBytes 4/8-bit quantization during training; enables 13B models on consumer GPUs but introduces subtle numeric precision loss detectable only in downstream tasks
🔗Related repos
huggingface/peft— Axolotl wraps PEFT for LoRA/QLoRA — understanding PEFT's adapters layer is essential for custom fine-tuning strategieshuggingface/transformers— Core model loading and Trainer API that Axolotl extends; bug fixes or new model support often require Transformers PRs firstopenllm-project/OpenLLM— Alternative LLM fine-tuning framework; Axolotl is more training-focused, OpenLLM is more inference-focused — complementaryunslothai/unsloth— Fast LoRA training alternative using kernel fusion (similar goal to Axolotl's Flash Attention integration) — competitor for speed-focused usersmicrosoft/DeepSpeed— Distributed training library that Axolotl could integrate deeper (currently uses Accelerate) — relevant for multi-node scaling
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add unit tests for .runpod/src/handler.py and utils.py
The RunPod integration has handler.py and utils.py files in .runpod/src/ but there are no corresponding test files visible in the repository structure. Given that this is a production integration point for RunPod deployments, comprehensive unit tests would improve reliability and catch regressions. This is especially valuable since the repo has test workflows (.github/workflows/tests.yml) and a .coveragerc file indicating testing infrastructure is already in place.
- [ ] Create .runpod/tests/ directory with init.py
- [ ] Add test_handler.py covering handler.py functionality with mock RunPod API calls
- [ ] Add test_utils.py covering utility functions in utils.py
- [ ] Update .github/workflows/tests.yml to include RunPod integration tests in the test matrix
- [ ] Ensure coverage reports include the new tests via .coveragerc
Add E2E validation tests for multi-GPU training configurations
The repo has .github/workflows/multi-gpu-e2e.yml and cicd/multigpu.py but lacks visible comprehensive test cases. The deepspeed_configs/ directory shows zero1.json and zero1_torch_compile.json configurations, yet there's no visible test file validating these configurations actually work end-to-end. Adding explicit test cases for different DeepSpeed configurations (ZeRO-1, ZeRO-2, ZeRO-3) would prevent configuration regressions.
- [ ] Create cicd/tests_multigpu.py with parametrized test cases for each deepspeed_configs/*.json file
- [ ] Add test cases validating ZeRO-1, ZeRO-2, and ZeRO-3 configurations work correctly
- [ ] Test gradient accumulation and mixed precision with each configuration
- [ ] Update .github/workflows/multi-gpu-e2e.yml to explicitly call the new test module
- [ ] Document expected hardware requirements in cicd/README.md
Add mypy type-checking CI workflow for Python codebase
The repo has .mypy.ini configuration file present, indicating intent to use type hints, but there's no visible GitHub Actions workflow to enforce type checking in CI. This means type errors can slip into production code undetected. Adding a dedicated mypy CI workflow would catch type inconsistencies early, especially important given the large codebase with modules like deepspeed_configs and cicd scripts.
- [ ] Create .github/workflows/type-check.yml that runs mypy against src/, cicd/, and .runpod/src/
- [ ] Configure the workflow to fail on type errors and run on pull_request and push to main
- [ ] Update .mypy.ini if needed to exclude test directories appropriately
- [ ] Add step to generate mypy coverage report and comment on PRs (similar to codecov pattern visible in codecov.yml)
- [ ] Document type-checking requirements in .github/CONTRIBUTING.md
🌿Good first issues
- Add type hints to src/axolotl/datasets/ (currently sparse typing; see .mypy.ini for strict config) — improves IDE support and catches bugs. Pairs with existing mypy CI check.
- Write integration tests for new model support (Mistral Medium 3.5, Gemma 4 added in 2026/04) in tests/e2e/ — currently only .github/workflows/multi-gpu-e2e.yml has E2E tests; add minimal pytest fixtures covering forward pass + LoRA export.
- Document RunPod deployment in CONTRIBUTING.md or separate DEPLOYMENT.md: .runpod/ has working code but no guide for contributors deploying there. Add examples of environment variables, secrets, config syncing from examples/ to .runpod/src/config/.
⭐Top contributors
Click to expand
Top contributors
- @winglian — 45 commits
- @NanoCode012 — 18 commits
- @ved1beta — 9 commits
- @thad0ctor — 3 commits
- @BrownianNotion — 3 commits
📝Recent commits
Click to expand
Recent commits
5352d41— feat: systemic multimodal assistant-only loss masking + cfg.role_boundaries` (#3625) (thad0ctor)c15f6cf— fix: FSDP FULL_STATE_DICT oom from memory leak (#3635) (ved1beta)e4032fc— Refactor separate attention flags with attn_implementation and capability/concerns feature flags (#3602) (winglian)6136ae6— Fix: add bitnet config (#3636) (younesbelkada)e662972— Feat: Add bitnet integration (#3634) (younesbelkada)ebbd7fa— feat: Add Mistral Medium 3.5 (#3633) (NanoCode012)ac77da9— use smaller pretrained models for ci (#3620) [skip ci] (winglian)798c8fb— chore: update docker docs (#3623) (NanoCode012)17fc747— fix: docker build failing (#3622) (NanoCode012)901f235— dpo collation/padding (#3601) [skip ci] (winglian)
🔒Security observations
The codebase has moderate security concerns, primarily around credential management and Docker configuration. The most critical issues involve sensitive environment variables exposed in docker-compose.yaml and overly permissive volume mounts that could compromise the host system. The loose dependency versioning and unspecified base image versions create additional risk vectors for supply chain attacks. Immediate action is recommended to implement proper secrets management and restrict container access patterns. The project would benefit from implementing automated dependency scanning, container image scanning, and security policy enforcement in CI/CD pipelines.
- High · Sensitive Credentials Exposed in Docker Compose —
docker-compose.yaml - environment section. The docker-compose.yaml file passes sensitive environment variables (GIT_AUTHOR_NAME, GIT_AUTHOR_EMAIL, GIT_COMMITTER_NAME, GIT_COMMITTER_EMAIL, WANDB_API_KEY) directly from the host environment without validation or masking. These credentials could be exposed in container logs, process listings, or docker inspect output. Fix: Use Docker secrets management, .env files with proper .gitignore rules, or external secret management systems. Never log sensitive environment variables. Consider using --build-arg instead of environment variables for build-time secrets. - Medium · Overly Permissive Volume Mounts —
docker-compose.yaml - volumes section. The docker-compose.yaml mounts the entire workspace root (.) into the container at /workspace/axolotl and also mounts the user's HuggingFace cache (~/.cache/huggingface/). This could allow container escape or unauthorized access to sensitive model files and credentials. Fix: Restrict volume mounts to only necessary directories. Use read-only mounts where possible (e.g., ':ro'). Implement proper filesystem isolation and consider using bind mounts with specific directory restrictions. - Medium · Vague Dependency Version Pinning —
.runpod/requirements.txt and inferred dependencies. The runpod dependency is pinned with '~=1.7.0' which allows patch versions up to 1.9.x. This loose version constraint could introduce vulnerabilities in minor/patch releases without explicit review. Fix: Use exact version pinning (==1.7.0) for production dependencies. If using flexible versioning, regularly audit and test dependency updates. Implement automated dependency scanning and security monitoring. - Medium · Git Configuration Variables in Environment —
docker-compose.yaml - GIT_* environment variables. Git author credentials are passed as environment variables which may be logged or exposed. Git credentials should be managed through git config --local or credential helpers, not environment variables. Fix: Configure git credentials using git config within the container or use SSH keys with proper agent forwarding. Avoid passing git credentials as plaintext environment variables. - Low · Unspecified Docker Base Image —
docker-compose.yaml - build.dockerfile and docker/Dockerfile*. Multiple Dockerfile variants exist (Dockerfile, Dockerfile-base, Dockerfile-tests, etc.) but the docker-compose.yaml references './docker/Dockerfile' without specifying a version. If the base image tag is 'latest' or unspecified, it could introduce unexpected changes. Fix: Use specific base image versions with digest pinning (e.g., FROM python:3.11.5-slim@sha256:...). Regularly scan and update base images, and document the rationale for version choices. - Low · Development Server Configuration —
docker-compose.yaml - command section. The docker-compose.yaml uses 'tail -f /dev/null' as the default command, which creates a long-running container in development mode. This may expose the service to unintended access if security groups or firewall rules are misconfigured. Fix: Ensure proper network isolation in development. Use explicit port mappings only when necessary. Document expected network exposure and implement health checks.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.