databrickslabs/dolly

Item: databrickslabs/dolly
Rating: 3
Author: RepoPilot

Databricks’ Dolly, a large language model trained on the Databricks Machine Learning Platform

Mixed

Stale — last commit 3y ago

weakest axis

Use as dependencyMixed

last commit was 3y ago; no CI workflows detected

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isMixed

last commit was 3y ago; no CI workflows detected

✓13 active contributors
✓Distributed ownership (top contributor 33% of recent commits)
✓Apache-2.0 licensed

Show all 6 evidence items →

✓Tests present
⚠Stale — last commit 3y ago
⚠No CI workflows detected

What would change the summary?

→Use as dependency Mixed → Healthy if: 1 commit in the last 365 days
→Deploy as-is Mixed → Healthy if: 1 commit in the last 180 days

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Forkable](https://repopilot.app/api/badge/databrickslabs/dolly?axis=fork)](https://repopilot.app/r/databrickslabs/dolly)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/databrickslabs/dolly on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: databrickslabs/dolly

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/databrickslabs/dolly shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 3y ago

13 active contributors
Distributed ownership (top contributor 33% of recent commits)
Apache-2.0 licensed
Tests present
⚠ Stale — last commit 3y ago
⚠ No CI workflows detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live databrickslabs/dolly repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/databrickslabs/dolly.

What it runs against: a local clone of databrickslabs/dolly — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in databrickslabs/dolly | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | Last commit ≤ 1072 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>databrickslabs/dolly</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of databrickslabs/dolly. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/databrickslabs/dolly.git
#   cd dolly
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of databrickslabs/dolly and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "databrickslabs/dolly(\\.git)?\\b" \\
  && ok "origin remote is databrickslabs/dolly" \\
  || miss "origin remote is not databrickslabs/dolly (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 1072 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~1042d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/databrickslabs/dolly"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Dolly is Databricks' instruction-following LLM (12B parameters) derived from EleutherAI's Pythia-12b, fine-tuned on ~15k instruction/response pairs from the databricks-dolly-15k dataset. It solves the problem of training and deploying a commercial-use LLM that follows natural language instructions without the massive compute and licensing restrictions of closed-source alternatives. Three-layer structure: config/ holds hardware-specific training configs (a100_config.json, a10_config.json, v100_config.json), training/ contains the core trainer (trainer.py), data generation (generate.py), and constants, and examples/ provide runnable demos (generation.py for inference, pipeline.py for workflows, langchain.py for LLM integration). Entry point is train_dolly.py.

👥Who it's for

ML engineers and researchers at organizations wanting to fine-tune and deploy their own instruction-following language models using Databricks infrastructure; teams building on Pythia-12b who need a reference implementation with training scripts, inference examples, and configuration for multi-GPU setups (A100, A10, V100).

🌱Maturity & risk

Actively maintained by Databricks but not production-hardened; the codebase is lean (~33K Python lines), includes pytest coverage infrastructure (.coveragerc, test/test_trainer.py, run_pytest.sh) and configuration for multiple GPU architectures, but README explicitly states 'dolly-v2-12b is not state-of-the-art' and documents known limitations (poor math, hallucination, syntax complexity). This is a solid reference implementation and research artifact, not a battle-tested platform.

Moderate risk: tight version constraints (accelerate<1, deepspeed<0.9, transformers<5, torch<2) create long-term maintenance burden and potential breaking change surface; only 111 lines of shell scripts and minimal CI config suggest limited automation; the model itself has documented performance gaps (mathematical reasoning, programming, factual accuracy) that limit production utility. Dependency chain is deep (torch → cuda ecosystem) with high complexity.

Active areas of work

No recent commit data visible in the repo structure provided, but the codebase is feature-complete for its scope: training pipeline infrastructure is established, three hardware configs are published, and LangChain integration example suggests ongoing ecosystem compatibility work. Project appears in maintenance mode rather than active feature development.

🚀Get running

git clone https://github.com/databrickslabs/dolly.git
cd dolly
pip install -r requirements.txt
pip install -r requirements_dev.txt  # for development/testing
python examples/generation.py  # run inference example

Daily commands: Training: python train_dolly.py --config config/a100_config.json (or substitute a10_config.json/v100_config.json). Inference: python examples/generation.py. Tests: ./run_pytest.sh or pytest test/test_trainer.py.

🗺️Map of the codebase

train_dolly.py: Main entry point; CLI interface for launching distributed training with config selection and Accelerate integration
training/trainer.py: Core training loop, model loading from HuggingFace, loss computation, checkpoint saving; most critical for understanding fine-tuning mechanics
training/consts.py: Shared constants and hyperparameter defaults (batch size, learning rate, warmup steps) used across train_dolly.py and trainer.py
config/a100_config.json: Reference hardware-specific training config; shows expected structure for distributed training parameters (per_device_train_batch_size, gradient_accumulation_steps, deepspeed config)
examples/generation.py: Minimal inference example showing model loading and prompt completion; template for users building on Dolly
test/test_trainer.py: Test suite for training pipeline; validates trainer.py behavior and is required reference for CI/contributions
data/README.md: Documents the databricks-dolly-15k dataset structure, instruction categories, and how training data maps to model capability domains

🛠️How to make changes

For training logic changes: edit training/trainer.py (core training loop) and training/consts.py (hyperparameters). For new hardware support: add config/new_gpu_config.json and update train_dolly.py argument parsing. For inference patterns: add examples/your_use_case.py following examples/generation.py structure. For dataset changes: modify training/generate.py or data loading in training/trainer.py.

🪤Traps & gotchas

DeepSpeed version constraint (>=0.8.3,<0.9) is brittle; minor version bumps may break training. Accelerate and Transformers have tight version coupling with torch; installing mismatched versions silently fails on obscure CUDA/CPU paths. Config JSON paths are hardcoded relative to CWD in train_dolly.py, so running from different directories may fail silently. No explicit documentation on required CUDA version or torch build variants (CPU vs GPU); users on CPU-only may hit silent failures in deepspeed initialization. HuggingFace model loading expects network access and will cache models to ~/.cache/huggingface/; offline training requires pre-caching.

💡Concepts to learn

Instruction Fine-Tuning — Dolly's core capability: training a base LLM (Pythia) on instruction-response pairs to create an instruction-following model; understanding how training/trainer.py applies this technique is essential to modifying training behavior
Causal Language Modeling — Dolly is a causal LM (predicts next token given previous tokens); understanding this loss function in training/trainer.py helps debug generation quality and training convergence
Distributed Data-Parallel Training (DDP) — DeepSpeed and Accelerate in requirements.txt enable DDP across multiple GPUs; config/ files (a100, a10, v100) configure gradient accumulation and device distribution for this pattern
Token Sequence Truncation & Padding — Training on variable-length instruction-response pairs requires padding/truncation strategy that affects training dynamics; data/README.md likely documents max_length choices that map to config/ batch size decisions
Gradient Checkpointing — config/ JSON files likely include gradient_checkpointing flags to reduce memory during 12B parameter training; understanding this trade-off (computation vs memory) is key to tuning on smaller GPUs
Domain-Specific Fine-Tuning — Dolly's ~15k instruction corpus spans 7 capability domains (brainstorming, classification, QA, generation, etc); data/README.md and training/generate.py should show how domain labels influence training signal
Prompt Template & Instruction Format — examples/generation.py and training/generate.py must define a consistent prompt format (e.g., '### Instruction:\n...\n### Response:\n...') that the model learns during fine-tuning; deviating from this at inference degrades quality

EleutherAI/pythia — Upstream base model (Pythia-12b) from which Dolly is fine-tuned; reference for pre-training and model architecture
huggingface/transformers — Dependency providing the Transformers library used for model loading, tokenization, and training loop scaffolding
microsoft/DeepSpeed — Distributed training framework integrated via config/ and training/trainer.py; critical for multi-GPU synchronization and optimization
hwchase17/langchain — Companion library for LLM orchestration; dolly/examples/langchain.py shows integration pattern for production deployments
databrickslabs/dolly-v2 — Successor/sibling project if separate repo exists; would track improvements beyond dolly-v2-12b baseline

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for training/trainer.py with GPU simulation

Currently test/test_trainer.py exists but likely has minimal coverage. The trainer.py module handles critical training logic with DeepSpeed integration, distributed training, and model checkpointing. A new contributor could add targeted tests for: trainer initialization with different config files (a100_config.json, a10_config.json, v100_config.json), training loop with mock datasets, checkpoint saving/loading, and gradient accumulation. This directly prevents regressions in the core training pipeline.

[ ] Expand test/test_trainer.py to import and instantiate Trainer class from training/trainer.py
[ ] Add parametrized tests using config files from config/ directory (a100_config.json, a10_config.json, v100_config.json)
[ ] Add tests for checkpoint save/load functionality referenced in trainer.py
[ ] Add tests for dataset loading with datasets library and tokenization pipeline
[ ] Run pytest with coverage (using .coveragerc) to achieve >80% coverage on training/trainer.py

Add integration test for examples/generation.py and examples/pipeline.py with model mocking

The examples directory contains generation.py and pipeline.py (likely using langchain>=0.0.139) but these lack integration tests. New contributors could add test cases in test/ that verify the example scripts run end-to-end with mocked models, ensuring API compatibility and preventing breakage when dependencies update. This is high-value because examples are the primary entry point for users.

[ ] Create test/test_examples.py to test examples/generation.py with a mock or tiny language model
[ ] Add test for examples/langchain.py to verify langchain integration works with mocked LLM chain
[ ] Add test for examples/pipeline.py to verify the full pipeline (data loading → training config → generation) executes without errors
[ ] Verify tests pass in run_pytest.sh and are included in CI coverage

Add GitHub Actions workflow for dependency compatibility testing across Python versions

The requirements.txt pins specific version ranges (torch>=1.13.1,<2, transformers>=4.28.1,<5, deepspeed>=0.8.3,<0.9) but there's no CI validation. A new contributor could create a .github/workflows/test.yml that runs pytest against multiple Python versions (3.8, 3.9, 3.10, 3.11) and key dependency versions to catch breaking changes early. This prevents silent failures for users on different environments.

[ ] Create .github/workflows/test.yml with matrix strategy for Python versions 3.8-3.11
[ ] Add step to install dependencies from requirements.txt and requirements_dev.txt
[ ] Add step to run run_pytest.sh and report coverage
[ ] Optionally add separate job to test against minimum and maximum pinned dependency versions (e.g., transformers 4.28.1 vs <5)

🌿Good first issues

Add test coverage for training/generate.py: currently test/test_trainer.py exists but no tests for the data generation/formatting logic that creates instruction-response pairs from databricks-dolly-15k. A new file test/test_generate.py would validate data pipeline before training.
Document GPU memory requirements per config: config/ has three JSON files (a100, a10, v100) but README.md doesn't explain which config to choose based on available VRAM or training dataset size. Add a table in README.md mapping GPU models to supported batch sizes and estimated memory usage.
Add CPU fallback or validation in train_dolly.py: currently no check for GPU availability; users without CUDA will get cryptic DeepSpeed errors. Add explicit device detection and helpful error message suggesting CPU-only alternatives or installation steps.

⭐Top contributors

Click to expand

@srowen — 28 commits
@tnixon — 22 commits
@matthayes — 17 commits
@mike-conover-db — 4 commits
@edurdevic — 3 commits

📝Recent commits