RepoPilotOpen in app →

databrickslabs/dolly

Databricks’ Dolly, a large language model trained on the Databricks Machine Learning Platform

Mixed

Stale — last commit 3y ago

weakest axis
Use as dependencyMixed

last commit was 3y ago; no CI workflows detected

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isMixed

last commit was 3y ago; no CI workflows detected

  • 13 active contributors
  • Distributed ownership (top contributor 33% of recent commits)
  • Apache-2.0 licensed
Show all 6 evidence items →
  • Tests present
  • Stale — last commit 3y ago
  • No CI workflows detected
What would change the summary?
  • Use as dependency MixedHealthy if: 1 commit in the last 365 days
  • Deploy as-is MixedHealthy if: 1 commit in the last 180 days

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Forkable
[![RepoPilot: Forkable](https://repopilot.app/api/badge/databrickslabs/dolly?axis=fork)](https://repopilot.app/r/databrickslabs/dolly)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/databrickslabs/dolly on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: databrickslabs/dolly

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/databrickslabs/dolly shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 3y ago

  • 13 active contributors
  • Distributed ownership (top contributor 33% of recent commits)
  • Apache-2.0 licensed
  • Tests present
  • ⚠ Stale — last commit 3y ago
  • ⚠ No CI workflows detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live databrickslabs/dolly repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/databrickslabs/dolly.

What it runs against: a local clone of databrickslabs/dolly — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in databrickslabs/dolly | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | Last commit ≤ 1072 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>databrickslabs/dolly</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of databrickslabs/dolly. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/databrickslabs/dolly.git
#   cd dolly
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of databrickslabs/dolly and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "databrickslabs/dolly(\\.git)?\\b" \\
  && ok "origin remote is databrickslabs/dolly" \\
  || miss "origin remote is not databrickslabs/dolly (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 1072 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~1042d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/databrickslabs/dolly"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Dolly is Databricks' instruction-following LLM (12B parameters) derived from EleutherAI's Pythia-12b, fine-tuned on ~15k instruction/response pairs from the databricks-dolly-15k dataset. It solves the problem of training and deploying a commercial-use LLM that follows natural language instructions without the massive compute and licensing restrictions of closed-source alternatives. Three-layer structure: config/ holds hardware-specific training configs (a100_config.json, a10_config.json, v100_config.json), training/ contains the core trainer (trainer.py), data generation (generate.py), and constants, and examples/ provide runnable demos (generation.py for inference, pipeline.py for workflows, langchain.py for LLM integration). Entry point is train_dolly.py.

👥Who it's for

ML engineers and researchers at organizations wanting to fine-tune and deploy their own instruction-following language models using Databricks infrastructure; teams building on Pythia-12b who need a reference implementation with training scripts, inference examples, and configuration for multi-GPU setups (A100, A10, V100).

🌱Maturity & risk

Actively maintained by Databricks but not production-hardened; the codebase is lean (~33K Python lines), includes pytest coverage infrastructure (.coveragerc, test/test_trainer.py, run_pytest.sh) and configuration for multiple GPU architectures, but README explicitly states 'dolly-v2-12b is not state-of-the-art' and documents known limitations (poor math, hallucination, syntax complexity). This is a solid reference implementation and research artifact, not a battle-tested platform.

Moderate risk: tight version constraints (accelerate<1, deepspeed<0.9, transformers<5, torch<2) create long-term maintenance burden and potential breaking change surface; only 111 lines of shell scripts and minimal CI config suggest limited automation; the model itself has documented performance gaps (mathematical reasoning, programming, factual accuracy) that limit production utility. Dependency chain is deep (torch → cuda ecosystem) with high complexity.

Active areas of work

No recent commit data visible in the repo structure provided, but the codebase is feature-complete for its scope: training pipeline infrastructure is established, three hardware configs are published, and LangChain integration example suggests ongoing ecosystem compatibility work. Project appears in maintenance mode rather than active feature development.

🚀Get running

git clone https://github.com/databrickslabs/dolly.git
cd dolly
pip install -r requirements.txt
pip install -r requirements_dev.txt  # for development/testing
python examples/generation.py  # run inference example

Daily commands: Training: python train_dolly.py --config config/a100_config.json (or substitute a10_config.json/v100_config.json). Inference: python examples/generation.py. Tests: ./run_pytest.sh or pytest test/test_trainer.py.

🗺️Map of the codebase

  • train_dolly.py: Main entry point; CLI interface for launching distributed training with config selection and Accelerate integration
  • training/trainer.py: Core training loop, model loading from HuggingFace, loss computation, checkpoint saving; most critical for understanding fine-tuning mechanics
  • training/consts.py: Shared constants and hyperparameter defaults (batch size, learning rate, warmup steps) used across train_dolly.py and trainer.py
  • config/a100_config.json: Reference hardware-specific training config; shows expected structure for distributed training parameters (per_device_train_batch_size, gradient_accumulation_steps, deepspeed config)
  • examples/generation.py: Minimal inference example showing model loading and prompt completion; template for users building on Dolly
  • test/test_trainer.py: Test suite for training pipeline; validates trainer.py behavior and is required reference for CI/contributions
  • data/README.md: Documents the databricks-dolly-15k dataset structure, instruction categories, and how training data maps to model capability domains

🛠️How to make changes

For training logic changes: edit training/trainer.py (core training loop) and training/consts.py (hyperparameters). For new hardware support: add config/new_gpu_config.json and update train_dolly.py argument parsing. For inference patterns: add examples/your_use_case.py following examples/generation.py structure. For dataset changes: modify training/generate.py or data loading in training/trainer.py.

🪤Traps & gotchas

DeepSpeed version constraint (>=0.8.3,<0.9) is brittle; minor version bumps may break training. Accelerate and Transformers have tight version coupling with torch; installing mismatched versions silently fails on obscure CUDA/CPU paths. Config JSON paths are hardcoded relative to CWD in train_dolly.py, so running from different directories may fail silently. No explicit documentation on required CUDA version or torch build variants (CPU vs GPU); users on CPU-only may hit silent failures in deepspeed initialization. HuggingFace model loading expects network access and will cache models to ~/.cache/huggingface/; offline training requires pre-caching.

💡Concepts to learn

  • Instruction Fine-Tuning — Dolly's core capability: training a base LLM (Pythia) on instruction-response pairs to create an instruction-following model; understanding how training/trainer.py applies this technique is essential to modifying training behavior
  • Causal Language Modeling — Dolly is a causal LM (predicts next token given previous tokens); understanding this loss function in training/trainer.py helps debug generation quality and training convergence
  • Distributed Data-Parallel Training (DDP) — DeepSpeed and Accelerate in requirements.txt enable DDP across multiple GPUs; config/ files (a100, a10, v100) configure gradient accumulation and device distribution for this pattern
  • Token Sequence Truncation & Padding — Training on variable-length instruction-response pairs requires padding/truncation strategy that affects training dynamics; data/README.md likely documents max_length choices that map to config/ batch size decisions
  • Gradient Checkpointing — config/ JSON files likely include gradient_checkpointing flags to reduce memory during 12B parameter training; understanding this trade-off (computation vs memory) is key to tuning on smaller GPUs
  • Domain-Specific Fine-Tuning — Dolly's ~15k instruction corpus spans 7 capability domains (brainstorming, classification, QA, generation, etc); data/README.md and training/generate.py should show how domain labels influence training signal
  • Prompt Template & Instruction Format — examples/generation.py and training/generate.py must define a consistent prompt format (e.g., '### Instruction:\n...\n### Response:\n...') that the model learns during fine-tuning; deviating from this at inference degrades quality
  • EleutherAI/pythia — Upstream base model (Pythia-12b) from which Dolly is fine-tuned; reference for pre-training and model architecture
  • huggingface/transformers — Dependency providing the Transformers library used for model loading, tokenization, and training loop scaffolding
  • microsoft/DeepSpeed — Distributed training framework integrated via config/ and training/trainer.py; critical for multi-GPU synchronization and optimization
  • hwchase17/langchain — Companion library for LLM orchestration; dolly/examples/langchain.py shows integration pattern for production deployments
  • databrickslabs/dolly-v2 — Successor/sibling project if separate repo exists; would track improvements beyond dolly-v2-12b baseline

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for training/trainer.py with GPU simulation

Currently test/test_trainer.py exists but likely has minimal coverage. The trainer.py module handles critical training logic with DeepSpeed integration, distributed training, and model checkpointing. A new contributor could add targeted tests for: trainer initialization with different config files (a100_config.json, a10_config.json, v100_config.json), training loop with mock datasets, checkpoint saving/loading, and gradient accumulation. This directly prevents regressions in the core training pipeline.

  • [ ] Expand test/test_trainer.py to import and instantiate Trainer class from training/trainer.py
  • [ ] Add parametrized tests using config files from config/ directory (a100_config.json, a10_config.json, v100_config.json)
  • [ ] Add tests for checkpoint save/load functionality referenced in trainer.py
  • [ ] Add tests for dataset loading with datasets library and tokenization pipeline
  • [ ] Run pytest with coverage (using .coveragerc) to achieve >80% coverage on training/trainer.py

Add integration test for examples/generation.py and examples/pipeline.py with model mocking

The examples directory contains generation.py and pipeline.py (likely using langchain>=0.0.139) but these lack integration tests. New contributors could add test cases in test/ that verify the example scripts run end-to-end with mocked models, ensuring API compatibility and preventing breakage when dependencies update. This is high-value because examples are the primary entry point for users.

  • [ ] Create test/test_examples.py to test examples/generation.py with a mock or tiny language model
  • [ ] Add test for examples/langchain.py to verify langchain integration works with mocked LLM chain
  • [ ] Add test for examples/pipeline.py to verify the full pipeline (data loading → training config → generation) executes without errors
  • [ ] Verify tests pass in run_pytest.sh and are included in CI coverage

Add GitHub Actions workflow for dependency compatibility testing across Python versions

The requirements.txt pins specific version ranges (torch>=1.13.1,<2, transformers>=4.28.1,<5, deepspeed>=0.8.3,<0.9) but there's no CI validation. A new contributor could create a .github/workflows/test.yml that runs pytest against multiple Python versions (3.8, 3.9, 3.10, 3.11) and key dependency versions to catch breaking changes early. This prevents silent failures for users on different environments.

  • [ ] Create .github/workflows/test.yml with matrix strategy for Python versions 3.8-3.11
  • [ ] Add step to install dependencies from requirements.txt and requirements_dev.txt
  • [ ] Add step to run run_pytest.sh and report coverage
  • [ ] Optionally add separate job to test against minimum and maximum pinned dependency versions (e.g., transformers 4.28.1 vs <5)

🌿Good first issues

  • Add test coverage for training/generate.py: currently test/test_trainer.py exists but no tests for the data generation/formatting logic that creates instruction-response pairs from databricks-dolly-15k. A new file test/test_generate.py would validate data pipeline before training.
  • Document GPU memory requirements per config: config/ has three JSON files (a100, a10, v100) but README.md doesn't explain which config to choose based on available VRAM or training dataset size. Add a table in README.md mapping GPU models to supported batch sizes and estimated memory usage.
  • Add CPU fallback or validation in train_dolly.py: currently no check for GPU availability; users without CUDA will get cryptic DeepSpeed errors. Add explicit device detection and helpful error message suggesting CPU-only alternatives or installation steps.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • e2d664d — Update README.md (matthayes)
  • f9dc122 — Update README.md (matthayes)
  • b337d53 — Merge pull request #200 from edurdevic/update-readme-DBR-13 (srowen)
  • b1e8ca0 — Update README.md changed quotes for DBR (edurdevic)
  • aead199 — Update README.md to 13.x ML or later (edurdevic)
  • 1b3d9b2 — Update README.md with instructions for DBR 13 (edurdevic)
  • aaa0ecb — Merge pull request #193 from srowen/DBR13 (srowen)
  • a357706 — Merge pull request #192 from srowen/bf16flag (srowen)
  • 10d0d65 — Remove DBR 12 notes in favor of 13; remove test loading data as it doesn't respect flag anyway (srowen)
  • 6609c85 — Set bf16 flags corretly for a10/a100 (srowen)

🔒Security observations

  • High · Outdated and Vulnerable Dependencies — requirements.txt. The codebase uses several dependencies with potentially known vulnerabilities due to loose version pinning. Specifically: accelerate (>=0.16.0,<1), datasets (>=2.10.0,<3), deepspeed (>=0.8.3,<0.9), and transformers (>=4.28.1,<5) allow for automatic installation of newer patch versions that may not be fully tested. Additionally, these are dated versions (from 2023) and may contain unpatched security issues. Fix: Pin exact versions of all dependencies after security verification. Implement dependency scanning with tools like Safety or Dependabot. Regularly update and audit dependencies for known CVEs.
  • High · Unrestricted Model and Data Download — examples/generation.py, examples/langchain.py, training/generate.py, training/trainer.py. The codebase appears to download models from Hugging Face (databricks/dolly-v2-12b) and datasets without apparent signature verification or integrity checks. The examples/langchain.py and training modules likely load remote resources without validation, creating supply chain attack vectors. Fix: Implement model/dataset integrity verification using cryptographic hashes. Verify downloaded artifacts against known good checksums. Consider using Hugging Face Hub's security features like model signing when available.
  • Medium · Potential Arbitrary Code Execution via Pickle/Model Loading — training/trainer.py, training/generate.py. The deepspeed and transformers libraries can execute arbitrary code when loading untrusted serialized models or datasets. The training pipeline likely uses pickle for state persistence without explicit safety checks. Fix: Use safe deserialization methods. Avoid loading models from untrusted sources. Implement model provenance verification. Use safetensors format instead of pickle where possible for model serialization.
  • Medium · Missing Input Validation in Text Generation — training/generate.py, examples/generation.py. The generation.py and trainer.py modules process user input for model inference without apparent sanitization. Large language models can be vulnerable to prompt injection attacks that may leak training data or produce unintended outputs. Fix: Implement input validation and sanitization. Add guardrails for prompt injection. Monitor for anomalous input patterns. Implement rate limiting on generation endpoints if exposed as a service.
  • Medium · Insecure Configuration File Handling — config/a100_config.json, config/a10_config.json, config/v100_config.json. Configuration files in config/ directory (a100_config.json, a10_config.json, v100_config.json) may contain sensitive hyperparameters or paths. No apparent encryption or access controls are visible for these configuration files. Fix: Store sensitive configuration in environment variables or secure vaults (e.g., AWS Secrets Manager, HashiCorp Vault). Implement file-level access controls. Avoid committing sensitive values to version control.
  • Medium · Insufficient Logging and Monitoring — training/trainer.py, examples/generation.py. The codebase lacks visible security-relevant logging for authentication, authorization, data access, and model usage. This inhibits security auditing and incident response capabilities. Fix: Implement comprehensive audit logging. Log all model loads, data access, and generation requests. Implement security event monitoring and alerting. Use structured logging with tamper-evident mechanisms.
  • Low · Missing Security Headers and API Hardening — examples/langchain.py. If the examples/langchain.py exposes endpoints (e.g., via FastAPI or Flask), there are no visible security headers or authentication mechanisms configured. Fix: If exposing APIs, implement authentication (API keys, OAuth2). Add security headers (CORS, CSP, HSTS if applicable). Implement rate limiting and request validation. Use HTTPS only.
  • Low · No Visible Security Testing — test/, pytest.ini. While pytest.ini and test/test_trainer.py exist, there are no apparent security-specific tests (SAST, dependency scanning, or fuzzing) in the test suite. Fix: Add security-focused testing: dependency vulnerability scanning, code security linting (bandit), and input fuzzing. Integrate SAST tools

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Mixed signals · databrickslabs/dolly — RepoPilot