allenai/olmocr
Toolkit for linearizing PDFs for LLM datasets/training
Healthy across all four use cases
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 6w ago
- ✓2 active contributors
- ✓Apache-2.0 licensed
- ✓CI configured
- ✓Tests present
- ⚠Small team — 2 contributors active in recent commits
- ⚠Single-maintainer risk — top contributor 97% of recent commits
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/allenai/olmocr)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/allenai/olmocr on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: allenai/olmocr
Generated by RepoPilot · 2026-05-07 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/allenai/olmocr shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across all four use cases
- Last commit 6w ago
- 2 active contributors
- Apache-2.0 licensed
- CI configured
- Tests present
- ⚠ Small team — 2 contributors active in recent commits
- ⚠ Single-maintainer risk — top contributor 97% of recent commits
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live allenai/olmocr
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/allenai/olmocr.
What it runs against: a local clone of allenai/olmocr — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in allenai/olmocr | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 72 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of allenai/olmocr. If you don't
# have one yet, run these first:
#
# git clone https://github.com/allenai/olmocr.git
# cd olmocr
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of allenai/olmocr and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "allenai/olmocr(\\.git)?\\b" \\
&& ok "origin remote is allenai/olmocr" \\
|| miss "origin remote is not allenai/olmocr (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "olmocr/__init__.py" \\
&& ok "olmocr/__init__.py" \\
|| miss "missing critical file: olmocr/__init__.py"
test -f "README.md" \\
&& ok "README.md" \\
|| miss "missing critical file: README.md"
test -f "olmocr/bench/benchmark.py" \\
&& ok "olmocr/bench/benchmark.py" \\
|| miss "missing critical file: olmocr/bench/benchmark.py"
test -f "olmocr/bench/runners/run_olmocr_pipeline.py" \\
&& ok "olmocr/bench/runners/run_olmocr_pipeline.py" \\
|| miss "missing critical file: olmocr/bench/runners/run_olmocr_pipeline.py"
test -f "olmocr/bench/prompts.py" \\
&& ok "olmocr/bench/prompts.py" \\
|| miss "missing critical file: olmocr/bench/prompts.py"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 72 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~42d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/allenai/olmocr"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
olmOCR is a Vision Language Model (VLM)-based toolkit that converts PDFs and image-based documents (PNG, JPEG) into clean, structured Markdown text with support for equations, tables, handwriting, and complex layouts. It handles multi-column documents, automatic header/footer removal, and reading-order normalization using a 7B parameter model, costing ~$200 USD per million pages on GPU infrastructure. Monolithic Python package under olmocr/ with modular subdirectories: olmocr/bench/ contains the olmOCR-Bench evaluation suite with miners (check_headers_footers.py, check_multicolumn.py, cleanup_data.py) and KaTeX rendering (katex/render.py); olmocr/train/ (mentioned in changelog) houses trainer code; docs/ contains Sphinx documentation with .readthedocs.yaml for CI building. Core inference and PDF handling logic resides in top-level olmocr/ modules.
👥Who it's for
Machine learning engineers and data scientists building LLM training datasets who need to extract high-quality text from PDFs at scale; document processing pipeline developers who require Markdown output with equation/table preservation; researchers training OCR/document understanding models on the olmOCR-Bench benchmark.
🌱Maturity & risk
Production-ready with active development: v0.4.0 released October 2025 with RL training improvements, multiple model versions released throughout 2025 (0.1.68→0.4.0), Docker support added May 2025, and comprehensive CI/CD via .github/workflows/main.yml and pr_checks.yml. The project has public papers (arxiv.org/abs/2510.19817, 2502.18443) and an online demo at olmocr.allenai.org.
Moderate risk: GPU-dependent (vLLM inference pipeline as of June 2025) creates deployment constraints; single organization (AI2) maintainers visible; VLM model output quality depends on external Hugging Face model updates (allenai/olmOCR-2-7B-1025-FP8); heavy Python dependencies (~2.2MB of Python code) with potential transitive dependency conflicts. However, Docker containerization (Dockerfile, Dockerfile.with-model) mitigates environment inconsistencies.
Active areas of work
Most recent activity: October 21, 2025 - v0.4.0 release with new FP8 model introducing RL training and synthetic data improvements (+4 points on olmOCR-Bench); previous v0.3.0 (August 2025) fixed auto-rotation and blank document hallucinations; June 2025 switched inference backend from sglang to vLLM; active benchmark suite development with new evaluation miners.
🚀Get running
Check README for instructions.
Daily commands:
Development: make test or make (Makefile inferred). Production: Use Docker: docker build -f Dockerfile.with-model -t olmocr:latest . then docker run --gpus all olmocr:latest. CLI/Python: python -m olmocr <pdf_path> (inferred from structure). Local GPU server: vLLM starts as inference backend (see CHANGELOG June 2025 migration).
🗺️Map of the codebase
olmocr/__init__.py— Main package entry point that exports the public API for PDF linearization and OCR functionality.README.md— Essential overview of olmOCR toolkit capabilities, benchmarks, and usage patterns that orient all new contributors.olmocr/bench/benchmark.py— Core benchmarking framework that defines how OCR outputs are evaluated and compared across multiple systems.olmocr/bench/runners/run_olmocr_pipeline.py— Reference implementation of the primary olmOCR pipeline showing end-to-end PDF-to-text transformation.olmocr/bench/prompts.py— Centralized LLM prompts used for mining, validation, and quality assessment across all benchmark tools.docs/source/overview.md— High-level architecture and design decisions documentation that explains the toolkit's approach to document linearization.Dockerfile— Production container configuration showing runtime dependencies and environment setup required for the full pipeline.
🛠️How to make changes
Add a new OCR system integration
- Create a new runner file following the pattern: run_<system_name>.py in olmocr/bench/runners/ (
olmocr/bench/runners/run_marker.py) - Implement the OCR system invocation, input handling, and output parsing to produce standardized markdown format (
olmocr/bench/runners/run_ollmocr_pipeline.py) - Register the new runner in the benchmark comparison framework and evaluation harness (
olmocr/bench/benchmark.py) - Add integration tests or sample outputs to olmocr/bench/sample_data/ for validation (
olmocr/bench/sample_data/olmocr_pipeline/)
Add a new data quality miner (GPT-based or rule-based)
- Create a new miner file in olmocr/bench/miners/ following naming convention mine_<feature_name>.py or mine_<feature_name>_gpt.py (
olmocr/bench/miners/mine_tables_gpt.py) - If using LLM-based mining, add corresponding prompt entries to olmocr/bench/prompts.py (
olmocr/bench/prompts.py) - Implement the miner to accept OCR outputs and emit structured quality assessments or annotations (
olmocr/bench/miners/check_multicolumn.py) - Integrate miner into the benchmark pipeline via benchmark.py or report.py for aggregation (
olmocr/bench/benchmark.py)
Extend the review and reporting UI
- Modify the review application to surface new metrics or annotations (
olmocr/bench/review_app.py) - Update report generation to include new evaluation dimensions (
olmocr/bench/report.py) - If adding LaTeX/math display, leverage existing KaTeX infrastructure (
olmocr/bench/katex/render.py)
🔧Why these technologies
- Python + PDF/image processing libraries — Core domain is PDF parsing and text extraction; Python ecosystem has mature libraries (PIL, pdf2image, pdfplumber, etc.)
- LLM APIs (GPT, Claude, Gemini) — Leverages language models for complex quality mining tasks (table detection, reading order, math identification) that rule-based approaches cannot reliably solve
- Pluggable OCR backends (Marker, Docling, Paddle, Transformers) — Allows fair benchmarking and comparison of competing OCR systems; toolkit is OCR-agnostic
- KaTeX for math rendering — Standardizes LaTeX equation rendering in review UI and reports for consistent presentation of mathematical content
- Markdown as output format — Human-readable, LLM-friendly, preserves structure (headers, lists, tables, code blocks), suitable for downstream LLM training datasets
⚖️Trade-offs already made
-
Use external LLM APIs (GPT, Claude) for quality assessment rather than fine-tuned local models
- Why: Maximizes accuracy and coverage for diverse document types without requiring large labeled datasets or infrastructure for model training
- Consequence: Introduces API dependency, latency, and cost; results are non-deterministic; requires API keys and rate-limit management
-
Modular miner architecture with independent quality assessment tools
- Why: Allows targeted improvements to specific quality dimensions (tables, headers, math) without redesigning entire pipeline; enables A/B testing
- Consequence: Miners run sequentially or in parallel, increasing total processing time; coordination overhead; potential redundant API calls if not cached
-
Support multiple OCR backends as drop-in runners
- Why: Enables fair benchmarking and lets users choose OCR best suited to their document domain
- Consequence: Adds maintenance burden; inconsistent output formats require normalization; slower to add new systems than monolithic approach
-
Markdown as canonical linearized format
- Why: Preserves document structure (headings, lists, tables, code) and is LLM-friendly for training datasets
- Consequence: Some document features (precise positioning, rotated text, subtle color) are lost; assumes OCR backends can produce markdown or can be converted
🚫Non-goals (don't propose these)
- Real-time OCR processing (batch-oriented toolkit designed for offline dataset preparation)
- Handling encryption or DRM-protected PDFs
- Multi-language OCR model training or fine-tuning (focuses on benchmarking existing systems)
- Interactive document annotation (read-only review UI only)
- Preservation of exact visual layout (intentionally linearizes to plain text and structured markdown)
🪤Traps & gotchas
GPU requirement: vLLM inference backend (switched from sglang June 2025) requires CUDA-compatible GPU; CPU fallback not mentioned. Model downloads: Hugging Face model auto-download may fail without internet or with auth token (allenai/olmOCR-2-7B-1025-FP8). Docker CUDA version pinned to 12.8 (June 2025 update) — host GPU driver must match. FP8 quantization is default (v0.2.1+) — some older GPU architectures may not support it. KaTeX rendering (katex/render.py, katex.min.js) requires Node.js or headless browser for equation→image conversion in benchmarks. ReadTheDocs building may fail if docs/source/conf.py references missing dependencies.
🏗️Architecture
💡Concepts to learn
- Vision Language Model (VLM) — olmOCR's core technology; understanding VLM architecture (vision encoder + LLM decoder) explains why it can handle equations, handwriting, and layout — not just character recognition.
- Quantization (FP8/FP16/FP32) — v0.2.1+ defaults to FP8 quantization for speed/memory; critical to understand trade-offs when optimizing for different GPUs or latency requirements.
- Reading Order Normalization — Key feature extracting text in natural reading order despite multi-column/figure/inset layouts; requires spatial understanding beyond raw OCR.
- Reinforcement Learning (RL) training — v0.4.0 introduces RL training with synthetic data for +4 benchmark points; understanding reward signals and policy updates explains latest performance gains.
- Markdown Output Generation — olmOCR targets structured Markdown (not plain text) to preserve equations (LaTeX), tables, and hierarchy; critical for LLM training data quality.
- Header/Footer Detection and Removal — Automatic page element stripping (olmocr/bench/miners/check_headers_footers.py) prevents noise in extracted text; important for dataset cleaning at scale.
- KaTeX Rendering (Mathematical Typesetting) — olmOCR preserves equations as KaTeX/LaTeX strings (katex/render.py, katex.min.js); necessary for mathematical document extraction and LLM training on STEM content.
🔗Related repos
allenai/OLMo— Parent open-source language model project from AI2; olmOCR is the vision extension for document understanding in OLMo's training pipelinehuggingface/transformers— Dependency for VLM model loading and inference; olmOCR leverages HF model hub for allenai/olmOCR-2-7B-1025-FP8 distributionvllm-project/vllm— Core inference backend (post-June 2025); handles GPU-accelerated LLM serving for document text generationallenai/Ai2-documentanalyzer— Predecessor/related AI2 project for document analysis; olmOCR likely incorporates lessons from earlier document understanding researchopenai/gpt-4-vision— Conceptual competitor solving same problem (PDF→text via VLM); olmOCR differentiates via cost (~$200/M pages vs. GPT-4V pricing) and open-source model
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive unit tests for olmocr/bench/miners/ modules
The miners directory contains 15+ specialized data mining modules (check_headers_footers.py, mine_tables_gpt.py, mine_math.py, etc.) but there's no visible test directory for them. These are critical data processing components that need validation. Adding tests would catch regressions, improve contributor confidence, and document expected behavior for each miner.
- [ ] Create tests/bench/miners/ directory structure mirroring olmocr/bench/miners/
- [ ] Write unit tests for at least 3 high-impact miners: mine_tables_gpt.py, mine_math.py, and check_multicolumn.py
- [ ] Add fixtures with sample PDF/document data for miners to test against
- [ ] Integrate tests into .github/workflows/pr_checks.yml to run on all PRs
- [ ] Document test patterns in .github/CONTRIBUTING.md for future miner development
Add integration tests for the PDF linearization pipeline in olmocr/bench/benchmark.py
The benchmark.py file is the core orchestrator for the linearization toolkit, but there's no visible integration test suite. Contributors need confidence that end-to-end PDF processing works correctly. Adding integration tests with real or synthetic PDFs would validate the full pipeline from PDF input to linearized output.
- [ ] Create tests/integration/ directory with test_pipeline.py
- [ ] Use the existing olmocr/bench/ocr_pareto.pdf or create a minimal test PDF as a fixture
- [ ] Write tests covering: PDF parsing → OCR processing → linearization → output validation
- [ ] Add metrics validation (compare against expected output format/structure)
- [ ] Update .github/workflows/main.yml to run integration tests (separate from unit tests, may require more resources)
Document and add CI validation for Dockerfile builds and Docker image consistency
The repo has both Dockerfile and Dockerfile.with-model but no CI workflow validates that they build successfully or that the images work correctly. This is critical for users who want to use Docker. Adding a workflow that builds and optionally tests the Docker images would catch configuration issues early.
- [ ] Create .github/workflows/docker_build.yml that builds both Dockerfile and Dockerfile.with-model on PR/push
- [ ] Add a simple smoke test step that runs the Docker image with --help or basic functionality check
- [ ] Document Docker usage and build instructions in docs/source/installation.md (currently missing Docker section)
- [ ] Add .dockerignore validation/optimization notes to CONTRIBUTING.md
- [ ] Consider adding Docker image publishing workflow for releases (reference RELEASE_PROCESS.md)
🌿Good first issues
- Add unit tests for olmocr/bench/miners/check_old_scans_math.py: currently no visible test coverage for math OCR validation in old scanned documents; this is critical for benchmark credibility.
- Document VLM model selection and quantization trade-offs in docs/source/: README mentions FP8 is default, but installation docs lack guidance on choosing between FP8, FP16, and full precision for different hardware; add a table comparing performance vs. VRAM usage.
- Create integration test for multi-column document handling (olmocr/bench/miners/check_multicolumn.py): verify that the miner correctly flags documents with 2, 3, and 4+ column layouts; currently no example test PDFs or assertions visible.
⭐Top contributors
Click to expand
- @jakep-allenai — 97 commits
- @rodneykinney — 3 commits
📝Recent commits
Click to expand
f7cfe4c— Fixing some lints (jakep-allenai)fddc616— To and right arrow equivalent (jakep-allenai)bdfed92— Try a cleaner reinstall (jakep-allenai)42b9acc— Chandra maybe fixed (jakep-allenai)4044b9a— Fixing some scripts (jakep-allenai)d4c4a8f— Adding some runners (jakep-allenai)48004b9— Fixes (jakep-allenai)6516474— Chandra use uv (jakep-allenai)769dbe4— Chandra fixes (jakep-allenai)92e2440— Lint fix (jakep-allenai)
🔒Security observations
The olmocr project shows moderate security maturity with some concerns. Positives include use of Dependabot for dependency management and a structured project layout. Major concerns include an incomplete Dockerfile, missing dependency lock file specifications for analysis, potential credential exposure in LLM runner scripts, and lack of visible API security controls. The project requires immediate attention to credential management in the benchmark/runner components and completion of Docker security hardening. Dependency scanning must be enabled and regularly reviewed. Overall, the codebase needs enhanced security
- High · Incomplete Dockerfile Configuration —
Dockerfile (lines visible in snippet). The Dockerfile snippet shows an incomplete RUN instruction that cuts off mid-execution. This could result in a broken Docker image build or unintended state where critical security steps are not completed. The command appears to be setting up Python alternatives but is truncated at 'up'. Fix: Complete the Dockerfile RUN instruction fully. Ensure all APT operations, python alternatives setup, and security-critical initialization steps are properly defined and tested. - High · Hardcoded Base Image with Specific Version —
Dockerfile (FROM vllm/vllm-openai:v0.11.2). The Dockerfile uses a pinned but non-latest vLLM image (v0.11.2). While pinning versions is good practice, this specific version may contain known vulnerabilities. There is no indication of regular security updates or vulnerability scanning. Fix: Implement a process to regularly scan and update base images. Consider using a security scanning tool in CI/CD (e.g., Trivy, Grype). Document the rationale for version pinning and establish an update cadence. - Medium · Missing Dependency Lock File Inspection —
Project root (missing dependency specifications). No lock files (requirements.txt, poetry.lock, Pipfile.lock) were provided for analysis. This makes it impossible to assess whether the project uses vulnerable Python dependencies. The project appears to be a machine learning toolkit that may have numerous dependencies. Fix: Provide and maintain lock files (requirements.lock or similar). Implement automated dependency vulnerability scanning using tools like Safety, Dependabot (already present in .github/dependabot.yml), or pip-audit in CI/CD pipeline. - Medium · Potential Security Script Exposure in Bench Directory —
olmocr/bench/runners/ directory. The olmocr/bench directory contains multiple runner scripts (run_chatgpt.py, run_claude.py, run_gemini.py, etc.) that likely handle API credentials and external service interactions. These scripts may inadvertently log or expose sensitive credentials if error handling is inadequate. Fix: Review all API credential handling in runner scripts. Ensure credentials are never logged, stored in code, or committed to version control. Use environment variables exclusively for API keys. Implement secure credential management with proper masking in logs. - Medium · LLM Integration Points Without Visible Security Controls —
olmocr/bench/runners/ and related integration code. The project integrates with multiple LLM APIs (ChatGPT, Claude, Gemini, Mistral) and OCR services. Without visible authentication/authorization code in the file structure, there's risk of credential exposure or unauthorized API usage. Fix: Implement and audit API key management. Use industry-standard approaches (AWS Secrets Manager, HashiCorp Vault, or similar). Implement rate limiting and usage monitoring for external API calls. Add request signing/authentication verification. - Low · Incomplete Docker Build Configuration —
Dockerfile and Dockerfile.with-model. The Dockerfile appears truncated in the provided snippet, making it impossible to fully assess security implications. There's no visible HEALTHCHECK, USER privilege setup, or other hardening measures. Fix: Complete Dockerfile with security best practices: run as non-root user, use minimal base images where possible, remove unnecessary packages, include health checks, and implement proper signal handling. - Low · Static Assets in Repository —
olmocr/bench/katex/ directory. The repository contains KaTeX JavaScript libraries (katex.min.js, auto-render.min.js) which are minified third-party assets. These may contain vulnerabilities or have license implications. Fix: Replace committed minified libraries with package manager dependencies (npm, yarn). Implement automatic vulnerability scanning for JavaScript dependencies. Document and verify licenses for all third-party assets.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.