RepoPilotOpen in app →

grobidOrg/grobid

A machine learning software for extracting information from scholarly documents

Healthy

Healthy across all four use cases

weakest axis
Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 1d ago
  • 6 active contributors
  • Apache-2.0 licensed
Show all 6 evidence items →
  • CI configured
  • Tests present
  • Single-maintainer risk — top contributor 94% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/grobidorg/grobid)](https://repopilot.app/r/grobidorg/grobid)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/grobidorg/grobid on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: grobidOrg/grobid

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/grobidOrg/grobid shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

  • Last commit 1d ago
  • 6 active contributors
  • Apache-2.0 licensed
  • CI configured
  • Tests present
  • ⚠ Single-maintainer risk — top contributor 94% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live grobidOrg/grobid repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/grobidOrg/grobid.

What it runs against: a local clone of grobidOrg/grobid — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in grobidOrg/grobid | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 31 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>grobidOrg/grobid</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of grobidOrg/grobid. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/grobidOrg/grobid.git
#   cd grobid
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of grobidOrg/grobid and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "grobidOrg/grobid(\\.git)?\\b" \\
  && ok "origin remote is grobidOrg/grobid" \\
  || miss "origin remote is not grobidOrg/grobid (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "build.gradle" \\
  && ok "build.gradle" \\
  || miss "missing critical file: build.gradle"
test -f "gradle.properties" \\
  && ok "gradle.properties" \\
  || miss "missing critical file: gradle.properties"
test -f "Readme.md" \\
  && ok "Readme.md" \\
  || miss "missing critical file: Readme.md"
test -f "doc/Install-Grobid.md" \\
  && ok "doc/Install-Grobid.md" \\
  || miss "missing critical file: doc/Install-Grobid.md"
test -f "doc/Notes-grobid-developers.md" \\
  && ok "doc/Notes-grobid-developers.md" \\
  || miss "missing critical file: doc/Notes-grobid-developers.md"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 31 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~1d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/grobidOrg/grobid"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

GROBID is a machine learning library for extracting, parsing, and structuring raw PDF documents—especially scholarly articles—into TEI-XML with metadata fields like headers, references, authors, affiliations, and citations. It combines CRF (Conditional Random Fields) models with Deep Learning via DELFT/JEP to achieve ~0.87-0.90 F1-score on reference extraction and ~0.76-0.91 on citation resolution across PubMed Central and bioRxiv datasets. Monorepo structure: grobid-core (Java main engine, 4.1MB), grobid-home (models and native libs), doc/ (extensive guides), .github/workflows/ (6 parallel CI pipelines for CRF/DELFT/ONNX variants). Build system uses Gradle with plugins for Kotlin (2.2.20), shadow JAR, spotless linting, and coveralls coverage. Python integration via JEP for deep learning inference.

👥Who it's for

Researchers, librarians, and software engineers building knowledge graphs, search systems, or digital repositories who need to automatically extract bibliographic metadata from unstructured PDFs without manual annotation.

🌱Maturity & risk

Production-ready and actively maintained since 2008. The project has steady releases, comprehensive CI/CD via GitHub Actions (CRF, DELFT, ONNX build pipelines), Docker support, and extensive documentation. It's a long-term Inria-supported side project with strong academic adoption (evident from bioRxiv benchmarks and PMC evaluation datasets).

Standard open source risks apply.

Active areas of work

Active CI/CD iteration with 6 distinct build workflows (manual CRF, DELFT, ONNX, unstable, tag-custom). Recent work visible in codespell.yml and trivy.yml (security scanning). Multiple Docker flavors (Dockerfile.crf, Dockerfile.delft, Dockerfile.evaluation) indicate ongoing exploration of runtime backends.

🚀Get running

git clone https://github.com/grobidOrg/grobid.git && cd grobid && ./gradlew build (requires JDK 11+, Gradle 8.3+). For Python/JEP support: ensure CONDA_PREFIX or VIRTUAL_ENV is set. For Docker: docker build -f Dockerfile.crf -t grobid:crf . or docker pull grobid/grobid.

Daily commands: ./gradlew build && ./gradlew run (check build.gradle for exact tasks). Service mode: java -jar build/libs/grobid-*.jar or use Dockerfile.crf. Batch: see doc/Grobid-batch.md for CLI entrypoints. DELFT variant requires CONDA_PREFIX set (inferred from getJavaLibraryPath closure in build.gradle).

🗺️Map of the codebase

  • build.gradle — Root build configuration defining all project dependencies, plugins (Kotlin, Shadow, Spotless), and platform-specific JNI library paths for GROBID's CRF and deep learning model support.
  • gradle.properties — Gradle configuration properties that control build behavior, version numbers, and dependency resolution for the entire project.
  • Readme.md — Entry point documenting GROBID's purpose (ML-based scholarly document information extraction), key features, and links to essential setup and usage guides.
  • doc/Install-Grobid.md — Critical installation guide covering prerequisites, build steps, and environment setup required for all developers to run the project locally.
  • doc/Notes-grobid-developers.md — Developer-focused documentation explaining project architecture, testing patterns, and contribution conventions specific to GROBID's codebase.
  • .github/workflows — CI/CD pipeline definitions (multiple workflow files) that enforce build, test, and deployment standards across branches and releases.
  • doc/Deep-Learning-models.md — Comprehensive guide to GROBID's machine learning model architecture, training pipelines, and model selection logic—essential for understanding the core extraction engine.

🛠️How to make changes

Add a New ML Model Variant

  1. Create a new training guide in doc/training/ (e.g., doc/training/my-new-task.md) documenting data format, annotation schema, and expected features. (doc/training/General-principles.md)
  2. Define the model in the Deep Learning models documentation (doc/Deep-Learning-models.md) with architecture details and performance metrics. (doc/Deep-Learning-models.md)
  3. Add a new Dockerfile variant (e.g., Dockerfile.my-variant) if the model requires unique runtime dependencies. (Dockerfile.delft)
  4. Create a corresponding CI/CD workflow in .github/workflows/ to build, test, and publish the new variant (copy pattern from ci-build-manual-full.yml). (.github/workflows/ci-build-manual-full.yml)
  5. Add benchmarking results in doc/benchmarks/ with dataset-specific performance (follow structure of existing PMC/bioRxiv benchmarks). (doc/benchmarks/Benchmarking.md)

Add a New REST API Endpoint

  1. Document the endpoint signature and request/response schema in doc/Grobid-service.md with examples. (doc/Grobid-service.md)
  2. Update configuration options if the new endpoint requires tunable parameters in doc/Configuration.md. (doc/Configuration.md)
  3. Document the output format in doc/TEI-encoding-of-results.md if results are TEI-XML encoded. (doc/TEI-encoding-of-results.md)
  4. Add test coverage following patterns in build.gradle test configurations and CI workflows. (build.gradle)

Improve Model Training & Evaluation

  1. Review existing training guide structure (e.g., doc/training/header.md) to understand annotation conventions and data requirements. (doc/training/General-principles.md)
  2. Follow the end-to-end evaluation methodology documented in doc/End-to-end-evaluation.md to define metrics and validation splits. (doc/End-to-end-evaluation.md)
  3. Create or update benchmark results in doc/benchmarks/ with your evaluation dataset (use existing journal benchmark structure). (doc/benchmarks/Benchmarking-models.md)
  4. Document training procedures and results in the model's specific training guide (e.g., doc/training/fulltext.md). (doc/training/Training-the-models-of-Grobid.md)

🔧Why these technologies

  • Java/Kotlin + Gradle — Mature ecosystem for building cross-platform ML inference engines; JNI integration for native CRF libraries (Wapiti) and Python DeLFT bindings.
  • CRF (Conditional Random Fields) + DeLFT Deep Learning — Dual-model strategy: CRF for lightweight sequence tagging with explicit feature engineering; DeLFT for state-of-the-art accuracy with transformer-based models (BERT, etc.).
  • Docker multi-variant images — Enables deployment flexibility: users can choose CRF (faster, smaller footprint) or DeLFT (higher accuracy, larger image) based on constraints.
  • TEI XML output schema — Standard humanities/scholarly publishing format for structured document encoding; enables interoperability with digital libraries and research platforms.
  • REST API + batch processing — Dual interface: REST for real-time single-document extraction; batch mode for high-throughput processing of document collections.

⚖️Trade-offs already made

  • CRF + DeLFT dual models instead of single unified model

    • Why: Allows users to trade accuracy for speed/resource consumption; CRF enables offline evaluation without GPU, DeLFT achieves SOTA results at computational cost.
    • Consequence: Increased maintenance burden (two model pipelines) and potential inconsistency between outputs; users must select model appropriately for their constraints.
  • JNI binding for native Wapiti CRF library instead of pure Java

    • Why: Wapiti is battle-tested, optimized C++ implementation; pure Java implementation would be slower and harder to maintain.
    • Consequence: Complex platform-specific build (Dockerfile

🪤Traps & gotchas

  1. JEP Python integration requires CONDA_PREFIX or VIRTUAL_ENV env var AND matching Python version site-packages in lib path (see getJavaLibraryPath closure); missing setup silently fails at runtime. 2) Platform-specific lib binaries (mac_arm-64 vs lin-64) are prebuilt in grobid-home/lib; custom builds require SWIG, external CRF++, and recompilation steps (doc/Recompiling-and-integrating-CRF-libraries.md). 3) Multiple Docker variants (CRF, DELFT, Evaluation) have different dependencies; DELFT variant requires Python runtime not present in lightweight CRF image. 4) Gradle root property access and symbol resolution for .editorconfig / .spotlessignore setup required for clean builds.

🏗️Architecture

💡Concepts to learn

  • Conditional Random Fields (CRF) — GROBID's primary sequence labeling engine for header/reference extraction; understanding CRF training and inference is critical for model improvements and troubleshooting
  • TEI (Text Encoding Initiative) — GROBID's output format for scholarly document markup; knowledge of TEI schema and semantics is essential for interpreting and validating extraction results
  • JEP (Java Embedded Python) — Bridge layer enabling Deep Learning model inference from Java; critical for DELFT variant builds and understanding native library dependency complexity
  • SWIG (Simplified Wrapper and Interface Generator) — Generates JNI bindings for CRF++ C++ library; understanding SWIG is required for maintaining or rebuilding native dependencies
  • F1-Score (Precision-Recall Harmonic Mean) — Primary evaluation metric reported across benchmarks (0.87-0.90 on references); foundation for comparing model versions and dataset performance
  • Citation Context Resolution — GROBID-specific task of linking in-text citation callouts to full bibliographic references; 0.76-0.91 F1-score metric reflects core value proposition
  • Document Segmentation (Layout Analysis) — Preprocessing step for PDF layout understanding before text extraction; models handle multi-column, header/footer detection referenced in doc/Grobid-specialized-processes.md
  • allenai/science-parse — Direct competitor for scholarly PDF parsing; similar header/reference extraction but uses different ML models; good reference for feature parity
  • kermitt2/delft — Companion Deep Learning framework used by GROBID for neural model training; understanding DELFT is prerequisite for DELFT variant builds
  • facebookresearch/fasttext — Embedding library referenced in deep learning pipelines; used for feature representation in DELFT models
  • anystyle/anystyle — Ruby-based reference parser with similar goals; comparison tool for validating GROBID's citation extraction quality
  • europepmc/europepmc — Europe PMC uses GROBID internally for large-scale PDF ingestion; production deployment example

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive GitHub Actions workflow for Python dependency validation and JEP compatibility testing

The build.gradle shows complex Python/JEP library path resolution logic across macOS (arm64/x86), Linux, and virtual environments (CONDA_PREFIX, VIRTUAL_ENV). Currently, there's no CI workflow validating this platform-specific behavior. Adding a workflow would catch regressions in the JEP initialization code that affects the ML model loading pipeline. The repo has ci-build-*.yml workflows but none specifically test the Python integration layer that's critical for deep learning models.

  • [ ] Create .github/workflows/ci-test-jep-compatibility.yml
  • [ ] Add matrix strategy testing macOS (arm64, x86_64), Ubuntu (x86_64), and conda/venv environments
  • [ ] Test the getJavaLibraryPath() logic in build.gradle by validating library paths are correctly resolved
  • [ ] Add validation steps to ensure jep binaries load correctly for each platform
  • [ ] Document expected library structure in doc/Notes-grobid-developers.md section for Python setup

Add integration tests for Docker flavor builds (CRF, DeLFT, Evaluation variants)

The repo contains three specialized Dockerfiles (Dockerfile.crf, Dockerfile.delft, Dockerfile.evaluation) alongside multiple CI workflows (ci-build-manual-crf.yml, etc.), but there's no documented test suite validating that each Docker build produces a functional image. Given that the README highlights Docker as a primary deployment method (with pull badges), adding automated tests ensures image consistency across flavors and prevents build regressions.

  • [ ] Create doc/Docker-testing-guide.md documenting test procedures for each flavor
  • [ ] Add .github/workflows/ci-docker-build-and-test.yml that builds all three Dockerfile variants
  • [ ] Include smoke tests in the workflow: verify grobid-service startup, health endpoint response, and basic document processing
  • [ ] Add matrix strategy for testing Docker images on ubuntu-latest and optionally arm64 runners
  • [ ] Reference this workflow in CHANGELOG.md and relevant CI build workflows

Create unit test suite for TEI output encoding and coordinate transformation logic

The repo has doc/TEI-encoding-of-results.md and doc/Coordinates-in-PDF.md describing output formats, plus doc/End-to-end-evaluation.md for benchmarking. However, the file structure suggests no dedicated test files for PDF coordinate transformation or TEI serialization correctness. These are critical for users integrating GROBID into pipelines—bugs in coordinate mapping or TEI tag nesting could cascade through downstream systems. Adding focused tests would improve reliability.

  • [ ] Review src/ structure to identify TEI serialization classes (likely in org.grobid.core.document or similar)
  • [ ] Create src/test/java/org/grobid/core/document/TEIEncodingTest.java for TEI output validation
  • [ ] Create src/test/java/org/grobid/core/utilities/CoordinateTransformationTest.java for PDF coordinate tests
  • [ ] Add test fixtures in src/test/resources with sample PDFs and expected TEI/coordinate outputs
  • [ ] Update doc/Testing.md or create it with test coverage goals aligned to benchmarking metrics in doc/benchmarks/

🌿Good first issues

  • Add unit test coverage for TEI XML marshalling in grobid-core/src/test/ — doc/TEI-encoding-of-results.md describes the schema but test directory likely has gaps for edge cases (malformed author lists, missing DOI fields, etc.)
  • Document CONDA/VIRTUALENV setup for JEP integration in doc/Install-Grobid.md or doc/Getting-started — the getJavaLibraryPath Gradle logic is opaque; a clear step-by-step guide with common failure modes would reduce onboarding friction
  • Create end-to-end integration test comparing CRF vs DELFT extraction F1-score on a fixed bioRxiv sample set — doc/benchmarks/ has per-dataset reports but no reproducible CI test; this bridges the gap between manual benchmarking and CI validation

Top contributors

Click to expand

📝Recent commits

Click to expand
  • bc30af7 — FIx formatting in DateParser (#1437) (lfoppiano)
  • e82a1fa — Add codespell support with configuration and fixes (#1365) (yarikoptic)
  • f8f152f — Fix git revision (#1433) (lfoppiano)
  • a7daadf — fix: uniform docker image summary #1428 (#1429) (lfoppiano)
  • 58fe2f2 — Merge pull request #1431 from grobidOrg/feature/code-formatting (lfoppiano)
  • a82c2e7 — chore: Code formatting with Spotless (lfoppiano)
  • c702a5b — feat: rewrite toString and other oveloaded methods (#1401) (lfoppiano)
  • f9e73ef — Enhance code formatting and add Spotless for code cleanup (#1384) (lfoppiano)
  • 85a97fb — Corrected links to the current CrossRef documentation, and organisation name (#1430) (luismmontilla)
  • 4b931ce — Fix ./gradlew install (#1427) (flrjrf)

🔒Security observations

The GROBID codebase shows a moderate security posture. The primary concerns relate to dynamic library path construction in the Gradle build file without sufficient input validation, which could be exploited in shared build environments or if system environment variables are compromised. The incomplete getGitRevision() function requires attention. The presence of CI/CD workflows with Trivy scanning is positive. No obvious hardcoded secrets were detected in the provided snippets. Docker configurations and GitHub Actions workflows follow reasonable security practices. Recommendations include: (1) completing and validating the build script logic, (2) implementing stricter environment variable validation, (3) maintaining dependency vulnerability scanning, and (4) regular security audits of the build infrastructure.

  • Medium · Potential Path Traversal in Library Path Construction — build.gradle - getJavaLibraryPath closure. The build.gradle file constructs Java library paths dynamically using environment variables (CONDA_PREFIX, VIRTUAL_ENV) and file system operations without sufficient validation. This could potentially be exploited if environment variables are controlled by an attacker to include malicious library paths. Fix: Validate and sanitize environment variable values before constructing library paths. Use absolute path resolution with canonical paths to prevent directory traversal attacks. Consider using a whitelist of allowed library locations.
  • Medium · Incomplete Git Revision Retrieval Logic — build.gradle - getGitRevision() function. The getGitRevision() function in build.gradle appears to be incomplete (ends with 'tr' suggesting truncation). Incomplete security-related code could lead to unexpected behavior or uninitialized variables. Fix: Complete the implementation of getGitRevision() function. Ensure proper error handling and that the git revision is properly validated before use.
  • Medium · Use of System Properties and Environment Variables Without Sanitization — build.gradle - getJavaLibraryPath closure. The code retrieves System.getProperty('java.library.path') and System.env variables without sanitization. These values are then concatenated into paths that will be used by the JVM to load native libraries, which could be a vector for local privilege escalation if the system is compromised. Fix: Implement strict validation of environment variables. Use only explicitly defined and validated paths. Consider using absolute paths only from trusted locations. Add logging to track any unusual path configurations.
  • Low · Gradle Plugin Versions Should Be Reviewed for Known CVEs — build.gradle - plugins section. While most plugin versions appear reasonably current, dependencies should be regularly checked. The kotlin.jvm plugin (2.2.20) and shadow plugin (8.3.10) should be monitored for security updates. Fix: Implement automated dependency scanning using tools like Trivy (already present in workflows), OWASP Dependency-Check, or Snyk. Keep all Gradle plugins updated to their latest stable versions. Review CVE databases regularly for plugin vulnerabilities.
  • Low · Potential Information Disclosure via Git Revision — build.gradle - getGitRevision() function. The getGitRevision() function defaults to 'unknown' string. If this is embedded in build artifacts or logs, it may help attackers understand version information, though the impact is minimal. Fix: Ensure version information is handled securely. Consider if this information should be exposed in production builds. Document version handling practices.
  • Low · Hard-coded Platform Detection Logic — build.gradle - getJavaLibraryPath closure. Platform detection using Os.FAMILY_MAC and Os.FAMILY_UNIX is platform-specific and could fail on unsupported platforms. The explicit RuntimeException for unsupported platforms is good, but this could be a DoS vector in automated build systems. Fix: Document supported platforms clearly. Consider adding graceful degradation for unsupported platforms rather than failing the build. Add logging for platform detection.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · grobidOrg/grobid — RepoPilot