apache/hudi

Item: apache/hudi
Rating: 5
Author: RepoPilot

Upserts, Deletes And Incremental Processing on Big Data.

Healthy

Healthy across the board

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit today
✓27+ active contributors
✓Distributed ownership (top contributor 28% of recent commits)

Show all 6 evidence items →

✓Apache-2.0 licensed
✓CI configured
✓Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/apache/hudi)](https://repopilot.app/r/apache/hudi)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/apache/hudi on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: apache/hudi

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/apache/hudi shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit today
27+ active contributors
Distributed ownership (top contributor 28% of recent commits)
Apache-2.0 licensed
CI configured
Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live apache/hudi repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/apache/hudi.

What it runs against: a local clone of apache/hudi — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in apache/hudi | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>apache/hudi</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of apache/hudi. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/apache/hudi.git
#   cd hudi
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of apache/hudi and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "apache/hudi(\\.git)?\\b" \\
  && ok "origin remote is apache/hudi" \\
  || miss "origin remote is not apache/hudi (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "README.md" \\
  && ok "README.md" \\
  || miss "missing critical file: README.md"
test -f ".asf.yaml" \\
  && ok ".asf.yaml" \\
  || miss "missing critical file: .asf.yaml"
test -f "pom.xml" \\
  && ok "pom.xml" \\
  || miss "missing critical file: pom.xml"
test -f ".github/workflows/azure_ci.js" \\
  && ok ".github/workflows/azure_ci.js" \\
  || miss "missing critical file: .github/workflows/azure_ci.js"
test -f ".github/PULL_REQUEST_TEMPLATE.md" \\
  && ok ".github/PULL_REQUEST_TEMPLATE.md" \\
  || miss "missing critical file: .github/PULL_REQUEST_TEMPLATE.md"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/apache/hudi"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Apache Hudi is an open data lakehouse platform that enables efficient upserts, deletes, and incremental processing on big data stored in cloud object storage. It provides a high-performance table format (similar to Delta Lake or Apache Iceberg) with built-in support for Spark and Flink, managing file organization, incremental queries, and timeline-based change tracking automatically. Monorepo with Maven multi-module structure organized by integration point: hudi-core (base format), hudi-spark (Spark DataSource), hudi-flink (Flink sink), hudi-hive-sync (metadata syncing), hudi-utilities (ingestion tools), hudi-kafka-connect (Kafka source), and timeline-server. Each module produces versioned bundles (hudi--bundle.txt in dependencies/) for packaging with different compute engines.

👥Who it's for

Data engineers and lakehouse architects using Apache Spark or Flink who need ACID transactions (insert/update/delete), incremental data processing, and efficient data versioning without managing Parquet/ORC files manually. Also relevant for organizations migrating from traditional data warehouses to cloud storage.

🌱Maturity & risk

Highly mature and production-ready. Apache Software Foundation project with active CI/CD (Azure Pipelines, GitHub Actions), 31.8M lines of Java code, Docker multi-version testing (Hadoop 2.8.4 through 3.4.0, Spark 3.5.3, Hive 3.1.3), and clear Maven artifact publication. Actively developed with recent infrastructure updates (azure-pipelines-20230430.yml, GitHub workflow updates in .github/workflows/).

Low technical risk but high complexity. Large multi-language codebase (Java 31.8M LOC, Scala 7.9M LOC) creates maintenance burden; multiple Scala/Spark versions (2.11, 2.12, Spark3) require careful dependency management. No evident single-maintainer risk due to ASF governance, but wide dependency surface (12+ bundle types: Flink, Hive, Kafka Connect, Presto) means upgrading dependencies requires extensive testing across integrations.

Active areas of work

Version 1.3.0-SNAPSHOT active development with focus on multi-engine support (Flink/Spark 3.5.3 docker images in compose/, Hadoop 3.4.0 compatibility). Recent work on PR compliance automation (.github/workflows/pr_compliance.yml, pr_title_validation.yml) and release candidate validation suggests preparation for major release. Kafka Connect and Presto bundles indicate ecosystem expansion.

🚀Get running

git clone https://github.com/apache/hudi.git
cd hudi
mvn clean install -DskipTests=true  # Build all modules with Maven 3.6+, requires Java 8+
# For Docker testing: docker-compose -f docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml up

Daily commands: No single dev server; build modules and test with spark-shell or spark-submit: mvn -pl hudi-spark,hudi-common -am install && spark-shell --jars target/hudi-spark-1.3.0-SNAPSHOT.jar. For end-to-end: docker-compose -f docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml up provides Hadoop NameNode, Hive Metastore, Spark, and Hudi preconfigured.

🗺️Map of the codebase

README.md — Entry point for understanding Hudi's core mission: an open data lakehouse platform for upserts, deletes, and incremental processing on big data.
.asf.yaml — Apache Software Foundation metadata; defines project governance, release process, and contribution standards that all maintainers must follow.
pom.xml — Maven root POM defining all dependencies, build profiles, and module organization across 600+ files; critical for understanding the build system.
.github/workflows/azure_ci.js — Primary CI/CD pipeline configuration; defines how all code is tested and validated before merge.
.github/PULL_REQUEST_TEMPLATE.md — PR submission template enforcing standards for commits, testing, and documentation that every contributor must follow.
conf/hudi-defaults.conf.template — Default configuration schema; documents all tunable parameters and serves as reference for runtime behavior.
Dockerfile — Container image definition; shows how Hudi runtime environment is packaged for deployment and demos.

🛠️How to make changes

Add a new Hudi runtime integration (e.g., new Spark version)

Define new module with dependencies in root pom.xml, adding a new <module> entry for hudi-spark-x-bundle (pom.xml)
Create bundle dependency list in dependencies directory following naming convention (e.g., hudi-spark3-bundle_2.13.txt) (dependencies/hudi-spark3-bundle_2.12.txt)
Add new Docker Compose environment for integration testing with the new runtime version (docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml)
Create test configuration YAML in demo/config/test-suite referencing the new bundle (docker/demo/config/test-suite/simple-deltastreamer.yaml)
Update CI/CD workflow to run integration tests against the new runtime configuration (.github/workflows/azure_ci.js)

Add a new integration test scenario

Create a new test YAML configuration in docker/demo/config/test-suite/ following naming convention (e.g., feature-name.yaml) (docker/demo/config/test-suite/simple-deltastreamer.yaml)
Reference existing Avro schema or create new one in same directory if different record structure needed (docker/demo/config/test-suite/source.avsc)
Reference or extend base properties from docker/demo/config/base.properties for common settings (docker/demo/config/base.properties)
Add scheduled test entry in .github/workflows/scheduled_workflow.yml to run nightly (.github/workflows/scheduled_workflow.yml)

Contribute a bug fix or feature following Apache guidelines

Create GitHub issue using appropriate template (bug/feature/improvement) in .github/ISSUE_TEMPLATE/ (.github/ISSUE_TEMPLATE/hudi_bug.yml)
Fork repo, create feature branch, and ensure code follows Apache project standards outlined in .asf.yaml (.asf.yaml)
Ensure PR follows checklist and formatting in PULL_REQUEST_TEMPLATE.md before submission (.github/PULL_REQUEST_TEMPLATE.md)
Add integration tests using existing test suite patterns in docker/demo/config/test-suite/ (docker/demo/config/test-suite/simple-deltastreamer.yaml)
Verify CI/CD passes via azure_ci.js workflow before requesting review (.github/workflows/azure_ci.js)

🔧Why these technologies

Apache Spark, Flink, Hadoop MR — Hudi supports multiple distributed compute frameworks; dependencies are pre-bundled (hudi-spark-bundle, hudi-flink-bundle, hudi-hadoop-mr-bundle) to ensure compatibility across cloud platforms
Avro schema format — Data schema definition standard used across demo configs (hoodie-schema.avsc, source.avsc, complex-source.avsc) for record serialization and compatibility
Docker Compose + multi-version stacks — Integration testing requires specific combinations of Hadoop, Hive, and Spark versions; docker-compose files pre-define validated stacks (Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1, etc.)
Azure CI/CD pipeline — Primary continuous integration orchestrated via azure_ci.js; validates all PRs against bundled dependencies and integration test suites
Apache Maven multi-module build — Coordinates ~600 files across modular components (core, spark-bundle, flink-bundle, utilities) with unified dependency management in root pom.xml

⚖️Trade-offs already made

Pre-built dependency bundles (hudi-spark-bundle_2.12.txt, etc.) instead of inline pom.xml
- Why: Reduces JAR conflicts when deploying to heterogeneous cluster environments with pre-existing Spark/Hadoop installations
- Consequence: Bundle lists must be manually maintained; out-of-sync bundles can cause runtime failures. Requires additional validation in CI/CD
Support multiple Spark versions (2.11, 2.12, 3.x) and Scala versions simultaneously
- Why: Enables adoption across enterprises with legacy Spark 2.x and modern Spark 3.x clusters
- Consequence: Increased build complexity, larger codebase surface, and more integration test matrix coverage needed
Both Copy-on-Write (CoW) and Merge-on-Read (MOR) table formats
- Why: CoW favors query performance; MOR favors write throughput. Users choose based on workload (test configs: cow-large-scale-sanity.yaml vs mor-large-scale-sanity.yaml)
- Consequence: Implementation must support two distinct code paths; increased testing burden and documentation requirements
Async compaction as optional background job (mor-async-compact.yaml)
- Why: Prevents write stalls in high-frequency ingestion scenarios
- Consequence: Query consistency during active compaction requires careful timeline management; risk of stale read visibility windows

🚫Non-goals (don't propose these)

Real-time sub-second query latency (Hudi is optimized for eventual consistency in batch/incremental scenarios)
Full ACID transaction isolation levels beyond snapshot isolation (timeline-based versioning provides eventual consistency, not strict serializability)
Standalone query engine (Hudi delegates query execution to Spark, Presto, Hive)
Schema evolution beyond Avro type system (no full schema migration framework)
Non-distributed single

🪤Traps & gotchas

Multi-Scala version complexity: hudi-spark bundles exist for Scala 2.11 and 2.12; mixing versions causes ClassNotFoundExceptions. 2) Timeline state required: .hudi/ directory in table path must exist and be writable; Hudi doesn't auto-create it on all storage systems. 3) Metastore sync timing: Hive metastore updates lag file commits; queries may see stale schema if run immediately after write. 4) Partition pruning depends on proper table config (hoodie.table.type=COPY_ON_WRITE vs MERGE_ON_READ) and write parallelism; incorrect settings cause full-table scans. 5) Java version: requires Java 8+ but Hive 3.1.3 / Hadoop 3.4.0 combinations may have undocumented conflicts.

🏗️Architecture

💡Concepts to learn

Log-Structured Merge Tree (LSM) — Hudi uses LSM principles in MERGE_ON_READ mode to avoid full table rewrites on upserts; understanding read-write tradeoffs is crucial for tuning performance
Multi-Version Concurrency Control (MVCC) — Hudi's timeline metadata enables multiple concurrent readers and writers via version isolation; timeline snapshots define what each reader sees
Copy-on-Write vs Merge-on-Read — Hudi's two table types trade write cost (CoW rewrites all affected files) vs read cost (MoR merges base + log on read); choosing correctly affects pipeline latency and storage efficiency
DataSource V2 API — Hudi implements Spark's DataSource V2 protocol to enable native format('hudi') integration; understanding pushdown filters, column pruning, and partition pruning is needed to optimize Spark queries
Parquet Bloom Filters and Statistics — Hudi leverages Parquet bloom filters for record pruning and file statistics for partition elimination; incorrect statistics cause full-table scans despite small result sets
Append-only Timeline Log — Hudi stores all commits, inflight operations, and rollbacks in .hudi/ directory using append-only logs; this enables time-travel queries, point-in-time recovery, and non-destructive deletes

delta-io/delta — Direct competitor in the lakehouse format space; Delta Lake provides similar ACID transactions and time-travel, widely used with Spark but Hudi has better Flink support
apache/iceberg — Alternative open table format with similar upsert/delete capabilities; Iceberg focuses on spec-first design while Hudi emphasizes engineering flexibility
apache/spark — Primary execution engine for Hudi; Hudi implements Spark DataSource V2 API and depends on Spark's Catalyst optimizer
apache/flink — Hudi's streaming ingestion path; hudi-flink module provides StreamingSink for continuous upserts from Flink pipelines
apache/kafka — Data source for Hudi ingestion; hudi-kafka-connect provides native Kafka sink connector for event streaming into Hudi tables

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive CI workflow for Docker image builds and validation

The repo has multiple Docker Compose configurations (hadoop284, hadoop334, hadoop340 with various Hive/Spark versions) and build scripts (docker/build_docker_images.sh, docker/build_local_docker_images.sh) but no dedicated GitHub Actions workflow to validate Docker builds on PR. This prevents catching breaking changes to Dockerfiles or docker-compose files before merge. A new workflow should build and validate images for at least amd64 and arm64 architectures.

[ ] Create .github/workflows/docker_build_validation.yml that triggers on changes to docker/ and Dockerfile
[ ] Add build steps for docker/build_local_docker_images.sh with matrix strategy for amd64/arm64
[ ] Validate docker-compose configurations parse correctly using docker-compose config validation
[ ] Add sanity checks: verify image layers don't exceed size thresholds, basic container startup tests
[ ] Document in .github/workflows/README.md when this workflow runs

Create integration tests for docker-compose demo environments

The repo contains 8 docker-compose files with different Hadoop/Hive/Spark versions and demo configurations (compaction-bootstrap.commands, compaction.commands, test-suite/) but no automated tests validating these environments work end-to-end. New contributors often use these for local development but can't verify their changes don't break demo setup.

[ ] Create docker/tests/ directory with shell scripts to validate each docker-compose configuration
[ ] Add tests that: spin up compose services, verify services are healthy (localhost:8080, HDFS port checks)
[ ] Add demo workflow validation: run commands from docker/demo/compaction.commands and docker/demo/compaction-bootstrap.commands in test environment
[ ] Create a GitHub Actions workflow (.github/workflows/docker_compose_integration_test.yml) to run these tests on docker/ and demo config changes
[ ] Document test execution in docker/README.md

Add Maven artifact validation workflow for bundle consistency

The repo has 13 bundle JAR dependencies listed (hudi-spark-bundle, hudi-flink-bundle, hudi-hadoop-mr-bundle, etc.) with corresponding .txt files in dependencies/, and a maven_artifact_validation.yml workflow exists but appears incomplete. A new contributor should enhance this to validate bundle JARs don't have conflicting transitive dependencies across Spark 2.11, 2.12, 3.x variants.

[ ] Review and complete .github/workflows/maven_artifact_validation.yml to parse dependencies/*.txt files
[ ] Add validation logic to detect duplicate/conflicting dependency versions across bundle types (e.g., same library in spark2 vs spark3 bundles with different versions)
[ ] Create a helper script in build-utils/ to analyze transitive dependency trees from each bundle
[ ] Add PR check that fails if bundle conflicts detected, with clear output showing the conflicts
[ ] Document expected bundle dependency structure in CONTRIBUTING.md or a new docs/bundles.md

🌿Good first issues

Add integration tests for Docker Compose variants: The repo defines 6 compose files (Hadoop 2.8.4/3.3.4/3.4.0 with Spark 3.5.3 and Hive 2.3.10/3.1.3) but no CI job runs them; create a .github/workflows/ job that validates each compose file spins up and passes a simple upsert test.
Document hudi-defaults.conf.template keys: conf/hudi-defaults.conf.template exists but has no README explaining which configs apply to Spark vs Flink vs Hive Sync; create docs/CONFIGURATION.md cross-referencing this file with real examples.
Implement missing test coverage for timeline recovery: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/ has no tests for incomplete commit recovery (e.g., when inflight/ files exist but commit failed); add HoodieTimelineRecoveryTest.java with failure injection scenarios.

⭐Top contributors

Click to expand

@voonhous — 28 commits
@rahil-c — 9 commits
@suryaprasanna — 8 commits
@yihua — 8 commits
@mailtoboggavarapu-coder — 7 commits

📝Recent commits