apache/hudi
Upserts, Deletes And Incremental Processing on Big Data.
Healthy across the board
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit today
- ✓27+ active contributors
- ✓Distributed ownership (top contributor 28% of recent commits)
Show all 6 evidence items →Show less
- ✓Apache-2.0 licensed
- ✓CI configured
- ✓Tests present
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/apache/hudi)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/apache/hudi on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: apache/hudi
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/apache/hudi shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit today
- 27+ active contributors
- Distributed ownership (top contributor 28% of recent commits)
- Apache-2.0 licensed
- CI configured
- Tests present
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live apache/hudi
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/apache/hudi.
What it runs against: a local clone of apache/hudi — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in apache/hudi | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch master exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of apache/hudi. If you don't
# have one yet, run these first:
#
# git clone https://github.com/apache/hudi.git
# cd hudi
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of apache/hudi and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "apache/hudi(\\.git)?\\b" \\
&& ok "origin remote is apache/hudi" \\
|| miss "origin remote is not apache/hudi (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 4. Critical files exist
test -f "README.md" \\
&& ok "README.md" \\
|| miss "missing critical file: README.md"
test -f ".asf.yaml" \\
&& ok ".asf.yaml" \\
|| miss "missing critical file: .asf.yaml"
test -f "pom.xml" \\
&& ok "pom.xml" \\
|| miss "missing critical file: pom.xml"
test -f ".github/workflows/azure_ci.js" \\
&& ok ".github/workflows/azure_ci.js" \\
|| miss "missing critical file: .github/workflows/azure_ci.js"
test -f ".github/PULL_REQUEST_TEMPLATE.md" \\
&& ok ".github/PULL_REQUEST_TEMPLATE.md" \\
|| miss "missing critical file: .github/PULL_REQUEST_TEMPLATE.md"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/apache/hudi"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
Apache Hudi is an open data lakehouse platform that enables efficient upserts, deletes, and incremental processing on big data stored in cloud object storage. It provides a high-performance table format (similar to Delta Lake or Apache Iceberg) with built-in support for Spark and Flink, managing file organization, incremental queries, and timeline-based change tracking automatically. Monorepo with Maven multi-module structure organized by integration point: hudi-core (base format), hudi-spark (Spark DataSource), hudi-flink (Flink sink), hudi-hive-sync (metadata syncing), hudi-utilities (ingestion tools), hudi-kafka-connect (Kafka source), and timeline-server. Each module produces versioned bundles (hudi--bundle.txt in dependencies/) for packaging with different compute engines.
👥Who it's for
Data engineers and lakehouse architects using Apache Spark or Flink who need ACID transactions (insert/update/delete), incremental data processing, and efficient data versioning without managing Parquet/ORC files manually. Also relevant for organizations migrating from traditional data warehouses to cloud storage.
🌱Maturity & risk
Highly mature and production-ready. Apache Software Foundation project with active CI/CD (Azure Pipelines, GitHub Actions), 31.8M lines of Java code, Docker multi-version testing (Hadoop 2.8.4 through 3.4.0, Spark 3.5.3, Hive 3.1.3), and clear Maven artifact publication. Actively developed with recent infrastructure updates (azure-pipelines-20230430.yml, GitHub workflow updates in .github/workflows/).
Low technical risk but high complexity. Large multi-language codebase (Java 31.8M LOC, Scala 7.9M LOC) creates maintenance burden; multiple Scala/Spark versions (2.11, 2.12, Spark3) require careful dependency management. No evident single-maintainer risk due to ASF governance, but wide dependency surface (12+ bundle types: Flink, Hive, Kafka Connect, Presto) means upgrading dependencies requires extensive testing across integrations.
Active areas of work
Version 1.3.0-SNAPSHOT active development with focus on multi-engine support (Flink/Spark 3.5.3 docker images in compose/, Hadoop 3.4.0 compatibility). Recent work on PR compliance automation (.github/workflows/pr_compliance.yml, pr_title_validation.yml) and release candidate validation suggests preparation for major release. Kafka Connect and Presto bundles indicate ecosystem expansion.
🚀Get running
git clone https://github.com/apache/hudi.git
cd hudi
mvn clean install -DskipTests=true # Build all modules with Maven 3.6+, requires Java 8+
# For Docker testing: docker-compose -f docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml up
Daily commands:
No single dev server; build modules and test with spark-shell or spark-submit: mvn -pl hudi-spark,hudi-common -am install && spark-shell --jars target/hudi-spark-1.3.0-SNAPSHOT.jar. For end-to-end: docker-compose -f docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml up provides Hadoop NameNode, Hive Metastore, Spark, and Hudi preconfigured.
🗺️Map of the codebase
README.md— Entry point for understanding Hudi's core mission: an open data lakehouse platform for upserts, deletes, and incremental processing on big data..asf.yaml— Apache Software Foundation metadata; defines project governance, release process, and contribution standards that all maintainers must follow.pom.xml— Maven root POM defining all dependencies, build profiles, and module organization across 600+ files; critical for understanding the build system..github/workflows/azure_ci.js— Primary CI/CD pipeline configuration; defines how all code is tested and validated before merge..github/PULL_REQUEST_TEMPLATE.md— PR submission template enforcing standards for commits, testing, and documentation that every contributor must follow.conf/hudi-defaults.conf.template— Default configuration schema; documents all tunable parameters and serves as reference for runtime behavior.Dockerfile— Container image definition; shows how Hudi runtime environment is packaged for deployment and demos.
🛠️How to make changes
Add a new Hudi runtime integration (e.g., new Spark version)
- Define new module with dependencies in root pom.xml, adding a new
<module>entry for hudi-spark-x-bundle (pom.xml) - Create bundle dependency list in dependencies directory following naming convention (e.g., hudi-spark3-bundle_2.13.txt) (
dependencies/hudi-spark3-bundle_2.12.txt) - Add new Docker Compose environment for integration testing with the new runtime version (
docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml) - Create test configuration YAML in demo/config/test-suite referencing the new bundle (
docker/demo/config/test-suite/simple-deltastreamer.yaml) - Update CI/CD workflow to run integration tests against the new runtime configuration (
.github/workflows/azure_ci.js)
Add a new integration test scenario
- Create a new test YAML configuration in docker/demo/config/test-suite/ following naming convention (e.g., feature-name.yaml) (
docker/demo/config/test-suite/simple-deltastreamer.yaml) - Reference existing Avro schema or create new one in same directory if different record structure needed (
docker/demo/config/test-suite/source.avsc) - Reference or extend base properties from docker/demo/config/base.properties for common settings (
docker/demo/config/base.properties) - Add scheduled test entry in .github/workflows/scheduled_workflow.yml to run nightly (
.github/workflows/scheduled_workflow.yml)
Contribute a bug fix or feature following Apache guidelines
- Create GitHub issue using appropriate template (bug/feature/improvement) in .github/ISSUE_TEMPLATE/ (
.github/ISSUE_TEMPLATE/hudi_bug.yml) - Fork repo, create feature branch, and ensure code follows Apache project standards outlined in .asf.yaml (
.asf.yaml) - Ensure PR follows checklist and formatting in PULL_REQUEST_TEMPLATE.md before submission (
.github/PULL_REQUEST_TEMPLATE.md) - Add integration tests using existing test suite patterns in docker/demo/config/test-suite/ (
docker/demo/config/test-suite/simple-deltastreamer.yaml) - Verify CI/CD passes via azure_ci.js workflow before requesting review (
.github/workflows/azure_ci.js)
🔧Why these technologies
- Apache Spark, Flink, Hadoop MR — Hudi supports multiple distributed compute frameworks; dependencies are pre-bundled (hudi-spark-bundle, hudi-flink-bundle, hudi-hadoop-mr-bundle) to ensure compatibility across cloud platforms
- Avro schema format — Data schema definition standard used across demo configs (hoodie-schema.avsc, source.avsc, complex-source.avsc) for record serialization and compatibility
- Docker Compose + multi-version stacks — Integration testing requires specific combinations of Hadoop, Hive, and Spark versions; docker-compose files pre-define validated stacks (Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1, etc.)
- Azure CI/CD pipeline — Primary continuous integration orchestrated via azure_ci.js; validates all PRs against bundled dependencies and integration test suites
- Apache Maven multi-module build — Coordinates ~600 files across modular components (core, spark-bundle, flink-bundle, utilities) with unified dependency management in root pom.xml
⚖️Trade-offs already made
-
Pre-built dependency bundles (hudi-spark-bundle_2.12.txt, etc.) instead of inline pom.xml
- Why: Reduces JAR conflicts when deploying to heterogeneous cluster environments with pre-existing Spark/Hadoop installations
- Consequence: Bundle lists must be manually maintained; out-of-sync bundles can cause runtime failures. Requires additional validation in CI/CD
-
Support multiple Spark versions (2.11, 2.12, 3.x) and Scala versions simultaneously
- Why: Enables adoption across enterprises with legacy Spark 2.x and modern Spark 3.x clusters
- Consequence: Increased build complexity, larger codebase surface, and more integration test matrix coverage needed
-
Both Copy-on-Write (CoW) and Merge-on-Read (MOR) table formats
- Why: CoW favors query performance; MOR favors write throughput. Users choose based on workload (test configs: cow-large-scale-sanity.yaml vs mor-large-scale-sanity.yaml)
- Consequence: Implementation must support two distinct code paths; increased testing burden and documentation requirements
-
Async compaction as optional background job (mor-async-compact.yaml)
- Why: Prevents write stalls in high-frequency ingestion scenarios
- Consequence: Query consistency during active compaction requires careful timeline management; risk of stale read visibility windows
🚫Non-goals (don't propose these)
- Real-time sub-second query latency (Hudi is optimized for eventual consistency in batch/incremental scenarios)
- Full ACID transaction isolation levels beyond snapshot isolation (timeline-based versioning provides eventual consistency, not strict serializability)
- Standalone query engine (Hudi delegates query execution to Spark, Presto, Hive)
- Schema evolution beyond Avro type system (no full schema migration framework)
- Non-distributed single
🪤Traps & gotchas
- Multi-Scala version complexity: hudi-spark bundles exist for Scala 2.11 and 2.12; mixing versions causes ClassNotFoundExceptions. 2) Timeline state required: .hudi/ directory in table path must exist and be writable; Hudi doesn't auto-create it on all storage systems. 3) Metastore sync timing: Hive metastore updates lag file commits; queries may see stale schema if run immediately after write. 4) Partition pruning depends on proper table config (hoodie.table.type=COPY_ON_WRITE vs MERGE_ON_READ) and write parallelism; incorrect settings cause full-table scans. 5) Java version: requires Java 8+ but Hive 3.1.3 / Hadoop 3.4.0 combinations may have undocumented conflicts.
🏗️Architecture
💡Concepts to learn
- Log-Structured Merge Tree (LSM) — Hudi uses LSM principles in MERGE_ON_READ mode to avoid full table rewrites on upserts; understanding read-write tradeoffs is crucial for tuning performance
- Multi-Version Concurrency Control (MVCC) — Hudi's timeline metadata enables multiple concurrent readers and writers via version isolation; timeline snapshots define what each reader sees
- Copy-on-Write vs Merge-on-Read — Hudi's two table types trade write cost (CoW rewrites all affected files) vs read cost (MoR merges base + log on read); choosing correctly affects pipeline latency and storage efficiency
- DataSource V2 API — Hudi implements Spark's DataSource V2 protocol to enable native format('hudi') integration; understanding pushdown filters, column pruning, and partition pruning is needed to optimize Spark queries
- Parquet Bloom Filters and Statistics — Hudi leverages Parquet bloom filters for record pruning and file statistics for partition elimination; incorrect statistics cause full-table scans despite small result sets
- Append-only Timeline Log — Hudi stores all commits, inflight operations, and rollbacks in .hudi/ directory using append-only logs; this enables time-travel queries, point-in-time recovery, and non-destructive deletes
🔗Related repos
delta-io/delta— Direct competitor in the lakehouse format space; Delta Lake provides similar ACID transactions and time-travel, widely used with Spark but Hudi has better Flink supportapache/iceberg— Alternative open table format with similar upsert/delete capabilities; Iceberg focuses on spec-first design while Hudi emphasizes engineering flexibilityapache/spark— Primary execution engine for Hudi; Hudi implements Spark DataSource V2 API and depends on Spark's Catalyst optimizerapache/flink— Hudi's streaming ingestion path; hudi-flink module provides StreamingSink for continuous upserts from Flink pipelinesapache/kafka— Data source for Hudi ingestion; hudi-kafka-connect provides native Kafka sink connector for event streaming into Hudi tables
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive CI workflow for Docker image builds and validation
The repo has multiple Docker Compose configurations (hadoop284, hadoop334, hadoop340 with various Hive/Spark versions) and build scripts (docker/build_docker_images.sh, docker/build_local_docker_images.sh) but no dedicated GitHub Actions workflow to validate Docker builds on PR. This prevents catching breaking changes to Dockerfiles or docker-compose files before merge. A new workflow should build and validate images for at least amd64 and arm64 architectures.
- [ ] Create .github/workflows/docker_build_validation.yml that triggers on changes to docker/ and Dockerfile
- [ ] Add build steps for docker/build_local_docker_images.sh with matrix strategy for amd64/arm64
- [ ] Validate docker-compose configurations parse correctly using docker-compose config validation
- [ ] Add sanity checks: verify image layers don't exceed size thresholds, basic container startup tests
- [ ] Document in .github/workflows/README.md when this workflow runs
Create integration tests for docker-compose demo environments
The repo contains 8 docker-compose files with different Hadoop/Hive/Spark versions and demo configurations (compaction-bootstrap.commands, compaction.commands, test-suite/) but no automated tests validating these environments work end-to-end. New contributors often use these for local development but can't verify their changes don't break demo setup.
- [ ] Create docker/tests/ directory with shell scripts to validate each docker-compose configuration
- [ ] Add tests that: spin up compose services, verify services are healthy (localhost:8080, HDFS port checks)
- [ ] Add demo workflow validation: run commands from docker/demo/compaction.commands and docker/demo/compaction-bootstrap.commands in test environment
- [ ] Create a GitHub Actions workflow (.github/workflows/docker_compose_integration_test.yml) to run these tests on docker/ and demo config changes
- [ ] Document test execution in docker/README.md
Add Maven artifact validation workflow for bundle consistency
The repo has 13 bundle JAR dependencies listed (hudi-spark-bundle, hudi-flink-bundle, hudi-hadoop-mr-bundle, etc.) with corresponding .txt files in dependencies/, and a maven_artifact_validation.yml workflow exists but appears incomplete. A new contributor should enhance this to validate bundle JARs don't have conflicting transitive dependencies across Spark 2.11, 2.12, 3.x variants.
- [ ] Review and complete .github/workflows/maven_artifact_validation.yml to parse dependencies/*.txt files
- [ ] Add validation logic to detect duplicate/conflicting dependency versions across bundle types (e.g., same library in spark2 vs spark3 bundles with different versions)
- [ ] Create a helper script in build-utils/ to analyze transitive dependency trees from each bundle
- [ ] Add PR check that fails if bundle conflicts detected, with clear output showing the conflicts
- [ ] Document expected bundle dependency structure in CONTRIBUTING.md or a new docs/bundles.md
🌿Good first issues
- Add integration tests for Docker Compose variants: The repo defines 6 compose files (Hadoop 2.8.4/3.3.4/3.4.0 with Spark 3.5.3 and Hive 2.3.10/3.1.3) but no CI job runs them; create a .github/workflows/ job that validates each compose file spins up and passes a simple upsert test.
- Document hudi-defaults.conf.template keys: conf/hudi-defaults.conf.template exists but has no README explaining which configs apply to Spark vs Flink vs Hive Sync; create docs/CONFIGURATION.md cross-referencing this file with real examples.
- Implement missing test coverage for timeline recovery: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/ has no tests for incomplete commit recovery (e.g., when inflight/ files exist but commit failed); add HoodieTimelineRecoveryTest.java with failure injection scenarios.
⭐Top contributors
Click to expand
Top contributors
- @voonhous — 28 commits
- @rahil-c — 9 commits
- @suryaprasanna — 8 commits
- @yihua — 8 commits
- @mailtoboggavarapu-coder — 7 commits
📝Recent commits
Click to expand
Recent commits
47bf4e4— feat(flink): Wire Flink 2.1 nested Parquet readers into the Hudi read path (FLINK-35702) (#18700) (skywalker0618)34e9c7c— test(schema): Add MOR log-only compaction tests for custom types (#18583) (voonhous)63f721d— fix: Fix reflection ctor signature for AwsGlueCatalogSyncTool in HiveSyncContext (#18697) (KiteSoar)87019a3— fix(hive): Tolerate pruned ArrayWritable in nested BLOB projection (#18581) (voonhous)4029560— feat(flink): Backport Flink 2.1 nested Parquet column readers and INT64 timestamp dispatch (FLINK-35702) (#18636) (skywalker0618)c36a5f7— fix(flink): Avoid emitting deletes for Flink source v2 batch reads (#18694) (cshuo)91f341f— fix: filter EXTERNAL property in SparkCatalogMetaStoreClient.toCatalogTable (#18672) (prashantwason)471bb48— refactor: move checkpoint metadata lookup helper to hudi-common (#18489) (suryaprasanna)127c6ee— feat(common): roll over commit metadata to clean (#18590) (kbuci)4d0e9cd— fix(lance): prevent file splitting for Lance base files to avoid duplicate reads (#18678) (rahil-c)
🔒Security observations
- High · Docker Image Base Not Pinned to Specific Version —
Dockerfile. The Dockerfile uses 'apachehudi/hudi-ci-bundle-validation-base:azure_ci_test_base_java11' without a full digest/hash. This allows for potential supply chain attacks if the base image is updated with malicious content or unintended breaking changes. The image tag references 'latest' concept implicitly. Fix: Pin the base image to a specific digest using SHA256 hash format: FROM apachehudi/hudi-ci-bundle-validation-base:azure_ci_test_base_java11@sha256:... instead of relying on tags alone. - Medium · Incomplete Docker Configuration —
Dockerfile. The Dockerfile is incomplete (ends with '# Set the working d' suggesting cut-off content). This incomplete configuration makes security analysis difficult and suggests the build process may not be fully documented or tested. Fix: Complete the Dockerfile with proper security configurations including: WORKDIR, USER (non-root), HEALTHCHECK, and proper layer caching strategies. Ensure all content is present and properly formatted. - Medium · Potential Exposed Environment Variables in Docker Compose —
docker/compose/docker-compose_*.yml files and docker/compose/hadoop.env. Multiple docker-compose files reference 'hadoop.env' file which may contain sensitive environment variables. If this file contains credentials or API keys, it could be exposed through logs, debug outputs, or unintended commits. Fix: Review hadoop.env for sensitive data. Use Docker secrets management instead of environment variables for sensitive data. Add hadoop.env to .gitignore to prevent accidental commits. Implement proper secret rotation policies. - Medium · GitHub Workflow Scripts Using JavaScript —
.github/workflows/azure_ci.js, .github/workflows/labeler.js. The repository contains GitHub workflow JavaScript files (azure_ci.js, labeler.js) that execute custom logic. These files require careful review for injection vulnerabilities, as they run in the CI/CD pipeline with elevated privileges. Fix: Audit JavaScript files for: (1) Input validation and sanitization, (2) Command injection risks, (3) Dependency vulnerabilities. Consider using official GitHub Actions instead of custom scripts where possible. Implement code review requirements for workflow file changes. - Medium · No Security Policy Visible —
Root directory. The repository structure does not show a SECURITY.md or SECURITY.txt file for reporting security vulnerabilities responsibly. This is important for open-source projects. Fix: Create a SECURITY.md file documenting the security vulnerability reporting process, supported versions receiving security updates, and contact information for security researchers. - Low · Incomplete POM File Analysis —
pom.xml (incomplete). The provided pom.xml file is truncated ('<plu' suggests incomplete content), making it difficult to assess dependency versions and potential known vulnerabilities in dependencies. Fix: Provide complete pom.xml file content. Run 'mvn dependency-check' or use OWASP Dependency-Check to scan for known vulnerabilities in all dependencies. Use 'mvn versions:display-dependency-updates' to identify outdated dependencies. - Low · Build Scripts Without Execute Permission Documentation —
docker/*.sh files. Shell scripts like 'docker/build_docker_images.sh' and 'docker/build_local_docker_images.sh' exist but their security posture is unknown. Scripts should be reviewed for injection risks and proper input validation. Fix: Review shell scripts for: (1) Proper quoting of variables to prevent word splitting and globbing, (2) Validation of input parameters, (3) Use of 'set -euo pipefail' for error handling. Add comments documenting required permissions and dependencies. - Low · Multiple Docker Compose Variants May Diverge —
docker/compose/docker-compose_*.yml. Multiple docker-compose files with similar naming but slight differences (amd64 vs arm64 variants, different Hadoop/Hive/Spark versions) could lead to configuration drift and inconsistent security postures across environments. Fix: Use docker-compose overrides or a single compose file with environment variable substitution to maintain consistency. Document the differences between variants and ensure all variants undergo the same security checks.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.