pentaho/pentaho-kettle

Item: pentaho/pentaho-kettle
Rating: 3
Author: RepoPilot

Pentaho Data Integration ( ETL ) a.k.a Kettle

Mixed

Mixed signals — read the receipts

weakest axis

Use as dependencyConcerns

non-standard license (Other); no CI workflows detected

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit today
✓25+ active contributors
✓Distributed ownership (top contributor 24% of recent commits)

Show all 7 evidence items →

✓Other licensed
✓Tests present
⚠Non-standard license (Other) — review terms
⚠No CI workflows detected

What would change the summary?

→Use as dependency Concerns → Mixed if: clarify license terms

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Forkable](https://repopilot.app/api/badge/pentaho/pentaho-kettle?axis=fork)](https://repopilot.app/r/pentaho/pentaho-kettle)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/pentaho/pentaho-kettle on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: pentaho/pentaho-kettle

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/pentaho/pentaho-kettle shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Mixed signals — read the receipts

Last commit today
25+ active contributors
Distributed ownership (top contributor 24% of recent commits)
Other licensed
Tests present
⚠ Non-standard license (Other) — review terms
⚠ No CI workflows detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live pentaho/pentaho-kettle repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/pentaho/pentaho-kettle.

What it runs against: a local clone of pentaho/pentaho-kettle — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in pentaho/pentaho-kettle | Confirms the artifact applies here, not a fork | | 2 | License is still Other | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>pentaho/pentaho-kettle</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of pentaho/pentaho-kettle. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/pentaho/pentaho-kettle.git
#   cd pentaho-kettle
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of pentaho/pentaho-kettle and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "pentaho/pentaho-kettle(\\.git)?\\b" \\
  && ok "origin remote is pentaho/pentaho-kettle" \\
  || miss "origin remote is not pentaho/pentaho-kettle (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Other)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Other\"" package.json 2>/dev/null) \\
  && ok "license is Other" \\
  || miss "license drift — was Other at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "README.md" \\
  && ok "README.md" \\
  || miss "missing critical file: README.md"
test -f "assemblies/pom.xml" \\
  && ok "assemblies/pom.xml" \\
  || miss "missing critical file: assemblies/pom.xml"
test -f "assemblies/client/pom.xml" \\
  && ok "assemblies/client/pom.xml" \\
  || miss "missing critical file: assemblies/client/pom.xml"
test -f ".github/CODEOWNERS" \\
  && ok ".github/CODEOWNERS" \\
  || miss "missing critical file: .github/CODEOWNERS"
test -f "pom.xml" \\
  && ok "pom.xml" \\
  || miss "missing critical file: pom.xml"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/pentaho/pentaho-kettle"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Pentaho Data Integration (Kettle) is an enterprise-grade, open-source ETL (Extract, Transform, Load) platform written in Java that enables data engineers to design visual data pipelines without coding. It provides both a desktop GUI (using SWT) and server-based execution engine for orchestrating complex data workflows across heterogeneous sources—databases, files, APIs—with built-in transformations, job scheduling, and monitoring capabilities. Monolithic multi-module Maven project: assemblies/ produces distribution packages (client ZIP, plugins, samples); core/ contains the transformation and job execution engine; ui/ wraps SWT for the GUI; engine/ and engine-ext/ provide runtime and extension points; plugins/ (see plugins/README.md) contains 30+ built-in steps/connectors; dbdialog/ abstracts database UI interaction; integration/ holds cross-module tests. Configuration and sample jobs live in assemblies/samples/src/main/resources/jobs/.

👥Who it's for

Data engineers and ETL developers who need to build, test, and deploy data integration pipelines; system architects designing data warehouses; and DevOps teams deploying Kettle via the Carte server component. Contributors are typically Java engineers working on core transformation steps, plugin developers extending Kettle's connector ecosystem, and maintainers of the pentaho/pentaho-kettle project itself.

🌱Maturity & risk

Highly mature and production-ready: this is an active, long-established project (part of Pentaho's commercial suite) with version 11.1.0.0-SNAPSHOT indicating ongoing development. The codebase shows comprehensive unit test coverage, integration test suites (mvn verify -DrunITs), and a multi-module Maven structure across core, plugins, ui, and engine. Commits appear regular (based on active SNAPSHOT versioning), and the project maintains strict code quality standards (checkstyle enforcement).

Moderate organizational risk: Pentaho is owned by Hitachi Vantara, so roadmap and feature priorities depend on commercial decisions outside community control. Technical risks include the large monolithic codebase (46MB+ Java alone) requiring careful dependency management across 8+ modules, and the tight coupling between the SWT-based UI and core transformation engine may make headless/containerized deployments complex. Java 11+ is a hard requirement, and the extensive plugin ecosystem creates compatibility testing burden.

Active areas of work

Active development toward version 11.1.0.0: the SNAPSHOT versioning and multi-layered assembly dependencies indicate ongoing feature work. Recent focus appears to be platform modernization (SWT version bumps: GTK 3.108, Win32 3.122, macOS ARM 3.122), Spark integration (visible in sample Spark Submit jobs), and plugin architecture (extensive assemblies/plugins module). No specific recent PR data visible in file list, but the maintained CODEOWNERS file and structured plugin system suggest organized governance.

🚀Get running

Clone and build with Maven 3+:

git clone https://github.com/pentaho/pentaho-kettle.git
cd pentaho-kettle
# Ensure Java 11 and Maven 3+ are installed
# Place settings.xml from https://raw.githubusercontent.com/pentaho/maven-parent-poms/master/maven-support-files/settings.xml in ~/.m2/
mvn clean install

Launch the desktop client from assemblies/client/target/pdi-ce-*-SNAPSHOT/ or run tests with mvn test.

Daily commands: This is a build-only project (no dev server); outputs are distributable packages. To test the GUI after build: ./assemblies/client/target/pdi-ce-*/bin/spoon.sh (Linux/Mac) or spoon.bat (Windows). To run the Carte server: ./carte.sh. To execute a transformation: ./pan.sh -file=myfile.ktr. Full build with tests: mvn verify -DrunITs.

🗺️Map of the codebase

README.md — Entry point documenting project structure, build prerequisites (Maven 3+, Java 11), and the module hierarchy (assemblies, core, ui, engine, plugins, integration).
assemblies/pom.xml — Root Maven POM for assemblies module; controls distribution packaging and coordinates all sub-assemblies (client, lib, plugins, samples).
assemblies/client/pom.xml — Client distribution assembly definition; produces the main PDI executable package and entry point for end-users.
.github/CODEOWNERS — Defines code ownership and review responsibilities across modules; critical for understanding approval workflows in this large multi-team project.
pom.xml — Parent Maven POM (implied, version 11.1.0.0-SNAPSHOT); manages dependency versions, build plugins, and Java 11 compiler configuration across all PDI modules.
assemblies/samples/src/main/resources/jobs — Sample job and transformation templates (.kjb, .ktr files); reference implementations showing best practices for ETL workflows.
LICENSE.TXT — Legal licensing terms; essential for contributors to understand open-source obligations and redistribution rights.

🛠️How to make changes

Add a New PDI Step Plugin

Create a new step plugin module under plugins/ following the naming convention (e.g., plugins/your-step-name/) (plugins/your-step-name/pom.xml)
Implement the step class extending PDI's step base class and register it in the plugin manifest (plugins/your-step-name/src/main/java/YourStepName.java)
Add plugin metadata and icon resource files to make the step discoverable by the UI (plugins/your-step-name/src/main/resources/plugin.xml)
Update the parent pom.xml or plugins/pom.xml to include your new module in the build (assemblies/plugins/pom.xml)

Add a Sample Transformation Workflow

Create a new .ktr (Kettle transformation) file in the appropriate samples subdirectory (assemblies/samples/src/main/resources/transformations/Your-Sample-Name.ktr)
If your sample requires a job orchestrator, create a corresponding .kjb file (assemblies/samples/src/main/resources/jobs/your-sample-folder/Your-Sample-Job.kjb)
Document the sample in the samples assembly pom or add a README in the job/transformation folder (assemblies/samples/pom.xml)
Add any required reference data or database scripts to the db/ or resources folder (assemblies/samples/src/main/resources/db/your-sample-data.script)

Build and Package a Custom PDI Distribution

Ensure all modules are defined in the Maven parent POM with correct dependency versions (pom.xml)
Verify the client assembly includes all required dependencies and plugins (assemblies/client/src/assembly/assembly.xml)
Run Maven clean install to compile all modules and assemble distributions (README.md)
Find the packaged distribution in the target/ directory (output location post-build) (assemblies/client/pom.xml)

Configure Remote Carte Server for Distributed Execution

Consult the Carte API documentation for remote job and transformation submission (CarteAPIDocumentation.md)
Reference the Carte JMeter test suite for load-testing configuration patterns (Carte-jmeter.jmx)
Set up server configuration in the Carte module (not enumerated; see core/engine dependencies) (core/)

🔧Why these technologies

Maven 3+ — Enables multi-module project structure with centralized dependency management; critical for coordinating ~15+ sub-modules (core, ui, engine, plugins, assemblies) and consistent Java 11 compilation.
Java 11 JDK — Target runtime for PDI; provides modern language features (modules, var, records) and long-term support required for enterprise ETL workloads.
XML-based workflow files (.ktr, .kjb) — Domain-specific serialization format for transformations and jobs; enables visual editing in the UI and scripting/version-control without binary dependencies.
Plugin architecture — Allows extensibility without modifying core; decouples step implementations (database, file, API connectors) from the engine, enabling third-party contributions.
Carte remote server — Provides REST API for distributed job execution; enables cloud deployment, load balancing, and headless/command-line orchestration.

⚖️Trade-offs already made

Multi-module Maven build vs. monolithic JAR
- Why: Modular structure allows independent plugin development and selective compilation; enables parallel builds and reduces rebuild time.
- Consequence: Increases build complexity; requires careful dependency version management across ~15 modules to avoid diamond dependency and transitive conflicts.
XML serialization for workflows (.ktr, .kjb) vs. binary format
- Why: Human-readable XML enables visual editor integration, version control diffing, and template generation; aligns with traditional ETL tools (Informatica, Talend).
- Consequence: Larger file sizes; parsing overhead at runtime; XML schema drift risk if UI and engine versions diverge.
Plugin-based architecture for step implementations
- Why: Decouples connector implementations (database, Salesforce, REST) from core; allows third-party plugins without rebuilding core.
- Consequence: Plugin discovery and loading adds startup latency; version compatibility matrix between core and plugins becomes complex.
Carte server for remote execution vs. embedded engine
- Why: Enables distributed execution, cloud deployment, and multi-tenant isolation; clients communicate over HTTP.
- Consequence: Added network latency and serialization overhead; requires separate server deployment and monitoring.

🚫Non-goals (don't propose these)

Real-time streaming ETL (PDI is micro-batch/scheduled; Kafka/Spark Streaming are not integrated)
In-memory analytics or columnar storage (PDI is row-based OLTP-oriented; not optimized for OLAP)
Native machine learning model training (PDI delegates to external engines; no ML infrastructure built-in)
Windows-only or web-only deployment (PDI targets cross-platform JVM; desktop UI is Swing-based, not web-native)

🪤Traps & gotchas

Maven settings.xml required: build will fail silently without the Pentaho parent POM settings at ~/.m2/settings.xml (linked in README). 2. Platform-specific SWT binaries: different Windows/Linux/Mac architectures use different SWT JAR versions (win32-x86_64 vs. gtk.linux.x86_64); building on one OS and running on another causes native library mismatches. 3. Java 11 hard requirement: codebase uses modules and newer APIs; Java 8 or 17+ will break the build. 4. Integration tests are slow: mvn verify -DrunITs spawns database containers and transformation instances; can take 30+ minutes. 5. Plugin dependency order matters: plugins/ modules must be built before core consumes them; Maven reactor may fail if plugin POM references are circular. 6. Checkstyle strictness: minor formatting violations cause build failure; always run mvn checkstyle:check before committing.

🏗️Architecture

💡Concepts to learn

Directed Acyclic Graph (DAG) Transformation Model — Kettle represents each transformation as a DAG of steps connected by data streams; understanding how steps consume/produce rows and how data flows through hop connections is essential to debugging transformation logic and reasoning about parallelization
Row-Based Streaming vs. Batch Processing — Kettle processes data as streams of rows in memory rather than batch files; this design choice affects buffering, memory usage, and how steps like aggregations must manage state—critical when optimizing for large datasets
Plugin Architecture with Classloader Isolation — Kettle's plugin system loads each connector/step in an isolated classloader to prevent dependency conflicts; understanding this prevents ClassNotFoundExceptions and guides how to package plugin JARs with transitive dependencies
Metadata-Driven Configuration (KTR/KJB XML DSL) — Transformations and jobs are declared as XML (KTR/KJB files); the metadata model drives code generation and execution, making it possible to serialize/version workflows as text and enable programmatic workflow generation
SWT (Standard Widget Toolkit) Cross-Platform GUI — Spoon uses SWT with platform-specific native binaries (GTK/Win32/Cocoa) for GUI rendering; understanding SWT's threading model (UI thread vs. worker threads) and platform-specific quirks is essential for debugging or extending the GUI
Carte REST Server Architecture — Kettle's Carte component exposes transformations as REST services for remote execution; understanding the request/response lifecycle is critical for deploying Kettle in containerized/microservice environments
Step Threading and Parallelization Strategy — Each Kettle step runs in its own thread with internal queues; data flows between steps asynchronously via row-based buffers, enabling parallelism but requiring careful handling of thread-safe state and deadlock prevention

pentaho/pentaho-commons-xul — Shared UI abstraction layer used by Pentaho tools; Kettle's SWT UI may depend on XUL for cross-platform UI code
pentaho/pentaho-platform — Pentaho's server platform; Kettle integrates with it for job scheduling, security, and metadata repository functionality
apache/hop — Modern fork/successor of Pentaho Kettle (as of 2020) with improved architecture; relevant for comparing design evolution and migration paths
pentaho/pentaho-metastore — Pentaho's lightweight metadata repository service; Kettle uses it to store transformation and job definitions
talend/tdi-studio-se — Talend's open-source ETL alternative; useful for benchmarking features and understanding the competitive ETL landscape

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for sample jobs and transformations in assemblies/samples

The repo contains 20+ sample .kjb (jobs) and .ktr (transformations) files in assemblies/samples/src/main/resources/jobs/ but there's no visible test suite validating these samples execute correctly. This is critical because samples are the first experience for new users and broken samples lead to poor adoption. Adding integration tests would catch regressions when core engine changes break sample workflows.

[ ] Create integration-tests module under integration/ specifically for sample validation (e.g., integration/sample-validation-tests/)
[ ] Write test cases in integration/sample-validation-tests/src/test/java/ that programmatically load and execute each sample from assemblies/samples/src/main/resources/jobs/
[ ] Add Maven configuration to integration/sample-validation-tests/pom.xml to run these tests as part of the build pipeline
[ ] Document in README.md how contributors can add new samples and ensure they include corresponding integration tests

Create GitHub Actions workflow for platform-specific SWT binary validation

The pom.xml defines SWT dependencies for 5 different platform/architecture combinations (linux x86/x86_64, windows x86_64, macos x86_64/aarch64) with different versions (3.108.0, 3.115.100, 3.122.0). There's no visible CI workflow validating that these binaries are correctly downloaded, that version mismatches don't occur, and that builds succeed on each platform. Adding a matrix GitHub Actions workflow would prevent silent SWT compatibility issues.

[ ] Create .github/workflows/swt-matrix-build.yml with a matrix strategy testing [ubuntu-latest, windows-latest, macos-latest, macos-13] to validate SWT binary resolution
[ ] Add a build step that extracts and validates SWT JAR checksums match expected versions from pom.xml properties
[ ] Document in README.md's 'How to build' section which SWT versions are tested on which platforms and how to report platform-specific issues

Add unit tests for sample data generation scripts in assemblies/samples/src/main/resources/db/

The assemblies/samples/src/main/resources/db/ directory contains sampledata.properties and sampledata.script (likely H2 database scripts) but there's no test coverage validating these scripts produce the expected schema/data structure. New contributors may break sample data setup when modifying database connectivity or transformation logic. Unit tests would prevent this.

[ ] Create unit test class in integration/src/test/java/ (or new assemblies/samples/src/test/java/) named SampleDataSetupTest that executes sampledata.script and validates resulting tables/columns
[ ] Add assertions validating row counts and column types match what sample jobs expect (reference the sample .kjb files in assemblies/samples/src/main/resources/jobs/ to infer expected schema)
[ ] Document in CONTRIBUTING.md (if it exists, otherwise create it) that sample data modifications must include corresponding test updates

🌿Good first issues

Add unit test coverage for dbdialog/ module: the file structure shows dbdialog/ exists but no test files are visible in the file list; write integration tests for database connection dialogs to catch SWT rendering issues early: dbdialog is critical for database configuration UX but appears under-tested; will prevent regressions in future SWT updates
Document the plugin.xml schema for plugins/: README.md is minimal; create a schema guide and 2-3 annotated example plugin.xml files (e.g., for a simple CSV step) so new contributors can write plugins without reverse-engineering existing ones: Plugin authoring is a major extension point but underdocumented; will reduce onboarding time for ecosystem contributors
Add integration test for Carte (server) startup and REST API: CarteAPIDocumentation.md exists but Carte startup/API integration tests are not visible; write a test that starts Carte, submits a transformation via REST, and validates execution: Server-mode deployments are critical for production but under-tested; will catch breaking changes in the Carte REST API early

⭐Top contributors

Click to expand

@ivakuly — 24 commits
@tgf — 11 commits
@pdesai16 — 8 commits
@abryant-hv — 7 commits
@varuntangirala — 7 commits

📝Recent commits