apache/seatunnel
SeaTunnel is a multimodal, high-performance, distributed, massive data integration tool.
Healthy across the board
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit today
- ✓39+ active contributors
- ✓Distributed ownership (top contributor 14% of recent commits)
Show all 6 evidence items →Show less
- ✓Apache-2.0 licensed
- ✓CI configured
- ⚠No test directory detected
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/apache/seatunnel)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/apache/seatunnel on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: apache/seatunnel
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/apache/seatunnel shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit today
- 39+ active contributors
- Distributed ownership (top contributor 14% of recent commits)
- Apache-2.0 licensed
- CI configured
- ⚠ No test directory detected
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live apache/seatunnel
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/apache/seatunnel.
What it runs against: a local clone of apache/seatunnel — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in apache/seatunnel | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch dev exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of apache/seatunnel. If you don't
# have one yet, run these first:
#
# git clone https://github.com/apache/seatunnel.git
# cd seatunnel
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of apache/seatunnel and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "apache/seatunnel(\\.git)?\\b" \\
&& ok "origin remote is apache/seatunnel" \\
|| miss "origin remote is not apache/seatunnel (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify dev >/dev/null 2>&1 \\
&& ok "default branch dev exists" \\
|| miss "default branch dev no longer exists"
# 4. Critical files exist
test -f "config/seatunnel.yaml" \\
&& ok "config/seatunnel.yaml" \\
|| miss "missing critical file: config/seatunnel.yaml"
test -f ".github/workflows/backend.yml" \\
&& ok ".github/workflows/backend.yml" \\
|| miss "missing critical file: .github/workflows/backend.yml"
test -f "pom.xml" \\
&& ok "pom.xml" \\
|| miss "missing critical file: pom.xml"
test -f "bin/install-plugin.sh" \\
&& ok "bin/install-plugin.sh" \\
|| miss "missing critical file: bin/install-plugin.sh"
test -f "docs/en/architecture/overview.md" \\
&& ok "docs/en/architecture/overview.md" \\
|| miss "missing critical file: docs/en/architecture/overview.md"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/apache/seatunnel"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
SeaTunnel is a distributed, high-performance data integration platform that synchronizes massive datasets across 160+ connectors (databases, data warehouses, APIs, file systems) with support for batch, streaming, CDC, and multimodal data (video, images, binary, structured/unstructured text). It decouples job configuration from execution engines (SeaTunnel Zeta, Flink, Spark) and provides real-time monitoring, data quality checks, and distributed snapshot algorithms to prevent data loss or duplication. Monorepo structure: seatunnel-engine (core distributed runtime with Hazelcast), seatunnel-connectors (160+ source/sink/transform plugins), seatunnel-api (abstraction layer), seatunnel-spark/seatunnel-flink (engine adapters), seatunnel-server (web dashboard backend). CLI in bin/ (install-plugin.sh, install-plugin.cmd). Config templates in config/ (hazelcast.yaml, log4j2.properties, jvm_*_options). GitHub Actions workflows orchestrate Java builds, Docker images, and Helm charts.
👥Who it's for
Data engineers and platform architects building enterprise data pipelines who need to synchronize data across heterogeneous sources (MySQL→Postgres, Kafka→S3, Oracle→Snowflake) with CDC support, fault tolerance, and minimal resource overhead. DevOps teams deploying via Docker/Helm (see publish-docker.yaml, publish-helm-chart.yaml workflows). Contributors extend support by building new connectors in the seatunnel-connectors module.
🌱Maturity & risk
Actively developed Apache incubating project with comprehensive CI/CD (backend.yml, schedule_backend.yml, codeql.yaml workflows), Docker/Helm publishing, and structured GitHub issue/PR templates. The 26MB Java codebase, hazelcast clustering (config/hazelcast*.yaml), and multi-engine abstraction indicate production-ready architecture, though as an Apache project it follows strict governance (see .asf.yaml, NOTICE, LICENSE).
Moderately high complexity: distributed system requiring understanding of multiple execution engines (Zeta, Flink, Spark), Hazelcast clustering, and 160+ connector integrations. Risk of connector-specific bugs and breaking changes across engine versions. Apache governance can slow releases, but large community mitigates single-maintainer risk. Monitor .github/workflows for CI failures and check GitHub issues for blocking connector problems.
Active areas of work
Actively merging PRs for connector expansion (160+ target), improving distributed snapshot algorithm, enhancing real-time monitoring dashboards, and stabilizing Zeta engine. CI/CD infrastructure being refined (codeql.yaml, update_build_status.yml). Docker/Helm publishing automated. Slack community and Twitter (@ASFSeaTunnel) indicate ongoing engagement; check .github/workflows/ for recent build statuses.
🚀Get running
git clone https://github.com/apache/seatunnel.git
cd seatunnel
# Maven-based project (see .mvn/wrapper)
./mvnw clean install # or mvn clean install
# Or use provided scripts:
bash bin/install-plugin.sh # Install plugins
# Docker quick-start:
docker build -f Dockerfile -t seatunnel:latest .
See README.md for engine-specific quickstarts (Zeta, Spark, Flink).
Daily commands:
# Build all
mvn clean install
# Run with SeaTunnel Zeta engine (default)
bin/seatunnel.sh --execute --config config/job.yaml
# Run with Spark
bin/seatunnel.sh --execute --config config/job.yaml --engine spark
# Run with Flink
bin/seatunnel.sh --execute --config config/job.yaml --engine flink
# See docs at seatunnel.apache.org/docs/getting-started for detailed engine setup
🗺️Map of the codebase
config/seatunnel.yaml— Master configuration file defining engine settings, plugins, and runtime behavior that all deployments must configure..github/workflows/backend.yml— Primary CI/CD pipeline defining build, test, and validation gates that all PRs must pass.pom.xml— Root Maven POM managing dependencies, modules, and build profiles across the entire distributed system.bin/install-plugin.sh— Plugin installation orchestration script that developers must understand to extend SeaTunnel with custom connectors.docs/en/architecture/overview.md— Foundational architecture documentation covering the data integration pipeline, engine design, and plugin system.config/v2.batch.config.template— Batch job configuration template demonstrating source-transform-sink DAG structure that all batch jobs follow.deploy/kubernetes/seatunnel/Chart.yaml— Helm chart definition required for understanding production Kubernetes deployment patterns and cluster architecture.
🛠️How to make changes
Add a New Source Connector
- Study the source architecture in docs/en/architecture/api-design/source-architecture.md to understand the Source interface and parallelization contract (
docs/en/architecture/api-design/source-architecture.md) - Create source plugin module following existing connector patterns (seatunnel-connectors-v2/source-xxx/) (
config/plugin_config) - Implement SourceFactory and Source classes with partition discovery, read methods, and schema inference (
bin/install-plugin.sh) - Register connector in plugin_config and add Maven module to pom.xml (
pom.xml) - Run plugin installation to validate discovery: ./bin/install-plugin.sh (
bin/install-plugin.sh)
Add a New Sink Connector
- Review sink architecture documentation covering write semantics, transaction handling, and exactly-once guarantees (
docs/en/architecture/api-design/sink-architecture.md) - Create sink plugin module implementing SinkFactory and Sink with write(), checkpoint(), and abort() methods (
config/plugin_config) - Handle checkpoint coordination using Hazelcast state backend (see hazelcast.yaml configuration) (
config/hazelcast.yaml) - Add sink module to pom.xml and test with v2.batch.config.template using new sink (
config/v2.batch.config.template) - Validate via CI by opening PR; backend.yml will run integration tests (
.github/workflows/backend.yml)
Deploy SeaTunnel on Kubernetes
- Customize Helm values (image tag, replicas, resource requests) in deploy/kubernetes/seatunnel/values.yaml (
deploy/kubernetes/seatunnel/values.yaml) - Create job configuration file based on config/v2.batch.config.template or v2.streaming.conf.template pattern (
config/v2.batch.config.template) - Mount job config via ConfigMap generated by deploy/kubernetes/seatunnel/templates/configmap.yaml (
deploy/kubernetes/seatunnel/templates/configmap.yaml) - Deploy master and worker pods: helm install seatunnel ./deploy/kubernetes/seatunnel (Helm uses master + worker deployments) (
deploy/kubernetes/seatunnel/templates/deployment-seatunnel-master.yaml) - Monitor using JVM options in deploy/kubernetes/seatunnel/conf/jvm_worker_options and Hazelcast cluster state (
deploy/kubernetes/seatunnel/conf/hazelcast-worker.yaml)
Configure Logging & Monitoring
- Edit config/log4j2.properties to set log levels for SeaTunnel runtime, plugin discovery, and engine components (
config/log4j2.properties) - For client-side logging in distributed jobs, customize config/log4j2_client.properties separately (
config/log4j2_client.properties) - Tune JVM memory and GC via config/jvm_options or environment-specific variants (jvm_master_options, jvm_worker_options) (
config/jvm_options) - Configure Hazelcast metrics and cluster logging in config/hazelcast.yaml for distributed state visibility (
config/hazelcast.yaml)
🔧Why these technologies
- Hazelcast — Distributed in-memory state backend enabling exactly-once semantics, checkpoint coordination, and cluster state management across worker nodes without external dependencies
- Maven (pom.xml) — Multi-module build system managing 600+ files across connectors, core engine, and plugins with consistent dependency resolution and OSGi packaging
- Helm/Kubernetes — Production deployment abstraction enabling horizontal scaling, rolling updates, and resource isolation for master-worker architecture across cloud environments
- Log4j2 — Structured, async logging with per-component configuration critical for debugging distributed job execution across multiple workers and masters
🪤Traps & gotchas
JVM options: Startup requires tuning jvm_options (memory, GC) for large datasets; config/jvm_*_options files are read by bin/seatunnel.sh. Hazelcast cluster: Ensure cluster members can communicate on ports defined in hazelcast.yaml (default 5701); multicast may be disabled. Connector classpath: Plugins must be installed via bin/install-plugin.sh before use; missing plugins silently fail. Log4j2 async: log4j2.properties uses async loggers; ensure sufficient heap. Engine differences: Same job YAML may behave differently on Spark vs Flink due to engine-specific optimizations (check seatunnel-spark/, seatunnel-flink/ for quirks). Maven profiles: Some connectors compile conditionally; inspect pom.xml profiles before building specific subsets.
🏗️Architecture
💡Concepts to learn
- Change Data Capture (CDC) — SeaTunnel's CDC connectors enable real-time database replication by parsing binary logs (MySQL), WALs (Postgres), or using APIs; core differentiator from batch-only tools
- Distributed Snapshot Algorithm — SeaTunnel's fault-tolerance mechanism ensures data consistency across distributed nodes during synchronization; prevents data loss or duplication on failure recovery
- JDBC Multiplexing — SeaTunnel pools JDBC connections across multiple tables/databases to reduce connection overhead in real-time sync; critical for resource efficiency
- Pluggable Execution Engines — SeaTunnel abstracts job configuration (YAML) from runtime (Zeta/Flink/Spark), allowing users to swap engines without rewriting jobs; key architectural flexibility
- Hazelcast Clustering — SeaTunnel's Zeta engine uses Hazelcast for distributed coordination, cluster membership, and state management; alternative to Flink/Spark masters
- Multimodal Data Integration — SeaTunnel handles not just structured data (SQL) but also unstructured (text, images, video, binary), differentiating it from SQL-centric ETL tools
- Log-Based CDC Parsing — SeaTunnel parses database binary logs (MySQL binlog, Postgres WAL) to detect changes without polling; enables low-latency real-time sync
🔗Related repos
apache/flink— Pluggable execution engine for SeaTunnel; shares distributed data processing patternsapache/spark— Alternative pluggable execution engine; many connectors leverage Spark DataFrames internallyapache/kafka— Ecosystem partner; SeaTunnel has Kafka source/sink connectors and uses Kafka for CDC streamingalibaba/DataX— Earlier open-source data integration tool; SeaTunnel evolved as a successor with multi-engine supportdebezium/debezium— CDC (Change Data Capture) library; SeaTunnel integrates Debezium for database replication
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive integration tests for Kubernetes Helm chart deployment
The repo contains a Helm chart in deploy/kubernetes/seatunnel/ with Chart.yaml and configuration files, but there's no evidence of automated testing for Helm chart validation. This is critical for a distributed data integration tool where users deploy via Kubernetes. Adding a GitHub Action workflow to lint, validate, and test the Helm chart using tools like helm lint and chart-testing would prevent deployment misconfigurations.
- [ ] Create .github/workflows/helm-chart-test.yml workflow
- [ ] Add helm lint validation for deploy/kubernetes/seatunnel/Chart.yaml
- [ ] Integrate chart-testing tool to validate schema and template rendering
- [ ] Add smoke tests to verify generated manifests are valid Kubernetes resources
- [ ] Document Helm chart testing in CONTRIBUTING or DEVELOPMENT guide
Implement end-to-end connector validation tests for Source and Sink plugins
SeaTunnel's core value proposition is integrating diverse data sources, but there's no visible test suite for validating that connectors work correctly end-to-end. The repo has plugin_config in config/ and multiple connector modules, but no dedicated GitHub workflow to test connector installation and basic functionality. Adding automated tests would catch breaking changes in connector APIs early.
- [ ] Create .github/workflows/connector-validation.yml workflow
- [ ] Add test matrix for major source/sink connectors (database, cloud, file-based)
- [ ] Test bin/install-plugin.sh and bin/install-plugin.cmd functionality
- [ ] Validate connector JAR availability and dependency resolution
- [ ] Add documentation in .github/ISSUE_TEMPLATE for connector-related bugs
Add License compliance checks and NOTICE file generation automation
The repo has .licenserc.yaml and NOTICE file, indicating license management concerns, but no automated validation workflow. SeaTunnel is an Apache project with strict licensing requirements. Adding a GitHub Action workflow to verify license headers in source files, validate dependencies for license compatibility, and auto-update the NOTICE file would prevent licensing violations and ensure compliance.
- [ ] Create .github/workflows/license-check.yml workflow
- [ ] Implement automated NOTICE file generation from Maven/Gradle dependencies
- [ ] Add license header validation for Java, Scala, and Python source files
- [ ] Configure license scanning against .licenserc.yaml for all PRs
- [ ] Document license compliance requirements in CONTRIBUTING.md
🌿Good first issues
- Add unit tests for JDBC multiplexing logic in seatunnel-connectors/seatunnel-connectors-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/ (currently sparse coverage for multi-table edge cases)
- Document the distributed snapshot algorithm (currently described only in Javadoc; write architecture guide in docs/design/snapshot-algorithm.md with diagrams)
- Create a sample connector template with best practices (e.g., seatunnel-connectors-examples/sample-source-connector/) showing error handling, type mapping, and testing patterns for new contributors
⭐Top contributors
Click to expand
Top contributors
- @zhangshenghang — 14 commits
- @davidzollo — 9 commits
- @chl-wxp — 8 commits
- @yzeng1618 — 8 commits
- @nzw921rx — 6 commits
📝Recent commits
Click to expand
Recent commits
5fb34bc— [Improve][Docs] Add minimum deployment config and clarify backup-count N definition in Zeta separated cluster guide (#10 (nzw921rx)72f32ed— [Fix][Connector-V2][File] Respect custom filename for binary sink (#10817) (zhangshenghang)9d6b879— [Improve][MySQL-CDC] Enhance diagnostics for missing binlog and GTID during restore (#10566) (zhangshenghang)53c11a5— [Fix][Zeta] Job stuck permanently after master failover, unable to complete (affects BATCH / bounded source / job shutdo (nzw921rx)17eb091— [BUG][Connector-V2][MaxCompute] Fix Maxcompute source split stategy (#10845) (zhiliang-wu)0fe175c— [Fix][Doc] Fix OpenAI API key documentation link (#10848) (zhangshenghang)1957c1d— [ Improve ][Docs] Improve Redshift sink connector documentation (#10682) (davidzollo)b808a50— [Improve][Transform] Replace Transform to support array-based replace_fields (#10712) (MyeoungDev)646eabd— [Fix][Zeta] Reuse job info during restore (#10842) (zhangshenghang)0b02d06— [Fix][Connector-V2] Rename uniqueKeyFields to pkNames in dialect upsert APIs (#10367) (corgy-w)
🔒Security observations
The Apache SeaTunnel project shows moderate security posture with some areas of concern. The main vulnerabilities are related to configuration management (hardcoded configs in version control), potential input validation gaps in data processing connectors, and infrastructure-as-code security in Docker and Kubernetes manifests. The project benefits from CI/CD security scanning (CodeQL) but lacks visible comprehensive dependency management and a clear security reporting process. No critical vulnerabilities were identified based on the file structure alone, but the actual source code requires deeper analysis for injection risks and authentication/authorization mechanisms given the tool's role in data integration.
- Medium · Potential Hardcoded Configuration in Config Files —
config/ directory (config/hazelcast.yaml, config/seatunnel.yaml, config/log4j2.properties, etc.). Multiple configuration files present in /config directory (hazelcast.yaml, seatunnel.yaml, jvm_options, etc.) may contain sensitive information such as database credentials, API keys, or connection strings if not properly managed. These files are typically checked into version control. Fix: Implement environment-based configuration management. Use environment variables or secret management systems (HashiCorp Vault, AWS Secrets Manager) instead of committing sensitive data. Add config files to .gitignore and provide example templates with placeholder values. - Medium · Kubernetes Manifests with Potential Hardcoded Secrets —
deploy/kubernetes/seatunnel/templates/ (configmap.yaml, deployment-seatunnel-master.yaml, deployment-seatunnel-worker.yaml). Kubernetes deployment files and ConfigMaps in deploy/kubernetes directory may contain hardcoded sensitive configuration values, connection strings, or credentials that should be externalized. Fix: Use Kubernetes Secrets for sensitive data instead of ConfigMaps. Implement external secret management tools like sealed-secrets or External Secrets Operator. Never commit actual secrets to the repository. - Medium · Missing Input Validation Framework Visibility —
Source code not provided - potential risk in connector implementations and configuration parsing modules. As a data integration tool processing 'massive amounts of data' from diverse sources, the codebase structure does not clearly show input validation and sanitization mechanisms. SQL injection and command injection risks could exist when processing user-provided connectors or configurations. Fix: Implement comprehensive input validation for all data source connectors. Use parameterized queries for database operations. Implement allowlist-based validation for plugin/connector inputs. Add security testing for injection vulnerabilities. - Low · Exposed Docker Build Configuration —
.github/workflows/publish-docker.yaml, .github/workflows/publish-helm-chart.yaml. Docker build and publish workflows are configured in .github/workflows (publish-docker.yaml, publish-helm-chart.yaml), potentially exposing Docker registry credentials or build artifacts if not properly secured. Fix: Ensure GitHub Actions secrets are properly configured and used for registry authentication. Implement image scanning for vulnerabilities. Use signed images and minimal base images. Review workflow YAML for any exposed credentials. - Low · Missing SBOM and Dependency Scanning —
.github/workflows/codeql.yaml and build configuration. While CodeQL is configured (codeql.yaml), there is no visible evidence of Software Bill of Materials (SBOM) generation or comprehensive dependency scanning for vulnerable packages, which is critical for a tool processing data from external sources. Fix: Implement SBOM generation using tools like syft or cyclonedx-maven-plugin. Add dependency-check or Snyk integration to CI/CD pipeline. Regularly audit and update dependencies. Publish SBOM alongside releases. - Low · Unclear Security Update Process —
Repository root (missing SECURITY.md). The file structure does not clearly show a SECURITY.md file or security vulnerability reporting mechanism, which is important for an Apache project handling data integration. Fix: Create SECURITY.md file with vulnerability reporting guidelines. Establish a responsible disclosure process. Provide contact information for security issues. Follow Apache Software Foundation security guidelines.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.