RepoPilotOpen in app →

apache/streampark

Make stream processing easier! Easy-to-use streaming application development framework and operation platform.

Healthy

Healthy across the board

weakest axis
Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 6w ago
  • 24+ active contributors
  • Distributed ownership (top contributor 41% of recent commits)
Show all 6 evidence items →
  • Apache-2.0 licensed
  • CI configured
  • Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/apache/streampark)](https://repopilot.app/r/apache/streampark)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/apache/streampark on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: apache/streampark

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/apache/streampark shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit 6w ago
  • 24+ active contributors
  • Distributed ownership (top contributor 41% of recent commits)
  • Apache-2.0 licensed
  • CI configured
  • Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live apache/streampark repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/apache/streampark.

What it runs against: a local clone of apache/streampark — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in apache/streampark | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch dev exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 73 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>apache/streampark</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of apache/streampark. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/apache/streampark.git
#   cd streampark
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of apache/streampark and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "apache/streampark(\\.git)?\\b" \\
  && ok "origin remote is apache/streampark" \\
  || miss "origin remote is not apache/streampark (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify dev >/dev/null 2>&1 \\
  && ok "default branch dev exists" \\
  || miss "default branch dev no longer exists"

# 4. Critical files exist
test -f "pom.xml" \\
  && ok "pom.xml" \\
  || miss "missing critical file: pom.xml"
test -f ".github/workflows/backend.yml" \\
  && ok ".github/workflows/backend.yml" \\
  || miss "missing critical file: .github/workflows/backend.yml"
test -f ".github/workflows/frontend.yml" \\
  && ok ".github/workflows/frontend.yml" \\
  || miss "missing critical file: .github/workflows/frontend.yml"
test -f "build.sh" \\
  && ok "build.sh" \\
  || miss "missing critical file: build.sh"
test -f ".asf.yaml" \\
  && ok ".asf.yaml" \\
  || miss "missing critical file: .asf.yaml"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 73 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~43d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/apache/streampark"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

StreamPark is an Apache TLP that unifies Apache Flink and Apache Spark streaming application development with a production-grade cloud-native platform for deployment, monitoring, and operations. It provides prebuilt connectors, SQL templates, and a web-based UI (Vue/TypeScript frontend) to eliminate boilerplate while supporting multi-version, multi-environment (Standalone/YARN/Kubernetes) deployments with a Java/Scala backend. Monorepo split into three layers: (1) backend (streampark-core) in Java/Scala handles Flink/Spark job submission, configuration, and cluster APIs; (2) web console in TypeScript/Vue (under a frontend directory) provides the operational UI; (3) connectors/flink-cdc, flink-spark-examples show reference implementations. Database schema managed via PLpgSQL migrations. Build orchestrated via Maven parent POM (pom.xml) with multi-module structure.

👥Who it's for

Data engineers and DevOps teams building real-time streaming pipelines on Flink/Spark who need to manage application lifecycle (development, deployment, monitoring) without writing infrastructure code. Also framework developers extending Flink/Spark connectors and operators.

🌱Maturity & risk

Production-ready and actively maintained as of January 2025 (Apache TLP status). The codebase is substantial (2.1M Java, 1.3M Scala LOC), has comprehensive GitHub Actions CI pipelines (.github/workflows includes backend.yml, frontend.yml, e2e.yml, docker-push.yml), and the repo structure reflects a mature project with release documentation and license compliance tooling (.licenserc.yaml, dist-material/release-docs). Current version 2.2.0-SNAPSHOT indicates active development.

Moderate operational complexity: requires Java/Scala runtime, PostgreSQL backend (PLpgSQL files present), Flink/Spark cluster infrastructure, and Kubernetes/YARN knowledge for deployment. Breaking changes are possible during pre-1.0 development phases. The polyglot nature (Java, Scala, TypeScript, Vue, FreeMarker templates) increases onboarding friction. No visible single-maintainer risk given Apache governance, but the monorepo scale (2M+ LOC) requires careful dependency management during upgrades.

Active areas of work

Version 2.2.0 in active development (SNAPSHOT). Recent focus visible in workflow files: docker-push.yml suggests containerization efforts, e2e.yml indicates end-to-end testing expansion, codeql-analysis.yml shows security scanning adoption. The project recently became Apache TLP (January 2025), implying stabilization and governance formalization work.

🚀Get running

Clone: git clone https://github.com/apache/streampark.git && cd streampark. Build backend: mvn clean package -DskipTests (uses Maven wrapper in .mvn/). Install frontend dependencies: cd streampark-console && npm install (or yarn). Check build.sh for orchestrated multi-module builds: ./build.sh.

Daily commands: After mvn clean package and npm install: (1) Start PostgreSQL: docker run -e POSTGRES_PASSWORD=password postgres:13; (2) Run backend: java -jar streampark-console/target/streampark-console-*.jar; (3) Start frontend dev server: cd streampark-console/frontend && npm run dev. Check actual .env configs and application.yml for database connection strings; exact startup commands are in build.sh and likely CI workflows.

🗺️Map of the codebase

  • pom.xml — Root Maven POM defines all project dependencies, modules, and build configuration for the entire StreamPark ecosystem
  • .github/workflows/backend.yml — CI/CD pipeline for backend builds; essential for understanding how code is tested and deployed
  • .github/workflows/frontend.yml — CI/CD pipeline for frontend builds; critical for understanding UI deployment process
  • build.sh — Primary build script orchestrating the entire project compilation; entry point for local development setup
  • .asf.yaml — Apache Software Foundation configuration; defines repository governance, permissions, and release settings
  • README.md — Project overview and quick-start guide; essential context for all contributors on project goals and architecture
  • .licenserc.yaml — License compliance configuration; ensures all dependencies meet Apache License requirements

🛠️How to make changes

Add a new workflow/automation

  1. Create new workflow file following GitHub Actions syntax (.github/workflows/your-workflow-name.yml)
  2. Define triggers (on: push, pull_request, schedule, etc.) (.github/workflows/your-workflow-name.yml)
  3. Add job steps referencing the appropriate build tool (Maven for backend, npm for frontend) (.github/workflows/your-workflow-name.yml)
  4. Update relevant configuration in .asf.yaml if workflow affects release process (.asf.yaml)

Add a new issue template

  1. Create YAML-formatted issue template in the issue template directory (.github/ISSUE_TEMPLATE/new-issue-type.yml)
  2. Define form fields and validation rules using GitHub's native issue template syntax (.github/ISSUE_TEMPLATE/new-issue-type.yml)
  3. Register template in issue config (optional, for directory visibility) (.github/ISSUE_TEMPLATE/config.yml)

Update build configuration and dependencies

  1. Modify module structure or add new Maven modules (pom.xml)
  2. Add or update dependency versions in dependency management section (pom.xml)
  3. Run license compliance check to ensure new dependencies are OSS-compatible (.licenserc.yaml)
  4. Update build.sh if new build steps or module orderings are required (build.sh)
  5. Add license notices for any new third-party libraries to dist-material (dist-material/release-docs/licenses)

Configure a new automation tool or labeling

  1. Define automation rules (e.g., auto-closing stale issues, auto-labeling) (.github/boring-cyborg.yml)
  2. Or define pull request labeling rules based on file paths (.github/labeler.yml)
  3. Test the rules by triggering the workflow or creating a test PR (.github/workflows/pull-request-robot.yml)

🔧Why these technologies

  • Apache Maven — Multi-module Java build system; supports dependency management, plugin orchestration, and consistent build lifecycle across backend components
  • GitHub Actions — Native CI/CD integration; enables automated testing, CodeQL security scanning, and Docker image builds without external infrastructure
  • CodeQL — Static security analysis detects vulnerabilities in source code before deployment; Apache-approved security practice
  • Docker — Containerization for consistent deployment across dev, staging, and production environments; essential for cloud-native real-time computing platform
  • License scanning (license-maven-plugin) — Ensures compliance with Apache License 2.0 and validates all transitive dependencies meet OSS requirements; critical for Apache Foundation projects

⚖️Trade-offs already made

  • Maven as primary build tool instead of Gradle or SBT

    • Why: Broad ecosystem support, dependency management maturity, and ASF infrastructure compatibility
    • Consequence: XML-heavy configuration; slower incremental builds; larger memory footprint than Gradle
  • Centralized pom.xml module management instead of per-module autonomy

    • Why: Unified dependency versions prevent version conflicts and transitive dependency hell
    • Consequence: Tight coupling of modules at build time; monorepo-like constraints; slower innovation in individual modules
  • License scanning embedded in build process

    • Why: Prevents accidental inclusion of incompatible licenses before release
    • Consequence: Build time overhead; strict enforcement may block rapid prototyping of experimental features
  • Multi-workflow CI/CD (backend.yml, frontend.yml, docker-push.yml, codeql-analysis.yml separately)

    • Why: Parallel execution reduces feedback time; specialized tooling per platform (Maven for JVM, npm for JS)
    • Consequence: Complex workflow orchestration; potential for race conditions in coordinated releases; increased GitHub Actions concurrency costs

🚫Non-goals (don't propose these)

  • This project does NOT support Python or Go as first-class streaming languages (focused on JVM and web-based runtime)
  • This project does NOT provide real-time data visualization (operation platform UI role is management and monitoring, not analytics dashboards)
  • This project is NOT a Kubernetes operator (though it can be deployed on K8s; it does not auto-scale based on metrics or manage cluster lifecycle)
  • This project does NOT handle low-level network protocol development (assumes standard HTTP, gRPC, Kafka, or JDBC connectors are available)
  • This project does NOT offer a proprietary cloud environment (designed for deployment on existing cloud providers or on-premises infrastructure)

🪤Traps & gotchas

  1. PostgreSQL required: Console expects a running PostgreSQL instance; missing database will crash startup silently until logs are checked. 2. Maven wrapper: Use ./mvnw not mvn to ensure correct JDK/version per .mvn/wrapper. 3. Flink/Spark cluster infrastructure: Local dev runs need either embedded mode or a real cluster (YARN/K8s) to test job submission. 4. FreeMarker template processing: Configuration templates use FreeMarker dialect; changes to streampark-core-flink configs may require rebuilding. 5. Multi-module dependency order: Building console before core modules will fail; always mvn clean install from root, not individual modules.

🏗️Architecture

💡Concepts to learn

  • Job Submission & Cluster Management — StreamPark's core value is abstracting Flink/Spark job submission across Standalone/YARN/Kubernetes; you must understand how job configs map to cluster-specific deployment descriptors.
  • Connector Abstraction Layer — StreamPark provides a unified API for Kafka/JDBC/Hudi/Paimon connectors; extending this abstraction requires understanding the plugin pattern and configuration serialization.
  • Configuration-Driven Pipeline Definition — Unlike writing DataStream code, StreamPark uses declarative YAML/JSON configs and FreeMarker templates; templates are compiled at deploy time, not runtime.
  • Multi-Version Engine Support — StreamPark manages incompatible Flink 1.13/1.14/1.15+ and Spark 3.x versions in a single control plane; requires careful ClassLoader isolation and API shims.
  • State Management in Streaming Topologies — StreamPark applications leverage Flink's managed state and savepoints for fault tolerance; you must understand state backends, checkpointing, and recovery semantics.
  • Schema Registry Integration — Connectors often integrate with Confluent/Hortonworks Schema Registry for Avro/Protobuf serialization; StreamPark config references schema versions declaratively.
  • Database-Driven Application Lifecycle — All job configs, versions, execution history, and audit logs live in PostgreSQL; the console CRUD operations drive backend resource provisioning, requiring understanding of ORM patterns and transaction semantics.
  • apache/flink — Core Flink runtime that StreamPark wraps and deploys; understanding Flink's DataStream API is prerequisite knowledge.
  • apache/spark — Alternative streaming engine supported alongside Flink; Spark job submission logic mirrors Flink in streampark-spark module.
  • apache/paimon — Lakehouse format often used as sink in StreamPark pipelines; listed in README as part of rich ecosystem compatibility.
  • apache/doris — OLAP database frequently used as StreamPark target sink for real-time analytics; ecosystem integration point.
  • apache/kafka — Primary message source for streaming pipelines in StreamPark; connectors and examples depend on Kafka clients.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive E2E tests for Docker deployment workflow

The repo has a .github/workflows/docker-push.yml and e2e.yml workflow, but no visible E2E test suite in the file structure. StreamPark is a stream processing platform with both backend and frontend components that need integration testing. Adding E2E tests would validate the complete deployment pipeline and catch regressions early.

  • [ ] Create e2e/ directory with test framework setup (Cypress or Selenium)
  • [ ] Add tests in .github/workflows/e2e.yml for Docker container startup and health checks
  • [ ] Test StreamPark backend API endpoints after container initialization
  • [ ] Test frontend UI interactions and data flow between frontend and backend
  • [ ] Document E2E test execution in README.md

Implement license compliance scanning for transitive dependencies

The repo contains .licenserc.yaml and dist-material/release-docs/licenses/ directory showing license awareness, but there's no automated validation workflow. With 40+ licenses already tracked, adding a CI job to validate new dependencies against ASF policy would prevent license compliance issues at PR time.

  • [ ] Create GitHub Action workflow in .github/workflows/ named 'license-check.yml'
  • [ ] Integrate Apache License Check tool or similar to scan pom.xml dependencies
  • [ ] Add step to compare against .licenserc.yaml approved list
  • [ ] Configure workflow to fail on unapproved licenses and block PRs
  • [ ] Document approved vs. blocked licenses in CONTRIBUTING.md

Add pull request validation workflow for Maven build artifacts

The repo has backend.yml and frontend.yml workflows, but no dedicated artifact validation before merge. With a complex multi-module Maven project (streampark parent POM with version 2.2.0-SNAPSHOT), validating that all modules build correctly and produce expected artifacts would catch integration issues early.

  • [ ] Create .github/workflows/artifact-validation.yml workflow
  • [ ] Add Maven build step that generates artifacts and validates checksums
  • [ ] Verify JAR/WAR outputs exist in target/ directories for key modules
  • [ ] Add step to validate artifact licenses match approved list in dist-material/
  • [ ] Configure as required check before merging PRs to main/master branch

🌿Good first issues

  • Add missing unit tests for streampark-core-base configuration classes (ConfigParser.java, JobConfig.java) — test files exist but coverage is <60% per likely SONAR reports.
  • Improve error messages in Flink job submission (likely in streampark-flink-core/src/main/java/Submit.java) by extracting hardcoded strings to MessageBundle and adding i18n keys in console frontend messages.json.
  • Document the database schema migration process: write a .md guide in docs/ showing how to add a new PLpgSQL migration file under the versioned migrations directory and integrate with Flyway/Liquibase (inferred from PLpgSQL presence).

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 8ac8d4d — [Improvement-4330] Optimize the configuration file and remove any unnecessary configurations (#4331) (niumy0701)
  • e07fc19 — [e2e]: Fix flaky ElementClickInterceptedException in System and Resource pages (#4337) (limc5462)
  • 37583a4 — chore(ci): replace maximize-build-space with free-disk-space in e2e (#4336) (limc5462)
  • acf04e2 — e2e: fix dropdown options timeout by using anyMatch instead of visibilityOfAllElements (#4335) (limc5462)
  • 1732672 — [Improve] Add user environment configuration script in conf directory (#4292) (limc5462)
  • df68b24 — [Bug] Fix duration column in JobList & JobDetail (#4242) (Gianzie)
  • 2d51eb0 — [hotfix][console-webapp] Upgrade Vue to 3.5.0 to resolve peer dependency conflict (#4318) (oregonknuths)
  • fecba85 — [hotfix][console-service] Fixed finished job is no need alert. (#4293) (Mrart)
  • 252bd20 — [ISSUE-4319][infrastructure] Add boring-cyborg.yml for repository infrastructure (#4320) (RocMarshal)
  • 03d64fb — [hotfix][infra] Change exit code on E2E failure to 0 (#4325) (xinzhuxiansheng)

🔒Security observations

The Apache StreamPark repository demonstrates foundational security practices (Apache licensing, GitHub Actions with CodeQL), but several areas require attention. The most critical limitation is the incomplete dependency analysis due to truncated pom.xml. Known risks include potential vulnerable transitive dependencies (bouncycastle, kryo, jsch), lack of explicit SAST/SCA tooling documentation, and missing security policy files. The project should implement comprehensive dependency vulnerability scanning, establish a formal security disclosure policy, and conduct code reviews on cryptographic implementations. No obvious hardcoded credentials or critical misconfigurations detected in visible files.

  • High · Incomplete POM File Analysis — pom.xml (root). The provided pom.xml file appears to be truncated at the mailing list section, preventing complete dependency analysis. This limits the ability to identify vulnerable transitive dependencies. Fix: Provide the complete pom.xml file for thorough dependency vulnerability scanning. Use 'mvn dependency:tree' and 'mvn dependency-check:check' to identify known vulnerabilities.
  • Medium · Potential Known Vulnerable Dependencies — pom.xml dependencies section. Based on visible dependencies in the partial pom.xml, several libraries require security review: bouncycastle (bcprov-jdk15on, bcpkix-jdk15on), kryo, and jsch. These have had historical vulnerabilities. The version numbers are not fully visible in the provided excerpt. Fix: Audit all dependencies with 'mvn dependency:tree' and cross-reference against the National Vulnerability Database (NVD) and GitHub Security Advisories. Ensure all packages are on patched versions. Consider using OWASP Dependency-Check plugin.
  • Medium · Lack of Visible Security Configuration — Repository root. No evidence of security-focused configuration files such as .checkstyle, security policies, or SAST configurations visible in the repository structure provided. Fix: Implement security scanning: configure CodeQL in GitHub Actions (already present), add OWASP Dependency-Check to build pipeline, enable Snyk or similar SCA tools, and establish a security policy.
  • Medium · Potential Cryptography Implementation Risks — pom.xml (bouncycastle, jsch dependencies). Dependencies like bouncycastle and jsch (SSH library) handle cryptography and encryption. Incorrect usage can introduce vulnerabilities. No visible cryptographic code review documentation. Fix: Conduct code review of all cryptographic operations. Ensure proper key management, use of secure algorithms (TLS 1.2+), and avoid custom crypto implementations. Follow OWASP Cryptographic Storage Cheat Sheet.
  • Low · Build Wrapper Security — .mvn/wrapper/. Maven wrapper (.mvn/wrapper/maven-wrapper.properties) present but integrity verification not visible in file structure. Fix: Verify Maven wrapper integrity by checking checksums. Document the Maven wrapper version used and establish a process for secure updates. Use '.mvn/wrapper/maven-wrapper.jar' signature validation.
  • Low · Missing Security Documentation — Repository root. No visible SECURITY.md or security vulnerability disclosure policy file in the repository root. Fix: Create a SECURITY.md file with vulnerability reporting procedures, security contacts, and update policy aligned with Apache Security Guidelines.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · apache/streampark — RepoPilot