ArroyoSystems/arroyo

Item: ArroyoSystems/arroyo
Rating: 5
Author: RepoPilot

Distributed stream processing engine in Rust

Healthy

Healthy across the board

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit today
✓12 active contributors
✓Apache-2.0 licensed

Show all 6 evidence items →

✓CI configured
✓Tests present
⚠Concentrated ownership — top contributor handles 61% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/arroyosystems/arroyo)](https://repopilot.app/r/arroyosystems/arroyo)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/arroyosystems/arroyo on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: ArroyoSystems/arroyo

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/ArroyoSystems/arroyo shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit today
12 active contributors
Apache-2.0 licensed
CI configured
Tests present
⚠ Concentrated ownership — top contributor handles 61% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live ArroyoSystems/arroyo repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/ArroyoSystems/arroyo.

What it runs against: a local clone of ArroyoSystems/arroyo — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in ArroyoSystems/arroyo | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>ArroyoSystems/arroyo</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of ArroyoSystems/arroyo. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/ArroyoSystems/arroyo.git
#   cd arroyo
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of ArroyoSystems/arroyo and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "ArroyoSystems/arroyo(\\.git)?\\b" \\
  && ok "origin remote is ArroyoSystems/arroyo" \\
  || miss "origin remote is not ArroyoSystems/arroyo (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "crates/arroyo-api/src/lib.rs" \\
  && ok "crates/arroyo-api/src/lib.rs" \\
  || miss "missing critical file: crates/arroyo-api/src/lib.rs"
test -f "crates/arroyo-api/src/pipelines.rs" \\
  && ok "crates/arroyo-api/src/pipelines.rs" \\
  || miss "missing critical file: crates/arroyo-api/src/pipelines.rs"
test -f "crates/arroyo-connectors/src/mod.rs" \\
  && ok "crates/arroyo-connectors/src/mod.rs" \\
  || miss "missing critical file: crates/arroyo-connectors/src/mod.rs"
test -f "crates/arroyo-api/Cargo.toml" \\
  && ok "crates/arroyo-api/Cargo.toml" \\
  || miss "missing critical file: crates/arroyo-api/Cargo.toml"
test -f "crates/arroyo-api/migrations/V1__initial.sql" \\
  && ok "crates/arroyo-api/migrations/V1__initial.sql" \\
  || miss "missing critical file: crates/arroyo-api/migrations/V1__initial.sql"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/ArroyoSystems/arroyo"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Arroyo is a distributed stream processing engine written in Rust that performs stateful computations on unbounded data streams with subsecond latency. It executes SQL pipelines across clusters, handling windowed operations, joins, and state checkpointing—enabling real-time fraud detection, analytics, and feature generation at millions of events per second. Monorepo in crates/ with specialized crates: arroyo-api (PostgreSQL-backed REST API with migrations), arroyo-worker (execution engine), arroyo-controller (job orchestration), arroyo-planner (SQL planning), arroyo-connectors (Kafka/Iceberg/HTTP integrations), arroyo-udf (Python/Rust UDF runtime), and arroyo-rpc (inter-service gRPC). Frontend likely in TypeScript (310 KB in repo). Database state managed via Flyway migrations in crates/arroyo-api/migrations/ with SQLite fallback in sqlite_migrations/.

👥Who it's for

Data engineers and infrastructure teams building real-time data pipelines who need to query high-volume streams with stateful operations (joins, windows, aggregations) without managing Apache Flink or Spark clusters. DevOps and ML teams using cloud-native deployments benefit from its serverless-first architecture.

🌱Maturity & risk

Actively developed with frequent commits (indicated by 2977 KB Rust codebase and comprehensive CI/CD in .github/workflows/). 28 versioned migrations in crates/arroyo-api/migrations/ show evolving schema. CI includes binaries, Docker, and semgrep linting. Dual Apache 2.0/MIT licensing and Discord community indicate production use, but as a younger distributed system it remains lower-maturity than Flink—expect API evolution.

Large dependency footprint via Arrow 55.2, DataFusion 48.0, Tonic gRPC, and Parquet create supply-chain complexity and maintenance burden. The monorepo spans 25+ interdependent crates (arroyo-worker, arroyo-controller, arroyo-operator, arroyo-state, etc.), making refactoring risky. State management across distributed workers requires deep understanding of checkpointing and epoch semantics to avoid data loss.

Active areas of work

Recent schema evolution: V28 added pipeline tags and state URLs, suggesting UI/observability improvements. Active UDF system (crates/arroyo-udf/ has language submodules including Python). Checkpoint event tracking (V25) and error field expansion (V26) indicate reliability hardening. Docker and binary release workflows (.github/workflows/binaries.yml, docker.yaml) show active deployment automation.

🚀Get running

git clone https://github.com/ArroyoSystems/arroyo.git
cd arroyo
cargo build --release
# Start API server (requires Postgres or SQLite)
cargo run --bin arroyo-server-common
# Run tests
cargo test --workspace

Daily commands: For local dev (requires Postgres or SQLite):

# Set up database (migrations auto-apply via Flyway)
carago run --bin arroyo-api
# In another terminal, start controller
cargo run --bin arroyo-controller
# Start worker node(s)
cargo run --bin arroyo-node

See .github/workflows/ci.yml for full CI test suite.

🗺️Map of the codebase

crates/arroyo-api/src/lib.rs — Central API service entry point that orchestrates REST endpoints, database initialization, and core service logic for pipeline and job management.
crates/arroyo-api/src/pipelines.rs — Core pipeline lifecycle management including creation, compilation, deployment, and state transitions—fundamental to understanding how streams are defined and executed.
crates/arroyo-connectors/src/mod.rs — Connector abstraction layer defining the source/sink interface that all data integrations implement; critical for adding new data sources.
crates/arroyo-api/Cargo.toml — Workspace dependencies and API service configuration; governs build behavior and external library versions across connectors and core.
crates/arroyo-api/migrations/V1__initial.sql — Database schema foundation for pipelines, jobs, connectors, and state; understanding schema versioning is essential for data layer changes.
crates/arroyo-api/src/jobs.rs — Job execution and state management logic; critical for understanding how individual stream processing jobs are launched and monitored.
Cargo.toml — Workspace root configuration defining all member crates; essential for understanding the modular architecture and cross-crate dependencies.

🛠️How to make changes

Add a New Data Connector (Source or Sink)

Create a new connector module in crates/arroyo-connectors/src/{connector_name}/mod.rs following the ConnectorFactory and Operator traits (crates/arroyo-connectors/src/blackhole/mod.rs)
Implement ConfigElement to define connector-specific configuration schema (connection parameters, format options, etc.) (crates/arroyo-connectors/src/filesystem/config.rs)
Implement source operator (poll_next) or sink operator (process_element) in dedicated operator.rs file (crates/arroyo-connectors/src/blackhole/operator.rs)
Register connector in arroyo-connectors build.rs to include it in the binary (crates/arroyo-connectors/build.rs)
Add connection profile type and table configuration handling in crates/arroyo-api/src/connectors.rs (crates/arroyo-api/src/connectors.rs)

Add a New REST API Endpoint

Define the endpoint handler function in the appropriate domain file (e.g., pipelines.rs, jobs.rs, or create a new module) (crates/arroyo-api/src/pipelines.rs)
Register the route in crates/arroyo-api/src/rest.rs in the appropriate Router builder chain with correct HTTP method and path (crates/arroyo-api/src/rest.rs)
Use rest_utils.rs helpers (e.g., json_response, error_response) for consistent response formatting (crates/arroyo-api/src/rest_utils.rs)
Add database query or mutation via generated SQL queries in crates/arroyo-api/queries/api_queries.sql if needed (crates/arroyo-api/queries/api_queries.sql)

Modify the Pipeline or Job State Machine

Update the state enum and transition logic in crates/arroyo-api/src/pipelines.rs or jobs.rs (crates/arroyo-api/src/pipelines.rs)
If adding a new database state, create a migration file in crates/arroyo-api/migrations/ following V{N}__description.sql naming (crates/arroyo-api/migrations/V20__pipeline_v2.sql)
Update corresponding REST handler to enforce new state transitions (e.g., prevent invalid operations) (crates/arroyo-api/src/rest.rs)
Update crates/arroyo-api/queries/api_queries.sql with new query patterns for state checks (crates/arroyo-api/queries/api_queries.sql)

🔧Why these technologies

Rust — Memory safety and performance critical for a distributed streaming engine that must handle millions of events/sec with minimal GC overhead and reliable fault tolerance.
PostgreSQL/SQLite — Durable storage of pipeline definitions, job state, connection profiles, and checkpoints; SQL enables complex queries for job scheduling and state tracking.
gRPC/Protocol Buffers — Efficient inter-service communication between API, compiler, controller, and worker nodes; typed messages prevent versioning mismatches.
Tokio async runtime — Non-blocking I/O for REST API and connector implementations; enables high concurrency without thread-per-request overhead.
Arrow/Parquet/Delta/Iceberg — Columnar format support for efficient batch processing and ecosystem interoperability with data lake systems.

⚖️Trade-offs already made

Pluggable connector architecture (trait-based) vs. built-in connectors
- Why: Enables community contributions and vendor-specific optimizations without bloating core; reduces maintenance burden.
- Consequence: Requires clear connector interface contracts; discoverability and testing of third-party connectors is harder.
SQL-first pipeline definition vs. programmatic APIs
- Why: SQL is familiar to data engineers; enables IDE/tooling support; declarative syntax easier to optimize and replan.
- Consequence: Complex transformations may be verbose in SQL; requires robust compiler to handle edge cases and generate efficient code.
Centralized REST API for all operations vs. worker-side submission
- Why: Single source of truth for state; easier auditing, authorization, and disaster recovery.
- Consequence: API becomes a bottleneck during scaling; requires load balancing and replication for HA.
Stateful checkpointing via persistent store vs. in-memory state
- Why: Enables recovery after failures without reprocessing; required for exactly-once semantics.
- Consequence: Adds latency; checkpoint write becomes critical path; requires tuning checkpoint frequency.

🚫Non-goals (don't propose these)

Interactive query execution (focus is streaming pipelines, not ad-hoc analytics)
Multi-tenant resource isolation at the kernel level (assumes trusted operators)
Support

🪤Traps & gotchas

Database Required: API won't start without Postgres configured (or SQLite fallback); set DATABASE_URL or migrations fail silently. Checkpoint Format Coupling: Worker/controller state protocol in crates/arroyo-state-protocol/ must stay in sync across versions—rolling upgrades risky. gRPC/Tonic Versioning: tonic 0.13 with specific feature flags (zstd, tls-ring); mismatches cause binary incompatibility. DataFusion 48.0.1 Pinning: Arrow ecosystem tightly versioned; cargo update may break compilation. UDF Boundary: Rust UDF host uses C FFI to Python via arroyo-udf-python/—platform-specific issues (glibc, OpenSSL versions) common in production. Epoch/Watermark Logic: Time semantics in arroyo-datastream/ are non-obvious; off-by-one errors in window boundaries are easy.

🏗️Architecture

💡Concepts to learn

Dataflow Model (Streaming 101/102) — Arroyo's entire design philosophy—windowing, watermarks, allowed lateness—derives from Google's Dataflow Model. Understanding event time vs. processing time vs. watermarks is prerequisite to debugging timestamp bugs.
Epoch-Based Checkpointing — Arroyo uses logical epochs (not wall-clock time) to snapshot distributed state across workers. Misunderstanding epoch advancement causes duplicate processing or data loss.
Watermarks & Window Semantics — Stream windows (tumbling, sliding, session) depend on watermarks to determine closure. Late-arriving events and allowed lateness thresholds are configured per pipeline in Arroyo SQL.
Arrow Columnar Format & IPC — Arroyo uses Apache Arrow (v55.2) for in-memory data representation and inter-process communication; understanding arrow-ipc serialization is critical for debugging connector serialization issues.
gRPC & Tonic Service Discovery — All inter-service RPC in Arroyo uses Tonic; controller/worker communication, state protocol, and API federation rely on gRPC negotiation and transport (zstd compression configured).
Distributed State Backend & Snapshots — Stateful operators (joins, windows, aggregations) store state in a distributed backend (Parquet/object_store) and snapshot at checkpoint boundaries. Corruption or versioning mismatch causes data loss.
SQL Query Planning & Operator Fusion — DataFusion parses SQL and creates a logical plan; Arroyo's planner converts this to a physical dataflow graph with operator fusion to reduce inter-task communication overhead.

apache/flink — Direct competitor in distributed stream processing; Arroyo designed as a Rust/cloud-native alternative with simpler operations model.
materializeinc/materialize — SQL-first streaming system using PostgreSQL wire protocol; similar real-time analytics focus but incremental computation model differs from Arroyo's checkpoint-based approach.
apache/datafusion — Core query planner/executor library used by Arroyo (datafusion = '48.0.1' in Cargo.toml); understanding DataFusion internals essential for modifying SQL behavior.
confluentinc/kafka — Primary data source for Arroyo pipelines; Kafka connector in crates/arroyo-connectors/ implements Confluent protocols.
delta-io/delta-rs — Delta Lake integration (deltalake = '0.27.0') for sink connectors; Arroyo uses delta_kernel for high-performance writes to lakehouses.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for database migrations in arroyo-api

The crates/arroyo-api/migrations directory contains 28 migration files (V1-V28) for PostgreSQL and 6 SQLite migrations, but there's no visible test suite validating migration correctness, idempotency, or rollback safety. This is critical for a distributed system that manages stateful pipelines. New contributors could add a test harness that applies migrations in sequence, validates schema changes, and verifies data integrity.

[ ] Create crates/arroyo-api/tests/migration_tests.rs with test fixtures
[ ] Add testcontainers PostgreSQL dependency to test against real database
[ ] Write tests for sequential migration application from V1 through V28
[ ] Add regression tests for each migration file to verify idempotency
[ ] Validate schema state after each migration matches expected structure
[ ] Document in CONTRIBUTING.md how to test migrations locally

Add comprehensive integration tests for connector configuration in arroyo-connectors

The crates/arroyo-connectors crate exists but the file structure shows only a 'blackhole' connector subdirectory. Given that crates/arroyo-api/src/connectors.rs and crates/arroyo-api/src/connection_profiles.rs suggest multiple connector types (HTTP, sources, profiles), there's likely missing test coverage for connector validation, profile creation, and connection establishment. This would improve reliability of connector integrations.

[ ] Audit crates/arroyo-connectors/src to identify all connector types
[ ] Create crates/arroyo-connectors/tests/connector_integration_tests.rs
[ ] Add tests for each connector type: connection profile creation, validation, and error handling
[ ] Test connection_profiles.rs integration with actual connector implementations
[ ] Add mocking for external services (Kafka, Postgres, HTTP endpoints)
[ ] Document connector testing patterns in CONTRIBUTING.md

Add end-to-end job execution tests with checkpointing validation

The crates/arroyo-api/migrations/V25__add_checkpoint_events.sql and V27__ignore_state_before_epoch.sql files indicate sophisticated checkpointing logic, but there are likely no end-to-end tests validating checkpoint creation, state management, and recovery. New contributors could add tests in crates/arroyo-sql-testing or a new tests directory that exercise the full job lifecycle: creation, execution, checkpointing, failure, and recovery.

[ ] Create crates/arroyo-sql-testing/tests/checkpoint_recovery_tests.rs or similar
[ ] Write test that creates a job, runs it to completion, and validates checkpoint_events table
[ ] Add test for job failure and recovery from checkpoints
[ ] Validate state_before_epoch filtering in V27 migration works correctly
[ ] Test concurrent checkpointing across multiple workers
[ ] Add documentation in CONTRIBUTING.md on how to run job lifecycle tests

🌿Good first issues

Add integration tests for connector error handling: crates/arroyo-connectors/ tests don't cover retry logic or malformed-message scenarios for Kafka/HTTP connectors. Create a new test module with fixture data.
Document the checkpoint protocol: crates/arroyo-state-protocol/src/ defines the binary format for state snapshots but lacks inline comments. Add rustdoc explaining field layouts and backwards-compatibility constraints.
Expand SQL function coverage: DataFusion provides ~200 functions but Arroyo's planning layer (crates/arroyo-planner/) may not expose all. Audit and add missing window/string functions with tests in crates/arroyo-sql-testing/.

⭐Top contributors

Click to expand

@mwylde — 61 commits
@QnJ1c2kNCg — 16 commits
@garvit-gupta — 11 commits
@oliy — 2 commits
@cmackenzie1 — 2 commits

📝Recent commits