RepoPilotOpen in app →

TimelyDataflow/differential-dataflow

An implementation of differential dataflow using timely dataflow on Rust.

Healthy

Healthy across the board

weakest axis
Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 2d ago
  • 6 active contributors
  • MIT licensed
Show all 6 evidence items →
  • CI configured
  • Tests present
  • Concentrated ownership — top contributor handles 57% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/timelydataflow/differential-dataflow)](https://repopilot.app/r/timelydataflow/differential-dataflow)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/timelydataflow/differential-dataflow on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: TimelyDataflow/differential-dataflow

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/TimelyDataflow/differential-dataflow shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit 2d ago
  • 6 active contributors
  • MIT licensed
  • CI configured
  • Tests present
  • ⚠ Concentrated ownership — top contributor handles 57% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live TimelyDataflow/differential-dataflow repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/TimelyDataflow/differential-dataflow.

What it runs against: a local clone of TimelyDataflow/differential-dataflow — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in TimelyDataflow/differential-dataflow | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 32 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>TimelyDataflow/differential-dataflow</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of TimelyDataflow/differential-dataflow. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/TimelyDataflow/differential-dataflow.git
#   cd differential-dataflow
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of TimelyDataflow/differential-dataflow and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "TimelyDataflow/differential-dataflow(\\.git)?\\b" \\
  && ok "origin remote is TimelyDataflow/differential-dataflow" \\
  || miss "origin remote is not TimelyDataflow/differential-dataflow (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "differential-dataflow/src/lib.rs" \\
  && ok "differential-dataflow/src/lib.rs" \\
  || miss "missing critical file: differential-dataflow/src/lib.rs"
test -f "differential-dataflow/src/collection.rs" \\
  && ok "differential-dataflow/src/collection.rs" \\
  || miss "missing critical file: differential-dataflow/src/collection.rs"
test -f "differential-dataflow/src/operators/mod.rs" \\
  && ok "differential-dataflow/src/operators/mod.rs" \\
  || miss "missing critical file: differential-dataflow/src/operators/mod.rs"
test -f "differential-dataflow/src/trace/mod.rs" \\
  && ok "differential-dataflow/src/trace/mod.rs" \\
  || miss "missing critical file: differential-dataflow/src/trace/mod.rs"
test -f "differential-dataflow/src/difference.rs" \\
  && ok "differential-dataflow/src/difference.rs" \\
  || miss "missing critical file: differential-dataflow/src/difference.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 32 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~2d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/TimelyDataflow/differential-dataflow"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Differential dataflow is a data-parallel programming framework built on top of timely dataflow that efficiently processes large data volumes and responds to incremental changes. Written in Rust, it provides operators like map, filter, join, reduce, and iterate compiled down to timely dataflow, allowing programs to reactively update results when input collections change rather than recomputing from scratch. Workspace monorepo: differential-dataflow/ is the core crate with src/ containing the implementation, examples/ with runnable demos (hello.rs, bfs.rs, reachability.rs, columnar examples), benches/ for performance tests, and CHANGELOG.md for versioning. Companion crates like dogsdogsdogs, interactive, server/, and diagnostics/ provide integrations and tools; advent_of_code_2017/ serves as stress-test examples.

👥Who it's for

Data engineers and systems programmers building distributed data processing pipelines that need to handle incremental updates efficiently (e.g., graph algorithms, reachability queries, degree distributions). Contributors are typically familiar with functional programming paradigms and Rust systems programming.

🌱Maturity & risk

Production-ready with active development: the project is versioned at 0.23.0, uses Rust 1.86+ as minimum, has comprehensive CI workflows (test.yml, test-timely-master.yml, release-plz.yml), maintains a CHANGELOG.md, and includes extensive examples. The workspace structure and workspace.dependencies config show mature dependency management. However, version 0.x suggests continued evolution rather than strict API stability guarantees.

Low-to-moderate risk: the project depends heavily on timely (version 0.29 with commented-out git branch option suggesting they track timely master), and the Rust ecosystem dependency surface is relatively small. The single main crate (differential-dataflow/) plus companion examples in the workspace mitigates fragmentation risk. No visible open issue backlog in the file list, but version 0.x means breaking changes are possible between minor versions.

Active areas of work

Active maintenance with release automation (release-plz.yml), CI testing against both released timely (test.yml) and timely master (test-timely-master.yml) to ensure forward compatibility. The workspace includes experimental features like columnar examples (differential-dataflow/examples/columnar/) suggesting ongoing work on memory-efficient data representations.

🚀Get running

git clone https://github.com/TimelyDataflow/differential-dataflow.git
cd differential-dataflow
cargo test --workspace
cargo run --example hello --manifest-path differential-dataflow/Cargo.toml

Daily commands: No single dev server; instead run examples: cargo run --example hello --manifest-path differential-dataflow/Cargo.toml or benchmarks: cargo bench --manifest-path differential-dataflow/Cargo.toml. For diagnostics server, see diagnostics/examples/smoke.rs. Build docs with cargo doc --workspace --open.

🗺️Map of the codebase

  • differential-dataflow/src/lib.rs — Entry point and core module exports defining the public API of the differential dataflow library.
  • differential-dataflow/src/collection.rs — Core Collection abstraction that wraps timely dataflow streams and provides differential operators like map, filter, join, reduce.
  • differential-dataflow/src/operators/mod.rs — Master module for all differential dataflow operators (arrange, join, reduce, iterate); defines extension traits on Collection.
  • differential-dataflow/src/trace/mod.rs — Trace abstraction for batching and indexing updates over time; critical for efficient incremental computation.
  • differential-dataflow/src/difference.rs — Defines the Monoid trait for update differences; enables tracking fine-grained changes to collections.
  • differential-dataflow/src/input.rs — Input handle and session management for feeding data into differential dataflow computations.
  • differential-dataflow/src/lattice.rs — Partial order and lattice abstractions for timestamps; foundational to differential dataflow's incremental semantics.

🛠️How to make changes

Add a New Collection Operator

  1. Define a new trait method on the CollectionExt trait in differential-dataflow/src/operators/mod.rs (differential-dataflow/src/operators/mod.rs)
  2. Implement the operator by consuming self (a Collection), applying timely dataflow operators on the underlying stream, and wrapping results in a new Collection (differential-dataflow/src/operators/mod.rs)
  3. If the operator requires indexed access (like join), arrange the collection first using the arrange operator in differential-dataflow/src/operators/arrange/mod.rs (differential-dataflow/src/operators/arrange/mod.rs)
  4. Write an example demonstrating the operator at differential-dataflow/examples/{operator_name}.rs and add to Cargo.toml [[example]] section (differential-dataflow/examples)

Add a New Graph Algorithm

  1. Create a new file in differential-dataflow/src/algorithms/graphs/{algorithm_name}.rs with a public function taking a Collection of edges (differential-dataflow/src/algorithms/graphs)
  2. Use iterate() to implement fixed-point computation, leveraging join and reduce operators on collections (differential-dataflow/src/operators/iterate.rs)
  3. Export the function in differential-dataflow/src/algorithms/graphs/mod.rs (differential-dataflow/src/algorithms/graphs/mod.rs)
  4. Create an example at differential-dataflow/examples/{algorithm_name}.rs that loads a graph and calls your algorithm (differential-dataflow/examples)

Create a Custom Trace Implementation

  1. Define a new struct implementing the Trace trait in a file under differential-dataflow/src/trace/implementations/ (differential-dataflow/src/trace/implementations)
  2. Implement Cursor trait to provide seek/valid/key/val navigation semantics over your batch format (differential-dataflow/src/trace/cursor/mod.rs)
  3. Implement Batcher trait for your trace type to batch incoming updates before committing to immutable storage (differential-dataflow/src/trace/mod.rs)
  4. Register your trace type in arrange operations by implementing TraceReader and making it available via arrangement configuration (differential-dataflow/src/operators/arrange/arrangement.rs)

Enable Logging & Diagnostics

  1. In your dataflow setup, call enable_tracing() and pass a logger instance in differential-dataflow/src/logging.rs (differential-dataflow/src/logging.rs)
  2. Run the diagnostic server at diagnostics/src/server.rs to visualize operator execution and bottlenecks (diagnostics/src/server.rs)
  3. Open the browser dashboard at diagnostics/index.html to inspect trace data (diagnostics/index.html)

🔧Why these technologies

  • Rust + Timely Dataflow — Enables safe, lock-free concurrent computation with explicit dataflow DAG scheduling; fine control over memory layout and iteration.
  • Monoid-based updates (Difference trait) — Allows tracking arbitrary updates (additions, deletions, negations) in a composable way; enables reversible incremental computation.
  • Partial order lattices for timestamps — Generalizes versioning beyond linear time; supports complex causality and enables efficient out-of-order execution.
  • Trace/Batch abstraction with Cursors — Separates logical collection semantics from physical storage; allows swapping storage engines (B-tree, columnar, etc.) without changing operators.
  • Arrangement for indexed access — Pre-indexes collections to enable efficient O(log n) joins and lookups; critical for scalability in multi-way joins.

⚖️Trade-offs already made

  • Lazy evaluation of operators combined with explicit timely::execute scopes

    • Why: Allows strong compile-time guarantees about dataflow structure and enables global optimization; shifts complexity to user.
    • Consequence: Steeper learning curve but safer, more predictable performance; graph structure known at operator-definition time.
  • Monoid differences instead of simple addition/deletion

    • Why: Supports richer update semantics (e.g., weighted graphs, probabilistic counts); more expressive.
    • Consequence: undefined

🪤Traps & gotchas

  1. Timestamp semantics: Differential dataflow heavily relies on timestamp propagation through operators; misunderstanding lattice timestamps (partial order on time) leads to subtle correctness bugs. 2) Scope nesting: .enter(scope) and .leave() manage scope transitions; incorrect nesting breaks compilation or produces deadlocks. 3) Capability semantics: Timely's push-based execution requires understanding capabilities and when they are retained/released; leaking capabilities stalls pipelines. 4) Columnar feature interaction: Examples use columnar (0.12) for memory efficiency, but this is optional—ensure feature flags match your use case. 5) Timely version coupling: Workspace allows testing against timely master (test-timely-master.yml), but this can introduce unreleased API changes.

🏗️Architecture

💡Concepts to learn

  • Differential traces & version vectors — Core mechanism for incremental computation in differential dataflow; timestamps (partial orders) track which updates have been processed, enabling systems to compute only deltas rather than recompute entire results.
  • Timely's Pushed-Based Dataflow Graphs — Differential dataflow operators compile to timely's push-based execution model; understanding capabilities, frontiers, and scope nesting is essential to debugging performance or correctness issues.
  • Semi-lattice / Partial Order Semantics — Differential dataflow's operators (join, reduce, iterate) require partially-ordered timestamps and a meet operation; misunderstanding lattice properties leads to incorrect incremental updates.
  • Collection Parameterization by Key/Value Types — Differential's Collection<S, (K,V), R> type is parameterized by scope S, data type (K,V), and multiplicity type R; understanding why this design enables operators like join and reduce is critical for API usage.
  • Columnar Data Layout & Memory Efficiency — Examples use the columnar crate (0.12) for memory-efficient representation of collections; this is non-obvious from the high-level API but crucial for production workloads processing large data volumes.
  • Reachability as an Iterative Fixed Point — The examples/reachability.rs and graph examples demonstrate how iterate() computes fixed points; this pattern generalizes beyond graphs to any transitive closure or convergent property.
  • Work-Stealing & Capability Management in timely — Differential programs can deadlock if capabilities are leaked or held incorrectly; understanding how timely schedules work-stealing execution prevents subtle bugs in custom operators.
  • TimelyDataflow/timely-dataflow — The underlying distributed execution engine; differential-dataflow is a higher-level abstraction compiled to timely dataflow programs.
  • TimelyDataflow/dogsdogsdogs — Companion workspace crate providing join and group-by implementations; heavily used by differential-dataflow examples for complex queries.
  • TimelyDataflow/abomonation — Serialization library optimized for timely/differential (zero-copy deserialization); likely used internally for efficient data movement.
  • apache/arrow-datafusion — Similar goal (incremental query execution) in the Apache ecosystem; useful for understanding alternative approaches to differential computation.
  • sigmajs/sigma.js — Example integration target: the diagnostics server (diagnostics/src/server.rs) can visualize differential dataflow computations, similar to how Sigma visualizes graphs.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for differential-dataflow/src/algorithms/graphs modules

The repo has graph algorithm implementations (bfs.rs, bijkstra.rs) but there are no visible test files in differential-dataflow/src/algorithms/graphs/. Graph algorithms are core functionality and should have dedicated test coverage for correctness validation, edge cases (empty graphs, single nodes, cycles), and performance characteristics.

  • [ ] Create differential-dataflow/src/algorithms/graphs/tests.rs with unit tests for BFS implementation
  • [ ] Add test cases for dijkstra algorithm covering basic paths, unreachable nodes, and edge weights
  • [ ] Add property-based tests using proptest to validate graph traversal correctness across random graph structures
  • [ ] Document expected time/space complexity in test comments for algorithm validation

Create integration tests validating columnar data format support end-to-end

The repo has differential-dataflow/examples/columnar/ directory showing columnar support is a feature, and columnar is listed as a workspace dependency (v0.12). However, there are no visible integration tests in differential-dataflow/ validating that columnar arrangements work correctly with differential dataflow operators (join, reduce, iterate).

  • [ ] Create differential-dataflow/tests/columnar_integration.rs with end-to-end columnar arrangement tests
  • [ ] Test columnar format with differential dataflow joins to ensure output correctness matches non-columnar
  • [ ] Add performance comparison tests between columnar and standard arrangements
  • [ ] Validate columnar support works across the iterate operator used in examples/columnar/

Add CI workflow to test against timely-dataflow master branch on schedule

The repo has .github/workflows/test-timely-master.yml but it's not clearly documented when it runs. Given the tight coupling between differential-dataflow and timely-dataflow (workspace dependencies reference both versions and git paths are commented), a scheduled workflow that runs nightly would catch API breakage early. The current workspace comment '#timely = { git = "https://github.com/TimelyDataflow/timely-dataflow"' suggests maintainers want to test master but lack automation.

  • [ ] Verify test-timely-master.yml exists and add schedule trigger (nightly or weekly)
  • [ ] Add step to uncomment git dependency in Cargo.toml and test full workspace against timely master
  • [ ] Add notification step to .github/create-issue-on-failure.md workflow to report breakage
  • [ ] Document in CONTRIBUTING.md when/why master branch testing is important for contributors

🌿Good first issues

  • Add missing doctests to operators in differential-dataflow/src/operators/: many operator implementations lack /// # Examples documentation with executable examples. Start with mod.rs to identify operators without doctests.
  • Create a beginner-focused example in differential-dataflow/examples/ demonstrating the three-layer architecture (Collection → timely operators → computation graph). Current examples jump to advanced use cases (reachability, graph algorithms); a simple stateful aggregation example would help new users.
  • Document the difference between .iterate() and manual scope entry in a new guide file or README section: the examples directory has both patterns but no explanation of when to use each, creating confusion for contributors unfamiliar with iterative computation semantics.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 1f348ab — chore: release v0.23.0 (#713) (github-actions[bot])
  • c67ad11 — Release preparation (#722) (frankmcsherry)
  • 775967a — chore: release v0.22.0 (#712) (github-actions[bot])
  • 64ec68f — Allow InternalMerger::extract to yield when full buffer (#711) (frankmcsherry)
  • 905537e — Merge pull request #703 from TimelyDataflow/release-plz-2026-03-26T17-38-20Z (frankmcsherry)
  • 04a69e4 — chore: release v0.21.1 (github-actions[bot])
  • 7bf2e02 — Merge pull request #707 from frankmcsherry/half_join_ordered_from_master (frankmcsherry)
  • 7944101 — Consolidate storage types to avoid double allocations (frankmcsherry)
  • 2a0fa88 — Re-order stages and improve session batching (frankmcsherry)
  • 1415cdd — Less quadratic half_join (frankmcsherry)

🔒Security observations

The differential-dataflow repository maintains a reasonable security posture with comprehensive clippy lints enabled and explicit dependency management. However, there are moderate concerns regarding the use of path-based and commented git dependencies without explicit pinning, incomplete lints configuration, and the absence of formal security documentation. The codebase appears to be a legitimate research/scientific project focused on dataflow processing without typical web application vulnerabilities (SQLi, XSS). No hardcoded secrets or credentials were detected in the provided file structure. Recommendations focus on strengthening dependency management, completing configuration validation, and establishing clear security policies.

  • Medium · Incomplete Clippy Lints Configuration — Cargo.toml - [workspace.lints.clippy]. The workspace clippy lints configuration in Cargo.toml is incomplete. The 'redundant_static_lifetimes' lint line ends abruptly without closing the string, which may cause parsing issues or allow unintended code patterns to slip through. Fix: Complete the configuration by properly closing the string: 'redundant_static_lifetimes = "warn"' and add any other necessary lints.
  • Medium · Dependency on External Git Repository — Cargo.toml - [workspace.dependencies]. The Cargo.toml contains commented-out dependency on timely dataflow from a GitHub repository: 'timely = { git = "https://github.com/TimelyDataflow/timely-dataflow", ... }'. While currently commented, this pattern could be used to pull untrusted code if uncommented without verification. Fix: Ensure any git-based dependencies use explicit version pinning with a specific commit hash rather than relying on branch references. Verify the repository and maintain a whitelist of trusted sources.
  • Medium · Local Path Dependencies in Workspace — Cargo.toml - [workspace.dependencies]. The workspace uses local path-based dependencies (e.g., 'differential-dataflow = { path = "differential-dataflow", ... }'), which could be vulnerable if the local filesystem is compromised or if files are modified unexpectedly. Fix: Implement file integrity monitoring for critical dependency paths. Consider using dependency verification features and lock files to ensure reproducible builds.
  • Low · Default Features Disabled Without Clear Justification — Cargo.toml - [workspace.dependencies]. Multiple dependencies explicitly disable default features (e.g., 'differential-dataflow = { ..., default-features = false }'), which may exclude important security features or stability improvements. Fix: Document why default features are disabled for each dependency. Audit the disabled features to ensure no security-relevant features are being excluded.
  • Low · Missing SECURITY.md File — Repository root. The repository lacks a SECURITY.md file that would provide security policy, vulnerability disclosure procedures, and security contact information. Fix: Create a SECURITY.md file following GitHub's recommended format that includes: security policy, how to report vulnerabilities, expected response times, and any security contacts.
  • Low · Unrestricted Rust Version — Cargo.toml - [workspace.package]. While a minimum rust-version of 1.86 is specified, there is no maximum version constraint. Using untested newer Rust versions could introduce unforeseen behavioral changes. Fix: Consider implementing MSRV (Minimum Supported Rust Version) testing in CI/CD pipelines. Document and test against specific Rust versions.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · TimelyDataflow/differential-dataflow — RepoPilot