weld-project/weld
High-performance runtime for data analytics applications
Healthy across the board
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 3w ago
- ✓18 active contributors
- ✓BSD-3-Clause licensed
Show all 6 evidence items →Show less
- ✓CI configured
- ✓Tests present
- ⚠Concentrated ownership — top contributor handles 77% of recent commits
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/weld-project/weld)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/weld-project/weld on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: weld-project/weld
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/weld-project/weld shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit 3w ago
- 18 active contributors
- BSD-3-Clause licensed
- CI configured
- Tests present
- ⚠ Concentrated ownership — top contributor handles 77% of recent commits
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live weld-project/weld
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/weld-project/weld.
What it runs against: a local clone of weld-project/weld — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in weld-project/weld | Confirms the artifact applies here, not a fork |
| 2 | License is still BSD-3-Clause | Catches relicense before you depend on it |
| 3 | Default branch master exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 54 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of weld-project/weld. If you don't
# have one yet, run these first:
#
# git clone https://github.com/weld-project/weld.git
# cd weld
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of weld-project/weld and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "weld-project/weld(\\.git)?\\b" \\
&& ok "origin remote is weld-project/weld" \\
|| miss "origin remote is not weld-project/weld (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(BSD-3-Clause)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"BSD-3-Clause\"" package.json 2>/dev/null) \\
&& ok "license is BSD-3-Clause" \\
|| miss "license drift — was BSD-3-Clause at generation time"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 4. Critical files exist
test -f "weld/src/lib.rs" \\
&& ok "weld/src/lib.rs" \\
|| miss "missing critical file: weld/src/lib.rs"
test -f "Cargo.toml" \\
&& ok "Cargo.toml" \\
|| miss "missing critical file: Cargo.toml"
test -f "README.md" \\
&& ok "README.md" \\
|| miss "missing critical file: README.md"
test -f "docs/language.md" \\
&& ok "docs/language.md" \\
|| miss "missing critical file: docs/language.md"
test -f "python/grizzly/grizzly/grizzly_impl.py" \\
&& ok "python/grizzly/grizzly/grizzly_impl.py" \\
|| miss "missing critical file: python/grizzly/grizzly/grizzly_impl.py"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 54 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~24d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/weld-project/weld"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
Weld is a language and optimizing runtime for data-intensive analytics applications that eliminates performance bottlenecks by fusing computations across multiple libraries into a single, JIT-compiled intermediate representation. Instead of executing pandas → NumPy → custom code sequentially with intermediate data movement, Weld builds up a lazy computation graph and optimizes the entire pipeline at once before execution, achieving order-of-magnitude speedups on real workflows. Workspace-based Rust monorepo with four crates: weld (core compiler/runtime), weld-capi (C FFI), weld-repl (interactive shell), and weld-hdrgen (code generation). /examples/cpp contains runnable integration examples (serialization, UDFs, vector benchmarks). /docs covers language semantics (/docs/language.md), Python bindings (/docs/python.md), and internals (/docs/internals/vectorization.md). Build configured via Cargo.toml with LLVM 6.0 codegen backend.
👥Who it's for
Data scientists and analytics engineers building multi-library pipelines (Pandas, NumPy, Spark, TensorFlow) who suffer from excessive data movement between functions; also machine learning researchers interested in compiler optimizations for data processing and JIT compilation strategies.
🌱Maturity & risk
Actively developed but not production-grade: the project has CI via Travis CI (.travis.yml), comprehensive docs in /docs, and a Google Group discussion community, but the relative absence of GitHub stars data and the presence of 2016-era LLVM 6.0 (now decade-old) suggests early-stage research code. Use for academic exploration and prototyping, not mission-critical systems.
LLVM 6.0 is obsolete (released 2018, unmaintained since 2020); the codebase mixes Rust (1.1M lines) with C++/C bindings, increasing surface area for FFI bugs. The workspace structure (weld-python excluded, weld-capi/weld-repl/weld-hdrgen as separate crates) suggests incomplete Python bindings. Single-language runtime and lack of visible recent commits raise questions about maintenance velocity.
Active areas of work
Unable to determine from provided metadata, but the file structure shows mature documentation (api.md, language.md, serialization.md, tools.md) and multiple C++ example categories (composite_types, udfs, serialization variants, vector_benchmark), suggesting active feature development around type systems, user-defined functions, and performance validation.
🚀Get running
# Install Rust and LLVM 6.0 (macOS example)
rustup update stable
brew install llvm@6
ln -sf $(brew --prefix llvm@6)/bin/llvm-config /usr/local/bin/llvm-config
llvm-config --version # verify 6.0.x
# Clone and build
git clone https://github.com/weld-project/weld.git
cd weld
cargo build --release
# Run tests
cargo test
Daily commands:
Development: cargo build compiles the workspace. Interactive: cargo run --bin weld-repl launches REPL (see /docs/tutorial.md for commands). Testing: cargo test --all runs unit/integration tests. Examples: cd examples/cpp/add_repl && make && ./add_repl for a C++ integration demo. See /docs/tools.md for CLI utilities.
🗺️Map of the codebase
weld/src/lib.rs— Core library entry point; defines the main public API and compilation pipeline that every weld program flows throughCargo.toml— Workspace configuration defining all member crates (weld, weld-capi, weld-repl, weld-hdrgen) and their interdependenciesREADME.md— Project overview explaining Weld's purpose (optimizing data-intensive workflows via lazy compilation and cross-library optimization)docs/language.md— Specification of Weld's intermediate representation language; essential for understanding what programs are compiledpython/grizzly/grizzly/grizzly_impl.py— Python bindings orchestrating lazy evaluation and compilation; entry point for high-level user codeweld-capi/src/lib.rs— C API wrapper exposing Weld's compiler and runtime to non-Rust consumers; critical for cross-language integrationdocs/api.md— Public API documentation for C and Rust interfaces; reference for integrating Weld into applications
🛠️How to make changes
Add a new optimization pass
- Define the pass as a function in weld/src/optimizer.rs that takes an AST and returns an optimized AST (
weld/src/optimizer.rs) - Integrate the pass into the compiler pipeline in weld/src/compiler.rs, ordering it relative to other passes (
weld/src/compiler.rs) - Add tests in weld/tests/ demonstrating the optimization's effect on a representative program (
weld/tests/opt_test.rs)
Add a new aggregation primitive
- Extend the AggregationOp enum in weld/src/ast.rs with your new operator variant (
weld/src/ast.rs) - Implement type inference in weld/src/compiler.rs for the new operator (
weld/src/compiler.rs) - Add LLVM codegen in weld/src/codegen.rs to emit the runtime call or inline computation (
weld/src/codegen.rs) - Update docs/language.md to document syntax and semantics for end-users (
docs/language.md)
Add a new Python high-level API
- Create a new class in python/grizzly/grizzly/ (e.g., grizzly_new.py) wrapping lazy operations (
python/grizzly/grizzly/grizzly_new.py) - Use grizzly_impl.py's LazyOp builder to construct operation graphs (
python/grizzly/grizzly/grizzly_impl.py) - Add unit tests in python/grizzly/tests/ verifying correctness against a pandas/numpy baseline (
python/grizzly/tests/grizzly_test.py) - Create an example in examples/python/ demonstrating typical usage patterns (
examples/python/hello_weld/example.py)
Expose a new C API function
- Implement a public function in weld/src/lib.rs (Rust) that performs the desired operation (
weld/src/lib.rs) - Wrap it with a #[no_mangle] extern C function in weld-capi/src/lib.rs to create a C binding (
weld-capi/src/lib.rs) - Update docs/api.md with the function signature and behavior documentation (
docs/api.md) - Create a C/C++ example in examples/cpp/ demonstrating the new function (
examples/cpp/add_repl/add_repl.cpp)
🔧Why these technologies
- Rust — Type-safe systems programming for compiler and runtime; zero-cost abstractions for performance-critical code
- LLVM — Mature JIT compilation infrastructure; leverages existing optimizations (loop fusion, vectorization, register allocation)
- Python ctypes bindings — Non-invasive integration with existing Python/Pandas/NumPy ecosystems without forking libraries
- C FFI — Cross-language compatibility; enables use from C++, Java, and other languages without reimplementation
⚖️Trade-offs already made
-
Lazy evaluation via operation graph deferral
- Why: Enables cross-function optimization by observing entire workflow before compilation
- Consequence: Delayed execution breaks synchronous APIs; requires explicit .evaluate() calls; harder to debug individual steps
-
JIT compilation at runtime instead of ahead-of-time
- Why: Allows specialization to actual data types and values; avoids need for generic code
- Consequence: First invocation incurs ~100ms compilation overhead; amortized over large workloads but problematic for streaming/
-
Weld IR as custom DSL rather than lowering to existing IRs
- Why: Exposes high-level intent (e.g., 'merge' vs sequence of loops) for domain-specific optimizations
- Consequence: Larger surface area for bugs; less tooling support; developers must learn new language
-
Explicit loop-centric operations (map, reduce, filter) rather than SQL-like syntax
- Why: More expressive for custom computations; closer to procedural mental models
- Consequence: Less declarative; harder for automatic parallelization than relational models
🚫Non-goals (don't propose these)
- Does not provide a SQL query engine or relational optimizer
- Does not handle distributed execution or multi-machine parallelism
- Does not manage I/O or support streaming data sources; computation is over in-memory data only
- Does not replace NumPy/Pandas; designed as an optimization layer above them, not a standalone array library
- Does not support dynamic control flow (if/while); computation must be statically analyzable
- Does not guarantee deterministic performance; JIT compilation and optimization heuristics vary by workload
🪤Traps & gotchas
LLVM version is pinned to 6.0: llvm-config must be on $PATH and return 6.0.x exactly; modern LLVM versions may break codegen. Python bindings are excluded: weld-python is not in the workspace members; integration requires manual FFI calls via weld-capi. Memory safety across FFI: the C FFI layer is a manual unsafe boundary; buffer overruns in deserialization (examples/cpp/serialization) can segfault. No async/GPU support: Weld is synchronous, CPU-only; expect single-threaded performance and potential blocking on I/O. Expression-based, not imperative: mental model shift needed—loops are fusion operators, not instructions.
🏗️Architecture
💡Concepts to learn
- Lazy evaluation and computation fusion — Core to Weld's speedup—building a deferred computation graph and optimizing across function boundaries is the entire point; without understanding laziness, the architecture is opaque.
- Loop fusion (iterator fusion / deforestation) — Weld's main optimization pass; eliminating intermediate vectors between pipelined operations is why Weld achieves order-of-magnitude speedups (see docs/internals/vectorization.md).
- Just-in-time (JIT) compilation — Weld compiles expressions to LLVM IR at runtime; JIT defers codegen until needed, enabling adaptive optimization based on actual data types.
- Intermediate representation (IR) — Weld's expression-based IR is the lingua franca for cross-library optimization; all Pandas/NumPy/UDF calls must lower to this IR or remain unoptimized.
- SIMD and vectorization — Weld generates SIMD-friendly loops via LLVM; understanding auto-vectorization and alignment constraints (docs/internals/vectorization.md) is critical for performance debugging.
- Foreign Function Interface (FFI) and unsafe Rust — weld-capi bridges Rust and C/C++/Python via unsafe blocks; FFI bugs can silently corrupt memory—familiarity with Rust's safety guarantees is mandatory for maintainers.
- Aggregation and builder pattern (merge, build, result) — Weld's expression syntax uses merge (combine partial results) and build (construct outputs); this is non-standard compared to SQL/pandas and requires mental reframing.
🔗Related repos
apache/arrow— Arrow provides the columnar in-memory format and compute kernels that Weld's serialization layer (docs/serialization.md) parallels; integration would reduce reimplementation.dask/dask— Dask provides lazy task graphs for distributed Pandas/NumPy; Weld and Dask solve complementary problems (intra-node fusion vs. multi-node distribution).grizzly-group/grizzly— Grizzly is the 'Pandas on Weld' implementation (mentioned in README); critical consumer of Weld runtime demonstrating real-world integration.tensorflow/glow— Glow is a compiler for ML inference with similar IR-to-LLVM-JIT architecture; shares vectorization and codegen strategies with Weld's approach.numba/numba— Numba JIT-compiles NumPy code to LLVM; Weld extends this concept across multiple libraries and adds cross-library fusion that Numba lacks.
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive Python binding tests for weld-python
The weld-python directory is explicitly excluded from the workspace (Cargo.toml), yet examples/python/ contains multiple complex use cases (grizzly, add_repl). There are no visible test files for the Python bindings. This is critical since Python is a primary user-facing interface. Adding pytest-based integration tests would validate the FFI bridge, serialization, and end-to-end workflows.
- [ ] Create weld-python/tests/ directory structure mirroring examples/python/
- [ ] Add unit tests for Python API wrappers (compilation, execution, error handling)
- [ ] Add integration tests matching grizzly examples (data_cleaning, birth_analysis workflows)
- [ ] Add serialization round-trip tests (matching docs/serialization.md examples)
- [ ] Document test setup in CONTRIBUTING.md for Python contributors
Migrate .travis.yml CI to GitHub Actions with matrix testing
.travis.yml and .travis/ scripts suggest the project uses Travis CI, but this is outdated infrastructure. The repo contains multi-version test scripts (.travis/test_multi_version.sh) and separate LLVM/Rust setup scripts, indicating complex matrix builds. Migrating to GitHub Actions would provide faster, more maintainable CI with native GitHub integration. The partial file structure shows tests across C++, Rust, and Python.
- [ ] Create .github/workflows/ci.yml with matrix for Rust versions (1.31+), LLVM versions, and OS (ubuntu/macos)
- [ ] Port .travis/llvm.sh and .travis/rust.sh logic into setup steps
- [ ] Add separate workflow for C++ examples (examples/cpp/) with Makefile builds
- [ ] Add Python test workflow triggering weld-python tests
- [ ] Preserve .travis.yml for legacy but document deprecation in CONTRIBUTING.md
Add missing documentation for C++ serialization examples
The examples/cpp/serialization/ folder contains 3 non-trivial examples (serialize_vec, serialize_dictionary, deserialize_vec_pointers) with only basic README.md files. docs/serialization.md exists but doesn't cross-reference or explain these examples in detail. There's a gap between the API documentation and practical C++ usage patterns for serialization.
- [ ] Expand docs/serialization.md with C++ subsection linking to each example
- [ ] Add inline code comments to deserialize_test.cpp and serialize_test.cpp explaining memory layout and pointer semantics
- [ ] Create docs/cpp-serialization-guide.md with walkthrough of each example (vec struct encoding, dictionary encoding, pointer deserialization)
- [ ] Add performance notes on serialization overhead in the guide
- [ ] Update examples/cpp/serialization/README.md with comparison table of when to use each serialization pattern
🌿Good first issues
- Add type annotations to the examples in /docs/language.md and create a corresponding /examples/rust/ directory with runnable code samples; most examples are pseudo-code.
- Extend /docs/serialization.md with a comparison table of vector, dictionary, and struct serialization formats (aligned vs. packed), then add a test case in /examples/cpp/serialization/ covering nested types.
- Update .travis/llvm.sh and README.md build instructions to LLVM 12+ and test compatibility; document any IR changes needed or file a breaking-changes GitHub issue.
- Write a /docs/performance_tuning.md guide with concrete examples from /examples/cpp/vector_benchmark/, explaining vectorization hints, memory layout, and profiling with LLVM's opt-remarks.
- Implement a missing built-in aggregation function (e.g., stddev, quantile) in the IR and expose it via weld-capi; add a unit test and a C++ example in /examples/cpp/.
⭐Top contributors
Click to expand
Top contributors
- @sppalkia — 77 commits
- @Max-Meldrum — 3 commits
- @pratiksha — 3 commits
- @radujica — 3 commits
- @kaz7 — 1 commits
📝Recent commits
Click to expand
Recent commits
dcbba9a— DataFrame foundations (#510) (sppalkia)f5f9586— Aggregation functions for Grizzly Series (#509) (sppalkia)778293e— Add string functions (#508) (sppalkia)9c2e6c0— String encoding and decoding support without memory management (#506) (sppalkia)c11aeaa— Add function decorator to make Python Weld codegen easier (#504) (sppalkia)475ff75— Corect weld-capi build.rs to support not only C++ but also C (#505) (kaz7)286a13d— Add element access and masking support to Series (#502) (sppalkia)6e337f7— Add GrizzlySeries + Scalar operations (#501) (sppalkia)5d8aa6a— Basic Series BinOp functionality (#499) (sppalkia)e443e25— Update dependencies and fix lint/format issues (#500) (sppalkia)
🔒Security observations
The Weld project demonstrates moderate security posture with some concerns. Primary issues include outdated CI/CD infrastructure (Travis CI), lack of visible dependency management practices, and potential security gaps in Python and C++ bindings. The project lacks security documentation and vulnerability disclosure guidelines. No hardcoded credentials or obvious injection vulnerabilities were detected in the provided file structure. Recommendations focus on modernizing infrastructure, implementing regular dependency auditing, adding security documentation, and ensuring secure coding practices across language bindings.
- Medium · Travis CI Configuration Exposed —
.travis.yml, .travis/ directory. The .travis.yml file and .travis directory contain CI/CD configuration scripts. If the repository is public, these files could expose build processes, environment setup details, and potentially sensitive build parameters. Fix: Review CI/CD configuration for sensitive data exposure. Use encrypted environment variables in Travis CI for any secrets. Avoid storing credentials or API keys in version control. - Medium · Outdated Dependency Management —
Cargo.toml, dependency management. The Cargo.toml shows a workspace structure but there is no visible lock file in the provided file structure. Additionally, as an analytics runtime project, dependencies may not be regularly audited for known vulnerabilities. The project appears to be older (references to Travis CI which is deprecated). Fix: Maintain a Cargo.lock file in version control. Run 'cargo audit' regularly to identify known vulnerabilities. Use GitHub Actions or modern CI/CD instead of Travis CI. Set up automated dependency updates with Dependabot or similar tools. - Low · Python Bindings Security Considerations —
python/grizzly/, examples/python/. The project includes Python bindings and examples that interface with compiled Rust code. Python code in examples/python and python/grizzly could potentially be vulnerable to common Python security issues (e.g., pickle deserialization, unsafe eval, SQL injection if SQL queries are constructed dynamically). Fix: Audit Python code for unsafe operations (eval, exec, pickle with untrusted data). Implement input validation for all external data. Use parameterized queries if SQL is involved. Follow OWASP guidelines for Python web applications. - Low · C++ Examples May Contain Unsafe Code —
examples/cpp/. The examples/cpp directory contains C++ code that interfaces with Weld. C++ code can be vulnerable to buffer overflows, memory leaks, and unsafe pointer operations if not carefully written. Fix: Apply strict compiler flags (-Wall, -Wextra, -Werror). Use memory safety tools like AddressSanitizer during testing. Perform code reviews focusing on memory management. Consider using smart pointers and RAII patterns. - Low · Missing Security Documentation —
CONTRIBUTING.md, docs/. The CONTRIBUTING.md and docs do not appear to include security guidelines, vulnerability disclosure procedures, or security best practices for contributors. Fix: Create a SECURITY.md file with vulnerability disclosure process. Add security guidelines to CONTRIBUTING.md. Document security considerations in API documentation. - Low · Potential Serialization Vulnerabilities —
docs/serialization.md, examples/cpp/serialization/. The project includes serialization examples and functionality (docs/serialization.md, serialization examples). Deserialization of untrusted data can lead to code execution vulnerabilities. Fix: Document serialization security considerations. Validate all deserialized data before use. Use conservative serialization formats. Never deserialize untrusted data without validation.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.