rapidsai/cudf

Item: rapidsai/cudf
Rating: 5
Author: RepoPilot

cuDF - GPU DataFrame Library

Healthy

Healthy across the board

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit today
✓24+ active contributors
✓Distributed ownership (top contributor 12% of recent commits)

Show 3 more →

✓Apache-2.0 licensed
✓CI configured
✓Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/rapidsai/cudf)](https://repopilot.app/r/rapidsai/cudf)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/rapidsai/cudf on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: rapidsai/cudf

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/rapidsai/cudf shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit today
24+ active contributors
Distributed ownership (top contributor 12% of recent commits)
Apache-2.0 licensed
CI configured
Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live rapidsai/cudf repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/rapidsai/cudf.

What it runs against: a local clone of rapidsai/cudf — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in rapidsai/cudf | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>rapidsai/cudf</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of rapidsai/cudf. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/rapidsai/cudf.git
#   cd cudf
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of rapidsai/cudf and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "rapidsai/cudf(\\.git)?\\b" \\
  && ok "origin remote is rapidsai/cudf" \\
  || miss "origin remote is not rapidsai/cudf (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "README.md" \\
  && ok "README.md" \\
  || miss "missing critical file: README.md"
test -f "VERSION" \\
  && ok "VERSION" \\
  || miss "missing critical file: VERSION"
test -f "build.sh" \\
  && ok "build.sh" \\
  || miss "missing critical file: build.sh"
test -f "ci/build_cpp.sh" \\
  && ok "ci/build_cpp.sh" \\
  || miss "missing critical file: ci/build_cpp.sh"
test -f "ci/build_python.sh" \\
  && ok "ci/build_python.sh" \\
  || miss "missing critical file: ci/build_python.sh"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/rapidsai/cudf"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

cuDF is a GPU-accelerated DataFrame library that mirrors pandas' API but executes on NVIDIA GPUs via CUDA, enabling 10-100x faster tabular data processing. It's composed of libcudf (C++/CUDA core), pylibcudf (Cython bindings), and a Python API layer, plus companion projects like cudf-polars (Polars GPU backend) and dask-cudf (Dask GPU support). Monorepo structured as: cpp/src/libcudf/ (core C++/CUDA algorithms), python/cudf/ (pandas-like Python API), python/pylibcudf/ (Cython layer), java/src/ (Java bindings), plus python/cudf_polars/ and external dask-cudf. Build via CMake (cpp/) and setup.py (python/). CI orchestrated through .github/workflows/ with agents/ defining reusable skills (build-test-cudf, review-cudf).

👥Who it's for

Data scientists and ML engineers who process large tabular datasets and want to use familiar pandas syntax without rewriting code, plus distributed computing users (Spark RAPIDS, Dask) who need GPU acceleration; also Java developers via cudf-java bindings.

🌱Maturity & risk

Actively maintained and production-ready. The repo shows 14M+ lines of C++, 10M+ Python, comprehensive CI/CD via GitHub Actions (.github/workflows/ has build.yaml, test.yaml, pandas-tests.yaml), organized .devcontainer setups for CUDA 12.9 and 13.1, and a CHANGELOG tracking releases. RAPIDS is widely adopted (Spark RAPIDS, Velox, Sirius all depend on it).

Multi-language codebase (C++, Python, Cython, CUDA, Java) increases maintenance surface; GPU CUDA version constraints (must match system CUDA 12 or 13) create deployment friction. Java bindings and polars/dask integrations add external dependency risk. However, RAPIDS governance and broad industry adoption (Spark, Velox, DuckDB) mitigate single-maintainer risk.

Active areas of work

Active development across multiple fronts: CUDA 12.9 and 13.1 support in .devcontainer configs, pandas compatibility testing via pandas-tests.yaml workflow, Spark RAPIDS JNI integration (spark-rapids-jni.yaml), and CodeRabbit automation via .coderabbit.yaml. PR/issue automation configured in .github/ops-bot.yaml and labeler.yml.

🚀Get running

git clone https://github.com/rapidsai/cudf.git
cd cudf
./build.sh  # Invokes cmake and builds libcudf + Python bindings
# Or for Python only: pip install -e python/cudf/ -c rapidsai

Requires CUDA 12.x or 13.1 installed; see .devcontainer/cuda13.1-conda/devcontainer.json for containerized setup.

Daily commands: For development: ./build.sh builds everything. For Python iteration: cd python/cudf && pip install -e .. For tests: GitHub Actions workflows in .github/workflows/ (build.yaml, test.yaml run on PR). Local CUDA tests require GPU; see ci/build_cpp.sh for C++ test invocation.

🗺️Map of the codebase

README.md — Defines cuDF's scope across libcudf (C++), pylibcudf (Cython), cudf (pandas-like API), and cudf-polars; essential for understanding the multi-layer architecture.
VERSION — Single source of truth for versioning across all cuDF components (C++, Python, wheels); critical for release pipelines.
build.sh — Top-level build orchestrator for the entire project; entry point for contributors building locally.
ci/build_cpp.sh — Builds libcudf C++ core; defines compilation flags, CUDA dependencies, and artifact locations for downstream Python bindings.
ci/build_python.sh — Orchestrates Python package builds (cudf, pylibcudf, dask_cudf); bridges C++ artifacts to Python distribution.
.github/workflows/build.yaml — CI/CD pipeline configuration; defines matrix of CUDA versions, compilers, and platforms tested on every push.
.github/CODEOWNERS — Maps files to maintainers; critical for code review routing and ownership clarity across layers.

🛠️How to make changes

Add a new libcudf C++ function

Implement the function in libcudf source tree (typically cpp/src/*/some_algorithm.cu or .hpp) (ci/cpp_linters.sh (will check style on commit))
Add unit tests in cpp/tests with GoogleTest framework and discover via ci/discover_libcudf_tests.sh (ci/discover_libcudf_tests.sh)
Run ci/build_cpp.sh to compile and validate the new function (ci/build_cpp.sh)
If exposing via Python, add Cython binding in python/pylibcudf/_lib and expose in public API (ci/build_wheel_pylibcudf.sh (to build the wheel))

Add a new Python DataFrame method (cudf.DataFrame.my_method)

Implement method in python/cudf source; typically inherits from pandas API (ci/build_wheel_cudf.sh (will package it))
Add integration tests in python/cudf/tests matching test naming convention (ci/test_python_cudf.sh (runs the tests))
Run ci/run_cudf_pytests.sh to validate locally before pushing (ci/run_cudf_pytests.sh)
Ensure pandas compatibility if in cudf.pandas; tests gate via ci/run_cudf_pandas_pytest_benchmarks.sh (.github/workflows/pandas-tests.yaml)

Add a new CI/CD test job

Create test script in ci/ folder (e.g., ci/run_my_pytests.sh) that executes the test command (ci/run_cudf_pytests.sh (reference for structure))
Add GitHub Actions workflow YAML in .github/workflows/ that invokes the script in a matrix (.github/workflows/test.yaml (reference for structure))
Configure job to depend on build.yaml completing and to report results back to PR (.github/workflows/status.yaml (for result aggregation))

Update cuDF version for release

Edit VERSION file with new semantic version (VERSION)
Run ci/release/update-version.sh to propagate version to setup.py, pyproject.toml, CMakeLists.txt (ci/release/update-version.sh)
Update CHANGELOG.md with user-facing release notes organized by feature/fix/breaking (CHANGELOG.md)
Create PR and merge; GitHub Actions will build wheels and tag release (.github/release.yml (auto-release config))

🔧Why these technologies

CUDA C++ — Enables GPU-accelerated columnar operations on NVIDIA GPUs; core competitive advantage over pandas CPU performance.
Apache Arrow — Standard in-memory columnar format; enables zero-copy data exchange with Polars, DuckDB, and other Arrow-compliant libraries.
CMake + NVIDIA CUDA Toolkit — Industry-standard for CUDA project builds; handles GPU architecture detection, code generation, and cross-platform compilation.
Cython (pylibcudf) — Bridges C++ and Python with minimal overhead; allows typed Python API to call libcudf directly without intermediate conversion.
pytest + GoogleTest — Standard test frameworks for Python and C++; enable parallel test execution and rich assertions for catching regressions.
GitHub Actions matrix workflows — Tests across CUDA 12.9, 13.1, multiple compilers, and platforms; catches environment-specific bugs early.

⚖️Trade-offs already made

Multi-layer design: libcudf → pylibcudf →
- Why: undefined
- Consequence: undefined

🪤Traps & gotchas

CUDA version matching: System CUDA must exactly match linked version; pip packages have cu12/cu13 suffixes—mixing breaks silently. 2. Device memory: GPU operations require sufficient device RAM; no automatic spilling to host (unlike pandas). Profiling with nsys/nvprof essential. 3. Cython rebuild: Changes to pylibcudf .pyx require full rebuild (./build.sh), not just pip install -e; pip wheel caching can hide changes. 4. Arrow schema compatibility: cuDF enforces Apache Arrow type system; some pandas dtypes (category, string) map non-trivially—check dtype casting in python/cudf/core/dtypes.py. 5. Pre-commit hooks: .pre-commit-config.yaml enforces clang-format on C++, black on Python; CI will fail if violated.

🏗️Architecture

💡Concepts to learn

Apache Arrow Columnar Format — libcudf's entire data model is Arrow-compliant; understanding Arrow schemas, buffers, and null bitmaps is essential for debugging dtype issues and understanding memory layout.
CUDA Memory Coalescing & Warp-Level Primitives — cuDF kernels are hand-tuned for GPU SM efficiency; understanding coalesced memory access and warp shuffles explains performance cliffs and informs optimization PRs.
Cython Type System & C++ Interop — pylibcudf bridges Python → libcudf via Cython; mastering .pyx syntax and C++ extern declarations is critical for adding new GPU operations to the Python API.
Device Memory Management & Memory Pools — cuDF implements RMM (RAPIDS Memory Manager) for GPU allocation; understanding pinned host memory, device pools, and async copying prevents memory leaks and improves throughput.
Kernel Fusion & Expression Templates — cuDF fuses multiple operations (e.g., filter + select) into single kernels to avoid intermediate GPU→RAM→GPU transfers; this drives much of the speedup over pandas.
Lazy Evaluation & Query Graphs — cudf.pandas uses expression trees to defer execution; understanding deferred vs. eager evaluation modes is key to debugging performance bottlenecks.
Zero-Copy Data Sharing & DLPack — cuDF integrates with Polars, PyTorch, etc. via DLPack tensors for zero-copy interchange; critical for building integrated data pipelines without GPU↔CPU serialization.

NVIDIA/spark-rapids — Production GPU accelerator for Apache Spark built atop cuDF; canonical example of cuDF as embedded engine.
dask/dask-cudf — Distributed DataFrame scheduler using cuDF partitions; sister repo providing multi-GPU/multi-node scaling.
rapidsai/raft — Companion RAPIDS library for GPU-accelerated ML primitives; frequently fused with cuDF for end-to-end data science.
facebookincubator/velox — FB's GPU-agnostic expression engine that integrates cuDF as execution backend (velox/experimental/cudf/).
pandas-dev/pandas — Upstream API reference; cuDF intentionally mirrors pandas behavior; significant test cross-validation in pandas-tests.yaml workflow.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive CI workflow for cudf_polars IR validation

The repo contains ci/check_cudf_polars_ir.py script but there's no dedicated GitHub Actions workflow to run it automatically on PRs. This prevents early detection of polars IR compatibility regressions. Adding a workflow would ensure the cudf_polars component maintains compatibility as the codebase evolves.

[ ] Create .github/workflows/cudf-polars-ir-check.yaml workflow file
[ ] Configure it to run ci/check_cudf_polars_ir.py on PR commits affecting cudf_polars code
[ ] Add status checks to .github/CODEOWNERS for cudf_polars maintainers
[ ] Document the new workflow in CONTRIBUTING.md

Implement style checking for Cython/Python files in pre-commit hooks

The repo has .pre-commit-config.yaml and ci/check_style.sh but the pre-commit configuration appears minimal. Given that cuDF has substantial Cython code (pylibcudf) and Python bindings, adding targeted pre-commit hooks for Python linting (black, isort, flake8) would catch style issues before submission and reduce CI burden.

[ ] Audit current .pre-commit-config.yaml to identify missing Python/Cython formatters
[ ] Add black, isort, and flake8 hooks configured to match project standards from ci/check_style.sh
[ ] Create .pre-commit-config-strict.yaml for optional stricter checks
[ ] Document setup in .devcontainer/README.md for developers using dev containers

Add unit tests for build script validation and documentation

The repo contains multiple build scripts (build.sh, ci/build_cpp.sh, ci/build_python.sh, ci/build_wheel.sh variants) but no corresponding test suite to validate script correctness, argument parsing, or error handling. This is risky for a complex multi-component build system where script failures impact all contributors.

[ ] Create ci/tests/test_build_scripts.sh to validate build script syntax and argument handling
[ ] Add tests verifying that build scripts fail gracefully with invalid arguments
[ ] Document build script behavior, environment variables, and options in ci/README.md
[ ] Add GitHub Actions workflow .github/workflows/build-script-validation.yaml to run validation on commits touching build files

🌿Good first issues

Add missing docstrings to cpp/include/cudf/ header files and regenerate Sphinx docs (cpp/docs/); many classes lack user-facing API documentation.
Expand python/cudf/tests/ coverage for edge cases in categorical dtype operations (string categories, null handling) which have gaps compared to pandas test suite.
Create integration test between cudf-polars (python/cudf_polars/) and latest Polars version; currently only spot-checked; add pytest fixture to .github/workflows/test.yaml.

⭐Top contributors

Click to expand

@madsbk — 12 commits
@mhaseeb123 — 11 commits
@TomAugspurger — 9 commits
@davidwendt — 8 commits
@vuule — 8 commits

📝Recent commits

Click to expand

4534447 — fix(ci): resolve all zizmor findings and add zizmor pre-commit checks (#22343) (gforsyth)
9f45b1b — [JAVA] Fix ColumnWriterOptions parquet field id placement on outer list/binary/map (#22422) (res-life)
0a1620e — Fix MERGE_M2 for extreme finite partial means (#22393) (wjxiz1992)
3302376 — Handle sign-extension while decoding Parquet decimal stats (#22402) (pramodsatya)
7d84936 — Fallback to async-mr for the multithreaded parquet example (#22245) (mhaseeb123)
feadb0e — Move table_device_view function definitions from .cuh to .cu (#22354) (davidwendt)
64a3109 — Fix pdsh script dropping records (#22412) (galipremsagar)
16c6356 — Run the cudf-polars test suite against DaskEngine and RayEngine (#22381) (madsbk)
c9ad1c5 — Refactor sort_actor to prepare for OrderScheme changes (#22350) (rjzamora)
8d76fc2 — Python bindings and pytests for cudf::apply_deletion_mask (#22145) (mhaseeb123)

🔒Security observations

The cuDF repository demonstrates a mature project structure with established security practices including GitHub issue templates, CODEOWNERS configuration, and multiple development environments. However, several medium-severity concerns were identified: CI/CD shell scripts lack visible input validation, Docker configurations need security hardening review, and dependency management would benefit from lock file implementation. The project should prioritize implementing automated security scanning in pre-commit hooks and CI/CD workflows, conducting a comprehensive review of shell scripts for injection vulnerabilities, and establishing strict dependency pinning and version management practices. No critical vulnerabilities were identified from the static file structure analysis, but runtime behavior and actual configuration content require deeper review.

Medium · Insufficient Input Validation in CI/CD Scripts — ci/*.sh files (build scripts and test runners). Multiple shell scripts in ci/ directory execute commands that may process user input or external data without sufficient validation. Scripts like run_cudf_pytests.sh, run_cudf_ctests.sh, and others could be vulnerable to command injection if they process untrusted input from environment variables or external sources. Fix: Implement strict input validation and sanitization in all shell scripts. Use shellcheck for static analysis of shell scripts. Avoid using eval() or command substitution with untrusted input. Quote all variable expansions.
Medium · Docker Image Configuration Exposure — .devcontainer/Dockerfile, .devcontainer/cuda*/devcontainer.json. Multiple devcontainer Dockerfile and configuration files exist without visible security baseline hardening. DevContainer configurations may expose sensitive build context or runtime environment details during development. Fix: Review Dockerfile for security best practices: use specific base image versions (not 'latest'), minimize layers, remove unnecessary packages, run as non-root user, and scan images with tools like Trivy or Snyk.
Medium · Potential Dependency Version Pinning Issues — .devcontainer/cuda*.*/devcontainer.json, ci/build_*.sh, VERSION file. Build scripts reference CUDA versions (12.9, 13.1) and conda/pip dependencies without visible lock files in the provided file structure. This could lead to inconsistent builds and potential installation of vulnerable dependency versions. Fix: Implement and maintain lock files (conda-lock.yml, requirements.lock.txt, poetry.lock). Pin all transitive dependencies to specific versions. Regularly scan dependencies with tools like pip-audit, safety, or Dependabot for known vulnerabilities.
Low · CODEOWNERS and Access Control Review Needed — .github/CODEOWNERS. CODEOWNERS file exists for GitHub access control, but specific permissions and reviewer requirements are not visible in the provided content. Ensure proper access controls are in place. Fix: Audit CODEOWNERS to ensure appropriate code review requirements. Ensure critical security-sensitive paths require multiple reviewers. Implement branch protection rules requiring status checks.
Low · Pre-commit Configuration Should Include Security Checks — .pre-commit-config.yaml. A .pre-commit-config.yaml file exists but its content is not provided. Pre-commit hooks should include security scanning tools to catch issues before commit. Fix: Add security scanning hooks to pre-commit configuration: detect-secrets (for credential detection), bandit (Python security), shellcheck (shell scripts), and hadolint (Dockerfile linting). Ensure all hooks are configured to run on pull requests.
Low · Workflow Configuration Review Required — .github/workflows/*.yaml. Multiple GitHub Actions workflows exist (.github/workflows/*.yaml) that may have security implications including build, test, and release workflows. Without reviewing their content, potential issues like exposed secrets, overprivileged service accounts, or insecure dependency updates cannot be assessed. Fix: Review all workflow files for: use of GITHUB_TOKEN with minimal required permissions, encrypted secrets handling, trusted actions from verified sources, and verification of artifact signing before release.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

rapidsai/cudf

Embed the "Healthy" badge

Onboarding doc

Onboarding: rapidsai/cudf

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

Add a new libcudf C++ function

Add a new Python DataFrame method (cudf.DataFrame.my_method)

Add a new CI/CD test job

Update cuDF version for release

🔧Why these technologies

⚖️Trade-offs already made

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive CI workflow for cudf_polars IR validation

Implement style checking for Cython/Python files in pre-commit hooks

Add unit tests for build script validation and documentation

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next