RepoPilotOpen in app →

kedro-org/kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

Healthy

Healthy across the board

weakest axis
Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit today
  • 27+ active contributors
  • Distributed ownership (top contributor 13% of recent commits)
Show all 7 evidence items →
  • Apache-2.0 licensed
  • CI configured
  • Tests present
  • Scorecard: dangerous CI workflow (0/10)

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests + OpenSSF Scorecard

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/kedro-org/kedro)](https://repopilot.app/r/kedro-org/kedro)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/kedro-org/kedro on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: kedro-org/kedro

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/kedro-org/kedro shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit today
  • 27+ active contributors
  • Distributed ownership (top contributor 13% of recent commits)
  • Apache-2.0 licensed
  • CI configured
  • Tests present
  • ⚠ Scorecard: dangerous CI workflow (0/10)

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests + OpenSSF Scorecard</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live kedro-org/kedro repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/kedro-org/kedro.

What it runs against: a local clone of kedro-org/kedro — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in kedro-org/kedro | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>kedro-org/kedro</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of kedro-org/kedro. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/kedro-org/kedro.git
#   cd kedro
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of kedro-org/kedro and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "kedro-org/kedro(\\.git)?\\b" \\
  && ok "origin remote is kedro-org/kedro" \\
  || miss "origin remote is not kedro-org/kedro (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "kedro/framework/context.py" \\
  && ok "kedro/framework/context.py" \\
  || miss "missing critical file: kedro/framework/context.py"
test -f "kedro/pipeline/pipeline.py" \\
  && ok "kedro/pipeline/pipeline.py" \\
  || miss "missing critical file: kedro/pipeline/pipeline.py"
test -f "kedro/io/data_catalog.py" \\
  && ok "kedro/io/data_catalog.py" \\
  || miss "missing critical file: kedro/io/data_catalog.py"
test -f "kedro/framework/hooks/hooks.py" \\
  && ok "kedro/framework/hooks/hooks.py" \\
  || miss "missing critical file: kedro/framework/hooks/hooks.py"
test -f "kedro/runner/runner.py" \\
  && ok "kedro/runner/runner.py" \\
  || miss "missing critical file: kedro/runner/runner.py"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/kedro-org/kedro"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Kedro is a Python framework for building production-ready data engineering and data science pipelines with software engineering best practices baked in. It provides a structured project layout, modular pipeline composition, data catalog management, and CLI tooling to make data workflows reproducible, maintainable, and shareable. Core capability: define data transformations as reusable nodes connected in dependency graphs, with automatic dependency resolution and execution orchestration. Monorepo structure: core Kedro engine in kedro/ (pipeline, runners, data catalog), CLI scaffolding in kedro/cli/, with separate kedro-datasets package for extensible I/O adapters. .github/workflows/ orchestrates CI matrix testing across Python versions. Gherkin specs in .github/ and throughout codebase enable BDD-style testing. Starter templates managed in separate release-starters.yml workflow.

👥Who it's for

Data engineers and data scientists who need to move beyond Jupyter notebooks toward production systems. Specifically: teams building ETL/ELT pipelines, ML feature engineering workflows, or analytics that require versioning, testing, and deployment automation. Contributors include MLOps engineers extending Kedro with custom runners and dataset adapters.

🌱Maturity & risk

Production-ready and actively maintained. The project shows strong engineering practices: comprehensive CI/CD via .github/workflows/ (unit-tests.yml, e2e-tests.yml, nightly-build.yml), OpenSSF Best Practices badge, supports Python 3.10–3.14, hosted by LF AI & Data Foundation. Large codebase (1.5M+ lines of Python) with established versioning and release processes visible in .github/workflows/check-release.yml and release-starters.yml.

Low risk for a production framework. Dependency surface is managed (uses kedro-datasets as modular ecosystem), has substantial test coverage (Gherkin BDD tests + unit tests), and maintains active CI on main/develop branches. Main risk: rapid feature addition (visible in multiple workflows) can occasionally introduce breaking changes—mitigated by semantic versioning in PyPI and clear migration guides. Single-org governance (kedro-org) is stable given LF AI backing.

Active areas of work

Active development on multiple fronts: .agents/skills/review-kedro-pr/ indicates automated PR review tooling being added; nightly builds and performance benchmarking workflows show continuous integration maturity; e2e-tests.yml and pipeline-performance-test.yml suggest focus on reliability at scale. Dependabot auto-updates configured, and linting/documentation checks are comprehensive (docs-language-linter.yml, docs-linkcheck.yml).

🚀Get running

git clone https://github.com/kedro-org/kedro.git
cd kedro
pip install -e .

Or via package managers: uv pip install kedro or conda install -c conda-forge kedro. For development: install with [dev] extras to get test dependencies and Makefile commands.

Daily commands: Kedro is a framework library, not a web app. To use it: kedro new <project-name> scaffolds a project, then kedro run executes pipelines. For development: make test runs test suite (from Makefile), make docs builds documentation, pytest tests/ runs unit tests. Run specific pipeline nodes via kedro run --pipeline <name> --node <node-name>.

🗺️Map of the codebase

  • kedro/framework/context.py — Core Context class that manages the Kedro project lifecycle, session setup, and pipeline execution—the foundation every contributor must understand
  • kedro/pipeline/pipeline.py — Pipeline abstraction that defines how nodes are composed, executed, and validated—essential for understanding DAG-based orchestration
  • kedro/io/data_catalog.py — DataCatalog manages all dataset I/O operations and versioning—critical for data provenance and artifact management
  • kedro/framework/hooks/hooks.py — Hook system for extending Kedro's lifecycle events—key extension point for plugins and custom behavior
  • kedro/runner/runner.py — Runner abstract class defining execution strategies (sequential, parallel, etc.)—fundamental to pipeline execution
  • kedro/config/config_loader.py — Configuration loading and validation—responsible for environment setup and parameterization
  • .github/workflows/unit-tests.yml — Primary CI/CD workflow validating all PRs—defines required test coverage and quality gates

🧩Components & responsibilities

  • Context — Owns project state, configuration, catalog, and

🛠️How to make changes

Add a Custom Hook for Pipeline Lifecycle Events

  1. Create a hook class inheriting from KedroHook in your plugin module (kedro/framework/hooks/hooks.py)
  2. Implement desired hook methods (e.g., before_pipeline_run, after_node_run) (kedro/framework/hooks/hooks.py)
  3. Register your hook class in the Hook Manager via settings.hooks or auto-discovery (kedro/framework/hooks/manager.py)

Add a New Dataset Type

  1. Create a new class inheriting from AbstractDataset or AbstractVersionedDataset (kedro/io/abstract_dataset.py)
  2. Implement _load() and _save() methods for your data format (kedro/io/abstract_dataset.py)
  3. Define dataset configuration schema (name, type, and parameters) (kedro/io/data_catalog.py)
  4. Register dataset in DataCatalog via catalog YAML or programmatic API (kedro/io/data_catalog.py)

Add a Custom Runner for Alternative Execution Strategy

  1. Create a runner class inheriting from AbstractRunner (kedro/runner/runner.py)
  2. Implement _run() method to define execution logic (sequential, parallel, distributed, etc.) (kedro/runner/runner.py)
  3. Register your runner in the CLI or context configuration (kedro/framework/context.py)

Add a Custom Configuration Loader

  1. Create a config loader class inheriting from AbstractConfigLoader (kedro/config/config_loader.py)
  2. Implement _load_config() to parse your configuration source (YAML, environment, database, etc.) (kedro/config/config_loader.py)
  3. Set your loader in the Context or register via plugins (kedro/framework/context.py)

🔧Why these technologies

  • OmegaConf — Powerful YAML/structured config management with environment variable interpolation and schema validation; enables 12-factor app patterns
  • Python dataclasses & type hints — Explicit contract definition, IDE support, and runtime type checking for robustness in data pipelines
  • Click (for CLI) — Lightweight CLI framework with decorator-based command composition; allows extensible subcommand registration
  • pluggy (plugin system) — Industry-standard plugin architecture (used by pytest); enables third-party extensions without modifying core

⚖️Trade-offs already made

  • DAG-based pipeline (not stream/reactive)

    • Why: Batch data science workflows are more common than streaming; DAGs are easier to reason about and debug
    • Consequence: Not suitable for real-time event streaming; requires manual orchestration for incremental runs
  • Context per-project lifetime (not global singleton)

    • Why: Enables multiple isolated projects and cleaner testing; avoids global state pollution
    • Consequence: Slightly more boilerplate in CLI/interactive usage; requires explicit context passing
  • Configuration-driven (YAML catalog/params)

    • Why: Non-technical stakeholders can modify data sources; reduces code duplication and enables environment parity
    • Consequence: Less discoverable at code-read time; requires discipline to keep config in sync with code
  • Hook-based extension over direct inheritance

    • Why: Decouples plugins from framework; allows multiple handlers per event without tight coupling
    • Consequence: Slightly higher complexity for simple extensions; hook execution order not always obvious

🚫Non-goals (don't propose these)

  • Real-time streaming pipelines (target is batch workflows)
  • Distributed computing orchestration (relies on external runners like Airflow, Kubeflow)
  • Machine learning model training frameworks (integrates with MLflow, not a training library)
  • Authentication and access control (assumes trusted execution environment)
  • Data visualization (integrates with BI tools, not a charting library)

🪤Traps & gotchas

Config discovery: Kedro expects conf/ directory in project root with parameters.yml and catalog.yml; missing or misconfigured paths cause silent failures. Circular dependencies: The DAG validator doesn't always catch indirect cycles until runtime. Dataset versioning: Enabled per-dataset but not globally synchronized—can lead to cache invalidation surprises. Plugin discovery: Uses entry_points in setup.py; adding custom runners/datasets requires proper package installation, not just PYTHONPATH. Test isolation: Some integration tests assume clean state; running tests in parallel (pytest -n) can fail flakily without proper fixtures.

🏗️Architecture

💡Concepts to learn

  • Directed Acyclic Graph (DAG) Pipeline — Kedro's entire execution model is DAG-based: nodes are vertices, dependencies are edges. Understanding cycle detection and topological sorting is essential to extend runners or debug deadlocks.
  • Data Catalog Abstraction — Kedro decouples node logic from I/O via a catalog layer; this enables swapping dataset backends (CSV → Parquet → Delta) without changing pipeline code. Critical for portability and testing.
  • Dependency Injection (via node inputs/outputs) — Nodes declare what data they need, Kedro injects it automatically by matching names to catalog entries. Reduces boilerplate and enables declarative parallelization.
  • Plugin Architecture via Entry Points — Kedro discovers runners, datasets, and commands via setuptools entry_points; understanding this enables building custom extensions without modifying core code.
  • Lazy Evaluation & Dataset Versioning — Datasets are loaded on-demand during node execution, and versions are tracked; this enables efficient recomputation and cache invalidation across pipeline runs.
  • Configuration as Code (via YAML) — Kedro separates pipeline structure (Python nodes) from configuration (YAML parameters & catalog). This pattern enables non-engineers to adjust hyperparameters and data paths without touching code.
  • BDD Testing via Gherkin — Kedro uses Gherkin/Behave for behavior-driven tests (visible in test suite); understanding this pattern helps write maintainable specs aligned with business requirements.
  • kedro-org/kedro-datasets — Official plugin repo containing 50+ dataset adapters (pandas, spark, dask, SQL); extends Kedro's I/O capabilities without bloating core
  • AirbyteHQ/airbyte — Alternative ELT platform; Kedro users often integrate Airbyte connectors via custom datasets for data ingestion
  • great-expectations/great_expectations — Data validation framework; commonly used in Kedro pipelines as post-node assertions or custom runners for quality checks
  • mlflow/mlflow — ML experiment tracking; Kedro pipelines often log results to MLflow for reproducibility and model comparison
  • apache/airflow — Orchestration alternative; Kedro pipelines can be wrapped as Airflow DAGs via official kedro-airflow plugin (separate repo)

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive skill documentation and examples for .agents/skills/review-kedro-pr

The .agents/skills/review-kedro-pr directory exists with SKILL.md and reference.md, but lacks concrete examples and implementation details. This skill appears to be a key part of Kedro's AI-assisted review process. A new contributor could enhance the reference.md with detailed examples of PR review patterns, automated checks performed, and integration with GitHub workflows (notably missing from the skill directory but referenced in .github/workflows/all-checks.yml).

  • [ ] Review existing .agents/skills/review-kedro-pr/SKILL.md and reference.md
  • [ ] Create example PR review scenarios in reference.md showing common patterns (linting, test coverage, code quality)
  • [ ] Document the integration points with GitHub Actions workflows (unit-tests.yml, lint.yml, e2e-tests.yml)
  • [ ] Add examples of output from post_review.sh script
  • [ ] Include troubleshooting section for common skill failures

Create missing workflow documentation and add workflow-specific tests

The .github/workflows directory contains 18+ CI/CD workflows but there's no dedicated workflow testing or documentation file. Workflows like benchmark-performance.yml, pipeline-performance-test.yml, and nightly-build.yml have complex logic that could benefit from integration tests. A new contributor could create workflow validation tests and add a .github/workflows/README.md explaining each workflow's purpose, triggers, and maintenance.

  • [ ] Create .github/workflows/README.md documenting each workflow's purpose and trigger conditions
  • [ ] Add workflow schema validation tests in tests/ directory
  • [ ] Create YAML linting rules specific to workflows in .github/styles/Kedro/
  • [ ] Document the merge-gatekeeper.yml and auto-merge-prs.yml approval logic
  • [ ] Add examples of expected outputs for performance-related workflows (benchmark-performance.yml, pipeline-performance-test.yml)

Expand .github/styles linting rules with Kedro-specific patterns

The .github/styles/Kedro directory contains 13 Vale linting rule files but lacks rules for common Kedro-specific patterns and conventions. Contributing rules for Kedro's data pipeline terminology, node naming conventions, and catalog configuration documentation would improve consistency across the codebase and documentation.

  • [ ] Review CONTRIBUTING.md and existing code for Kedro-specific terminology (nodes, pipelines, catalogs, runners)
  • [ ] Create .github/styles/Kedro/kedro-terminology.yml for consistent terminology usage
  • [ ] Add .github/styles/Kedro/catalog-conventions.yml for data catalog documentation patterns
  • [ ] Create .github/styles/Kedro/code-examples.yml for code block formatting in docs
  • [ ] Update .github/styles/Kedro/ignore.txt and ignore-names.txt to include Kedro-specific terms and classes
  • [ ] Add tests validating the new rules don't conflict with existing ones in .github/styles/Kedro/

🌿Good first issues

  • Add type hints to kedro/io/ module: currently inconsistent with mypy coverage; small, localized change with clear acceptance criteria (100% coverage, all existing tests pass).
  • Expand Gherkin BDD specs in .github/ for data catalog edge cases: dataset version collision handling and missing parameter references are undertested; write 2–3 scenario outlines following existing .feature patterns.
  • Document plugin architecture in CONTRIBUTING.md with working example: no concrete example of building a custom Runner subclass; add examples/custom-runner-plugin/ with Sphinx docs linking from main docs site.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • d21d357 — fix for param validation in namespaced pipelines (#5545) (ravi-kumar-pilla)
  • dd32c87 — Part 3: Add CLI command for HTTP Server (#5522) (ankatiyar)
  • d481ebc — Part 2: Add run endpoint to HTTP Server (#5521) (ankatiyar)
  • 49ebdc4 — feat(deps): add optional kedro[pydantic] dependency extra (#5498) (rahulbansod519)
  • eba9332 — Part 1: Scaffolding for HTTP server + health check API (#5520) (ankatiyar)
  • dd9345f — Document stateful hook patterns in common use cases (#5483) (jeevan6996)
  • f263487 — Minimal docs for KedroServiceSession (#5524) (ankatiyar)
  • 485cd4d — Update session related API docs (#5523) (ankatiyar)
  • ebe40af — Add review-kedro-pr skill for cursor and copilot (#5517) (ElenaKhaustova)
  • 3f3b6b5 — Remove chardet from test dependencies (#5519) (merelcht)

🔒Security observations

The Kedro project demonstrates a moderate security posture with several good practices in place (detect-secrets, CODEOWNERS, comprehensive CI/CD workflows, explicit security policy). However, critical gaps exist: the SECURITY.md vulnerability reporting section is incomplete, preventing proper coordinated disclosure. Dependency management lacks explicit version pinning and documented scanning practices. The project should prioritize completing security documentation and implementing stricter dependency management controls. No evidence of SQL injection, XSS, or hardcoded credentials was found in the provided file structure, which is positive.

  • Medium · Incomplete Security Policy Documentation — SECURITY.md. The SECURITY.md file appears to be incomplete (cuts off mid-sentence at 'Reporting a vulnerability: W'). This prevents users and security researchers from understanding the complete vulnerability disclosure process and reporting procedures. Fix: Complete the SECURITY.md file with full vulnerability reporting instructions, including: contact methods (security@example.com or GitHub Security Advisory), expected response timeline, and disclosure guidelines.
  • Low · Unspecified Dependency Versions — Dependencies/Package file content. The dependencies file contains packages with loose version constraints (ipython>=8.10, jupyterlab>=3.0, notebook) that could pull in vulnerable versions. The kedro-datasets dependency uses bracket notation but lacks pinned versions for transitive dependencies. Fix: Implement dependency pinning with upper bounds or use lock files (poetry.lock, requirements.lock). Regularly scan dependencies with 'pip-audit' or 'safety' to detect known vulnerabilities.
  • Low · Missing Code Owners for Security-Critical Paths — CODEOWNERS. While CODEOWNERS file exists, the structure suggests it may not have explicit security-critical path assignments (.github/workflows, security policies, authentication code). Fix: Ensure CODEOWNERS explicitly assigns security-sensitive files to trusted maintainers, particularly: authentication/authorization code, dependency management, CI/CD workflows, and security policies.
  • Low · No Visible Secret Scanning Configuration — .github/workflows/detect-secrets.yml, .secrets.baseline. While .secrets.baseline exists (indicating detect-secrets is used), there's no visible documentation of secret scanning strategy in provided files. The .github/workflows/detect-secrets.yml workflow exists but configuration details are not shown. Fix: Verify detect-secrets is properly configured to catch: API keys, tokens, passwords, and private credentials. Document the secret scanning process in CONTRIBUTING.md or SECURITY.md.
  • Low · Dependency on Cookiecutter Template Variable — Dependencies/Package file content. The kedro dependency uses '{{ cookiecutter.kedro_version }}' which appears to be a template variable. If this is not properly substituted during project generation, it could result in invalid dependency specifications. Fix: Ensure cookiecutter templates are properly validated post-generation. Add CI/CD checks to verify generated projects have valid dependency specifications before being used.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · kedro-org/kedro — RepoPilot