RepoPilotOpen in app →

Oneflow-Inc/oneflow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.

Healthy

Healthy across all four use cases

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 5mo ago
  • 24+ active contributors
  • Distributed ownership (top contributor 31% of recent commits)
Show 4 more →
  • Apache-2.0 licensed
  • CI configured
  • Tests present
  • Slowing — last commit 5mo ago

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/oneflow-inc/oneflow)](https://repopilot.app/r/oneflow-inc/oneflow)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/oneflow-inc/oneflow on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: Oneflow-Inc/oneflow

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/Oneflow-Inc/oneflow shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

  • Last commit 5mo ago
  • 24+ active contributors
  • Distributed ownership (top contributor 31% of recent commits)
  • Apache-2.0 licensed
  • CI configured
  • Tests present
  • ⚠ Slowing — last commit 5mo ago

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live Oneflow-Inc/oneflow repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/Oneflow-Inc/oneflow.

What it runs against: a local clone of Oneflow-Inc/oneflow — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in Oneflow-Inc/oneflow | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 187 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>Oneflow-Inc/oneflow</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of Oneflow-Inc/oneflow. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/Oneflow-Inc/oneflow.git
#   cd oneflow
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of Oneflow-Inc/oneflow and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "Oneflow-Inc/oneflow(\\.git)?\\b" \\
  && ok "origin remote is Oneflow-Inc/oneflow" \\
  || miss "origin remote is not Oneflow-Inc/oneflow (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "CMakeLists.txt" \\
  && ok "CMakeLists.txt" \\
  || miss "missing critical file: CMakeLists.txt"
test -f ".github/workflows/pr.yml" \\
  && ok ".github/workflows/pr.yml" \\
  || miss "missing critical file: .github/workflows/pr.yml"
test -f "ci/test/generic_test.sh" \\
  && ok "ci/test/generic_test.sh" \\
  || miss "missing critical file: ci/test/generic_test.sh"
test -f ".github/CODEOWNERS" \\
  && ok ".github/CODEOWNERS" \\
  || miss "missing critical file: .github/CODEOWNERS"
test -f "ci/requirements.txt" \\
  && ok "ci/requirements.txt" \\
  || miss "missing critical file: ci/requirements.txt"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 187 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~157d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/Oneflow-Inc/oneflow"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

OneFlow is a deep learning framework that provides PyTorch-like Python APIs with a focus on distributed execution through its Global Tensor abstraction and a graph-based compiler for optimization. It's designed to scale models across multiple GPUs/TPUs with n-dimensional parallelism while maintaining ease of use, with ~14M lines of C++ core logic, ~7M Python, and ~3M CUDA kernels. Monolithic C++/Python stack: C++ core engine in oneflow/ subdirectory, Python bindings and APIs in python/ (inferred from 7M Python), operator implementations likely scattered across CUDA kernel files, CI orchestration in .github/workflows/, development environment preconfigured in .devcontainer/Dockerfile for reproducibility.

👥Who it's for

ML engineers and researchers who want to train large models with distributed parallelism (data/model/pipeline parallel) without rewriting models, and production teams deploying models via the graph compiler—particularly those in China (domestic CDN mirrors suggest this focus) or seeking PyTorch compatibility with better scaling properties.

🌱Maturity & risk

Production-ready. Released v1.0.0 recently, has CI/CD pipelines (.github/workflows/ with nightly builds, PR, release, and canary workflows), uses clang-tidy and format checking (ci/check/), supports Python 3.7-3.11 and CUDA 10.0+. However, codebase size and distributed complexity suggest active development rather than stable maturity.

Moderate risk: large C++ surface area (14M lines) means C++ build complexity; monolithic architecture makes incremental adoption harder than modular frameworks. CUDA-heavy (2.8M lines) creates tight hardware coupling. No visible single-maintainer risk (GitHub org Oneflow-Inc), but distributed training bugs can be subtle. Last commit recency not provided, so stability trend is unclear.

Active areas of work

Active development post-v1.0.0 release; multiple CI workflows (pr.yml, test.yml, release.yml) run on every merge; community and private release tracks (community_release.yml, priv_release.yml) suggest parallel release channels; nightly Docker images being built (from README nightly-cuda11.8).

🚀Get running

git clone https://github.com/Oneflow-Inc/oneflow && cd oneflow && python3 -m pip install --upgrade pip && python3 -m pip install -e . (build from source per README). Or docker pull oneflowinc/oneflow:nightly-cuda11.8 for quick start.

Daily commands: See CMakeLists.txt for build: likely mkdir build && cd build && cmake .. && make (standard CMake pattern). For development: use .devcontainer/devcontainer.json in VS Code for reproducible environment; ci/build/make.sh handles CI builds.

🗺️Map of the codebase

  • CMakeLists.txt — Root build configuration defining the entire project structure, dependencies, and compilation flags for OneFlow framework.
  • .github/workflows/pr.yml — CI/CD pipeline that validates all pull requests; understanding this is essential for knowing what tests your changes must pass.
  • ci/test/generic_test.sh — Main test orchestration script that exercises core functionality and operators; critical for understanding test expectations.
  • .github/CODEOWNERS — Defines ownership and review requirements per subsystem; essential to know who reviews what and where to direct questions.
  • ci/requirements.txt — Specifies all Python dependencies needed for building and testing OneFlow; understanding this prevents dependency-related failures.
  • .clang-format — Enforces C++ code style consistency across the entire codebase; violations block CI/CD.
  • ci/check/run_clang_tidy.py — Static analysis gatekeeper that enforces code quality rules on C++ contributions; violations prevent merge.

🛠️How to make changes

Add a New Operator (Op)

  1. Create operator implementation in the ops source directory following naming conventions (e.g., elementwise_add_op.cpp) (CMakeLists.txt)
  2. Register the operator in CMakeLists.txt under the appropriate subsystem target to ensure compilation (CMakeLists.txt)
  3. Create a test case in ci/test/ following the naming pattern test_<op_name>.py or add to 1node_op_test.sh (ci/test/1node_op_test.sh)
  4. Verify operator passes clang-tidy static analysis by running the check locally (ci/check/run_clang_tidy.py)
  5. Use the op_template.md PR template when opening a pull request to document op functionality (.github/PULL_REQUEST_TEMPLATE/op_template.md)

Add a New Test for a Feature

  1. Create a test script in ci/test/ following naming convention (e.g., 1node_feature_test.sh) (ci/test/generic_test.sh)
  2. Add your test invocation to the appropriate CI workflow in .github/workflows/ (likely pr.yml or test.yml) (.github/workflows/pr.yml)
  3. For distributed tests, use distributed_run.py helper to launch multi-node setups (ci/test/distributed_run.py)
  4. Ensure test respects exclusion list in ci/test/excludelist if skipping known failures (ci/test/excludelist)
  5. Run the test locally and verify it passes before submitting PR (ci/test/parallel_run.py)

Modify C++ Code & Pass Code Quality Checks

  1. Edit C++ source files, ensuring you follow the .clang-format style rules (.clang-format)
  2. Run clang-format auto-fix locally: python ci/check/run_clang_format.py (ci/check/run_clang_format.py)
  3. Run clang-tidy for static analysis: python ci/check/run_clang_tidy.py and fix warnings (ci/check/run_clang_tidy.py)
  4. For Python code, run the format check: python ci/check/run_py_format.py (ci/check/run_py_format.py)
  5. Commit and open PR; CI workflows in .github/workflows/pr.yml will re-validate all checks (.github/workflows/pr.yml)

Build OneFlow Locally for Development

  1. Review ci/requirements.txt to install all Python dependencies: pip install -r ci/requirements.txt (ci/requirements.txt)
  2. Run ci/setup_submodule.sh to initialize external Git submodules and dependencies (ci/setup_submodule.sh)
  3. Execute build script ci/build/make.sh to compile the framework (or use CMake directly) (ci/build/make.sh)
  4. For GPU builds, select appropriate CMake cache file (e.g., cmake/caches/ci/cuda.cmake) by setting -C flag (cmake/caches/ci/cuda.cmake)
  5. Run basic sanity tests with ci/test/generic_test.sh to verify build (ci/test/generic_test.sh)

🔧Why these technologies

  • CMake — Multi-platform build system supporting CPU, CUDA, and XLA backends with conditional compilation and dependency resolution.
  • GitHub Actions — Native CI/CD integration enabling automated testing, building, and releasing across multiple platforms and hardware configurations without external service dependencies.
  • clang-format + clang-tidy — Industry-standard C++ tooling for enforcing consistent code style and catching static analysis issues early; integrated into CI gates.
  • Docker (devcontainer) — undefined

🪤Traps & gotchas

CUDA arch ≥60 required (no fallback to older cards). Build time likely very long (14M C++ lines + CUDA compilation). clang-format/clang-tidy enforced on diffs (ci/check/ scripts), so local formatting must match exactly or CI fails. PyYAML and parsec dependencies must be pre-installed or CMake configure will fail. Graph compiler (MLIR) adds compilation complexity; understand MLIR IR if optimizing. Distributed training bugs often reproduce only on multi-GPU setups.

🏗️Architecture

💡Concepts to learn

  • Global Tensor — Core abstraction in OneFlow for n-dimensional distributed parallelism (data/model/pipeline); enables writing single model code that scales across devices without explicit communication
  • MLIR (Multi-Level Intermediate Representation) — OneFlow uses MLIR (185K lines in repo) for graph compilation; understanding MLIR is essential for optimizing computation graphs and kernel fusion
  • Distributed Data Parallelism vs. Model Parallelism — OneFlow's Global Tensor unifies both paradigms; you must understand the tradeoffs between splitting data vs. model across devices when designing distributed training
  • CUDA Kernel Programming — 2.8M lines of CUDA means heavy kernel implementation; contributing operators requires writing optimized CUDA kernels with knowledge of warp-level primitives and memory hierarchy
  • Graph Compilation & Operator Fusion — OneFlow's Graph Compiler optimizes computation graphs by fusing operations; understanding kernel fusion and memory access patterns is critical for performance
  • Operator Registration Pattern — OneFlow has a registration system for ops (inferred from op_template.md); new contributors must follow the pattern to add operators, dispatch to correct kernels, and bind to Python
  • Memory-Mapped CUDA Runtime & Device Synchronization — Distributed training requires careful synchronization and all-reduce ops; understanding OneFlow's event loop and stream management prevents deadlocks and performance cliffs
  • pytorch/pytorch — Direct competitor with similar PyTorch-like API and distributed execution, but PyTorch uses DistributedDataParallel while OneFlow uses Global Tensor—study differences in abstraction layer
  • jax/jax — Functional deep learning alternative with strong distributed support via pmap/vmap; shared design goal of scaling without model rewrites
  • tensorflow/tensorflow — Established distributed DL framework with tf.distribute strategy; OneFlow competes on ease-of-use for distributed training
  • Oneflow-Inc/docker-images — Companion repo for OneFlow nightly Docker builds; referenced in README as source of nightly-cuda11.8 images
  • openxla/xla — Compiler for accelerating ML workloads; OneFlow uses MLIR (similar goal) for graph compilation, worth comparing IR design

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive clang-tidy CI enforcement workflow for pull requests

The repo has .clang-tidy and ci/check/clang_tidy_warnings_as_errors_on_diff setup, but there's no dedicated GitHub Actions workflow file for blocking PRs on clang-tidy violations. Currently, only run_clang_tidy.py exists in ci/check/ but isn't integrated into the main PR validation pipeline (pr.yml shows basic checks but no clang-tidy enforcement). This would catch C++ code quality issues earlier and reduce reviewer burden.

  • [ ] Create .github/workflows/clang-tidy-check.yml that runs on pull_request events
  • [ ] Integrate ci/check/run_clang_tidy.py with the clang-tidy warnings_as_errors config from ci/check/clang_tidy_warnings_as_errors_on_diff
  • [ ] Configure the workflow to post clang-tidy findings as PR comments using existing tools
  • [ ] Add job status check requirement in branch protection rules documentation

Implement missing unit tests for operator template registration system

The repo uses .github/PULL_REQUEST_TEMPLATE/op_template.md indicating a structured op registration pattern, but there are no visible unit tests in the file structure for the op registration/templating system. Given the reliance on CMakeLists.txt and custom op patterns, adding tests would ensure new ops don't break the registration pipeline and reduce merge conflicts.

  • [ ] Create oneflow/core/operator/tests/op_registration_test.cpp to test the operator template registration system
  • [ ] Add test cases for: op name collision detection, template instantiation with different device types, and op attribute validation
  • [ ] Update CMakeLists.txt to include new test targets in the build system
  • [ ] Document the op registration testing pattern in CONTRIBUTING.md

Add Python code format validation to PR workflow (complementing existing run_py_format.py)

The repo has ci/check/run_py_format.py and references to PyYAML dependencies indicating Python codebase, but pr.yml GitHub Actions workflow doesn't appear to run Python formatting checks. Meanwhile, C++ has clang-format checks. Adding a dedicated Python linting workflow would catch style/import issues early, consistent with the existing ci/requirements.txt and dev setup.

  • [ ] Create .github/workflows/python-lint-check.yml that runs on pull_request events
  • [ ] Integrate ci/check/run_py_format.py (black/autopep8 formatting) into the workflow
  • [ ] Add flake8 or pylint as a secondary check using dependencies from ci/requirements.txt
  • [ ] Configure workflow to fail the check if formatting issues are detected, with suggestions for auto-fixing

🌿Good first issues

  • Add integration tests in tests/ for new CUDA operators (many Cuda/ files likely lack dedicated test coverage; follow op_template.md pattern)
  • Improve clang-format config consistency: audit C++ style violations in oneflow/ subdirs and document exceptions in .clang-format (many 14M lines = inconsistent formatting)
  • Write Python binding documentation for Global Tensor API (python/ bindings exist but docstrings/examples likely sparse for distributed-first concepts)

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 25c8978 — support npu multinomial (#10668) (ShawnXuan)
  • f485d7e — adaptation_of_mlu_device_based_on_refactor_xccl_primitive (#10647) (Flowingsun007)
  • 9fdc6d8 — Add interpolate_like function to oneflow.nn.functional (#10644) (ShawnXuan)
  • 43dc0ca — Revert "add interpolate_like for cpu" (#10645) (ShawnXuan)
  • d8fd231 — add interpolate_like for cpu (#10544) (dcaox)
  • ca812da — Support int64 output dtype when use NPU. (#10641) (MagicalMogu)
  • 7b20e90 — adaptation of multi devices based on refactor_xccl_primitive (#10618) (Flowingsun007)
  • 3fdbfe7 — update to ubuntu-22.04 (#10631) (Flowingsun007)
  • ac757cb — support BUILD_NPU (#10623) (Flowingsun007)
  • 334b4f5 — cache@v2->v4 (#10622) (ShawnXuan)

🔒Security observations

The OneFlow repository demonstrates reasonable security practices with GitHub Actions CI/CD, code quality tools (clang-tidy, format checkers), and suppressions for memory/thread/UB sanitizers. However, there are notable concerns: (1) loose dependency version pinning creates supply chain risks, (2) potential YAML deserialization vulnerabilities if user configurations are parsed unsafely, (3) GitHub Actions workflows need security hardening (pinned versions, OIDC tokens), and (4) Docker and CI/CD configurations require content review. The codebase lacks visible hardcoded secrets, but YAML configuration handling and dependency management should be prioritized for improvement.

  • Medium · Potential YAML Deserialization Vulnerability — Dependencies (PyYAML>=5.1). PyYAML>=5.1 is listed as a dependency. While PyYAML 5.1+ mitigates some unsafe deserialization issues, the presence of YAML parsing in a deep learning framework could pose risks if user-supplied YAML configurations are deserialized without proper safeguards. The framework should use yaml.safe_load() instead of yaml.load(). Fix: Ensure all YAML parsing uses yaml.safe_load() rather than yaml.load(). Upgrade to the latest PyYAML version and add linting rules to prevent unsafe deserialization patterns.
  • Medium · Insufficient Dependency Version Pinning — Package dependencies in requirements files. Dependencies specify loose version constraints (PyYAML>=5.1, parsec without version constraints). This allows installation of newer versions that may contain breaking changes or vulnerabilities. Loose pinning increases supply chain attack surface. Fix: Use specific version ranges or lock files (e.g., requirements.lock). Implement automated dependency scanning with tools like Dependabot. Define both minimum and maximum acceptable versions.
  • Low · Git Repository Exposed Configuration Files — Root configuration files (.clang-tidy, .clangd, .cmake-format.py). Configuration files like .clang-tidy, .clangd, .cmake-format.py, and .gitignore are committed to the repository. While generally safe, these could potentially leak information about the build system or development practices. Fix: Review these configuration files to ensure they don't contain sensitive paths, credentials, or security-critical build flags. Consider using .gitignore to exclude local overrides.
  • Low · Potential Debugging Artifacts Left in CI/CD — .lsan-suppressions, .tsan-suppressions, .ubsan-suppressions. Files like .lsan-suppressions, .tsan-suppressions, and .ubsan-suppressions indicate use of sanitizers. While beneficial for security testing, these should be carefully reviewed to ensure they don't suppress critical security warnings. Fix: Review suppression files to ensure only intentional suppressions are present. Document why each suppression exists. Consider using more granular suppressions rather than blanket rules.
  • Low · Docker Build Context Not Fully Reviewed — .devcontainer/Dockerfile, .dockerignore. .devcontainer/Dockerfile and .dockerignore are present but content not provided for review. Docker configurations can introduce supply chain risks if not properly secured. Fix: Ensure Dockerfile uses specific base image versions (not 'latest'), scans dependencies for vulnerabilities, and follows container security best practices (non-root user, minimal layers, no secrets in layers).
  • Low · GitHub Actions Workflows Require Security Review — .github/workflows/ (canary.yml, release.yml, pr.yml, priv_release.yml, etc.). Multiple GitHub Actions workflows (.github/workflows/) including PR, release, and deployment pipelines are present. These could potentially be vulnerable to token theft or malicious action execution. Fix: Use pinned action versions (not 'latest' or branches). Implement OIDC token-based authentication instead of PATs. Restrict workflow permissions to minimum required. Implement branch protections and approval gates for sensitive workflows.
  • Low · Mergify Configuration Present — .mergify.yml. .mergify.yml is used for automated merge management. Misconfigured merge rules could bypass security reviews or introduce untested code. Fix: Ensure merge rules require all status checks to pass, require approvals from code owners, and enforce branch protection rules. Regularly audit merge configuration changes.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · Oneflow-Inc/oneflow — RepoPilot