dotnet/machinelearning

Item: dotnet/machinelearning
Rating: 5
Author: RepoPilot

ML.NET is an open source and cross-platform machine learning framework for .NET.

Healthy

Healthy across the board

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 6d ago
✓28+ active contributors
✓Distributed ownership (top contributor 21% of recent commits)

Show 3 more →

✓MIT licensed
✓CI configured
⚠No test directory detected

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/dotnet/machinelearning)](https://repopilot.app/r/dotnet/machinelearning)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/dotnet/machinelearning on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: dotnet/machinelearning

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/dotnet/machinelearning shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 6d ago
28+ active contributors
Distributed ownership (top contributor 21% of recent commits)
MIT licensed
CI configured
⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live dotnet/machinelearning repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/dotnet/machinelearning.

What it runs against: a local clone of dotnet/machinelearning — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in dotnet/machinelearning | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 36 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>dotnet/machinelearning</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of dotnet/machinelearning. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/dotnet/machinelearning.git
#   cd machinelearning
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of dotnet/machinelearning and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "dotnet/machinelearning(\\.git)?\\b" \\
  && ok "origin remote is dotnet/machinelearning" \\
  || miss "origin remote is not dotnet/machinelearning (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "Microsoft.ML.sln" \\
  && ok "Microsoft.ML.sln" \\
  || miss "missing critical file: Microsoft.ML.sln"
test -f "Directory.Build.props" \\
  && ok "Directory.Build.props" \\
  || miss "missing critical file: Directory.Build.props"
test -f "Directory.Packages.props" \\
  && ok "Directory.Packages.props" \\
  || miss "missing critical file: Directory.Packages.props"
test -f "CONTRIBUTING.md" \\
  && ok "CONTRIBUTING.md" \\
  || miss "missing critical file: CONTRIBUTING.md"
test -f ".github/copilot-instructions.md" \\
  && ok ".github/copilot-instructions.md" \\
  || miss "missing critical file: .github/copilot-instructions.md"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 36 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~6d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/dotnet/machinelearning"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

ML.NET is a cross-platform, open-source machine learning framework that enables .NET developers to build, train, and deploy ML models directly in C# without Python/R expertise. It handles data loading, transformations, and runs both classical algorithms and TensorFlow/ONNX models natively on Windows, Linux, macOS, and ARM64 architectures. Monolithic repo with modular NuGet packages: core ML algorithm transformers and estimators under the main namespace, specialized packages for TensorFlow/ONNX interop, AutoML features, and command-line tooling. Source organized by feature domain (Classification, Regression, Clustering, Forecasting, Recommendation) with parallel test structure. Native interop layers (C++/P-Invoke) bridge to scikit-learn-like algorithms and deep learning runtimes.

👥Who it's for

.NET developers (C# and F#) building production ML applications who want to avoid context-switching to Python; enterprise teams needing on-premises ML deployment with .NET Framework/.NET Core compatibility; data scientists integrating ML into existing .NET codebases.

🌱Maturity & risk

Production-ready. The codebase is large (22.9M lines of C#), has extensive CI/CD via Azure DevOps and GitHub Actions (.vsts-dotnet-ci.yml, .github/workflows/), official NuGet packages with version tracking, and active documentation. Regular releases are published with release notes tracked in docs/release-notes/.

Low-to-moderate risk for production use. The framework has heavy native dependencies (C++ interop for TensorFlow/LightGBM) requiring platform-specific builds, which increases maintenance complexity. Architecture spans multiple languages (C# 22.9M LOC, C++ 304K, PowerShell/CMake build scripts) making cross-platform testing critical. Dependency on external ML libraries (TensorFlow, ONNX, LightGBM) means you inherit their stability constraints.

Active areas of work

Active development with GitHub Actions CI (add-tests.lock.yml, issue-triage workflows), Copilot-assisted code generation (8 code-testing agents in .github/skills/), and automated release-notes generation via prompts/. Recent focus on platform support (ARM64, Apple M1) and integration testing as evidenced by 1espt security baseline config.

🚀Get running

Clone the repo, install .NET Core 2.1+, and build: git clone https://github.com/dotnet/machinelearning.git && cd machinelearning && dotnet build. Consult the Makefile or .vsts-dotnet-ci.yml for platform-specific build commands (Windows requires Visual Studio, Linux/macOS use dotnet CLI).

Daily commands: No 'dev server'; this is a library. Build and test: dotnet build src/Microsoft.ML.Core.sln (or use Visual Studio on Windows). Run unit tests: dotnet test. The repo includes example projects in a separate samples repo (dotnet/machinelearning-samples).

🗺️Map of the codebase

Microsoft.ML.sln — Primary solution file containing all projects and dependencies—start here to understand the modular architecture
Directory.Build.props — Central build configuration defining SDK versions, dependency constraints, and common compiler settings across all projects
Directory.Packages.props — Centralized NuGet package version management for all projects—essential for understanding external dependencies
CONTRIBUTING.md — Defines contribution workflow, code style, PR process, and testing expectations—required reading before submitting changes
.github/copilot-instructions.md — Codifies naming conventions, architectural patterns, and development guidelines used throughout the ML.NET framework
build.cmd — Entry point for Windows build process—orchestrates compilation, testing, and packaging steps
NuGet.config — Configures package sources and restoration rules—controls where dependencies are resolved

🛠️How to make changes

Add a New ML Algorithm Estimator

Create algorithm documentation following the pattern in docs/api-reference/, e.g., docs/api-reference/algo-details-myalgorithm.md, describing parameters, use cases, and tuning guidance (docs/api-reference/algo-details-myalgorithm.md)
Implement the estimator class and register it in the main ML context following conventions in the source tree (not listed, but referenced by .sln) (Microsoft.ML.sln)
Add unit tests covering normal cases, edge cases, and serialization—place in parallel test directory structure (build/ci/job-template.yml)
Update Directory.Build.props if new external dependencies (e.g., native libraries) are needed and bump versions in Directory.Packages.props (Directory.Packages.props)

Add Support for a New Data Type or I/O Format

Create I/O column specification document at docs/api-reference/io-columns-yourtask.md defining input/output schemas and data contracts (docs/api-reference/io-columns-yourtask.md)
Implement IDataView transformer or IEstimator in the source tree (organized by ML task category) (Microsoft.ML.sln)
Register the transformer in the MLContext API surface and update inline documentation (Directory.Build.props)
Add integration tests to the test suite and verify with build.cmd test target (build.cmd)

Integrate External Model Format (TensorFlow/ONNX/oneDAL)

Document model loading and inference contract in docs/api-reference/tensorflow-usage.md or equivalent, showing shape/type compatibility (docs/api-reference/tensorflow-usage.md)
Add native library bindings or managed wrappers in source tree, updating Directory.Packages.props with new external dependencies (Directory.Packages.props)
Create loading/scoring estimators following IEstimator<ITransformer> pattern with proper resource cleanup (Microsoft.ML.sln)
Verify interop and performance by running full CI pipeline via .vsts-dotnet-ci.yml across Windows/Linux/macOS (.vsts-dotnet-ci.yml)

Contribute a Bug Fix or Performance Improvement

Fork repo and review CONTRIBUTING.md to understand PR branching strategy and CLA requirements (CONTRIBUTING.md)
Create or update unit tests in the test project mirroring the source directory layout, ensuring coverage for the fix (build/ci/job-template.yml)
Run build.cmd to validate locally on Windows; use build.sh for Linux/macOS validation (build.cmd)
Submit PR with detailed description and reference the issue; CI automation in .vsts-dotnet-ci.yml will run end-to-end validation (.vsts-dotnet-ci.yml)

🔧Why these technologies

.NET (C#) runtime — Cross-platform framework enabling ML.NET to run on Windows, Linux, and macOS with a single codebase; integrates tightly with .NET ecosystem for enterprise adoption
TensorFlow & ONNX interop — Extends ML.NET beyond native algorithms to consume pre-trained deep learning and domain-specific models; vendor-neutral model portability
IDataView abstraction — Lazy-evaluated, columnar data pipeline enabling efficient memory usage and composable transformations without materializing intermediate data
Intel oneDAL backend — Accelerates linear algebra and tree algorithms on multi-core CPUs; improves performance for large-scale training without GPU dependency
MSBuild (Directory.Build.props pattern) — Centralized property management ensures consistent SDK, compiler, and dependency versions across ~20+ projects; simplifies maintenance

⚖️Trade-offs already made

Lazy IDataView evaluation vs. eager DataFrame materialization
- Why: Reduces memory footprint for large datasets; enables pipelining without intermediate copies
- Consequence: More complex debugging; transformations must be composable; some operations require explicit materialization
Managed C# estimators + native C++ algorithm kernels
- Why: Balances ease-of-use (C# API) with performance-critical compute (native code); avoids 100% JIT overhead
- Consequence: Complex interop and platform-specific binary dependencies; harder to debug across language boundaries
Support external formats (TensorFlow, ONNX) over proprietary serialization
- Why: Reduces vendor lock-in; enables model sharing across frameworks and languages
- Consequence: Inherits limitations and versioning complexity of external formats; must maintain compatibility across updates
Cross-platform CI/CD (Windows, Linux, macOS) in .vsts-dotnet-ci.yml
- Why: Ensures genuine cross-platform compatibility; catches platform-specific bugs early
- Consequence:

🪤Traps & gotchas

Windows x86 builds exclude TensorFlow/LightGBM (platform limitation in docs/project-docs/platform-limitations.md). C++ native builds require Visual Studio 2019+ on Windows or gcc/clang with CMake on Linux/macOS; missing toolchain silently fails in dependency resolution. Blazor WASM has reduced functionality (CPU-only, no native interop). The repo uses .gitmodules for submodules; clone with --recursive or run git submodule update --init. Build artifacts reference platform-specific native DLLs; cross-platform package publishing requires full matrix builds on Azure DevOps.

🏗️Architecture

💡Concepts to learn

DataView — ML.NET's lazy, columnar, type-safe data abstraction; core to understanding how pipelines process data without loading everything into memory, unlike DataFrames.
IEstimator / ITransformer — Fundamental pattern in ML.NET: Estimators learn from data (fit), Transformers apply learned logic (transform); understanding this duality is required to use or extend the framework.
Pipeline Builder Pattern — ML.NET chains transforms via fluent APIs (e.g., mlContext.Transforms.Normalize(...).Append(...)); this composition model is the primary way to structure feature engineering and model training.
P/Invoke and Native Interop — ML.NET bridges C# to C++ for performance (TensorFlow, ONNX, LightGBM); contributors must understand marshaling, unsafe code, and platform-specific DLL loading for native dependencies.
Columnar Data Processing (SIMD) — ML.NET's Transforms and Estimators use vectorized operations on column batches via SIMD intrinsics (SSE/AVX on x64, NEON on ARM); understanding this enables performance optimization.
Cross-Platform Native Builds (CMake) — The repo builds C++ code on Windows, Linux, and macOS via CMake; debugging build failures requires familiarity with platform-specific toolchains and the .vsts-dotnet-ci.yml matrix.
AutoML (Automated Machine Learning) — ML.NET includes AutoML.CodeGen which automatically searches algorithm hyperparameters and generates C# training code; useful for non-expert developers but requires understanding the search space and stopping criteria.

dotnet/machinelearning-samples — Official companion repo with runnable C# examples for all ML.NET tasks (classification, forecasting, etc.); essential for learning the framework API.
onnx/onnx-runtime — Native dependency for ML.NET's ONNX model inference; understanding its C API and platform support is necessary when debugging interop issues.
microsoft/LightGBM — Gradient boosting engine integrated into ML.NET; source of native bindings and performance-critical ops in src/Native/.
dotnet/roslyn — Used for code generation in AutoML and ML.NET CLI (mlnet command) to dynamically emit C# training pipelines.
scikit-learn/scikit-learn — Design inspiration and algorithm reference; many ML.NET estimators are ports or wrappers of scikit-learn equivalents (e.g., OneVsAll, FastLinear).

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive GitHub Actions workflow for cross-platform ML model validation

The repo has multiple build configurations (.vsts-dotnet-ci.yml, .night-build.yml, .outer-loop-build.yml) but the .github/workflows directory lacks a dedicated workflow for validating ML model training and inference across Windows/Linux/macOS. Given ML.NET's cross-platform focus and the presence of TensorFlow/ONNX model support mentioned in README.md, a workflow that trains sample models and validates consistency across platforms would catch platform-specific ML bugs early.

[ ] Create .github/workflows/ml-model-validation.yml to train and validate models on ubuntu-latest, windows-latest, and macos-latest
[ ] Reference the sample models or test datasets already in the repo's test structure
[ ] Add validation for both native ML.NET algorithms and imported TensorFlow/ONNX models
[ ] Integrate with existing dependabot.yml to catch breaking changes in dependencies (numpy, pandas, scikit-learn)

Implement unit tests for data transformation pipeline in src/ (specific to DataFrame/IDataView operations)

The README mentions 'data transformations' as a key feature, but there's no visible test-focused PR template or documented test coverage gaps. The repo structure suggests ML algorithms are heavily tested, but data loading and transformation—critical for real-world ML workflows—likely needs expanded test coverage for edge cases (missing values, type mismatches, large datasets).

[ ] Identify the core data transformation classes in src/ (likely IDataView implementations and data loading modules)
[ ] Add parameterized unit tests covering null handling, schema validation, and type coercion across CSV/database sources
[ ] Test transformation composition (chaining multiple transforms) for correctness and performance
[ ] Add integration tests validating transformations against the numpy/pandas/scikit-learn behavior baseline mentioned in dependencies

Add GitHub Actions agent skill for automated ONNX/TensorFlow model compatibility testing

The .github/agents/ and .github/skills/ directories show an established AI agent infrastructure for code testing and linting. However, there's no skill for validating ONNX/TensorFlow model imports—a critical feature mentioned in README.md. Creating a reusable skill would enable automated testing of model format compatibility and version-specific issues.

[ ] Create .github/skills/model-compatibility-validator/SKILL.md documenting the agent's purpose
[ ] Add .github/skills/model-compatibility-validator/extensions/ for ONNX and TensorFlow validation prompts (parallel to existing dotnet.md)
[ ] Create a prompt file (e.g., model-compatibility.prompt.md) for the agent to validate imported models against multiple runtime versions
[ ] Reference this skill in a new or existing workflow (similar to add-tests.lock.yml pattern)

🌿Good first issues

Add XML documentation comments to public APIs in src/Microsoft.ML.Data/Transforms/NormalizingTransformer.cs and similar core transforms; the codebase has inconsistent doc coverage despite being public API surface.
Write integration tests for the AutoML (AutoML/) module on ARM64; current test matrix (.vsts-dotnet-ci.yml) covers x64 but ARM64 testing is marked as experimental.
Create example notebooks or markdown guides in docs/ for the newer ONNX inference path; current samples emphasize TensorFlow but ONNX adoption is growing and documented examples are sparse.

⭐Top contributors

Click to expand

@dotnet-maestro[bot] — 21 commits
@ericstj — 16 commits
@tarekgh — 14 commits
@Copilot — 6 commits
@michaelgsharp — 6 commits

📝Recent commits

Click to expand

4c8b357 — [main] Update dependencies from dotnet/arcade (#7601) (dotnet-maestro[bot])
3ff4be3 — Add the code testing agent and /add-tests workflow (#7602) (JanKrivanek)
e11c2b4 — Update SkiaSharp from 2.88.8 to 3.119.2 (#7582) (Copilot)
9d809f1 — PoC of repository ai bootstrap (#7585) (JanKrivanek)
d25ef12 — Add Melt method to DataFrame (#7578) (sevenzees)
0cef6a2 — Update macOS Helix queues from decommissioned OSX.13 to osx.15 (#7599) (rokonec)
a564b13 — [main] Update dependencies from dotnet/arcade (#7566) (dotnet-maestro[bot])
f79b8fa — Add gpt-5.4 to tiktoken tokenizer (#7591) (Copilot)
def7a4a — Remove Google.Protobuf dependency from Microsoft.ML.Tokenizers (#7587) (stephentoub)
1a6739e — Add targets for repo health aw (#7588) (JanKrivanek)

🔒Security observations

The ML.NET repository demonstrates good security baseline practices with proper security policy (SECURITY.md) and reporting mechanisms in place. The primary concerns are around Python dependency management without version pinning and incomplete security documentation. The codebase structure appears well-organized with security workflows and configuration management. No critical vulnerabilities were identified in the static analysis of the visible file structure. However, comprehensive source code review would be needed to identify potential injection risks, hardcoded secrets, or authentication issues in the actual implementation files.

Medium · Python Dependencies Without Version Pinning — Dependencies/Package file (numpy, pandas, scikit-learn). The Dependencies/Package file lists numpy, pandas, and scikit-learn without specific version constraints. This could lead to compatibility issues and potential security vulnerabilities if insecure versions are installed. Python package managers may resolve to versions with known CVEs if not explicitly pinned. Fix: Pin all dependencies to specific versions (e.g., numpy==1.24.3) and regularly audit for CVEs. Use tools like pip-audit or Poetry for dependency management.
Low · Incomplete Security Policy Documentation — SECURITY.md. The SECURITY.md file appears to be truncated, with the reporting security issues section incomplete. The Microsoft Security Response Center URL is provided but the complete reporting process is cut off, potentially confusing security researchers. Fix: Complete the SECURITY.md file with full vulnerability reporting instructions, expected response times, and safe disclosure guidelines following MSRC best practices.
Low · Copilot and AI Agent Configuration Visibility — .github/copilot-instructions.md, .github/agents/, .github/skills/. The repository contains multiple AI agent configuration files (.github/agents, .github/skills/code-testing-agent) and copilot instructions. While not inherently insecure, these should be reviewed to ensure they don't expose sensitive patterns or encourage insecure coding practices. Fix: Review AI/Copilot instruction files to ensure they promote secure coding practices, don't leak sensitive information, and align with your security policies.
Low · Build Script Accessibility — build.cmd, build.sh. The presence of build.cmd and build.sh scripts in the root directory, while typical for open-source projects, could potentially be exploited if compromised. These scripts execute arbitrary code during the build process. Fix: Ensure build scripts are code-reviewed, signed if possible, and that CI/CD pipelines validate script integrity. Restrict write access to these files.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

dotnet/machinelearning

Embed the "Healthy" badge

Onboarding doc

Onboarding: dotnet/machinelearning

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

Add a New ML Algorithm Estimator

Add Support for a New Data Type or I/O Format

Integrate External Model Format (TensorFlow/ONNX/oneDAL)

Contribute a Bug Fix or Performance Improvement

🔧Why these technologies

⚖️Trade-offs already made

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive GitHub Actions workflow for cross-platform ML model validation

Implement unit tests for data transformation pipeline in src/ (specific to DataFrame/IDataView operations)

Add GitHub Actions agent skill for automated ONNX/TensorFlow model compatibility testing

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next