microsoft/nni

Item: microsoft/nni
Rating: 3
Author: RepoPilot

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

Mixed

Stale — last commit 2y ago

weakest axis

Use as dependencyMixed

last commit was 2y ago; no tests detected

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓13 active contributors
✓Distributed ownership (top contributor 20% of recent commits)
✓MIT licensed
✓CI configured
⚠Stale — last commit 2y ago
⚠No test directory detected

What would change the summary?

→Use as dependency Mixed → Healthy if: 1 commit in the last 365 days

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Forkable](https://repopilot.app/api/badge/microsoft/nni?axis=fork)](https://repopilot.app/r/microsoft/nni)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/microsoft/nni on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: microsoft/nni

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/microsoft/nni shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 2y ago

13 active contributors
Distributed ownership (top contributor 20% of recent commits)
MIT licensed
CI configured
⚠ Stale — last commit 2y ago
⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live microsoft/nni repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/microsoft/nni.

What it runs against: a local clone of microsoft/nni — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in microsoft/nni | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 703 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>microsoft/nni</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of microsoft/nni. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/microsoft/nni.git
#   cd nni
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of microsoft/nni and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "microsoft/nni(\\.git)?\\b" \\
  && ok "origin remote is microsoft/nni" \\
  || miss "origin remote is not microsoft/nni (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "README.md" \\
  && ok "README.md" \\
  || miss "missing critical file: README.md"
test -f "dependencies/required.txt" \\
  && ok "dependencies/required.txt" \\
  || miss "missing critical file: dependencies/required.txt"
test -f ".github/workflows/main.yml" \\
  && ok ".github/workflows/main.yml" \\
  || miss "missing critical file: .github/workflows/main.yml"
test -f "Dockerfile" \\
  && ok "Dockerfile" \\
  || miss "missing critical file: Dockerfile"
test -f "SECURITY.md" \\
  && ok "SECURITY.md" \\
  || miss "missing critical file: SECURITY.md"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 703 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~673d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/microsoft/nni"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

NNI (Neural Network Intelligence) is an open-source AutoML toolkit that automates the full machine learning lifecycle: hyperparameter tuning, neural architecture search (NAS), feature engineering, and model compression. It provides tuning algorithms, training service integrations, and APIs to automatically optimize ML models across diverse frameworks (PyTorch, TensorFlow, scikit-learn, XGBoost) and deployment targets (local, remote, cloud, Kubernetes). Monorepo with Python core (nni/) for tuning engine, TypeScript/JavaScript frontend (ts/) for web UI, and documentation (docs/). Core modules: nni/tuner/ (algorithm implementations), nni/experiment/ (experiment orchestration), nni/compression/ (model compression), nni/nas/ (architecture search), nni/training_service/ (runtime backends). Dependencies segmented in dependencies/ folder (required.txt, recommended.txt, GPU variants).

👥Who it's for

ML engineers and researchers who need to automatically optimize model hyperparameters, explore architecture spaces, or compress models without manually trying thousands of configurations. Contributors are typically AutoML researchers extending tuning algorithms or adding new training service backends.

🌱Maturity & risk

Actively developed and production-ready. The repo shows v3.0 release (May 2022) with ongoing maintenance, 3.9M lines of Python code, comprehensive CI/CD via GitHub Actions (.github/workflows/main.yml), and institutional backing from Microsoft. However, v3.0 is marked 'preview' suggesting API consolidation in progress.

Moderate risk: large monorepo (3.9M Python lines) with complex multi-service architecture (tuners, training services, compression modules) increases maintenance burden. Multiple training service backends (Kubernetes, AML, Kubeflow, PAI) mean dependency on external systems. The 'v3.0 preview' status indicates ongoing breaking changes. Open issue and PR counts are substantial (visible in README badges) suggesting backlog exists.

Active areas of work

Active development on v3.0 with focus on modernizing APIs and documentation (docs/ recently upgraded per README). Multiple bug report and enhancement issue templates (.github/ISSUE_TEMPLATE/) suggest active community engagement. Research papers published (OSDI 2022, CVPR 2022) indicate ongoing algorithmic innovation. New demo content added (YouTube/Bilibili links in README).

🚀Get running

git clone https://github.com/microsoft/nni.git
cd nni
pip install -r dependencies/required.txt
pip install -r dependencies/setup.txt
pip install -e .

Daily commands: For development: python -m nni.tools.nnictl create --config <config.yml> to launch an experiment. For CLI: nnictl start / nnictl stop / nnictl view. Development server for web UI: cd ts && npm install && npm start (inferred from TypeScript presence). See docs/ for detailed tutorial examples.

🗺️Map of the codebase

README.md — Entry point documenting NNI's core mission (AutoML for feature engineering, NAS, hyperparameter tuning, model compression) and primary use cases.
dependencies/required.txt — Defines minimal dependencies for NNI runtime; essential for understanding core framework requirements and compatibility constraints.
.github/workflows/main.yml — CI/CD pipeline defining build, test, and release processes; critical for understanding how changes are validated before merge.
Dockerfile — Container build definition; essential for deployment and reproducibility of NNI in containerized environments.
SECURITY.md — Security policy and vulnerability reporting process; mandatory for all contributors to understand responsible disclosure.
LICENSE — MIT license; establishes legal framework and usage rights for all code in the repository.
.readthedocs.yaml — Documentation build configuration; controls how docs are generated and deployed, critical for maintaining API documentation.

🛠️How to make changes

Add a New Hyperparameter Tuning Algorithm (Tuner)

Create a new tuner class inheriting from base Tuner interface in nni.tuner (nni/algorithms/hpo/[algorithm_name]_tuner.py)
Implement required methods: update_search_space(), generate_parameters(), receive_trial_result() (nni/algorithms/hpo/[algorithm_name]_tuner.py)
Register the tuner in the algorithm registry for auto-discovery (nni/algorithms/__init__.py)
Add dependencies to dependencies/recommended.txt if algorithm requires external libraries (dependencies/recommended.txt)
Create example experiment configuration in docs demonstrating the tuner (docs/reference/tuners/[algorithm_name].rst)

Add a New Model Compression Algorithm (Pruner or Quantizer)

Create compression algorithm class inheriting from base Compressor interface (nni/compression/[compression_type]/[algorithm_name].py)
Implement compression logic for supported frameworks (PyTorch, TensorFlow) (nni/compression/[compression_type]/[algorithm_name].py)
Add unit tests and integration tests validating compression ratio and accuracy (tests/compression/[compression_type]/test_[algorithm_name].py)
Update compression documentation with algorithm overview and example usage (docs/compression/[compression_type]/[algorithm_name].rst)

Add a New Neural Architecture Search (NAS) Algorithm

Create search strategy class in nni.nas implementing SearchAlgorithm interface (nni/nas/[strategy_name].py)
Define search space (cell-based, macro, or custom) compatible with NNI's mutator framework (nni/nas/space/[search_space_name].py)
Implement the search loop: sample architecture → train → evaluate → update strategy (nni/nas/[strategy_name].py)
Create benchmark example against NAS-Bench-201 or custom benchmark dataset (examples/nas/[strategy_name]_benchmark.py)

Add a New Training Service Backend (Distributed Executor)

Create training service class implementing TrainingService interface (nni/training_services/[backend_name]/training_service.py)
Implement job submission, status polling, and resource management for target platform (nni/training_services/[backend_name]/training_service.py)
Add configuration schema for the training service backend (nni/training_services/[backend_name]/config.py)
Register training service in the ServiceFactory for auto-discovery (nni/training_services/__init__.py)

🔧Why these technologies

Python — Primary language for ML/DL research; facilitates integration with PyTorch, TensorFlow, scikit-learn; enables rapid prototyping of AutoML algorithms
Docker — Enables reproducible, containerized deployments across heterogeneous cloud and on-premises clusters; simplifies dependency management
Sphinx + ReadTheDocs — Industry-standard for Python API documentation; automates doc generation from docstrings; supports versioning and multi-language localization (Crowdin)
GitHub Actions — Native CI/CD for GitHub; enables automated testing on multiple Python versions and platforms; streamlines release workflows
MIT License — Permissive open-source license; enables broad adoption in academia and industry without restrictive reciprocal obligations

⚖️Trade-offs already made

Pluggable tuner/pruner/NAS algorithm architecture (not monolithic)
- Why: Allows researchers to contribute custom algorithms without forking; reduces core maintenance burden
- Consequence: Slightly more abstraction overhead; requires users to understand plugin registration patterns
Support for multiple training service backends (local, remote, Kubernetes, cloud providers)
- Why: Maximizes flexibility for diverse deployment scenarios (single machine to 1000+ node clusters)
- Consequence: Higher code complexity; more surface area for bugs; harder to optimize for single platform
Framework-agnostic trial interface (works with PyTorch, TensorFlow, XGBoost, etc.)
- Why: Enables adoption across the ML ecosystem; future-proofs against framework shifts
- Consequence: Cannot leverage framework-specific optimizations; trial code must explicitly log metrics to NNI API
Python-only SDK (no C++/Rust core libraries for core tuning logic)
- Why: Simpler maintenance; faster iteration on algorithms; easier for community contributions
- Consequence: Potential latency overhead for high-frequency operations (millions of trials); less suitable for real-time embedded AutoML

🚫Non-goals (don't propose these)

Real-time online learning: NNI evaluates models serially or in fixed batches; not designed for streaming data or continual learning workflows
Automatic data preprocessing: Does not infer missing value imputation, scaling, or categorical encoding; focuses on hyperparameter tuning and architecture search
Model deployment and serving: NNI finds good models but does not manage model serving infrastructure (cf. KServe, BentoML)
GPU memory optimization: Does not automatically tune batch size or gradient checkpointing; users must manually configure
Windows support (primary): Documentation emphasizes

🪤Traps & gotchas

Config-driven: Experiments require YAML configuration files; Python API exists but YAML is the primary input format—missing config fields cause cryptic errors. Trial reporting: Trials must call nni.report_intermediate_result() / nni.report_final_result() or experiment hangs; no automatic metrics collection. Training service startup: Each backend (Kubernetes, AML, Kubeflow) requires pre-configured credentials/cluster access; local mode works without setup but cloud backends fail silently if misconfigured. Version gaps: v3.0 is preview—existing examples may use v2.x APIs; check example dates. Async operations: Experiment orchestration is event-driven; blocking calls in trial code block resource allocation.

🏗️Architecture

💡Concepts to learn

Hyperparameter Optimization (HPO) — Core capability of NNI—understanding search spaces, objectives, and algorithm families (grid, random, Bayesian, evolutionary) is essential to using the tuner module effectively
Neural Architecture Search (NAS) — Major NNI subsystem (nni/nas/); involves automatically designing model architectures via differentiable methods (DARTS) or reinforcement learning (ENAS)—understanding search space representation is critical for extending NAS
Multi-Armed Bandit (MAB) / Successive Halving — Algorithm underlying Hyperband and PBT tuners in NNI; understanding early-stopping and resource allocation strategies is necessary for configuring efficient tuning
Model Compression (Pruning, Quantization, Distillation) — Core NNI subsystem (nni/compression/); techniques to reduce model size/latency—understanding sparsity patterns, bit-width constraints, and knowledge transfer are essential for contributing to compression module
Bayesian Optimization — Default tuning strategy in many NNI tuners; uses Gaussian processes to model objective function and guide search—understanding acquisition functions (EI, UCB) helps configure tuners effectively
Distributed Trial Scheduling — NNI's multi-service architecture abstracts trial submission across local, SSH, Kubernetes, and cloud backends—understanding job submission, lifecycle management, and fault tolerance is critical for adding new training services
Configuration as Code (YAML-driven Experiments) — NNI primary interface uses declarative YAML config files for experiment definition; understanding schema validation, defaults, and mutation strategies is needed for extending the experiment framework

optuna/optuna — Direct competitor for hyperparameter tuning; simpler single-library approach vs NNI's multi-service ecosystem, no built-in NAS/compression
pytorch/pytorch — Primary framework target; NNI depends heavily on PyTorch for distributed training and model mutation in NAS
kubeflow/kubeflow — Kubernetes-native ML platform; NNI integrates Kubeflow as one of its training service backends for distributed trial scheduling
ray/tune — Distributed hyperparameter tuning library; similar scope to NNI's tuning module but Ray ecosystem integration vs NNI's multi-backend flexibility
google/vizier — Google's AutoML service; comparable research-backed tuning algorithms; NNI implements many of the same strategies (Bayesian optimization, PBT)

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add missing CI workflow for model compression functionality

The repo has dedicated issue templates for model-compression-bug-report.md and model-compression-enhancement.md, indicating model compression is a major feature. However, .github/workflows/main.yml likely doesn't have dedicated testing for the model compression module. Adding a separate workflow would ensure model compression changes don't regress and would validate the feature's stability across PRs.

[ ] Examine .github/workflows/main.yml to identify gaps in model compression test coverage
[ ] Create .github/workflows/model-compression.yml with steps to run model compression unit tests
[ ] Add integration tests that validate common model compression scenarios (pruning, quantization, etc.)
[ ] Document the workflow in docs/ and ensure it triggers on PRs affecting src/model_compression/ or equivalent directories

Create comprehensive documentation index for dependencies management

The repo has 8 dependency files (required.txt, recommended.txt, recommended_gpu.txt, etc.) in dependencies/, but there's no documented guide explaining when to use each file, the differences between legacy vs current versions, or how they relate to installation. New contributors are likely confused about which dependencies to install.

[ ] Create docs/DEPENDENCIES.md documenting the purpose of each file in dependencies/
[ ] Explain the difference between legacy, current, and GPU variants
[ ] Add a decision matrix showing which file to use for different scenarios (dev, production, GPU, legacy systems)
[ ] Cross-reference this doc from docs/installation.rst if it exists in the live docs

Audit and document removed features in docs/_removed/

The repo has a large docs/_removed/ directory with deprecated content (.rst files for old training services, tuners, and examples). There's no deprecation guide or migration documentation for users upgrading from v1.x to current versions. This is a common source of user confusion and support burden.

[ ] Create docs/DEPRECATION_GUIDE.md listing all removed features from v1.x
[ ] For each removed file in docs/_removed/ (e.g., RemoteMachineMode.rst, SmacTuner.rst), add migration path recommendations
[ ] Document what features replaced each deprecated component (e.g., if RemoteMachineMode was removed, what should users use instead?)
[ ] Add version tags to indicate when each feature was deprecated and removed

🌿Good first issues

Add missing unit tests for nni/compression/quantization/ quantizer implementations—currently only integration tests exist, pure-unit test coverage would catch edge cases in quantization schedules
Improve error messages in nni/experiment/config/ validation: when a required field is missing from YAML config, users get JSON schema errors instead of human-readable guidance—add field-specific hints
Document the trial callback lifecycle in docs/ with concrete code examples—currently only API reference exists, no walkthrough of when report_intermediate_result() vs report_final_result() should be called and failure modes

⭐Top contributors

Click to expand

@J-shang — 20 commits
@liuzhe-lz — 17 commits
@Bonytu — 15 commits
@ultmaster — 15 commits
@super-dainiu — 14 commits

📝Recent commits