microsoft/nni
An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
Stale — last commit 2y ago
weakest axislast commit was 2y ago; no tests detected
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓13 active contributors
- ✓Distributed ownership (top contributor 20% of recent commits)
- ✓MIT licensed
- ✓CI configured
- ⚠Stale — last commit 2y ago
- ⚠No test directory detected
What would change the summary?
- →Use as dependency Mixed → Healthy if: 1 commit in the last 365 days
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Forkable" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/microsoft/nni)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/microsoft/nni on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: microsoft/nni
Generated by RepoPilot · 2026-05-07 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/microsoft/nni shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
WAIT — Stale — last commit 2y ago
- 13 active contributors
- Distributed ownership (top contributor 20% of recent commits)
- MIT licensed
- CI configured
- ⚠ Stale — last commit 2y ago
- ⚠ No test directory detected
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live microsoft/nni
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/microsoft/nni.
What it runs against: a local clone of microsoft/nni — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in microsoft/nni | Confirms the artifact applies here, not a fork |
| 2 | License is still MIT | Catches relicense before you depend on it |
| 3 | Default branch master exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 703 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of microsoft/nni. If you don't
# have one yet, run these first:
#
# git clone https://github.com/microsoft/nni.git
# cd nni
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of microsoft/nni and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "microsoft/nni(\\.git)?\\b" \\
&& ok "origin remote is microsoft/nni" \\
|| miss "origin remote is not microsoft/nni (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
&& ok "license is MIT" \\
|| miss "license drift — was MIT at generation time"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 4. Critical files exist
test -f "README.md" \\
&& ok "README.md" \\
|| miss "missing critical file: README.md"
test -f "dependencies/required.txt" \\
&& ok "dependencies/required.txt" \\
|| miss "missing critical file: dependencies/required.txt"
test -f ".github/workflows/main.yml" \\
&& ok ".github/workflows/main.yml" \\
|| miss "missing critical file: .github/workflows/main.yml"
test -f "Dockerfile" \\
&& ok "Dockerfile" \\
|| miss "missing critical file: Dockerfile"
test -f "SECURITY.md" \\
&& ok "SECURITY.md" \\
|| miss "missing critical file: SECURITY.md"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 703 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~673d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/microsoft/nni"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
NNI (Neural Network Intelligence) is an open-source AutoML toolkit that automates the full machine learning lifecycle: hyperparameter tuning, neural architecture search (NAS), feature engineering, and model compression. It provides tuning algorithms, training service integrations, and APIs to automatically optimize ML models across diverse frameworks (PyTorch, TensorFlow, scikit-learn, XGBoost) and deployment targets (local, remote, cloud, Kubernetes). Monorepo with Python core (nni/) for tuning engine, TypeScript/JavaScript frontend (ts/) for web UI, and documentation (docs/). Core modules: nni/tuner/ (algorithm implementations), nni/experiment/ (experiment orchestration), nni/compression/ (model compression), nni/nas/ (architecture search), nni/training_service/ (runtime backends). Dependencies segmented in dependencies/ folder (required.txt, recommended.txt, GPU variants).
👥Who it's for
ML engineers and researchers who need to automatically optimize model hyperparameters, explore architecture spaces, or compress models without manually trying thousands of configurations. Contributors are typically AutoML researchers extending tuning algorithms or adding new training service backends.
🌱Maturity & risk
Actively developed and production-ready. The repo shows v3.0 release (May 2022) with ongoing maintenance, 3.9M lines of Python code, comprehensive CI/CD via GitHub Actions (.github/workflows/main.yml), and institutional backing from Microsoft. However, v3.0 is marked 'preview' suggesting API consolidation in progress.
Moderate risk: large monorepo (3.9M Python lines) with complex multi-service architecture (tuners, training services, compression modules) increases maintenance burden. Multiple training service backends (Kubernetes, AML, Kubeflow, PAI) mean dependency on external systems. The 'v3.0 preview' status indicates ongoing breaking changes. Open issue and PR counts are substantial (visible in README badges) suggesting backlog exists.
Active areas of work
Active development on v3.0 with focus on modernizing APIs and documentation (docs/ recently upgraded per README). Multiple bug report and enhancement issue templates (.github/ISSUE_TEMPLATE/) suggest active community engagement. Research papers published (OSDI 2022, CVPR 2022) indicate ongoing algorithmic innovation. New demo content added (YouTube/Bilibili links in README).
🚀Get running
git clone https://github.com/microsoft/nni.git
cd nni
pip install -r dependencies/required.txt
pip install -r dependencies/setup.txt
pip install -e .
Daily commands:
For development: python -m nni.tools.nnictl create --config <config.yml> to launch an experiment. For CLI: nnictl start / nnictl stop / nnictl view. Development server for web UI: cd ts && npm install && npm start (inferred from TypeScript presence). See docs/ for detailed tutorial examples.
🗺️Map of the codebase
README.md— Entry point documenting NNI's core mission (AutoML for feature engineering, NAS, hyperparameter tuning, model compression) and primary use cases.dependencies/required.txt— Defines minimal dependencies for NNI runtime; essential for understanding core framework requirements and compatibility constraints..github/workflows/main.yml— CI/CD pipeline defining build, test, and release processes; critical for understanding how changes are validated before merge.Dockerfile— Container build definition; essential for deployment and reproducibility of NNI in containerized environments.SECURITY.md— Security policy and vulnerability reporting process; mandatory for all contributors to understand responsible disclosure.LICENSE— MIT license; establishes legal framework and usage rights for all code in the repository..readthedocs.yaml— Documentation build configuration; controls how docs are generated and deployed, critical for maintaining API documentation.
🛠️How to make changes
Add a New Hyperparameter Tuning Algorithm (Tuner)
- Create a new tuner class inheriting from base Tuner interface in nni.tuner (
nni/algorithms/hpo/[algorithm_name]_tuner.py) - Implement required methods: update_search_space(), generate_parameters(), receive_trial_result() (
nni/algorithms/hpo/[algorithm_name]_tuner.py) - Register the tuner in the algorithm registry for auto-discovery (
nni/algorithms/__init__.py) - Add dependencies to dependencies/recommended.txt if algorithm requires external libraries (
dependencies/recommended.txt) - Create example experiment configuration in docs demonstrating the tuner (
docs/reference/tuners/[algorithm_name].rst)
Add a New Model Compression Algorithm (Pruner or Quantizer)
- Create compression algorithm class inheriting from base Compressor interface (
nni/compression/[compression_type]/[algorithm_name].py) - Implement compression logic for supported frameworks (PyTorch, TensorFlow) (
nni/compression/[compression_type]/[algorithm_name].py) - Add unit tests and integration tests validating compression ratio and accuracy (
tests/compression/[compression_type]/test_[algorithm_name].py) - Update compression documentation with algorithm overview and example usage (
docs/compression/[compression_type]/[algorithm_name].rst)
Add a New Neural Architecture Search (NAS) Algorithm
- Create search strategy class in nni.nas implementing SearchAlgorithm interface (
nni/nas/[strategy_name].py) - Define search space (cell-based, macro, or custom) compatible with NNI's mutator framework (
nni/nas/space/[search_space_name].py) - Implement the search loop: sample architecture → train → evaluate → update strategy (
nni/nas/[strategy_name].py) - Create benchmark example against NAS-Bench-201 or custom benchmark dataset (
examples/nas/[strategy_name]_benchmark.py)
Add a New Training Service Backend (Distributed Executor)
- Create training service class implementing TrainingService interface (
nni/training_services/[backend_name]/training_service.py) - Implement job submission, status polling, and resource management for target platform (
nni/training_services/[backend_name]/training_service.py) - Add configuration schema for the training service backend (
nni/training_services/[backend_name]/config.py) - Register training service in the ServiceFactory for auto-discovery (
nni/training_services/__init__.py)
🔧Why these technologies
- Python — Primary language for ML/DL research; facilitates integration with PyTorch, TensorFlow, scikit-learn; enables rapid prototyping of AutoML algorithms
- Docker — Enables reproducible, containerized deployments across heterogeneous cloud and on-premises clusters; simplifies dependency management
- Sphinx + ReadTheDocs — Industry-standard for Python API documentation; automates doc generation from docstrings; supports versioning and multi-language localization (Crowdin)
- GitHub Actions — Native CI/CD for GitHub; enables automated testing on multiple Python versions and platforms; streamlines release workflows
- MIT License — Permissive open-source license; enables broad adoption in academia and industry without restrictive reciprocal obligations
⚖️Trade-offs already made
-
Pluggable tuner/pruner/NAS algorithm architecture (not monolithic)
- Why: Allows researchers to contribute custom algorithms without forking; reduces core maintenance burden
- Consequence: Slightly more abstraction overhead; requires users to understand plugin registration patterns
-
Support for multiple training service backends (local, remote, Kubernetes, cloud providers)
- Why: Maximizes flexibility for diverse deployment scenarios (single machine to 1000+ node clusters)
- Consequence: Higher code complexity; more surface area for bugs; harder to optimize for single platform
-
Framework-agnostic trial interface (works with PyTorch, TensorFlow, XGBoost, etc.)
- Why: Enables adoption across the ML ecosystem; future-proofs against framework shifts
- Consequence: Cannot leverage framework-specific optimizations; trial code must explicitly log metrics to NNI API
-
Python-only SDK (no C++/Rust core libraries for core tuning logic)
- Why: Simpler maintenance; faster iteration on algorithms; easier for community contributions
- Consequence: Potential latency overhead for high-frequency operations (millions of trials); less suitable for real-time embedded AutoML
🚫Non-goals (don't propose these)
- Real-time online learning: NNI evaluates models serially or in fixed batches; not designed for streaming data or continual learning workflows
- Automatic data preprocessing: Does not infer missing value imputation, scaling, or categorical encoding; focuses on hyperparameter tuning and architecture search
- Model deployment and serving: NNI finds good models but does not manage model serving infrastructure (cf. KServe, BentoML)
- GPU memory optimization: Does not automatically tune batch size or gradient checkpointing; users must manually configure
- Windows support (primary): Documentation emphasizes
🪤Traps & gotchas
Config-driven: Experiments require YAML configuration files; Python API exists but YAML is the primary input format—missing config fields cause cryptic errors. Trial reporting: Trials must call nni.report_intermediate_result() / nni.report_final_result() or experiment hangs; no automatic metrics collection. Training service startup: Each backend (Kubernetes, AML, Kubeflow) requires pre-configured credentials/cluster access; local mode works without setup but cloud backends fail silently if misconfigured. Version gaps: v3.0 is preview—existing examples may use v2.x APIs; check example dates. Async operations: Experiment orchestration is event-driven; blocking calls in trial code block resource allocation.
🏗️Architecture
💡Concepts to learn
- Hyperparameter Optimization (HPO) — Core capability of NNI—understanding search spaces, objectives, and algorithm families (grid, random, Bayesian, evolutionary) is essential to using the tuner module effectively
- Neural Architecture Search (NAS) — Major NNI subsystem (nni/nas/); involves automatically designing model architectures via differentiable methods (DARTS) or reinforcement learning (ENAS)—understanding search space representation is critical for extending NAS
- Multi-Armed Bandit (MAB) / Successive Halving — Algorithm underlying Hyperband and PBT tuners in NNI; understanding early-stopping and resource allocation strategies is necessary for configuring efficient tuning
- Model Compression (Pruning, Quantization, Distillation) — Core NNI subsystem (nni/compression/); techniques to reduce model size/latency—understanding sparsity patterns, bit-width constraints, and knowledge transfer are essential for contributing to compression module
- Bayesian Optimization — Default tuning strategy in many NNI tuners; uses Gaussian processes to model objective function and guide search—understanding acquisition functions (EI, UCB) helps configure tuners effectively
- Distributed Trial Scheduling — NNI's multi-service architecture abstracts trial submission across local, SSH, Kubernetes, and cloud backends—understanding job submission, lifecycle management, and fault tolerance is critical for adding new training services
- Configuration as Code (YAML-driven Experiments) — NNI primary interface uses declarative YAML config files for experiment definition; understanding schema validation, defaults, and mutation strategies is needed for extending the experiment framework
🔗Related repos
optuna/optuna— Direct competitor for hyperparameter tuning; simpler single-library approach vs NNI's multi-service ecosystem, no built-in NAS/compressionpytorch/pytorch— Primary framework target; NNI depends heavily on PyTorch for distributed training and model mutation in NASkubeflow/kubeflow— Kubernetes-native ML platform; NNI integrates Kubeflow as one of its training service backends for distributed trial schedulingray/tune— Distributed hyperparameter tuning library; similar scope to NNI's tuning module but Ray ecosystem integration vs NNI's multi-backend flexibilitygoogle/vizier— Google's AutoML service; comparable research-backed tuning algorithms; NNI implements many of the same strategies (Bayesian optimization, PBT)
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add missing CI workflow for model compression functionality
The repo has dedicated issue templates for model-compression-bug-report.md and model-compression-enhancement.md, indicating model compression is a major feature. However, .github/workflows/main.yml likely doesn't have dedicated testing for the model compression module. Adding a separate workflow would ensure model compression changes don't regress and would validate the feature's stability across PRs.
- [ ] Examine .github/workflows/main.yml to identify gaps in model compression test coverage
- [ ] Create .github/workflows/model-compression.yml with steps to run model compression unit tests
- [ ] Add integration tests that validate common model compression scenarios (pruning, quantization, etc.)
- [ ] Document the workflow in docs/ and ensure it triggers on PRs affecting src/model_compression/ or equivalent directories
Create comprehensive documentation index for dependencies management
The repo has 8 dependency files (required.txt, recommended.txt, recommended_gpu.txt, etc.) in dependencies/, but there's no documented guide explaining when to use each file, the differences between legacy vs current versions, or how they relate to installation. New contributors are likely confused about which dependencies to install.
- [ ] Create docs/DEPENDENCIES.md documenting the purpose of each file in dependencies/
- [ ] Explain the difference between legacy, current, and GPU variants
- [ ] Add a decision matrix showing which file to use for different scenarios (dev, production, GPU, legacy systems)
- [ ] Cross-reference this doc from docs/installation.rst if it exists in the live docs
Audit and document removed features in docs/_removed/
The repo has a large docs/_removed/ directory with deprecated content (.rst files for old training services, tuners, and examples). There's no deprecation guide or migration documentation for users upgrading from v1.x to current versions. This is a common source of user confusion and support burden.
- [ ] Create docs/DEPRECATION_GUIDE.md listing all removed features from v1.x
- [ ] For each removed file in docs/_removed/ (e.g., RemoteMachineMode.rst, SmacTuner.rst), add migration path recommendations
- [ ] Document what features replaced each deprecated component (e.g., if RemoteMachineMode was removed, what should users use instead?)
- [ ] Add version tags to indicate when each feature was deprecated and removed
🌿Good first issues
- Add missing unit tests for
nni/compression/quantization/quantizer implementations—currently only integration tests exist, pure-unit test coverage would catch edge cases in quantization schedules - Improve error messages in
nni/experiment/config/validation: when a required field is missing from YAML config, users get JSON schema errors instead of human-readable guidance—add field-specific hints - Document the trial callback lifecycle in
docs/with concrete code examples—currently only API reference exists, no walkthrough of whenreport_intermediate_result()vsreport_final_result()should be called and failure modes
⭐Top contributors
Click to expand
- @J-shang — 20 commits
- @liuzhe-lz — 17 commits
- @Bonytu — 15 commits
- @ultmaster — 15 commits
- @super-dainiu — 14 commits
📝Recent commits
Click to expand
767ed7f— update draft note (#5668) (#5700) (Bonytu)b84d25b— [Bugbash] fix example bugs (#5637) (J-shang)0322e59— fix accelerate version bug (#5645) (Bonytu)27a24a1— [Compression] merge nni.contrib.compression with nni.compression (#5573) (J-shang)8dc1a83— [compression] fix mask conflict v2 (#5592) (super-dainiu)b9d9492— [bug fix] fix prefix experiment nav highlight issue (#5575) (Lijiaoa)60c9459— [BugFix] Using the builtin types likeint,boolandfloat. (#5620) (Hzbeta)750546b— [WebUI] Compatible with latest edge browser (#5599) (Lijiaoa)5e22f49— [Compression] Add bias correction feature for PTQ quantizer (#5603) (Bonytu)9053d65— [Compression] Add support for deepspeed (#5517) (Bonytu)
🔒Security observations
- High · Outdated pip Package Manager Version —
Dockerfile, line: RUN python3 -m pip --no-cache-dir install pip==22.0.3. Dockerfile pins pip to version 22.0.3 (released January 2022), which is significantly outdated and may contain known security vulnerabilities. Current pip versions include numerous security patches. Fix: Update pip to the latest stable version or at minimum a recent version (e.g., 23.x or 24.x). Consider using a more recent base image that includes updated pip. - High · Outdated PyPI Package Dependencies —
Dockerfile, lines containing pip install commands for dependencies. Multiple Python packages installed in Dockerfile are significantly outdated with potential known vulnerabilities: numpy==1.22.2 (2022), pandas==1.4.1 (2022), scikit-learn==1.0.2 (2022), scipy==1.8.0 (2022), lightgbm==3.3.2 (2022), and torch==1.10.2 (2021). These versions are 2+ years old. Fix: Update all dependencies to current stable versions. Implement a dependency management strategy using tools like pip-audit or dependabot to track and update vulnerable packages regularly. - High · Outdated CUDA Base Image —
Dockerfile, line: FROM nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04. Dockerfile uses nvidia/cuda:11.3.1 from October 2021. This base image is outdated and may contain OS-level security vulnerabilities in the underlying Ubuntu 20.04 and CUDA runtime. Fix: Update to a recent CUDA base image (e.g., nvidia/cuda:12.x with Ubuntu 22.04 or later). Regularly update base images as part of the CI/CD pipeline. - Medium · Missing Python Package Pinning in Dependencies Files —
dependencies/ directory (develop.txt, recommended.txt, required.txt, etc.). While the Dockerfile pins versions, the dependency files in /dependencies/ directory likely use flexible version constraints that could pull in vulnerable transitive dependencies. No evidence of lock files (requirements.lock) or dependency auditing. Fix: Implement locked dependency files with exact versions for all transitive dependencies. Use tools like pip-compile or Poetry to generate lock files and enable security scanning with pip-audit or Safety. - Medium · OpenSSH Server in Docker Image —
Dockerfile, line: openssh-server. Dockerfile installs openssh-server, which significantly increases the attack surface of the container. This is rarely needed in containerized environments and opens potential SSH-based attack vectors. Fix: Remove openssh-server unless absolutely necessary. Use Docker exec or orchestration platform tools for container access instead. If SSH is required, document the justification and implement additional hardening. - Medium · Incomplete Dockerfile Build Cleanup —
Dockerfile, final line. While the Dockerfile includes apt cleanup, the final line 'RUN python3' appears incomplete and serves no purpose. This suggests incomplete or testing code in production Dockerfile. Fix: Remove the incomplete 'RUN python3' line. Ensure all RUN commands are complete and necessary for production builds. Implement linting for Dockerfiles using tools like hadolint. - Medium · No Security Headers or SBOM Documentation —
SECURITY.md and repository root. Repository lacks documented security practices such as Software Bill of Materials (SBOM), security scanning results, or vulnerability disclosure timeline in SECURITY.md (file is truncated in provided content). Fix: Complete and enhance SECURITY.md with: responsible disclosure timeline, supported versions, known issues, dependency scanning results. Generate and publish SBOM files using tools like cyclonedx or syft. - Low · PyTorch Installation from URL Without Verification —
Dockerfile, PyTorch installation lines. Dockerfile installs PyTorch from download.pytorch.org using -f flag without hash verification or GPG signature validation. While the source is reputable, there's no integrity verification. Fix: Verify downloaded packages using checksums or GPG signatures. Document the expected checksums for audit purposes. Consider using official PyPI packages instead of download URLs where possible. - undefined · undefined —
undefined. undefined Fix: undefined
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.