Dod-o/Statistical-Learning-Method_Code
手写实现李航《统计学习方法》书中全部算法
Slowing — last commit 5mo ago
weakest axisno license — legally unclear; no CI workflows detected
no license — can't legally use code; no CI workflows detected
Documented and popular — useful reference codebase to read through.
no license — can't legally use code; no CI workflows detected
- ✓Last commit 5mo ago
- ✓4 active contributors
- ✓Tests present
Show all 8 evidence items →Show less
- ⚠Slowing — last commit 5mo ago
- ⚠Small team — 4 contributors active in recent commits
- ⚠Single-maintainer risk — top contributor 83% of recent commits
- ⚠No license — legally unclear to depend on
- ⚠No CI workflows detected
What would change the summary?
- →Use as dependency Concerns → Mixed if: publish a permissive license (MIT, Apache-2.0, etc.)
- →Fork & modify Concerns → Mixed if: add a LICENSE file
- →Deploy as-is Concerns → Mixed if: add a LICENSE file
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Great to learn from" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/dod-o/statistical-learning-method_code)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/dod-o/statistical-learning-method_code on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: Dod-o/Statistical-Learning-Method_Code
Generated by RepoPilot · 2026-05-07 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/Dod-o/Statistical-Learning-Method_Code shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
WAIT — Slowing — last commit 5mo ago
- Last commit 5mo ago
- 4 active contributors
- Tests present
- ⚠ Slowing — last commit 5mo ago
- ⚠ Small team — 4 contributors active in recent commits
- ⚠ Single-maintainer risk — top contributor 83% of recent commits
- ⚠ No license — legally unclear to depend on
- ⚠ No CI workflows detected
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live Dod-o/Statistical-Learning-Method_Code
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/Dod-o/Statistical-Learning-Method_Code.
What it runs against: a local clone of Dod-o/Statistical-Learning-Method_Code — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in Dod-o/Statistical-Learning-Method_Code | Confirms the artifact applies here, not a fork |
| 2 | Default branch master exists | Catches branch renames |
| 3 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 4 | Last commit ≤ 193 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of Dod-o/Statistical-Learning-Method_Code. If you don't
# have one yet, run these first:
#
# git clone https://github.com/Dod-o/Statistical-Learning-Method_Code.git
# cd Statistical-Learning-Method_Code
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of Dod-o/Statistical-Learning-Method_Code and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "Dod-o/Statistical-Learning-Method_Code(\\.git)?\\b" \\
&& ok "origin remote is Dod-o/Statistical-Learning-Method_Code" \\
|| miss "origin remote is not Dod-o/Statistical-Learning-Method_Code (artifact may be from a fork)"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 4. Critical files exist
test -f "perceptron/perceptron_dichotomy.py" \\
&& ok "perceptron/perceptron_dichotomy.py" \\
|| miss "missing critical file: perceptron/perceptron_dichotomy.py"
test -f "DecisionTree/DecisionTree.py" \\
&& ok "DecisionTree/DecisionTree.py" \\
|| miss "missing critical file: DecisionTree/DecisionTree.py"
test -f "SVM/SVM.py" \\
&& ok "SVM/SVM.py" \\
|| miss "missing critical file: SVM/SVM.py"
test -f "NaiveBayes/NaiveBayes.py" \\
&& ok "NaiveBayes/NaiveBayes.py" \\
|| miss "missing critical file: NaiveBayes/NaiveBayes.py"
test -f "EM/EM.py" \\
&& ok "EM/EM.py" \\
|| miss "missing critical file: EM/EM.py"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 193 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~163d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/Dod-o/Statistical-Learning-Method_Code"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
A comprehensive hand-written implementation of all algorithms from Li Hang's 'Statistical Learning Methods' textbook, covering supervised learning (perceptron, KNN, naive Bayes, decision trees, logistic regression, SVM, AdaBoost) and unsupervised learning (K-means, hierarchical clustering, EM, HMM, PCA, LDA, LSA, PLSA, PageRank). It serves as a reference implementation with fully annotated code tracing to textbook formulas. Chapter-based monorepo structure: each algorithm gets its own top-level directory (e.g., SVM/, KNN/, Clustering/) containing a .py implementation file and optionally a .ipynb Jupyter notebook. Clustering/ has sub-directories (K-means_Clustering/, Hierachical_Clustering/). Supporting data lives alongside code (iris.data, bbc_text.csv, MNIST/ folder). Blogs/ holds supplementary PDF articles keyed to chapters.
👥Who it's for
Machine learning students and practitioners learning statistical learning theory who need working implementations that directly map to Li Hang's textbook formulas; researchers validating algorithmic understanding; educators building ML curricula in Chinese-speaking contexts.
🌱Maturity & risk
Actively maintained but pre-1.0: the repo shows regular updates (book branch announced Feb 2022, unsupervised updates Jan 2021), has accompanying PDF blogs for multiple chapters, and integrates educational content. However, there is no visible CI/CD, no formal test suite, and commits appear episodic rather than continuous, suggesting it's a well-curated educational project rather than production software.
Standard open source risks apply.
Active areas of work
Transition to formal publication: code is being restructured in the book branch for publication by 人民邮电出版社 (People's Posts and Telecommunications Press); offline ML/MLP/CV training program in planning; community contributions welcome via issues and PRs. Unsupervised algorithms section was recently expanded by Harold-Ran.
🚀Get running
git clone https://github.com/Dod-o/Statistical-Learning-Method_Code.git
cd Statistical-Learning-Method_Code
# For MNIST: unzip Mnist/mnist_train.rar and Mnist/mnist_test.rar
# No setup.py or requirements.txt; install dependencies manually:
pip install numpy scipy pandas scikit-learn matplotlib jupyter
Daily commands:
# Run a single algorithm (e.g., KNN):
python KNN/KNN.py
# Or open Jupyter notebooks for interactive learning:
jupyter notebook Clustering/K-means_Clustering/K-means_Clustering.ipynb
jupyter notebook PCA/PCA.ipynb
No centralized test suite or main entry point; each module is independent.
🗺️Map of the codebase
perceptron/perceptron_dichotomy.py— Entry point for the perceptron algorithm (Chapter 2), foundational supervised learning method that most other algorithms build upon conceptuallyDecisionTree/DecisionTree.py— Core decision tree implementation (Chapter 5), one of the most commonly referenced supervised methods in the repo with extensive commentsSVM/SVM.py— Support Vector Machine implementation (Chapter 7), complex optimization-based algorithm central to supervised learning theoryNaiveBayes/NaiveBayes.py— Naive Bayes classifier (Chapter 4), foundational probabilistic model used across multiple downstream algorithmsEM/EM.py— Expectation-Maximization algorithm, unsupervised core method that underpins HMM and clustering implementationsHMM/HMM.py— Hidden Markov Model (Chapter 10), sequence modeling algorithm that extends EM with temporal structureAdaBoost/AdaBoost.py— AdaBoost ensemble method (Chapter 8), demonstrates weighted iteration pattern replicated in gradient boosting variants
🛠️How to make changes
Add a New Supervised Learning Algorithm
- Create a new directory with algorithm name matching the pattern in the repo root (e.g., GradientBoosting/) (
GradientBoosting/) - Implement the main algorithm class in a .py file with extensive inline comments explaining each formula step (
GradientBoosting/GradientBoosting.py) - Reference the book chapter and equations in comments at the top of the file, following the pattern from DecisionTree.py or SVM.py (
GradientBoosting/GradientBoosting.py) - Include a demonstration with a standard dataset (iris, MNIST, or synthetic data) showing train/test workflow (
GradientBoosting/GradientBoosting.py) - Optionally create a Jupyter notebook (.ipynb) in the same directory for interactive exploration, following PCA/PCA.ipynb pattern (
GradientBoosting/GradientBoosting.ipynb) - Update README.md to add the new chapter/algorithm to the supervised learning section with a link to the blog post if available (
README.md)
Add a New Unsupervised Learning Algorithm
- Create directory under Clustering/ or a new top-level directory (e.g., GaussianMixture/) following repo conventions (
GaussianMixture/) - Implement the algorithm class, clearly documenting the E-step and M-step if EM-based (see EM/EM.py for the pattern) (
GaussianMixture/GaussianMixture.py) - Create test dataset handling code to load standard datasets like iris.data or bbc_text.csv (
GaussianMixture/GaussianMixture.py) - Include visualization code (clustering plots, elbow curves, or topic visualizations) in a Jupyter notebook (
GaussianMixture/GaussianMixture.ipynb) - Add evaluation metrics (silhouette score, inertia, perplexity) following Clustering/K-means_Clustering/K-means_Clustering.py pattern (
GaussianMixture/GaussianMixture.py)
Prepare a Dataset for Algorithm Testing
- Export or download dataset in CSV or text format matching the structure of existing datasets (iris.data, bbc_text.csv, cars.csv) (
NewDataset/dataset.csv) - Create a data loader function in the algorithm file that reads, validates, and splits the data (train/test), following transMnist/transMnist.py pattern (
NewAlgorithm/NewAlgorithm.py) - Add data preprocessing (normalization, tokenization, feature scaling) specific to your algorithm and dataset (
NewAlgorithm/NewAlgorithm.py) - Document dataset source, size, features, and any preprocessing steps in comments at the top of the algorithm file (
NewAlgorithm/NewAlgorithm.py)
Create Educational Documentation for an Algorithm
- Write a blog post (PDF or markdown) explaining the algorithm's theory, mathematical foundations, and intuition (
blogs/algorithm_theory.pdf) - Create a Jupyter notebook (.ipynb) in the algorithm directory with step-by-step implementation and visualizations (see PCA/PCA.ipynb, LDA/LDA.ipynb for examples) (
Algorithm/Algorithm.ipynb) - Add comments in the Python implementation correlating code lines to equation numbers from the blog post or textbook (
Algorithm/Algorithm.py) - Include worked examples or toy datasets to make the tutorial reproducible (e.g., small feature matrix, simple classification task) (
Algorithm/Algorithm.ipynb)
🪤Traps & gotchas
- MNIST data requires manual unzipping: README says .rar files must be extracted after download; no download script provided. 2. No requirements.txt: Dependencies (NumPy, SciPy, pandas, Jupyter) must be inferred and installed manually; versions not pinned. 3. Data file paths are hardcoded: LSA/LSA.py, PLSA/PLSA.py, and LDA/LDA.py reference 'bbc_text.csv' with relative paths; will fail if run from wrong directory. 4. No test suite: No way to validate implementations; manual inspection required. 5. Mixed language code: Comments in Chinese; may confuse non-Chinese-speaking contributors. 6. Notebooks not version-controlled cleanly: .ipynb files can conflict in git; no .gitignore rule visible for Jupyter output.
🏗️Architecture
💡Concepts to learn
- Information Gain & Entropy (Decision Trees) — Core to DecisionTree.py's split selection; understanding Shannon entropy and information gain is essential for grasping why trees choose splits the way they do
- Kernel Trick (SVM) — SVM/SVM.py's most subtle concept; allows non-linear classification in high dimensions without explicit feature transformation; critical for practical SVM understanding
- Expectation-Maximization (EM Algorithm) — EM/EM.py implements the foundation for latent variable models; reappears in HMM, PLSA, LDA; understanding E and M steps is essential for probabilistic unsupervised learning
- Distance Metrics (KNN, K-means, Hierarchical Clustering) — KNN/KNN.py, Clustering/ modules rely on Euclidean, Manhattan, or cosine distance; choice of metric directly affects clustering quality and classification accuracy
- Gradient Descent & Convex Optimization — Underpins logisticRegression.py, maxEntropy.py, and SVM optimization; understanding convergence criteria and step size selection is crucial for debugging learning failures
- Viterbi Algorithm (HMM) — HMM/HMM.py uses Viterbi for sequence decoding; a dynamic programming technique that solves maximum likelihood path problem in hidden Markov models efficiently
- Singular Value Decomposition (PCA, LSA) — PCA/PCA.py and LSA/LSA.py rely on SVD for dimensionality reduction; understanding eigenvalues and principal components is essential for interpreting results
🔗Related repos
apachecn/AiLearning— Another comprehensive Chinese ML textbook implementation (sklearn/TensorFlow based); covers broader ecosystem but less annotated than this repoeriklindernoren/ML-From-Scratch— Hand-coded ML algorithms in Python with NumPy; similar philosophy (no frameworks) but focused on educational clarity rather than textbook formula alignmentscikit-learn/scikit-learn— Production-grade reference implementations of many algorithms here (KNN, naive Bayes, decision trees, SVM); useful for validation and performance comparisontzapu/pytorch-lightning-mnist— Modern approach to MNIST classification using PyTorch Lightning; contrasts with this repo's raw NumPy approach and shows neural network alternativenndl/nndl.github.io— Neural Network and Deep Learning textbook with similar educational intent; companion resource for understanding how modern deep learning extends classical statistical learning
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add unit tests for all algorithm implementations
The repo implements 15+ statistical learning algorithms (SVM, AdaBoost, KNN, etc.) but has no test suite. This is critical for a textbook reference implementation—tests would catch regressions, validate against textbook formulas, and give contributors confidence when refactoring. Start with high-priority algorithms: SVM.py, DecisionTree.py, and AdaBoost.py.
- [ ] Create tests/ directory with test_svm.py, test_decision_tree.py, test_adaboost.py
- [ ] Add simple synthetic datasets (XOR, iris subset) for unit tests that verify algorithm correctness
- [ ] Test edge cases: single sample, single feature, binary vs multiclass classification
- [ ] Document expected outputs against textbook formulas (cite chapter in Li's book)
Create a unified requirements.txt and environment setup guide
The repo has scattered notebook and script dependencies (pandas, sklearn, numpy) but no requirements.txt. Contributors cannot easily reproduce the environment. This blocks CI/CD and makes the repo harder to use for learning. A proper setup guide would lower friction significantly.
- [ ] Audit all .py and .ipynb files to extract imports (e.g., pandas, numpy, scipy, sklearn)
- [ ] Create requirements.txt with pinned versions tested against the code
- [ ] Create a setup.sh or DEVELOPMENT.md guide covering: Python version, virtual env, data extraction (unzipping Mnist)
- [ ] Add a note to README about the Mnist .rar files needing extraction
Refactor Clustering module: split Hierachical_Clustering.py and K-means_Clustering.py into separate base class with shared utilities
The two clustering algorithms share distance computation, centroid logic, and convergence checks but duplicate code. This violates DRY and makes maintenance harder. Extracting a ClusteringBase class and clustering_utils.py would reduce ~100 lines of duplication and serve as a reference for extending with new algorithms (e.g., DBSCAN).
- [ ] Create Clustering/clustering_base.py with abstract base class and shared methods (distance metrics, convergence checks, centroid updates)
- [ ] Create Clustering/clustering_utils.py for helper functions (euclidean_distance, silhouette_score, etc.)
- [ ] Refactor K-means_Clustering/K-means_Clustering.py and Hierachical_Clustering/Hierachical_Clustering.py to inherit from base class
- [ ] Update init.py in Clustering/ to expose unified API and verify notebooks still run
🌿Good first issues
- Add a requirements.txt with pinned dependency versions (NumPy, SciPy, Scikit-learn, Matplotlib, Jupyter). This will lower the barrier for new contributors and prevent 'works on my machine' problems.: Create requirements.txt at repo root
- Create a tests/ directory with unit tests for core algorithms (KNN distance metrics, DecisionTree splits, SVM kernel functions). Start with KNN since it's simpler. Reference: pytest framework.: Create tests/test_knn.py, tests/test_decision_tree.py
- Standardize data file handling: add a data/ subdirectory and write a download_data.py script to fetch MNIST, iris, and bbc_text datasets. Update all .py files to use a common DATA_DIR path.: Create download_data.py, refactor Clustering/K-means_Clustering/K-means_Clustering.py, LSA/LSA.py, LDA/LDA.py
⭐Top contributors
Click to expand
Top contributors
- @Dod-o — 83 commits
- @Harold-Ran — 15 commits
- [@Tengchao Lv](https://github.com/Tengchao Lv) — 1 commits
- @jason-liew — 1 commits
📝Recent commits
Click to expand
Recent commits
1ab7287— Update README for GreatRouter API service details (Dod-o)15c7bba— Revise README with new AI tool advertisement (Dod-o)cd1d283— Update README.md (Dod-o)2efa742— add blogs (Tengchao Lv)9918e34— Update README.md (Dod-o)f0a5b37— Update README.md (Dod-o)e76e612— Update README.md (Dod-o)d86abd8— Update README.md (Dod-o)0862aaf— Update README.md (Dod-o)ff5450d— Update README.md (Dod-o)
🔒Security observations
This is an educational machine learning algorithm implementation repository with moderate security concerns. Primary issues are: (1) No dependency management file preventing vulnerability auditing, (2) Unverified data files without integrity checks, and (3) Lack of input validation in algorithm implementations. The codebase is not intended for production use and lacks production-grade security hardening. Recommended immediate actions: create requirements.txt with pinned versions, add data file checksums, and document security limitations. The overall security posture is acceptable for educational purposes but would require significant hardening before any production deployment.
- Medium · No Dependency Management File —
Root directory - missing dependency manifest. The codebase lacks a requirements.txt, setup.py, or pyproject.toml file. This makes it impossible to track and audit dependencies for known vulnerabilities. Users installing this code cannot verify package versions or detect compromised packages. Fix: Create a requirements.txt file listing all dependencies with pinned versions. Use tools like pip-audit or safety to scan for known vulnerabilities in dependencies. - Medium · Unverified Third-Party Data Files —
Clustering/iris.data, LSA/bbc_text.csv, PLSA/bbc_text.csv, Mnist/mnist_test.rar, Mnist/mnist_train.rar, transMnist/Mnist/. The repository contains multiple data files (iris.data, bbc_text.csv, MNIST .rar archives) without integrity verification. These could potentially be modified or poisoned without detection, especially the compressed MNIST files. Fix: Add checksums (SHA-256) for all data files in README.md. Consider using file signing or publishing data on trusted repositories. Document data sources clearly. - Low · Missing Input Validation —
All algorithm files (AdaBoost.py, DecisionTree.py, SVM.py, etc.). Educational codebase with algorithm implementations typically lacks comprehensive input validation. This could lead to unexpected behavior if malformed data is provided, though impact is limited in educational context. Fix: Add type hints and basic input validation to functions. Use assertions or custom exceptions to validate input shapes and types, especially for ML algorithms. - Low · Hardcoded Data Paths —
transMnist/transMnist.py and other files that load external data. Code likely contains hardcoded file paths for loading datasets. This reduces portability and could expose sensitive information if paths contain user home directories or system paths. Fix: Use relative paths or configuration files. Implement configurable data directory paths using environment variables or config files. - Low · No Security Documentation —
README.md. README.md lacks security guidelines or warnings about code usage. The project doesn't document that this is educational code not intended for production use. Fix: Add a security section clearly stating this is educational code. Include warnings about using it in production and recommend professional ML frameworks for real applications. - Low · Unencrypted Sensitive Data in Repository —
HMM/HMMTrainSet.txt, HMM/testArtical.txt. HMM training set and test articles are stored in plain text, which is appropriate for educational purposes but should be documented. Fix: For educational use, this is acceptable. Document why data is unencrypted. For any future production use, implement proper data protection.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.