Dod-o/Statistical-Learning-Method_Code

Item: Dod-o/Statistical-Learning-Method_Code
Rating: 3
Author: RepoPilot

手写实现李航《统计学习方法》书中全部算法

Mixed

Slowing — last commit 5mo ago

weakest axis

Use as dependencyConcerns

no license — legally unclear; no CI workflows detected

Fork & modifyConcerns

no license — can't legally use code; no CI workflows detected

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isConcerns

no license — can't legally use code; no CI workflows detected

✓Last commit 5mo ago
✓4 active contributors
✓Tests present

Show all 8 evidence items →

⚠Slowing — last commit 5mo ago
⚠Small team — 4 contributors active in recent commits
⚠Single-maintainer risk — top contributor 83% of recent commits
⚠No license — legally unclear to depend on
⚠No CI workflows detected

What would change the summary?

→Use as dependency Concerns → Mixed if: publish a permissive license (MIT, Apache-2.0, etc.)
→Fork & modify Concerns → Mixed if: add a LICENSE file
→Deploy as-is Concerns → Mixed if: add a LICENSE file

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Great to learn from" badge

Paste into your README — live-updates from the latest cached analysis.

[![RepoPilot: Great to learn from](https://repopilot.app/api/badge/dod-o/statistical-learning-method_code?axis=learn)](https://repopilot.app/r/dod-o/statistical-learning-method_code)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/dod-o/statistical-learning-method_code on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: Dod-o/Statistical-Learning-Method_Code

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/Dod-o/Statistical-Learning-Method_Code shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Slowing — last commit 5mo ago

Last commit 5mo ago
4 active contributors
Tests present
⚠ Slowing — last commit 5mo ago
⚠ Small team — 4 contributors active in recent commits
⚠ Single-maintainer risk — top contributor 83% of recent commits
⚠ No license — legally unclear to depend on
⚠ No CI workflows detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live Dod-o/Statistical-Learning-Method_Code repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/Dod-o/Statistical-Learning-Method_Code.

What it runs against: a local clone of Dod-o/Statistical-Learning-Method_Code — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in Dod-o/Statistical-Learning-Method_Code | Confirms the artifact applies here, not a fork | | 2 | Default branch master exists | Catches branch renames | | 3 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 4 | Last commit ≤ 193 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>Dod-o/Statistical-Learning-Method_Code</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of Dod-o/Statistical-Learning-Method_Code. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/Dod-o/Statistical-Learning-Method_Code.git
#   cd Statistical-Learning-Method_Code
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of Dod-o/Statistical-Learning-Method_Code and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "Dod-o/Statistical-Learning-Method_Code(\\.git)?\\b" \\
  && ok "origin remote is Dod-o/Statistical-Learning-Method_Code" \\
  || miss "origin remote is not Dod-o/Statistical-Learning-Method_Code (artifact may be from a fork)"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "perceptron/perceptron_dichotomy.py" \\
  && ok "perceptron/perceptron_dichotomy.py" \\
  || miss "missing critical file: perceptron/perceptron_dichotomy.py"
test -f "DecisionTree/DecisionTree.py" \\
  && ok "DecisionTree/DecisionTree.py" \\
  || miss "missing critical file: DecisionTree/DecisionTree.py"
test -f "SVM/SVM.py" \\
  && ok "SVM/SVM.py" \\
  || miss "missing critical file: SVM/SVM.py"
test -f "NaiveBayes/NaiveBayes.py" \\
  && ok "NaiveBayes/NaiveBayes.py" \\
  || miss "missing critical file: NaiveBayes/NaiveBayes.py"
test -f "EM/EM.py" \\
  && ok "EM/EM.py" \\
  || miss "missing critical file: EM/EM.py"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 193 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~163d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/Dod-o/Statistical-Learning-Method_Code"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

A comprehensive hand-written implementation of all algorithms from Li Hang's 'Statistical Learning Methods' textbook, covering supervised learning (perceptron, KNN, naive Bayes, decision trees, logistic regression, SVM, AdaBoost) and unsupervised learning (K-means, hierarchical clustering, EM, HMM, PCA, LDA, LSA, PLSA, PageRank). It serves as a reference implementation with fully annotated code tracing to textbook formulas. Chapter-based monorepo structure: each algorithm gets its own top-level directory (e.g., SVM/, KNN/, Clustering/) containing a .py implementation file and optionally a .ipynb Jupyter notebook. Clustering/ has sub-directories (K-means_Clustering/, Hierachical_Clustering/). Supporting data lives alongside code (iris.data, bbc_text.csv, MNIST/ folder). Blogs/ holds supplementary PDF articles keyed to chapters.

👥Who it's for

Machine learning students and practitioners learning statistical learning theory who need working implementations that directly map to Li Hang's textbook formulas; researchers validating algorithmic understanding; educators building ML curricula in Chinese-speaking contexts.

🌱Maturity & risk

Actively maintained but pre-1.0: the repo shows regular updates (book branch announced Feb 2022, unsupervised updates Jan 2021), has accompanying PDF blogs for multiple chapters, and integrates educational content. However, there is no visible CI/CD, no formal test suite, and commits appear episodic rather than continuous, suggesting it's a well-curated educational project rather than production software.

Standard open source risks apply.

Active areas of work

Transition to formal publication: code is being restructured in the book branch for publication by 人民邮电出版社 (People's Posts and Telecommunications Press); offline ML/MLP/CV training program in planning; community contributions welcome via issues and PRs. Unsupervised algorithms section was recently expanded by Harold-Ran.

🚀Get running

git clone https://github.com/Dod-o/Statistical-Learning-Method_Code.git
cd Statistical-Learning-Method_Code
# For MNIST: unzip Mnist/mnist_train.rar and Mnist/mnist_test.rar
# No setup.py or requirements.txt; install dependencies manually:
pip install numpy scipy pandas scikit-learn matplotlib jupyter

Daily commands:

# Run a single algorithm (e.g., KNN):
python KNN/KNN.py

# Or open Jupyter notebooks for interactive learning:
jupyter notebook Clustering/K-means_Clustering/K-means_Clustering.ipynb
jupyter notebook PCA/PCA.ipynb

No centralized test suite or main entry point; each module is independent.

🗺️Map of the codebase

perceptron/perceptron_dichotomy.py — Entry point for the perceptron algorithm (Chapter 2), foundational supervised learning method that most other algorithms build upon conceptually
DecisionTree/DecisionTree.py — Core decision tree implementation (Chapter 5), one of the most commonly referenced supervised methods in the repo with extensive comments
SVM/SVM.py — Support Vector Machine implementation (Chapter 7), complex optimization-based algorithm central to supervised learning theory
NaiveBayes/NaiveBayes.py — Naive Bayes classifier (Chapter 4), foundational probabilistic model used across multiple downstream algorithms
EM/EM.py — Expectation-Maximization algorithm, unsupervised core method that underpins HMM and clustering implementations
HMM/HMM.py — Hidden Markov Model (Chapter 10), sequence modeling algorithm that extends EM with temporal structure
AdaBoost/AdaBoost.py — AdaBoost ensemble method (Chapter 8), demonstrates weighted iteration pattern replicated in gradient boosting variants

🛠️How to make changes

Add a New Supervised Learning Algorithm

Create a new directory with algorithm name matching the pattern in the repo root (e.g., GradientBoosting/) (GradientBoosting/)
Implement the main algorithm class in a .py file with extensive inline comments explaining each formula step (GradientBoosting/GradientBoosting.py)
Reference the book chapter and equations in comments at the top of the file, following the pattern from DecisionTree.py or SVM.py (GradientBoosting/GradientBoosting.py)
Include a demonstration with a standard dataset (iris, MNIST, or synthetic data) showing train/test workflow (GradientBoosting/GradientBoosting.py)
Optionally create a Jupyter notebook (.ipynb) in the same directory for interactive exploration, following PCA/PCA.ipynb pattern (GradientBoosting/GradientBoosting.ipynb)
Update README.md to add the new chapter/algorithm to the supervised learning section with a link to the blog post if available (README.md)

Add a New Unsupervised Learning Algorithm

Create directory under Clustering/ or a new top-level directory (e.g., GaussianMixture/) following repo conventions (GaussianMixture/)
Implement the algorithm class, clearly documenting the E-step and M-step if EM-based (see EM/EM.py for the pattern) (GaussianMixture/GaussianMixture.py)
Create test dataset handling code to load standard datasets like iris.data or bbc_text.csv (GaussianMixture/GaussianMixture.py)
Include visualization code (clustering plots, elbow curves, or topic visualizations) in a Jupyter notebook (GaussianMixture/GaussianMixture.ipynb)
Add evaluation metrics (silhouette score, inertia, perplexity) following Clustering/K-means_Clustering/K-means_Clustering.py pattern (GaussianMixture/GaussianMixture.py)

Prepare a Dataset for Algorithm Testing

Export or download dataset in CSV or text format matching the structure of existing datasets (iris.data, bbc_text.csv, cars.csv) (NewDataset/dataset.csv)
Create a data loader function in the algorithm file that reads, validates, and splits the data (train/test), following transMnist/transMnist.py pattern (NewAlgorithm/NewAlgorithm.py)
Add data preprocessing (normalization, tokenization, feature scaling) specific to your algorithm and dataset (NewAlgorithm/NewAlgorithm.py)
Document dataset source, size, features, and any preprocessing steps in comments at the top of the algorithm file (NewAlgorithm/NewAlgorithm.py)

Create Educational Documentation for an Algorithm

Write a blog post (PDF or markdown) explaining the algorithm's theory, mathematical foundations, and intuition (blogs/algorithm_theory.pdf)
Create a Jupyter notebook (.ipynb) in the algorithm directory with step-by-step implementation and visualizations (see PCA/PCA.ipynb, LDA/LDA.ipynb for examples) (Algorithm/Algorithm.ipynb)
Add comments in the Python implementation correlating code lines to equation numbers from the blog post or textbook (Algorithm/Algorithm.py)
Include worked examples or toy datasets to make the tutorial reproducible (e.g., small feature matrix, simple classification task) (Algorithm/Algorithm.ipynb)

🪤Traps & gotchas

MNIST data requires manual unzipping: README says .rar files must be extracted after download; no download script provided. 2. No requirements.txt: Dependencies (NumPy, SciPy, pandas, Jupyter) must be inferred and installed manually; versions not pinned. 3. Data file paths are hardcoded: LSA/LSA.py, PLSA/PLSA.py, and LDA/LDA.py reference 'bbc_text.csv' with relative paths; will fail if run from wrong directory. 4. No test suite: No way to validate implementations; manual inspection required. 5. Mixed language code: Comments in Chinese; may confuse non-Chinese-speaking contributors. 6. Notebooks not version-controlled cleanly: .ipynb files can conflict in git; no .gitignore rule visible for Jupyter output.

🏗️Architecture

💡Concepts to learn

Information Gain & Entropy (Decision Trees) — Core to DecisionTree.py's split selection; understanding Shannon entropy and information gain is essential for grasping why trees choose splits the way they do
Kernel Trick (SVM) — SVM/SVM.py's most subtle concept; allows non-linear classification in high dimensions without explicit feature transformation; critical for practical SVM understanding
Expectation-Maximization (EM Algorithm) — EM/EM.py implements the foundation for latent variable models; reappears in HMM, PLSA, LDA; understanding E and M steps is essential for probabilistic unsupervised learning
Distance Metrics (KNN, K-means, Hierarchical Clustering) — KNN/KNN.py, Clustering/ modules rely on Euclidean, Manhattan, or cosine distance; choice of metric directly affects clustering quality and classification accuracy
Gradient Descent & Convex Optimization — Underpins logisticRegression.py, maxEntropy.py, and SVM optimization; understanding convergence criteria and step size selection is crucial for debugging learning failures
Viterbi Algorithm (HMM) — HMM/HMM.py uses Viterbi for sequence decoding; a dynamic programming technique that solves maximum likelihood path problem in hidden Markov models efficiently
Singular Value Decomposition (PCA, LSA) — PCA/PCA.py and LSA/LSA.py rely on SVD for dimensionality reduction; understanding eigenvalues and principal components is essential for interpreting results

apachecn/AiLearning — Another comprehensive Chinese ML textbook implementation (sklearn/TensorFlow based); covers broader ecosystem but less annotated than this repo
eriklindernoren/ML-From-Scratch — Hand-coded ML algorithms in Python with NumPy; similar philosophy (no frameworks) but focused on educational clarity rather than textbook formula alignment
scikit-learn/scikit-learn — Production-grade reference implementations of many algorithms here (KNN, naive Bayes, decision trees, SVM); useful for validation and performance comparison
tzapu/pytorch-lightning-mnist — Modern approach to MNIST classification using PyTorch Lightning; contrasts with this repo's raw NumPy approach and shows neural network alternative
nndl/nndl.github.io — Neural Network and Deep Learning textbook with similar educational intent; companion resource for understanding how modern deep learning extends classical statistical learning

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add unit tests for all algorithm implementations

The repo implements 15+ statistical learning algorithms (SVM, AdaBoost, KNN, etc.) but has no test suite. This is critical for a textbook reference implementation—tests would catch regressions, validate against textbook formulas, and give contributors confidence when refactoring. Start with high-priority algorithms: SVM.py, DecisionTree.py, and AdaBoost.py.

[ ] Create tests/ directory with test_svm.py, test_decision_tree.py, test_adaboost.py
[ ] Add simple synthetic datasets (XOR, iris subset) for unit tests that verify algorithm correctness
[ ] Test edge cases: single sample, single feature, binary vs multiclass classification
[ ] Document expected outputs against textbook formulas (cite chapter in Li's book)

Create a unified requirements.txt and environment setup guide

The repo has scattered notebook and script dependencies (pandas, sklearn, numpy) but no requirements.txt. Contributors cannot easily reproduce the environment. This blocks CI/CD and makes the repo harder to use for learning. A proper setup guide would lower friction significantly.

[ ] Audit all .py and .ipynb files to extract imports (e.g., pandas, numpy, scipy, sklearn)
[ ] Create requirements.txt with pinned versions tested against the code
[ ] Create a setup.sh or DEVELOPMENT.md guide covering: Python version, virtual env, data extraction (unzipping Mnist)
[ ] Add a note to README about the Mnist .rar files needing extraction

Refactor Clustering module: split Hierachical_Clustering.py and K-means_Clustering.py into separate base class with shared utilities

The two clustering algorithms share distance computation, centroid logic, and convergence checks but duplicate code. This violates DRY and makes maintenance harder. Extracting a ClusteringBase class and clustering_utils.py would reduce ~100 lines of duplication and serve as a reference for extending with new algorithms (e.g., DBSCAN).

[ ] Create Clustering/clustering_base.py with abstract base class and shared methods (distance metrics, convergence checks, centroid updates)
[ ] Create Clustering/clustering_utils.py for helper functions (euclidean_distance, silhouette_score, etc.)
[ ] Refactor K-means_Clustering/K-means_Clustering.py and Hierachical_Clustering/Hierachical_Clustering.py to inherit from base class
[ ] Update init.py in Clustering/ to expose unified API and verify notebooks still run

🌿Good first issues

Add a requirements.txt with pinned dependency versions (NumPy, SciPy, Scikit-learn, Matplotlib, Jupyter). This will lower the barrier for new contributors and prevent 'works on my machine' problems.: Create requirements.txt at repo root
Create a tests/ directory with unit tests for core algorithms (KNN distance metrics, DecisionTree splits, SVM kernel functions). Start with KNN since it's simpler. Reference: pytest framework.: Create tests/test_knn.py, tests/test_decision_tree.py
Standardize data file handling: add a data/ subdirectory and write a download_data.py script to fetch MNIST, iris, and bbc_text datasets. Update all .py files to use a common DATA_DIR path.: Create download_data.py, refactor Clustering/K-means_Clustering/K-means_Clustering.py, LSA/LSA.py, LDA/LDA.py

⭐Top contributors

Click to expand

@Dod-o — 83 commits
@Harold-Ran — 15 commits
[@Tengchao Lv](https://github.com/Tengchao Lv) — 1 commits
@jason-liew — 1 commits

📝Recent commits

Click to expand

1ab7287 — Update README for GreatRouter API service details (Dod-o)
15c7bba — Revise README with new AI tool advertisement (Dod-o)
cd1d283 — Update README.md (Dod-o)
2efa742 — add blogs (Tengchao Lv)
9918e34 — Update README.md (Dod-o)
f0a5b37 — Update README.md (Dod-o)
e76e612 — Update README.md (Dod-o)
d86abd8 — Update README.md (Dod-o)
0862aaf — Update README.md (Dod-o)
ff5450d — Update README.md (Dod-o)

🔒Security observations

This is an educational machine learning algorithm implementation repository with moderate security concerns. Primary issues are: (1) No dependency management file preventing vulnerability auditing, (2) Unverified data files without integrity checks, and (3) Lack of input validation in algorithm implementations. The codebase is not intended for production use and lacks production-grade security hardening. Recommended immediate actions: create requirements.txt with pinned versions, add data file checksums, and document security limitations. The overall security posture is acceptable for educational purposes but would require significant hardening before any production deployment.

Medium · No Dependency Management File — Root directory - missing dependency manifest. The codebase lacks a requirements.txt, setup.py, or pyproject.toml file. This makes it impossible to track and audit dependencies for known vulnerabilities. Users installing this code cannot verify package versions or detect compromised packages. Fix: Create a requirements.txt file listing all dependencies with pinned versions. Use tools like pip-audit or safety to scan for known vulnerabilities in dependencies.
Medium · Unverified Third-Party Data Files — Clustering/iris.data, LSA/bbc_text.csv, PLSA/bbc_text.csv, Mnist/mnist_test.rar, Mnist/mnist_train.rar, transMnist/Mnist/. The repository contains multiple data files (iris.data, bbc_text.csv, MNIST .rar archives) without integrity verification. These could potentially be modified or poisoned without detection, especially the compressed MNIST files. Fix: Add checksums (SHA-256) for all data files in README.md. Consider using file signing or publishing data on trusted repositories. Document data sources clearly.
Low · Missing Input Validation — All algorithm files (AdaBoost.py, DecisionTree.py, SVM.py, etc.). Educational codebase with algorithm implementations typically lacks comprehensive input validation. This could lead to unexpected behavior if malformed data is provided, though impact is limited in educational context. Fix: Add type hints and basic input validation to functions. Use assertions or custom exceptions to validate input shapes and types, especially for ML algorithms.
Low · Hardcoded Data Paths — transMnist/transMnist.py and other files that load external data. Code likely contains hardcoded file paths for loading datasets. This reduces portability and could expose sensitive information if paths contain user home directories or system paths. Fix: Use relative paths or configuration files. Implement configurable data directory paths using environment variables or config files.
Low · No Security Documentation — README.md. README.md lacks security guidelines or warnings about code usage. The project doesn't document that this is educational code not intended for production use. Fix: Add a security section clearly stating this is educational code. Include warnings about using it in production and recommend professional ML frameworks for real applications.
Low · Unencrypted Sensitive Data in Repository — HMM/HMMTrainSet.txt, HMM/testArtical.txt. HMM training set and test articles are stored in plain text, which is appropriate for educational purposes but should be documented. Fix: For educational use, this is acceptable. Document why data is unencrypted. For any future production use, implement proper data protection.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Dod-o/Statistical-Learning-Method_Code

Embed the "Great to learn from" badge

Onboarding doc

Onboarding: Dod-o/Statistical-Learning-Method_Code

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

Add a New Supervised Learning Algorithm

Add a New Unsupervised Learning Algorithm

Prepare a Dataset for Algorithm Testing

Create Educational Documentation for an Algorithm

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add unit tests for all algorithm implementations

Create a unified requirements.txt and environment setup guide

Refactor Clustering module: split Hierachical_Clustering.py and K-means_Clustering.py into separate base class with shared utilities

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next