haifengl/smile
Statistical Machine Intelligence & Learning Engine
Single-maintainer risk — review before adopting
weakest axisnon-standard license (Other); top contributor handles 98% of recent commits
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 3d ago
- ✓2 active contributors
- ✓Other licensed
Show all 8 evidence items →Show less
- ✓CI configured
- ✓Tests present
- ⚠Small team — 2 contributors active in recent commits
- ⚠Single-maintainer risk — top contributor 98% of recent commits
- ⚠Non-standard license (Other) — review terms
What would change the summary?
- →Use as dependency Concerns → Mixed if: clarify license terms
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Forkable" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/haifengl/smile)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/haifengl/smile on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: haifengl/smile
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/haifengl/smile shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
WAIT — Single-maintainer risk — review before adopting
- Last commit 3d ago
- 2 active contributors
- Other licensed
- CI configured
- Tests present
- ⚠ Small team — 2 contributors active in recent commits
- ⚠ Single-maintainer risk — top contributor 98% of recent commits
- ⚠ Non-standard license (Other) — review terms
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live haifengl/smile
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/haifengl/smile.
What it runs against: a local clone of haifengl/smile — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in haifengl/smile | Confirms the artifact applies here, not a fork |
| 2 | License is still Other | Catches relicense before you depend on it |
| 3 | Default branch master exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 33 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of haifengl/smile. If you don't
# have one yet, run these first:
#
# git clone https://github.com/haifengl/smile.git
# cd smile
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of haifengl/smile and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "haifengl/smile(\\.git)?\\b" \\
&& ok "origin remote is haifengl/smile" \\
|| miss "origin remote is not haifengl/smile (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Other)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Other\"" package.json 2>/dev/null) \\
&& ok "license is Other" \\
|| miss "license drift — was Other at generation time"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 4. Critical files exist
test -f "base/src/main/java/smile/data/DataFrame.java" \\
&& ok "base/src/main/java/smile/data/DataFrame.java" \\
|| miss "missing critical file: base/src/main/java/smile/data/DataFrame.java"
test -f "base/src/main/java/smile/data/formula/Formula.java" \\
&& ok "base/src/main/java/smile/data/formula/Formula.java" \\
|| miss "missing critical file: base/src/main/java/smile/data/formula/Formula.java"
test -f "base/src/main/java/smile/data/Dataset.java" \\
&& ok "base/src/main/java/smile/data/Dataset.java" \\
|| miss "missing critical file: base/src/main/java/smile/data/Dataset.java"
test -f "base/build.gradle.kts" \\
&& ok "base/build.gradle.kts" \\
|| miss "missing critical file: base/build.gradle.kts"
test -f "base/src/main/java/smile/data/measure/Measure.java" \\
&& ok "base/src/main/java/smile/data/measure/Measure.java" \\
|| miss "missing critical file: base/src/main/java/smile/data/measure/Measure.java"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 33 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~3d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/haifengl/smile"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
SMILE is a high-performance machine learning framework for the JVM that provides state-of-the-art implementations of 100+ algorithms across classification, regression, clustering, manifold learning, NLP, and deep learning. It's written primarily in Java (18.5MB) with first-class support for Scala and Kotlin, offering both traditional ML (SVM, Random Forest, Gradient Boosting) and modern capabilities (LLaMA-3 inference, LibTorch GPU backend, EfficientNet-V2 image classification). Monorepo structure: root contains shared build configs (base/build.gradle.kts, base/build.sbt). Core algorithms live in base/src/main/java/smile/ with subpackages for domains (smile/cs/ for compressed sensing, implying smile/math/, smile/classification/, smile/clustering/, etc.). Documentation is colocated as markdown files in base/ (DATA_FRAME.md, FORMULA.md, DISTRIBUTIONS.md). CI workflows in .github/workflows/ handle Maven publishing and releases.
👥Who it's for
Data scientists, ML engineers, and backend developers building JVM-based machine learning systems who need production-grade algorithms without heavyweight dependencies like Python/NumPy. Users range from researchers prototyping in Scala notebooks to companies building inference servers with the embedded LLM and GPU backends.
🌱Maturity & risk
Highly mature and production-ready. SMILE is actively maintained (v5+ targets Java 25, v4.x targets Java 21), has comprehensive CI/CD pipelines (Maven publish, release, CodeQL workflows in .github/workflows/), and spans 18MB+ of Java source. The framework has been around long enough to iterate through major version transitions with breaking changes (v5 bump for Java compatibility), indicating serious real-world usage.
Low risk for a single-maintainer project. Main risks: (1) Heavy reliance on haifengl as primary author (contributor risk if unmaintained), (2) Java/JVM coupling—performance-critical components may bind to native BLAS/LAPACK libraries (see base/src/main/java/smile/cs/), requiring proper system libraries, (3) Fast-moving deep learning features (LLaMA, EfficientNet) may introduce breaking changes between minor versions. Monitor release notes carefully when upgrading.
Active areas of work
Active development focused on LLM inference (LLaMA-3 tokenizer, OpenAI-compatible REST server with SSE streaming) and deep learning features (EfficientNet-V2, LibTorch/GPU integration). Java 25 support in v5+ shows ongoing modernization. The presence of .devcontainer/devcontainer.json indicates push toward containerized development workflows.
🚀Get running
git clone https://github.com/haifengl/smile.git
cd smile
# For Maven:
mvn clean install
# For Gradle (Kotlin):
gradle build
# For SBT (Scala):
sbt clean compile
Daily commands:
For development: mvn clean install compiles all modules. For SMILE Studio/Shell (interactive): smile.sh or use Jupyter notebooks (Jupyter Notebook files in repo). For LLM server: infer from presence of OpenAI-compatible REST server in features—likely a dedicated module or example (check base/src for inference examples).
🗺️Map of the codebase
base/src/main/java/smile/data/DataFrame.java— Core data abstraction for the SMILE framework; all data loading and transformation pipelines depend on this interface.base/src/main/java/smile/data/formula/Formula.java— Central abstraction for data transformation and feature engineering; used across all ML modules to specify feature mappings and model terms.base/src/main/java/smile/data/Dataset.java— Base interface for all dataset types (sparse, dense, sequences); required for understanding how models consume training data.base/build.gradle.kts— Primary build configuration; defines module structure, dependencies, and test setup for the entire JVM framework.base/src/main/java/smile/data/measure/Measure.java— Foundational abstraction for data type semantics (nominal, ordinal, interval, ratio); required for correct feature encoding and distance computation.base/src/main/java/smile/data/transform/Transform.java— Interface for invertible and non-invertible data transformations; essential for preprocessing pipelines and model deployment.
🛠️How to make changes
Add a New Data Type/Measure
- Create a new Measure implementation (e.g., CyclicScale.java) extending CategoricalMeasure or NumericalMeasure in base/src/main/java/smile/data/measure/ (
base/src/main/java/smile/data/measure/Measure.java) - Define the semantic properties (e.g., distance metric, encoding rules) in the measure's methods (
base/src/main/java/smile/data/measure/NumericalMeasure.java) - Update CategoricalEncoder or create a new Transform subclass to handle encoding for the new measure (
base/src/main/java/smile/data/CategoricalEncoder.java) - Register the measure in the DataFrame type system so formulas and transformations can use it automatically (
base/src/main/java/smile/data/DataFrame.java)
Add a New Feature Transformation in Formulas
- Create a new Term subclass (e.g., Log.java, Sqrt.java) in base/src/main/java/smile/data/formula/ extending AbstractFunction or AbstractBiFunction (
base/src/main/java/smile/data/formula/AbstractFunction.java) - Implement the apply() method to compute the transformation and getVariables() to track dependencies (
base/src/main/java/smile/data/formula/Abs.java) - Update Formula.java to expose the new transformation as a fluent API method or static factory (
base/src/main/java/smile/data/formula/Formula.java) - Add unit tests in the corresponding test module to verify the transformation applies correctly to DataFrames (
base/src/main/java/smile/data/formula/package-info.java)
Add a New Compression Sensing Algorithm
- Create a new algorithm class (e.g., ThresholdingAlgorithm.java) in base/src/main/java/smile/cs/ implementing the sparse recovery interface (
base/src/main/java/smile/cs/MeasurementMatrix.java) - Implement the core iterative reconstruction logic (fit/solve methods) using the MeasurementMatrix interface for signal sensing (
base/src/main/java/smile/cs/OMP.java) - Provide convergence criteria and recovery guarantees documentation (add .md file in base/) (
base/COMPRESSED_SENSING.md) - Integrate into the formula/dataset pipeline if applicable for feature recovery in ML models (
base/src/main/java/smile/data/transform/Transform.java)
Add Support for a New Data Source
- Extend the SQL.java class or create a new loader class (e.g., ParquetLoader.java) in base/src/main/java/smile/data/ (
base/src/main/java/smile/data/SQL.java) - Implement a factory method that returns a DataFrame by parsing the external format into Tuple/Row objects (
base/src/main/java/smile/data/DataFrame.java) - Ensure type inference or explicit schema specification for columns based on the data source's metadata (
base/src/main/java/smile/data/Collectors.java) - Add sample code and documentation in base/DATA_IO.md showing how to load data from the new source (
base/DATA_IO.md)
🔧Why these technologies
- Java 25 / JVM — High-performance compiled language with mature ML ecosystem; provides native interop for BLAS/LAPACK libraries and horizontal scaling via framework distribution.
- Scala & Kotlin APIs — Idiomatic language bindings reduce friction for functional/multiparadigm developers; Scala collections and type inference improve DSL ergonomics.
- Gradle Kotlin DSL + SBT — Multi-language polyglot build; Gradle handles Java/Kotlin modules, SBT manages Scala artifacts and cross-compilation.
- Formula DSL (Term-based) — R-style formula syntax ( y ~ x1 + x2 ) is familiar to statisticians; enables declarative feature engineering without explicit code.
⚖️Trade-offs already made
-
In-memory DataFrame vs. streaming/lazy evaluation
- Why: SMILE prioritizes single-node performance and algorithm correctness; in-memory loading enables complex transformations and random access.
- Consequence: Large datasets (100GB+) require external data sampling or distributed framework wrapping; not suitable for unbounded streaming without custom extensions.
-
- Why: undefined
- Consequence: undefined
🪤Traps & gotchas
Version coupling: Java 25 (v5+) is non-negotiable; running on Java 21 will fail. Native bindings: BLAS/LAPACK features require libopenblas, liblapack, libgfortran system libraries (vary by OS—see Maven publish workflow for build environment). GPU features: LibTorch deep learning requires CUDA toolkit + appropriate GPU drivers if not using CPU-only. SBT quirks: .sbtopts file controls SBT settings; pay attention if modifying parallel compilation flags. Formula API: The formula syntax (see base/FORMULA.md) is non-standard—not R-like or pandas-like; requires learning SMILE's specific syntax.
🏗️Architecture
💡Concepts to learn
- BK-Tree (Burkhard-Keller Tree) — SMILE implements BK-Tree for efficient nearest-neighbor search without distance matrix materialization; critical for scaling KNN beyond small datasets
- TreeSHAP — SMILE's TreeSHAP feature importance method provides model interpretability for ensemble learners (Random Forest, GBDT) used in production ML systems
- BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) — Memory-efficient hierarchical clustering for massive datasets; SMILE implements it as an alternative to traditional linkage-based clustering
- Compressed Sensing (basis pursuit, orthogonal matching pursuit) — SMILE's base/cs/ package implements sparse signal recovery from compressed measurements; foundational for signal processing and dimensionality reduction
- Variational Inference (GAP algorithm) — SMILE implements approximation algorithms for intractable Bayesian inference; see base/GAP.md for probabilistic model learning
- RBF (Radial Basis Function) networks — SMILE provides both RBF classification/regression and RBF kernel for SVMs; fundamental nonlinear approximation technique
- Tokenizer (BPE, LLaMA tiktoken) — SMILE's LLM module includes Byte Pair Encoding tokenizer compatible with LLaMA-3; essential for prompt preparation in inference pipelines
🔗Related repos
apache/spark— Spark MLlib is the competing large-scale distributed ML framework on JVM; SMILE targets single-machine performance, but users often evaluate both for different scalability needsdeeplearning4j/deeplearning4j— DL4J provides another JVM-based deep learning option; SMILE's LibTorch backend is a newer alternative focusing on modern architectures (EfficientNet, LLaMA)tribuo/tribuo— Oracle's Tribuo is a lightweight, modular ML library for JVM; both compete in the 'practical ML without heavyweight frameworks' spacehaifengl/smile-nlp— Sister repository (if exists) likely containing NLP-specific modules and tokenizers (tiktoken BPE mentioned in features); check haifengl's org for companion reposscikit-learn/scikit-learn— Industry reference implementation for many algorithms (SVM, RBF, ensemble methods) that SMILE ports to JVM; useful for algorithm validation and API design inspiration
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive unit tests for smile/data/formula package
The formula package (base/src/main/java/smile/data/formula/) is critical for data transformation and model specification, but there are no visible test files in the provided structure. This package contains formula parsing, function definitions (Abs.java, AbstractFunction.java, etc.), and needs extensive test coverage to prevent regressions in data processing pipelines.
- [ ] Create base/src/test/java/smile/data/formula/ directory structure
- [ ] Add unit tests for AbstractFunction and AbstractBiFunction covering edge cases
- [ ] Add integration tests for formula parsing with various data types
- [ ] Test formula evaluation with missing values and categorical variables
- [ ] Add performance tests for complex formula expressions
Implement GitHub Actions workflow for SBT builds
The project supports both Gradle (build.gradle.kts) and SBT (build.sbt) builds as evidenced by files in base/build.sbt and .sbtopts config, but the CI workflow (.github/workflows/ci.yml) likely only tests Gradle. Adding SBT-specific CI ensures consistency across both build systems and prevents SBT-specific regressions.
- [ ] Review current ci.yml to understand existing Maven/Gradle testing strategy
- [ ] Create .github/workflows/sbt-ci.yml for SBT-specific builds
- [ ] Test against Java 25 (v5+), Java 21 (v4.x) as per README requirements
- [ ] Add SBT test coverage reporting to ci.yml or new workflow
- [ ] Document SBT build process in CONTRIBUTING.md if not present
Add missing documentation for compressed sensing and sparse data modules
The base/ directory contains COMPRESSED_SENSING.md and documentation for other modules, but the actual CS implementation (smile/cs/ package with BasisPursuit.java, CoSaMP.java, OMP.java) lacks structured API examples. New contributors need clear guides on how to use these algorithms with provided examples.
- [ ] Review base/COMPRESSED_SENSING.md for completeness and update with API examples
- [ ] Add practical examples showing BasisPursuit, CoSaMP, and OMP usage
- [ ] Document MeasurementMatrix and its role in compressed sensing workflows
- [ ] Add benchmarking examples comparing the three CS algorithms
- [ ] Cross-reference sparse dataset classes (SparseDataset.java, BinarySparseDataset.java) in documentation
🌿Good first issues
- Add missing unit tests for base/src/main/java/smile/cs/ (CoSaMP.java, OMP.java, BasisPursuit.java) compressed sensing algorithms—these are numerically complex and lack comprehensive edge-case coverage
- Extend documentation in base/ with worked examples: create BASE_EXAMPLES.md with Scala notebook snippets showing DataFrame → formula → SVM/RandomForest → prediction pipeline for common datasets (iris, mnist)
- Implement Kotlin extension functions for common algorithm chains (e.g., smile.learn.classifier in Kotlin): wrap boilerplate like cross-validation, hyperparameter tuning to match Scala ergonomics
⭐Top contributors
Click to expand
Top contributors
- @haifengl — 98 commits
- @dependabot[bot] — 2 commits
📝Recent commits
Click to expand
Recent commits
17b9c1a— quarkus 3.35.2 (haifengl)4abc618— Merge pull request #862 from haifengl/dependabot/devcontainers/ghcr.io/devcontainers/features/node-2.0.0 (haifengl)68f6d81— Bump ghcr.io/devcontainers/features/node from 1.7.1 to 2.0.0 (dependabot[bot])03f5c17— disable parallel build to save memory for CodeQL (haifengl)644a762— multi-thread safe edit distance (haifengl)758f2f6— serve/jacoco-report-aggregation (haifengl)3b27a16— jacoco plugin (haifengl)7be2a96— remove input parameter 'version' (haifengl)b29d800— rename cd.yml to release.yml (haifengl)245f43d— v6.1.0 (haifengl)
🔒Security observations
The SMILE codebase demonstrates a reasonable security posture with established security policy and CI/CD infrastructure. However, potential risks exist around SQL operations, formula evaluation, and dependency management. The project has proper vulnerability reporting mechanisms in place via GitHub's private security advisory. Key recommendations: (1) Review SQL.java implementation for injection prevention, (2) Complete the SECURITY.md documentation, (3) Implement strict input validation in formula evaluation, (4) Enhance dependency vulnerability scanning in CI/CD pipelines. The Java-based nature of the project and use of established frameworks provides some inherent security benefits, but careful attention to data handling and input validation remains critical.
- Medium · SQL Injection Risk in SQL.java —
base/src/main/java/smile/data/SQL.java. The file 'base/src/main/java/smile/data/SQL.java' is present in the codebase, which typically handles database operations. Without seeing the implementation, there is a potential risk of SQL injection if raw SQL queries are constructed using string concatenation or without proper parameterized queries. Fix: Review SQL.java to ensure all database queries use parameterized statements or prepared statements. Avoid string concatenation for building SQL queries. Implement input validation and sanitization for all user-provided data. - Low · Incomplete SECURITY.md File —
SECURITY.md. The SECURITY.md file appears to be truncated at the end ('At the bot'). The security policy documentation is incomplete, which could lead to confusion about the proper vulnerability reporting process. Fix: Complete the SECURITY.md file with full instructions for security vulnerability reporting, including contact information, response times, and any other relevant disclosure guidelines. - Low · Formula/Expression Evaluation Risk —
base/src/main/java/smile/data/formula/ (multiple files). The codebase includes multiple formula and expression evaluation classes (Formula.java, various operators, functions). These could potentially be vulnerable to expression injection or code injection if user input is directly evaluated without proper validation. Fix: Implement strict input validation and sanitization for all formula/expression parsing. Consider using a safe expression evaluator with limited scope. Document any security constraints in formula evaluation. - Low · Missing OWASP Dependency Check Integration —
.github/workflows/ and base/build.gradle.kts. While GitHub Actions workflows are present (ci.yml, codeql.yml), there is no explicit mention of dependency scanning for known vulnerabilities in the visible build configuration. Fix: Integrate OWASP Dependency-Check or similar tool into the CI/CD pipeline to automatically scan dependencies for known vulnerabilities. Consider adding GitHub's Dependabot alerts monitoring.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.