NaturalNode/natural

Item: NaturalNode/natural
Rating: 5
Author: RepoPilot

general natural language facilities for node

Healthy

Healthy across all four use cases

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 2mo ago
✓14 active contributors
✓MIT licensed
✓CI configured
✓Tests present
⚠Single-maintainer risk — top contributor 85% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the “Healthy” badge

Paste into your README — live-updates from the latest cached analysis.

[![RepoPilot: Healthy](https://repopilot.app/api/badge/naturalnode/natural)](https://repopilot.app/r/naturalnode/natural)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/naturalnode/natural on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: NaturalNode/natural

Generated by RepoPilot · 2026-05-07 · Source

Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/NaturalNode/natural shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

Verdict

GO — Healthy across all four use cases

Last commit 2mo ago
14 active contributors
MIT licensed
CI configured
Tests present
⚠ Single-maintainer risk — top contributor 85% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live NaturalNode/natural repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/NaturalNode/natural.

What it runs against: a local clone of NaturalNode/natural — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in NaturalNode/natural | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 103 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>NaturalNode/natural</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of NaturalNode/natural. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/NaturalNode/natural.git
#   cd natural
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of NaturalNode/natural and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "NaturalNode/natural(\\.git)?\\b" \\
  && ok "origin remote is NaturalNode/natural" \\
  || miss "origin remote is not NaturalNode/natural (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "index.js" \\
  && ok "index.js" \\
  || miss "missing critical file: index.js"
test -f "lib/natural/classifiers/classifier.js" \\
  && ok "lib/natural/classifiers/classifier.js" \\
  || miss "missing critical file: lib/natural/classifiers/classifier.js"
test -f "lib/natural/brill_pos_tagger/lib/Brill_POS_Tagger.js" \\
  && ok "lib/natural/brill_pos_tagger/lib/Brill_POS_Tagger.js" \\
  || miss "missing critical file: lib/natural/brill_pos_tagger/lib/Brill_POS_Tagger.js"
test -f "lib/natural/analyzers/sentence_analyzer.js" \\
  && ok "lib/natural/analyzers/sentence_analyzer.js" \\
  || miss "missing critical file: lib/natural/analyzers/sentence_analyzer.js"
test -f "package.json" \\
  && ok "package.json" \\
  || miss "missing critical file: package.json"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 103 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~73d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/NaturalNode/natural"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Natural is a Node.js NLP (Natural Language Processing) library that provides tokenization, stemming in multiple languages (English, Russian, Spanish), part-of-speech tagging, sentiment analysis, text classification, phonetic matching (Soundex, Metaphone), TF-IDF, WordNet integration, and string similarity metrics (Jaro-Winkler, Levenshtein, Dice's Coefficient). It bundles multiple specialized NLP algorithms into a single, cohesive JavaScript/TypeScript API for server-side language processing. Monolithic structure with lib/natural/ containing feature domains: tokenizers/, stemmers/, classifiers/, phonetics/, tfidf/, inflection/, pos_tagger/. Examples in examples/ showcase each module independently (e.g., examples/classification/, examples/stemming/). Test specs in io_spec/ cover I/O and persistence (Bayes, MaxEnt classifiers with file storage). Benchmarks in benchmarks/ isolate Metaphone, Soundex, stemmer performance.

Who it's for

Node.js developers building text processing pipelines who need production-ready NLP algorithms without external service dependencies (e.g., building chatbots, content classifiers, search engines, or sentiment analysis systems). Maintainers include original architects (Chris Umbel, Rob Ellis, Russell Mull, Hugo W.L. ter Doest) and active community contributors.

Maturity & risk

Actively maintained at v8.1.1 with comprehensive CI/CD (GitHub Actions for linting, testing, CodeQL), test coverage via nyc, and TypeScript support fully integrated. The project enforces StandardJS style, has 727k lines of JavaScript and 358k of TypeScript, and maintains backward compatibility from Node 0.4.10 forward. Production-ready for most NLP tasks.

Dependencies are moderate (12 direct deps including mongoose, pg, redis for optional backends; afinn-165 for sentiment; wordnet-db for lexical data) but well-established packages. The library is community-maintained (not a commercial product) with single points of failure in specialized modules (Brill POS tagger, WordNet integration). Breaking changes are possible across major versions; review CONTRIBUTING.md and pull_request_template.md before upgrading.

Active areas of work

Active development on v8.1.1 with TypeScript definitions, storage backend abstraction (MongoDB, PostgreSQL, Redis support via storage plugins), and expanded language support. CI validates every commit (node.js.yml), publishes to npm on release (npm-publish.yml), and runs CodeQL security scanning. Recent work likely includes classifier I/O improvements (MaxEntClassifier, LogisticRegressionClassifier in io_spec/) and optimization benchmarks.

Get running

git clone git://github.com/NaturalNode/natural.git && cd natural && npm install && npm test (runs Jasmine via jasmine in devDependencies). For examples: node examples/tokenizer/testSentenceTokenizer.js or node benchmarks/metaphone.js.

Daily commands: npm install (installs dependencies). npm test (runs Jasmine test suite via .nycrc config). npm run benchmark (executes benchmarks/index.js). npm run build (runs build:tests & build:esm via Rollup). npm run lint (via ESLint with standard-with-typescript config).

Map of the codebase

index.js — Main entry point exporting all Natural NLP modules; essential for understanding the public API surface.
lib/natural/classifiers/classifier.js — Core classifier abstraction; foundational for training and prediction workflows across all ML components.
lib/natural/brill_pos_tagger/lib/Brill_POS_Tagger.js — Part-of-speech tagging implementation; critical NLP functionality used in text analysis pipelines.
lib/natural/analyzers/sentence_analyzer.js — Sentence tokenization and analysis; essential preprocessing step for most NLP workflows.
package.json — Declares all external dependencies and versions; critical for understanding runtime requirements and capabilities.
lib/natural/classifiers/bayes_classifier.js — Naive Bayes classifier implementation; widely-used statistical classification algorithm in the library.

Components & responsibilities

Classifier base — undefined

How to make changes

Add a new classifier type

Create new classifier class in lib/natural/classifiers/ extending the Classifier base class (lib/natural/classifiers/classifier.js)
Implement required methods: train(doc, label), classify(doc), and toJSON()/fromJSON() for persistence (lib/natural/classifiers/your_new_classifier.js)
Export the new classifier from the module index (lib/natural/classifiers/index.js)
Add export to main library entry point (index.js)
Create integration tests in io_spec/ for training, classification, and I/O operations (io_spec/your_classifier_spec.js)

Add support for a new language in POS tagging

Create language lexicon JSON file mapping words to tags (lib/natural/brill_pos_tagger/data/YourLanguage/brill_Lexicon.json)
Create context transformation rules JSON file defining tag correction patterns (lib/natural/brill_pos_tagger/data/YourLanguage/brill_CONTEXTRULES.json)
Update Brill_POS_Tagger.js to load language-specific data conditionally (lib/natural/brill_pos_tagger/lib/Brill_POS_Tagger.js)
Add language option to the public index and document language parameter (lib/natural/brill_pos_tagger/index.js)

Add a new text analysis feature

Create analyzer module in lib/natural/analyzers/ following SenType interface (lib/natural/analyzers/your_new_analyzer.ts)
Implement analyze() method accepting text string and returning analysis result (lib/natural/analyzers/your_new_analyzer.ts)
Export analyzer from analyzers module index (lib/natural/analyzers/index.js)
Re-export from main library entry point (index.js)
Create example demonstrating typical usage pattern (examples/analysis/your_feature_example.js)

Why these technologies

Brill POS Tagging — Industry-standard transformation-based approach for accurate part-of-speech tagging; enables rule-based linguistic knowledge without training data
Naive Bayes Classifier — Probabilistic text classification; works well with sparse bag-of-words features; fast inference and training
Logistic Regression Classifier — Linear model for text classification; provides probability calibration; interpretable feature weights
Node.js/JavaScript — Enables NLP processing in web applications and server-side JavaScript runtimes; cross-platform compatibility
Redis/Memcached support — Optional distributed caching for classifier predictions and tagging results; improves performance in high-throughput scenarios

Trade-offs already made

Rule-based POS tagging instead of neural models
- Why: Deterministic behavior, fast inference, no ML training required, language-agnostic framework
- Consequence: Lower accuracy than modern transformers; requires manual rule creation per language
Event-based probabilistic classifiers instead of neural networks
- Why: Simple APIs, fast training, interpretable, low memory footprint, no deep learning dependency
- Consequence: Lower accuracy on complex tasks; less suitable for very large datasets; no attention mechanisms
In-memory JSON storage by default with pluggable backends
- Why: Zero external dependencies for basic usage; flexibility for production scenarios
- Consequence: Non-persistent by default; models lost on process termination; optional DB setup required for durability
Synchronous APIs with optional parallel training
- Why: Simpler developer experience; matches Node callback conventions
- Consequence: Blocks event loop during long operations; requires explicit parallelization for large training sets

Non-goals (don't propose these)

Does not provide neural network / deep learning models (no TensorFlow, PyTorch integration)
Does not include pre-trained word embeddings or language models
Does not perform cross-lingual transfer or multilingual fine-tuning
Does not provide real-time streaming analysis (batch processing model)
Does not include speech recognition or text-to-speech
Does not handle image or multimodal input

Traps & gotchas

Optional dependencies for storage backends: pg (PostgreSQL), mongoose (MongoDB), redis require running services; tests in io_spec/ may skip if backends unavailable. .env file is present but config not shown—check CONTRIBUTING.md for setup. PEG.js grammar files (lib/natural/brill_pos_tagger/lib/TF_Parser.js, tokenizers/parser_sentence_tokenizer.js) are excluded from jscpd (copy-paste detection), suggesting auto-generated or complex logic. TypeScript compilation requires ts-node for ts examples. Brill tagger POS model may need external data file (check examples/classification/).

Architecture

Concepts to learn

Porter Stemmer — Algorithms in lib/natural/stemmers/porter_stemmer.js reduce English words to root forms (running → run); essential for matching semantically identical terms in classification and search
Brill Part-of-Speech Tagger — lib/natural/pos_tagger/ uses Brill's rule-based tagging to label tokens (VB, NN, JJ); required for syntactic understanding in advanced NLP tasks
Naive Bayes Classifier — Core algorithm in lib/natural/classifiers/bayes_classifier.js for probabilistic text categorization; foundation for spam filters and sentiment analysis
TF-IDF (Term Frequency–Inverse Document Frequency) — Vectorization technique in lib/natural/tfidf/ that weights terms by relevance across a corpus; used for document similarity, search ranking, and feature extraction for classifiers
Jaro-Winkler Distance — String similarity metric in lib/natural/distance/ for fuzzy matching (typo tolerance in search, entity deduplication); particularly effective for short strings
Maximum Entropy (MaxEnt) Classification — Probabilistic classifier in lib/natural/classifiers/maxent_classifier.js used in examples/classification/MaxEntAppliedToPOSTagging_spec.js; generalizes better than Naive Bayes on overlapping feature distributions
WordNet Synsets and Hypernyms — Lexical database integration via wordnet-db dependency provides semantic relationships (synonymy, hypernymy); enables word sense disambiguation and semantic similarity computation
Logistic Regression Classifier — Linear probabilistic classifier in lib/natural/classifiers/logistic_regression_classifier.js with serialization support (io_spec/logistic_regression_classifier_spec.js); bridges rule-based and neural approaches

Related repos

axa-group/nlp.js — Alternative Node.js NLP library with named entity recognition, language detection, sentiment analysis; overlaps on stemming and classification but with different architecture
jiahaoli95/simple-statistics — Complementary stats library for Node; many NLP classifiers (Bayes, Logistic Regression) in Natural depend on statistical algorithms
thisandagain/sentiment — Focused sentiment analysis for Node; Natural uses afinn-165 for sentiment but sentiment.js is a lighter alternative
wordnet/wordnet-cli — Official WordNet command-line tool; Natural depends on wordnet-db package for lexical data integration
dariuszgulewski/porter-stemmer — Pure Porter Stemmer implementation; Natural bundles this as lib/natural/stemmers/porter_stemmer.js but this repo isolates the algorithm

PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive TypeScript type definitions for core NLP modules

The repo has partial TypeScript support (lib/natural/analyzers/index.d.ts exists) but most of lib/natural/* lacks .d.ts files. Given the devDependencies include TypeScript and @types packages, and the file structure shows many .ts files, completing type definitions would improve DX for TypeScript users and catch bugs during development.

[ ] Audit lib/natural/ subdirectories to identify .js/.ts files without corresponding .d.ts exports
[ ] Create comprehensive .d.ts files for tokenizers/, stemmers/, classifiers/, and phonetics/ modules (reference existing lib/natural/analyzers/index.d.ts as template)
[ ] Update lib/natural/index.d.ts or create if missing to export all public APIs with proper types
[ ] Add tsconfig.json check in CI pipeline (linter.yml or node.js.yml) to validate type definitions compile without errors
[ ] Update README TypeScript section with usage examples for main exported types

Add integration tests for data persistence backends (Redis, Memcached, PostgreSQL, MongoDB)

Dependencies include redis, memjs, pg, and mongoose, suggesting classifier/model serialization features. However, io_spec/ only shows basic storage tests. Add integration tests validating that trained classifiers can be persisted and restored correctly across different backends to catch regressions in serialization logic.

[ ] Create io_spec/integration/ directory for backend-specific tests
[ ] Add io_spec/integration/redis_classifier_spec.js testing BayesClassifier and MaxEntClassifier save/load with Redis
[ ] Add io_spec/integration/postgres_classifier_spec.js testing classifier persistence with pg driver
[ ] Add io_spec/integration/mongodb_classifier_spec.js testing classifier persistence with mongoose
[ ] Update .github/workflows/node.js.yml to spin up Redis/PostgreSQL/MongoDB services during test runs (using services: config)

Add ESM/CommonJS dual export validation tests and modernize benchmark suite

package.json shows build:esm is being configured and .npmignore exists, but no tests validate that both ESM and CommonJS exports work correctly. Also, benchmarks/ directory has limited coverage (only metaphone, soundex, stemmer). Add tests ensuring dual-module support doesn't break, and extend benchmarks for key modules (classifiers, tokenizers, TF-IDF).

[ ] Create test/esm-cjs-compat/ with tests importing core modules via require() and import{} to validate dual exports
[ ] Add test to verify exported types match between lib/natural/index.d.ts and actual lib/natural/index.js exports
[ ] Extend benchmarks/index.js to include BayesClassifier training/classification performance
[ ] Add benchmarks/tfidf_benchmark.js measuring TF-IDF matrix generation and scoring on sample documents
[ ] Add benchmarks/tokenizer_benchmark.js comparing performance of different tokenizer implementations
[ ] Update package.json 'benchmark' script to run all benchmark files and output comparative results

Good first issues

Add missing TypeScript .d.ts definitions for lib/natural/phonetics/ (Metaphone, Soundex classes) — type safety for phonetic comparison functions used in examples/phonetics/compare.js
Write integration tests in io_spec/ for Redis backend persistence (redis dependency installed but no spec exists yet) — currently only MongoDB/PostgreSQL tested
Expand benchmarks/index.js to include TF-IDF performance metrics — tfidf module exists in examples/tfidf/ but no benchmark suite for scalability testing

Top contributors

@Hugo-ter-Doest — 85 commits
@MukeshSinghBisht — 2 commits
@mdmower — 2 commits
@JustinBeckwith — 1 commits
@BrodaNoel — 1 commits

Recent commits

69c59f9 — Merge branch 'master' of https://github.com/NaturalNode/natural (Hugo-ter-Doest)
9e049f3 — 8.1.1 (Hugo-ter-Doest)
df6cbea — Upgrade dependencies (#769) (Hugo-ter-Doest)
e545648 — fix: drop dependency on http-server (#768) (JustinBeckwith)
6bafb34 — Added an option for keeping umlauts intact (#766) (Hugo-ter-Doest)
2550394 — Pr 762 (#765) (Hugo-ter-Doest)
bef9985 — Add Abbreviations for Spanish (#762) (BrodaNoel)
8475a04 — 8.1.0 (Hugo-ter-Doest)
014c934 — add trimSentences option to SentenceTokenizer, to let users choose to preserve whitespace (#760) (jeremybmerrill)
791df0b — 8.0.1 (Hugo-ter-Doest)

Security observations

The natural package has moderate security concerns. The most critical issues are: (1) an invalid/non-existent uuid dependency version that suggests a supply chain risk, (2) overly permissive Node.js engine requirement allowing installation on deprecated versions with known vulnerabilities, and (3) loose dependency version constraints that could introduce unexpected vulnerabilities. The presence of a .env file and incomplete security policy further we

High · Overly Permissive Node Engine Requirement — package.json - engines field. The package.json specifies 'node': '>=0.4.10', which is extremely outdated and allows installation on Node.js versions with known critical security vulnerabilities. Node 0.4.10 was released in 2011 and has been unsupported for over a decade. Fix: Update the minimum Node.js version requirement to at least '>=18.0.0' or preferably '>=20.0.0' to ensure users are running supported versions with security patches.
High · Outdated and Vulnerable Dependency - mongoose — package.json - dependencies.mongoose. mongoose version ^9.2.1 is specified, but the pinning is loose. More critically, mongoose has had multiple security vulnerabilities in its history. The version constraint allows for automatic updates that could introduce breaking changes or vulnerabilities. Fix: Audit mongoose for known CVEs, pin to a specific secure version, and regularly review security advisories. Consider using 'npm audit' and monitoring Dependabot alerts.
High · Vulnerable Dependency - uuid v13.0.0 — package.json - dependencies.uuid. The 'uuid' package version ^13.0.0 does not exist. The latest stable version is v9.x. This unusually high version number suggests either a typo or a dependency from an untrusted source, posing a supply chain security risk. Fix: Correct the uuid version to a valid, current version (e.g., '^9.0.0') and verify the package is installed from the official npm registry.
Medium · Loose Dependency Version Constraints — package.json - all dependencies with ^ prefix. Multiple dependencies use caret ranges (^) which allow minor and patch version updates automatically. This could introduce unexpected changes or vulnerabilities from upstream dependencies (afinn-165, redis, pg, memjs, etc.). Fix: Consider using more restrictive versioning (e.g., ~x.y.z or exact versions) for critical dependencies. Implement automated security scanning with 'npm audit' in CI/CD pipeline.
Medium · Presence of .env File in Repository — .env file. The file structure indicates a .env file exists in the repository. Environment files may contain sensitive credentials, database passwords, or API keys that should never be committed to version control. Fix: Ensure .env is listed in .gitignore. Use .env.example with placeholder values for documentation. Rotate any credentials that may have been exposed.
Medium · Incomplete Security Policy — SECURITY.md. SECURITY.md has an incomplete header and only marks version 5.1.10 as supported, but the current package version is 8.1.1. This creates confusion about which versions receive security updates. Fix: Complete the security policy document, clarify the security support timeline, and ensure it covers currently released versions (at minimum the latest 2-3 versions).
Low · Test Data in Repository — io_spec/tmp/ directory. The io_spec/tmp directory contains test data files with UUID-like names that may contain sensitive information or be leftover from test runs. Fix: Verify these are non-sensitive test files and add tmp directories to .gitignore. Ensure no real data is committed to the repository.
Low · Missing SBOM and Dependency Auditing — .github/workflows/. No evidence of Software Bill of Materials (SBOM) generation or automated dependency scanning in the build process, despite CodeQL being configured. Fix: Add 'npm audit' checks to CI/CD pipeline. Consider generating SBOM using tools like cyclonedx or spdx. Add explicit 'npm ci' (instead of npm install) for reproducible builds.

LLM-derived; treat as a starting point, not a security audit.

Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.