huggingface/tokenizers

Item: huggingface/tokenizers
Rating: 5
Author: RepoPilot

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Healthy

Healthy across the board

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit today
✓36+ active contributors
✓Distributed ownership (top contributor 37% of recent commits)

Show all 6 evidence items →

✓Apache-2.0 licensed
✓CI configured
✓Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/huggingface/tokenizers)](https://repopilot.app/r/huggingface/tokenizers)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/huggingface/tokenizers on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: huggingface/tokenizers

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/huggingface/tokenizers shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit today
36+ active contributors
Distributed ownership (top contributor 37% of recent commits)
Apache-2.0 licensed
CI configured
Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live huggingface/tokenizers repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/huggingface/tokenizers.

What it runs against: a local clone of huggingface/tokenizers — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in huggingface/tokenizers | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>huggingface/tokenizers</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of huggingface/tokenizers. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/huggingface/tokenizers.git
#   cd tokenizers
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of huggingface/tokenizers and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "huggingface/tokenizers(\\.git)?\\b" \\
  && ok "origin remote is huggingface/tokenizers" \\
  || miss "origin remote is not huggingface/tokenizers (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "bindings/node/src/lib.rs" \\
  && ok "bindings/node/src/lib.rs" \\
  || miss "missing critical file: bindings/node/src/lib.rs"
test -f "bindings/node/Cargo.toml" \\
  && ok "bindings/node/Cargo.toml" \\
  || miss "missing critical file: bindings/node/Cargo.toml"
test -f "tokenizers/" \\
  && ok "tokenizers/" \\
  || miss "missing critical file: tokenizers/"
test -f "bindings/node/index.d.ts" \\
  && ok "bindings/node/index.d.ts" \\
  || miss "missing critical file: bindings/node/index.d.ts"
test -f "bindings/node/build.rs" \\
  && ok "bindings/node/build.rs" \\
  || miss "missing critical file: bindings/node/build.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/huggingface/tokenizers"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

huggingface/tokenizers is a production-grade Rust library with Python/Node.js bindings that implements state-of-the-art tokenization algorithms (BPE, WordPiece, Unigram) with extreme performance—tokenizing 1GB of text in <20 seconds on CPU. It handles the full tokenization pipeline: normalization with alignment tracking, pre-tokenization, training, truncation, padding, and special token injection, designed for both research and production NLP systems. Monorepo structure: tokenizers/ (Rust core implementation), bindings/python/ (PyO3-based Python bindings), bindings/node/ (NAPI-based Node.js bindings), with shared dependencies and conda/pip/npm packaging. Core tokenizer logic lives in tokenizers/ with model implementations (BPE, WordPiece, Unigram), pre-tokenizers, normalizers, and post-processors; bindings provide language-specific APIs wrapping Rust FFI.

👥Who it's for

NLP engineers and researchers building transformer-based models who need fast, production-ready tokenization without performance bottlenecks; integrators into ML frameworks (Hugging Face Transformers, etc.) who require sub-20ms tokenization latency; teams training custom vocabularies on massive corpora who need to avoid Python-only tokenizer bottlenecks.

🌱Maturity & risk

Production-ready and actively maintained. The Rust core (1.34M LOC) is stable with comprehensive CI (Rust, Python, Node workflows in .github/workflows/) covering training and release pipelines. The Python binding (391K LOC) has extensive test coverage and is widely used in Hugging Face Transformers ecosystem. Recent commits and active release cadence (.github/workflows/python-release.yml, python.yml) indicate ongoing development.

Low risk for production use but moderate architectural complexity: Rust-Python FFI binding (napi in bindings/node/Cargo.toml) introduces platform-specific build requirements (see .github/conda/); tight coupling between Rust tokenizers/ core and bindings means breaking changes to core APIs propagate across all language bindings; multi-target release process (Rust, Python, Node.js) requires careful testing (evidenced by multiple CI workflows) to avoid inconsistent behavior across platforms.

Active areas of work

Active multi-platform maintenance: Node.js v0.23.2-dev.0 in development (bindings/node/Cargo.toml), CI coverage across Rust/Python/Node release pipelines, recent documentation and benchmark infrastructure (.github/scripts/render_bench_svg.py, .github/workflows/benchmark-trigger.yml), and stale issue management (.github/stale.yml). Ongoing performance optimization visible in benchmark automation.

🚀Get running

Clone and explore Python binding (most common): git clone https://github.com/huggingface/tokenizers.git && cd bindings/python && pip install -e . or pip install tokenizers. For Rust core: cd tokenizers && cargo build --release. For Node.js: cd bindings/node && npm install && npm run build.

Daily commands: Python: python -c 'from tokenizers import Tokenizer; t = Tokenizer.from_file("path/to/vocab.json"); t.encode("text")'. Rust: cargo test --all in tokenizers/. Node.js: npm test in bindings/node/ (uses Jest per jest.config.js). Training example: from tokenizers.trainers import BpeTrainer; trainer.train(files=[...]) with appropriate tokenizer model.

🗺️Map of the codebase

bindings/node/src/lib.rs — Main Node.js binding entry point that exposes Rust tokenizer functionality to JavaScript via NAPI; defines all exported classes and methods
bindings/node/Cargo.toml — Node.js binding dependencies and build configuration; controls linking to the core tokenizers library and NAPI version
tokenizers/ — Core Rust tokenizer library (referenced as workspace dependency); implements all tokenization algorithms and data structures
bindings/node/index.d.ts — TypeScript type definitions for Node.js bindings; documents the public API surface for JavaScript consumers
bindings/node/build.rs — Build script that compiles Rust code to native binary; integrates NAPI build system for platform-specific compilation
bindings/node/package.json — Node.js package metadata and dependencies; defines NPM package versions, scripts, and platform-specific binary references
.github/workflows/node-release.yml — CI/CD pipeline for building and publishing Node.js bindings to NPM across multiple platforms

🛠️How to make changes

Add a new NAPI binding for a tokenizer component

Implement the Rust component in the core tokenizers library (e.g., new normalizer) (tokenizers/src/normalizers/mod.rs)
Create NAPI wrapper module in bindings/node/src/ (e.g., src/normalizers.rs) deriving #[napi] macros (bindings/node/src/normalizers.rs)
Export the new module and classes in bindings/node/src/lib.rs (bindings/node/src/lib.rs)
Add TypeScript interface definitions in bindings/node/index.d.ts (bindings/node/index.d.ts)
Write integration tests in bindings/node/lib/bindings/normalizers.test.ts (bindings/node/lib/bindings/normalizers.test.ts)

Release a new version to NPM

Update version in bindings/node/Cargo.toml and bindings/node/package.json (bindings/node/package.json)
Commit changes and create a git tag (e.g., node-v0.24.0) (RELEASE.md)
Push tag to trigger node-release.yml workflow which builds platform-specific binaries (.github/workflows/node-release.yml)
Workflow automatically publishes all platform shims to NPM (linux-x64-gnu, darwin-arm64, etc.) (bindings/node/npm/)

Add support for a new target platform

Create new directory for platform under bindings/node/npm/ (e.g., npm/linux-riscv64/) (bindings/node/npm/linux-x64-gnu/package.json)
Add platform triple and build target to bindings/node/Cargo.toml as a feature or [target] section (bindings/node/Cargo.toml)
Update .github/workflows/node-release.yml to build for the new target in the matrix (.github/workflows/node-release.yml)
Update bindings/node/package.json optionalDependencies to reference the new platform binary (bindings/node/package.json)

🔧Why these technologies

Rust + NAPI — Enables CPU-bound tokenization to run at native speeds (~20s per GB) while exposing safe bindings to JavaScript; NAPI provides zero-copy interop and AOT compilation
TypeScript definitions — Provides type safety and IDE autocomplete for JavaScript consumers; documents the async and sync method signatures
Platform-specific NPM shims — Allows single 'npm install tokenizers' to work across Windows, macOS, Linux (glibc/musl) and ARM architectures by selecting the correct prebuilt binary at install time
Cargo workspace — Centralizes tokenizer algorithm implementations in a single Rust package (tokenizers/) that can be reused by multiple language bindings (Node, Python, etc.)

⚖️Trade-offs already made

Prebuilt binary distribution (not just WASM)
- Why: Native Rust code is 50–100x faster than WASM for tokenization; platform-specific binaries eliminate the need for end-users to have Rust/compiler installed
- Consequence: Increases repo complexity (CI matrix across ~12 platforms), larger npm package size, and binary compatibility concerns if dependencies change
Single Tokenizer class (not separate classes per algorithm)
- Why: Simplifies API surface and allows transparent swapping of algorithms (BPE, WordPiece, SentencePiece) via configuration
- Consequence: Type safety for algorithm-specific options is reduced; users must pass string enum keys rather than strongly-typed classes
Synchronous tokenization (no async/await)
- Why: Tokenization is CPU-bound and completes in milliseconds; async overhead would be unnecessary
- Consequence: Long tokenization of large texts will block the JavaScript event loop; users must offload to worker threads for massive datasets
Rust-only normalization and decoding implementations
- Why: Ensures alignment tracking and offsets are always accurate; avoids duplicating complex logic in JavaScript
- Consequence: Makes it harder for JavaScript-only developers to extend or debug tokenizer behavior without reading Rust code

🚫Non-goals (don't propose these)

Language-agnostic tokenizer training (training is Rust-only; Python bindings handle user-facing training API)
Real-time token streaming or streaming tokenization output
GPU acceleration (scope is CPU tokenization only)
Authentication or multi-user isolation (tokenizer is stateless; no session management)
Support for dynamic vocab updates without reloading the tokenizer

🪤Traps & gotchas

NAPI build complexity: Node.js binding requires native compilation (build.rs uses napi-build v2); pre-built wheels only available for common Python versions (check PyPI), other builds require Rust toolchain. FFI data marshaling: Rust strings/vecs cross to Python/Node with ownership semantics (PyO3/NAPI handle this but customization is fragile). Version skew risk: tokenizers core, bindings/python, and bindings/node released separately (see .github/workflows/*-release.yml); dependency on exact core version per binding required to avoid API mismatches. No async API: all operations block (design choice per README); I/O-bound workflows need external threading. Normalization alignment tracking is opt-in and affects performance measurably (see bindings/python/benches/).

🏗️Architecture

💡Concepts to learn

Byte-Pair Encoding (BPE) — Core tokenization algorithm used by GPT/DALL-E/Stable Diffusion; understanding merge operations and vocabulary construction is essential to use BPE correctly in tokenizers/src/models/bpe.rs
WordPiece Tokenization — Competing algorithm used by BERT; implemented in tokenizers/ and understanding subword vs. word-level tradeoffs helps choose correct model for NLP tasks
Token-to-Character Alignment — huggingface/tokenizers tracks offsets mapping tokens back to original text (README mentions 'alignments tracking'); critical for tasks like named entity recognition where you need original character positions
Foreign Function Interface (FFI) / NAPI — Rust core is exposed to Python (PyO3) and Node.js (NAPI) via FFI; understanding ownership semantics and serialization is required to extend bindings or debug crashes
Unigram Language Model Tokenization — Probabilistic tokenization algorithm used by SentencePiece and mT5; implemented in tokenizers/ as alternative to BPE/WordPiece for multilingual scenarios
Normalization & Pre-tokenization Pipeline — huggingface/tokenizers separates Unicode normalization (NFKC, lowercasing) from pre-tokenization (whitespace splitting) from tokenization; understanding this pipeline order prevents token corruption in production
Rust-to-Python Memory Safety via PyO3 — bindings/python/src/lib.rs uses PyO3 to marshal Rust data structures (Vec, String) to Python objects with guaranteed memory safety; modifying this requires understanding Rust ownership and GIL

huggingface/transformers — Primary consumer of this tokenizers library; handles tokenization via from_pretrained() using huggingface/tokenizers as backend
openai/tiktoken — Alternative BPE tokenizer for GPT models; directly compared in bindings/python/benches/test_tiktoken.py for performance benchmarking
google/sentencepiece — Competing tokenization library supporting Unigram/SentencePiece algorithms; serves similar use case for BERT/mT5 models
huggingface/safetensors — Companion repo for safe model weight serialization; commonly used alongside tokenizers for full model-tokenizer pipelines
huggingface/hub-models — Central model registry where tokenizer configs (.tokenizer.json) are stored; direct integration point for loading pretrained tokenizers

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive integration tests for Node.js bindings across all tokenizer components

The Node.js bindings have individual test files (tokenizer.test.ts, models.test.ts, etc.) but lack cross-component integration tests. The file structure shows bindings/node/lib/bindings/*.test.ts files exist, but there's no integration test suite that validates end-to-end tokenization pipelines combining normalizers, pre-tokenizers, models, and post-processors together. This is critical since tokenizer quality depends on component interaction.

[ ] Create bindings/node/lib/bindings/integration.test.ts
[ ] Add tests for complete pipelines: BPE + normalizers + post-processors
[ ] Add tests for WordPiece, SentencePiece combinations with real-world examples
[ ] Verify encoding/decoding round-trips for all major tokenizer types
[ ] Add performance regression tests comparing Node.js vs Rust performance

Implement missing CI workflow for Node.js performance benchmarking against Rust baseline

The repo has benchmarks.yml workflow and benchmark-trigger.yml, but they appear focused on Rust. Given that bindings/node exists as a first-class platform and Node.js is performance-critical for many users, there's no dedicated Node.js benchmarking workflow. This gap means performance regressions in Node.js bindings could go undetected. The benchmark infrastructure exists (.github/scripts/render_bench_svg.py) but Node.js isn't integrated.

[ ] Create .github/workflows/node-benchmarks.yml similar to benchmarks.yml structure
[ ] Add benchmark scripts in bindings/node/ to measure tokenization throughput
[ ] Compare Node.js binding performance against native Rust implementation
[ ] Generate and publish benchmark results on PR/release similar to Rust workflow
[ ] Add performance thresholds to fail CI if Node.js bindings regress >5% vs baseline

Add type safety improvements and missing JSDoc for Node.js bindings public API

The bindings/node/index.d.ts TypeScript definitions exist but are incomplete compared to actual capabilities. The bindings/node/lib/bindings/ test files reveal many features that likely lack proper JSDoc documentation and complete type definitions. This creates friction for TypeScript users and reduces IDE autocomplete effectiveness. Better DX will encourage adoption.

[ ] Audit bindings/node/index.d.ts against all test files in bindings/node/lib/bindings/
[ ] Add comprehensive JSDoc comments with @param, @returns, @example for all public methods
[ ] Add overload signatures for polymorphic methods (e.g., tokenizer methods accepting multiple input types)
[ ] Add strict generic type constraints for encoder/decoder methods
[ ] Generate documentation from JSDoc and validate against README examples in bindings/node/

🌿Good first issues

Add missing docstrings to Python binding classes in bindings/python/src/lib.rs (Tokenizer, BPE, WordPiece classes) to improve generated API docs; high-impact, low risk.
Implement integration test in bindings/node/examples/ that mirrors bindings/python/tests/ for BPE/WordPiece training and encoding; ensures feature parity and catches FFI regressions.
Write benchmark comparison script (similar to bindings/python/benches/test_tiktoken.py) for Node.js binding to verify NAPI overhead is acceptable; would inform performance SLAs for npm package users.

⭐Top contributors

Click to expand

@ArthurZucker — 37 commits
@dependabot[bot] — 14 commits
@ngoldbaum — 5 commits
@McPatate — 3 commits
@hf-security-analysis[bot] — 3 commits

📝Recent commits

Click to expand

22d54d3 — update dev version to 0.23.2-dev.0 (ArthurZucker)
bbe43ad — ci: release workflow fixes (node + python) (#2043) (ArthurZucker)
ab0c5d8 — Fix node release (#2034) (ArthurZucker)
decd8e0 — bindings/python: free-threaded Python (3.14t) support (#2041) (ArthurZucker)
3992692 — update for release (#2033) (ArthurZucker)
bcdd25b — BPE cache: per-thread read-through cache to avoid RwLock atomics on hits (#2028) (sebpop)
618eb38 — Bump follow-redirects in /tokenizers/examples/unstable_wasm/www (#2024) (dependabot[bot])
b6b1688 — chore: bump doc-builder SHA for PR upload workflow (#2025) (rtrompier)
19015d6 — fix: use uvx --with cairosvg instead of uv pip install --system (#2021) (ArthurZucker)
efbcc68 — Ci benchmarks (#2019) (ArthurZucker)

🔒Security observations

The codebase demonstrates generally good security practices with no critical vulnerabilities identified. The main concerns are: (1) outdated NAPI bindings that may lack recent security patches, (2) potential transitive dependency risks with pinned versions, (3) GitHub Actions workflows requiring audit for secrets management and third-party action security, and (4) minor risk of information disclosure through npm package documentation. The project uses dependency management (Cargo.lock implied) and has security scanning configured (trufflehog.yml workflow). Immediate actions should focus on updating the NAPI bindings to current versions and conducting a GitHub Actions workflow security audit.

Medium · Outdated NAPI Bindings Version — bindings/node/Cargo.toml - dependencies: napi = "3", napi-derive = "3". The Node.js bindings use napi v3 and napi-derive v3, which are relatively old versions. Current stable versions are significantly newer, potentially missing security patches and bug fixes for the native module interface. Fix: Update napi and napi-derive to the latest stable versions (currently v2.x series for napi with v2.x for napi-derive). Run 'cargo update' and test thoroughly to ensure compatibility.
Low · Transitive Dependency Risk - ahash — bindings/node/Cargo.toml - dependencies: ahash = { version = "0.8.11", features = ["serde"] }. The ahash dependency (0.8.11) is pinned to a specific version. While ahash is generally reliable, the explicit serde feature inclusion suggests serialization of hash outputs, which could be a vector for hash collision attacks if not used carefully. Fix: Consider using version ranges (^0.8.11) instead of exact pins to allow patch updates. Regularly monitor for security advisories using 'cargo audit'. Review how ahash is used in the codebase to ensure hash values are not exposed in ways that could enable algorithmic attacks.
Low · GitHub Actions Workflow Security — .github/workflows/ directory. Multiple GitHub Actions workflows are present (.github/workflows/) including release workflows (python-release.yml, rust-release.yml, node-release.yml). Without access to their contents, potential issues could include: missing branch protection, insufficient secret management, or insecure artifact handling. Fix: Audit all workflow files for: (1) Use of GITHUB_TOKEN with appropriate permissions, (2) Secrets not logged or exposed in outputs, (3) Third-party action pins using commit SHA instead of version tags, (4) Proper branch protection rules, (5) Artifact signing for releases.
Low · Potential Information Disclosure via Package Structure — bindings/node/npm/*/README.md files. The bindings/node/npm/ directory contains platform-specific package configurations for multiple architectures. If documentation or README files in these directories contain sensitive information, it could be exposed via npm. Fix: Audit all README.md files in architecture-specific npm packages to ensure no sensitive build information, internal infrastructure details, or credentials are documented. Ensure only user-facing documentation is included.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

huggingface/tokenizers

Embed the "Healthy" badge

Onboarding doc

Onboarding: huggingface/tokenizers

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

Add a new NAPI binding for a tokenizer component

Release a new version to NPM

Add support for a new target platform

🔧Why these technologies

⚖️Trade-offs already made

🚫Non-goals (don't propose these)

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive integration tests for Node.js bindings across all tokenizer components

Implement missing CI workflow for Node.js performance benchmarking against Rust baseline

Add type safety improvements and missing JSDoc for Node.js bindings public API

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next