houbb/sensitive-word

Item: houbb/sensitive-word
Rating: 5
Author: RepoPilot

👮‍♂️The sensitive word tool for java.(敏感词/违禁词/违法词/脏词。基于 DFA 算法实现的高性能 java 敏感词过滤工具框架。内置支持单词标签分类分级。请勿发布涉及政治、广告、营销、翻墙、违反国家法律法规等内容。高性能敏感词检测过滤组件，附带繁体简体互换，支持全角半角互换，汉字转拼音，模糊搜索等功能。)

Healthy

Healthy across the board

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 7w ago
✓9 active contributors
✓Distributed ownership (top contributor 47% of recent commits)

Show all 6 evidence items →

✓Apache-2.0 licensed
✓CI configured
✓Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/houbb/sensitive-word)](https://repopilot.app/r/houbb/sensitive-word)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/houbb/sensitive-word on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: houbb/sensitive-word

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/houbb/sensitive-word shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 7w ago
9 active contributors
Distributed ownership (top contributor 47% of recent commits)
Apache-2.0 licensed
CI configured
Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live houbb/sensitive-word repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/houbb/sensitive-word.

What it runs against: a local clone of houbb/sensitive-word — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in houbb/sensitive-word | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 76 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>houbb/sensitive-word</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of houbb/sensitive-word. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/houbb/sensitive-word.git
#   cd sensitive-word
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of houbb/sensitive-word and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "houbb/sensitive-word(\\.git)?\\b" \\
  && ok "origin remote is houbb/sensitive-word" \\
  || miss "origin remote is not houbb/sensitive-word (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "src/main/java/com/github/houbb/sensitive/word/core/SensitiveWord.java" \\
  && ok "src/main/java/com/github/houbb/sensitive/word/core/SensitiveWord.java" \\
  || miss "missing critical file: src/main/java/com/github/houbb/sensitive/word/core/SensitiveWord.java"
test -f "src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java" \\
  && ok "src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java" \\
  || miss "missing critical file: src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java"
test -f "src/main/java/com/github/houbb/sensitive/word/api/ISensitiveWord.java" \\
  && ok "src/main/java/com/github/houbb/sensitive/word/api/ISensitiveWord.java" \\
  || miss "missing critical file: src/main/java/com/github/houbb/sensitive/word/api/ISensitiveWord.java"
test -f "src/main/java/com/github/houbb/sensitive/word/core/AbstractSensitiveWord.java" \\
  && ok "src/main/java/com/github/houbb/sensitive/word/core/AbstractSensitiveWord.java" \\
  || miss "missing critical file: src/main/java/com/github/houbb/sensitive/word/core/AbstractSensitiveWord.java"
test -f "src/main/java/com/github/houbb/sensitive/word/support/check/WordChecks.java" \\
  && ok "src/main/java/com/github/houbb/sensitive/word/support/check/WordChecks.java" \\
  || miss "missing critical file: src/main/java/com/github/houbb/sensitive/word/support/check/WordChecks.java"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 76 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~46d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/houbb/sensitive-word"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

A high-performance Java library for detecting and filtering sensitive words (profanity, illegal content, prohibited terms) using a Deterministic Finite Automaton (DFA) algorithm. It ships with 60K+ pre-built word entries, achieves 140K+ QPS throughput, and supports advanced text normalization (simplified/traditional Chinese conversion, full/half-width character handling, pinyin transformation) to defeat common evasion techniques. Standard Maven monolith: src/main/java/com/github/houbb/sensitive/word/ contains the core DFA implementation with modular interfaces (ISensitiveWord.java, IWordCheck.java, IWordReplace.java, etc.) for pluggable behavior. Word data lives separately (sensitive-word-data dependency v1.0.0). Test code presumably in src/test/ (not listed). Build artifacts configured in pom.xml for Maven Central publication.

👥Who it's for

Java backend developers building content moderation systems, user-generated content platforms, or chat/forum applications who need production-grade sensitive word filtering without building their own DFA engine and maintaining massive word lists.

🌱Maturity & risk

Production-ready. Active development as of v0.29.5 (recent releases in the commit history), comprehensive CI setup (.travis.yml, .coveralls.yml), Maven Central distribution, 60K+ curated word entries, and extensive documentation in /doc/issues/roadmap/ indicate a mature, actively maintained project. Benchmark claims of 14W+ QPS are documented.

Low-to-moderate risk. Single maintainer (houbb) visible in package structure com.github.houbb. Dependencies are minimal but include opencc4j (1.14.0) for Chinese conversion and heaven (0.13.0) for utility functions—audit these for security. The word list is built-in (not dynamic by default), so you must rebuild/redeploy to update banned terms; roadmap items like v009 (custom blacklist) and dynamic loading partially mitigate this. No visible open issue backlog in the provided file list.

Active areas of work

Roadmap items in /doc/issues/roadmap/ indicate active feature development: v004 (full/half-width punctuation), v005 (digit normalization), v006 (simplified/traditional Chinese swap), v007 (repeated word handling), v009 (custom blacklist), v010 (white list support), v011 (email/URL regex detection), v012 (stopword filtering), v014/v015 (phonetically/visually similar character matching, mirroring), v016 (custom noise reduction). The admin console (sensitive-word-admin) is mentioned as in early MVP development.

🚀Get running

git clone https://github.com/houbb/sensitive-word.git
cd sensitive-word
mvn clean install
mvn test

No external services required; library is self-contained. Uses Java 1.8+ (project.compiler.level=1.8 in pom.xml).

Daily commands: This is a library, not an application. To use it: add <dependency><groupId>com.github.houbb</groupId><artifactId>sensitive-word</artifactId><version>0.29.5</version></dependency> to your pom.xml, then instantiate via the fluent API (documented in README.md). For development: mvn clean package builds the JAR; mvn test runs the test suite.

🗺️Map of the codebase

src/main/java/com/github/houbb/sensitive/word/core/SensitiveWord.java — Primary entry point and main API implementation for DFA-based sensitive word detection; all users interact through this class
src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java — Builder pattern implementation that orchestrates configuration, initialization, and chaining of word filtering operations
src/main/java/com/github/houbb/sensitive/word/api/ISensitiveWord.java — Core interface contract defining all public operations (check, replace, tags, formatting); critical for understanding API surface
src/main/java/com/github/houbb/sensitive/word/core/AbstractSensitiveWord.java — Base implementation of DFA algorithm logic and core word matching; foundation for all sensitive word operations
src/main/java/com/github/houbb/sensitive/word/support/check/WordChecks.java — Registry and factory for pluggable word validation checks (email, URL, IP, numbers); enables extensible validation pipeline
src/main/java/com/github/houbb/sensitive/word/support/allow/WordAllows.java — Whitelist management system that integrates with allow/deny rules to customize filtering behavior
src/main/java/com/github/houbb/sensitive/word/api/context/InnerSensitiveWordContext.java — Context holder managing state across filtering operations, including tags, formats, and intermediate results

🛠️How to make changes

Add a Custom Word Check Strategy

Create a new check class extending AbstractWordCheck or implementing IWordCheck in src/main/java/com/github/houbb/sensitive/word/support/check/ (src/main/java/com/github/houbb/sensitive/word/support/check/WordCheckWord.java)
Register the new check in WordChecks.java factory method as an available strategy (src/main/java/com/github/houbb/sensitive/word/support/check/WordChecks.java)
Integrate into the builder pipeline via SensitiveWordBs by adding a fluent method that selects your check strategy (src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java)

Add a Custom Format/Normalization Handler

Create a new format class implementing IWordFormat in src/main/java/com/github/houbb/sensitive/word/support/format/ (src/main/java/com/github/houbb/sensitive/word/api/IWordFormat.java)
Implement format() method to normalize text (e.g., simplified↔traditional, pinyin conversion) (src/main/java/com/github/houbb/sensitive/word/api/IWordFormatText.java)
Configure in SensitiveWordBs via addFormat() to apply during word matching and replacement (src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java)

Add Custom Allow/Whitelist Rules

Create a new allow implementation extending AbstractWordAllow or implementing IWordAllow in src/main/java/com/github/houbb/sensitive/word/support/allow/ (src/main/java/com/github/houbb/sensitive/word/support/allow/WordAllowInit.java)
Implement contains() method to check if a word should be exempted from filtering (src/main/java/com/github/houbb/sensitive/word/api/IWordAllow.java)
Register in SensitiveWordBs.allow() builder method to apply during sensitive word detection (src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java)

Extend with Custom Tag Classifications

Review existing word tag types in WordTagType.java enum (src/main/java/com/github/houbb/sensitive/word/constant/enums/WordTagType.java)
Add new tag enum values and configure in word data source/repository with metadata (src/main/java/com/github/houbb/sensitive/word/api/IWordTag.java)
Query tags via SensitiveWord.tags() method which returns categorized sensitive words from AbstractSensitiveWord (src/main/java/com/github/houbb/sensitive/word/core/AbstractSensitiveWord.java)

🪤Traps & gotchas

No Spring Boot auto-configuration or property files visible—library is not a starter pack, requires manual fluent API instantiation. The bundled word list (60K+ entries) is immutable by default; dynamic updates require custom IWordData implementations (roadmap v009). DFA trie is likely built once at initialization—check initialization cost with large custom word lists before using in hot paths. Chinese text processing depends on opencc4j version 1.14.0; ensure compatible locale settings on Windows. No explicit thread-safety documentation found in listed files; verify concurrent access patterns before deploying to multi-threaded services.

🏗️Architecture

💡Concepts to learn

Deterministic Finite Automaton (DFA) — The core algorithm in this library; understanding how DFA trie construction and character-by-character matching work is essential to optimizing detection performance and extending the engine with custom patterns.
Trie (prefix tree) data structure — DFA implementations typically use a trie to store word lists; the trie enables O(m) lookup where m is the length of the input text, independent of vocabulary size, making it suitable for 60K+ word lists.
Character normalization / text preprocessing — The library's advanced feature set (full/half-width conversion, simplified/traditional Chinese swapping, pinyin) exists precisely to normalize obfuscated input before DFA matching; understanding these transforms is critical for evading filter-bypass techniques.
Fluent API / Builder pattern — The library's public interface is entirely fluent-style (ISensitiveWord), allowing method chaining for readable configuration; this is the idiomatic way to invoke the library and should be mimicked when adding new features.
Strategy pattern (pluggable implementations) — Interfaces like IWordReplace, IWordFormat, and IWordCheck allow swappable implementations; understanding this design enables safe extension without modifying core logic.
OpenCC (Simplified ↔ Traditional Chinese conversion) — The opencc4j (v1.14.0) dependency powers Chinese text normalization; knowing how it works helps debug issues with non-Latin scripts and character equivalence matching.
QPS (Queries Per Second) benchmarking — The library claims 140K+ QPS performance; understanding how this metric is measured (single word lookups vs. paragraph scanning, JVM warmup, GC pauses) is crucial before trusting performance guarantees in production.

houbb/sensitive-word-admin — Official admin console (MVP) for managing sensitive word lists; companion UI project for this core library
houbb/sensitive — High-performance log desensitization component by same author; integrates with sensitive-word for redacting sensitive fields in logs
houbb/heaven — Utility abstraction library (v0.13.0 dependency); provides foundational utilities used throughout sensitive-word
medcl/elasticsearch-analysis-ik — IK Chinese tokenizer for Elasticsearch; complementary solution if you need sensitive word filtering integrated with full-text search indexing
fighting41love/funNLP — Alternative Chinese NLP toolkit with built-in sensitive word detection; similar problem domain, useful for comparison if you need broader NLP capabilities

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for ISensitiveWordCharIgnore and character normalization pipeline

The repo has advanced features like traditional/simplified Chinese conversion, full/half-width character handling, and pinyin conversion (referenced in roadmap files v004-v007), but there are no visible test files in src/test for the ISensitiveWordCharIgnore.java interface. This is critical for a tool processing multiple character encodings and transformations. Adding tests would ensure the normalization pipeline handles edge cases correctly.

[ ] Create src/test/java/com/github/houbb/sensitive/word/api/ISensitiveWordCharIgnoreTest.java
[ ] Add test cases for full-width to half-width character conversion
[ ] Add test cases for traditional to simplified Chinese conversion
[ ] Add test cases for mixed-encoding inputs (combining multiple transformations)
[ ] Add test cases for edge cases (empty strings, special characters, numbers)
[ ] Verify existing format implementations in src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java pass all tests

Implement and test the word denoising pipeline referenced in v016 roadmap

The roadmap file doc/issues/roadmap/v016-自定义降噪处理.md indicates custom noise reduction is a planned feature, but there's no corresponding implementation visible in the API layer. This is a high-value feature for improving detection accuracy against obfuscated sensitive words. Creating the interface and tests would unblock this feature.

[ ] Create src/main/java/com/github/houbb/sensitive/word/api/IWordDenoise.java interface with methods for custom noise reduction rules
[ ] Create src/main/java/com/github/houbb/sensitive/word/bs/WordDenoiseChain.java to handle chaining multiple denoise strategies
[ ] Add integration to SensitiveWordBs.java to apply denoise operations before DFA matching
[ ] Create src/test/java/com/github/houbb/sensitive/word/bs/WordDenoiseChainTest.java with test cases for common obfuscation patterns (homoglyphs, repeated characters, symbols)
[ ] Document the feature with examples in README.md

Add CI/CD workflow for multi-version Java compatibility testing (Java 8, 11, 17, 21)

The pom.xml specifies Java 1.8 as the compiler level, but there's no GitHub Actions workflow visible (.github/workflows missing from file list) to test against multiple JDK versions. Given the project targets production use and needs to support various enterprise environments, automated testing across LTS versions would catch compatibility regressions early.

[ ] Create .github/workflows/java-matrix-test.yml with matrix strategy for Java 8, 11, 17, 21
[ ] Configure Maven to run full test suite (surefire) on each Java version
[ ] Add coverage reporting step using coveralls integration (already referenced in .coveralls.yml)
[ ] Configure workflow to run on push to main and PRs
[ ] Update README.md to show Java version compatibility badge

🌿Good first issues

Add unit tests for IWordResult interface implementations in src/test/; the api/ folder shows 8 interfaces but no test files are listed, making it unclear if all edge cases (overlapping matches, empty strings, null words) are covered.
Document the exact DFA trie construction algorithm and memory footprint in a /doc/internals.md file; the README mentions 'DFA algorithm' but no deep-dive exists for contributors wanting to optimize or extend the matching logic.
Implement a concrete example in /examples showing the fluent API for each major feature (detect, replace, tag extraction, custom white/blacklist) with unit tests; the README has snippets but no runnable Maven submodule.

⭐Top contributors

Click to expand

@houbb — 47 commits
@binbin.hou — 33 commits
@yds — 7 commits
@k9999dot — 6 commits
@bbhou — 3 commits

📝Recent commits