houbb/sensitive-word
👮♂️The sensitive word tool for java.(敏感词/违禁词/违法词/脏词。基于 DFA 算法实现的高性能 java 敏感词过滤工具框架。内置支持单词标签分类分级。请勿发布涉及政治、广告、营销、翻墙、违反国家法律法规等内容。高性能敏感词检测过滤组件,附带繁体简体互换,支持全角半角互换,汉字转拼音,模糊搜索等功能。)
Healthy across the board
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 7w ago
- ✓9 active contributors
- ✓Distributed ownership (top contributor 47% of recent commits)
Show all 6 evidence items →Show less
- ✓Apache-2.0 licensed
- ✓CI configured
- ✓Tests present
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/houbb/sensitive-word)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/houbb/sensitive-word on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: houbb/sensitive-word
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/houbb/sensitive-word shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit 7w ago
- 9 active contributors
- Distributed ownership (top contributor 47% of recent commits)
- Apache-2.0 licensed
- CI configured
- Tests present
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live houbb/sensitive-word
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/houbb/sensitive-word.
What it runs against: a local clone of houbb/sensitive-word — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in houbb/sensitive-word | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch master exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 76 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of houbb/sensitive-word. If you don't
# have one yet, run these first:
#
# git clone https://github.com/houbb/sensitive-word.git
# cd sensitive-word
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of houbb/sensitive-word and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "houbb/sensitive-word(\\.git)?\\b" \\
&& ok "origin remote is houbb/sensitive-word" \\
|| miss "origin remote is not houbb/sensitive-word (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 4. Critical files exist
test -f "src/main/java/com/github/houbb/sensitive/word/core/SensitiveWord.java" \\
&& ok "src/main/java/com/github/houbb/sensitive/word/core/SensitiveWord.java" \\
|| miss "missing critical file: src/main/java/com/github/houbb/sensitive/word/core/SensitiveWord.java"
test -f "src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java" \\
&& ok "src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java" \\
|| miss "missing critical file: src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java"
test -f "src/main/java/com/github/houbb/sensitive/word/api/ISensitiveWord.java" \\
&& ok "src/main/java/com/github/houbb/sensitive/word/api/ISensitiveWord.java" \\
|| miss "missing critical file: src/main/java/com/github/houbb/sensitive/word/api/ISensitiveWord.java"
test -f "src/main/java/com/github/houbb/sensitive/word/core/AbstractSensitiveWord.java" \\
&& ok "src/main/java/com/github/houbb/sensitive/word/core/AbstractSensitiveWord.java" \\
|| miss "missing critical file: src/main/java/com/github/houbb/sensitive/word/core/AbstractSensitiveWord.java"
test -f "src/main/java/com/github/houbb/sensitive/word/support/check/WordChecks.java" \\
&& ok "src/main/java/com/github/houbb/sensitive/word/support/check/WordChecks.java" \\
|| miss "missing critical file: src/main/java/com/github/houbb/sensitive/word/support/check/WordChecks.java"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 76 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~46d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/houbb/sensitive-word"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
A high-performance Java library for detecting and filtering sensitive words (profanity, illegal content, prohibited terms) using a Deterministic Finite Automaton (DFA) algorithm. It ships with 60K+ pre-built word entries, achieves 140K+ QPS throughput, and supports advanced text normalization (simplified/traditional Chinese conversion, full/half-width character handling, pinyin transformation) to defeat common evasion techniques. Standard Maven monolith: src/main/java/com/github/houbb/sensitive/word/ contains the core DFA implementation with modular interfaces (ISensitiveWord.java, IWordCheck.java, IWordReplace.java, etc.) for pluggable behavior. Word data lives separately (sensitive-word-data dependency v1.0.0). Test code presumably in src/test/ (not listed). Build artifacts configured in pom.xml for Maven Central publication.
👥Who it's for
Java backend developers building content moderation systems, user-generated content platforms, or chat/forum applications who need production-grade sensitive word filtering without building their own DFA engine and maintaining massive word lists.
🌱Maturity & risk
Production-ready. Active development as of v0.29.5 (recent releases in the commit history), comprehensive CI setup (.travis.yml, .coveralls.yml), Maven Central distribution, 60K+ curated word entries, and extensive documentation in /doc/issues/roadmap/ indicate a mature, actively maintained project. Benchmark claims of 14W+ QPS are documented.
Low-to-moderate risk. Single maintainer (houbb) visible in package structure com.github.houbb. Dependencies are minimal but include opencc4j (1.14.0) for Chinese conversion and heaven (0.13.0) for utility functions—audit these for security. The word list is built-in (not dynamic by default), so you must rebuild/redeploy to update banned terms; roadmap items like v009 (custom blacklist) and dynamic loading partially mitigate this. No visible open issue backlog in the provided file list.
Active areas of work
Roadmap items in /doc/issues/roadmap/ indicate active feature development: v004 (full/half-width punctuation), v005 (digit normalization), v006 (simplified/traditional Chinese swap), v007 (repeated word handling), v009 (custom blacklist), v010 (white list support), v011 (email/URL regex detection), v012 (stopword filtering), v014/v015 (phonetically/visually similar character matching, mirroring), v016 (custom noise reduction). The admin console (sensitive-word-admin) is mentioned as in early MVP development.
🚀Get running
git clone https://github.com/houbb/sensitive-word.git
cd sensitive-word
mvn clean install
mvn test
No external services required; library is self-contained. Uses Java 1.8+ (project.compiler.level=1.8 in pom.xml).
Daily commands:
This is a library, not an application. To use it: add <dependency><groupId>com.github.houbb</groupId><artifactId>sensitive-word</artifactId><version>0.29.5</version></dependency> to your pom.xml, then instantiate via the fluent API (documented in README.md). For development: mvn clean package builds the JAR; mvn test runs the test suite.
🗺️Map of the codebase
src/main/java/com/github/houbb/sensitive/word/core/SensitiveWord.java— Primary entry point and main API implementation for DFA-based sensitive word detection; all users interact through this classsrc/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java— Builder pattern implementation that orchestrates configuration, initialization, and chaining of word filtering operationssrc/main/java/com/github/houbb/sensitive/word/api/ISensitiveWord.java— Core interface contract defining all public operations (check, replace, tags, formatting); critical for understanding API surfacesrc/main/java/com/github/houbb/sensitive/word/core/AbstractSensitiveWord.java— Base implementation of DFA algorithm logic and core word matching; foundation for all sensitive word operationssrc/main/java/com/github/houbb/sensitive/word/support/check/WordChecks.java— Registry and factory for pluggable word validation checks (email, URL, IP, numbers); enables extensible validation pipelinesrc/main/java/com/github/houbb/sensitive/word/support/allow/WordAllows.java— Whitelist management system that integrates with allow/deny rules to customize filtering behaviorsrc/main/java/com/github/houbb/sensitive/word/api/context/InnerSensitiveWordContext.java— Context holder managing state across filtering operations, including tags, formats, and intermediate results
🛠️How to make changes
Add a Custom Word Check Strategy
- Create a new check class extending AbstractWordCheck or implementing IWordCheck in src/main/java/com/github/houbb/sensitive/word/support/check/ (
src/main/java/com/github/houbb/sensitive/word/support/check/WordCheckWord.java) - Register the new check in WordChecks.java factory method as an available strategy (
src/main/java/com/github/houbb/sensitive/word/support/check/WordChecks.java) - Integrate into the builder pipeline via SensitiveWordBs by adding a fluent method that selects your check strategy (
src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java)
Add a Custom Format/Normalization Handler
- Create a new format class implementing IWordFormat in src/main/java/com/github/houbb/sensitive/word/support/format/ (
src/main/java/com/github/houbb/sensitive/word/api/IWordFormat.java) - Implement format() method to normalize text (e.g., simplified↔traditional, pinyin conversion) (
src/main/java/com/github/houbb/sensitive/word/api/IWordFormatText.java) - Configure in SensitiveWordBs via addFormat() to apply during word matching and replacement (
src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java)
Add Custom Allow/Whitelist Rules
- Create a new allow implementation extending AbstractWordAllow or implementing IWordAllow in src/main/java/com/github/houbb/sensitive/word/support/allow/ (
src/main/java/com/github/houbb/sensitive/word/support/allow/WordAllowInit.java) - Implement contains() method to check if a word should be exempted from filtering (
src/main/java/com/github/houbb/sensitive/word/api/IWordAllow.java) - Register in SensitiveWordBs.allow() builder method to apply during sensitive word detection (
src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java)
Extend with Custom Tag Classifications
- Review existing word tag types in WordTagType.java enum (
src/main/java/com/github/houbb/sensitive/word/constant/enums/WordTagType.java) - Add new tag enum values and configure in word data source/repository with metadata (
src/main/java/com/github/houbb/sensitive/word/api/IWordTag.java) - Query tags via SensitiveWord.tags() method which returns categorized sensitive words from AbstractSensitiveWord (
src/main/java/com/github/houbb/sensitive/word/core/AbstractSensitiveWord.java)
🪤Traps & gotchas
No Spring Boot auto-configuration or property files visible—library is not a starter pack, requires manual fluent API instantiation. The bundled word list (60K+ entries) is immutable by default; dynamic updates require custom IWordData implementations (roadmap v009). DFA trie is likely built once at initialization—check initialization cost with large custom word lists before using in hot paths. Chinese text processing depends on opencc4j version 1.14.0; ensure compatible locale settings on Windows. No explicit thread-safety documentation found in listed files; verify concurrent access patterns before deploying to multi-threaded services.
🏗️Architecture
💡Concepts to learn
- Deterministic Finite Automaton (DFA) — The core algorithm in this library; understanding how DFA trie construction and character-by-character matching work is essential to optimizing detection performance and extending the engine with custom patterns.
- Trie (prefix tree) data structure — DFA implementations typically use a trie to store word lists; the trie enables O(m) lookup where m is the length of the input text, independent of vocabulary size, making it suitable for 60K+ word lists.
- Character normalization / text preprocessing — The library's advanced feature set (full/half-width conversion, simplified/traditional Chinese swapping, pinyin) exists precisely to normalize obfuscated input before DFA matching; understanding these transforms is critical for evading filter-bypass techniques.
- Fluent API / Builder pattern — The library's public interface is entirely fluent-style (ISensitiveWord), allowing method chaining for readable configuration; this is the idiomatic way to invoke the library and should be mimicked when adding new features.
- Strategy pattern (pluggable implementations) — Interfaces like IWordReplace, IWordFormat, and IWordCheck allow swappable implementations; understanding this design enables safe extension without modifying core logic.
- OpenCC (Simplified ↔ Traditional Chinese conversion) — The opencc4j (v1.14.0) dependency powers Chinese text normalization; knowing how it works helps debug issues with non-Latin scripts and character equivalence matching.
- QPS (Queries Per Second) benchmarking — The library claims 140K+ QPS performance; understanding how this metric is measured (single word lookups vs. paragraph scanning, JVM warmup, GC pauses) is crucial before trusting performance guarantees in production.
🔗Related repos
houbb/sensitive-word-admin— Official admin console (MVP) for managing sensitive word lists; companion UI project for this core libraryhoubb/sensitive— High-performance log desensitization component by same author; integrates with sensitive-word for redacting sensitive fields in logshoubb/heaven— Utility abstraction library (v0.13.0 dependency); provides foundational utilities used throughout sensitive-wordmedcl/elasticsearch-analysis-ik— IK Chinese tokenizer for Elasticsearch; complementary solution if you need sensitive word filtering integrated with full-text search indexingfighting41love/funNLP— Alternative Chinese NLP toolkit with built-in sensitive word detection; similar problem domain, useful for comparison if you need broader NLP capabilities
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive unit tests for ISensitiveWordCharIgnore and character normalization pipeline
The repo has advanced features like traditional/simplified Chinese conversion, full/half-width character handling, and pinyin conversion (referenced in roadmap files v004-v007), but there are no visible test files in src/test for the ISensitiveWordCharIgnore.java interface. This is critical for a tool processing multiple character encodings and transformations. Adding tests would ensure the normalization pipeline handles edge cases correctly.
- [ ] Create src/test/java/com/github/houbb/sensitive/word/api/ISensitiveWordCharIgnoreTest.java
- [ ] Add test cases for full-width to half-width character conversion
- [ ] Add test cases for traditional to simplified Chinese conversion
- [ ] Add test cases for mixed-encoding inputs (combining multiple transformations)
- [ ] Add test cases for edge cases (empty strings, special characters, numbers)
- [ ] Verify existing format implementations in src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java pass all tests
Implement and test the word denoising pipeline referenced in v016 roadmap
The roadmap file doc/issues/roadmap/v016-自定义降噪处理.md indicates custom noise reduction is a planned feature, but there's no corresponding implementation visible in the API layer. This is a high-value feature for improving detection accuracy against obfuscated sensitive words. Creating the interface and tests would unblock this feature.
- [ ] Create src/main/java/com/github/houbb/sensitive/word/api/IWordDenoise.java interface with methods for custom noise reduction rules
- [ ] Create src/main/java/com/github/houbb/sensitive/word/bs/WordDenoiseChain.java to handle chaining multiple denoise strategies
- [ ] Add integration to SensitiveWordBs.java to apply denoise operations before DFA matching
- [ ] Create src/test/java/com/github/houbb/sensitive/word/bs/WordDenoiseChainTest.java with test cases for common obfuscation patterns (homoglyphs, repeated characters, symbols)
- [ ] Document the feature with examples in README.md
Add CI/CD workflow for multi-version Java compatibility testing (Java 8, 11, 17, 21)
The pom.xml specifies Java 1.8 as the compiler level, but there's no GitHub Actions workflow visible (.github/workflows missing from file list) to test against multiple JDK versions. Given the project targets production use and needs to support various enterprise environments, automated testing across LTS versions would catch compatibility regressions early.
- [ ] Create .github/workflows/java-matrix-test.yml with matrix strategy for Java 8, 11, 17, 21
- [ ] Configure Maven to run full test suite (surefire) on each Java version
- [ ] Add coverage reporting step using coveralls integration (already referenced in .coveralls.yml)
- [ ] Configure workflow to run on push to main and PRs
- [ ] Update README.md to show Java version compatibility badge
🌿Good first issues
- Add unit tests for IWordResult interface implementations in src/test/; the api/ folder shows 8 interfaces but no test files are listed, making it unclear if all edge cases (overlapping matches, empty strings, null words) are covered.
- Document the exact DFA trie construction algorithm and memory footprint in a /doc/internals.md file; the README mentions 'DFA algorithm' but no deep-dive exists for contributors wanting to optimize or extend the matching logic.
- Implement a concrete example in /examples showing the fluent API for each major feature (detect, replace, tag extraction, custom white/blacklist) with unit tests; the README has snippets but no runnable Maven submodule.
⭐Top contributors
Click to expand
Top contributors
- @houbb — 47 commits
- @binbin.hou — 33 commits
- @yds — 7 commits
- @k9999dot — 6 commits
- @bbhou — 3 commits
📝Recent commits
Click to expand
Recent commits
b86249d— [fixed] 英文全词匹配问题 (binbin.hou)67d4719— [fixed] 英文全词匹配问题 (binbin.hou)855659a— [Feature] add for new (bbhou)3a444e8— [Feature] add for new (bbhou)82c0d00— [Feature] add for new (bbhou)00f888e— v0.29.3 opt (binbin.hou)dbbef1c— [Feature] add for new (binbin.hou)9605052— [Feature] add for new (binbin.hou)8378e20— [Feature] add for new (binbin.hou)a46f430— v0.29.1 opt (binbin.hou)
🔒Security observations
The sensitive-word project has moderate security concerns primarily related to outdated dependencies and Java version. The most critical issues are the use of Java 1.
- High · Outdated JUnit Dependency —
pom.xml - junit.version property (4.13.1). The project uses JUnit version 4.13.1, which contains known security vulnerabilities. JUnit 4.13.1 was released in 2020 and has been superseded by more recent versions with security patches. Fix: Upgrade JUnit to the latest stable version (4.13.2 or higher). Run 'mvn dependency:check-updates' to identify all outdated dependencies and patch them accordingly. - High · Potentially Vulnerable Plugin Versions —
pom.xml - plugin versions. The project uses outdated Maven plugin versions including maven-surefire-plugin 2.18.1 (released 2014), maven-javadoc-plugin 2.9.1 (released 2013), and maven-gpg-plugin 1.5 (released 2013). These contain potential security issues and lack modern security features. Fix: Update all plugins to their latest stable versions: maven-surefire-plugin to 2.22.2+, maven-javadoc-plugin to 3.3.1+, and maven-gpg-plugin to 1.6+ - Medium · Missing Dependency Version Management —
pom.xml - dependencyManagement section. The pom.xml file shows incomplete dependency declarations with truncated content ('<dependency>' tag not closed). This could indicate missing security checks and proper version pinning for transitive dependencies. Fix: Complete the pom.xml file with all dependencies explicitly declared and pinned to specific versions. Implement a dependency check plugin in Maven to detect known vulnerabilities. - Medium · Java Version 1.8 (EOL) —
pom.xml - project.compiler.level property. The project targets Java 1.8 (project.compiler.level=1.8) which reached end-of-life. Java 1.8 no longer receives security updates from Oracle, exposing the application to known vulnerabilities in the JVM itself. Fix: Upgrade to Java 11 LTS or Java 17 LTS. Update the compiler level to at least 11, and test the codebase for compatibility. Ensure all dependencies are compatible with the newer Java version. - Medium · Sensitive Word Data Dependency Unclear —
pom.xml - sensitive-word-data.version property. The project depends on 'sensitive-word-data' version 1.0.0 without clear documentation of its source, versioning strategy, or security vetting. The source of sensitive word datasets could be malicious or outdated. Fix: Document the source and security vetting process for the sensitive-word-data dependency. Implement regular updates and security audits. Consider pinning to a specific release with checksum verification. - Low · Missing HTTPS in Maven Repository Configuration —
pom.xml - repository configuration (not shown). While not visible in the provided snippet, build configuration should ensure that all Maven repositories use HTTPS to prevent man-in-the-middle attacks during dependency resolution. Fix: Ensure all <repository> and <pluginRepository> entries in pom.xml use HTTPS URLs (https://repo.maven.apache.org/maven2). Verify no plain HTTP repositories are configured. - Low · No Security Headers or Content Security Policy Configuration —
Project-wide configuration. As a library project, there's no visible configuration for security headers. If this library is used in a web application context, security headers should be enforced. Fix: Provide documentation and examples for integrating security headers in web applications that use this library. Consider adding a security configuration module if web exposure is planned. - Low · Incomplete Source Plugin Configuration —
pom.xml - maven-source-plugin.version property. Maven Source Plugin version 2.2.1 (released 2013) is outdated and may have known issues with reproducible builds and security. Fix: Upgrade maven-source-plugin to version 2.4 or higher for improved stability and security.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.