RepoPilotOpen in app →

infinilabs/analysis-ik

🚌 The IK Analysis plugin integrates Lucene IK analyzer into Elasticsearch and OpenSearch, support customized dictionary.

Healthy

Healthy across the board

weakest axis
Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 2d ago
  • 28+ active contributors
  • Apache-2.0 licensed
Show all 6 evidence items →
  • CI configured
  • Tests present
  • Concentrated ownership — top contributor handles 63% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/infinilabs/analysis-ik)](https://repopilot.app/r/infinilabs/analysis-ik)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/infinilabs/analysis-ik on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: infinilabs/analysis-ik

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/infinilabs/analysis-ik shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit 2d ago
  • 28+ active contributors
  • Apache-2.0 licensed
  • CI configured
  • Tests present
  • ⚠ Concentrated ownership — top contributor handles 63% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live infinilabs/analysis-ik repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/infinilabs/analysis-ik.

What it runs against: a local clone of infinilabs/analysis-ik — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in infinilabs/analysis-ik | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 32 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>infinilabs/analysis-ik</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of infinilabs/analysis-ik. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/infinilabs/analysis-ik.git
#   cd analysis-ik
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of infinilabs/analysis-ik and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "infinilabs/analysis-ik(\\.git)?\\b" \\
  && ok "origin remote is infinilabs/analysis-ik" \\
  || miss "origin remote is not infinilabs/analysis-ik (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "core/src/main/java/org/wltea/analyzer/core/IKSegmenter.java" \\
  && ok "core/src/main/java/org/wltea/analyzer/core/IKSegmenter.java" \\
  || miss "missing critical file: core/src/main/java/org/wltea/analyzer/core/IKSegmenter.java"
test -f "core/src/main/java/org/wltea/analyzer/dic/Dictionary.java" \\
  && ok "core/src/main/java/org/wltea/analyzer/dic/Dictionary.java" \\
  || miss "missing critical file: core/src/main/java/org/wltea/analyzer/dic/Dictionary.java"
test -f "core/src/main/java/org/wltea/analyzer/lucene/IKAnalyzer.java" \\
  && ok "core/src/main/java/org/wltea/analyzer/lucene/IKAnalyzer.java" \\
  || miss "missing critical file: core/src/main/java/org/wltea/analyzer/lucene/IKAnalyzer.java"
test -f "elasticsearch/src/main/java/com/infinilabs/ik/elasticsearch/AnalysisIkPlugin.java" \\
  && ok "elasticsearch/src/main/java/com/infinilabs/ik/elasticsearch/AnalysisIkPlugin.java" \\
  || miss "missing critical file: elasticsearch/src/main/java/com/infinilabs/ik/elasticsearch/AnalysisIkPlugin.java"
test -f "opensearch/src/main/java/com/infinilabs/ik/opensearch/AnalysisIkPlugin.java" \\
  && ok "opensearch/src/main/java/com/infinilabs/ik/opensearch/AnalysisIkPlugin.java" \\
  || miss "missing critical file: opensearch/src/main/java/com/infinilabs/ik/opensearch/AnalysisIkPlugin.java"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 32 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~2d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/infinilabs/analysis-ik"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

A Lucene-based Chinese/CJK language tokenizer plugin for Elasticsearch and OpenSearch that integrates the IK Analyzer to provide intelligent word segmentation with two modes: ik_smart (fast, fewer tokens) and ik_max_word (thorough, max coverage). It supports custom dictionary injection via XML configuration files and handles Chinese-specific linguistic challenges like surname detection, quantifiers, and surrogates pairs for rare characters. Multi-module Maven project: core/ contains the IK analyzer implementation (segmenters, dictionary, configuration) under core/src/main/java/org/wltea/analyzer/ split into subpackages (cfg, core, dic, help, lucene). Config dictionaries live in config/ as plain text files (main.dic, stopword.dic, suffix.dic, etc.). The root pom.xml orchestrates builds; elasticsearch/ and opensearch/ modules (not shown but referenced) wrap core into plugin jars.

👥Who it's for

Search infrastructure engineers and DevOps teams deploying Elasticsearch/OpenSearch in Chinese or multilingual environments who need accurate Chinese text tokenization without building custom segmentation logic. Also used by developers integrating IK Analysis into Java applications via the core Lucene library.

🌱Maturity & risk

Production-ready. The project has CI/CD via GitHub Actions (.github/workflows/test.yml), Apache 2.0 licensing, and is actively maintained by INFINI Labs with pre-built binaries available at https://release.infinilabs.com/. However, the core IK algorithm itself originates from the Lucene ecosystem and has been battle-tested for years; the plugin wrapper is the actively maintained layer.

Low-to-moderate risk. Dependencies are minimal and stable (Lucene, log4j, httpclient) all pinned to specific versions. Single organization (INFINI Labs) maintains this fork, so organizational dissolution is a risk. No visibility into open issue backlog or PR velocity from provided data. Breaking changes possible during major Elasticsearch/OpenSearch version upgrades, as evidenced by separate plugin URLs for ES 9.1.4 vs OpenSearch 2.12.0.

Active areas of work

Cannot determine from provided data—no recent commit history, PR list, or issues visible. README shows version examples (ES 9.1.4, OpenSearch 2.12.0) suggesting ongoing version alignment with upstream releases, and GitHub Actions test workflow exists but details unavailable.

🚀Get running

git clone https://github.com/infinilabs/analysis-ik.git
cd analysis-ik
mvn clean install -DskipTests
# Binary output in elasticsearch/ or opensearch/ target/ dirs

Daily commands: This is a library/plugin, not a runnable application. To test locally: mvn test in core/. To use: install the plugin JAR into Elasticsearch/OpenSearch via bin/elasticsearch-plugin install file:///path/to/plugin.zip or download pre-built from https://release.infinilabs.com/.

🗺️Map of the codebase

  • core/src/main/java/org/wltea/analyzer/core/IKSegmenter.java — Core segmentation engine that performs Chinese word segmentation; understanding this is essential for grasping how IK analyzer tokenizes text
  • core/src/main/java/org/wltea/analyzer/dic/Dictionary.java — Dictionary management system that loads and manages all word lists; critical for customization and performance
  • core/src/main/java/org/wltea/analyzer/lucene/IKAnalyzer.java — Lucene integration entry point; bridges the IK core analyzer to Elasticsearch/OpenSearch ecosystems
  • elasticsearch/src/main/java/com/infinilabs/ik/elasticsearch/AnalysisIkPlugin.java — Elasticsearch plugin bootstrap; registers analyzers and tokenizers with ES runtime
  • opensearch/src/main/java/com/infinilabs/ik/opensearch/AnalysisIkPlugin.java — OpenSearch plugin bootstrap; parallel to ES plugin but tailored for OpenSearch compatibility
  • config/IKAnalyzer.cfg.xml — Configuration file that specifies dictionary paths and analyzer behavior; must be edited to customize word lists
  • pom.xml — Root Maven POM; defines multi-module build structure for core, elasticsearch, and opensearch variants

🛠️How to make changes

Add Custom Words to Dictionary

  1. Edit the extra dictionary file appropriate for your use case (config/extra_main.dic)
  2. Add one word per line to the file (words will be auto-reloaded via Monitor.java) (config/extra_main.dic)
  3. Verify the configuration references your dictionary in the XML config (config/IKAnalyzer.cfg.xml)
  4. Restart Elasticsearch/OpenSearch or wait for hot-reload via Dictionary.java monitor

Implement a New Segmentation Strategy

  1. Create a new class implementing ISegmenter interface (core/src/main/java/org/wltea/analyzer/core/ISegmenter.java)
  2. Register your segmenter in IKSegmenter.java's init() method alongside CJKSegmenter and LetterSegmenter (core/src/main/java/org/wltea/analyzer/core/IKSegmenter.java)
  3. Update AnalyzeContext to track any new state your segmenter requires (core/src/main/java/org/wltea/analyzer/core/AnalyzeContext.java)
  4. Add test cases to verify segmentation results (core/src/test/java/org/wltea/analyzer/lucene/IKAnalyzerTests.java)

Support a New Search Engine (e.g., Meilisearch)

  1. Create a new Maven module mirroring elasticsearch/ or opensearch/ structure (pom.xml)
  2. Create plugin entry point by extending/copying AnalysisIkPlugin pattern (elasticsearch/src/main/java/com/infinilabs/ik/elasticsearch/AnalysisIkPlugin.java)
  3. Implement analyzer and tokenizer factories for the target platform (ConfigurationSub, IkAnalyzerProvider, IkTokenizerFactory) (elasticsearch/src/main/java/com/infinilabs/ik/elasticsearch/IkAnalyzerProvider.java)
  4. Add platform-specific assembly and descriptor files in src/main/assemblies and src/main/resources (elasticsearch/src/main/assemblies/plugin.xml)
  5. Update root pom.xml to include the new module in <modules> and configure its dependencies

Tune Segmentation Behavior

  1. Modify disambiguation logic in IKArbitrator.java to change how competing paths are ranked (core/src/main/java/org/wltea/analyzer/core/IKArbitrator.java)
  2. Adjust path costs in LexemePath.java to prioritize certain segmentation patterns (core/src/main/java/org/wltea/analyzer/core/LexemePath.java)
  3. Add new word categories by creating new .dic files and registering them in Dictionary.java (core/src/main/java/org/wltea/analyzer/dic/Dictionary.java)
  4. Run IKAnalyzerTests to validate segmentation quality on your test corpus (core/src/test/java/org/wltea/analyzer/lucene/IKAnalyzerTests.java)

🪤Traps & gotchas

Dictionary encoding: All .dic files in config/ must be UTF-8; other encodings will silently fail during word matching. Dynamic reload: IKAnalyzer.cfg.xml supports <entry key="http.request.slow.threshold">...</entry> hinting at remote dictionary refresh, but no explicit lock/race condition handling visible in Dictionary.java—concurrent tokenization during reload could cause issues. Version alignment: Plugin JAR must exactly match ES/OpenSearch major version (e.g., 9.1.4 for ES 9.x); mismatches cause ClassNotFoundException on Analyzer instantiation. Logging: Uses log4j 2.19.0 (provided scope), but log4j config must exist in Elasticsearch/OpenSearch runtime or logs silently disappear.

🏗️Architecture

💡Concepts to learn

  • Forward-Maximum-Matching (FMM) Algorithm — The core segmentation strategy in CJKSegmenter.java; understanding how it greedily matches the longest dictionary word from left-to-right is key to predicting IK's tokenization behavior
  • Trie (Prefix Tree) Data Structure — DictSegment.java implements a trie for O(m) dictionary lookup (m = word length); used for efficient prefix matching during tokenization
  • Token Conflict Arbitration / Dijkstra Path Selection — IKArbitrator.java uses graph-based path selection when overlapping segments exist (e.g., '中国' vs '中' + '国'); ik_smart picks shortest path, ik_max_word picks longest
  • CJK Character Handling & Surrogate Pairs — SurrogatePairSegmenter.java handles rare characters outside the BMP (Basic Multilingual Plane) encoded as UTF-16 surrogate pairs; necessary for full Unicode support in Chinese text
  • Lucene Analyzer/TokenStream Pipeline — The plugin wraps IK segmentation into Lucene's TokenFilter chain; understanding Analyzer.tokenStream() and Token attributes (offset, type, position) is required to integrate custom filters
  • XML Configuration Management — IKAnalyzer.cfg.xml is parsed by Configuration.java at plugin startup; supports loading external dictionaries via HTTP and controlling analysis behavior—understanding the config schema prevents silent loading failures
  • apache/lucene — Core dependency providing the Analyzer and TokenFilter interfaces that IK plugs into; understanding Lucene's analysis pipeline is essential
  • medcl/elasticsearch-analysis-ik — Original standalone IK plugin for Elasticsearch (predecessor); INFINI's version is a maintained fork with OpenSearch support
  • elastic/elasticsearch — Elasticsearch codebase; required to understand plugin loading, version-specific APIs, and how analyzers integrate into search pipelines
  • opensearch-project/opensearch — OpenSearch codebase; parallel plugin system to Elasticsearch; needed for OpenSearch-specific version builds
  • infinilabs/framework — INFINI Labs' parent framework; likely provides common logging, config, and packaging utilities used across their plugin ecosystem

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for custom dictionary loading and dynamic updates

The repo supports customized dictionaries (config/extra_main.dic, config/extra_stopword.dic, etc.) but there are minimal tests for dictionary loading scenarios. Currently only IKAnalyzerTests.java and Issue921Test.java exist in core/src/test. A new test suite should verify: (1) dictionary file parsing, (2) hot-reload of custom dictionaries without restart, (3) fallback behavior when dictionary files are missing or malformed. This is critical for users relying on custom word lists.

  • [ ] Create core/src/test/java/org/wltea/analyzer/dic/DictionaryLoadingTests.java
  • [ ] Add tests for loading config/extra_main.dic and config/extra_stopword.dic
  • [ ] Test Dictionary.java reload mechanism with Monitor.java
  • [ ] Add test cases for corrupted/missing dictionary file handling

Add OpenSearch-specific plugin integration tests in elasticsearch/ module

The README claims support for 'major versions of Elasticsearch and OpenSearch' but the elasticsearch/ module (elasticsearch/src/main/java/com/infinilabs/) appears incomplete in the file listing and lacks OpenSearch-specific integration tests. Create a parallel opensearch/ module or add dedicated tests that verify the plugin works correctly with OpenSearch's specific API changes and configuration formats, ensuring parity with Elasticsearch support.

  • [ ] Create elasticsearch/src/test/java/com/infinilabs/analysis/IKAnalyzerPluginTests.java for ES integration
  • [ ] Add test to verify IKAnalyzer and IKTokenizer registration in plugin bootstrap
  • [ ] Test custom dictionary configuration via elasticsearch.yml (if supported)
  • [ ] Document OpenSearch-specific version compatibility matrix in README

Add comprehensive test coverage for CharacterUtil.java and CJKSegmenter.java edge cases

The core segmentation logic in core/src/main/java/org/wltea/analyzer/core/ (CharacterUtil.java, CJKSegmenter.java, LetterSegmenter.java) handles complex CJK character classification and segmentation. Current tests in core/src/test are minimal. Add focused unit tests for: (1) boundary conditions in character type detection, (2) surrogate pair handling in SurrogatePairSegmenter.java, (3) CJK punctuation and special character handling, (4) mixed CJK/Latin/numeric sequences.

  • [ ] Create core/src/test/java/org/wltea/analyzer/core/CharacterUtilTests.java with tests for all CharacterUtil static methods
  • [ ] Create core/src/test/java/org/wltea/analyzer/core/CJKSegmenterTests.java testing CJK-only and mixed-script inputs
  • [ ] Add SurrogatePairSegmenter edge case tests (emoji, rare Unicode characters)
  • [ ] Add parametrized tests for real-world CJK text samples with expected segmentation results

🌿Good first issues

  • Add unit tests for the CN_QuantifierSegmenter.java class to verify correct tokenization of Chinese numeric quantifiers (e.g., '三个人' → ['三', '个', '人']); current test coverage unknown.
  • Document dictionary format specification in README: what fields .dic files support (word, frequency, POS tag), encoding requirements, and how to add custom domains (e.g., medical terms) without modifying core files.
  • Implement thread-safe reloading in Dictionary.java for dynamic .dic reload without restart; currently no explicit support for hot-swapping dictionaries, blocking iteration over DictSegment during concurrent tokenization.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • b72b9d4 — FIX: incorrect finalOffset and positionIncrement when stopwords filtered for issue#921 (#1146) (kin122)
  • d03917f — fix: missing long number in smart mode for issue1137 (#1138) (kin122)
  • 9b82025 — Fix typos in some files (#1130) (Edge-Seven)
  • d491580 — Upgrade to Elasticsearch to 9.1.4 (#1126) (johnnyshields)
  • fc1c2e0 — fix: remove some words from dicts (#1128) (kin122)
  • f8451de — Update README.md (#1116) (a180285)
  • 9fe1546 — Fix the problem caused by long repeat words in issue #1119 (#1120) (kin122)
  • f8b9e07 — fix: fix CN_Quantifier problem in ISSUE 1108 and add testcode (#1109) (kin122)
  • c4a00e7 — fix: add entitlement-policy to fix ISSUE#1104 (#1106) (kin122)
  • 8673613 — fix: SurrogatePairSegmenter problem in ISSUES#1100 and add testcode tokenizeCase6_correctly (#1103) (kin122)

🔒Security observations

The codebase has a reasonable security posture but requires dependency updates. The primary concerns are outdated Apache HttpClient (4.5.13) and Log4j (2.19.0) packages, both containing known vulnerabilities. The Lucene version cannot be assessed due to variable reference in the pom.xml. Test dependencies are significantly outdated but have lower risk exposure. No hardcoded secrets, SQL injection risks, or obvious infrastructure misconfigurations were identified in the provided file structure. The project uses Apache 2.0 license and maintains CI/CD workflows which are positive indicators. Immediate action is recommended for HttpClient and Log4j updates.

  • High · Outdated Apache HttpClient Dependency — core/pom.xml - org.apache.httpcomponents:httpclient:4.5.13. The pom.xml specifies httpclient version 4.5.13, which contains known security vulnerabilities. This version was released in 2020 and has multiple CVEs including connection security issues and potential exploitation vectors. Fix: Upgrade to httpclient 4.5.14 or later. Review the Apache HttpComponents security advisories and apply the latest stable patch version available.
  • Medium · Outdated Log4j Dependency — core/pom.xml - org.apache.logging.log4j:log4j-api:2.19.0. The pom.xml specifies log4j-api version 2.19.0. While this version is post-Log4Shell (CVE-2021-44228), it is not the latest stable version. Newer versions contain additional security patches and bug fixes. Fix: Upgrade to log4j-api 2.20.0 or the latest stable version (2.x series) to ensure all security patches are applied.
  • Low · Outdated Hamcrest Test Dependencies — core/pom.xml - org.hamcrest:hamcrest-core:1.3 and hamcrest-library:1.3. The test dependencies use hamcrest-core and hamcrest-library version 1.3, released in 2012. While test dependencies have lower security impact, outdated test frameworks may have unpatched vulnerabilities. Fix: Upgrade to hamcrest 2.2 or later (latest stable version) to modernize the test framework and receive security patches.
  • Medium · Missing Dependency Version Management — core/pom.xml - lucene-core and lucene-analysis-common dependencies. The pom.xml uses ${lucene.version} as a variable reference without explicitly showing the version number in the provided snippet. This makes it difficult to verify if the Lucene version is current and secure. Version drift between lucene-core and lucene-analysis-common could cause compatibility issues. Fix: Explicitly define and document the Lucene version. Ensure it matches the target Elasticsearch/OpenSearch version and contains all security patches. Review release notes for any CVEs.
  • Low · Incomplete POM Configuration — core/pom.xml - end of file. The provided pom.xml snippet appears truncated (ends abruptly at hamcrest-library dependency). This incomplete view may mask additional dependency or security configuration issues. Fix: Review the complete pom.xml file to ensure all dependencies are listed, security plugins are configured, and no sensitive information is exposed in version control.
  • Low · Dictionary Files in Version Control — config/ directory - *.dic files. The repository contains multiple dictionary files (main.dic, stopword.dic, etc.) in the config directory. If these files are dynamically loaded and not properly validated, they could potentially be manipulated or used for injection attacks. Fix: Implement integrity checks (checksums/signatures) for dictionary files. Validate file contents before loading. Consider restricting write permissions to dictionary files at runtime.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.