pemistahl/grex
A command-line tool and Rust library with Python bindings for generating regular expressions from user-provided test cases
Healthy across all four use cases
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 2mo ago
- ✓3 active contributors
- ✓Apache-2.0 licensed
Show all 7 evidence items →Show less
- ✓CI configured
- ✓Tests present
- ⚠Small team — 3 contributors active in recent commits
- ⚠Concentrated ownership — top contributor handles 59% of recent commits
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/pemistahl/grex)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/pemistahl/grex on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: pemistahl/grex
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/pemistahl/grex shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across all four use cases
- Last commit 2mo ago
- 3 active contributors
- Apache-2.0 licensed
- CI configured
- Tests present
- ⚠ Small team — 3 contributors active in recent commits
- ⚠ Concentrated ownership — top contributor handles 59% of recent commits
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live pemistahl/grex
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/pemistahl/grex.
What it runs against: a local clone of pemistahl/grex — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in pemistahl/grex | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 99 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of pemistahl/grex. If you don't
# have one yet, run these first:
#
# git clone https://github.com/pemistahl/grex.git
# cd grex
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of pemistahl/grex and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "pemistahl/grex(\\.git)?\\b" \\
&& ok "origin remote is pemistahl/grex" \\
|| miss "origin remote is not pemistahl/grex (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "src/lib.rs" \\
&& ok "src/lib.rs" \\
|| miss "missing critical file: src/lib.rs"
test -f "src/builder.rs" \\
&& ok "src/builder.rs" \\
|| miss "missing critical file: src/builder.rs"
test -f "src/dfa.rs" \\
&& ok "src/dfa.rs" \\
|| miss "missing critical file: src/dfa.rs"
test -f "src/main.rs" \\
&& ok "src/main.rs" \\
|| miss "missing critical file: src/main.rs"
test -f "src/config.rs" \\
&& ok "src/config.rs" \\
|| miss "missing critical file: src/config.rs"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 99 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~69d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/pemistahl/grex"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
grex is a Rust-based library and CLI tool that automatically generates regular expressions from user-provided test cases. It uses deterministic finite automata (DFA) algorithms to infer minimal regex patterns that match all positive examples without over-generalizing. It exposes functionality via a Rust library (published to crates.io), Python bindings (via PyO3), and a standalone command-line binary, plus experimental WebAssembly bindings. Monolithic Rust library with optional feature gates. Core regex generation logic lives in src/ (builder.rs, dfa.rs, cluster.rs, component.rs, regexp.rs, expression.rs drive the algorithm). Feature flags control compilation: 'cli' (default, pulls in clap for CLI), 'python' (pulls PyO3 for bindings). Language-specific entry points: src/main.rs for CLI, src/python.rs for Python FFI, src/wasm.rs for WebAssembly. Tests split into cli_integration_tests.rs, lib_integration_tests.rs, property_tests.rs, and tests/python/.
👥Who it's for
Developers and data engineers who need to quickly generate regex patterns without manually writing them—particularly those working with text parsing, validation, or pattern extraction pipelines. Users range from Python developers (via PyPI bindings) to Rust library consumers to CLI users on any OS. Maintainers of text processing tools who want to expose regex generation to end users.
🌱Maturity & risk
Production-ready and actively maintained. Currently at v1.4.6 with comprehensive CI/CD (rust-build.yml, python-build.yml, release.yml workflows), significant test coverage (integration tests in tests/, property tests via proptest, benchmarks via criterion), and multi-platform binary distribution (Linux x86_64/ARM64, macOS, Windows). The project demonstrates maturity through versioned releases, Python 3.12+ support, and structured dependency management.
Low maintenance risk but moderate complexity. The codebase has a single primary maintainer (pemistahl) and ~442KB of Rust code, making it a moderate-sized project. Dependencies are well-maintained (itertools, petgraph, ndarray, regex crates). No obvious red flags in the file list; CI is green across platforms. Main risk: the Python bindings (PyO3-based) add platform-specific build complexity—watch for cross-compilation issues on unusual architectures.
Active areas of work
The project is actively developed with support for Python 3.12+ (PyPI badge shows v1.0.2). GitHub Actions CI pipelines (python-build.yml, rust-build.yml, release.yml) are configured for continuous testing and automated releases. The presence of dependabot.yml indicates active dependency maintenance. Recent work appears focused on cross-language support (Python bindings, WASM) and ensuring multi-platform binary availability.
🚀Get running
Clone and build locally:
git clone https://github.com/pemistahl/grex.git
cd grex
cargo build --release
./target/release/grex --help
For Python: pip install grex (from PyPI), or build locally with cargo build --release --features python. For development: cargo test runs the full test suite; cargo bench runs criterion benchmarks.
Daily commands:
CLI: ./grex --help (after cargo build --release). Try: echo -e 'test1\ntest2' | grex. Library (Rust): Add to Cargo.toml: grex = "1.4.6", then use use grex::RegexBuilder;. Python: python -m pip install grex, then from grex import RegexBuilder. Tests: cargo test (all tests), cargo test --lib (unit only), cargo test --test cli_integration_tests (CLI only). Benchmarks: cargo bench --bench benchmark.
🗺️Map of the codebase
src/lib.rs— Main library entry point exposing the core API that all bindings and CLI tools depend on; defines theBuildertrait and public interface.src/builder.rs— Core builder pattern implementation that orchestrates DFA construction, optimization, and regex generation from test cases.src/dfa.rs— Deterministic Finite Automaton construction and minimization—the algorithmic heart of regex generation.src/main.rs— CLI entry point and argument parsing; demonstrates the public API contract for all users.src/config.rs— Configuration struct controlling generation options (case-insensitivity, escape behavior, verbose mode); used across all generation paths.Cargo.toml— Defines crate version, features (wasm, python), and minimal dependencies—essential for understanding build targets.src/python.rs— Python FFI bindings via PyO3; shows how the library is exposed to Python users.
🛠️How to make changes
Add a new character class rule (e.g., new \p{...} property)
- Add character data table to src/unicode_tables/mod.rs or new file (e.g., src/unicode_tables/hex.rs) (
src/unicode_tables/mod.rs) - Import and reference the table in src/cluster.rs where Character::Digit/Space/Word patterns are detected (
src/cluster.rs) - Add matching logic to src/component.rs Component enum to represent the new class (
src/component.rs) - Update src/format.rs to emit the correct PCRE syntax (e.g., \h for hex digits) (
src/format.rs) - Add test cases to tests/lib_integration_tests.rs verifying the new class is generated correctly (
tests/lib_integration_tests.rs)
Add a new configuration option to control regex generation behavior
- Add field to Config struct in src/config.rs (e.g., max_repetition_count: Option<usize>) (
src/config.rs) - Pass Config reference through Builder::build() and use in src/builder.rs pipeline (
src/builder.rs) - Update src/main.rs CLI argument parsing to accept the new flag (via clap derive) (
src/main.rs) - Expose Config setter in src/lib.rs public API if it's a builder method (
src/lib.rs) - Update src/python.rs to wrap the new Config field for Python callers (
src/python.rs)
Optimize the DFA minimization algorithm for very large automata
- Review current minimize() and hopcroft() implementations in src/dfa.rs (
src/dfa.rs) - Profile with benches/benchmark.rs to identify bottlenecks on large test case sets (
benches/benchmark.rs) - Implement improved algorithm (e.g., faster state equivalence checking) in src/dfa.rs (
src/dfa.rs) - Add new benchmark case to benches/benchmark.rs to validate improvement (
benches/benchmark.rs) - Add regression test to tests/property_tests.rs ensuring correctness is maintained (
tests/property_tests.rs)
🔧Why these technologies
- Rust + deterministic finite automaton (DFA) — Provides memory-safe, fast regex generation from test cases via state-machine formalism; no external regex engine dependency means smaller binary and full control over output.
- PyO3 for Python bindings — Enables Rust performance in Python without rewriting; zero-copy interop for string/list handling.
- wasm-bindgen for browser & Node.js — Allows the same Rust algorithm to run client-side without a server; enables the demo website and offline usage.
- Hopcroft's algorithm for DFA minimization — Near-linear complexity minimization ensures O(n log n) performance even on large test case sets, critical for real-time CLI responsiveness.
⚖️Trade-offs already made
-
Generate regex from test cases rather than learn from grammar or syntax
- Why: Simpler UX (just provide examples), no need to train on corpora; directly maps user intent to regex.
- Consequence: Generated regex may be narrower or wider than ideal; not suitable for learning all patterns (e.g., 'test1' and 'test2' may not cover edge cases the user actually needs).
-
Build DFA incrementally from all test cases at once, then minimize globally
- Why: Produces smaller, more optimal final regex than incremental union of per-case automata.
- Consequence: Higher initial memory/CPU cost; cannot stream or interactively refine results if a test case contradicts earlier ones.
-
No integration with external regex engines; generate output only
- Why: Minimizes binary size, avoids dependency bloat, maintains pure algorithmic focus.
- Consequence: Generated regex is not validated against the regex engine until user runs it; no interactive back-pressure loop to refine bad patterns.
-
Support multiple output targets (CLI, library, Python, WASM) from single Rust codebase
- Why: Maximizes code reuse and ensures consistency across platforms.
- Consequence: Build complexity (features, FFI, tooling); Python/WASM versions lag native in release cycles if build process fails.
🚫Non-goals (don't propose these)
- Not a regex validator or executor; does not match strings against generated patterns.
- Not a regex tutorial or teaching tool; outputs only the pattern, not explanation of why it was chosen.
- Not real-time collaborative; no multi-user session or persisted regex library.
- Not a full formal language learner; cannot infer grammars or context-free structures from examples.
- Not a cross-platform GUI; only CLI and programmatic APIs.
🪤Traps & gotchas
PyO3 build complexity: Python bindings require matching Python.h headers and Rust toolchain compatibility; cross-compilation (e.g., Apple Silicon or musl-libc) can fail silently. Feature gate interactions: Building without '--features cli' disables clap, so src/main.rs won't compile; similarly, '--features python' is required for src/python.rs. Unicode tables are precomputed: src/unicode_tables/ files are static data generated offline (not code-generated at build time); if you need to update Unicode support, regeneration logic is not in the repo. Benchmark setup: benches/testcases.txt must exist for benchmarks to run; it's checked in but easy to accidentally delete. Wasm build isolation: src/wasm.rs uses #[cfg(target_family = "wasm")], so the full library won't link to wasm-bindgen unless targeting WebAssembly; this is intentional but can surprise developers testing locally.
🏗️Architecture
💡Concepts to learn
- Deterministic Finite Automaton (DFA) — The mathematical foundation of grex: it builds a DFA from test cases (in src/dfa.rs) and then converts the DFA into a minimal regex pattern. Understanding DFA state transitions and minimization is essential to modifying the core algorithm.
- Regex Synthesis / Inductive Synthesis — Grex solves the inverse problem: given examples, generate the pattern. This is fundamentally different from regex matching and requires searching the space of valid regexes; critical for understanding why grex makes design choices around character classes and quantifiers.
- Character Class Clustering — Implemented in src/cluster.rs and src/char_range.rs; grex minimizes regex size by grouping characters into efficient ranges (e.g.,
[a-z0-9]instead of[abcdefg...xyz0123456789]). Requires understanding Unicode code point ordering and range optimization. - Quantifier Optimization — Module src/quantifier.rs converts repetition patterns into compact quantifier syntax (
*,+,{n,m}, etc.). Non-obvious: grex must decide whena{2,3}is better thanaa|aaa, requiring heuristics about pattern readability vs. size. - Unicode Grapheme Clusters — Handled via unicode-segmentation dependency and src/grapheme.rs; critically, a 'character' in grex is often a grapheme (e.g., emoji with combining marks), not a single code point. Misunderstanding this breaks pattern matching for non-ASCII text.
- FFI (Foreign Function Interface) via PyO3 — src/python.rs uses PyO3 to expose Rust code to Python without writing separate Python logic. Critical for maintenance: a single Rust code path serves both Python and Rust consumers, reducing bugs and duplication.
- Feature Flags in Cargo — Grex uses Cargo feature gates ('cli', 'python', 'default') to conditionally compile optional dependencies. Understanding this is essential: attempting to use the Python API without --features python, or the CLI without --features cli, will fail at compile time.
🔗Related repos
BurntSushi/regex— The canonical Rust regex implementation (grex's dependency); understanding its DFA/NFA engine helps comprehend grex's synthesis approachgoogle/re2— Leading alternative regex engine with different DFA/NFA tradeoffs; re2 inspired thinking around efficient finite automata that grex leveragespemistahl/grex-js— Official JavaScript/WebAssembly port of grex (grex-js); used for the demo website—shows multi-language commitment and WASM strategypemistahl/languagetool-core— Sibling project by the same author; language/text processing expertise applies to grex's Unicode and grapheme handlingquickwit-oss/tantivy— Full-text search engine in Rust using regex for tokenization; a major downstream consumer of regex synthesis logic similar to grex
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive Unicode property tests for src/unicode_tables/
The repo has Unicode support modules (decimal.rs, space.rs, word.rs) but tests/property_tests.rs likely lacks exhaustive coverage of edge cases for Unicode character classification. This is critical for a regex generation tool that handles international text, ensuring correctness across all Unicode categories.
- [ ] Review existing coverage in tests/property_tests.rs for Unicode table functions
- [ ] Add property-based tests using proptest (already a dev dependency) for src/unicode_tables/decimal.rs, src/unicode_tables/space.rs, and src/unicode_tables/word.rs
- [ ] Test boundary conditions: characters at category transitions, surrogate pairs, combining marks
- [ ] Verify integration with src/grapheme.rs for proper grapheme cluster handling in Unicode contexts
- [ ] Run with
cargo test --test property_testsand ensure codecov improves
Add WebAssembly-specific integration tests in tests/wasm_browser_tests.rs
The repo has wasm_browser_tests.rs and wasm_node_tests.rs but they appear minimal. With wasm-bindgen already configured (src/wasm.rs exists), the WASM target needs proper end-to-end testing of regex generation through the JS API, especially for edge cases that differ between Node.js and browser environments.
- [ ] Expand tests/wasm_browser_tests.rs with real browser test cases (Chromium, Firefox) using wasm-bindgen-test
- [ ] Add tests for src/wasm.rs public API: regex generation with various builder options, special character handling
- [ ] Test WASM module initialization, memory management, and cleanup in browser context
- [ ] Verify tests run in CI via .github/workflows (add dedicated WASM browser testing workflow if missing)
- [ ] Document results in README under WebAssembly section
Add performance regression tests with src/builder.rs optimization benchmarks
The repo has benches/benchmark.rs but only one benchmark named 'benchmark'. The builder pattern in src/builder.rs is central to performance, yet lacks targeted micro-benchmarks for detecting regressions in critical paths like DFA construction (src/dfa.rs) and expression clustering (src/cluster.rs).
- [ ] Add new benchmark targets in benches/benchmark.rs using criterion (already configured)
- [ ] Create benchmarks for: Builder::new() initialization, add_test_case() performance with varying input sizes, DFA state minimization in src/dfa.rs
- [ ] Add benchmarks for different input patterns: simple strings, complex Unicode, large character ranges (testing src/char_range.rs)
- [ ] Set baseline thresholds to catch 10%+ performance regressions in CI
- [ ] Run with
cargo benchand integrate results into .github/workflows/rust-build.yml
🌿Good first issues
- Add property-based tests for the quantifier module (src/quantifier.rs). Currently tests/property_tests.rs covers DFA and cluster logic but quantifier edge cases (e.g., converting greedy
{n,m}to lazy{n,m}?syntax) lack explicit property tests against real regex engines. - Expand integration test coverage for Python bindings. tests/python/test_grex.py exists but is minimal; add tests for error handling, feature flags (e.g., RegexBuilder.with_case_insensitivity()), and edge cases like empty input or very large test case sets.
- Document the cluster.rs algorithm in code comments. The character range clustering logic is complex (handling overlaps, minimizing ranges) but lacks inline documentation; add doc comments explaining the algorithm and key data structures for contributors.
- Add benchmarks for the DFA construction phase specifically. Currently benches/benchmark.rs tests end-to-end performance; break it down to isolate DFA build time, component assembly time, and output formatting to identify bottlenecks.
- Create example programs in a new examples/ directory. Add example1.rs (basic usage), example2.rs (with config options), example3.rs (Python calling Rust). Currently only README.md documents API; runnable examples reduce friction for new users.
⭐Top contributors
Click to expand
Top contributors
- @dependabot[bot] — 59 commits
- @pemistahl — 40 commits
- @jqnatividad — 1 commits
📝Recent commits
Click to expand
Recent commits
99cc347— Remove deprecatedCommand::cargo_bin(#349) (pemistahl)fa3e8ed— Bump clap from 4.5.51 to 4.5.53 (#342) (dependabot[bot])2f4f6aa— Bump actions/checkout from 5 to 6 (#341) (dependabot[bot])db9275a— Enable pyo3 featuregenerate-import-lib(pemistahl)5f80377— Try to fix Rust targets on Windows (pemistahl)4d313a7— Update release notes (pemistahl)a9915b3— Replace unmaintained rust unic-* crates (#337) (jqnatividad)78e2c83— Drop support for Python < 3.12 (pemistahl)7bc12e1— Update READMEs (pemistahl)ae47226— Update pytest (pemistahl)
🔒Security observations
The grex codebase demonstrates generally good security practices with proper license headers, dependency management, and no obvious injection vulnerabilities or hardcoded secrets. However, there are moderate concerns around flexible dependency versioning, Python FFI security boundaries, ReDoS potential in generated regex patterns, and WASM execution resource limits. The primary attack surface is through malicious input to the regex generation engine. No critical vulnerabilities were identified, but the recommendations should be addressed to improve the overall security posture, particularly for production deployments handling untrusted input.
- Medium · Permissive dependency versions with wildcard constraints —
Cargo.toml - [dependencies] section. Several dependencies use flexible version constraints (e.g., itertools = "0.14.0", ndarray = "0.17.1") without upper bounds. This could allow automatic updates to patch versions that may introduce unexpected behavioral changes or security issues. While patch versions should be backward compatible, this approach increases risk of supply chain attacks or breaking changes. Fix: Consider using more restrictive version constraints like "^0.14.0" or "~0.14.0" to limit automatic updates. Regularly audit and test dependency updates. Implement lock file (Cargo.lock) enforcement in CI/CD pipelines. - Low · Optional feature without security context —
Cargo.toml - [features] section and src/python.rs. The 'python' feature enables pyo3 binding without clear security boundaries or sandboxing mechanisms documented. Python FFI can introduce security risks if not properly validated, especially when processing untrusted regex input. Fix: Add documentation regarding security considerations for Python bindings. Implement strict input validation for all data crossing the FFI boundary. Consider adding security-focused tests for the Python interface. - Low · Regex library dependency with complex parsing —
Cargo.toml - regex = "1.12.2" and src/regexp.rs, src/dfa.rs. The crate uses the 'regex' crate (1.12.2) as a dependency. While the regex crate itself is well-maintained, grex generates regex patterns which are then used with the regex library. ReDoS (Regular Expression Denial of Service) vulnerabilities could occur if grex generates inefficient patterns from malicious input. Fix: Implement fuzzing tests specifically targeting ReDoS vulnerability patterns. Add complexity metrics to generated regex patterns and reject patterns exceeding reasonable complexity thresholds. Document known limitations around regex generation safety. - Low · WASM target with limited security considerations —
src/wasm.rs and Cargo.toml - [target.'cfg(target_family = "wasm")'] sections. The codebase includes WASM support (src/wasm.rs) which exposes the regex generation logic to browser environments. Browser-based execution of regex generation could be exploited if input validation is insufficient, potentially causing browser hang or crash. Fix: Implement resource limits and timeouts in WASM functions. Add rate limiting for WASM calls. Validate all input size constraints. Consider adding usage documentation warning about resource consumption in browser contexts.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.