rust-lang/regex

Item: rust-lang/regex
Rating: 5
Author: RepoPilot

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

Healthy

Healthy across the board

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 2mo ago
✓26+ active contributors
✓Apache-2.0 licensed

Show all 6 evidence items →

✓CI configured
✓Tests present
⚠Concentrated ownership — top contributor handles 67% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/rust-lang/regex)](https://repopilot.app/r/rust-lang/regex)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/rust-lang/regex on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: rust-lang/regex

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/rust-lang/regex shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 2mo ago
26+ active contributors
Apache-2.0 licensed
CI configured
Tests present
⚠ Concentrated ownership — top contributor handles 67% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live rust-lang/regex repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/rust-lang/regex.

What it runs against: a local clone of rust-lang/regex — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in rust-lang/regex | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 102 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>rust-lang/regex</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of rust-lang/regex. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/rust-lang/regex.git
#   cd regex
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of rust-lang/regex and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "rust-lang/regex(\\.git)?\\b" \\
  && ok "origin remote is rust-lang/regex" \\
  || miss "origin remote is not rust-lang/regex (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "Cargo.toml" \\
  && ok "Cargo.toml" \\
  || miss "missing critical file: Cargo.toml"
test -f "regex-syntax/src/lib.rs" \\
  && ok "regex-syntax/src/lib.rs" \\
  || miss "missing critical file: regex-syntax/src/lib.rs"
test -f "regex-automata/src/lib.rs" \\
  && ok "regex-automata/src/lib.rs" \\
  || miss "missing critical file: regex-automata/src/lib.rs"
test -f "src/lib.rs" \\
  && ok "src/lib.rs" \\
  || miss "missing critical file: src/lib.rs"
test -f "regex-lite/src/lib.rs" \\
  && ok "regex-lite/src/lib.rs" \\
  || miss "missing critical file: regex-lite/src/lib.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 102 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~72d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/rust-lang/regex"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

A high-performance regex engine for Rust that uses finite automata to guarantee linear time O(m * n) matching on all inputs, avoiding catastrophic backtracking. It provides the standard Regex API with support for Unicode, named capture groups, and extensive pattern syntax while prioritizing predictable performance over feature completeness. Workspace monorepo: regex/ is the main crate with src/ containing the high-level API; regex-automata/ provides lower-level DFA engines (dense/sparse); regex-syntax/ handles AST parsing; regex-lite/ offers a minimal version; regex-cli/ and regex-capi/ provide CLI and C FFI bindings. Tests live in tests/, benchmarks in bench/, and fuzz targets in fuzz/fuzz_targets/.

👥Who it's for

Rust developers building systems that need reliable, fast text pattern matching—especially those in security-sensitive, performance-critical, or real-time contexts who cannot tolerate worst-case exponential regex behavior. Contributors are typically compiler and systems engineers interested in automata theory and Rust performance.

🌱Maturity & risk

Highly mature and production-ready. Version 1.12.3 with MIT/Apache-2.0 dual licensing, comprehensive fuzzing harness (fuzz/fuzz_targets/ contains 8+ OSS-Fuzz targets catching regressions), active CI/CD (.github/workflows/ci.yml), and workspace structure (regex-automata, regex-syntax, regex-cli) indicates ongoing professional maintenance. Actively developed with clear design documentation (UNICODE.md, bench/README.md).

Low risk for core functionality, high risk for custom features. Deliberately avoids lookahead/lookbehind and backreferences, so unsupported patterns will panic or fail at compile time (not silently misbehave). Single-threaded maintainer concern mitigated by Rust Foundation stewardship and extensive regression tests. No known critical vulnerabilities, but regex compilation and DFA deserialization are sensitive attack surfaces (evidenced by fuzz_regex_automata_deserialize_* targets).

Active areas of work

The repo is actively maintained with recent workspace organization (regex-automata separated), feature stabilization (Cargo.toml shows conditional dependencies like aho-corasick?, memchr?), and continuous fuzzing integration. Changelog and UNICODE.md suggest ongoing work on Unicode property support and deserialization safety.

🚀Get running

git clone https://github.com/rust-lang/regex
cd regex
cargo build
cargo test

Daily commands:

cargo test --all
cargo bench --all
cargo run --bin regex-cli -- 'pattern' file.txt

🗺️Map of the codebase

Cargo.toml — Workspace root defining all member crates (regex-automata, regex-syntax, regex-lite, regex-cli, etc.) and shared dependencies that orchestrate the entire regex implementation ecosystem.
regex-syntax/src/lib.rs — Core AST parser and regex syntax validation; every regex string flows through this layer before compilation, making it foundational to understanding how patterns are interpreted.
regex-automata/src/lib.rs — Low-level finite automata (DFA/NFA) engine that provides the linear-time matching guarantee; implements both dense and sparse DFA serialization and is the performance-critical core.
src/lib.rs — Main regex crate API surface; high-level Regex type and Match iterators that users interact with; bridges user code to automata and syntax layers.
regex-lite/src/lib.rs — Lightweight regex implementation for constrained environments; alternative entry point showing how the codebase can be sliced for different use cases.
fuzz/fuzz_targets — Regression tests and fuzzing harnesses that validate correctness of parsing, matching, serialization, and deserialization across all major code paths.
UNICODE.md — Documents Unicode handling strategy and design decisions; essential reading for understanding how the engine interprets character classes and properties.

🛠️How to make changes

Add a new regex matching method to the public API

Define the new method signature in the Regex struct implementation block (src/lib.rs)
Implement the method by composing existing automata search methods from regex-automata crate (src/lib.rs)
Add integration tests in the same file or dedicated test module (src/lib.rs)
Add fuzzing target to exercise the new code path with random inputs (fuzz/fuzz_targets/fuzz_regex_match.rs)

Add support for a new character class or Unicode property

Document the property semantics and edge cases in UNICODE.md (UNICODE.md)
Update the lexer to recognize the new syntax (e.g., \p{New_Property}) (regex-syntax/src/parser.rs)
Add AST node type for the new property if not already generic (regex-syntax/src/ast.rs)
Implement Unicode property lookup in the NFA compiler to generate correct state transitions (regex-automata/src/nfa/compiler.rs)
Add regression tests to fuzz_targets to catch parsing and matching regressions (fuzz/fuzz_targets/ast_fuzz_regex.rs)

Optimize DFA construction or matching performance

Profile the slow path using benchmarks in the bench directory (bench/README.md)
Identify the bottleneck: parser, NFA compiler, or DFA minimization in the automata layer (regex-automata/src/lib.rs)
Implement optimization (e.g., state merging, early termination, cache-friendly layout) (regex-automata/src/dfa/dense.rs)
Run full test suite and fuzz targets to validate correctness (fuzz/fuzz_targets/fuzz_regex_match.rs)
Update CHANGELOG.md with performance notes and benchmark results (CHANGELOG.md)

🔧Why these technologies

Finite Automata (DFA/NFA) — Guarantees linear-time O(m*n) matching on all inputs, avoiding catastrophic backtracking present in other regex engines; proven theoretical foundation for worst-case performance.
Workspace with multiple member crates — Separates concerns (syntax parsing, automata, public API, CLI) into independently versioned and testable units; enables regex-lite subset for embedded/constrained environments.
OSS-Fuzz integration — Continuous fuzzing catches correctness regressions in parsing, matching, and serialization without manual test case creation; critical for a parsing/matching library.
Zero-copy DFA serialization — Compiled automata can be cached or embedded at build-time; users pay parsing/compilation cost only once, not per regex creation.

⚖️Trade-offs already made

No lookaround or backreferences support
- Why: These features are not known to be implementable efficiently in finite automata without exponential worst-case blowup.
- Consequence: Regex patterns are less expressive than PCRE or .NET regex, but matching is provably fast; users must refactor complex patterns or use alternative engines.
Byte-oriented DFA matching rather than Unicode-aware matching at automata level
- Why: Maximizes performance on raw bytes; Unicode character properties are handled at parse/compilation stage, not runtime.
- Consequence: Efficient byte-level scanning, but developers must understand UTF-8 encoding and byte boundaries; some Unicode-aware features have runtime cost during compilation.
Powerset construction for NFA-to-DFA conversion with state minimization
- Why: Produces optimal DFA with fewest states and transitions, minimizing memory and matching time.
- Consequence: Compilation phase can be slower and memory-hungry for very large patterns; not suitable for dynamic, untrusted pattern input without resource limits.
Lazy DFA and hybrid NFA/DFA matching available as alternatives
- Why: Not all use cases need full DFA; lazy DFA trades memory for slower matching; NFA matching is faster for first-match with early exit.
- Consequence: Multiple code paths increase complexity; users must choose appropriate matching strategy for their use case.

🚫Non-goals (don't propose these)

Does not implement lookahead or lookbehind assertions
Does not support backreferences or capture group recursion
Not a real-time regex engine (compilation phase can block)
Does not provide PCRE-compatible API or feature parity
Not designed for arbitrary Unicode grapheme cluster handling (UTF-8 byte-based)

🪤Traps & gotchas

No backreferences or lookaround: Patterns using (?=...), (?<=...), or \1 will fail at compile time, not match incorrectly. 2. DFA size limits: Very complex patterns can hit dfa_size_limit (default 10MB); use builder().dfa_size_limit() to adjust. 3. Fuzzing regressions: Serialized DFA formats in fuzz/regressions/ are regression test inputs; breaking DFA serialization format requires careful migration. 4. Unicode normalization: \p{...} patterns work on the input as-is; no implicit NFC/NFD normalization. 5. Thread safety: Regex is Send + Sync but cloning is cheap due to Arc; avoid sharing mutable builder state.

🏗️Architecture

💡Concepts to learn

Thompson NFA — The foundational algorithm used in regex-syntax to convert regex patterns into non-deterministic finite automata before DFA compilation; understanding it explains why certain patterns are rejected and how the engine guarantees linear time
Deterministic Finite Automaton (DFA) — The core execution engine in regex-automata/src/dfa/ that guarantees O(m*n) worst-case time; DFA state explosion is why the crate has a configurable size limit
Lazy DFA / Lazy evaluation — The crate builds DFAs on-demand during matching rather than pre-compiling all states, trading memory for CPU time and reducing compile latency; see regex-automata/src/dfa/lazy.rs
Aho-Corasick algorithm — Used in exec.rs for literal prefix matching before DFA; a trie-based multi-pattern matcher that accelerates common cases like fixed strings
Unicode grapheme clusters and properties — The crate handles Unicode via \p{...} syntax; see UNICODE.md and regex-syntax/src/unicode.rs for which properties are supported and why some (like emoji) require careful handling
Catastrophic backtracking — The core problem this crate solves—traditional regex engines fail on patterns like (a+)+b with pathological inputs; this crate eliminates backtracking entirely via automata
Sparse vs. Dense DFA encoding — Two complementary state table representations in regex-automata/src/dfa/; dense uses O(states * alphabet) memory for fast transitions, sparse uses less memory but slower lookup; the crate uses both and chooses at compile time

rust-lang/rust — The Rust standard library has a legacy regex module; this repo is the modern replacement that the stdlib eventually delegates to
BurntSushi/ripgrep — High-performance grep alternative that directly uses regex + regex-automata for its line matching engine
BurntSushi/aho-corasick — Multi-pattern string matching library used optionally by regex for literal sets; complementary algorithm in the matching pipeline
Geal/nom — Parser combinator library for Rust; offers an alternative to regex for structured text parsing with better composability

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add regression test suite for fuzzing discoveries

The repo has 15+ clusterfuzz regression test cases in fuzz/regressions/ but they appear to be undocumented and not integrated into the standard test suite. Creating a structured regression test module that systematically validates these cases would prevent future regressions and document known edge cases discovered through fuzzing.

[ ] Create tests/fuzz_regressions.rs that reads and validates each case from fuzz/regressions/
[ ] Parse the binary regression cases and create minimal Regex patterns that trigger the issues
[ ] Document in each test what class of bug it prevents (e.g., deserialization panic, incorrect match)
[ ] Add this test module to the standard test suite in Cargo.toml (autotests = true for this module)
[ ] Update fuzz/README.md with guidance on converting fuzzing crashes into permanent tests

Document and test the unicode data handling pipeline

The repo has UNICODE.md and relies on unicode data, but there's no documented process for unicode updates or tests validating the unicode property tables used by regex-syntax. Adding explicit tests for unicode support would help contributors understand and maintain this complex subsystem.

[ ] Create tests/unicode_properties.rs with specific tests for unicode categories, scripts, and properties
[ ] Test edge cases like surrogate pairs, recent unicode versions, and grapheme cluster handling
[ ] Add documentation to UNICODE.md explaining: (1) how unicode data is sourced, (2) version pinning strategy, (3) validation approach
[ ] Create a test that verifies unicode property tables match expected Unicode version (reference unicode-general-category, unicode-script, etc.)
[ ] Document the process for updating unicode support when new Rust/Unicode versions release

Add benchmarking CI workflow for performance regression detection

The repo has extensive benchmark infrastructure (bench/, record/), but .github/workflows/ci.yml doesn't appear to run benchmarks. Adding automated performance benchmarking on each commit would catch regressions early and maintain the repo's performance guarantees, especially important given the O(m*n) complexity claims.

[ ] Review .github/workflows/ci.yml and add a new benchmark job that runs cargo bench with stable output
[ ] Configure the job to compare against baseline (using criterion.rs comparison features or similar)
[ ] Set up automatic comment on PRs showing performance impact for key patterns (literal, alternation, complex DFA)
[ ] Create bench/REGRESSION_DETECTION.md documenting acceptable performance variance thresholds
[ ] Add logic to fail CI if any benchmark regresses more than 5% (configurable per pattern class)

🌿Good first issues

Add integration tests for regex-lite/ package (currently minimal test coverage visible in workspace). Start by examining regex-lite/src/ and creating tests under regex-lite/tests/ mirroring the patterns in tests/ at repo root.
Document the exact DFA serialization format and version in code comments. Currently fuzz_regex_automata_deserialize_*.rs tests exist but the format specification is implicit; add rustdoc to regex-automata/src/dfa/dense.rs and sparse.rs explaining binary layout.
Add explicit error messages for common regex anti-patterns. When a user writes an unsupported pattern like (?<=...), the error could suggest alternatives (use find_at() for positional matching); modify regex-syntax/src/error.rs to detect and hint on these.

⭐Top contributors

Click to expand

@BurntSushi — 67 commits
@nyurik — 6 commits
@tmccombs — 2 commits
@Mrmaxmeier — 2 commits
@erickt — 2 commits

📝Recent commits

Click to expand

839d16b — regex-syntax-0.8.10 (BurntSushi)
c4865a0 — syntax: fix negation handling in HIR translation (pandaman64)
d8761c0 — cargo: also include benches (BurntSushi)
2aaa18d — rure-0.2.5 (BurntSushi)
b028e4f — 1.12.3 (BurntSushi)
5e195de — regex-automata-0.4.14 (BurntSushi)
a3433f6 — regex-syntax-0.8.9 (BurntSushi)
0c07fae — regex-lite-0.1.9 (BurntSushi)
6a81006 — cargo: exclude development scripts and fuzzing data (weiznich)
4733e28 — automata: fix onepass::DFA::try_search_slots panic when too many slots are given (keith-hall)

🔒Security observations

The regex crate demonstrates strong security practices. As a core parsing library, it maintains extensive fuzzing infrastructure with multiple regression test cases, uses finite automata to guarantee linear time matching (preventing ReDoS attacks), and is actively maintained by the Rust project. No critical or high-severity vulnerabilities were identified in the provided codebase structure. The project's use of workspace dependencies and feature flags is well-organized. Minor recommendations include formalizing a security policy document and potentially enhancing visibility of the security testing approach. The crate's design inherently mitigates Regular Expression Denial of Service (ReDoS) attacks through algorithmic guarantees.

Low · Fuzzing Regression Test Cases Stored in Repository — fuzz/regressions/. The repository contains multiple fuzzing regression test cases in the fuzz/regressions directory that may include edge cases or malformed inputs. While these are intentionally stored for regression testing, they could potentially be analyzed by attackers to understand attack vectors against the regex engine. Fix: Continue storing these for regression testing as they are valuable for security hardening. Consider documenting the fuzzing process in SECURITY.md to help researchers understand the security testing approach.
Low · Public Fuzzing Infrastructure Configuration — fuzz/oss-fuzz-build.sh. The repository includes OSS-Fuzz integration configuration (fuzz/oss-fuzz-build.sh) which is intentionally public but exposes the fuzzing targets and build strategy. Fix: This is standard practice for OSS-Fuzz projects. Ensure fuzzing continues to be run regularly and results are acted upon promptly.
Low · Missing SECURITY.md Policy — Root directory. The repository lacks a SECURITY.md file that would typically outline security policies, vulnerability disclosure procedures, and security contact information. Fix: Create a SECURITY.md file following GitHub's security policy guidelines to establish a responsible disclosure process for security vulnerabilities.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

rust-lang/regex

Embed the "Healthy" badge

Onboarding doc

Onboarding: rust-lang/regex

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🛠️How to make changes

Add a new regex matching method to the public API

Add support for a new character class or Unicode property

Optimize DFA construction or matching performance

🔧Why these technologies

⚖️Trade-offs already made

🚫Non-goals (don't propose these)

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add regression test suite for fuzzing discoveries

Document and test the unicode data handling pipeline

Add benchmarking CI workflow for performance regression detection

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next