apache/datafusion-sqlparser-rs
Extensible SQL Lexer and Parser for Rust
Healthy across the board
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit today
- ✓32+ active contributors
- ✓Distributed ownership (top contributor 12% of recent commits)
Show all 6 evidence items →Show less
- ✓Apache-2.0 licensed
- ✓CI configured
- ✓Tests present
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/apache/datafusion-sqlparser-rs)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/apache/datafusion-sqlparser-rs on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: apache/datafusion-sqlparser-rs
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/apache/datafusion-sqlparser-rs shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit today
- 32+ active contributors
- Distributed ownership (top contributor 12% of recent commits)
- Apache-2.0 licensed
- CI configured
- Tests present
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live apache/datafusion-sqlparser-rs
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/apache/datafusion-sqlparser-rs.
What it runs against: a local clone of apache/datafusion-sqlparser-rs — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in apache/datafusion-sqlparser-rs | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | Last commit ≤ 30 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of apache/datafusion-sqlparser-rs. If you don't
# have one yet, run these first:
#
# git clone https://github.com/apache/datafusion-sqlparser-rs.git
# cd datafusion-sqlparser-rs
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of apache/datafusion-sqlparser-rs and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "apache/datafusion-sqlparser-rs(\\.git)?\\b" \\
&& ok "origin remote is apache/datafusion-sqlparser-rs" \\
|| miss "origin remote is not apache/datafusion-sqlparser-rs (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 30 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~0d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/apache/datafusion-sqlparser-rs"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
Apache DataFusion SQLParser is a Rust crate that provides an extensible SQL lexer and parser supporting ANSI/ISO SQL:2011 and multiple SQL dialects. It transforms raw SQL strings into an Abstract Syntax Tree (AST) that can be consumed by SQL query engines, vendor-specific parsers, and SQL analysis tools. The core capability is dialect-agnostic parsing with pluggable dialect support (GenericDialect, AnsiDialect, etc.) enabling tools like Apache DataFusion to understand and execute SQL queries. Dual-crate monorepo: src/lib.rs contains the main lexer/parser logic organized around sqlparser, while derive/src/ provides derive macros for implementing custom dialects via derive-dialect feature. examples/ contains runnable parse_select.rs and cli.rs demonstrating real usage. fuzz/ holds property-based testing. Tests are co-located with source files following Rust conventions.
👥Who it's for
Database engine developers building SQL query executors (like Apache DataFusion), vendors implementing dialect-specific SQL support, and SQL analysis tool creators who need robust AST representations without writing parser code from scratch. Contributors are primarily Apache DataFusion maintainers and community members extending SQL dialect coverage.
🌱Maturity & risk
Production-ready and actively maintained. The project is at v0.62.0 with consistent quarterly releases (changelog/ shows releases through 0.62.0), comprehensive CI/CD via .github/workflows/rust.yml, coverage tracking via Coveralls, and an active changelog documenting breaking changes and features. Recent version history and organized GitHub Actions indicate sustained development and community adoption.
Low risk for core parsing but moderate risk for dialect extensions. The crate has minimal direct dependencies (log, recursive, serde are optional), reducing supply chain risk. However, new SQL dialects added via the extensible plugin system (derive/) require careful testing to avoid parser ambiguities. The project follows Apache governance with controlled release processes (dev/release/ scripts), mitigating abandonment risk.
Active areas of work
The project is in active maintenance with version 0.62.0 released and changelog tracking through 0.61.0. GitHub workflows (rust.yml, license.yml, stale.yml) indicate continuous CI validation, linting, and dependency management via Dependabot. The focus appears to be on dialect coverage expansion and stability, evident from the derive/ macro system supporting custom dialect implementations.
🚀Get running
git clone https://github.com/apache/datafusion-sqlparser-rs.git
cd datafusion-sqlparser-rs
cargo build
cargo test
cargo run --example parse_select
Daily commands:
Run the parse example: cargo run --example parse_select. Run interactive CLI with JSON output: cargo run --example cli --features json_example. Execute tests: cargo test. Benchmark: refer to docs/benchmarking.md.
🗺️Map of the codebase
- src/lib.rs: Entry point exporting Parser, Tokenizer, Dialect, and AST types; defines the public API
- src/parser.rs: Core 5000+ line recursive descent parser with parse_select, parse_insert, parse_create_table methods and dialect-agnostic statement parsing
- src/ast.rs: Complete AST node definitions (Select, Expr, Statement, etc.) representing parsed SQL structure; serializable with serde
- src/tokenizer.rs: Lexical analyzer converting SQL strings into Token stream; handles dialect-specific keyword recognition
- src/dialect/: Contains GenericDialect, AnsiDialect, and 15+ vendor-specific dialect implementations (PostgresDialect, MySqlDialect, T-SQL, etc.)
- derive/src/lib.rs: Procedural macro crate enabling custom Dialect trait implementations via derive-dialect feature
- examples/parse_select.rs: Minimal working example showing how to use Parser::parse_sql with GenericDialect on a SELECT statement
- Cargo.toml: Workspace config defining features (recursive-protection, visitor, derive-dialect), edition 2021, optional dependencies
🛠️How to make changes
For new SQL dialect support: implement Dialect trait in src/dialect/ (study src/dialect/ansi.rs or src/dialect/generic.rs for patterns). For parser changes: modify src/parser.rs and add tests in src/parser.rs test modules. For AST node changes: edit src/ast.rs. For tokenizer changes: src/tokenizer.rs. Use the derive/ crate to generate visitor patterns for new AST nodes via the visitor feature.
🪤Traps & gotchas
The recursive-protection feature is enabled by default and uses the recursive crate to prevent stack overflow on deeply nested queries; disabling it in embedded scenarios requires explicit feature control. Dialect trait implementations require careful keyword ordering in the tokenizer to avoid conflicts (e.g., T-SQL DECLARE vs SELECT). The derive/ macro system requires sqlparser_derive dependency to be in scope. Tests expect deterministic AST output but some dialects may normalize identifiers differently; study existing dialect tests before adding new ones.
💡Concepts to learn
- Recursive Descent Parsing — The entire parser.rs is hand-written recursive descent (not generated from grammar); understanding mutual recursion, lookahead, and backtracking is essential to modify parsing logic
- Dialect-Agnostic AST — sqlparser-rs solves the multi-dialect problem by mapping vendor syntax to a unified AST via Dialect trait implementations; contributors must understand how dialect differences map to common AST nodes
- Tokenizer State Machine — The tokenizer.rs implements a stateful lexer handling SQL comments, string literals with escaping, and dialect-specific reserved words; understanding token lookahead is crucial for fixing tokenizer bugs
- Stack Overflow Protection via Recursion Limiting — The optional recursive-protection feature prevents denial-of-service on pathological nested SQL (1000+ levels of subqueries); contributors must understand the recursive crate's depth tracking
- Procedural Macros for Dialect Extension — The derive/ crate generates Dialect and Visitor implementations at compile-time; understanding #[derive(Dialect)] and attribute macros is needed to extend parsing without modifying core parser.rs
- Abstract Syntax Tree (AST) Design — ast.rs defines 100+ enum variants (Select, Insert, Expr, etc.) that represent parsed SQL; understanding AST shape is mandatory for adding new SQL features or dialects
- Operator Precedence and Associativity — The parser handles operator precedence implicitly via recursive parse_expression() call depth; misunderstanding precedence rules causes incorrect AST for complex expressions like
a OR b AND c
🔗Related repos
apache/datafusion— Primary consumer of this parser; DataFusion query engine uses sqlparser-rs to parse SQL into ASTs for execution planning and optimizationsqlparser/sqlparser-rs— Historical predecessor (this is the official Apache-maintained fork); check for algorithm inspiration or legacy issue discussionspostgres/postgres— Reference implementation for PostgreSQL dialect grammar; dialect maintainers cross-reference Postgres parser for feature parityeclipse/eclipse-collections— No direct relation but similar Apache-governed parsing infrastructure project; may share release/CI patterns and community practicesrustls/rustls— Comparable Apache/community Rust security-critical library; demonstrates Rust ecosystem best practices for procedural macros and test coverage this project emulates
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive dialect-specific parser tests for src/ast/
The repo supports multiple SQL dialects (ANSI SQL:2011, PostgreSQL, MySQL, etc.) but the test coverage for dialect-specific AST parsing appears minimal. The derive/src/dialect.rs and docs/custom_sql_parser.md exist but there's no dedicated test suite validating each dialect's unique syntax (e.g., PostgreSQL JSON operators, MySQL CREATE TABLE extensions). This would catch regressions and improve maintainability.
- [ ] Create tests/dialect_specific/ directory with subdirectories for each supported dialect
- [ ] Add test module in src/ast/ or separate file testing PostgreSQL-specific DDL parsing from src/ast/ddl.rs
- [ ] Document which dialects map to which AST nodes in src/ast/dml.rs and src/ast/dcl.rs
- [ ] Add integration tests validating the dialect trait system works correctly with custom implementations
Implement missing SQL standard features in src/ast/data_type.rs
The ANSI SQL:2011 standard includes complex data types (ARRAY, ROW, INTERVAL with precision) but src/ast/data_type.rs likely has incomplete coverage. Adding these would improve compatibility with PostgreSQL, BigQuery, and Snowflake dialects. A parser that can't handle modern SQL types severely limits adoption.
- [ ] Audit src/ast/data_type.rs against ANSI SQL:2011 standard data types and update DataType enum
- [ ] Add parser rules in src/ for ARRAY[dimension], ROW type constructors, and INTERVAL YEAR TO MONTH syntax
- [ ] Create tests/data_types/ with examples from PostgreSQL and Snowflake dialect documentation
- [ ] Update examples/parse_select.rs to demonstrate parsing complex types
Add missing edge case tests for recursive-protection feature and fuzzing coverage
The repo has recursive-protection feature (Cargo.toml) and fuzz/fuzz_targets/fuzz_parse_sql.rs but no documented test cases validating deeply nested queries, CTE chains, or malformed input that would trigger the recursive limit. This is critical for security and stability in production use.
- [ ] Create tests/recursive_protection/ with test cases for max nesting depth limits (e.g., 1000+ nested SELECTs)
- [ ] Add fuzz tests in fuzz/fuzz_targets/ that specifically target edge cases like unbalanced parentheses and circular references
- [ ] Document the recursive-protection behavior and limits in docs/custom_sql_parser.md
- [ ] Add benchmarks in sqlparser_bench/benches/ comparing performance with/without recursive-protection enabled
🌿Good first issues
- Add parser support for SQL window function frame clauses (ROWS BETWEEN / RANGE BETWEEN) by extending src/ast.rs WindowFrame enum and updating src/parser.rs parse_window_frame() method—currently only basic window specs are parsed.
- Write comprehensive dialect-specific integration tests comparing AST output across all 15+ dialects in src/dialect/ by creating tests/dialect_compat.rs with test cases from dialect-specific SQL syntax guides (e.g., MySQL-specific INDEX hints).
- Implement missing operator precedence documentation in docs/ by analyzing src/parser.rs binary operator parsing and documenting the precedence hierarchy with examples (currently only examples/parse_select.rs covers basic queries).
⭐Top contributors
Click to expand
Top contributors
- @LucaCappelletti94 — 12 commits
- @romanoff — 12 commits
- @xitep — 8 commits
- @guan404ming — 8 commits
- @yoavcloud — 6 commits
📝Recent commits
Click to expand
Recent commits
7c78d13— BigQuery: Parse WITH CONNECTION on CREATE EXTERNAL TABLE (#2326) (97nitt)79f06bd— Add support for PostgreSQL's ORDER BY ... USING <operator> clause (#2246) (LucaCappelletti94)f98c9d1— [Teradata] AddCREATE TABLEoptions (#2329) (iffyio)3f347e3— Fix panic in EscapeQuotedString and parse_flush, clean up a few unwraps (#2330) (Mrmaxmeier)299f54d— Snowflake: Accept COPY GRANTS after CREATE VIEW column list (#2327) (97nitt)824b5c1— Snowflake: Fix COPY INTO transformation parsing for cast expressions (#2328) (97nitt)a281171— Databricks: Add support forUPDATE SET *andINSERT *in MERGE statements (#2325) (finchxxia)3dd0e30— chore: Bump version to 0.62.0 and add changelog (#2331) (andygrove)9833c03— feat(clickhouse): support PARTITION BY after ORDER BY and ARRAY JOIN (#2283) (cristhiank)9550e93— Add Teradata dialect (#2309) (iffyio)
🔒Security observations
The apache/datafusion-sqlparser-rs codebase demonstrates strong security posture as a SQL parser library. Key strengths include: minimal dependencies, no hardcoded credentials, Apache licensing, presence of fuzzing infrastructure, and recursive-protection features. Minor security observations include: incomplete vulnerability reporting guidance (using personal email instead of Apache channels), lack of explicit security documentation regarding input validation and DoS prevention, and need for clarification on logging practices to prevent sensitive data exposure. The library's core purpose (parsing untrusted SQL) is well-suited to security best practices, though documentation could be enhanced around threat modeling and recommended usage patterns.
- Low · Incomplete Security Reporting Channel —
SECURITY.md. The SECURITY.md file specifies only a single email address (andrew@nerdnetworks.org) for vulnerability reporting. This is a personal email rather than an official Apache project security contact, and lacks a formal vulnerability disclosure policy with response time commitments or embargo periods. Fix: Update SECURITY.md to use the Apache Security Team contact (security@apache.org) and follow Apache's vulnerability reporting guidelines with documented response timelines and embargo policies. - Low · Missing Input Validation Documentation —
README.md, Documentation. As a SQL parser library, the codebase processes untrusted SQL input. While the code structure suggests proper parsing, there is no explicit security documentation about input validation, maximum query complexity limits, or protection against ReDoS (Regular Expression Denial of Service) attacks. Fix: Add security documentation covering: (1) input size limits, (2) recursion depth protection (note: the 'recursive-protection' feature exists), (3) performance characteristics with malicious input, and (4) recommended usage patterns to prevent DoS. - Low · Recursive Protection Feature Not Enabled by Default —
Cargo.toml (features section). The 'recursive-protection' feature exists but is enabled by default. While this is good, the documentation should clearly explain when this protection is needed and what risks it mitigates (stack overflow from deeply nested queries). Fix: Ensure comprehensive documentation exists explaining the 'recursive-protection' feature, its performance implications, and recommended default usage in applications parsing untrusted SQL. - Low · Minimal Dependency Footprint But Log Crate Included —
Cargo.toml (dependencies). The project includes the 'log' crate as a dependency. While logging is generally safe, ensure that sensitive information (SQL queries, parsed values) is not logged at debug levels where it could be captured in logs. Fix: Review all logging statements to ensure no sensitive SQL data, user input, or parsed values are logged, especially at DEBUG level. Consider using high-level error/warn logging only. - Low · Fuzzing Infrastructure Present But Integration Unclear —
fuzz/ directory. The repository contains a fuzzing target (fuzz/fuzz_targets/fuzz_parse_sql.rs), indicating security testing, but integration with OSS-Fuzz or continuous fuzzing is not clearly documented. Fix: Document fuzzing coverage and results. Ensure the project participates in OSS-Fuzz or other continuous fuzzing programs to catch parser edge cases and potential vulnerabilities.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.