sfu-db/connector-x
Fastest library to load data from DB to DataFrames in Rust and Python
Healthy across the board
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 1w ago
- ✓11 active contributors
- ✓MIT licensed
Show all 6 evidence items →Show less
- ✓CI configured
- ✓Tests present
- ⚠Concentrated ownership — top contributor handles 52% of recent commits
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/sfu-db/connector-x)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/sfu-db/connector-x on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: sfu-db/connector-x
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/sfu-db/connector-x shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit 1w ago
- 11 active contributors
- MIT licensed
- CI configured
- Tests present
- ⚠ Concentrated ownership — top contributor handles 52% of recent commits
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live sfu-db/connector-x
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/sfu-db/connector-x.
What it runs against: a local clone of sfu-db/connector-x — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in sfu-db/connector-x | Confirms the artifact applies here, not a fork |
| 2 | License is still MIT | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 38 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of sfu-db/connector-x. If you don't
# have one yet, run these first:
#
# git clone https://github.com/sfu-db/connector-x.git
# cd connector-x
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of sfu-db/connector-x and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "sfu-db/connector-x(\\.git)?\\b" \\
&& ok "origin remote is sfu-db/connector-x" \\
|| miss "origin remote is not sfu-db/connector-x (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
&& ok "license is MIT" \\
|| miss "license drift — was MIT at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "connectorx-python/src/lib.rs" \\
&& ok "connectorx-python/src/lib.rs" \\
|| miss "missing critical file: connectorx-python/src/lib.rs"
test -f "connectorx-python/src/cx_read_sql.rs" \\
&& ok "connectorx-python/src/cx_read_sql.rs" \\
|| miss "missing critical file: connectorx-python/src/cx_read_sql.rs"
test -f "Cargo.toml" \\
&& ok "Cargo.toml" \\
|| miss "missing critical file: Cargo.toml"
test -f "connectorx-python/Cargo.toml" \\
&& ok "connectorx-python/Cargo.toml" \\
|| miss "missing critical file: connectorx-python/Cargo.toml"
test -f "connectorx-python/connectorx/__init__.py" \\
&& ok "connectorx-python/connectorx/__init__.py" \\
|| miss "missing critical file: connectorx-python/connectorx/__init__.py"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 38 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~8d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/sfu-db/connector-x"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
ConnectorX is a Rust-based library that loads data from databases (PostgreSQL, MySQL, Oracle, MSSQL, SQLite) into Python DataFrames 10-50x faster than pandas/SQLAlchemy by using parallel streaming and zero-copy Arrow IPC. It supports multi-threaded partition-based parallel loading on numerical columns and experimental federated queries across multiple databases. Workspace monorepo (Cargo.toml defines workspace members 'connectorx' and 'connectorx-cpp'). Core Rust library in connectorx/ handles database connections and Arrow buffer management. Python bindings wrap the Rust layer (implied by Python 261K lines). Benchmarks in benchmarks/ folder compare performance against pandas, Dask, Modin. Dev environment in .devcontainer/ uses docker-compose with multiple databases.
👥Who it's for
Data engineers and data scientists who need to ingest large datasets (GB-scale) from databases into Python for analysis, with a focus on those using PostgreSQL/MySQL and who value memory efficiency and speed over convenience.
🌱Maturity & risk
Production-ready with active development. The project has CI/CD pipelines (.github/workflows/ with ci.yml, benchmark.yml, release.yml, import-test.yml), comprehensive benchmarking suite (benchmarks/ folder comparing against Dask, Modin, turbodbc), and workspaces supporting multiple member crates. Latest commits indicate ongoing maintenance and feature development (federated queries marked as experimental).
Moderate risk: The codebase is Rust-heavy (877K lines) with Python bindings, creating a maintenance burden for a potentially small core team (single GitHub org suggests limited contributors). Federated query support is explicitly marked experimental. The dependency on Arrow (v54) and Polars (v0.49) locks you to specific versions that may lag upstream. No obvious red flags in dependencies but Rust ecosystem stability matters.
Active areas of work
The project maintains active CI with benchmarking against competitors (benchmark.yml runs performance tests), release automation (release.yml), and import testing (import-test.yml). Federated query feature is under development (mentioned in README as experimental). The presence of multiple workflow files suggests the team is focused on correctness, performance regression detection, and reliable releases.
🚀Get running
git clone https://github.com/sfu-db/connector-x.git
cd connector-x
cargo build --release
pip install -e .
Or for Python only: pip install connectorx. For development with databases, use the docker-compose setup in .devcontainer/docker-compose.yml.
Daily commands:
For development: cargo build --release builds the Rust library. Python testing would use python -m pytest (inferred from Python 261K lines containing test code). Benchmarks run via python benchmarks/tpch-cx.py or similar. See Justfile for available recipes (Just 4383 bytes suggests task automation).
🗺️Map of the codebase
connectorx-python/src/lib.rs— Entry point for Python bindings; all Python API calls route through here via PyO3connectorx-python/src/cx_read_sql.rs— Core implementation of read_sql() function that orchestrates database queries and result materializationCargo.toml— Workspace configuration defining all members and shared dependencies (Arrow, Polars); critical for build and feature flagsconnectorx-python/Cargo.toml— Python crate configuration with PyO3 bindings and platform-specific compilation targetsconnectorx-python/connectorx/__init__.py— Python module initialization exposing the main public API (read_sql, read_arrow, etc.)connectorx-python/connectorx/connectorx.pyi— Type stub file providing static type hints for all Python bindings to consumersconnectorx-python/build.rs— Build script managing Rust compilation, linking, and feature-gating for database drivers
🛠️How to make changes
Add support for a new database backend
- Create a new feature flag in connectorx-python/Cargo.toml under [features] (e.g., "oracle" or "snowflake") (
connectorx-python/Cargo.toml) - Implement the driver protocol in the core connector crate (outside this file list) or enable via dependency feature (
connectorx-python/build.rs) - Add a test file connectorx-python/connectorx/tests/test_<dbname>.py with fixtures connecting to your database (
connectorx-python/connectorx/tests/conftest.py) - Update CI workflow to add database service container for testing (
.github/workflows/ci.yml)
Add a new output destination format (beyond pandas/polars/arrow)
- Extend the destination enum in connectorx-python/src/cx_read_sql.rs to support the new format (
connectorx-python/src/cx_read_sql.rs) - Implement conversion logic (if using Arrow FFI, update connectorx-python/src/arrow.rs) (
connectorx-python/src/arrow.rs) - Add Python bindings in connectorx-python/src/lib.rs to expose the new destination (
connectorx-python/src/lib.rs) - Export in connectorx-python/connectorx/init.py and type hints in .pyi (
connectorx-python/connectorx/__init__.py) - Add integration tests in connectorx-python/connectorx/tests/test_arrow.py or similar (
connectorx-python/connectorx/tests/test_arrow.py)
Add a new configuration option to read_sql()
- Add parameter to the Rust function signature in connectorx-python/src/cx_read_sql.rs (
connectorx-python/src/cx_read_sql.rs) - Update the PyO3 #[pyfunction] macro in connectorx-python/src/lib.rs to accept the parameter (
connectorx-python/src/lib.rs) - Update Python type stub in connectorx-python/connectorx/connectorx.pyi (
connectorx-python/connectorx/connectorx.pyi) - Add test cases in relevant test files (e.g., connectorx-python/connectorx/tests/test_postgres.py) (
connectorx-python/connectorx/tests/test_postgres.py)
Add a new benchmark scenario
- Create a new Python benchmark script in benchmarks/ directory (e.g., benchmarks/tpch-custom.py) (
benchmarks/tpch-cx.py) - Use the pattern of existing benchmarks (connectorx load, compare against pandas/dask) (
benchmarks/ddos-cx.py) - Optionally add to GitHub Actions workflow for automated regression detection (
.github/workflows/benchmark.yml)
🔧Why these technologies
- Rust + PyO3 — Enables C-grade performance for data loading while maintaining Python ergonomics; PyO3 provides zero-copy FFI and seamless GIL management
- Apache Arrow 54 — Columnar in-memory format enables efficient parallel processing, zero-copy transfers to Python, and easy conversion to pandas/polars/Dask
- Polars 0.49 — High-performance alternative to pandas; often faster for data transformations and supports lazy evaluation
- Maturin build system — Simplifies PyO3 project builds and enables automated wheel distribution for multiple Python versions and platforms
- Multi-threaded async query execution — Allows parallel fetching from partitioned queries, dramatically reducing wall-clock time for large result sets
⚖️Trade-offs already made
-
Rust-based implementation instead of pure Python
- Why: Python is too slow for high-throughput data loading; Rust enables 10–100x speedups
- Consequence: Requires binary wheels; platform-specific compilation; higher maintenance burden for edge cases
-
PyO3 instead of ctypes/CFFI
- Why: PyO3 is idiomatic, safer, and integrates tightly with Rust's type system
- Consequence: Tight coupling to specific PyO3 versions; any Rust refactors may require API changes
-
Arrow FFI for zero-copy transfer
- Why: Minimizes memory copies and enables shared memory semantics with downstream libraries (Dask, Polars)
- Consequence: Requires consumer libraries to support Arrow C Data Interface; fallback to materialization needed for older pandas versions
-
Feature flags for optional database drivers
- Why: Reduces binary size and compilation time for users not needing all backends
- Consequence: More complex Cargo.toml; potential for feature combinations that break at runtime if not tested
🚫Non-goals (don't propose these)
- Does not provide query optimization or cost-based planning; assumes user provides well-formed SQL
🪤Traps & gotchas
Database credentials in .github/config/: db1.json and db2.json contain test connection strings; ensure these are not credentials but placeholders. Docker dependency: Development requires Docker for the multi-database test environment (.devcontainer/docker-compose.yml) — building without Docker requires manual database setup. Partition NULL constraint: Partition-based parallelism fails if the partition column contains NULL values (explicit constraint in README). Rust toolchain version: The workspace requires a stable Rust version; check .cargo/config for minimum MSRV (not obvious in provided files). Arrow/Polars version pinning: Workspace.dependencies lock Arrow 54 and Polars 0.49 exactly; updating these may require significant integration work.
🏗️Architecture
💡Concepts to learn
- Arrow IPC (Inter-Process Communication) — ConnectorX uses Arrow's columnar format and IPC mechanisms to transfer data from Rust to Python without copying — understanding Arrow buffers is essential for debugging performance or adding new destinations
- Partition-based data parallelism — The core performance feature of ConnectorX splits a numeric column into N ranges and spawns N threads to fetch partitions in parallel; understanding partition boundaries and NULL handling is critical for using partition_on correctly
- Columnar storage vs. row-oriented storage — ConnectorX's speed advantage comes from materializing data as Arrow columnar format (SIMD-friendly) rather than row-oriented pandas; understanding column layout explains why it's faster and more memory-efficient
- Foreign Function Interface (FFI) / ctypes — The Python bindings to the Rust core use FFI or pyo3; modifying the API or adding new Rust functions requires understanding how Rust types map to Python
- Connection pooling — Parallel partition loading requires reusing database connections across threads safely; the codebase likely implements pooling to avoid connection exhaustion
- Query optimization with join pushdown — Federated queries optimize by pushing joins from the same data source down to that database before cross-database joins; this is the key insight in Federation.md
- Rust async/await with tokio — The Rust layer likely uses async I/O via tokio for non-blocking database reads; understanding Rust's async model is essential for modifying the core driver
🔗Related repos
pandas-dev/pandas— Incumbent DataFrame library that ConnectorX optimizes against; key competitor for read_sql()dask/dask— Parallel DataFrame library used in benchmarks (benchmarks/tpch-dask.py); ConnectorX targets faster ingest for Dask workloadspola-rs/polars— Fast Rust DataFrame engine that ConnectorX uses as a destination and dependency (workspace.dependencies specifies polars 0.49)apache/arrow— Columnar format underlying ConnectorX's zero-copy data transfer (workspace.dependencies specifies arrow 54)launchbadge/sqlx— Async Rust SQL toolkit likely used for database connections in the core driver layer
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive integration tests for Federation feature in .github/workflows/
Federation.md exists as documentation but there's no dedicated CI workflow (federation-test.yml or similar) to validate federation functionality across different database sources. The repo has import-test.yml and ci.yml, but federation integration tests should run separately to catch regressions in the federation layer. This would prevent breaking changes to the federation system.
- [ ] Create .github/workflows/federation-test.yml with matrix testing across supported databases
- [ ] Reference Federation.md to identify core federation scenarios to test
- [ ] Add federation-specific test cases in connectorx/tests/ directory
- [ ] Update CONTRIBUTING.md with federation testing guidelines
- [ ] Ensure workflow triggers on changes to federation-related code paths
Implement performance regression detection in .github/workflows/benchmark.yml
The benchmark.yml workflow exists but appears to lack automated comparison against baseline metrics. Given the repo's focus on performance (assets/mssql-mem.png, pg-time.png, etc.), there should be a mechanism to fail CI if benchmarks regress beyond acceptable thresholds. This prevents performance regressions from merging undetected.
- [ ] Enhance .github/workflows/benchmark.yml to store baseline metrics
- [ ] Add comparison logic that fails CI if any benchmark degrades >5% (configurable)
- [ ] Store historical benchmark results in .github/config/ (similar to db1.json, db2.json structure)
- [ ] Add benchmark result annotations to PR comments for visibility
- [ ] Document acceptable performance regression thresholds in CONTRIBUTING.md
Add missing backend-specific documentation in docs/ with setup guides
The repo supports multiple database sources (shown in assets/sources.gif) and destinations (assets/destinations.gif) but the repo structure shows no dedicated docs/ directory with per-backend setup instructions. Contributors struggle to test against Oracle, MSSQL, PostgreSQL, MySQL locally. Create backend-specific guides to lower contribution friction.
- [ ] Create docs/ directory structure with docs/backends/ subdirectory
- [ ] Add database-specific setup guides: docs/backends/postgres-setup.md, docs/backends/mysql-setup.md, docs/backends/mssql-setup.md, docs/backends/oracle-setup.md, docs/backends/sqlite-setup.md
- [ ] Reference .devcontainer/docker-compose.yml and .devcontainer/Dockerfile in guides
- [ ] Document connection string formats and test database initialization for each backend
- [ ] Update CONTRIBUTING.md to link to backend setup documentation
- [ ] Add backend selection instructions for local testing against specific database types
🌿Good first issues
- Add integration tests for federated queries across PostgreSQL + MySQL: Federation.md exists but no explicit test file listed in top 60; adding connectorx/tests/federation_test.rs with reproducible examples would validate the experimental feature
- Document partition_on limitations with examples for each database: README mentions partition_on works on 'numerical columns without NULL', but no docs/partition_limitations.md exists; add a guide showing which numeric types work and example failure modes
- Add benchmarks comparing memory usage vs. pandas chunked reading: benchmarks/ddos-pandas-chunk.py exists but no memory profiling output; add --tracemalloc comparison to show exact memory savings claimed in README
⭐Top contributors
Click to expand
Top contributors
- @wangxiaoying — 52 commits
- @dbascoules — 11 commits
- @github-actions[bot] — 10 commits
- @dependabot[bot] — 9 commits
- @Vincent — 8 commits
📝Recent commits
Click to expand
Recent commits
6ec2229— 0.4.6-alpha.1 (wangxiaoying)ab77155— docs(contributor): contrib-readme-action has updated readme (github-actions[bot])5a2c792— Merge pull request #902 from oystersuki/hotfix/tinyint_was_being_converted_to_bool (wangxiaoying)8e6aa6e— fix mysql tiny to bool bug (oystersuki)6cdc2da— docs(contributor): contrib-readme-action has updated readme (github-actions[bot])c7ad47c— Merge pull request #896 from antivirak/feature/allow-redshift-iam-or-saml-login (wangxiaoying)09d91ca— Add redshift identity provide login info (antivirak)d1c0c62— Merge pull request #901 from dbascoules/feat/dbascoules/default-to-microsecond-precision (wangxiaoying)7ff7add— feat(tests): add support for TIMESTAMP(9) and TIMESTAMP WITH TIME ZONE(9) in Oracle tests (dbascoules)6dd1d5a— Merge pull request #900 from sfu-db/dependabot/github_actions/extractions/setup-just-4 (wangxiaoying)
🔒Security observations
The ConnectorX codebase shows moderate security posture with several areas requiring attention. The main concerns are: (1) potential SQL injection risks inherent in a multi-database connector library, (2) outdated dependencies (Arrow, Polars) that may contain vulnerabilities, (3) debug symbols in release builds, and (4) potential credential exposure through configuration files. The codebase lacks a formal security policy. Strengths include active CI/CD pipeline, structured testing, and use of Rust for memory safety. Immediate actions should focus on validating input sanitization in the database query layer, updating dependencies, and implementing proper credential management through GitHub Secrets.
- Medium · Outdated Arrow Dependency —
Cargo.toml - workspace.dependencies. The workspace uses arrow v54, which may contain known vulnerabilities. Arrow is a critical dependency for data processing and serialization. Regular security updates are important for this type of library. Fix: Regularly check arrow changelog and CVE databases. Update to the latest stable version that maintains API compatibility. Consider implementing dependabot or similar automated dependency scanning. - Medium · Outdated Polars Dependency —
Cargo.toml - workspace.dependencies. The workspace uses polars v0.49, which may contain known vulnerabilities. Polars is a core dataframe processing library used in data handling operations. Fix: Monitor polars releases and security advisories. Update to the latest stable version. Check polars changelog for security fixes before updating to new major versions. - High · Potential SQL Injection Risk in Database Connector —
connectorx - database connection layer (not visible in provided structure). As a library designed to load data from multiple database types (PostgreSQL, MySQL, SQLite, Oracle, MSSQL), there is inherent risk of SQL injection vulnerabilities if user input is not properly sanitized. The file structure suggests dynamic query construction across multiple database sources. Fix: Ensure all database queries use parameterized statements or prepared statements exclusively. Never concatenate user input directly into SQL queries. Implement input validation and sanitization. Add security-focused code review process for database layer changes. - Medium · Debug Symbols in Release Builds —
Cargo.toml - [profile.release]. The release profile is configured with 'debug = true', which includes debug symbols in release builds. This increases binary size and may expose sensitive information in stack traces. Fix: Set 'debug = false' or 'debug = "line-tables-only"' in release profile for production builds. Reserve full debug symbols for development builds only. - Medium · Credential Storage in Config Files —
.github/config/db1.json, .github/config/db2.json. Configuration files exist in .github/config/ (db1.json, db2.json) which may contain database credentials for CI/CD testing. If these contain real credentials, they pose a security risk. Fix: Use GitHub Secrets for sensitive credentials instead of config files. If config files must exist, use placeholder values only. Scan repository history for any accidentally committed credentials using tools like git-secrets or TruffleHog. - Low · Missing Security Headers Documentation —
Repository root. No evidence of security policy documentation (SECURITY.md file) for reporting vulnerabilities responsibly. Fix: Create a SECURITY.md file with vulnerability disclosure policy and contact information. This allows security researchers to report issues responsibly. - Low · Benchmark Scripts May Require Hardcoded Credentials —
benchmarks/ directory. Multiple benchmark scripts exist (benchmarks/tpch-.py, benchmarks/ddos-.py) which likely require database connection strings. These may be executed with credentials visible in script files. Fix: Ensure benchmark scripts read credentials from environment variables rather than hardcoding them. Document secure setup procedures for running benchmarks.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.