RepoPilotOpen in app →

Qovery/Replibyte

Seed your development database with real data ⚡️

Mixed

Slowing — last commit 9mo ago

weakest axis
Use as dependencyConcerns

copyleft license (GPL-3.0) — review compatibility; no tests detected

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 9mo ago
  • 22+ active contributors
  • GPL-3.0 licensed
Show all 8 evidence items →
  • CI configured
  • Slowing — last commit 9mo ago
  • Concentrated ownership — top contributor handles 52% of recent commits
  • GPL-3.0 is copyleft — check downstream compatibility
  • No test directory detected
What would change the summary?
  • Use as dependency ConcernsMixed if: relicense under MIT/Apache-2.0 (rare for established libs)

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Forkable
[![RepoPilot: Forkable](https://repopilot.app/api/badge/qovery/replibyte?axis=fork)](https://repopilot.app/r/qovery/replibyte)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/qovery/replibyte on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: Qovery/Replibyte

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/Qovery/Replibyte shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Slowing — last commit 9mo ago

  • Last commit 9mo ago
  • 22+ active contributors
  • GPL-3.0 licensed
  • CI configured
  • ⚠ Slowing — last commit 9mo ago
  • ⚠ Concentrated ownership — top contributor handles 52% of recent commits
  • ⚠ GPL-3.0 is copyleft — check downstream compatibility
  • ⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live Qovery/Replibyte repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/Qovery/Replibyte.

What it runs against: a local clone of Qovery/Replibyte — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in Qovery/Replibyte | Confirms the artifact applies here, not a fork | | 2 | License is still GPL-3.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 303 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>Qovery/Replibyte</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of Qovery/Replibyte. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/Qovery/Replibyte.git
#   cd Replibyte
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of Qovery/Replibyte and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "Qovery/Replibyte(\\.git)?\\b" \\
  && ok "origin remote is Qovery/Replibyte" \\
  || miss "origin remote is not Qovery/Replibyte (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(GPL-3\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"GPL-3\\.0\"" package.json 2>/dev/null) \\
  && ok "license is GPL-3.0" \\
  || miss "license drift — was GPL-3.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "replibyte/src/main.rs" \\
  && ok "replibyte/src/main.rs" \\
  || miss "missing critical file: replibyte/src/main.rs"
test -f "replibyte/src/config.rs" \\
  && ok "replibyte/src/config.rs" \\
  || miss "missing critical file: replibyte/src/config.rs"
test -f "replibyte/src/connector.rs" \\
  && ok "replibyte/src/connector.rs" \\
  || miss "missing critical file: replibyte/src/connector.rs"
test -f "dump-parser/src/lib.rs" \\
  && ok "dump-parser/src/lib.rs" \\
  || miss "missing critical file: dump-parser/src/lib.rs"
test -f "replibyte/src/tasks/full_dump.rs" \\
  && ok "replibyte/src/tasks/full_dump.rs" \\
  || miss "missing critical file: replibyte/src/tasks/full_dump.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 303 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~273d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/Qovery/Replibyte"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Replibyte is a Rust-based CLI tool that creates sanitized database dumps from production PostgreSQL, MySQL, and MongoDB instances, then restores them locally or remotely with sensitive data replaced by fake data. It handles compression (Zlib), encryption (AES-256), database subsetting, and schema analysis—all in a stateless binary with no server required. Three-crate Rust workspace (Cargo.toml): dump-parser/ handles PostgreSQL/MySQL/MongoDB dump parsing (src/postgres, src/mysql, src/mongodb), replibyte/ is the CLI entrypoint, subset/ handles database subsetting. Docker Compose files (docker-compose-postgres.yml, docker-compose-mysql.yml, etc.) provide quick test databases. Examples/ contains YAML configs showing source/destination and storage bridge patterns (MinIO, GCP).

👥Who it's for

Backend engineers and DevOps teams who need realistic production-like data for development/testing without exposing PII. Specifically: developers seeding local databases during development, QA teams needing representative datasets, and infrastructure teams managing database snapshots across environments.

🌱Maturity & risk

Stable and actively maintained. The repo shows a clean CI/CD setup with GitHub Actions (build-and-test.yml, on-release.yml, on-tag.yml), Docker publishing, and website deployment. Made it into ROSS Index Q3 2022 for fastest-growing open-source startups. The codebase is production-grade (443KB Rust code) with examples and multi-database support already shipped.

Low risk for core functionality, but consider: monorepo structure across three crates (dump-parser, replibyte, subset) means changes ripple; Rust dependencies via Cargo.lock should be audited; no obvious issue backlog visible in file list suggests either well-triaged or under-resourced for bugs. Main risk is WASM transformer support (examples/wasm) which is experimental.

Active areas of work

Active maintenance evident from multiple GitHub Actions workflows. Website.yml workflow suggests ongoing documentation updates. The presence of both bridge-minio examples and GCP datastore examples indicates expansion toward cloud storage integrations. WASM transformer support is in examples/ suggesting experimental feature development.

🚀Get running

Clone: git clone https://github.com/Qovery/replibyte && cd replibyte. Install Rust if needed. Build: cargo build --release. Run tests: cargo test. Spin up a test database: docker-compose -f docker-compose-postgres.yml up. See examples/replibyte.yaml for config syntax.

Daily commands: cargo run --release -- -c examples/replibyte.yaml dump create (after configuring a source database). Or replibyte -c conf.yaml dump restore local -v latest -i postgres -p 5432 to restore to a local container. See README usage section for full command list.

🗺️Map of the codebase

  • replibyte/src/main.rs — Entry point for the CLI application; initializes the runtime and dispatches all commands
  • replibyte/src/config.rs — Core configuration parser and validation; defines the entire YAML schema that users interact with
  • replibyte/src/connector.rs — Central abstraction for database connections; coordinates source/destination/datastore interactions
  • dump-parser/src/lib.rs — Shared parsing library for SQL dumps; used by all database type handlers to extract and transform data
  • replibyte/src/tasks/full_dump.rs — Orchestrates the complete dump pipeline: source extraction → transformation → datastore storage
  • replibyte/src/tasks/full_restore.rs — Orchestrates the complete restore pipeline: datastore retrieval → transformation → destination injection
  • replibyte/src/transformer/mod.rs — Transformation engine abstraction; routes data through built-in and WASM transformers for PII masking

🛠️How to make changes

Add Support for a New Database Type

  1. Create a new parser module in dump-parser/src/{dbtype}/mod.rs to parse that database's dump format (dump-parser/src/lib.rs)
  2. Implement a source connector in replibyte/src/source/{dbtype}.rs that invokes the native dump tool (e.g., pg_dump, mysqldump) (replibyte/src/source/mod.rs)
  3. Implement a destination connector in replibyte/src/destination/{dbtype}.rs that restores data using the native restore tool (replibyte/src/destination/mod.rs)
  4. Update replibyte/src/config.rs to add the new database type to the Source and Destination enum variants (replibyte/src/config.rs)
  5. Add integration tests using docker-compose-{dbtype}.yml and examples/source-{dbtype}.yaml (examples/source-postgres.yaml)

Add a New Data Transformer for PII Masking

  1. Create a new transformer module at replibyte/src/transformer/{transform_name}.rs with masking logic (replibyte/src/transformer/credit_card.rs)
  2. Define the transformer configuration struct and implement detection/replacement logic (replibyte/src/config.rs)
  3. Register the transformer in the transformation pipeline within full_dump.rs and full_restore.rs (replibyte/src/tasks/full_dump.rs)
  4. Add a YAML example in examples/with-transformer-options.yaml showing configuration (examples/with-transformer-options.yaml)

Add a New Datastore Backend (S3, GCS, Etc.)

  1. Create a new datastore implementation at replibyte/src/datastore/{backend_name}.rs (replibyte/src/datastore/s3.rs)
  2. Implement the Datastore trait (upload, download, list, delete operations) (replibyte/src/datastore/mod.rs)
  3. Update the config parser to add the new backend type to the DatastoreType enum (replibyte/src/config.rs)
  4. Add an example configuration in examples/ showing how to use the new datastore (examples/source-postgres-bridge-minio.yaml)

🔧Why these technologies

  • Rust — Memory-safe systems language enables high-performance data processing with minimal overhead; essential for handling large database dumps without crashes or data corruption
  • Subprocess-based DB tools (pg_dump, mysqldump, mongodump) — Leverages battle-tested native tools rather than reimplementing dump logic; avoids DB-specific protocol complexity and maintains compatibility with latest DB versions
  • WASM for custom transformers — Sandboxed execution model allows users to define arbitrary data transformations without requiring Rust recompilation; maintains security and portability
  • Docker integration — Enables seeding ephemeral test databases in CI/CD pipelines without external DB infrastructure; simplifies local development and testing workflows

⚖️Trade-offs already made

  • Subprocess-based DB tool invocation instead of native driver libraries

    • Why: Keeps dependency footprint small and avoids deep DB protocol knowledge; subprocess calls are more robust to DB version drift
    • Consequence: Requires native tools (pg_dump, mysqldump) to be pre-installed in the runtime environment; adds I/O overhead from process spawning and shell command parsing
  • Single-pass streaming dump model instead of in-memory buffering

    • Why: Handles multi-terabyte databases without RAM constraints; enables on-the-fly transformation and compression
    • Consequence: Cannot perform complex multi-table joins or secondary analysis during dump; transformations must be row-by-row or dataset-wide without cross-referencing
  • YAML configuration files instead of programmatic API

    • Why: Simplifies user onboarding and enables version control of dump/restore workflows; human-readable and industry-standard
    • Consequence: Less flexible for complex conditional logic; limited IDE support; harder to reuse configurations across environments without templating
  • Pluggable datastore backends (S3, local, GCP) instead of single backend

    • Why: Accommodates diverse infrastructure choices (cloud, on-prem, air-gapped); enables portability across organizations
    • Consequence: Increased code complexity and testing surface area; each new backend requires full integration testing

🚫Non-goals (don't propose these)

  • Real-time database replication or CDC (Change Data Capture) — this is a one-time

🪤Traps & gotchas

YAML config paths must match examples/ patterns exactly (source.host, destination.host, storage.type); no validation docs provided. Encryption key management is AES-256 but key derivation strategy not documented in visible files—check source. Docker Compose files assume specific port availability (5432 for Postgres, 27017 for MongoDB); modify if ports conflict. WASM transformer path must be reachable at dump time. Database credentials passed via YAML config file—ensure file permissions are 0600.

🏗️Architecture

💡Concepts to learn

  • Data Anonymization & PII Masking — Core feature of Replibyte—understanding tokenization, hashing, and fake-data generation strategies is essential to safely use production data in dev
  • Database Dump Formats (SQL text, binary protocols) — Replibyte must parse PostgreSQL protocol, MySQL binary logs, and MongoDB BSON; understanding these formats explains why separate parsers exist in dump-parser/src/
  • Stream-based Processing & Zlib Compression — Replibyte handles >10GB databases by streaming rows and compressing on-the-fly rather than buffering; essential for memory efficiency
  • AES-256 Encryption at Rest — Encrypted dumps prevent data leaks in storage; understanding block cipher modes and key derivation helps reason about security guarantees
  • Database Subsetting (Row filtering) — The subset/ crate reduces production data to a smaller, representative sample; requires understanding foreign key constraints and referential integrity
  • WASM (WebAssembly) for Custom Transformers — Replibyte supports pluggable WASM modules for data transformation rules; understanding WASM bytecode and the Rust/WASM interface (examples/wasm) enables custom anonymization logic
  • Stateless CLI Architecture — Replibyte is a standalone binary with no server/daemon; understanding stateless design helps reason about idempotency, concurrency safety, and deployment simplicity
  • getdbt/dbt-core — dbt handles database transformation and testing; complements Replibyte by allowing data quality validation on seeded databases
  • stripe/pg_chaosmonkey — Postgres-specific chaos testing tool; users of Replibyte often need to test failure scenarios after seeding
  • ankane/pgsync — Postgres-only data sync tool; Replibyte's multi-DB alternative with added encryption/compression
  • Qovery/engine — Same org's infrastructure engine; Replibyte integrates as database seeding component in Qovery's deployment pipelines
  • opencontainers/image-spec — Replibyte's Docker publishing workflows (publish-image.yaml) depend on OCI image standards; relevant for CI/CD contributors

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for dump-parser with real database outputs

The dump-parser crate has separate modules for postgres, mysql, and mongodb (dump-parser/src/postgres/mod.rs, dump-parser/src/mysql/mod.rs, dump-parser/src/mongodb/mod.rs) but there are no visible test files. The repo includes real database dumps in db/postgres/, db/mysql/, and db/mongodb/ that should be used to validate the parser against actual production-like data. This would catch regressions and ensure data integrity across database types.

  • [ ] Create dump-parser/tests/postgres_parser_test.rs using db/postgres/fulldump.sql and db/postgres/fulldump-with-inserts.sql
  • [ ] Create dump-parser/tests/mysql_parser_test.rs using db/mysql/world.sql
  • [ ] Create dump-parser/tests/mongodb_parser_test.rs using db/mongodb/init-mongo.js
  • [ ] Add test assertions verifying correct parsing of tables, schemas, and data transformations
  • [ ] Update CI workflow .github/workflows/build-and-test.yml to run these new integration tests

Add security-focused documentation and validation examples for transformer functions

The repo emphasizes 'keeping sensitive data safe' in the README and examples include transformers (examples/with-transformer-options.yaml, examples/wasm/wasm-transformer-reverse-string.wasm), but there is no dedicated security guide documenting best practices for PII redaction patterns. Creating a comprehensive transformer examples file with common patterns (email masking, phone redaction, credit card anonymization) would help contributors build safer data pipelines and reduce security misconfiguration risks.

  • [ ] Create docs/TRANSFORMERS_GUIDE.md documenting transformer capabilities and security best practices
  • [ ] Add 3-5 concrete transformer examples (email masking, phone number anonymization, UUID replacement) to examples/ directory with .yaml configs
  • [ ] Create examples/transformers/common-pii-patterns.wasm showing safe redaction implementations
  • [ ] Link the new guide from README.md in the 'Usage' or 'Examples' section
  • [ ] Update docs/DESIGN.md with a 'Security Considerations' subsection referencing the transformers guide

Add GitHub Actions workflow for validating example configurations against schema

The examples/ directory contains 11+ YAML configuration files (source-postgres.yaml, with-encryption.yaml, source-and-dest-mongodb-bridge-minio.yaml, etc.) but there's no CI validation ensuring these configs remain valid. A configuration validation workflow would catch breaking changes when the config schema evolves and serve as end-to-end verification that all documented examples actually work.

  • [ ] Create .github/workflows/validate-examples.yml that runs on pull requests
  • [ ] Use replibyte's config parsing to validate each .yaml file in examples/ directory against the current schema
  • [ ] Add checks for required fields, valid database types (postgres/mysql/mongodb), and datastore configurations
  • [ ] Ensure workflow fails if any example config is invalid, with clear error messages
  • [ ] Document this validation in CONTRIBUTING.md (or create it if missing) so contributors know to test examples

🌿Good first issues

  • Add integration tests for dump-parser/src/mysql/mod.rs (exists but likely lacks coverage compared to postgres). See dump-parser/src/postgres/mod.rs patterns and write equivalent tests in a tests/ directory.
  • Document the WASM transformer API in docs/DESIGN.md. The examples/wasm directory exists but main docs don't explain how to author custom transformers; write a tutorial with a concrete example (e.g., email masking WASM module).
  • Add SQLite support to dump-parser/src/ following the PostgreSQL/MySQL/MongoDB pattern. Create dump-parser/src/sqlite/mod.rs and register it in dump-parser/src/lib.rs. SQLite is single-file so implementation should be simpler than PostgreSQL.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 5504db9 — Parse passwords with special chars correctly (#291) (pm-trey)
  • 1476dd7 — fix: handle multi-byte chars on redacted transformer (#279) (pepoviola)
  • 15ba775 — chore: update doc - deploy replibyte with qovery (evoxmusic)
  • d6b35a7 — Correct filename (#238) (sondrelg)
  • 7fe0609 — fix: Add utf-8 parsing error handling (#236) (sondrelg)
  • da8cd34 — Update README.md (evoxmusic)
  • bf2d5d5 — update Cargo.lock (evoxmusic)
  • 8a6a519 — Release v0.10.0 (#229) (evoxmusic)
  • 7714dd0 — feat: add a new sub command 'source' to print the database schema (#226) (Filip Kieres)
  • 8a918eb — fix(docs): add credentials item to Configuration page (#222) (skyline93)

🔒Security observations

  • High · Outdated Rust Base Image — Dockerfile. Dockerfile uses rust:1.59-buster which is from February 2022 and contains known vulnerabilities. The final stage uses debian:buster-slim which is also outdated (released 2019, EOL June 2024). Fix: Update to latest stable Rust image (1.75+) and use debian:bookworm-slim or bullseye-slim for the final stage. Implement regular image updates in CI/CD pipeline.
  • High · Missing Container Security Context — Dockerfile. Dockerfile does not define security context settings. The final image lacks USER specification, meaning the application runs as root by default, violating principle of least privilege. Fix: Add 'USER' directive with a non-root user account. Include security options like 'RUN groupadd -r appuser && useradd -r -g appuser appuser' and 'USER appuser'.
  • Medium · Unnecessary Build Artifacts in Final Image — Dockerfile. Dockerfile uses multi-stage build but the final image inherits potential build artifacts and unnecessary libraries from debian:buster-slim. No explicit cleanup of cache or unnecessary packages. Fix: Add 'RUN apt-get clean && rm -rf /var/lib/apt/lists/*' before finalizing. Consider using distroless base images for reduced attack surface.
  • Medium · Missing HEALTHCHECK Directive — Dockerfile. Dockerfile lacks HEALTHCHECK instruction, making it difficult to detect container health issues in orchestration platforms. Fix: Add HEALTHCHECK directive to monitor application health: 'HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 CMD ...'
  • Medium · Hardcoded Database Initialization Scripts — db/mysql/world.sql, db/postgres/*.sql, db/mongodb/init-mongo.js. Multiple SQL initialization files in db/ directories (world.sql, northwind.sql, etc.) appear to be committed with potentially sensitive test data or structure information. Fix: Review all test data files for sensitive information. Consider using .gitignore to exclude large database dumps. Document which files are safe for version control.
  • Medium · Missing Input Validation on Configuration Files — replibyte/src/cli.rs, replibyte/src/config.rs, examples/replibyte.yaml. Multiple YAML configuration examples reference database credentials and connection strings. No visible validation mechanism in CLI parser for config inputs. Fix: Implement strict YAML schema validation. Sanitize all configuration inputs. Use environment variables for sensitive credentials rather than config files.
  • Medium · Potential SQL Injection in Dump Parsing — dump-parser/src/mysql/mod.rs, dump-parser/src/postgres/mod.rs, replibyte/src/connector.rs. dump-parser module processes SQL dumps from mysql, postgres without visible sanitization. Dynamic query construction risk in migrations and connectors. Fix: Use parameterized queries and prepared statements exclusively. Implement input validation on all parsed SQL. Add security-focused unit tests for injection attempts.
  • Medium · Insufficient Encryption Configuration — examples/with-encryption.yaml, replibyte/src/datastore/s3.rs. Example 'with-encryption.yaml' shows encryption is optional. No enforcement of encryption for sensitive data at rest or in transit. Fix: Make encryption mandatory for production use cases. Implement TLS for all database connections. Use envelope encryption for data stored in S3 and local disk.
  • Low · Missing SBOM and Dependency Scanning — .github/workflows/build-and-test.yml. No evidence of Software Bill of Materials (SBOM) generation or automated dependency vulnerability scanning in CI/CD workflows. Fix: Integrate cargo-audit or OWASP Dependency-Check in CI/CD. Generate SBOM using tools like cargo-sbom. Fail builds on critical vulnerabilities.
  • Low · Missing Security Headers Documentation — undefined. No documented security guidelines or headers configuration for any Fix: undefined

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Mixed signals · Qovery/Replibyte — RepoPilot