Qovery/Replibyte

Item: Qovery/Replibyte
Rating: 3
Author: RepoPilot

Seed your development database with real data ⚡️

Mixed

Slowing — last commit 9mo ago

weakest axis

Use as dependencyConcerns

copyleft license (GPL-3.0) — review compatibility; no tests detected

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 9mo ago
✓22+ active contributors
✓GPL-3.0 licensed

Show all 8 evidence items →

✓CI configured
⚠Slowing — last commit 9mo ago
⚠Concentrated ownership — top contributor handles 52% of recent commits
⚠GPL-3.0 is copyleft — check downstream compatibility
⚠No test directory detected

What would change the summary?

→Use as dependency Concerns → Mixed if: relicense under MIT/Apache-2.0 (rare for established libs)

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Forkable](https://repopilot.app/api/badge/qovery/replibyte?axis=fork)](https://repopilot.app/r/qovery/replibyte)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/qovery/replibyte on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: Qovery/Replibyte

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/Qovery/Replibyte shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Slowing — last commit 9mo ago

Last commit 9mo ago
22+ active contributors
GPL-3.0 licensed
CI configured
⚠ Slowing — last commit 9mo ago
⚠ Concentrated ownership — top contributor handles 52% of recent commits
⚠ GPL-3.0 is copyleft — check downstream compatibility
⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live Qovery/Replibyte repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/Qovery/Replibyte.

What it runs against: a local clone of Qovery/Replibyte — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in Qovery/Replibyte | Confirms the artifact applies here, not a fork | | 2 | License is still GPL-3.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 303 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>Qovery/Replibyte</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of Qovery/Replibyte. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/Qovery/Replibyte.git
#   cd Replibyte
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of Qovery/Replibyte and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "Qovery/Replibyte(\\.git)?\\b" \\
  && ok "origin remote is Qovery/Replibyte" \\
  || miss "origin remote is not Qovery/Replibyte (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(GPL-3\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"GPL-3\\.0\"" package.json 2>/dev/null) \\
  && ok "license is GPL-3.0" \\
  || miss "license drift — was GPL-3.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "replibyte/src/main.rs" \\
  && ok "replibyte/src/main.rs" \\
  || miss "missing critical file: replibyte/src/main.rs"
test -f "replibyte/src/config.rs" \\
  && ok "replibyte/src/config.rs" \\
  || miss "missing critical file: replibyte/src/config.rs"
test -f "replibyte/src/connector.rs" \\
  && ok "replibyte/src/connector.rs" \\
  || miss "missing critical file: replibyte/src/connector.rs"
test -f "dump-parser/src/lib.rs" \\
  && ok "dump-parser/src/lib.rs" \\
  || miss "missing critical file: dump-parser/src/lib.rs"
test -f "replibyte/src/tasks/full_dump.rs" \\
  && ok "replibyte/src/tasks/full_dump.rs" \\
  || miss "missing critical file: replibyte/src/tasks/full_dump.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 303 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~273d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/Qovery/Replibyte"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Replibyte is a Rust-based CLI tool that creates sanitized database dumps from production PostgreSQL, MySQL, and MongoDB instances, then restores them locally or remotely with sensitive data replaced by fake data. It handles compression (Zlib), encryption (AES-256), database subsetting, and schema analysis—all in a stateless binary with no server required. Three-crate Rust workspace (Cargo.toml): dump-parser/ handles PostgreSQL/MySQL/MongoDB dump parsing (src/postgres, src/mysql, src/mongodb), replibyte/ is the CLI entrypoint, subset/ handles database subsetting. Docker Compose files (docker-compose-postgres.yml, docker-compose-mysql.yml, etc.) provide quick test databases. Examples/ contains YAML configs showing source/destination and storage bridge patterns (MinIO, GCP).

👥Who it's for

Backend engineers and DevOps teams who need realistic production-like data for development/testing without exposing PII. Specifically: developers seeding local databases during development, QA teams needing representative datasets, and infrastructure teams managing database snapshots across environments.

🌱Maturity & risk

Stable and actively maintained. The repo shows a clean CI/CD setup with GitHub Actions (build-and-test.yml, on-release.yml, on-tag.yml), Docker publishing, and website deployment. Made it into ROSS Index Q3 2022 for fastest-growing open-source startups. The codebase is production-grade (443KB Rust code) with examples and multi-database support already shipped.

Low risk for core functionality, but consider: monorepo structure across three crates (dump-parser, replibyte, subset) means changes ripple; Rust dependencies via Cargo.lock should be audited; no obvious issue backlog visible in file list suggests either well-triaged or under-resourced for bugs. Main risk is WASM transformer support (examples/wasm) which is experimental.

Active areas of work

Active maintenance evident from multiple GitHub Actions workflows. Website.yml workflow suggests ongoing documentation updates. The presence of both bridge-minio examples and GCP datastore examples indicates expansion toward cloud storage integrations. WASM transformer support is in examples/ suggesting experimental feature development.

🚀Get running

Clone: git clone https://github.com/Qovery/replibyte && cd replibyte. Install Rust if needed. Build: cargo build --release. Run tests: cargo test. Spin up a test database: docker-compose -f docker-compose-postgres.yml up. See examples/replibyte.yaml for config syntax.

Daily commands: cargo run --release -- -c examples/replibyte.yaml dump create (after configuring a source database). Or replibyte -c conf.yaml dump restore local -v latest -i postgres -p 5432 to restore to a local container. See README usage section for full command list.

🗺️Map of the codebase

replibyte/src/main.rs — Entry point for the CLI application; initializes the runtime and dispatches all commands
replibyte/src/config.rs — Core configuration parser and validation; defines the entire YAML schema that users interact with
replibyte/src/connector.rs — Central abstraction for database connections; coordinates source/destination/datastore interactions
dump-parser/src/lib.rs — Shared parsing library for SQL dumps; used by all database type handlers to extract and transform data
replibyte/src/tasks/full_dump.rs — Orchestrates the complete dump pipeline: source extraction → transformation → datastore storage
replibyte/src/tasks/full_restore.rs — Orchestrates the complete restore pipeline: datastore retrieval → transformation → destination injection
replibyte/src/transformer/mod.rs — Transformation engine abstraction; routes data through built-in and WASM transformers for PII masking

🛠️How to make changes

Add Support for a New Database Type

Create a new parser module in dump-parser/src/{dbtype}/mod.rs to parse that database's dump format (dump-parser/src/lib.rs)
Implement a source connector in replibyte/src/source/{dbtype}.rs that invokes the native dump tool (e.g., pg_dump, mysqldump) (replibyte/src/source/mod.rs)
Implement a destination connector in replibyte/src/destination/{dbtype}.rs that restores data using the native restore tool (replibyte/src/destination/mod.rs)
Update replibyte/src/config.rs to add the new database type to the Source and Destination enum variants (replibyte/src/config.rs)
Add integration tests using docker-compose-{dbtype}.yml and examples/source-{dbtype}.yaml (examples/source-postgres.yaml)

Add a New Data Transformer for PII Masking

Create a new transformer module at replibyte/src/transformer/{transform_name}.rs with masking logic (replibyte/src/transformer/credit_card.rs)
Define the transformer configuration struct and implement detection/replacement logic (replibyte/src/config.rs)
Register the transformer in the transformation pipeline within full_dump.rs and full_restore.rs (replibyte/src/tasks/full_dump.rs)
Add a YAML example in examples/with-transformer-options.yaml showing configuration (examples/with-transformer-options.yaml)

Add a New Datastore Backend (S3, GCS, Etc.)

Create a new datastore implementation at replibyte/src/datastore/{backend_name}.rs (replibyte/src/datastore/s3.rs)
Implement the Datastore trait (upload, download, list, delete operations) (replibyte/src/datastore/mod.rs)
Update the config parser to add the new backend type to the DatastoreType enum (replibyte/src/config.rs)
Add an example configuration in examples/ showing how to use the new datastore (examples/source-postgres-bridge-minio.yaml)

🔧Why these technologies

Rust — Memory-safe systems language enables high-performance data processing with minimal overhead; essential for handling large database dumps without crashes or data corruption
Subprocess-based DB tools (pg_dump, mysqldump, mongodump) — Leverages battle-tested native tools rather than reimplementing dump logic; avoids DB-specific protocol complexity and maintains compatibility with latest DB versions
WASM for custom transformers — Sandboxed execution model allows users to define arbitrary data transformations without requiring Rust recompilation; maintains security and portability
Docker integration — Enables seeding ephemeral test databases in CI/CD pipelines without external DB infrastructure; simplifies local development and testing workflows

⚖️Trade-offs already made

Subprocess-based DB tool invocation instead of native driver libraries
- Why: Keeps dependency footprint small and avoids deep DB protocol knowledge; subprocess calls are more robust to DB version drift
- Consequence: Requires native tools (pg_dump, mysqldump) to be pre-installed in the runtime environment; adds I/O overhead from process spawning and shell command parsing
Single-pass streaming dump model instead of in-memory buffering
- Why: Handles multi-terabyte databases without RAM constraints; enables on-the-fly transformation and compression
- Consequence: Cannot perform complex multi-table joins or secondary analysis during dump; transformations must be row-by-row or dataset-wide without cross-referencing
YAML configuration files instead of programmatic API
- Why: Simplifies user onboarding and enables version control of dump/restore workflows; human-readable and industry-standard
- Consequence: Less flexible for complex conditional logic; limited IDE support; harder to reuse configurations across environments without templating
Pluggable datastore backends (S3, local, GCP) instead of single backend
- Why: Accommodates diverse infrastructure choices (cloud, on-prem, air-gapped); enables portability across organizations
- Consequence: Increased code complexity and testing surface area; each new backend requires full integration testing

🚫Non-goals (don't propose these)

Real-time database replication or CDC (Change Data Capture) — this is a one-time

🪤Traps & gotchas

YAML config paths must match examples/ patterns exactly (source.host, destination.host, storage.type); no validation docs provided. Encryption key management is AES-256 but key derivation strategy not documented in visible files—check source. Docker Compose files assume specific port availability (5432 for Postgres, 27017 for MongoDB); modify if ports conflict. WASM transformer path must be reachable at dump time. Database credentials passed via YAML config file—ensure file permissions are 0600.

🏗️Architecture

💡Concepts to learn

Data Anonymization & PII Masking — Core feature of Replibyte—understanding tokenization, hashing, and fake-data generation strategies is essential to safely use production data in dev
Database Dump Formats (SQL text, binary protocols) — Replibyte must parse PostgreSQL protocol, MySQL binary logs, and MongoDB BSON; understanding these formats explains why separate parsers exist in dump-parser/src/
Stream-based Processing & Zlib Compression — Replibyte handles >10GB databases by streaming rows and compressing on-the-fly rather than buffering; essential for memory efficiency
AES-256 Encryption at Rest — Encrypted dumps prevent data leaks in storage; understanding block cipher modes and key derivation helps reason about security guarantees
Database Subsetting (Row filtering) — The subset/ crate reduces production data to a smaller, representative sample; requires understanding foreign key constraints and referential integrity
WASM (WebAssembly) for Custom Transformers — Replibyte supports pluggable WASM modules for data transformation rules; understanding WASM bytecode and the Rust/WASM interface (examples/wasm) enables custom anonymization logic
Stateless CLI Architecture — Replibyte is a standalone binary with no server/daemon; understanding stateless design helps reason about idempotency, concurrency safety, and deployment simplicity

getdbt/dbt-core — dbt handles database transformation and testing; complements Replibyte by allowing data quality validation on seeded databases
stripe/pg_chaosmonkey — Postgres-specific chaos testing tool; users of Replibyte often need to test failure scenarios after seeding
ankane/pgsync — Postgres-only data sync tool; Replibyte's multi-DB alternative with added encryption/compression
Qovery/engine — Same org's infrastructure engine; Replibyte integrates as database seeding component in Qovery's deployment pipelines
opencontainers/image-spec — Replibyte's Docker publishing workflows (publish-image.yaml) depend on OCI image standards; relevant for CI/CD contributors

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for dump-parser with real database outputs

The dump-parser crate has separate modules for postgres, mysql, and mongodb (dump-parser/src/postgres/mod.rs, dump-parser/src/mysql/mod.rs, dump-parser/src/mongodb/mod.rs) but there are no visible test files. The repo includes real database dumps in db/postgres/, db/mysql/, and db/mongodb/ that should be used to validate the parser against actual production-like data. This would catch regressions and ensure data integrity across database types.

[ ] Create dump-parser/tests/postgres_parser_test.rs using db/postgres/fulldump.sql and db/postgres/fulldump-with-inserts.sql
[ ] Create dump-parser/tests/mysql_parser_test.rs using db/mysql/world.sql
[ ] Create dump-parser/tests/mongodb_parser_test.rs using db/mongodb/init-mongo.js
[ ] Add test assertions verifying correct parsing of tables, schemas, and data transformations
[ ] Update CI workflow .github/workflows/build-and-test.yml to run these new integration tests

Add security-focused documentation and validation examples for transformer functions

The repo emphasizes 'keeping sensitive data safe' in the README and examples include transformers (examples/with-transformer-options.yaml, examples/wasm/wasm-transformer-reverse-string.wasm), but there is no dedicated security guide documenting best practices for PII redaction patterns. Creating a comprehensive transformer examples file with common patterns (email masking, phone redaction, credit card anonymization) would help contributors build safer data pipelines and reduce security misconfiguration risks.

[ ] Create docs/TRANSFORMERS_GUIDE.md documenting transformer capabilities and security best practices
[ ] Add 3-5 concrete transformer examples (email masking, phone number anonymization, UUID replacement) to examples/ directory with .yaml configs
[ ] Create examples/transformers/common-pii-patterns.wasm showing safe redaction implementations
[ ] Link the new guide from README.md in the 'Usage' or 'Examples' section
[ ] Update docs/DESIGN.md with a 'Security Considerations' subsection referencing the transformers guide

Add GitHub Actions workflow for validating example configurations against schema

The examples/ directory contains 11+ YAML configuration files (source-postgres.yaml, with-encryption.yaml, source-and-dest-mongodb-bridge-minio.yaml, etc.) but there's no CI validation ensuring these configs remain valid. A configuration validation workflow would catch breaking changes when the config schema evolves and serve as end-to-end verification that all documented examples actually work.

[ ] Create .github/workflows/validate-examples.yml that runs on pull requests
[ ] Use replibyte's config parsing to validate each .yaml file in examples/ directory against the current schema
[ ] Add checks for required fields, valid database types (postgres/mysql/mongodb), and datastore configurations
[ ] Ensure workflow fails if any example config is invalid, with clear error messages
[ ] Document this validation in CONTRIBUTING.md (or create it if missing) so contributors know to test examples

🌿Good first issues

Add integration tests for dump-parser/src/mysql/mod.rs (exists but likely lacks coverage compared to postgres). See dump-parser/src/postgres/mod.rs patterns and write equivalent tests in a tests/ directory.
Document the WASM transformer API in docs/DESIGN.md. The examples/wasm directory exists but main docs don't explain how to author custom transformers; write a tutorial with a concrete example (e.g., email masking WASM module).
Add SQLite support to dump-parser/src/ following the PostgreSQL/MySQL/MongoDB pattern. Create dump-parser/src/sqlite/mod.rs and register it in dump-parser/src/lib.rs. SQLite is single-file so implementation should be simpler than PostgreSQL.

⭐Top contributors

Click to expand

@evoxmusic — 52 commits
@fabriceclementz — 15 commits
@benny-n — 4 commits
@Cluas — 3 commits
@karpa4o4 — 3 commits

📝Recent commits