RepoPilotOpen in app →

ankane/pgsync

Sync data from one Postgres database to another

Healthy

Healthy across all four use cases

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 4mo ago
  • 4 active contributors
  • MIT licensed
Show 5 more →
  • CI configured
  • Tests present
  • Slowing — last commit 4mo ago
  • Small team — 4 contributors active in recent commits
  • Single-maintainer risk — top contributor 97% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/ankane/pgsync)](https://repopilot.app/r/ankane/pgsync)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/ankane/pgsync on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: ankane/pgsync

Generated by RepoPilot · 2026-05-10 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/ankane/pgsync shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

  • Last commit 4mo ago
  • 4 active contributors
  • MIT licensed
  • CI configured
  • Tests present
  • ⚠ Slowing — last commit 4mo ago
  • ⚠ Small team — 4 contributors active in recent commits
  • ⚠ Single-maintainer risk — top contributor 97% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live ankane/pgsync repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/ankane/pgsync.

What it runs against: a local clone of ankane/pgsync — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in ankane/pgsync | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 164 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>ankane/pgsync</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of ankane/pgsync. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/ankane/pgsync.git
#   cd pgsync
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of ankane/pgsync and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "ankane/pgsync(\\.git)?\\b" \\
  && ok "origin remote is ankane/pgsync" \\
  || miss "origin remote is not ankane/pgsync (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "exe/pgsync" \\
  && ok "exe/pgsync" \\
  || miss "missing critical file: exe/pgsync"
test -f "lib/pgsync.rb" \\
  && ok "lib/pgsync.rb" \\
  || miss "missing critical file: lib/pgsync.rb"
test -f "lib/pgsync/sync.rb" \\
  && ok "lib/pgsync/sync.rb" \\
  || miss "missing critical file: lib/pgsync/sync.rb"
test -f "lib/pgsync/table_sync.rb" \\
  && ok "lib/pgsync/table_sync.rb" \\
  || miss "missing critical file: lib/pgsync/table_sync.rb"
test -f "lib/pgsync/client.rb" \\
  && ok "lib/pgsync/client.rb" \\
  || miss "missing critical file: lib/pgsync/client.rb"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 164 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~134d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/ankane/pgsync"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

pgsync is a Ruby CLI tool that syncs data between PostgreSQL databases with parallel table transfers, schema-aware diffing, and built-in data masking. It replicates functionality of pg_dump/pg_restore but adds speed, security (preventing sensitive data leaks via rules engine), and flexibility (handles schema mismatches, partial syncs, and related-record traversal). Monolithic Ruby gem with entry point at exe/pgsync (CLI dispatcher). Core sync logic lives in lib/pgsync/ with modular classes: Sync (orchestrator), TableSync (per-table logic), SchemaSyncForddatumck (schema DDL), Client (PG connection wrapper), DataSource (config+creds parsing), and TaskResolver (group/variable expansion). Tests in test/ mirror lib structure.

👥Who it's for

PostgreSQL DevOps engineers and backend developers at organizations like Instacart who need to replicate production data to staging/development environments safely, without manual scripting or exposing sensitive columns like passwords and API keys.

🌱Maturity & risk

Production-ready and battle-tested at Instacart. The codebase shows active maintenance with a comprehensive test suite (test/ directory covers commands, data rules, schema, sync, tables), CI/CD via GitHub Actions (build.yml), and a full CHANGELOG. ~73KB of Ruby code with deliberate feature set suggests stability over rapid iteration.

Single maintainer (ankane) is a potential risk for critical infrastructure tooling. The gem has few dependencies listed in Gemfile (check actual Gemfile for dependency count), but tight coupling to PostgreSQL version-specific features could introduce breaking changes. No visible open issue count in provided data, though production use at scale means breaking changes would have wide impact.

Active areas of work

No PR or milestone data provided, but CHANGELOG.md and build.yml suggest ongoing maintenance. The recent Dockerfile addition suggests container distribution is a focus. GitHub Actions is actively used for CI.

🚀Get running

git clone https://github.com/ankane/pgsync.git
cd pgsync
bundle install
bundle exec rake test

Daily commands:

pgsync --init                     # Generate .pgsync.yml with DB credentials
pgsync                            # Sync all tables
pgsync table1,table2              # Sync specific tables
pgsync products "where id=1"      # Sync with WHERE clause
pgsync --schema-first             # Sync schema before data

🗺️Map of the codebase

  • exe/pgsync — Entry point for the CLI tool that parses commands and orchestrates the sync workflow.
  • lib/pgsync.rb — Main module that loads all dependencies and defines the public API for pgsync.
  • lib/pgsync/sync.rb — Core synchronization logic that coordinates table syncing and schema management.
  • lib/pgsync/table_sync.rb — Handles individual table data transfer with batching and parallel execution.
  • lib/pgsync/client.rb — Database connection wrapper that manages queries and transaction handling for both source and target.
  • lib/pgsync/data_source.rb — Parses configuration files and resolves database connections with support for data rules and filtering.
  • lib/pgsync/schema_sync.rb — Manages schema synchronization including table structure and sequence initialization.

🧩Components & responsibilities

  • Sync orchestrator (sync.rb) (Ruby threads, Client API) — Coordinates overall sync workflow: schema sync, parallel table transfers, sequence updates
    • Failure mode: If any table sync fails, entire sync halts; partial data may be written to target
  • Table sync engine (table_sync.rb) (PG COPY, SQL WHERE clauses, batch processing) — Batched row transfer with filtering and data rule application per-table
    • Failure mode: Failed INSERT/COPY rolls back; corrupted/incomplete table in target
  • Client abstraction (client.rb) (PG gem, PostgreSQL protocol) — Wraps PG gem; executes queries, manages connections, handles transactions
    • Failure mode: Connection loss terminates sync; partial transaction state undefined
  • Configuration parser (data_source.rb) (YAML, Postgres URI parsing) — Loads YAML, resolves database URIs, applies data rules and filters
    • Failure mode: Invalid config halts startup; unclear error message if URI/rules malformed
  • Schema sync (schema_sync.rb) (DDL execution, constraint introspection) — Creates or alters target schema to match source; handles sequences and dependencies
    • Failure mode: Schema mismatch causes data sync

🛠️How to make changes

Add a new data rule (masking/filtering)

  1. Define rule logic in lib/pgsync/data_source.rb where data_rules are parsed from config (lib/pgsync/data_source.rb)
  2. Apply the rule in lib/pgsync/table_sync.rb during batch processing (row transformations) (lib/pgsync/table_sync.rb)
  3. Add test case in test/data_rules_test.rb with example YAML configuration (test/data_rules_test.rb)

Add a new CLI command/task

  1. Create a new task class inheriting from Task in lib/pgsync/task.rb pattern (lib/pgsync/task.rb)
  2. Register the task in lib/pgsync/task_resolver.rb task name-to-class mapping (lib/pgsync/task_resolver.rb)
  3. Implement command logic in your new task, using Client for DB operations (lib/pgsync/client.rb)
  4. Add tests in test/commands_test.rb verifying CLI invocation and behavior (test/commands_test.rb)

Extend table filtering or partial sync logic

  1. Add table/row filter criteria to config parsing in lib/pgsync/data_source.rb (lib/pgsync/data_source.rb)
  2. Apply filters in lib/pgsync/table_sync.rb where SQL WHERE clauses are constructed (lib/pgsync/table_sync.rb)
  3. Test with lib/pgsync/sync.rb integration or add unit test in test/tables_test.rb (test/tables_test.rb)

Add support for schema exclusions or custom transformations

  1. Define exclusion rules in config parsing (lib/pgsync/data_source.rb) (lib/pgsync/data_source.rb)
  2. Check exclusions in lib/pgsync/schema_sync.rb before creating tables/sequences (lib/pgsync/schema_sync.rb)
  3. Verify with schema tests in test/schema_test.rb (test/schema_test.rb)

🔧Why these technologies

  • Ruby + Sinatra/Rake — Lightweight, fast development for CLI tools; excellent for system automation and database operations
  • PostgreSQL COPY protocol (via PG gem) — Fastest bulk data transfer method; far faster than INSERT statements for large volumes
  • Parallel table transfers (threads) — Independent tables can be synced concurrently without blocking; improves throughput
  • YAML configuration — Human-readable, version-control friendly, easy to define rules and data transformations

⚖️Trade-offs already made

  • In-process parallel syncing (threads) vs. distributed job queue

    • Why: Simpler deployment and no external dependencies; better for typical use cases
    • Consequence: Cannot distribute sync across machines; scaling is bound by single-host resources
  • Batch-based row transfer vs. streaming

    • Why: Allows data transformations (masking) without streaming complexity; clearer memory control
    • Consequence: Entire batches must fit in memory; very large tables require smaller batch sizes
  • Schema detection via introspection vs. explicit schema definitions

    • Why: Automatic schema sync reduces manual configuration; gracefully handles schema drift
    • Consequence: Less control over fine-grained schema transformations; relies on PostgreSQL introspection queries

🚫Non-goals (don't propose these)

  • Real-time replication (one-way snapshot sync only)
  • Cross-database platform support (PostgreSQL-only)
  • Distributed/multi-host orchestration (single-machine tool)
  • Bi-directional sync or conflict resolution
  • Support for PostGIS or other non-core PostgreSQL extensions beyond basic schema

🪤Traps & gotchas

PG gem version coupling: schema introspection queries in Client may break on PostgreSQL 15+ (OID schema changes); verify PG version compatibility. Parallel sync concurrency: default number of workers not visible in provided files—check lib/pgsync/sync.rb for Thread/Concurrent usage; may deadlock on FK-heavy schemas without explicit dependency ordering. Credentials in .pgsync.yml: git hook or CI secret scanning required to prevent commits of database passwords. Column type mismatches: syncing BYTEA or JSONB without explicit casting rules can fail silently; see data_rules_test.rb for handling. Sequence state: lib/pgsync/sequence.rb must reset PG sequences after bulk INSERT or source IDs will collide with future inserts on destination.

🏗️Architecture

💡Concepts to learn

  • Logical Replication & CDC (Change Data Capture) — pgsync is a one-time full sync; understanding WAL-based logical replication helps you recognize when pgsync + triggers (for incremental updates) is better than pg_basebackup or continuous CDC.
  • Column-Level Data Masking / PII Anonymization — Core pgsync feature: data_rules let you hash/redact sensitive columns during sync; critical for compliance (GDPR/CCPA) when replicating production to non-prod.
  • Foreign Key Dependency Resolution — pgsync must sync tables in dependency order to avoid FK constraint violations; understanding topological sorting of schema graph is key to debugging 'insert or update on table X violates foreign key' errors.
  • UPSERT (INSERT ... ON CONFLICT) semantics — table_sync.rb likely uses UPSERT to idempotently re-sync overlapping data; understanding conflict resolution (DO UPDATE vs DO NOTHING) prevents silent data loss when syncing partial tables.
  • Parallel Batch Processing & Rate Limiting — pgsync syncs tables in parallel to speed up large datasets; understanding thread pool saturation and batch size tuning prevents connection exhaustion and memory OOM on large tables.
  • PostgreSQL Prepared Statements & SQL Injection Prevention — Client.rb wraps PG gem prepared statement API; pgsync safely handles user-supplied WHERE clauses via parameterization, preventing SQL injection.
  • Schema Inference & Version-Safe Introspection — pgsync queries information_schema and pg_catalog to discover columns/types/constraints; PostgreSQL version differences (9.5 vs 15) break hardcoded queries—critical for multi-version production environments.
  • ankane/pgslice — Same author; pgslice partitions large tables while pgsync replicates them—users often chain both tools for parallel incremental syncs.
  • ankane/pghero — Same author; pghero monitors PostgreSQL health and identifies slow queries; pgsync users run it on dest DB post-sync to validate performance.
  • citusdata/pg_chameleon — PostgreSQL logical replication alternative; pgsync is pull-based (one-time sync) while Chameleon is push-based (continuous CDC)—choose based on sync frequency need.
  • debezium/debezium — Enterprise CDC platform for real-time sync across databases; pgsync is simpler/lighter for one-off replica creation, Debezium for streaming Postgres changes to Kafka.
  • flyway/flyway — Schema migration tool; pgsync users run flyway first to align schemas, then pgsync to replicate data—complementary not competing.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive tests for lib/pgsync/schema_sync.rb

The schema_sync.rb module handles schema synchronization between databases, but there's no dedicated test file (test/schema_sync_test.rb exists as schema_test.rb). Given the complexity of schema diffing and the critical nature of schema operations, this module deserves focused unit tests covering edge cases like column type mismatches, constraint handling, and partial schema syncs.

  • [ ] Create test/schema_sync_test.rb with tests for schema_sync.rb operations
  • [ ] Add test cases for: adding columns, removing columns, type changes, constraint mismatches, and schema differences
  • [ ] Test the interaction between schema_sync.rb and table.rb for schema validation
  • [ ] Ensure tests use test/support/schema*.sql fixtures appropriately

Add integration tests for lib/pgsync/task_resolver.rb

The task_resolver.rb is responsible for parsing and resolving sync tasks from configuration, but there are no dedicated tests for this critical module. It bridges config files and actual sync operations, making comprehensive testing essential for reliability.

  • [ ] Create test/task_resolver_test.rb with tests for task resolution logic
  • [ ] Add test cases using test/support/config.yml for: single table tasks, group tasks, related records resolution, and error cases
  • [ ] Test edge cases like circular dependencies in related records and missing table references
  • [ ] Test interaction with lib/pgsync/table_sync.rb for task execution

Add tests for lib/pgsync/sequence.rb sequence synchronization

The sequence.rb module handles Postgres sequence synchronization (critical for auto-increment columns), but there are no dedicated test cases. This is a discrete module that should have isolated unit tests to ensure sequences are properly reset and synchronized across databases.

  • [ ] Create test/sequence_test.rb with tests for sequence detection and synchronization
  • [ ] Add test cases for: sequence value reset, sequence ownership, dependent sequences, and schema-qualified sequences
  • [ ] Test edge cases like sequences with gaps and sequences not owned by any table
  • [ ] Verify integration with lib/pgsync/table_sync.rb for post-sync sequence updates

🌿Good first issues

  • Add test coverage for TableSync edge cases with GENERATED columns: PostgreSQL 12+ GENERATED columns are read-only; table_sync.rb likely INSERT fails without skipping them. Write test in test/tables_test.rb covering GENERATED ALWAYS AS IDENTITY.
  • Document data masking rule syntax with examples in README.md: data_rules_test.rb shows masking works but README has no example of defining rules in .pgsync.yml config; add a 'Data Masking' section with concrete YAML + output.
  • Add --dry-run flag to preview sync without applying changes: exe/pgsync has no --dry-run; users want to validate WHERE clauses and rule application before commit. Requires passing dry_run: true through Sync → TableSync and wrapping inserts in rollback.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • db602da — Test with Ruby 4.0 on CI (ankane)
  • 23d1aad — Added load_schema method [skip ci] (ankane)
  • 26d4ac9 — Updated test setup [skip ci] (ankane)
  • 5d61038 — Test with Postgres 18 on CI (ankane)
  • 901928e — Updated checkout action (ankane)
  • a21260f — Version bump to 0.8.1 [skip ci] (ankane)
  • 469a23d — Updated changelog [skip ci] (ankane)
  • 14694ef — Fixed CI (ankane)
  • 1f44e64 — Fixed local destination detection with pg 1.6+ (ankane)
  • ab793f6 — Fixed unrecognized configuration parameter error with Docker image - fixes #249 (ankane)

🔒Security observations

  • High · Potential SQL Injection via Data Rules Configuration — lib/pgsync/table_sync.rb, lib/pgsync/data_source.rb, test/data_rules_test.rb. The tool processes data rules from YAML configuration files (referenced as 'data_rules_test.rb' in tests) that may contain raw SQL or filtering logic. If user-supplied data rules are not properly sanitized before being incorporated into SQL queries, this could lead to SQL injection attacks. Fix: Implement strict validation and parameterized queries for all data rule filters. Use prepared statements exclusively and avoid string interpolation for SQL construction. Add comprehensive input validation for rule configuration.
  • High · Hardcoded Database Credentials in Configuration Files — config.yml, test/support/config.yml. The config.yml file and test configuration files (test/support/config.yml) may contain database connection strings with embedded credentials. These could be exposed if committed to version control or logged in error messages. Fix: Use environment variables for sensitive credentials instead of hardcoded values. Implement a secure credential management system and add config.yml to .gitignore. Use connection URI environment variables (DATABASE_URL pattern).
  • High · Missing Input Validation on Table/Schema Names — lib/pgsync/table.rb, lib/pgsync/schema_sync.rb, lib/pgsync/sync.rb. The tool dynamically constructs SQL queries based on table names, schema names, and column identifiers from configuration files. Without proper validation, this could enable SQL injection through table name manipulation. Fix: Implement whitelist-based validation for all identifiers (table names, schema names, column names). Use PostgreSQL's identifier quoting (e.g., pg_format gem or proper escaping). Validate against system catalogs before use.
  • Medium · Insufficient Error Handling and Information Disclosure — lib/pgsync/client.rb, lib/pgsync/sync.rb. Database errors and stack traces may be exposed to users, potentially revealing sensitive information about database structure, configuration, or internal implementation details. Fix: Implement proper error handling that logs detailed errors securely while returning generic user-facing messages. Avoid exposing database schema or connection details in error messages.
  • Medium · Parallel Execution Without Rate Limiting — lib/pgsync/table_sync.rb, lib/pgsync/sync.rb. The tool transfers tables in parallel for speed. Without proper connection pooling limits and rate limiting, this could be exploited for resource exhaustion attacks or used to overwhelm the target database. Fix: Implement configurable connection pool limits, query timeouts, and rate limiting. Add safeguards to prevent excessive parallelization. Document resource requirements and best practices.
  • Medium · Sensitive Data Leakage in Logs and Debugging — lib/pgsync/sync.rb, lib/pgsync/table_sync.rb, lib/pgsync/utils.rb. The tool may log sensitive data during sync operations. SQL queries containing actual data, connection strings, or query parameters could end up in application logs. Fix: Implement structured logging that excludes sensitive data. Sanitize SQL queries before logging. Add configuration options for log verbosity levels. Ensure logs are properly secured and rotated.
  • Medium · Docker Image Uses Alpine with Limited Security Updates — Dockerfile. The Dockerfile builds from alpine with pre-installed gems. While minimal, Alpine has fewer security updates than other base images. The gem installation during build could pull vulnerable dependencies. Fix: Implement multi-stage builds to reduce attack surface. Use specific version pinning for base image and dependencies. Regularly scan images with tools like Trivy. Consider using distroless images for final stage.
  • Low · Missing Network Security Configuration — Dockerfile. The Dockerfile doesn't specify network policies or resource limits. Running untrusted configurations could potentially lead to resource exhaustion or lateral movement. Fix: Add health checks to the Docker image. Document recommended Docker security flags (--read-only, --cap-drop, memory limits). Provide example docker-compose.yml with security best practices.
  • Low · Missing Security Documentation — undefined. There is no visible SECURITY.md file Fix: undefined

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · ankane/pgsync — RepoPilot