RepoPilotOpen in app →

BurntSushi/xsv

A fast CSV command line toolkit written in Rust.

Healthy

Healthy across all four use cases

weakest axis
Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • 29+ active contributors
  • Unlicense licensed
  • CI configured
Show all 6 evidence items →
  • Tests present
  • Stale — last commit 1y ago
  • Concentrated ownership — top contributor handles 67% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/burntsushi/xsv)](https://repopilot.app/r/burntsushi/xsv)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/burntsushi/xsv on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: BurntSushi/xsv

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/BurntSushi/xsv shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

  • 29+ active contributors
  • Unlicense licensed
  • CI configured
  • Tests present
  • ⚠ Stale — last commit 1y ago
  • ⚠ Concentrated ownership — top contributor handles 67% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live BurntSushi/xsv repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/BurntSushi/xsv.

What it runs against: a local clone of BurntSushi/xsv — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in BurntSushi/xsv | Confirms the artifact applies here, not a fork | | 2 | License is still Unlicense | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 408 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>BurntSushi/xsv</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of BurntSushi/xsv. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/BurntSushi/xsv.git
#   cd xsv
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of BurntSushi/xsv and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "BurntSushi/xsv(\\.git)?\\b" \\
  && ok "origin remote is BurntSushi/xsv" \\
  || miss "origin remote is not BurntSushi/xsv (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Unlicense)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Unlicense\"" package.json 2>/dev/null) \\
  && ok "license is Unlicense" \\
  || miss "license drift — was Unlicense at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "src/main.rs" \\
  && ok "src/main.rs" \\
  || miss "missing critical file: src/main.rs"
test -f "src/cmd/mod.rs" \\
  && ok "src/cmd/mod.rs" \\
  || miss "missing critical file: src/cmd/mod.rs"
test -f "src/config.rs" \\
  && ok "src/config.rs" \\
  || miss "missing critical file: src/config.rs"
test -f "src/select.rs" \\
  && ok "src/select.rs" \\
  || miss "missing critical file: src/select.rs"
test -f "src/index.rs" \\
  && ok "src/index.rs" \\
  || miss "missing critical file: src/index.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 408 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~378d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/BurntSushi/xsv"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

xsv is a command-line CSV processing toolkit written in Rust that provides 25+ composable commands (cat, count, select, join, stats, frequency, etc.) for slicing, analyzing, splitting, and reshaping CSV data at high speed. It uses indexed access patterns to enable constant-time row lookups and parallel statistics computation without loading entire files into memory. Single-binary monolith: src/cmd/ contains 20+ command modules (each cmd/*.rs implements one subcommand like select.rs, join.rs, stats.rs), src/main.rs routes CLI args to command handlers via docopt, src/config.rs manages CSV parsing options, src/select.rs handles column selection logic shared across commands, and src/index.rs implements the indexed file format for O(1) row access.

👥Who it's for

Data analysts, DevOps engineers, and systems administrators who need to process, transform, and analyze CSV/TSV files in shell pipelines without the overhead of Python/pandas or Excel, particularly those working with large datasets where performance and composability matter.

🌱Maturity & risk

The project is unmaintained as of the README notice—the author recommends alternatives (qsv, xan) instead. However, the code is production-quality: v0.13.0 with dual MIT/UNLICENSE, comprehensive test coverage (tests/ directory with tests for each major command), Travis CI + AppVeyor setup, and a stable Rust dependency set using csv crate v1. Last visible activity suggests it reached a stable stopping point rather than active abandonment mid-development.

Unmaintained status is the primary risk: no new bug fixes or Rust edition updates will land. Dependency risk is low (14 crates, mostly stable: csv, regex, serde, threadpool are well-maintained), but Rust MSRV is not specified—could fail to compile on future Rust editions. No breaking changes expected since project is frozen, but toolchain incompatibility could emerge.

Active areas of work

Nothing—the repo is frozen. The README explicitly states xsv is unmaintained; the author moved focus to qsv (a maintained fork/evolution). No pending work, PRs, or issues are being addressed.

🚀Get running

git clone https://github.com/BurntSushi/xsv.git
cd xsv
cargo build --release
./target/release/xsv --help

Binaries appear in target/release/. Run make to see available build targets (see Makefile).

Daily commands:

# Development build
cargo build
./target/debug/xsv count data.csv

# Release (optimized, 3x faster)
cargo build --release
./target/release/xsv select 1,3,5 data.csv

# Run all tests
cargo test --test tests

No external services; input is file or stdin, output is stdout.

🗺️Map of the codebase

  • src/main.rs — Entry point that dispatches CLI commands; every contributor must understand the command routing and docopt configuration.
  • src/cmd/mod.rs — Command module registry and trait definitions; essential for adding any new CSV operation.
  • src/config.rs — CSV parsing configuration (delimiter, quoting, escaping); core to all CSV reading/writing behavior.
  • src/select.rs — Column selection logic used across multiple commands; critical path for field indexing and filtering.
  • src/index.rs — CSV index file format and querying; enables fast random access and slicing operations.
  • Cargo.toml — Dependencies and binary configuration; csv and csv-index crates are the foundation.
  • tests/tests.rs — Main test harness and integration test utilities; demonstrates command invocation patterns.

🧩Components & responsibilities

  • main.rs & docopt dispatch (docopt, std::env) — Parse CLI arguments and route to the correct command module. Handles --help, version, and global flags.
    • Failure mode: Invalid arguments or missing subcommand cause early exit with usage message.
  • config.rs dialect manager (csv crate, serde) — Encapsulates CSV format settings (delimiter, quote, escape). Applied uniformly across all I/O.
    • Failure mode: Incorrect delimiter or escape settings cause parse errors; invalid UTF-8 terminates reading.
  • select.rs column indexer (regex, custom parser) — Parses column selection syntax (e.g., '1,3,5' or '1-10'). Used by select, sort, stats, join, etc.
    • Failure mode: Invalid selector syntax or out-of-bounds column indices cause error; silently skipped if not strict.
  • index.rs random access layer (csv-index, File I/O, byteorder) — Stores and queries byte offsets for fast row lookup. Enables slice and seek without full scan.
    • Failure mode: Missing or corrupted .idx file falls back to sequential scan; index regeneration may be needed.
  • Command modules (cmd/*.rs) (csv crate, crossbeam-channel, regex) — Each command (sort, join, stats, search, etc.) implements filtering, transformation, or aggregation logic.
    • Failure mode: Invalid input or unsupported operations error early; malformed CSV causes partial output or error.

🔀Data flow

  • undefinedundefined — undefined

🛠️How to make changes

Add a new CSV command

  1. Create a new file in src/cmd/ (e.g., src/cmd/mycommand.rs) implementing the Command trait with config() and run() methods (src/cmd/mycommand.rs)
  2. Register the command in src/cmd/mod.rs by adding it to the command dispatch match statement (src/cmd/mod.rs)
  3. Update src/main.rs docopt string to document the new command's CLI usage (src/main.rs)
  4. Add integration tests in tests/test_mycommand.rs following the pattern of existing test files (tests/test_mycommand.rs)
  5. Use src/config.rs to read CSV dialect settings and src/select.rs for column selection if needed (src/config.rs)

Modify CSV parsing or dialect handling

  1. Edit src/config.rs to adjust delimiter, quote character, escape handling, or other CSV dialect options (src/config.rs)
  2. Update src/cmd/input.rs if reader initialization logic needs changes (src/cmd/input.rs)
  3. Verify impact in tests/test_fmt.rs (formatting) and other dialect-sensitive tests (tests/test_fmt.rs)

Optimize indexing or random access

  1. Review or modify the index binary format in src/index.rs (src/index.rs)
  2. Update src/cmd/index.rs to generate indexes with new format if needed (src/cmd/index.rs)
  3. Test with src/cmd/slice.rs which relies on index for fast row access (src/cmd/slice.rs)

🔧Why these technologies

  • csv crate (Rust) — Handles RFC 4180 CSV parsing with excellent performance; avoids reinventing dialect handling.
  • csv-index — Enables O(1) byte-offset lookup for row slicing without loading entire file into memory.
  • docopt — Derives CLI parsing from docstring; reduces boilerplate and keeps usage documentation in sync with code.
  • regex crate — Provides efficient pattern matching for search/filter operations.
  • crossbeam-channel — Enables work-stealing parallelism for multi-threaded commands (e.g., frequency, stats).
  • Rust — Zero-cost abstractions and memory safety; handles large CSV files without GC pauses.

⚖️Trade-offs already made

  • In-memory sorting vs streaming

    • Why: Most sort algorithms require random access; streaming would require external sorting (disk I/O).
    • Consequence: sort and stats commands require dataset to fit in RAM; trade memory for simplicity and speed.
  • Binary index file format vs on-the-fly indexing

    • Why: Repeated slicing/access operations benefit from cached index; avoids re-scanning file.
    • Consequence: Requires separate index command; index files must be regenerated if CSV changes.
  • Command-line composition over unified data model

    • Why: Unix philosophy: each command reads stdin/file and writes stdout; enables shell piping.
    • Consequence: No in-process command chaining; overhead of CSV parsing/serialization at each step, but high composability.
  • Parallel work-stealing for stats/frequency via crossbeam

    • Why: Reduces lock contention and distributes work evenly across CPU cores.
    • Consequence: Complexity in multi-threaded code; data races must be carefully avoided (mitigated by Rust's type system).

🚫Non-goals (don't propose these)

  • Real-time streaming (all commands load or iterate full input before output)
  • SQL-like joins in a single command (multi-CSV joins require explicit join command, not ad-hoc filtering)
  • Data type inference (all fields treated as strings; stats operations are statistical, not semantic)
  • GUI or interactive mode (CLI-only tool)
  • Compression (no built-in gzip, brotli, etc.; relies on Unix toolchain compression)
  • Network I/O (local file system only; no S3, HTTP, database connectors)

🪤Traps & gotchas

  1. Index format is external: the csv-index crate's binary format is not documented here—if you break index compatibility, existing indices become unreadable. 2. No MSRV specified: Cargo.toml doesn't pin Rust version, so building on very old or new Rust editions may fail silently. 3. Docopt string is law: CLI behavior is entirely driven by the docopt usage strings in src/main.rs; changing arg names requires matching the string exactly or args won't parse. 4. Test binaries expect xsv in PATH: tests/test_*.rs spawn child processes assuming xsv is built and discoverable; must run cargo build before cargo test. 5. Threadpool is single-threaded by default: commands opt-in to parallelism manually via threadpool creation; there is no automatic parallelism.

🏗️Architecture

💡Concepts to learn

  • Binary Index Format (csv-index) — xsv's performance advantage comes from pre-computed byte offsets stored in .idx files, enabling O(1) random row access instead of O(n) linear scans—critical for slice/count commands on large files
  • Reservoir Sampling — The sample command uses this algorithm to draw uniformly random rows using O(k) memory instead of O(n), enabling efficient sampling of multi-GB CSV files
  • Work-Stealing Threadpool — xsv uses crossbeam-channel + threadpool for parallelization in frequency/stats commands; understanding task distribution patterns explains how multi-core speedups are achieved
  • Hash-Based Join — The join command builds an in-memory hash table for the left input to enable fast O(1) lookups during the join, trading memory for speed—pattern visible in src/cmd/join.rs
  • Streaming Statistics (One-Pass) — The stats command computes mean/stddev/median/percentiles in a single pass using streaming-stats crate, avoiding the need to load entire columns into memory
  • RFC 4180 CSV Semantics — xsv's csv crate implements the RFC 4180 standard precisely; understanding quoting rules (escaping, flexible record length) is essential for using --flexible, --delimiter, and --quote correctly
  • Regex on Field Values — The search command applies regex independently to each CSV field, not to raw bytes; this is a common gotcha (e.g., regex can't match across quoted fields)
  • dathere/qsv — Modern maintained successor to xsv—same Rust+CSV philosophy, actively developed with bug fixes and new features; recommended in xsv's README as the preferred alternative
  • medialab/xan — Another xsv alternative in Rust with different design tradeoffs; also recommended in xsv README as an option
  • BurntSushi/csv-index — Companion crate providing the binary index format and I/O that xsv depends on for constant-time row access; used by index.rs
  • BurntSushi/rust-csv — The underlying csv crate (owned by xsv author); all xsv commands use this for RFC 4180 parsing—source of truth for CSV semantics
  • docopt/docopt.rs — Provides the docopt CLI parser that xsv uses; understanding this crate helps with modifying src/main.rs argument handling

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Migrate CI from Travis CI and AppVeyor to GitHub Actions

The repo currently uses .travis.yml and appveyor.yml which are legacy CI systems. GitHub Actions is the native GitHub CI/CD platform and would be more maintainable. This is especially important given the project is now unmaintained - modern CI ensures the build instructions in ci/ scripts stay current and are easier for new contributors to understand.

  • [ ] Create .github/workflows/ci.yml with matrix builds for Linux, macOS, and Windows
  • [ ] Port the logic from ci/script.sh, ci/install.sh, and ci/before_deploy.sh into the workflow
  • [ ] Test the workflow runs successfully for the current Rust version and MSRV
  • [ ] Remove or deprecate .travis.yml and appveyor.yml with notices pointing to the new workflow
  • [ ] Update README.md to reference GitHub Actions badges instead of Travis/AppVeyor

Add comprehensive error handling tests for src/cmd/join.rs

The join.rs command is one of the most complex operations in xsv (combining two CSV files), but tests/test_join.rs appears minimal. Edge cases like mismatched headers, duplicate keys, malformed input, and case sensitivity in joins are critical for correctness and should be explicitly tested.

  • [ ] Review tests/test_join.rs to identify missing edge case scenarios
  • [ ] Add tests for: duplicate join keys, missing columns, empty files, quote handling in join keys
  • [ ] Add tests for case-sensitive vs case-insensitive join behavior
  • [ ] Add tests for various join types (inner, left, right, full outer if supported)
  • [ ] Run tests with cargo test --test test_join to verify

Add integration tests for command piping and composition scenarios

The README emphasizes that 'Composition should not come at the expense of performance', but the current test suite (tests/test_*.rs) appears to test individual commands in isolation. Adding tests that verify common multi-command pipelines (e.g., select | sort | stats, or cat | search | frequency) would ensure the claimed composability works correctly.

  • [ ] Create tests/test_composition.rs for multi-command pipelines
  • [ ] Add test cases for: select then sort, search then count, cat multiple files then sort
  • [ ] Verify output matches manual sequential execution
  • [ ] Test with the workdir.rs utility to simulate real CLI invocations
  • [ ] Document in BENCHMARKS.md or README.md which composition patterns are recommended

🌿Good first issues

  • Add integration test for xsv join with --full outer joins. Currently tests/test_join.rs exists but may not cover all join types (inner/left/right/full). Check the join.rs command implementation for all supported join modes and ensure corresponding test cases exist.: Tests are the easiest to add, low risk, and this is a complex feature that deserves thorough coverage.
  • Document the csv-index binary format in docs/ or as a code comment in src/index.rs. Currently only the csv-index crate documents it, but xsv users and contributors should understand what the .idx file structure looks like to debug index corruption or extend indexing.: High-value documentation gap visible from the file structure; no docs/ directory exists.
  • Add --output / -o flag to write command results to a file instead of stdout. Currently all commands write to stdout only. Check how tabwriter and csv::Writer are used in each cmd/*.rs and add a common output abstraction in config.rs or util.rs.: This is a quality-of-life improvement matching Unix tradition; involves modest changes across multiple files but has a clear pattern to follow.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • f430466 — readme: add unmaintained notice (BurntSushi)
  • 4278b85 — license: revert license file change (atouchet)
  • 3de6c04 — doc: use HTTPS for all links (bruceadams)
  • c0d2666 — reame: add MacPorts instructions (lperry)
  • c8c3484 — xsv: fix error message for invalid commands (mintyplanet)
  • 63ad0b3 — readme: note naming collision (BurntSushi)
  • 9574d89 — deps: use crossbeam-channel instead of chan (BurntSushi)
  • 5c5c538 — deps: update to quickcheck 0.7 (BurntSushi)
  • bc730d7 — xsv: add --drop flag to partition command (Dimagog)
  • 72db9ed — xsv: add reverse command (Dimagog)

🔒Security observations

The xsv project has moderate security concerns primarily related to its unmaintained status and outdated dependencies. The codebase itself appears reasonably well-structured as a command-line CSV tool with minimal injection attack surfaces (no web components, databases, or dynamic code execution visible). However, the lack of maintenance means accumulated dependencies with known vulnerabilities are not being patched. Users should migrate to maintained alternatives or establish a security maintenance fork. The main technical risks are outdated Rust crate dependencies and potential path traversal in file handling operations.

  • Medium · Outdated Dependencies with Known Vulnerabilities — Cargo.toml - [dependencies] section. The project uses several outdated dependencies that may contain known security vulnerabilities. Notable outdated packages include: rand 0.5 (released 2018, multiple CVEs), crossbeam-channel 0.2.4 (old version), and csv 1.x. The project was last maintained around 2018-2019 based on version numbers. Fix: Update all dependencies to their latest versions. Specifically: upgrade rand to 0.8+, update crossbeam-channel to 0.5+, and review csv crate for security patches. Run 'cargo audit' to identify CVEs.
  • Medium · Unmaintained Project — README.md. The project is explicitly marked as unmaintained in the README. This means security vulnerabilities will not be patched, and the codebase may have accumulated unaddressed security issues over time. Fix: Users should migrate to maintained alternatives like 'qsv' or 'xan' as recommended in the README. If continuing to use xsv, conduct a thorough security audit and maintain a fork with security patches.
  • Low · Debug Symbols in Release Builds — Cargo.toml - [profile.release] section. The release profile includes 'debug = true' which keeps debug symbols in release binaries. While this aids in debugging, it can leak information about the binary structure and source code locations. Fix: Set 'debug = false' or use 'debug = 1' in release profile to strip debug symbols unless debugging capabilities are explicitly needed in production builds.
  • Low · Potential Path Traversal in File Operations — src/cmd/split.rs, src/cmd/join.rs, src/cmd/index.rs, ci/before_deploy.sh. The codebase includes file manipulation commands (split, join, index) and CI/deployment scripts (ci/before_deploy.sh, scripts/github-release) that may handle user-provided file paths. Without visible input validation in the file structure, there's potential for path traversal attacks. Fix: Ensure all file path inputs are validated and sanitized. Use path canonicalization and whitelist allowed directories. Review command implementations for proper path validation.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · BurntSushi/xsv — RepoPilot