RepoPilotOpen in app →

pkolaczk/fclones

Efficient Duplicate File Finder

Mixed

Stale — last commit 1y ago

weakest axis
Use as dependencyMixed

last commit was 1y ago; no tests detected…

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isMixed

last commit was 1y ago; no CI workflows detected

  • 14 active contributors
  • MIT licensed
  • Stale — last commit 1y ago
Show all 6 evidence items →
  • Concentrated ownership — top contributor handles 60% of recent commits
  • No CI workflows detected
  • No test directory detected
What would change the summary?
  • Use as dependency MixedHealthy if: 1 commit in the last 365 days; add a test suite
  • Deploy as-is MixedHealthy if: 1 commit in the last 180 days

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Forkable
[![RepoPilot: Forkable](https://repopilot.app/api/badge/pkolaczk/fclones?axis=fork)](https://repopilot.app/r/pkolaczk/fclones)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/pkolaczk/fclones on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: pkolaczk/fclones

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/pkolaczk/fclones shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 1y ago

  • 14 active contributors
  • MIT licensed
  • ⚠ Stale — last commit 1y ago
  • ⚠ Concentrated ownership — top contributor handles 60% of recent commits
  • ⚠ No CI workflows detected
  • ⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live pkolaczk/fclones repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/pkolaczk/fclones.

What it runs against: a local clone of pkolaczk/fclones — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in pkolaczk/fclones | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 461 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>pkolaczk/fclones</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of pkolaczk/fclones. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/pkolaczk/fclones.git
#   cd fclones
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of pkolaczk/fclones and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "pkolaczk/fclones(\\.git)?\\b" \\
  && ok "origin remote is pkolaczk/fclones" \\
  || miss "origin remote is not pkolaczk/fclones (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "fclones/src/main.rs" \\
  && ok "fclones/src/main.rs" \\
  || miss "missing critical file: fclones/src/main.rs"
test -f "fclones/src/walk.rs" \\
  && ok "fclones/src/walk.rs" \\
  || miss "missing critical file: fclones/src/walk.rs"
test -f "fclones/src/hasher.rs" \\
  && ok "fclones/src/hasher.rs" \\
  || miss "missing critical file: fclones/src/hasher.rs"
test -f "fclones/src/group.rs" \\
  && ok "fclones/src/group.rs" \\
  || miss "missing critical file: fclones/src/group.rs"
test -f "fclones/src/phase.rs" \\
  && ok "fclones/src/phase.rs" \\
  || miss "missing critical file: fclones/src/phase.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 461 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~431d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/pkolaczk/fclones"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

fclones is a high-performance duplicate file finder written in Rust that identifies and optionally deduplicates files by content hash across filesystems. It uses multi-threaded scanning with configurable hashing strategies (MD5, SHA1, SHA256) and supports advanced filtering via regex patterns, path transformations, and device-aware traversal to efficiently locate duplicate content in large directory hierarchies. Workspace monorepo (Cargo.toml members: fclones, gen-test-files) with the main binary at fclones/src/main.rs orchestrating a pipeline: walk.rs traverses directories → file.rs groups by size → hasher.rs computes content hashes → group.rs clusters duplicates → selector.rs picks candidates → dedupe.rs performs reflink/hardlink/copy operations. Config flows through config.rs; output via report.rs.

👥Who it's for

System administrators and power users managing large storage systems (NAS, servers, workstations) who need to reclaim disk space by identifying duplicate files across multiple directories or drives, and want a faster, more memory-efficient alternative to tools like fdupes or rdfind.

🌱Maturity & risk

Actively developed and production-ready: the codebase is 428KB of Rust, uses LTO and panic=abort for release builds, includes CircleCI CI/CD in .circleci/config.yml, and has a workspace structure with both main (fclones) and test utility (gen-test-files) crates. No obvious signs of abandonment; the architecture is well-organized with dedicated modules for caching, deduplication, and progress tracking.

Standard open source risks apply.

Active areas of work

Unable to infer from repo data alone (no visible recent commits or PR list in file structure), but the presence of reflink.rs, semaphore.rs, and rlimit.rs suggests active focus on filesystem efficiency, resource limiting, and parallelism. The pattern.rs and transform.rs modules hint at ongoing improvements to filtering and path manipulation.

🚀Get running

git clone https://github.com/pkolaczk/fclones.git
cd fclones
cargo build --release
./target/release/fclones --help

Daily commands:

cargo build --release
./target/release/fclones /path/to/scan
cargo run --release -- /path/to/scan  # with dev config
cargo test                              # run test suite

🗺️Map of the codebase

  • fclones/src/main.rs — Application entry point that orchestrates the entire duplicate-finding pipeline and CLI argument handling.
  • fclones/src/walk.rs — Core file system traversal logic that recursively discovers files and applies filtering/pattern matching—critical for performance and correctness.
  • fclones/src/hasher.rs — Hash computation engine that groups files by content; the primary mechanism for identifying duplicates.
  • fclones/src/group.rs — Data structures and logic for organizing files into duplicate groups; essential for result aggregation and reporting.
  • fclones/src/phase.rs — Orchestrates multi-phase duplicate detection (grouping by size, partial hash, full hash) and manages the execution pipeline.
  • fclones/src/dedupe.rs — Implements deduplication strategies (reflinks, hardlinks, symlinks) that consume the duplicate groups output.
  • fclones/src/report.rs — Formats and outputs results in multiple formats (text, JSON, etc.); critical for user-facing output.

🛠️How to make changes

Add a New Output Format

  1. Define a new output format variant in the report module's format enum (fclones/src/report.rs)
  2. Implement the formatting logic by extending the format match statement in report.rs with your new format case (fclones/src/report.rs)
  3. Add a new argument variant to the CLI args to allow users to select your format (fclones/src/arg.rs)
  4. Thread the output format from config through main.rs to the report generation function (fclones/src/main.rs)

Add a New File Filtering Criterion

  1. Define a new filter field in the config struct to store filter parameters (fclones/src/config.rs)
  2. Add CLI argument(s) to accept the filter parameter from users (fclones/src/arg.rs)
  3. Extend the walk.rs visitor logic (ScannerBuilder or filter closure) to apply your new criterion during traversal (fclones/src/walk.rs)
  4. Update the File struct or metadata if additional file properties are needed for the filter (fclones/src/file.rs)

Add a New Deduplication Strategy

  1. Define a new deduplication method variant in the config or add a new module parallel to dedupe.rs (fclones/src/dedupe.rs)
  2. Add a CLI argument to allow users to select your new deduplication strategy (fclones/src/arg.rs)
  3. Implement the core deduplication logic (file operations, error handling) in dedupe.rs or a new module (fclones/src/dedupe.rs)
  4. Update main.rs to route to your new deduplication strategy based on user selection (fclones/src/main.rs)
  5. Use lock.rs utilities to prevent concurrent access conflicts during your deduplication operation (fclones/src/lock.rs)

Optimize Hash Phase Thresholds

  1. Identify threshold constants (e.g., minimum file size for partial hash) in phase.rs (fclones/src/phase.rs)
  2. Add configurable parameters to config.rs to control phase behavior (e.g., partial hash size) (fclones/src/config.rs)
  3. Expose the new parameters via CLI arguments (fclones/src/arg.rs)
  4. Thread the config values into phase.rs and update the phase orchestration logic to use them (fclones/src/phase.rs)

🔧Why these technologies

  • Rust — Memory safety without GC, zero-cost abstractions for I/O-heavy operations, parallel iteration via Rayon, and statically linked binaries for distribution across systems.
  • Multi-phase hashing (size → partial → full) — Dramatically reduces disk I/O by eliminating candidates early; most duplicates are caught by size + partial hash without reading entire files.
  • File locking (flock/fcntl) — Prevents data corruption during concurrent deduplication by serializing access to files being linked or modified.
  • Reflinks (copy-on-write) — Achieves deduplication with zero disk space overhead on modern filesystems (Btrfs, APFS, XFS) without inode limits of hardlinks.
  • Semaphore-controlled I/O — Limits concurrent file descriptors and memory usage during parallel traversal and hashing to prevent resource exhaustion.

⚖️Trade-offs already made

  • Multi-phase hashing with progressive pruning

    • Why: Balances speed vs. correctness: early phases avoid expensive full-file reads but require multiple passes.
    • Consequence: Faster for large datasets with few true duplicates, but overhead for small datasets with many duplicates.
  • In-memory grouping and caching of hashes

    • Why: Fast duplicate detection without persistent database, but limits scalability to very large filesets.
    • Consequence: Works well for typical use (millions of files), but memory footprint grows with number of unique files.
  • Hardlinks/symlinks as primary deduplication method

    • Why: Universal filesystem support with instant space reclamation
    • Consequence: undefined

🪤Traps & gotchas

Reflink support is filesystem-dependent: dedupe.rs attempts reflinks first but silently falls back to hardlinks or copies if unsupported (Btrfs, XFS, APFS supported; NTFS/ext4 not); users must understand this trade-off. Resource limits matter: rlimit.rs may need tuning on systems with many open files (dedup on huge trees). Cross-device dedup: device.rs logic prevents hardlinking across filesystems, which may surprise users. No atomic dedup: if dedup.rs crashes mid-operation, cleanup may be incomplete. Cache invalidation: cache.rs may serve stale hashes if files are modified externally between runs.

🏗️Architecture

💡Concepts to learn

  • Content-addressable deduplication — fclones groups files by content hash (not filename); understanding that two files with identical MD5/SHA256 are guaranteed duplicates is core to the algorithm
  • Reflink / Copy-on-Write (CoW) — reflink.rs implements CoW deduplication for modern filesystems (Btrfs, XFS, APFS); this is the fastest dedup method, saving space without hardlink constraints
  • Two-pass hashing (size + content hash) — fclones first groups by file size to avoid hashing unique files; hasher.rs only hashes within size-matched groups, a major performance optimization
  • Hardlinks — dedupe.rs falls back to hardlinks when reflinks unavailable; hardlinks share inodes but add directory entry overhead and cross-device limitations
  • Semaphore-based rate limiting — semaphore.rs and rlimit.rs bound concurrent I/O to prevent file descriptor exhaustion on large scans; critical for system stability
  • Device-aware traversal — device.rs prevents hardlinking across mounted filesystems (POSIX constraint); understanding inode isolation is key to correct dedup behavior
  • Bloom filter / early exclusion — If implemented in cache.rs or hasher.rs, Bloom filters quickly reject files that couldn't be duplicates (e.g., by size bounds); check for probabilistic filtering
  • sahib/rsfind — Alternative Rust duplicate finder with similar goals; useful for comparing performance and feature trade-offs
  • adrianlopezperez/fdupes — Original C-based fdupes tool; fclones aims to be a faster Rust rewrite with better parallelism
  • ggreer/the_silver_searcher — Fast filesystem search in Rust; shares pattern matching (pattern.rs, regex.rs) and walk optimization techniques
  • BurntSushi/ripgrep — High-performance text search in Rust with excellent regex and I/O patterns; architectural inspiration for walk.rs and caching strategies
  • pkolaczk/qdirstat — Companion tool for interactive disk usage visualization; often used in tandem with fclones to find and understand duplicate clusters

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for deduplication strategies in src/dedupe.rs

The dedupe.rs module handles critical file deduplication logic, but there are no visible integration tests validating the deduplication process across different file scenarios (hardlinks, reflinks, copies). This is high-risk code that deserves comprehensive test coverage for edge cases like permission errors, filesystem-specific behavior, and concurrent deduplication.

  • [ ] Create tests/dedupe_integration.rs with test fixtures using gen-test-files
  • [ ] Test hardlink deduplication on various filesystems (using src/reflink.rs and src/device.rs helpers)
  • [ ] Test reflink deduplication where supported (validate src/reflink.rs integration)
  • [ ] Test error handling scenarios (permission denied, readonly filesystems, cross-device deduplication)
  • [ ] Validate src/selector.rs logic with different selection strategies for duplicate groups

Add property-based tests for hash collision handling in src/hasher.rs and src/group.rs

The hasher.rs module computes file hashes and group.rs manages duplicate groups by hash. Property-based testing with arbitrary file content would catch edge cases in hash collision handling and grouping logic that unit tests might miss, ensuring data integrity when deduplicating large file sets.

  • [ ] Add proptest and quickcheck dependencies to fclones/Cargo.toml
  • [ ] Create tests/hash_properties.rs with property-based tests for src/hasher.rs hash consistency
  • [ ] Test that identical file content always produces identical hashes regardless of read order/buffering
  • [ ] Create tests/group_properties.rs testing src/group.rs grouping logic with generated duplicate file scenarios
  • [ ] Test that group merging and file selection in src/selector.rs maintains invariants across arbitrary inputs

Implement CI workflow for multi-platform testing in .circleci/config.yml

The repo has CircleCI config but lacks comprehensive multi-platform testing. Given that fclones handles filesystem-specific features (reflinks, hardlinks, device detection in src/reflink.rs and src/device.rs), testing on Linux, macOS, and ideally BSD variants would catch platform-specific bugs early. The current setup likely only tests one platform.

  • [ ] Expand .circleci/config.yml with separate Linux and macOS jobs
  • [ ] Add reflink support detection tests (src/reflink.rs) that skip on unsupported platforms
  • [ ] Add filesystem-specific tests for device detection (src/device.rs) across platforms
  • [ ] Test packaging.sh and snapcraft.yaml build output on Linux for distribution compatibility
  • [ ] Add resource limit tests (src/rlimit.rs) which behave differently across Unix variants

🌿Good first issues

  • Add integration tests for reflink.rs: currently no test file exists for the critical reflink detection and fallback logic; create tests/reflink_integration.rs with mocked Btrfs/XFS scenarios.
  • Extend report.rs with JSON output format: add --format=json CLI flag that outputs duplicate groups as JSON arrays instead of text; useful for scripting and integration with other tools.
  • Document semaphore.rs concurrency model in lib.rs or a new ARCHITECTURE.md: explain how thread pooling, semaphore-limited I/O, and resource limits interact; currently undocumented.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • a74f90d — Bump up version to 0.35.0 (pkolaczk)
  • c0865f5 — Add shell completion subcommand (ede1998)
  • cd952af — Modernize circleci config (pkolaczk)
  • b7ffc30 — Upgrade clap to 4.5, anstream to 0.6.18 to fix vulnerability (pkolaczk)
  • 886990c — Set minimum rustc version to 1.74 (pkolaczk)
  • eb1e8d5 — Fix clippy lints for 1.85 (pkolaczk)
  • edb2468 — By default, only report progress when stderr is a terminal (th1000s)
  • ba7c5bc — cargo clippy fixes for v1.84 (th1000s)
  • f6d1d76 — cargo clippy --fix for v1.84 (th1000s)
  • 770d8ea — Update config.rs (wcarmon)

🔒Security observations

The fclones duplicate file finder codebase appears to be a well-structured Rust project with moderate security posture. No critical vulnerabilities were identified from the provided static analysis. The main concerns are around dependency management (dependency pinning and vulnerability scanning), potential symlink/path traversal risks in file system operations, and container image pinning in Docker builds. The use of Rust's type system and memory safety features provides inherent protection against many common vulnerability classes. Recommendations focus on implementing best practices around dependency auditing, path validation, and containerization security.

  • Medium · Insufficient Dependency Pinning — Cargo.toml, Cargo.lock. The Cargo.toml workspace configuration does not provide the actual dependency list, but the resolver = "2" indicates workspace dependencies are in use. Without seeing specific dependency versions in the provided content, it's recommended to verify that all external crates are pinned to specific versions rather than using loose constraints (e.g., avoid "*" or overly broad semver ranges). Fix: Ensure Cargo.lock is committed to version control and all dependencies specify exact or minimal semver versions. Run 'cargo audit' regularly to check for known vulnerabilities in dependencies.
  • Low · Potential Path Traversal Risk — fclones/src/path.rs, fclones/src/walk.rs, fclones/src/file.rs. The codebase contains modules related to path handling (path.rs, walk.rs) and file operations (file.rs, dedupe.rs). Given that this is a file duplicate finder that walks directories, there's a potential risk of symlink following or improper path validation leading to unintended file access. Fix: Ensure all path operations use safe APIs that prevent symlink attacks (e.g., follow_symlinks is properly controlled). Validate and sanitize user-provided paths. Use Rust's built-in path security features and avoid following symlinks unless explicitly intended.
  • Low · Panic on Release Build — Cargo.toml [profile.release]. The Cargo.toml sets 'panic = "abort"' in the release profile, which causes the program to abort immediately on panic rather than unwinding. While this can prevent information leaks in some cases, it may also prevent proper resource cleanup (drop handlers won't run). Fix: Review whether 'panic = "abort"' is appropriate for this application. If resource cleanup is critical, consider using 'panic = "unwind"' or ensure all critical resources use RAII patterns. Document the decision if panic abort is intentional.
  • Low · Docker Build Image Not Pinned — packaging/Dockerfile, packaging/build-image.sh. The packaging/Dockerfile references 'build-image.sh' but without seeing its contents, the base Docker image may not be pinned to a specific version, potentially leading to supply chain risks. Fix: Ensure all Docker base images are pinned to specific digest hashes (e.g., 'FROM ubuntu:20.04@sha256:...' instead of 'FROM ubuntu:latest'). Review and pin all build dependencies.
  • Low · Missing Input Validation Documentation — fclones/src/arg.rs, fclones/src/pattern.rs, fclones/src/regex.rs, fclones/src/transform.rs. Modules handling user input (arg.rs, pattern.rs, regex.rs, transform.rs) lack visible security documentation in file names. Input validation and sanitization are critical for a tool that processes user-provided patterns and paths. Fix: Implement strict input validation for all user-provided arguments, patterns, and paths. Add unit tests for boundary conditions and malicious inputs. Document validation logic clearly.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Mixed signals · pkolaczk/fclones — RepoPilot