hartator/wayback-machine-downloader

Item: hartator/wayback-machine-downloader
Rating: 3
Author: RepoPilot

Download an entire website from the Wayback Machine.

Mixed

Stale — last commit 2y ago

worst of 4 axes

Use as dependencyConcerns

non-standard license (Other); last commit was 2y ago

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓9 active contributors
✓Other licensed
✓CI configured

Show 4 more →

✓Tests present
⚠Stale — last commit 2y ago
⚠Concentrated ownership — top contributor handles 72% of recent commits
⚠Non-standard license (Other) — review terms

What would change the summary?

→Use as dependency Concerns → Mixed if: clarify license terms

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Forkable](https://repopilot.app/api/badge/hartator/wayback-machine-downloader?axis=fork)](https://repopilot.app/r/hartator/wayback-machine-downloader)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/hartator/wayback-machine-downloader on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: hartator/wayback-machine-downloader

Generated by RepoPilot · 2026-05-10 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/hartator/wayback-machine-downloader shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 2y ago

9 active contributors
Other licensed
CI configured
Tests present
⚠ Stale — last commit 2y ago
⚠ Concentrated ownership — top contributor handles 72% of recent commits
⚠ Non-standard license (Other) — review terms

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live hartator/wayback-machine-downloader repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/hartator/wayback-machine-downloader.

What it runs against: a local clone of hartator/wayback-machine-downloader — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in hartator/wayback-machine-downloader | Confirms the artifact applies here, not a fork | | 2 | License is still Other | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 852 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>hartator/wayback-machine-downloader</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of hartator/wayback-machine-downloader. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/hartator/wayback-machine-downloader.git
#   cd wayback-machine-downloader
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of hartator/wayback-machine-downloader and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "hartator/wayback-machine-downloader(\\.git)?\\b" \\
  && ok "origin remote is hartator/wayback-machine-downloader" \\
  || miss "origin remote is not hartator/wayback-machine-downloader (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Other)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Other\"" package.json 2>/dev/null) \\
  && ok "license is Other" \\
  || miss "license drift — was Other at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "bin/wayback_machine_downloader" \\
  && ok "bin/wayback_machine_downloader" \\
  || miss "missing critical file: bin/wayback_machine_downloader"
test -f "lib/wayback_machine_downloader.rb" \\
  && ok "lib/wayback_machine_downloader.rb" \\
  || miss "missing critical file: lib/wayback_machine_downloader.rb"
test -f "lib/wayback_machine_downloader/archive_api.rb" \\
  && ok "lib/wayback_machine_downloader/archive_api.rb" \\
  || miss "missing critical file: lib/wayback_machine_downloader/archive_api.rb"
test -f "lib/wayback_machine_downloader/to_regex.rb" \\
  && ok "lib/wayback_machine_downloader/to_regex.rb" \\
  || miss "missing critical file: lib/wayback_machine_downloader/to_regex.rb"
test -f "lib/wayback_machine_downloader/tidy_bytes.rb" \\
  && ok "lib/wayback_machine_downloader/tidy_bytes.rb" \\
  || miss "missing critical file: lib/wayback_machine_downloader/tidy_bytes.rb"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 852 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~822d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/hartator/wayback-machine-downloader"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

A Ruby CLI tool that downloads entire websites from the Internet Archive Wayback Machine, reconstructing the original directory structure and creating index.html files to serve seamlessly with Apache/Nginx. It fetches the latest snapshot of every archived file or optionally all snapshots within a date range, using the Wayback Machine's CDX API to discover available captures. Simple CLI-oriented gem structure: bin/wayback_machine_downloader is the executable entry point; lib/wayback_machine_downloader.rb contains the main logic; supporting modules in lib/wayback_machine_downloader/ handle specific tasks (archive_api.rb queries CDX, tidy_bytes.rb sanitizes filenames, to_regex.rb converts filters to patterns).

👥Who it's for

Web archivists, digital preservation specialists, and developers who need to recover lost websites or migrate content from archived snapshots. Also useful for researchers analyzing historical website versions and for sites wanting to restore their web presence after outages.

🌱Maturity & risk

Production-ready but lightly maintained. The project has a Travis CI setup (.travis.yml), basic test coverage (test/test_wayback_machine_downloader.rb), and is published as a RubyGem. However, the codebase is relatively small (26KB Ruby), last activity appears dormant, and there's minimal recent development mentioned in the data provided.

Low risk for core functionality but single-maintainer (hartator) creates bus factor risk. The Gemfile and Dockerfile suggest some dependencies, but the small codebase minimizes complexity. Main risk: reliance on Wayback Machine API stability and potential breaking changes in Internet Archive's CDX API format, which isn't directly versioned in this repo.

Active areas of work

No specific active development visible from the file list alone. The presence of Dockerfile and Gemfile suggests containerization support was added at some point, but commit recency is unknown from the provided data.

🚀Get running

git clone https://github.com/hartator/wayback-machine-downloader && cd wayback-machine-downloader && bundle install && bundle exec bin/wayback_machine_downloader http://example.com

Daily commands: After bundle install, run: wayback_machine_downloader http://example.com [options]. Or via bundler: bundle exec bin/wayback_machine_downloader http://example.com. Docker alternative: docker build . then docker run with the URL as argument.

🗺️Map of the codebase

bin/wayback_machine_downloader — Entry point executable that parses CLI arguments and orchestrates the download workflow—every contributor must understand how options flow into the library.
lib/wayback_machine_downloader.rb — Main library class containing the core download logic and orchestration—the heart of the entire codebase that all features depend on.
lib/wayback_machine_downloader/archive_api.rb — Wayback Machine API client handling all HTTP requests to Internet Archive—any API changes or retry logic changes must go here.
lib/wayback_machine_downloader/to_regex.rb — URL pattern matching utility for filtering which files to download—critical for implementing include/exclude logic.
lib/wayback_machine_downloader/tidy_bytes.rb — File encoding and byte-cleaning utility ensuring downloaded files are properly handled—prevents corruption of binary and text files.
wayback_machine_downloader.gemspec — Gem specification defining dependencies, metadata, and versioning—required for packaging and distribution.
test/test_wayback_machine_downloader.rb — Primary test suite validating core download logic, API interactions, and utility functions—must pass before any release.

🧩Components & responsibilities

CLI Executable (Ruby OptionParser, ARGV) — Parse command-line arguments, handle user input, display progress/results
- Failure mode: Invalid arguments or missing URL cause error message; user exits without download
Download Orchestrator (Ruby File I/O, Array processing) — Manage overall workflow: query snapshots, filter, download, organize files, create indexes
- Failure mode: API failure or disk errors halt download; partial directory structure may remain on disk
Wayback Machine API Client (Net::HTTP, JSON parsing) — Handle all HTTP requests to Internet Archive, parse JSON responses, manage retries
- Failure mode: Network timeout or API unavailability stops snapshot retrieval; no fallback mechanism
URL Filtering Engine (Ruby Regex (to_regex utility)) — Evaluate include/exclude regex patterns to determine which files to download
- Failure mode: Invalid regex pattern causes exception; user must fix pattern and restart
File Encoding Cleaner (Ruby string encoding, byte manipulation) — Fix encoding issues and byte-level corruption in downloaded files for readability
- Failure mode: Corrupted encoding may persist in output files; data loss unlikely but readability impaired

🔀Data flow

CLI Arguments → Download Orchestrator — User provides base URL, date filters, include/exclude patterns, output directory
Download Orchestrator → Wayback Machine API — Query available snapshots for the target URL within date range
Wayback Machine API → Download Orchestrator — Returns list of snapshot timestamps; orchestrator selects which to download
Download Orchestrator → URL Filtering Engine — Each discovered file URL is tested against include/exclude patterns
URL Filtering Engine → Download Orchestrator — Returns boolean decision (download or skip) for each file
Download Orchestrator → Wayback Machine API — For each accepted file, request the original

🛠️How to make changes

Add a new URL filtering pattern

Review existing include/exclude logic in main download method (lib/wayback_machine_downloader.rb)
Add pattern matching logic using the to_regex utility (lib/wayback_machine_downloader/to_regex.rb)
Add CLI flag support in the executable (bin/wayback_machine_downloader)
Write tests for the new pattern matching behavior (test/test_wayback_machine_downloader.rb)

Add support for a new Wayback Machine API endpoint

Add new API method to the archive_api client (lib/wayback_machine_downloader/archive_api.rb)
Integrate the new API call into the main download orchestration logic (lib/wayback_machine_downloader.rb)
Add corresponding tests for the new API integration (test/test_wayback_machine_downloader.rb)

Add a new file encoding or format handler

Review existing encoding utilities and file cleanup logic (lib/wayback_machine_downloader/tidy_bytes.rb)
Extend tidy_bytes or create new format handler for the specific file type (lib/wayback_machine_downloader/tidy_bytes.rb)
Call the new handler from the main download logic where files are written (lib/wayback_machine_downloader.rb)
Add test cases for the new format handler (test/test_wayback_machine_downloader.rb)

🔧Why these technologies

Ruby — Lightweight, portable scripting language ideal for CLI tools with built-in HTTP and file I/O capabilities
Internet Archive Wayback Machine API — Only reliable public source for historical website snapshots; provides structured API for querying and accessing archived content
Regex-based URL filtering — Allows flexible pattern matching for include/exclude rules without heavy dependency overhead
File system storage — Direct disk writing enables offline browsing and eliminates dependencies on databases or cloud storage

⚖️Trade-offs already made

Download entire website snapshots rather than on-demand streaming
- Why: Users want complete offline access and recreation of the original site structure
- Consequence: Requires significant disk space and network bandwidth; may take time for large sites
Re-create directory structure and auto-generate index.html pages
- Why: Makes downloaded sites immediately browsable with Apache/Nginx without additional configuration
- Consequence: Adds complexity to file organization logic; assumes Unix-style directory structures
Use only original (non-rewritten) Wayback Machine files
- Why: Ensures links and URLs remain valid and authentic to the original site
- Consequence: Excludes Wayback UI enhancements; URLs must be manually verified for internal consistency
Single-threaded sequential downloads
- Why: Simpler logic, avoids rate-limiting and API throttling issues
- Consequence: Slower for very large sites; better behavior from Wayback Machine perspective

🚫Non-goals (don't propose these)

Does not interact with live websites—only archives from Wayback Machine
Does not support authentication or login-protected content
Does not handle dynamic JavaScript-rendered content from original snapshots
Does not provide real-time incremental updates; downloads entire snapshots at once
Does not generate static site generators or CDN integration—direct file serving only

🪤Traps & gotchas

The CDX API may return snapshots in unpredictable orders; snapshot selection logic (e.g., 'last version') depends on API response order. The --all-timestamps mode can generate enormous file counts; concurrency defaults to 1 but lacks timeout/retry logic for hung downloads. File paths are sanitized via tidy_bytes.rb but the exact rules aren't obvious without reading the code. The --exact-url flag requires exact string matching to the URL provided, which can break if trailing slashes differ.

🏗️Architecture

💡Concepts to learn

CDX API (Capture Index API) — This project entirely depends on querying the Wayback Machine's CDX API to discover snapshots; understanding its endpoint structure, query parameters, and response format is essential for modifying archive_api.rb
URL-to-filepath conversion — The downloader must reconstruct on-disk directory structures that mirror URL hierarchies, handling query strings, fragments, and special characters; this is the core of tidy_bytes.rb and index.html generation
Regex-based filtering — The --only and --exclude options use regex patterns (with // notation) to allow flexible matching; understanding to_regex.rb's conversion logic is key to extending filter capabilities
Concurrent HTTP downloads with concurrency limit — The -c/--concurrency flag controls how many files download in parallel; the implementation must handle thread pools or similar to avoid overwhelming the archive or network
Directory index auto-generation — The tool creates index.html files in directories lacking them, allowing web servers to serve the reconstructed site without URL rewriting; this requires understanding Apache/Nginx directory behavior
Timestamp-based snapshot filtering — The --from and --to flags allow restricting downloads to snapshots within a date range (YYYYMMDDHHMMSS format); the implementation must parse and compare timestamps correctly
Web archive format (original vs. rewritten content) — The Wayback Machine serves rewritten HTML (with modified links) by default; this tool downloads the original unmodified files, requiring knowledge of how to request the raw version from the API

jmorataya/wayback-machine — Alternative Python-based Wayback Machine downloader; different language choice for same use case
internetarchive/ia-download-service — Official Internet Archive download tool; authoritative implementation by the archive maintainers
mitmproxy/mitmproxy — Not directly related but often used alongside for MITM capturing of live traffic to archive (complementary workflow)
archivebox/ArchiveBox — Comprehensive web archiving solution that can ingest Wayback snapshots and export/serve archived websites
user2589/wayback-machine-selenium — Browser-based Wayback Machine downloader using Selenium; handles JavaScript-heavy sites better than this CLI tool

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for lib/wayback_machine_downloader/archive_api.rb

The archive_api.rb module is critical for fetching snapshots from the Wayback Machine API, but test/test_wayback_machine_downloader.rb appears to lack specific test coverage for API interaction, error handling, and edge cases like rate limiting, malformed responses, or network failures. This would improve reliability and make future refactoring safer.

[ ] Create test cases in test/test_wayback_machine_downloader.rb for archive_api.rb methods (query, fetch_snapshot_list, etc.)
[ ] Add mock tests for API responses (valid JSON, empty results, API errors)
[ ] Test edge cases: rate limiting, timeout handling, invalid URLs passed to API
[ ] Test that tidy_bytes.rb and to_regex.rb are correctly integrated with archive_api.rb

Migrate from Travis CI (.travis.yml) to GitHub Actions workflow

The repo uses .travis.yml for CI, but GitHub Actions is now the standard for GitHub repos, offering better integration, faster execution, and no external service dependency. This modernizes the workflow and reduces maintenance burden.

[ ] Create .github/workflows/test.yml with Ruby matrix testing (1.9.2+, and modern versions)
[ ] Ensure the workflow runs: bundle install, rake test, and any linting tasks
[ ] Add gem build test to catch gemspec issues early
[ ] Remove or deprecate .travis.yml with a note in README

Add integration tests for URL filtering in lib/wayback_machine_downloader.rb with real Wayback Machine scenarios

The repo uses to_regex.rb for URL pattern matching and filtering, but there are no visible tests for complex scenarios like wildcard filtering, exclude patterns, or the interaction between URL filters and actual Wayback Machine results. Adding these would prevent regressions in the core download logic.

[ ] Create test cases in test/test_wayback_machine_downloader.rb that test URL pattern matching against sample Wayback snapshots
[ ] Test wildcard patterns (e.g., '.jpg', '/admin/') against realistic snapshot lists
[ ] Test exclusion patterns and verify they correctly filter out unwanted URLs
[ ] Add regression tests for known edge cases (query parameters, fragments, internationalized URLs)

🌿Good first issues

Add integration tests in test/test_wayback_machine_downloader.rb that mock CDX API responses for edge cases: empty snapshots, malformed URLs, and timestamps outside the archive range.
Document the exact filename sanitization rules applied by lib/wayback_machine_downloader/tidy_bytes.rb in a code comment or README section, since it's non-obvious how paths like foo/../bar or special characters are handled.
Add a --dry-run flag that performs full API queries and file discovery but skips actual downloads, useful for previewing what would be downloaded without writing to disk.

⭐Top contributors

Click to expand

@hartator — 72 commits
@p — 11 commits
@pabs3 — 8 commits
@ikirker — 3 commits
@insaner — 2 commits

📝Recent commits