hartator/wayback-machine-downloader
Download an entire website from the Wayback Machine.
Stale — last commit 2y ago
worst of 4 axesnon-standard license (Other); last commit was 2y ago
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓9 active contributors
- ✓Other licensed
- ✓CI configured
Show 4 more →Show less
- ✓Tests present
- ⚠Stale — last commit 2y ago
- ⚠Concentrated ownership — top contributor handles 72% of recent commits
- ⚠Non-standard license (Other) — review terms
What would change the summary?
- →Use as dependency Concerns → Mixed if: clarify license terms
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Forkable" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/hartator/wayback-machine-downloader)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/hartator/wayback-machine-downloader on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: hartator/wayback-machine-downloader
Generated by RepoPilot · 2026-05-10 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/hartator/wayback-machine-downloader shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
WAIT — Stale — last commit 2y ago
- 9 active contributors
- Other licensed
- CI configured
- Tests present
- ⚠ Stale — last commit 2y ago
- ⚠ Concentrated ownership — top contributor handles 72% of recent commits
- ⚠ Non-standard license (Other) — review terms
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live hartator/wayback-machine-downloader
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/hartator/wayback-machine-downloader.
What it runs against: a local clone of hartator/wayback-machine-downloader — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in hartator/wayback-machine-downloader | Confirms the artifact applies here, not a fork |
| 2 | License is still Other | Catches relicense before you depend on it |
| 3 | Default branch master exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 852 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of hartator/wayback-machine-downloader. If you don't
# have one yet, run these first:
#
# git clone https://github.com/hartator/wayback-machine-downloader.git
# cd wayback-machine-downloader
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of hartator/wayback-machine-downloader and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "hartator/wayback-machine-downloader(\\.git)?\\b" \\
&& ok "origin remote is hartator/wayback-machine-downloader" \\
|| miss "origin remote is not hartator/wayback-machine-downloader (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Other)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Other\"" package.json 2>/dev/null) \\
&& ok "license is Other" \\
|| miss "license drift — was Other at generation time"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 4. Critical files exist
test -f "bin/wayback_machine_downloader" \\
&& ok "bin/wayback_machine_downloader" \\
|| miss "missing critical file: bin/wayback_machine_downloader"
test -f "lib/wayback_machine_downloader.rb" \\
&& ok "lib/wayback_machine_downloader.rb" \\
|| miss "missing critical file: lib/wayback_machine_downloader.rb"
test -f "lib/wayback_machine_downloader/archive_api.rb" \\
&& ok "lib/wayback_machine_downloader/archive_api.rb" \\
|| miss "missing critical file: lib/wayback_machine_downloader/archive_api.rb"
test -f "lib/wayback_machine_downloader/to_regex.rb" \\
&& ok "lib/wayback_machine_downloader/to_regex.rb" \\
|| miss "missing critical file: lib/wayback_machine_downloader/to_regex.rb"
test -f "lib/wayback_machine_downloader/tidy_bytes.rb" \\
&& ok "lib/wayback_machine_downloader/tidy_bytes.rb" \\
|| miss "missing critical file: lib/wayback_machine_downloader/tidy_bytes.rb"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 852 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~822d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/hartator/wayback-machine-downloader"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
A Ruby CLI tool that downloads entire websites from the Internet Archive Wayback Machine, reconstructing the original directory structure and creating index.html files to serve seamlessly with Apache/Nginx. It fetches the latest snapshot of every archived file or optionally all snapshots within a date range, using the Wayback Machine's CDX API to discover available captures. Simple CLI-oriented gem structure: bin/wayback_machine_downloader is the executable entry point; lib/wayback_machine_downloader.rb contains the main logic; supporting modules in lib/wayback_machine_downloader/ handle specific tasks (archive_api.rb queries CDX, tidy_bytes.rb sanitizes filenames, to_regex.rb converts filters to patterns).
👥Who it's for
Web archivists, digital preservation specialists, and developers who need to recover lost websites or migrate content from archived snapshots. Also useful for researchers analyzing historical website versions and for sites wanting to restore their web presence after outages.
🌱Maturity & risk
Production-ready but lightly maintained. The project has a Travis CI setup (.travis.yml), basic test coverage (test/test_wayback_machine_downloader.rb), and is published as a RubyGem. However, the codebase is relatively small (26KB Ruby), last activity appears dormant, and there's minimal recent development mentioned in the data provided.
Low risk for core functionality but single-maintainer (hartator) creates bus factor risk. The Gemfile and Dockerfile suggest some dependencies, but the small codebase minimizes complexity. Main risk: reliance on Wayback Machine API stability and potential breaking changes in Internet Archive's CDX API format, which isn't directly versioned in this repo.
Active areas of work
No specific active development visible from the file list alone. The presence of Dockerfile and Gemfile suggests containerization support was added at some point, but commit recency is unknown from the provided data.
🚀Get running
git clone https://github.com/hartator/wayback-machine-downloader && cd wayback-machine-downloader && bundle install && bundle exec bin/wayback_machine_downloader http://example.com
Daily commands: After bundle install, run: wayback_machine_downloader http://example.com [options]. Or via bundler: bundle exec bin/wayback_machine_downloader http://example.com. Docker alternative: docker build . then docker run with the URL as argument.
🗺️Map of the codebase
bin/wayback_machine_downloader— Entry point executable that parses CLI arguments and orchestrates the download workflow—every contributor must understand how options flow into the library.lib/wayback_machine_downloader.rb— Main library class containing the core download logic and orchestration—the heart of the entire codebase that all features depend on.lib/wayback_machine_downloader/archive_api.rb— Wayback Machine API client handling all HTTP requests to Internet Archive—any API changes or retry logic changes must go here.lib/wayback_machine_downloader/to_regex.rb— URL pattern matching utility for filtering which files to download—critical for implementing include/exclude logic.lib/wayback_machine_downloader/tidy_bytes.rb— File encoding and byte-cleaning utility ensuring downloaded files are properly handled—prevents corruption of binary and text files.wayback_machine_downloader.gemspec— Gem specification defining dependencies, metadata, and versioning—required for packaging and distribution.test/test_wayback_machine_downloader.rb— Primary test suite validating core download logic, API interactions, and utility functions—must pass before any release.
🧩Components & responsibilities
- CLI Executable (Ruby OptionParser, ARGV) — Parse command-line arguments, handle user input, display progress/results
- Failure mode: Invalid arguments or missing URL cause error message; user exits without download
- Download Orchestrator (Ruby File I/O, Array processing) — Manage overall workflow: query snapshots, filter, download, organize files, create indexes
- Failure mode: API failure or disk errors halt download; partial directory structure may remain on disk
- Wayback Machine API Client (Net::HTTP, JSON parsing) — Handle all HTTP requests to Internet Archive, parse JSON responses, manage retries
- Failure mode: Network timeout or API unavailability stops snapshot retrieval; no fallback mechanism
- URL Filtering Engine (Ruby Regex (to_regex utility)) — Evaluate include/exclude regex patterns to determine which files to download
- Failure mode: Invalid regex pattern causes exception; user must fix pattern and restart
- File Encoding Cleaner (Ruby string encoding, byte manipulation) — Fix encoding issues and byte-level corruption in downloaded files for readability
- Failure mode: Corrupted encoding may persist in output files; data loss unlikely but readability impaired
🔀Data flow
CLI Arguments→Download Orchestrator— User provides base URL, date filters, include/exclude patterns, output directoryDownload Orchestrator→Wayback Machine API— Query available snapshots for the target URL within date rangeWayback Machine API→Download Orchestrator— Returns list of snapshot timestamps; orchestrator selects which to downloadDownload Orchestrator→URL Filtering Engine— Each discovered file URL is tested against include/exclude patternsURL Filtering Engine→Download Orchestrator— Returns boolean decision (download or skip) for each fileDownload Orchestrator→Wayback Machine API— For each accepted file, request the original
🛠️How to make changes
Add a new URL filtering pattern
- Review existing include/exclude logic in main download method (
lib/wayback_machine_downloader.rb) - Add pattern matching logic using the to_regex utility (
lib/wayback_machine_downloader/to_regex.rb) - Add CLI flag support in the executable (
bin/wayback_machine_downloader) - Write tests for the new pattern matching behavior (
test/test_wayback_machine_downloader.rb)
Add support for a new Wayback Machine API endpoint
- Add new API method to the archive_api client (
lib/wayback_machine_downloader/archive_api.rb) - Integrate the new API call into the main download orchestration logic (
lib/wayback_machine_downloader.rb) - Add corresponding tests for the new API integration (
test/test_wayback_machine_downloader.rb)
Add a new file encoding or format handler
- Review existing encoding utilities and file cleanup logic (
lib/wayback_machine_downloader/tidy_bytes.rb) - Extend tidy_bytes or create new format handler for the specific file type (
lib/wayback_machine_downloader/tidy_bytes.rb) - Call the new handler from the main download logic where files are written (
lib/wayback_machine_downloader.rb) - Add test cases for the new format handler (
test/test_wayback_machine_downloader.rb)
🔧Why these technologies
- Ruby — Lightweight, portable scripting language ideal for CLI tools with built-in HTTP and file I/O capabilities
- Internet Archive Wayback Machine API — Only reliable public source for historical website snapshots; provides structured API for querying and accessing archived content
- Regex-based URL filtering — Allows flexible pattern matching for include/exclude rules without heavy dependency overhead
- File system storage — Direct disk writing enables offline browsing and eliminates dependencies on databases or cloud storage
⚖️Trade-offs already made
-
Download entire website snapshots rather than on-demand streaming
- Why: Users want complete offline access and recreation of the original site structure
- Consequence: Requires significant disk space and network bandwidth; may take time for large sites
-
Re-create directory structure and auto-generate index.html pages
- Why: Makes downloaded sites immediately browsable with Apache/Nginx without additional configuration
- Consequence: Adds complexity to file organization logic; assumes Unix-style directory structures
-
Use only original (non-rewritten) Wayback Machine files
- Why: Ensures links and URLs remain valid and authentic to the original site
- Consequence: Excludes Wayback UI enhancements; URLs must be manually verified for internal consistency
-
Single-threaded sequential downloads
- Why: Simpler logic, avoids rate-limiting and API throttling issues
- Consequence: Slower for very large sites; better behavior from Wayback Machine perspective
🚫Non-goals (don't propose these)
- Does not interact with live websites—only archives from Wayback Machine
- Does not support authentication or login-protected content
- Does not handle dynamic JavaScript-rendered content from original snapshots
- Does not provide real-time incremental updates; downloads entire snapshots at once
- Does not generate static site generators or CDN integration—direct file serving only
🪤Traps & gotchas
The CDX API may return snapshots in unpredictable orders; snapshot selection logic (e.g., 'last version') depends on API response order. The --all-timestamps mode can generate enormous file counts; concurrency defaults to 1 but lacks timeout/retry logic for hung downloads. File paths are sanitized via tidy_bytes.rb but the exact rules aren't obvious without reading the code. The --exact-url flag requires exact string matching to the URL provided, which can break if trailing slashes differ.
🏗️Architecture
💡Concepts to learn
- CDX API (Capture Index API) — This project entirely depends on querying the Wayback Machine's CDX API to discover snapshots; understanding its endpoint structure, query parameters, and response format is essential for modifying archive_api.rb
- URL-to-filepath conversion — The downloader must reconstruct on-disk directory structures that mirror URL hierarchies, handling query strings, fragments, and special characters; this is the core of tidy_bytes.rb and index.html generation
- Regex-based filtering — The --only and --exclude options use regex patterns (with // notation) to allow flexible matching; understanding to_regex.rb's conversion logic is key to extending filter capabilities
- Concurrent HTTP downloads with concurrency limit — The -c/--concurrency flag controls how many files download in parallel; the implementation must handle thread pools or similar to avoid overwhelming the archive or network
- Directory index auto-generation — The tool creates index.html files in directories lacking them, allowing web servers to serve the reconstructed site without URL rewriting; this requires understanding Apache/Nginx directory behavior
- Timestamp-based snapshot filtering — The --from and --to flags allow restricting downloads to snapshots within a date range (YYYYMMDDHHMMSS format); the implementation must parse and compare timestamps correctly
- Web archive format (original vs. rewritten content) — The Wayback Machine serves rewritten HTML (with modified links) by default; this tool downloads the original unmodified files, requiring knowledge of how to request the raw version from the API
🔗Related repos
jmorataya/wayback-machine— Alternative Python-based Wayback Machine downloader; different language choice for same use caseinternetarchive/ia-download-service— Official Internet Archive download tool; authoritative implementation by the archive maintainersmitmproxy/mitmproxy— Not directly related but often used alongside for MITM capturing of live traffic to archive (complementary workflow)archivebox/ArchiveBox— Comprehensive web archiving solution that can ingest Wayback snapshots and export/serve archived websitesuser2589/wayback-machine-selenium— Browser-based Wayback Machine downloader using Selenium; handles JavaScript-heavy sites better than this CLI tool
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive unit tests for lib/wayback_machine_downloader/archive_api.rb
The archive_api.rb module is critical for fetching snapshots from the Wayback Machine API, but test/test_wayback_machine_downloader.rb appears to lack specific test coverage for API interaction, error handling, and edge cases like rate limiting, malformed responses, or network failures. This would improve reliability and make future refactoring safer.
- [ ] Create test cases in test/test_wayback_machine_downloader.rb for archive_api.rb methods (query, fetch_snapshot_list, etc.)
- [ ] Add mock tests for API responses (valid JSON, empty results, API errors)
- [ ] Test edge cases: rate limiting, timeout handling, invalid URLs passed to API
- [ ] Test that tidy_bytes.rb and to_regex.rb are correctly integrated with archive_api.rb
Migrate from Travis CI (.travis.yml) to GitHub Actions workflow
The repo uses .travis.yml for CI, but GitHub Actions is now the standard for GitHub repos, offering better integration, faster execution, and no external service dependency. This modernizes the workflow and reduces maintenance burden.
- [ ] Create .github/workflows/test.yml with Ruby matrix testing (1.9.2+, and modern versions)
- [ ] Ensure the workflow runs: bundle install, rake test, and any linting tasks
- [ ] Add gem build test to catch gemspec issues early
- [ ] Remove or deprecate .travis.yml with a note in README
Add integration tests for URL filtering in lib/wayback_machine_downloader.rb with real Wayback Machine scenarios
The repo uses to_regex.rb for URL pattern matching and filtering, but there are no visible tests for complex scenarios like wildcard filtering, exclude patterns, or the interaction between URL filters and actual Wayback Machine results. Adding these would prevent regressions in the core download logic.
- [ ] Create test cases in test/test_wayback_machine_downloader.rb that test URL pattern matching against sample Wayback snapshots
- [ ] Test wildcard patterns (e.g., '.jpg', '/admin/') against realistic snapshot lists
- [ ] Test exclusion patterns and verify they correctly filter out unwanted URLs
- [ ] Add regression tests for known edge cases (query parameters, fragments, internationalized URLs)
🌿Good first issues
- Add integration tests in
test/test_wayback_machine_downloader.rbthat mock CDX API responses for edge cases: empty snapshots, malformed URLs, and timestamps outside the archive range. - Document the exact filename sanitization rules applied by
lib/wayback_machine_downloader/tidy_bytes.rbin a code comment or README section, since it's non-obvious how paths likefoo/../baror special characters are handled. - Add a --dry-run flag that performs full API queries and file discovery but skips actual downloads, useful for previewing what would be downloaded without writing to disk.
⭐Top contributors
Click to expand
Top contributors
📝Recent commits
Click to expand
Recent commits
653b94b— Merge pull request #234 from giovanni-cutri/master (hartator)c805108— Fix small typo (giovanni-cutri)cf770c2— Bump gem version (hartator)fe9893e— Merge pull request #192 from pabs3/uri-open-compat (hartator)9da87bf— Make URI#open cross Ruby versions compatible (pabs3)66ff4d9— Merge pull request #188 from pabs3/fixes (hartator)83b4f88— Bump Gem version (hartator)30475c5— Make URI#open cross Ruby versions compatible (hartator)ba4ca60— Do not emit a comma for the final item in JSON output (pabs3)06e2595— Print progress messages to stderr when printing JSON (pabs3)
🔒Security observations
- Critical · Outdated Ruby Version in Docker —
Dockerfile. The Dockerfile uses Ruby 2.3, which reached end-of-life on March 31, 2019. This version no longer receives security patches and contains multiple known vulnerabilities. Fix: Update to Ruby 3.0 or later. Modify 'FROM ruby:2.3' to 'FROM ruby:3.2' or the latest stable version. - High · No Dependency Lock File Pinning —
Gemfile / Dockerfile. No Gemfile.lock content provided in the analysis context. Without a lock file, the Docker build can install vulnerable versions of gems. Dependencies may be automatically updated to newer versions with security issues. Fix: Ensure Gemfile.lock is committed to version control and used during Docker builds. Run 'bundle lock' to generate it. Consider using 'bundle install --frozen' in Docker to enforce locked versions. - High · Unsafe URL Handling and Potential SSRF —
lib/wayback_machine_downloader.rb, lib/wayback_machine_downloader/archive_api.rb. The application downloads files from Wayback Machine based on user-supplied URLs without visible validation. This could lead to Server-Side Request Forgery (SSRF) attacks targeting internal networks or file retrieval of unintended resources. Fix: Implement URL validation: whitelist allowed domains, validate URL scheme (http/https only), reject localhost/127.0.0.1/internal IPs, and implement request timeouts. - High · Path Traversal Vulnerability —
lib/wayback_machine_downloader.rb. The application reconstructs directory structures from Wayback Machine URLs. Without proper sanitization, malicious URLs with '../' sequences could write files outside the intended directory (e.g., '../../etc/passwd'). Fix: Sanitize all path components extracted from URLs. Use secure path joining methods, validate against path traversal patterns, and reject paths containing '..', absolute paths, or null bytes. - Medium · No Input Validation on Command-line Arguments —
bin/wayback_machine_downloader. The bin/wayback_machine_downloader script accepts user input without visible validation. Command-line arguments could contain shell metacharacters or malicious input. Fix: Implement input validation for all command-line arguments. Use Ruby's built-in validation methods and reject arguments containing shell metacharacters. - Medium · Unencrypted HTTP Support —
README.md, lib/wayback_machine_downloader.rb. The README examples show usage with http:// URLs. Downloading content over unencrypted HTTP connections exposes data in transit to man-in-the-middle attacks. Fix: Enforce HTTPS-only connections. Validate that URLs use https:// scheme and reject http:// except for localhost testing. - Medium · Missing Security Headers in Downloaded Content —
lib/wayback_machine_downloader.rb. When serving or processing downloaded HTML files locally, there's no mention of security headers or protection against XSS attacks in archived content that may contain malicious scripts. Fix: When processing downloaded HTML, sanitize content using libraries like Sanitize gem. If serving files, implement Content-Security-Policy and X-Content-Type-Options headers. - Low · Missing Security Scanning in CI/CD —
.travis.yml. The .travis.yml file (referenced in README) suggests CI/CD is in use, but no mention of security scanning tools like Brakeman, bundle audit, or OWASP checks. Fix: Add security scanning to CI/CD pipeline: integrate 'bundle audit' for dependency vulnerabilities and 'brakeman' for code vulnerabilities. - Low · No Rate Limiting —
lib/wayback_machine_downloader/archive_api.rb. The application may perform many requests to Wayback Machine API without rate limiting, which could impact service availability or result in IP blocking. Fix: Implement rate limiting with configurable delays between requests. Add exponential backoff for failed requests.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.