RepoPilotOpen in app →

Y2Z/monolith

⬛️ CLI tool and library for saving complete web pages as a single HTML file

Healthy

Healthy across the board

weakest axis
Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 1w ago
  • 14 active contributors
  • CC0-1.0 licensed
Show all 6 evidence items →
  • CI configured
  • Tests present
  • Concentrated ownership — top contributor handles 79% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/y2z/monolith)](https://repopilot.app/r/y2z/monolith)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/y2z/monolith on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: Y2Z/monolith

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/Y2Z/monolith shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit 1w ago
  • 14 active contributors
  • CC0-1.0 licensed
  • CI configured
  • Tests present
  • ⚠ Concentrated ownership — top contributor handles 79% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live Y2Z/monolith repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/Y2Z/monolith.

What it runs against: a local clone of Y2Z/monolith — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in Y2Z/monolith | Confirms the artifact applies here, not a fork | | 2 | License is still CC0-1.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 37 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>Y2Z/monolith</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of Y2Z/monolith. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/Y2Z/monolith.git
#   cd monolith
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of Y2Z/monolith and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "Y2Z/monolith(\\.git)?\\b" \\
  && ok "origin remote is Y2Z/monolith" \\
  || miss "origin remote is not Y2Z/monolith (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(CC0-1\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"CC0-1\\.0\"" package.json 2>/dev/null) \\
  && ok "license is CC0-1.0" \\
  || miss "license drift — was CC0-1.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "src/main.rs" \\
  && ok "src/main.rs" \\
  || miss "missing critical file: src/main.rs"
test -f "src/lib.rs" \\
  && ok "src/lib.rs" \\
  || miss "missing critical file: src/lib.rs"
test -f "src/core.rs" \\
  && ok "src/core.rs" \\
  || miss "missing critical file: src/core.rs"
test -f "src/html.rs" \\
  && ok "src/html.rs" \\
  || miss "missing critical file: src/html.rs"
test -f "src/css.rs" \\
  && ok "src/css.rs" \\
  || miss "missing critical file: src/css.rs"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 37 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~7d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/Y2Z/monolith"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Monolith is a Rust CLI tool and library that bundles complete web pages—HTML, CSS, images, JavaScript, and all assets—into a single self-contained HTML5 file using data URLs. Unlike 'Save Page As' or wget -mpk, it embeds all resources inline, producing a portable .html file that renders identically offline without needing external dependencies. Modular Rust library: src/lib.rs exports a public API; src/main.rs provides CLI entry. Core processing logic splits by concern: src/html.rs (DOM manipulation), src/css.rs (CSS embedding), src/js.rs (JavaScript inlining), src/url.rs (URL parsing/resolution), src/cache.rs (asset caching), src/cookies.rs (session handling), src/session.rs (HTTP sessions), src/core.rs (main fetch/embed pipeline). Optional GUI mode in src/gui.rs using Druid. Tests in tests/_data_/ with sample HTML/CSS/JS fixtures.

👥Who it's for

Data hoarders, archivists, and users who want to preserve web content as portable single files; developers building tools for web scraping, offline content preservation, or page capture; people needing to share complex web pages without broken asset links.

🌱Maturity & risk

Production-ready. The project shows healthy CI/CD setup (5 workflow files covering GNU/Linux, macOS, Windows, NetBSD), broad packaging distribution (Homebrew, Chocolatey, Scoop, Winget, MacPorts, Snapcraft, Guix, NixPkgs, Flox), and stable versioning (at v2.11.0). Multi-author contribution history and presence on major package managers indicates active maintenance and real-world usage.

Low risk for core functionality. Dependencies are pinned to exact versions (e.g., reqwest = "=0.12.15", html5ever = "=0.29.1"), reducing surprise breakage. Single primary author (snshn) with supporting contributors. The openssl dependency requires static linking setup, which can be tricky on some platforms. Large Rust compilation footprint and reqwest's async runtime add build complexity.

Active areas of work

Active maintenance on platform support and packaging (Docker, Apify Actor at .actor/). CI covers Linux, macOS, Windows, NetBSD. Integration with Apify platform suggests recent serverless/cloud-oriented work. The actor.json, actor.sh, and .actor/ directory indicate ongoing effort to expose monolith as an Apify actor service.

🚀Get running

git clone https://github.com/Y2Z/monolith.git
cd monolith
cargo build --release
./target/release/monolith https://example.com -o output.html

Or install via Cargo: cargo install monolith, then run monolith <URL>.

Daily commands: Development: cargo build (debug) or cargo build --release (optimized). Run: monolith <URL> [OPTIONS] (e.g., monolith https://example.com -o page.html). See Makefile for standard tasks. No dev server needed; this is a CLI tool, not a service.

🗺️Map of the codebase

  • src/main.rs — CLI entry point and argument parsing—defines the command-line interface that all users interact with
  • src/lib.rs — Library root exposing public API and core module organization for library consumers
  • src/core.rs — Main orchestration logic for fetching and embedding resources; handles the core monolith workflow
  • src/html.rs — HTML parsing, DOM manipulation, and serialization—critical for web page reconstruction
  • src/css.rs — CSS parsing and embedding logic; handles stylesheet inlining and resource embedding
  • Cargo.toml — Dependency manifest and project metadata—defines external libraries and build configuration

🧩Components & responsibilities

  • CLI (src/main.rs) (structopt or clap for argument parsing) — Parse command-line arguments, handle user input/output, orchestrate high-level flow
    • Failure mode: Invalid arguments, file I/O errors → user-facing error messages
  • Core Engine (src/core.rs) (Async I/O, DOM APIs) — Orchestrate fetching, parsing, embedding workflow; manage options and session state
    • Failure mode: Network failures, malformed HTML → graceful degradation or error propagation
  • HTML Parser (src/html.rs) (html5ever, DOM manipulation) — Parse HTML, walk DOM, identify resources (img, link, script), rewrite and embed them
    • Failure mode: Malformed HTML → parser handles robustly; invalid selectors → skip gracefully
  • CSS Handler (src/css.rs) (CSS parsing library) — Parse CSS rules, detect @import and url() references, inline stylesheets and imports
    • Failure mode: Invalid CSS → skip rule or property; missing imports → continue without
  • Cache (src/cache.rs) (HashMap or similar in-memory store) — In-memory deduplication of fetched resources, avoid redundant HTTP requests
    • Failure mode: Memory exhaustion on very large pages → proceed without cache benefits
  • URL Handler (src/url.rs) (url crate) — Parse, validate, normalize, and resolve relative URLs to absolute forms
    • Failure mode: Invalid URLs → skip or error; relative paths → resolve against base URL

🔀Data flow

  • User input (CLI)src/main.rs — Command-line URL, options (base-url, user-agent, cookie-file, etc.)
  • src/main.rssrc/core.rs — Parsed options and URL passed to document processing
  • src/core.rsHTTP Client — Fetch initial HTML document from URL
  • HTTP Clientsrc/core — undefined

🛠️How to make changes

Add Support for a New Content Type

  1. Identify the content type (e.g., video, audio, manifest) in the fetch/response handler (src/core.rs)
  2. Add parsing logic in the appropriate module (src/html.rs for HTML tags, src/css.rs for CSS imports, etc.) (src/html.rs)
  3. Implement embedding as data URL or inline content, similar to existing image/style handling (src/html.rs)
  4. Add test cases in tests/ directory to verify correct embedding (tests/html/mod.rs)

Add a New Command-Line Option

  1. Define the option in the argument parser with help text and default value (src/main.rs)
  2. Add corresponding field to the Options struct in core.rs (or relevant module) (src/core.rs)
  3. Pass the option through the request pipeline and apply logic where needed (src/core.rs)
  4. Add CLI integration test to verify option is parsed and applied correctly (tests/cli/mod.rs)

Improve Resource Embedding Strategy

  1. Identify the resource type in the DOM walker (e.g., img, link, script) (src/html.rs)
  2. Add detection logic for the specific scenario (e.g., srcset, integrity attributes) (src/html.rs)
  3. Fetch and encode the resource (reuse cache.rs for deduplication) (src/cache.rs)
  4. Update the node with the embedded data and test with test data in tests/data/ (tests/_data_/basic)

🔧Why these technologies

  • Rust — Memory-safe, performant CLI tool with strong type system; ideal for resource-intensive HTML parsing
  • reqwest (implied HTTP client) — Async HTTP client for fetching remote resources without blocking
  • html5ever / markup5ever (DOM parsing) — Robust HTML5-compliant parser that handles malformed HTML gracefully
  • base64 encoding — Convert binary resources (images, fonts) to text data URLs for embedding in HTML
  • Multi-platform CI/CD — GitHub Actions workflows ensure consistent cross-platform builds (Linux, macOS, Windows)

⚖️Trade-offs already made

  • Single-file output (monolith HTML) instead of directory bundle

    • Why: Simplicity, portability, and ease of sharing
    • Consequence: File size can be large; no incremental caching benefits for individual assets
  • Data URL embedding for all resources (images, stylesheets, scripts)

    • Why: Ensures the single HTML file is completely self-contained and offline-accessible
    • Consequence: Performance trade-off: browser cannot parallelize HTTP requests; larger HTML payload
  • Synchronous embedding and serialization in core.rs

    • Why: Deterministic, simpler logic flow
    • Consequence: May block on large pages; limiting to single-threaded processing per invocation
  • CLI-first design with optional library API

    • Why: Primary use case is command-line tool; library mode for programmatic integration
    • Consequence: API surface is derived from CLI options; less ergonomic for library consumers

🚫Non-goals (don't propose these)

  • Real-time page monitoring or incremental updates
  • JavaScript execution or DOM rendering (no headless browser integration)
  • Cross-domain cookie persistence or session replay
  • Compression or format conversion (output is always HTML)

🪤Traps & gotchas

OpenSSL static linking: The openssl = "=0.10.72" dependency with default settings attempts static linking, which can fail on systems without OpenSSL dev headers; set OPENSSL_LIB_DIR and OPENSSL_INCLUDE_DIR or disable on some platforms. reqwest async runtime: Requires Tokio runtime; blocking code in callbacks can deadlock. Charset edge cases: encoding_rs may not detect encoding for all edge-case malformed HTML; fallback is UTF-8. Large file memory usage: No streaming; entire DOM and all fetched assets load into memory—huge pages can exhaust RAM. Cache format: redb cache format may not be portable across versions; delete .cache/monolith/ if upgrade breaks reads. User-Agent requirements: Some sites block automated requests; may need custom UA headers via cookies/session config.

🏗️Architecture

💡Concepts to learn

  • Data URL encoding — Core mechanism in monolith: all remote assets (images, stylesheets, scripts) are converted to data: URLs and embedded inline in HTML, eliminating external dependencies.
  • DOM traversal and tree manipulation — Monolith parses HTML5 into a DOM tree using html5ever, then recursively walks nodes to find <img>, <link>, <script> tags and rewrite their src/href attributes; understanding tree traversal patterns is essential to extending the tool.
  • CSS parsing and @import resolution — The cssparser crate parses stylesheets and extracts @import rules; monolith recursively fetches imported CSS and inlines it, requiring understanding of CSS parsing and cascade handling.
  • Character encoding detection and conversion — Websites declare encoding via <meta charset> or HTTP headers; encoding_rs detects and converts charsets to UTF-8 for output, preventing garbled text in archived pages.
  • Integrity attributes (SRI) — Monolith preserves Subresource Integrity integrity attributes and calculates SHA-256/SHA-384/SHA-512 hashes using the sha2 crate to maintain security guarantees in archived pages.
  • Content Security Policy (CSP) rewriting — Original pages may have CSP headers restricting data: URLs; monolith must strip or rewrite CSP meta tags to allow inlined assets to execute.
  • Persistent key-value caching with redb — The redb dependency provides on-disk caching to avoid re-fetching identical remote assets across multiple monolith runs; understanding redb's transaction model is key to optimizing performance.
  • gildas-lormeau/SingleFile — Direct competitor: JavaScript-based browser extension for single-file web capture; similar goal but browser-native, not CLI/Rust.
  • ArchiveBox/ArchiveBox — Broader archival framework that uses monolith as a backend; ArchiveBox orchestrates multiple saving methods including monolith for HTML extraction.
  • getlantern/lantern — Rust-based proxy/networking tool; shares dependency ecosystem (reqwest, url parsing) and relevant for offline/low-bandwidth scenarios where monolith is used.
  • servo/html5ever — Upstream HTML5 parser library used by monolith; understanding servo's DOM model directly impacts monolith's asset inlining logic.
  • apify/apify-sdk-js — Apify platform integration shown in .actor/; SDK used when monolith runs as an Apify actor for cloud-based web scraping workflows.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for CSS asset embedding and @import resolution

The repo has test data for CSS scenarios (tests/data/css/, tests/data/import-css-via-data-url/) but no corresponding test files in tests/cli/. There's a basic.rs and base_url.rs, but no dedicated CSS embedding tests. Given that src/css.rs is a core module handling CSS parsing and inlining, adding comprehensive tests would catch regressions in @import handling, url() rewriting, and data-uri conversion—critical for the single-file output feature.

  • [ ] Create tests/cli/css_embedding.rs with test cases for each scenario in tests/data/css/
  • [ ] Add test for @import CSS via data-url from tests/data/import-css-via-data-url/
  • [ ] Add test verifying url() paths are correctly rewritten to data URIs
  • [ ] Test edge cases like nested @imports and relative paths in CSS

Add GitHub Actions workflow for macOS/Windows ARM64 builds

Current workflows exist for build_gnu_linux.yml, build_macos.yml, and build_windows.yml, but they likely only target x86_64. With the rise of ARM64 machines (Apple Silicon, Windows ARM), adding native ARM64 build workflows would expand platform coverage. The Cargo.toml and existing CI structure support this, but there's no evidence of ARM64 compilation in the workflows.

  • [ ] Create .github/workflows/build_macos_arm64.yml targeting aarch64-apple-darwin
  • [ ] Create .github/workflows/build_windows_arm64.yml targeting aarch64-pc-windows-msvc
  • [ ] Update or extend existing workflows with matrix strategy for [x86_64, aarch64]
  • [ ] Test that openssl dependency (Cargo.toml line) builds correctly on ARM64 targets

Add comprehensive tests for unusual character encodings and integrity attribute handling

Test data exists (tests/data/unusual_encodings/ with gb2312.html and iso-8859-1.html, plus tests/data/integrity/), but there are no test files in tests/cli/ for these scenarios. The src/html.rs module handles DOM manipulation and src/core.rs handles integrity attributes, making these critical to test. Missing tests could let encoding-related bugs (from encoding_rs dependency) or SRI validation bugs slip through.

  • [ ] Create tests/cli/encoding.rs testing gb2312 and iso-8859-1 HTML files from tests/data/unusual_encodings/
  • [ ] Create tests/cli/integrity.rs verifying integrity attributes are preserved and validated
  • [ ] Test that charset detection works correctly when converting to single-file output
  • [ ] Add test for integrity attribute removal/preservation based on CLI flags

🌿Good first issues

  • Add integration tests for src/cookies.rs cookie handling—currently no test fixtures in tests/_data_/ validate cookie serialization/parsing. Create a test case with Set-Cookie headers and verify session persistence.
  • Expand src/cache.rs documentation with examples showing how to enable/disable caching and inspect cache entries; the module is undocumented beyond type signatures, making it hard for contributors to optimize cache logic.
  • Add a test fixture in tests/_data_/ for CSS @font-face inlining; current test coverage (in css/) focuses on @import rules, but font embedding is a common use case and likely has untested edge cases (WOFF2, EOT formats).

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 1634dae — fix local inlined svg symbols from embedding entire page (brettp)
  • 623ffcc — Improve Dockerfile apk cache handling (PeterDaveHello)
  • 0432662 — fix: Adds handling of webmanifest. (css-optivoy)
  • 79a4345 — fix: Fixes script type handling. (css-optivoy)
  • 8702e66 — Fix typos (kianmeng)
  • 7fed227 — use specific package versions (snshn)
  • 0f7e309 — roll redb back to 2.4.0 due to NetBSD not yet supporting edition2024 (snshn)
  • f57819e — bump version number (2.10.1 -> 2.11.0), update README and crates (snshn)
  • b2002c1 — basic support for saving as MHTML, refactor code and fix bugs (snshn)
  • a483897 — Update README.md (snshn)

🔒Security observations

  • High · Dockerfile Uses Unreliable Upstream Source — Dockerfile (lines 3-5). The Dockerfile fetches the latest monolith release from GitHub API at runtime without pinning a specific version. This creates a supply chain risk where a compromised or malicious release could be downloaded and executed. The curl command to api.github.com is also not verified with checksums. Fix: Pin to a specific release version using a hardcoded URL and verify the tarball with a SHA256 checksum before extraction. Example: use a specific release tag instead of 'latest'.
  • High · OpenSSL Static Linking Dependency — Cargo.toml (openssl dependency). The project depends on openssl=0.10.72 which statically links OpenSSL. This means security patches to the system OpenSSL won't apply to the bundled binary. Additionally, this version is relatively old and may contain known vulnerabilities. Fix: Consider using rustls as a replacement (pure Rust TLS implementation) or dynamically link to system OpenSSL and ensure regular updates.
  • Medium · Reqwest with Incomplete Default Features Configuration — Cargo.toml (reqwest dependency). In Cargo.toml, reqwest dependency has default-features = fal which appears to be a typo (should be 'false'). This means some default features may be unexpectedly enabled, potentially including features with security implications. Fix: Correct the typo to default-features = false and explicitly enable only required features.
  • Medium · Potential XXE/XML Injection via HTML5 Parsing — src/html.rs, src/core.rs (likely DOM parsing code). The project uses html5ever for DOM manipulation. If user-supplied HTML/XML content is parsed without proper safeguards, this could lead to XML External Entity (XXE) attacks or entity expansion attacks, particularly when processing malicious web pages. Fix: Ensure html5ever is used with secure defaults. Validate and sanitize input HTML. Consider using a dedicated HTML sanitization library if user content is directly processed.
  • Medium · Network Request Validation Missing — src/core.rs, src/url.rs. As a web scraping tool, monolith makes arbitrary network requests to user-specified URLs. There's no indication of SSRF (Server-Side Request Forgery) protections such as blocking requests to private IP ranges (127.0.0.1, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, etc.). Fix: Implement URL validation to block requests to private IP ranges, localhost, and link-local addresses. Add configuration options for security policies.
  • Medium · Regex DoS Vulnerability Potential — Cargo.toml (regex dependency), src/html.rs (NOSCRIPT unwrapping). The regex dependency is used with features 'std', 'perf-dfa', 'unicode-perl'. The unicode-perl feature combined with complex regex patterns could potentially enable Regular Expression Denial of Service (ReDoS) attacks if processing untrusted regex patterns. Fix: Ensure regex patterns are hardcoded and not derived from user input. Review NOSCRIPT processing logic for potential ReDoS vectors.
  • Low · No Integrity Validation for Downloaded Assets — src/core.rs (asset handling). While the project uses sha2 for integrity checks on scripts/styles with integrity attributes, there's no validation for assets without such attributes, potentially allowing MITM attacks for unprotected resources. Fix: Add optional pinning of expected asset hashes or implement certificate pinning for HTTPS connections.
  • Low · Minimal Input Sanitization Visible — src/html.rs, src/css.rs, src/js.rs. The project processes and embeds user-provided URLs, JavaScript, and CSS directly into HTML output. Without explicit sanitization visible in file structure, there's potential for XSS if output is not properly escaped. Fix: Implement proper HTML escaping for all user-controlled content. Use a context-aware escaping library and validate output encoding.
  • Low · Temporary Files in /tmp — undefined. undefined Fix: undefined

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · Y2Z/monolith — RepoPilot