gjtorikian/html-pipeline
HTML processing filters and utilities
Healthy across all four use cases
Permissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 4mo ago
- ✓8 active contributors
- ✓MIT licensed
Show 4 more →Show less
- ✓CI configured
- ✓Tests present
- ⚠Slowing — last commit 4mo ago
- ⚠Concentrated ownership — top contributor handles 79% of recent commits
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/gjtorikian/html-pipeline)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/gjtorikian/html-pipeline on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: gjtorikian/html-pipeline
Generated by RepoPilot · 2026-05-10 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/gjtorikian/html-pipeline shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across all four use cases
- Last commit 4mo ago
- 8 active contributors
- MIT licensed
- CI configured
- Tests present
- ⚠ Slowing — last commit 4mo ago
- ⚠ Concentrated ownership — top contributor handles 79% of recent commits
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live gjtorikian/html-pipeline
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/gjtorikian/html-pipeline.
What it runs against: a local clone of gjtorikian/html-pipeline — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in gjtorikian/html-pipeline | Confirms the artifact applies here, not a fork |
| 2 | License is still MIT | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 153 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of gjtorikian/html-pipeline. If you don't
# have one yet, run these first:
#
# git clone https://github.com/gjtorikian/html-pipeline.git
# cd html-pipeline
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of gjtorikian/html-pipeline and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "gjtorikian/html-pipeline(\\.git)?\\b" \\
&& ok "origin remote is gjtorikian/html-pipeline" \\
|| miss "origin remote is not gjtorikian/html-pipeline (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
&& ok "license is MIT" \\
|| miss "license drift — was MIT at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "lib/html-pipeline.rb" \\
&& ok "lib/html-pipeline.rb" \\
|| miss "missing critical file: lib/html-pipeline.rb"
test -f "lib/html_pipeline/filter.rb" \\
&& ok "lib/html_pipeline/filter.rb" \\
|| miss "missing critical file: lib/html_pipeline/filter.rb"
test -f "lib/html_pipeline/sanitization_filter.rb" \\
&& ok "lib/html_pipeline/sanitization_filter.rb" \\
|| miss "missing critical file: lib/html_pipeline/sanitization_filter.rb"
test -f "lib/html_pipeline/node_filter.rb" \\
&& ok "lib/html_pipeline/node_filter.rb" \\
|| miss "missing critical file: lib/html_pipeline/node_filter.rb"
test -f "lib/html_pipeline/convert_filter.rb" \\
&& ok "lib/html_pipeline/convert_filter.rb" \\
|| miss "missing critical file: lib/html_pipeline/convert_filter.rb"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 153 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~123d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/gjtorikian/html-pipeline"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
html-pipeline is a Ruby gem that chains composable HTML filters to transform user-generated content through a pipeline architecture. It processes text through TextFilters → ConvertFilter (Markdown to HTML) → SanitizationFilter (XSS removal) → NodeFilters (DOM manipulation), allowing safe rendering of user content like mentions (@user), emoji, syntax highlighting, and table-of-contents generation in a single pass. Single-gem monolith with clear filter hierarchy: lib/html_pipeline/ contains base Filter classes (filter.rb, text_filter.rb, convert_filter.rb, node_filter.rb, sanitization_filter.rb) and concrete implementations in subdirectories (e.g., lib/html_pipeline/node_filter/ has emoji_filter.rb, mention_filter.rb, syntax_highlight_filter.rb). Tests mirror this structure exactly under test/html_pipeline/.
👥Who it's for
Ruby web developers building content platforms (blogs, comment systems, wikis) who need to safely process and transform user-supplied Markdown or HTML with features like @mentions, emoji rendering, and XSS sanitization without reinventing the filter chain each time.
🌱Maturity & risk
Actively maintained but low-velocity: the repo has 109KB of Ruby code across well-organized filter classes, comprehensive test coverage (test/ mirrors lib/ structure), and CI/CD pipelines in .github/workflows/ (automerge, lint, publish). However, the README explicitly notes 'this project was started at GitHub' and they 'no longer use it', suggesting it's community-maintained rather than backed by active development at scale.
Low risk for core functionality, but thin maintainer bandwidth: it's a single-person repo (gjtorikian) with no visible recent activity metrics in the file list, and the gem's decoupling from GitHub's own stack means breaking changes or security issues may not receive urgent patches. Dependency risk is mitigated by the modular filter design—users can opt-in to only the filters they need.
Active areas of work
No visible active development in the file list. The .github/workflows/ show automation for automerge, CI, linting, and publishing, but these are foundational rather than reactive to open work. The UPGRADING.md and CHANGELOG.md suggest past versioning work but no current milestones are evident.
🚀Get running
git clone https://github.com/gjtorikian/html-pipeline.git
cd html-pipeline
bundle install
bundle exec rake test
Daily commands:
bundle exec rake test # run full test suite
bundle exec rubocop # lint (linter config in .rubocop.yml)
bundle exec rake build # build gem (see Rakefile)
🗺️Map of the codebase
lib/html-pipeline.rb— Main entry point and public API; defines the Pipeline class that chains filters together to process HTML content.lib/html_pipeline/filter.rb— Base Filter class that all specialized filters inherit from; establishes the core contract and instrumentation mechanism.lib/html_pipeline/sanitization_filter.rb— Core security filter that sanitizes HTML against a configurable allowlist to prevent XSS attacks.lib/html_pipeline/node_filter.rb— Base class for filters that traverse and modify parsed HTML DOM nodes; implements the NodeFilter pattern.lib/html_pipeline/convert_filter.rb— Base class for filters that convert input formats (e.g., Markdown) to HTML before pipeline processing.lib/html_pipeline/text_filter.rb— Base class for filters operating on plain text input; bridges text-based content into the HTML pipeline.
🧩Components & responsibilities
- Pipeline (Ruby base class, Hash context passing) — Orchestrates filter chain execution; maintains context and result state across filters; handles instrumentation callbacks.
- Failure mode: Exception in any filter halts pipeline; original exception propagates to caller.
- Filter (base class) — Defines call/call_filter lifecycle
🛠️How to make changes
Add a new NodeFilter
- Create a new class inheriting from NodeFilter in lib/html_pipeline/node_filter/ (
lib/html_pipeline/node_filter/my_new_filter.rb) - Implement the filter_nodes(doc) method to traverse and modify Nokogiri nodes (
lib/html_pipeline/node_filter/my_new_filter.rb) - Register the filter by requiring it in lib/html_pipeline.rb or manually instantiate in Pipeline chain (
lib/html-pipeline.rb) - Add tests in test/html_pipeline/node_filter/ using the same pattern as existing filters (
test/html_pipeline/node_filter/my_new_filter_test.rb)
Add a new ConvertFilter
- Create a new class inheriting from ConvertFilter in lib/html_pipeline/convert_filter/ (
lib/html_pipeline/convert_filter/my_converter_filter.rb) - Implement the convert(input) method that takes plain input and returns HTML string (
lib/html_pipeline/convert_filter/my_converter_filter.rb) - Add tests in test/html_pipeline/convert_filter/ following MarkdownFilter pattern (
test/html_pipeline/convert_filter/my_converter_filter_test.rb) - Use as first filter in Pipeline chain: Pipeline.new([MyConverterFilter, SanitizationFilter]).call(input) (
lib/html-pipeline.rb)
Customize SanitizationFilter allowlist
- Copy the default scrubber configuration from SanitizationFilter and modify allowed tags/attributes (
lib/html_pipeline/sanitization_filter.rb) - Create custom scrubber by calling SanitizationFilter.new(scrubber: my_scrubber) in Pipeline (
lib/html-pipeline.rb) - Add test to verify custom allowlist enforcement in test/sanitization_filter_test.rb (
test/sanitization_filter_test.rb)
🔧Why these technologies
- Nokogiri (HTML/XML parser) — Provides robust DOM parsing and traversal for NodeFilter operations; standard Ruby solution for HTML manipulation.
- Rack::Utils (HTML escaping) — Secure HTML entity encoding to prevent XSS when processing user-provided content.
- Ruby Regexp (pattern matching) — Efficient detection of patterns like mentions (@user), emoji shortcodes (:smile:), and URLs in text filters.
⚖️Trade-offs already made
-
Filter chain architecture (sequential, immutable input)
- Why: Enables composability and predictable behavior; each filter operates on output of previous.
- Consequence: Full HTML reparsing between filters may have performance impact on very large documents; parallelization not possible.
-
NodeFilter requires full Nokogiri document parsing
- Why: Enables precise DOM navigation and attribute manipulation; safe modification semantics.
- Consequence: Cannot process streaming or partial HTML; must hold entire document in memory.
-
Instrumentation via optional service callback
- Why: Allows performance monitoring without adding observability burden to core code path.
- Consequence: Instrumentation is opt-in and not built-in; requires explicit service injection.
-
Allowlist-based sanitization (not blacklist)
- Why: Secure by default; protects against unknown attack vectors.
- Consequence: May be overly restrictive for some use cases; requires customization for permissive HTML.
🚫Non-goals (don't propose these)
- Real-time or streaming HTML processing; all filters operate on complete documents in memory.
- Parallel filter execution; pipeline is strictly sequential.
- Authentication or user identity verification; filters are content-agnostic.
- HTML validation or compliance checking against W3C standards.
- JavaScript execution or dynamic content rendering; filters operate on static HTML only.
🪤Traps & gotchas
No explicit environment variables required. Key gotcha: the README notes 'Why doesn't my pipeline work when there's no root element in the document?'—Nokogiri wraps fragmentary HTML in <html><body> automatically, which can surprise users expecting bare strings. Sanitization occurs by default if no sanitization_config provided, but allowlist config must match your security requirements. The gem assumes Nokogiri is installed and compatible; version constraints are in html-pipeline.gemspec (check dependencies/).
🏗️Architecture
💡Concepts to learn
- Filter Chain / Pipeline Pattern — Core architectural pattern in this gem—understanding how filters compose sequentially (TextFilter → ConvertFilter → SanitizationFilter → NodeFilter) is essential to extending it
- Allowlist-based Sanitization (HTML5 SafeList) — SanitizationFilter uses allowlist (whitelist) approach for XSS prevention, not blacklist; critical to understand why DEFAULT_CONFIG is conservative
- CSS Selectors for DOM Queries — NodeFilters operate via Nokogiri CSS selectors (e.g., doc.css('a[@data-user]')); mastery of this enables writing efficient, readable filters
- Document Fragment Processing — html-pipeline handles both full HTML documents and fragments; understanding Nokogiri's auto-wrapping of fragments in <html><body> prevents subtle bugs
- Instrumentation / Observable Pattern — Filters can emit instrumentation events (see Filter base class); understanding how to hook into this enables observability and debugging of pipeline execution
- Markdown to HTML Conversion (CommonMark Spec) — MarkdownFilter relies on CommonMark (not GitHub-Flavored Markdown by default); understanding the spec and its differences from GFM is important for expected output
🔗Related repos
gjtorikian/gollum-lib— Wiki engine built on html-pipeline; shows real-world pipeline composition with Markdown conversion and custom node filtersgithub/markup— GitHub's markup language dispatcher; often paired with html-pipeline for handling multiple input formats (Markdown, AsciiDoc, etc.)jch/html-pipeline-linkify— Community extension for html-pipeline adding autolinking; demonstrates the third-party filter patternsparklemotion/nokogiri— Underlying DOM library that html-pipeline depends on for all HTML/XML parsing and node manipulationcommonmark/commonmark-ruby— Markdown parser powering MarkdownFilter; used for text-to-HTML conversion in the pipeline
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive test coverage for ConvertFilter and its subclasses
The test directory has test/html_pipeline/convert_filter/markdown_filter_test.rb, but there's no test file for the base ConvertFilter class itself (lib/html_pipeline/convert_filter.rb). Given the inheritance pattern used throughout the codebase (Filter -> ConvertFilter -> MarkdownFilter), the base class likely has important behavior that should be tested independently. This would ensure the filter abstraction is solid and catch regressions in the conversion pipeline.
- [ ] Create test/html_pipeline/convert_filter_test.rb
- [ ] Add tests for ConvertFilter initialization, context handling, and error cases
- [ ] Verify integration with the parent Filter class behavior
- [ ] Run full test suite to ensure no regressions
Add missing test file for Filter base class
The lib/html_pipeline/filter.rb file is the foundation for all filters in the codebase, yet there's no dedicated test/html_pipeline/filter_test.rb. Looking at the test structure, all concrete filter classes have tests, but the base abstraction does not. This should include tests for context passing, error handling, and the fundamental filter lifecycle that all subclasses depend on.
- [ ] Create test/html_pipeline/filter_test.rb
- [ ] Add tests for Filter base class initialization and abstract method contracts
- [ ] Test context object handling and validation
- [ ] Test error handling for improperly implemented filters
- [ ] Verify compatibility with NodeFilter and TextFilter subclasses
Add integration tests for multi-filter pipelines in html_pipeline_test.rb
The main test/html_pipeline_test.rb likely tests basic pipeline functionality, but given the README emphasizes 'chainable content filters', there should be explicit tests demonstrating real-world filter chains. For example: MarkdownFilter -> SyntaxHighlightFilter -> SanitizationFilter, or PlainTextInputFilter -> EmojiFilter -> MentionFilter. These would validate that context is properly threaded through multiple filters and demonstrate the primary value proposition of the library.
- [ ] Review current test/html_pipeline_test.rb to identify coverage gaps
- [ ] Add test cases for 2-3 realistic filter chains mentioned in README
- [ ] Include tests for context mutation across filter boundaries
- [ ] Test error propagation and filter ordering dependencies
- [ ] Document expected behavior for commonly-used filter combinations
🌿Good first issues
- Add test coverage for edge cases in lib/html_pipeline/node_filter/syntax_highlight_filter.rb—examine test/html_pipeline/node_filter/syntax_highlight_filter_test.rb to identify untested code paths (e.g., missing language specification, malformed code blocks).
- Document the allowlist schema for SanitizationFilter::DEFAULT_CONFIG in lib/html_pipeline/sanitization_filter.rb with inline examples in the README; currently only referenced abstractly in the FAQ.
- Create a new TextFilter example (e.g., lib/html_pipeline/text_filter/placeholder_filter.rb) that replaces {{token}} patterns before Markdown conversion, with full test coverage, to demonstrate the TextFilter extension pattern for documentation.
⭐Top contributors
Click to expand
Top contributors
- @gjtorikian — 79 commits
- @actions-user — 8 commits
- @dependabot[bot] — 5 commits
- @jeremysmithco — 2 commits
- @ppworks — 2 commits
📝Recent commits
Click to expand
Recent commits
c99d76d— Merge pull request #429 from gjtorikian/release/v3.2.4 (gjtorikian)f00ac92— [skip test] update changelog (gjtorikian)4bd9392— Merge pull request #428 from gjtorikian/allow-for-sanitization-nil (gjtorikian)7a75c3e— :gem: bump to 3.2.4 (gjtorikian)973cbef— add minitest/mock for stubs (gjtorikian)f75cd21— Merge branch 'main' into allow-for-sanitization-nil (gjtorikian)7a6e748— Merge pull request #427 from gjtorikian/support-ruby-4 (gjtorikian)251dde6— loosen commonmarker (gjtorikian)a1b66f0— no need for this (gjtorikian)1b5c5fb— [auto-lint]: Lint files (gjtorikian)
🔒Security observations
- High · Potential XSS Vulnerability in HTML Processing Pipeline —
lib/html_pipeline/sanitization_filter.rb and all filter implementations. The html-pipeline library processes and filters HTML content. Given the presence of filters like SanitizationFilter, NodeFilters (MentionFilter, EmojiFilter, etc.), and ConvertFilters, there is a significant risk of Cross-Site Scripting (XSS) vulnerabilities if output is not properly escaped or if sanitization is misconfigured. The library's core purpose is HTML manipulation, which is inherently risky. Fix: Ensure all filters properly escape output. Verify that SanitizationFilter uses a whitelist-based approach (e.g., Sanitize gem with strict configuration). Conduct thorough security testing for edge cases in XSS prevention. Document safe usage patterns clearly. - High · Missing Dependency Information —
html-pipeline.gemspec and Gemfile. The dependency file content is empty or not provided. Cannot verify if dependent gems have known vulnerabilities. The gemspec file (html-pipeline.gemspec) should declare dependencies, but its content was not provided for analysis. This makes it impossible to identify if vulnerable versions of gems like Sanitize, Nokogiri, or Markdown parsers are being used. Fix: Provide complete gemspec and Gemfile contents. Run 'bundle audit' regularly to check for known vulnerabilities in dependencies. Use specific version pinning and regularly update dependencies. Implement automated dependency scanning in CI/CD pipeline (as indicated by dependabot.yml, which is good). - High · Potential Code Injection via Custom Filters —
lib/html_pipeline/filter.rb, lib/html_pipeline/text_filter.rb, lib/html_pipeline/node_filter.rb, lib/html_pipeline/convert_filter.rb. The pipeline architecture allows users to define custom filters (TextFilter, NodeFilter, ConvertFilter). If user-supplied filter logic is executed without proper validation, arbitrary code execution may be possible. The framework's flexibility is a security risk if not carefully constrained. Fix: Implement strict input validation for filter configuration. Use sandboxing or code review processes for custom filters. Document security best practices for extending the pipeline. Consider restricting filter instantiation to known safe filters only. - Medium · Markdown Filter Processing Risk —
lib/html_pipeline/convert_filter/markdown_filter.rb. The MarkdownFilter (lib/html_pipeline/convert_filter/markdown_filter.rb) converts Markdown to HTML. Depending on the underlying Markdown parser used, there could be vulnerabilities in how raw HTML is handled within Markdown content. Some parsers allow raw HTML passthrough which could enable XSS. Fix: Verify that the Markdown parser (likely kramdown or similar) is configured to disable raw HTML or to properly escape it. Run the output through SanitizationFilter. Test with payloads containing embedded HTML/JavaScript in Markdown. - Medium · External Resource Loading Risks —
lib/html_pipeline/node_filter/asset_proxy_filter.rb and lib/html_pipeline/node_filter/absolute_source_filter.rb. Filters like AssetProxyFilter and AbsoluteSourceFilter modify URLs and handle external resources. There is potential risk of Server-Side Request Forgery (SSRF) if URLs are not properly validated before being fetched or proxied. Fix: Implement strict URL validation and whitelisting for external resources. Prevent access to internal/private IP ranges (127.0.0.1, 10.x.x.x, 172.16.x.x, 192.168.x.x). Use allowlist-based approach for protocols (http, https only). Set timeouts for external requests. - Medium · Regular Expression Denial of Service (ReDoS) Risk —
lib/html_pipeline/node_filter/mention_filter.rb, lib/html_pipeline/node_filter/team_mention_filter.rb, lib/html_pipeline/node_filter/syntax_highlight_filter.rb. The codebase includes multiple filters with regex patterns (MentionFilter, TeamMentionFilter, SyntaxHighlightFilter, etc.). Without careful regex design, attackers could craft input that causes catastrophic backtracking and denial of service. Fix: Audit all regular expressions for ReDoS vulnerabilities. Use tools like 'ruby-regex-check' or online ReDoS checkers. Implement input length limits. Consider using non-back
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.