scinfu/SwiftSoup

Item: scinfu/SwiftSoup
Rating: 5
Author: RepoPilot

SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery (Supports Linux, iOS, Mac, tvOS, watchOS)

Healthy

Healthy across the board

HealthyDependency

Permissive license, no critical CVEs, actively maintained — safe to depend on.

HealthyFork & modify

Has a license, tests, and CI — clean foundation to fork and modify.

HealthyLearn from

Documented and popular — useful reference codebase to read through.

HealthyDeploy as-is

No critical CVEs, sane security posture — runnable as-is.

⚠Concentrated ownership — top contributor handles 78% of recent commits
✓Last commit 6w ago
✓5 active contributors
✓MIT licensed
✓CI configured
✓Tests present

Computed from maintenance signals — commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/scinfu/swiftsoup)](https://repopilot.app/r/scinfu/swiftsoup)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card

This card auto-renders when someone shares https://repopilot.app/r/scinfu/swiftsoup on X, Slack, or LinkedIn.

Ask AI about scinfu/swiftsoup

Grounded in the actual source code. Pick a starter question or write your own.

What does this repo do, in one paragraph?How would I get started using it?What are the main alternatives?Show me the entry point.

Or write your own question →

Onboarding doc

Onboarding: scinfu/SwiftSoup

Generated by RepoPilot · 2026-06-24 · Source

🎯Verdict

GO — Healthy across the board

Last commit 6w ago
5 active contributors
MIT licensed
CI configured
Tests present
⚠ Concentrated ownership — top contributor handles 78% of recent commits

<sub>Computed from maintenance signals — commit recency, contributor breadth, bus factor, license, CI, tests</sub>

⚡TL;DR

SwiftSoup is a pure Swift HTML parser that parses and manipulates HTML according to the WHATWG HTML5 specification, providing DOM traversal, CSS selector queries, and jQuery-like methods for extracting and modifying HTML content. It handles malformed HTML reliably across iOS, macOS, tvOS, watchOS, and Linux without external C dependencies. Monolithic structure: Sources/ contains ~80 Swift files implementing the HTML5 parser (CharacterReader.swift handles lexing, Document.swift models the DOM tree, CssSelector.swift handles queries). Example/ provides an iOS app demo. Tests are likely in a separate directory not shown in top-60 listing.

👥Who it's for

iOS/Swift developers building web scrapers, data extraction tools, or HTML content processors who need a native Swift alternative to JSoup (Java) or BeautifulSoup (Python), without relying on system HTML parsers or Objective-C bridging.

🌱Maturity & risk

Production-ready. The project has 2000+ Swift LOC with multi-platform CI/CD (macOS and Ubuntu workflows), CocoaPods/SPM/Carthage distribution, version 2.6.0+, and Swift 5 support. Regular maintenance visible but commit frequency appears moderate—solid stability over bleeding-edge pace.

Low risk. Single maintainer (scinfu) is a potential concern for future maintenance, but no complex external dependencies (pure Swift implementation), clear API surface, and established multi-platform support reduce adoption friction. No major breaking changes visible in recent tags.

Active areas of work

Project appears in steady-state maintenance rather than active feature development—no specific milestone or active PR data visible in the file structure. CI pipelines (macos.yml, ubuntu.yml) suggest ongoing integration testing. CHANGELOG.md likely documents recent fixes.

🚀Get running

git clone https://github.com/scinfu/SwiftSoup.git && cd SwiftSoup && swift build (uses Swift Package Manager via Package.swift). For Xcode: open .swiftpm/xcode/package.xcworkspace. For iOS example app: open Example/Example.xcworkspace and build for simulator.

Daily commands: swift build compiles the library. swift test runs the test suite (if Tests/ dir exists). For the iOS Example app: xcode-select --install, then open Example/Example.xcworkspace && select Example target → Run on simulator.

🗺️Map of the codebase

Sources/SwiftSoup.swift — Main API entry point exposing parse() and related public methods that all users interact with
Sources/Parser.swift — Core parsing orchestration that drives HTML tokenization and tree building—fundamental to all parsing flows
Sources/HtmlTreeBuilder.swift — State machine implementing HTML5 tree construction algorithm, central to parsing correctness
Sources/Tokeniser.swift — Lexical analyzer converting input streams into tokens; any parsing failure often originates here
Sources/Node.swift — Base DOM node abstraction; all tree elements inherit from this class
Sources/Element.swift — HTML element representation with selector/query support; primary interface for DOM manipulation
Sources/CssSelector.swift — CSS selector compilation and evaluation engine enabling select()/selectFirst() queries

🧩Components & responsibilities

Tokeniser + TokeniserState (CharacterReader, Token, TokenQueue) — Convert byte stream into semantic tokens (tags, text, comments, doctype) following HTML5 tokenisation algorithm
- Failure mode: Malformed token recognition; incorrect text/tag boundaries cause downstream tree corruption
HtmlTreeBuilder + HtmlTreeBuilderState (Node, Element, Document, Attributes) — Apply HTML5 tree construction rules to feed tokens into a DOM tree with proper foster-parenting and scope management
- Failure mode: Incorrect tree structure (unclosed tags not auto-closed, nesting violations); violates HTML5 semantics
CssSelector + QueryParser + Evaluator chain — Parse CSS selector strings and evaluate them against DOM elements to return matching subsets

🛠️How to make changes

Add support for a new CSS pseudo-selector

Define new evaluator class inheriting from Evaluator in Sources/Evaluator.swift (e.g., PseudoFirstChild) (Sources/Evaluator.swift)
Register the evaluator parsing logic in QueryParser.parseSelector() to recognize the pseudo-class syntax (Sources/QueryParser.swift)
Implement matches(Element) logic to check if an element satisfies the pseudo-selector predicate (Sources/Evaluator.swift)
Add test cases in the Example project or create inline tests verifying select() returns correct elements

Add a new HTML tag with special parsing rules

Define tag in Sources/Tag.swift as a static constant (e.g., Tag("customtag")) (Sources/Tag.swift)
If tag requires special tree construction, add state transitions in HtmlTreeBuilderState.processCharacters() or relevant state method (Sources/HtmlTreeBuilderState.swift)
Update ParsingStrings.swift if the tag needs special entity/text handling (Sources/ParsingStrings.swift)

Improve parser robustness for malformed HTML

Add error recovery logic in HtmlTreeBuilder error handling paths (e.g., unclosed tag handling) (Sources/HtmlTreeBuilder.swift)
Update ParseErrorList to categorize new error types and enable error callback reporting (Sources/ParseErrorList.swift)
Test edge case by enabling ParseError tracking in Parser.parse() and validating tree structure (Sources/Parser.swift)

🔧Why these technologies

Pure Swift with no external dependencies — Enables cross-platform support (iOS, macOS, tvOS, watchOS, Linux) without managing native bindings or version conflicts
HTML5 tree construction state machine — Implements W3C HTML5 spec for robust parsing of real-world malformed HTML, matching browser behavior
LRU query cache (QueryParserCache) — Amortizes CSS selector compilation cost across repeated queries on the same selector string
Character-by-character tokeniser with lookahead — Allows accurate context-sensitive token recognition (e.g., CDATA vs text in XML)

⚖️Trade-offs already made

Strict W3C HTML5 compliance with automatic error recovery
- Why: Real-world HTML is malformed; spec-compliant parsing guarantees consistent behavior
- Consequence: Parser is slower and more memory-intensive than permissive parsers; adds complexity in state machine
Immutable parsed tree (no in-place mutation of source HTML)
- Why: Safer API; prevents accidental corruption of tree structure
- Consequence: Modifications require building new nodes; may be slower for heavy DOM manipulation
Single-threaded synchronous parsing
- Why: Simplicity and predictable performance; Swift GCD/async added separately if needed
- Consequence: Large documents block the calling thread; no built-in parallelization

🚫Non-goals (don't propose these)

Does not validate XML/HTML against DTD or schema
Does not execute JavaScript or handle dynamic content
Does not provide real-time incremental parsing (full document buffered before parsing)
Does not include HTTP/URL fetching (caller must provide HTML string)
Does not guarantee thread-safe mutations of parsed tree without external locking

🪤Traps & gotchas

No external service dependencies, but: HTML5 spec is complex—edge cases in malformed HTML may behave unexpectedly without reading WHATWG spec. Performance on very large documents (>10MB) untested in public docs. Character encoding detection limited to UTF-8 explicit; implicit charset detection from HTML meta tags may differ from browsers. watchOS/tvOS support is declared but usage in practice is rare—report bugs if targeting these platforms.

🏗️Architecture

💡Concepts to learn

WHATWG HTML5 Parsing Algorithm — SwiftSoup implements this standardized state machine for robust tag soup parsing; understanding it explains why malformed HTML is handled predictably.
CSS Selector Combinators and Pseudo-selectors — Core to the query API (select, selectFirst)—knowing descendant, child, and attribute selectors is essential for effective data extraction.
DOM Tree Traversal (pre-order, depth-first) — Methods like traversingFirstChild(), nextSibling() use depth-first traversal; understanding order is critical for iteration logic.
Character Encoding Detection — CharacterReader.swift must handle UTF-8, Latin-1, and declared charsets; mismatches cause parsing corruption in international content.
Tokenization and Lookahead Buffering — HTML parsing requires 1+ character lookahead for entity recognition; CharacterReader implements buffer management for efficiency.
XSS Prevention via Whitelist Sanitization — Cleaner.swift provides tag/attribute whitelisting; essential for safe user-submitted HTML rendering without script injection.
Memory-Efficient String Handling with ByteSlice — Sources/ByteSlice.swift avoids full string copies during parsing; crucial for handling large documents on memory-constrained platforms like watchOS.

scinfu/SwiftSoup — This is the main repository itself—canonical source.
joannis/HTML — Alternative Swift HTML parser using Foundation's XMLParser; lighter weight but less jQuery-like API and CSS selector support.
vapor/html — HTML generation and light parsing for Swift Vapor web framework; complements SwiftSoup for server-side templating.
apple/swift-org-website — Official Swift.org site uses SwiftSoup for internal content parsing examples.
jpsim/Yams — Peer library in Swift ecosystem for structured parsing (YAML); demonstrates similar pure-Swift parser design patterns.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for CSS selector evaluation (CssSelector.swift & Evaluator.swift)

The repo has no visible test directory despite being a complex parser with CSS selector support. CssSelector.swift and Evaluator.swift are core components that need test coverage for pseudo-selectors, combinators, and edge cases. This is critical for a parser library where correctness is paramount.

[ ] Create Tests/ directory structure with XCTest framework
[ ] Add unit tests for CssSelector.swift covering pseudo-classes (:nth-child, :first, :last, etc.)
[ ] Add tests for Evaluator.swift covering element matching, attribute selectors, and combinator logic
[ ] Add regression tests for known CSS selector edge cases and malformed inputs
[ ] Update Package.swift to include test target

Add GitHub Actions workflow for Swift Package Manager on multiple Swift versions

The repo has macos.yml and ubuntu.yml workflows but they're Travis CI-based (.travis.yml still present). Modern Swift projects should test across multiple Swift versions (5.3+, 5.5+, 5.9+) with SPM natively. This ensures compatibility with different dependency resolution and language features.

[ ] Create .github/workflows/spm-matrix.yml testing Swift 5.3, 5.5, 5.9 on macOS and Ubuntu
[ ] Test both swift build and swift test with matrix strategy
[ ] Add test for watchOS and tvOS minimum deployment targets via build-only targets
[ ] Replace or deprecate .travis.yml references in README.md
[ ] Verify backwards compatibility with older Swift versions

Add performance benchmarks and memory profiling utilities (new Benchmarks directory)

A parser library needs performance guarantees, especially for large HTML documents. The codebase has optimization hints (LRUCache.swift, BinarySearch.swift) but no benchmarks. Contributors need a way to measure impact on parsing speed and memory usage.

[ ] Create Benchmarks/ directory with Swift package target
[ ] Add benchmark suite for parsing large HTML files (1MB+, 10MB+) measuring execution time and peak memory
[ ] Create micro-benchmarks for hot paths: HtmlTreeBuilder.swift, CharacterReader.swift, NodeTraversor.swift
[ ] Add CI workflow (.github/workflows/benchmarks.yml) to track performance across PRs
[ ] Document how to run benchmarks in CONTRIBUTING.md (create if missing)

🌿Good first issues

Add comprehensive unit tests for Sources/CharacterExt.swift and Sources/BinarySearch.swift—these utility files lack visible test coverage and are foundational to correctness.
Create a performance benchmark suite (compare parse time vs. file size for public HTML5 test cases) and document in README; currently no performance guidance exists.
Implement missing CSS selector features (e.g., :nth-child() or :contains() edge cases) by extending CombiningEvaluator.swift and Collector.swift with failing tests as reference.

⭐Top contributors

Click to expand

@aehlke — 78 commits
@rursache — 16 commits
@macdrevx — 4 commits
@chrisjenkins — 1 commits
@MiuraKairi — 1 commits

📝Recent commits

Click to expand

6c7915e — Merge pull request #391 from rursache/fix/compound-attribute-selector-regression (aehlke)
9bbf537 — Fix compound attribute selector regression in fast query path (rursache)
e409102 — Merge pull request #389 from rursache/fix/form-input-direct-children-388 (aehlke)
8fcecf0 — Fix form and input element tree building regression (#388) (rursache)
fd541c4 — Merge pull request #387 from chrisjenkins/codex/fix-base-url-prepending (aehlke)
13e3937 — Fix whitespace trim before resolve (chrisjenkins)
b38f3ce — Merge pull request #386 from MiuraKairi/fix/license-format (aehlke)
d01d004 — Normalize copyright lines in LICENSE (MiuraKairi)
dba183c — Merge pull request #380 from rursache/feature/auto-detect-xml-parse (aehlke)
9f089a0 — Add auto-detection to parse() and explicit parseHTML() APIs (rursache)

🔒Security observations

SwiftSoup is a pure HTML parsing library with a relatively clean security posture for a parser. The main security considerations are inherent to HTML parsing libraries: XSS risks when parsing untrusted content without sanitization, potential entity expansion attacks, and lack of input size limits. The codebase shows no obvious hardcoded credentials, malicious dependencies, or infrastructure misconfigurations. Primary recommendations include: (1) implementing input size/recursion depth limits, (2) adding comprehensive security documentation for library users, (3) establishing a vulnerability disclosure process, and (4) documenting best practices for safe handling of untrusted HTML input. Users of this library should understand they are responsible for sanitizing output when rendering parsed content from untrusted sources.

High · HTML Parsing Without Built-in XSS Protection — Sources/Parser.swift, Sources/HtmlTreeBuilder.swift, Sources/Document.swift. SwiftSoup is an HTML parser that parses and manipulates HTML content. As a parsing library, it does not inherently sanitize or filter potentially malicious content. Applications using SwiftSoup to parse untrusted HTML input could be vulnerable to XSS attacks if parsed content is rendered without proper sanitization. Fix: Document security best practices for library users. Recommend that applications implement proper output encoding and HTML sanitization before rendering parsed content from untrusted sources. Consider providing utility functions for safe HTML sanitization.
Medium · Potential Entity Expansion Attack (XXE-like) — Sources/Entities.swift, Sources/Parser.swift. HTML parsers can be vulnerable to entity expansion attacks if they process HTML entities without limits. The Entities.swift file suggests entity handling is implemented, but without visible limits on entity expansion or recursive processing. Fix: Implement limits on entity expansion depth and size. Add protection against billion laughs attack style vulnerabilities. Document entity handling behavior and any security limitations.
Medium · No Apparent Input Size Limits — Sources/Parser.swift, Sources/StreamReader.swift, Sources/Tokeniser.swift. The parser accepts arbitrary input size without visible constraints in the documented file structure. Large malicious HTML documents could potentially cause DoS through excessive memory consumption or processing time. Fix: Implement and document maximum input size limits. Consider implementing configurable limits for document size, nesting depth, and parsing timeout. Add guidance for production deployments.
Low · Missing Security Documentation — Repository root. No explicit security policy, vulnerability disclosure process, or security guidelines are evident in the repository structure. Security.md or SECURITY.txt file is missing. Fix: Create a SECURITY.md file documenting security best practices for library users, known limitations, and a responsible disclosure process for security issues.
Low · LRU Cache Implementation Without Size Limits Documentation — Sources/LRUCache.swift, Sources/QueryParserCache.swift. LRUCache.swift is used (likely in QueryParserCache.swift) but cache size limits and overflow behavior are not documented in the provided file listing. Fix: Document cache size limits, eviction policies, and memory usage. Ensure unbounded cache growth is prevented.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/scinfu/SwiftSoup shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live scinfu/SwiftSoup repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/scinfu/SwiftSoup.

What it runs against: a local clone of scinfu/SwiftSoup — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in scinfu/SwiftSoup | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 74 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>scinfu/SwiftSoup</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of scinfu/SwiftSoup. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/scinfu/SwiftSoup.git
#   cd SwiftSoup
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of scinfu/SwiftSoup and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "scinfu/SwiftSoup(\\.git)?\\b" \\
  && ok "origin remote is scinfu/SwiftSoup" \\
  || miss "origin remote is not scinfu/SwiftSoup (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "Sources/SwiftSoup.swift" \\
  && ok "Sources/SwiftSoup.swift" \\
  || miss "missing critical file: Sources/SwiftSoup.swift"
test -f "Sources/Parser.swift" \\
  && ok "Sources/Parser.swift" \\
  || miss "missing critical file: Sources/Parser.swift"
test -f "Sources/HtmlTreeBuilder.swift" \\
  && ok "Sources/HtmlTreeBuilder.swift" \\
  || miss "missing critical file: Sources/HtmlTreeBuilder.swift"
test -f "Sources/Tokeniser.swift" \\
  && ok "Sources/Tokeniser.swift" \\
  || miss "missing critical file: Sources/Tokeniser.swift"
test -f "Sources/Node.swift" \\
  && ok "Sources/Node.swift" \\
  || miss "missing critical file: Sources/Node.swift"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 74 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~44d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/scinfu/SwiftSoup"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Embed this chat in your README →

Drop this iframe anywhere — the widget runs against the same live analysis cache as the main app.

<iframe
  src="https://repopilot.app/embed/scinfu/swiftsoup"
  width="100%" height="500"
  style="border:1px solid #d0d7de; border-radius:8px;"
  allow="microphone"
  loading="lazy"
></iframe>