RepoPilotOpen in app →

jhy/jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

Healthy

Healthy across the board

weakest axis
Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 1d ago
  • 5 active contributors
  • MIT licensed
Show all 6 evidence items →
  • CI configured
  • Tests present
  • Concentrated ownership — top contributor handles 65% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/jhy/jsoup)](https://repopilot.app/r/jhy/jsoup)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/jhy/jsoup on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: jhy/jsoup

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/jhy/jsoup shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit 1d ago
  • 5 active contributors
  • MIT licensed
  • CI configured
  • Tests present
  • ⚠ Concentrated ownership — top contributor handles 65% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live jhy/jsoup repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/jhy/jsoup.

What it runs against: a local clone of jhy/jsoup — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in jhy/jsoup | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | Last commit ≤ 31 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>jhy/jsoup</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of jhy/jsoup. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/jhy/jsoup.git
#   cd jsoup
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of jhy/jsoup and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "jhy/jsoup(\\.git)?\\b" \\
  && ok "origin remote is jhy/jsoup" \\
  || miss "origin remote is not jhy/jsoup (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 31 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~1d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/jhy/jsoup"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

jsoup is a Java HTML/XML parser that implements the WHATWG HTML5 specification and provides a DOM API for parsing, extracting, and manipulating real-world HTML. It handles malformed tag-soup HTML gracefully, offers CSS and XPath selectors, and includes built-in sanitization against XSS attacks via safe-list filtering. Modular package structure under src/main/java/org/jsoup/: core parsing (nodes/), HTTP connection handling (helper/HttpConnection.java), DOM utilities (helper/W3CDom.java), and internal support (internal/ package with StringUtil, Normalizer, Regex helpers). Examples in src/main/java/org/jsoup/examples/ demonstrate real usage patterns.

👥Who it's for

Java developers who need to scrape web pages, extract structured data from HTML, clean user-submitted HTML content for safety, or build web automation tools. Both backend engineers (server-side scraping) and Android developers (with core library desugaring) use it.

🌱Maturity & risk

Production-ready and actively maintained. The project has been developed since 2009 (inceptionYear in pom.xml), uses Maven with multi-release JAR support for Java 8+ compatibility, and maintains CI/CD via GitHub Actions (build.yml, codeql.yml, cifuzz.yml). Currently at v1.23.1-SNAPSHOT with active dependency management (dependabot.yml).

Low risk: single-maintainer project (jhy/jsoup) but well-established with MIT licensing and no heavy external dependencies visible in the pom.xml snippet. The reliance on WHATWG spec compliance means breaking HTML5 parsing changes could impact users, but spec-tracking is a feature, not a risk. Security is taken seriously (SECURITY.md present, fuzzing enabled via codeql.yml).

Active areas of work

Version 1.23.1 is in active development (SNAPSHOT). The codebase includes recent security infrastructure (cifuzz.yml for fuzzing, codeql.yml for static analysis) and dependabot automation for dependency updates. CHANGES.md exists for changelog tracking.

🚀Get running

git clone https://github.com/jhy/jsoup.git && cd jsoup && mvn clean install

Daily commands: mvn clean package to build the JAR. Run examples via: mvn exec:java -Dexec.mainClass="org.jsoup.examples.Wikipedia" (see src/main/java/org/jsoup/examples/ for runnable demos like ListLinks.java, HtmlToPlainText.java).

🗺️Map of the codebase

🛠️How to make changes

Start in src/main/java/org/jsoup/: nodes/ package handles DOM node types (Element, Document, Node classes), helper/ package for HTTP/utility logic, parser logic in root (likely org/jsoup/parser not shown but implied). For HTML sanitization, search SafeList usage. For parsing behavior, examine the nodes/ and parser packages. Add tests to src/test/ using the test structure that must exist (not shown in snippet).

🪤Traps & gotchas

Multi-release JAR compilation requires JDK 11+ to build (Java 11 overlay won't compile on JDK 8 only, per pom.xml comments). Android users must enable core library desugaring per README. The regex abstraction (Regex.java vs Re2jRegex.java) means regex behavior can vary by which engine is on the classpath—ensure expectations match selected implementation. No env vars appear required for basic usage, but HTTP connection behavior (timeouts, proxies) is configurable via Connection API.

💡Concepts to learn

  • WHATWG HTML5 Specification Compliance — jsoup parses HTML the same way modern browsers do, not by XML rules—this is critical for handling real-world malformed HTML and predicting DOM structure correctly
  • DOM Tree Construction Algorithm — jsoup implements the HTML5 parsing algorithm for building a valid DOM from broken HTML; understanding tokenization, tree construction phases, and adoption agency is essential for parser behavior
  • CSS Selector Parsing and Matching — jsoup's select(cssQuery) API requires parsing CSS selectors and matching them against the DOM tree; the implementation must handle pseudo-classes, combinators, and attribute selectors per CSS spec
  • Character Encoding Detection — HTTP responses often lack explicit charset declarations; jsoup's DataUtil must detect encoding from meta tags, BOM, and HTTP headers to parse HTML correctly
  • XSS Prevention via Safe-list Sanitization — jsoup's core use case is cleaning user-submitted HTML; the safe-list model whitelist-only approach prevents script injection by removing any tags/attributes not explicitly allowed
  • Multi-release JAR Files (Java 9+) — jsoup's pom.xml uses multi-release JARs to maintain Java 8 compatibility while using Java 11+ features in optional code paths—required knowledge for building and modifying the project
  • Abstract Regex Engine Swapping — Regex.java and Re2jRegex.java provide pluggable regex implementations; understanding the abstraction pattern is key to extending regex support or debugging regex-specific bugs
  • HtmlUnit/htmlunit — Alternative Java HTML parser with JavaScript execution support; choose jsoup for lightweight parsing, HtmlUnit for browser-like behavior
  • playframework/playframework — Play Framework web framework uses jsoup internally for HTML processing in templates and form handling
  • SeleniumHQ/selenium — Selenium for browser automation often partners with jsoup for DOM parsing post-page-load and data extraction from rendered HTML
  • OpenRewrite/rewrite — Code refactoring framework that uses jsoup-like DOM parsing patterns for AST manipulation of code and configuration files

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for W3CDom.java helper class

The W3CDom.java class converts jsoup DOM to W3C DOM but there are likely missing edge case tests for namespace handling, attribute preservation, and node type conversions. This is critical for users relying on W3C interoperability. Reviewing src/test for W3CDom tests will reveal gaps.

  • [ ] Check src/test for existing W3CDomTest.java coverage and identify untested methods
  • [ ] Add tests for namespace-qualified elements and attributes handling
  • [ ] Add tests for CDATA nodes, comments, and document type conversion
  • [ ] Add tests for malformed HTML to W3C DOM conversion edge cases
  • [ ] Run full test suite to ensure no regressions

Implement missing unit tests for Re2jRegex.java and Regex.java pattern handling

The helper/Regex.java and helper/Re2jRegex.java classes provide regex utilities for CSS selector and HTML parsing, but likely lack comprehensive tests for edge cases like special characters, Unicode patterns, and invalid regex. This is foundational to jsoup's selector engine reliability.

  • [ ] Create or expand src/test tests for Regex.java with Unicode, escaped characters, and boundary cases
  • [ ] Add tests for Re2jRegex.java (Google RE2J integration) to verify drop-in compatibility
  • [ ] Add tests for empty patterns, null inputs, and catastrophic backtracking prevention
  • [ ] Test regex behavior differences between Java Pattern and RE2J to document constraints
  • [ ] Verify CSS selector regex edge cases work correctly

Add integration tests for RequestAuthenticator.java and AuthenticationHandler.java

The authentication system (helper/AuthenticationHandler.java, helper/RequestAuthenticator.java) for HTTP connections lacks visible integration tests. These are critical for security and credential handling. Add tests covering common auth schemes and edge cases.

  • [ ] Review helper/AuthenticationHandler.java and helper/RequestAuthenticator.java for supported auth types
  • [ ] Create integration test suite covering Basic Auth, Digest Auth (if supported), and custom auth handlers
  • [ ] Add tests for credential leakage prevention (e.g., stripping auth headers on redirect)
  • [ ] Add tests for malformed credentials, empty passwords, and special characters in usernames
  • [ ] Test interaction with helper/CookieUtil.java for cookie-based session handling
  • [ ] Document any security constraints in SECURITY.md if gaps are found

🌿Good first issues

  • Add sanitization whitelist tests: src/test/java likely lacks comprehensive tests for SafeList filtering against XSS vectors—add test cases for HTML5 event attributes, iframe srcdoc, and CSS expression attacks.
  • Improve regex engine abstraction documentation: Regex.java and Re2jRegex.java are parallel implementations but lack developer docs explaining when to use each and how they differ—add package-info.java and inline examples.
  • Complete missing examples: src/main/java/org/jsoup/examples/ has Wikipedia.java, ListLinks.java, HtmlToPlainText.java but lacks examples for CSS selector advanced usage, custom scraping with error handling, and HTML cleaning with SafeList configuration.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 44ac231 — Bump netty to 4.2.13 (jhy)
  • 3a767de — Refactor classNames scanner and add classList (#2500) (jhy)
  • 7fd813c — Bump com.google.code.gson:gson from 2.13.2 to 2.14.0 (#2499) (dependabot[bot])
  • 477740f — Refactor source range storage (#2498) (jhy)
  • 9c5fbf7 — Bump com.github.siom79.japicmp:japicmp-maven-plugin from 0.25.4 to 0.25.6 (#2496) (dependabot[bot])
  • f190e2a — Bump github/codeql-action from 4.35.1 to 4.35.2 (#2497) (dependabot[bot])
  • ab88c1b — Changes update for cleanup (jhy)
  • d4638c8 — Add a struct for the progress callback state (jhy)
  • 7c58902 — Clean up unchecked warnings (jhy)
  • d87cddb — Suppress expected deprecation warnings (jhy)

🔒Security observations

jsoup demonstrates a solid security posture as an HTML parsing library designed with XSS prevention as a core feature. The codebase is well-structured with security-conscious design patterns (e.g., safelisting for HTML output, XSS safety focus). However, several areas require attention: (1) XXE protection in W3CDom XML parsing needs verification, (2) SSRF prevention in HTTP connection handling should be explicitly implemented, (3) ReDoS vulnerabilities should be mitigated through consistent use of Re2j regex, and (4) resource exhaustion attacks on the parser should be defended against with configurable limits. The library's security model is sound, but implementation details need hardening against advanced attack vectors. No

  • Medium · Potential XXE (XML External Entity) Vulnerability in W3CDom — src/main/java/org/jsoup/helper/W3CDom.java. The W3CDom class converts jsoup documents to W3C DOM objects. XML parsers can be vulnerable to XXE attacks if not properly configured. The file src/main/java/org/jsoup/helper/W3CDom.java should be reviewed to ensure XML parsers disable external entity resolution and DTD processing. Fix: Ensure that any XML parsers used disable XXE vulnerabilities by setting appropriate features like XMLConstants.ACCESS_EXTERNAL_DTD and XMLConstants.ACCESS_EXTERNAL_SCHEMA to empty string, and disabling DOCTYPE declarations.
  • Medium · Potential SSRF in HTTP Connection Handling — src/main/java/org/jsoup/helper/HttpConnection.java. The HttpConnection class (src/main/java/org/jsoup/helper/HttpConnection.java) handles URL fetching. Without proper validation, it could be vulnerable to Server-Side Request Forgery (SSRF) attacks if users can control the URLs being fetched, potentially allowing access to internal services or file:// protocol abuse. Fix: Implement URL validation to prevent access to private IP ranges (127.0.0.1, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), localhost, and non-HTTP(S) schemes. Consider using an allowlist approach for URLs.
  • Medium · Regex Denial of Service (ReDoS) Potential — src/main/java/org/jsoup/helper/Regex.java, src/main/java/org/jsoup/helper/Re2jRegex.java. The codebase includes regex handling in multiple places, including Re2jRegex.java and Regex.java. While Re2j provides protection, the base Regex class may be vulnerable to ReDoS attacks if user-controlled input is used in regex patterns, particularly in CSS selectors or custom parsing logic. Fix: Ensure all regex operations, especially those processing user input or CSS selectors, use Re2j regex library consistently. Add timeout mechanisms for regex operations to prevent ReDoS attacks.
  • Medium · Missing Input Validation in Parser — src/main/java/org/jsoup/parser/Parser.java, src/main/java/org/jsoup/parser/Tokeniser.java. The Parser and Tokeniser classes handle potentially untrusted HTML/XML input. While jsoup is designed for XSS safety, ensure that all parsing paths properly handle malformed input, especially extremely large documents that could cause DoS through memory exhaustion. Fix: Implement configurable limits on document size, parsing depth, and entity expansion. Add memory monitoring during parsing to detect and prevent resource exhaustion attacks.
  • Low · No Explicit Version Pinning in Dependencies — pom.xml. The pom.xml file appears incomplete in the provided snippet, but dependency versions should be explicitly pinned (not using ranges) to ensure reproducible builds and prevent unexpected security updates that might break compatibility. Fix: Use exact version pinning for all dependencies. Regularly audit and update dependencies to patch security vulnerabilities rather than relying on version ranges.
  • Low · Missing Security Headers in Documentation — README.md, SECURITY.md. While jsoup is a parsing library, if it's used to serve web content, users may not be aware of the need to set appropriate security headers (CSP, X-Frame-Options, etc.) in their applications. Fix: Add security best practices documentation explaining how to safely use jsoup in web applications, including the importance of security headers and Content Security Policy.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · jhy/jsoup — RepoPilot