yasserg/crawler4j

Item: yasserg/crawler4j
Rating: 5
Author: RepoPilot

Open Source Web Crawler for Java

Healthy

Healthy across all four use cases

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓11 active contributors
✓Distributed ownership (top contributor 40% of recent commits)
✓Apache-2.0 licensed

Show all 6 evidence items →

✓CI configured
✓Tests present
⚠Stale — last commit 5y ago

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/yasserg/crawler4j)](https://repopilot.app/r/yasserg/crawler4j)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/yasserg/crawler4j on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: yasserg/crawler4j

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/yasserg/crawler4j shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

11 active contributors
Distributed ownership (top contributor 40% of recent commits)
Apache-2.0 licensed
CI configured
Tests present
⚠ Stale — last commit 5y ago

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live yasserg/crawler4j repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/yasserg/crawler4j.

What it runs against: a local clone of yasserg/crawler4j — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in yasserg/crawler4j | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 1676 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>yasserg/crawler4j</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of yasserg/crawler4j. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/yasserg/crawler4j.git
#   cd crawler4j
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of yasserg/crawler4j and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "yasserg/crawler4j(\\.git)?\\b" \\
  && ok "origin remote is yasserg/crawler4j" \\
  || miss "origin remote is not yasserg/crawler4j (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlController.java" \\
  && ok "crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlController.java" \\
  || miss "missing critical file: crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlController.java"
test -f "crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/WebCrawler.java" \\
  && ok "crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/WebCrawler.java" \\
  || miss "missing critical file: crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/WebCrawler.java"
test -f "crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java" \\
  && ok "crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java" \\
  || miss "missing critical file: crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java"
test -f "crawler4j/src/main/java/edu/uci/ics/crawler4j/fetcher/PageFetcher.java" \\
  && ok "crawler4j/src/main/java/edu/uci/ics/crawler4j/fetcher/PageFetcher.java" \\
  || miss "missing critical file: crawler4j/src/main/java/edu/uci/ics/crawler4j/fetcher/PageFetcher.java"
test -f "crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/Page.java" \\
  && ok "crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/Page.java" \\
  || miss "missing critical file: crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/Page.java"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 1676 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~1646d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/yasserg/crawler4j"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

crawler4j is a multi-threaded Java web crawler library that automates the discovery and downloading of web pages at scale. It provides a simple WebCrawler base class you extend to define crawling logic (which URLs to visit, how to parse responses), handles HTTP fetching, HTML parsing, link extraction, and concurrent page processing across configurable thread pools—enabling you to build production crawlers in minutes without managing sockets, thread coordination, or URL de-duplication yourself. Multi-module Gradle project: crawler4j-examples/ contains working examples (basic, image, localdata, multiple threads, shutdown, statushandler patterns) in crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/, core library code is in a separate module (not shown in file list but imported by examples). Config/ holds checkstyle rules and IntelliJ templates. Build targets Java 1.8 with UTF-8 encoding, uses Checkstyle for linting.

👥Who it's for

Java developers building web scraping tools, data collection pipelines, or SEO analysis platforms who need a ready-made concurrent crawler framework rather than implementing HTTP clients and threading from scratch. Also used by researchers needing to crawl websites for NLP datasets or information extraction.

🌱Maturity & risk

Production-ready and actively maintained. Published to Maven Central (version 4.5.0-SNAPSHOT in build.gradle, 4.4.0 in README), has Travis CI configured (.travis.yml), includes comprehensive examples in crawler4j-examples/, and uses Java 1.8+ with modern Gradle 5.2.1. The codebase is 322KB of Java code—substantial enough to be battle-tested, though single-maintainer risk exists (yasserg as primary contributor).

Moderate risk from single primary maintainer (yasserg) with no visible recent commits in the provided snapshot. Dependency chain not fully visible in the excerpt, but any web crawler inherits risks: robots.txt compliance is user-responsibility, potential for IP blocking or legal issues if misconfigured. No open issue backlog visible in file list, but the last version bump to 4.5.0-SNAPSHOT suggests ongoing work.

Active areas of work

Version 4.5.0-SNAPSHOT is in development (build.gradle line shows -SNAPSHOT suffix). Examples cover diverse patterns: basic crawling, image crawling, local data collection with CrawlStat aggregation, multi-crawler controller, graceful shutdown, and status handler callbacks. The CHANGES.txt file should document recent features, though content not provided here.

🚀Get running

git clone https://github.com/yasserg/crawler4j.git
cd crawler4j
./gradlew build
./gradlew :crawler4j-examples-base:test

Gradle wrapper is included (gradlew script). Requires Java 1.8+. See crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/basic/BasicCrawlController.java for a runnable example.

Daily commands:

./gradlew :crawler4j-examples-base:test --tests "*BasicCrawlController"

Or run tests directly from crawler4j-examples-base/src/test/java/... in your IDE. Examples are in test/ (not main/) so they run as unit tests. Examine BasicCrawler.java and BasicCrawlController.java to see the minimal setup pattern.

🗺️Map of the codebase

crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlController.java — Core orchestration engine that manages crawler thread pools, URL frontier, and crawl execution—every contributor must understand the crawl lifecycle.
crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/WebCrawler.java — Abstract base class that defines the contract for implementing custom crawlers; essential for extending crawler behavior.
crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java — Configuration container for all crawl parameters; required reading to understand tunable behavior and defaults.
crawler4j/src/main/java/edu/uci/ics/crawler4j/fetcher/PageFetcher.java — HTTP client and page retrieval logic; critical for understanding content acquisition and connection management.
crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/Page.java — Data structure representing a crawled page; foundational model for all crawl results.
build.gradle — Multi-module Gradle build configuration defining dependencies and project structure; required for local builds and dependency updates.

🛠️How to make changes

Create a Custom Crawler for a New Domain

Extend WebCrawler abstract class and override shouldVisit() and visit() methods (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/WebCrawler.java)
In visit(), inspect the Page object (getWebURL(), getParseData(), getContentData()) and extract desired data (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/Page.java)
Instantiate CrawlConfig, set politeness delays, user agent, and crawl scope (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java)
Create a CrawlController, register your custom crawler class, and call start() (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlController.java)
Optionally persist data to a database by storing results in visit(); see PostgresWebCrawler for a reference implementation (crawler4j-examples/crawler4j-examples-postgres/src/main/java/edu/uci/ics/crawler4j/examples/crawler/PostgresWebCrawler.java)

Add Authentication (Basic Auth, Form Login, or NTLM)

Choose an AuthInfo subclass based on your authentication type (BasicAuthInfo, FormAuthInfo, or NtAuthInfo) (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/authentication/BasicAuthInfo.java)
Instantiate the chosen AuthInfo with domain and credentials (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/authentication/AuthInfo.java)
Register it with CrawlConfig via addAuthInfo() before passing to CrawlController (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java)
The PageFetcher will automatically inject credentials via BasicAuthHttpRequestInterceptor or equivalent handler (crawler4j/src/main/java/edu/uci/ics/crawler4j/fetcher/BasicAuthHttpRequestInterceptor.java)

Configure and Fine-Tune Crawler Performance

Modify CrawlConfig properties: setNumberOfCrawlers() for thread count, setMaxPagesToFetch() for crawl limits (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java)
Set connection timeouts via setConnectTimeout() and setSocketTimeout() (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java)
Adjust politeness delay with setPolitenessDelay() to control request rate per domain (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/CrawlConfig.java)
PageFetcher uses IdleConnectionMonitorThread to clean idle connections; monitor system resources (crawler4j/src/main/java/edu/uci/ics/crawler4j/fetcher/IdleConnectionMonitorThread.java)

Handle Custom Content Types and Parsing

In your WebCrawler subclass, check Page.getContentType() to identify MIME type (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/Page.java)
Use Page.getParseData() to access jsoup Document or null if parsing failed; handle ParseException if needed (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/exceptions/ParseException.java)
For non-HTML types (JSON, XML, PDFs), access raw bytes via Page.getContentData() (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/Page.java)
Implement custom parsing logic in your visit() method or delegate to external libraries (crawler4j/src/main/java/edu/uci/ics/crawler4j/crawler/WebCrawler.java)

🪤Traps & gotchas

Examples are in src/test/java/, not src/main/java/—run them as unit tests via Gradle, not as standalone apps. 2. The core crawler library itself is not shown in the file list; you're seeing only examples and config. You must inspect the actual library JAR or Maven artifact to understand WebCrawler, CrawlController, and Page internals. 3. Each crawler example creates its own CrawlController instance; running multiple examples concurrently from the same JVM may have port/resource conflicts. 4. Checkstyle is strict (toolVersion 8.17); code that doesn't pass checkstyleMain/checkstyleTest will fail the ./gradlew check task. 5. Tests set jvmArgs '-Xmx2g', so you need at least 2GB heap available.

🏗️Architecture

💡Concepts to learn

Thread Pool Executor Pattern — crawler4j's core is multi-threaded fetching; understanding how it partitions URLs across a bounded thread pool and queues work is critical to tuning crawler performance and avoiding deadlocks.
URL Canonicalization & De-duplication — Web crawlers must avoid re-visiting the same URL; crawler4j must handle trailing slashes, query params, fragments, and relative-to-absolute URL normalization—core to crawler correctness.
Robots.txt Compliance & Crawl Politeness — Ethical crawling requires respecting robots.txt rules and adding delay between requests; crawler4j provides the framework but you must implement shouldVisit() logic correctly to avoid being blocked.
HTML DOM Parsing & Link Extraction — The visit(Page) hook receives parsed HTML; you'll need to understand CSS selectors or DOM traversal to extract links and data using HtmlParseData.
Distributed Crawl State & Frontier — crawler4j maintains a frontier (queue of unvisited URLs) and visited set; understanding in-memory vs. persistent storage trade-offs (seen in LocalDataCollectorCrawler) is key for scaling.
HTTP Content Negotiation & Character Encoding — Pages arrive with Content-Type and charset headers; mishandling can corrupt text extraction. crawler4j must decode gzip, handle charset conversions, and parse meta-charset tags.
Observer Pattern (Callbacks) — crawler4j uses template method pattern (shouldVisit/visit overrides) and implicit observer hooks (StatusHandler in examples); understanding callback timing is essential for correct crawler behavior.

HtmlUnit/htmlunit — Likely the HTML parsing engine crawler4j uses internally (or alternative: jsoup); essential for understanding page parsing behavior.
seleniumhq/selenium — Alternative for crawling JavaScript-heavy sites (crawler4j is static HTML only); users often need both for different use cases.
apache/nutch — Apache's production-grade distributed web crawler; crawler4j is simpler/single-machine while Nutch scales to billions of pages.
javalite/javalite — Provides HTTP client and HTML parsing utilities used by some Java crawlers; overlaps with crawler4j's dependencies.
vert-x3/vertx-web — Reactive web framework often used alongside crawlers for serving results or APIs; complementary for building crawler output UIs.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for crawler4j-core module

The repository has extensive examples in crawler4j-examples but lacks visible unit tests for the core crawler4j module itself. Given this is a web crawler library, core components like URL filtering, link extraction, and crawl state management need test coverage to prevent regressions.

[ ] Create src/test/java/edu/uci/ics/crawler4j/crawler/test/ directory structure in crawler4j module
[ ] Add unit tests for CrawlController initialization and thread pool management
[ ] Add unit tests for URL normalization and filtering logic in WebURL class
[ ] Add unit tests for robots.txt parsing and compliance checking
[ ] Integrate tests into build.gradle with coverage reporting using JaCoCo plugin

Migrate from Travis CI to GitHub Actions workflow

The project uses .travis.yml (visible in file structure) but should migrate to GitHub Actions for better integration with GitHub. This modernizes CI/CD and removes dependency on external service.

[ ] Create .github/workflows/build.yml with Java 8+ matrix testing (Java 8, 11, 17)
[ ] Configure gradle build steps with checkstyle validation from config/checkstyle.xml
[ ] Add test result reporting and artifact archival for HTML test reports
[ ] Set up Maven Central deployment workflow triggered on release tags
[ ] Remove .travis.yml after validating new workflow works

Add integration tests for PostgreSQL example with testcontainers

The crawler4j-examples-postgres module exists but appears to lack automated integration tests. Using testcontainers library provides isolated, reproducible PostgreSQL testing without manual setup.

[ ] Add testcontainers dependency to crawler4j-examples-postgres/build.gradle
[ ] Create src/test/java/edu/uci/ics/crawler4j/examples/PostgresIntegrationTest.java
[ ] Implement test cases for PostgresDBService.java using embedded PostgreSQL
[ ] Add tests for PostgresWebCrawler data persistence and retrieval
[ ] Document test execution in crawler4j-examples-postgres/README.md with instructions

🌿Good first issues

Add integration test for robots.txt compliance: The examples lack a test verifying that crawlers respect robots.txt (shouldVisit() currently only filters by URL pattern). Create crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/robotstxt/ with a RobotsCrawler.java that demonstrates fetching and parsing robots.txt, blocking disallowed paths.
Add pagination example: No example shows crawling multi-page results (e.g., extracting links from paginated search results). Add crawler4j-examples-base/src/test/java/edu/uci/ics/crawler4j/examples/pagination/ demonstrating how to detect next-page links, extract data from each page, and stop at a limit.
Document CrawlController configuration options in Javadoc or README: BasicCrawlController.java instantiates CrawlController but the constructor parameters and setters are not visible in the examples. Add inline documentation or a new config-example showing all available options (timeout, politeness delay, user-agent, proxy, etc.) and their effects.

⭐Top contributors

Click to expand

@pgalbraith — 40 commits
@s17t — 25 commits
@yasserg — 22 commits
[@Federico Tolomei](https://github.com/Federico Tolomei) — 4 commits
@valfirst — 3 commits

📝Recent commits