andeya/pholcus

Item: andeya/pholcus
Rating: 5
Author: RepoPilot

Pholcus is a distributed high-concurrency crawler software written in pure golang

Healthy

Healthy across the board

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 2mo ago
✓17 active contributors
✓Apache-2.0 licensed

Show all 6 evidence items →

✓CI configured
✓Tests present
⚠Concentrated ownership — top contributor handles 74% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/andeya/pholcus)](https://repopilot.app/r/andeya/pholcus)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/andeya/pholcus on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: andeya/pholcus

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/andeya/pholcus shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 2mo ago
17 active contributors
Apache-2.0 licensed
CI configured
Tests present
⚠ Concentrated ownership — top contributor handles 74% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live andeya/pholcus repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/andeya/pholcus.

What it runs against: a local clone of andeya/pholcus — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in andeya/pholcus | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 95 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>andeya/pholcus</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of andeya/pholcus. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/andeya/pholcus.git
#   cd pholcus
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of andeya/pholcus and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "andeya/pholcus(\\.git)?\\b" \\
  && ok "origin remote is andeya/pholcus" \\
  || miss "origin remote is not andeya/pholcus (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "app/app.go" \\
  && ok "app/app.go" \\
  || miss "missing critical file: app/app.go"
test -f "app/crawler/crawler.go" \\
  && ok "app/crawler/crawler.go" \\
  || miss "missing critical file: app/crawler/crawler.go"
test -f "app/distribute/master_api.go" \\
  && ok "app/distribute/master_api.go" \\
  || miss "missing critical file: app/distribute/master_api.go"
test -f "app/downloader/downloader.go" \\
  && ok "app/downloader/downloader.go" \\
  || miss "missing critical file: app/downloader/downloader.go"
test -f "app/pipeline/pipeline.go" \\
  && ok "app/pipeline/pipeline.go" \\
  || miss "missing critical file: app/pipeline/pipeline.go"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 95 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~65d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/andeya/pholcus"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Pholcus is a distributed, high-concurrency web crawler framework written in pure Go that supports single-machine, server, and client modes. It provides three download engines (Surf for HTTP, PhantomJS, and headless Chrome), intelligent proxy rotation, cookie management, and persistent failure/success tracking—enabling large-scale web scraping with automatic deduplication and retry logic. Monolithic package structure: app/ contains the core engine split into crawler/ (concurrency pool & spider queue), downloader/ (three engines + request handling), distribute/ (master/slave APIs + custom teleport protocol), aid/ (proxy rotation, failure/success history), and app.go orchestrating everything. Test files colocate with implementation.

👥Who it's for

Go developers and DevOps engineers who need to build scalable web crawling infrastructure; data engineers needing distributed task collection across multiple machines; researchers or companies performing legitimate large-scale web data extraction who want Chinese-language documentation and a complete integrated framework rather than building from scratch.

🌱Maturity & risk

The project shows moderate maturity: ~6.3MB of Go code with comprehensive test coverage (all major modules have *_test.go files), three execution modes (single/server/client), and multiple output adapters (MySQL, MongoDB, Kafka, CSV). However, recent commit history and active issue resolution are not visible from the provided data; the Chinese-language focus and niche distributed crawler market suggest a specialized rather than universally-adopted project.

Dependencies on external services (Kafka, MongoDB, MySQL, Beanstalkd, PhantomJS/Chrome) create operational complexity; the custom bidirectional Socket framework under app/distribute/teleport/ requires careful protocol maintenance; no visible CI/CD pipeline metadata. Single-maintainer projects in the crawler space risk legal exposure due to terms-of-service violations by end users, and the framework's ease-of-use amplifies that liability.

Active areas of work

Unable to determine from provided data—no git log, CI status, or milestone information is available. The file structure suggests a mature codebase, but activity level is not visible.

🚀Get running

git clone https://github.com/andeya/pholcus.git
cd pholcus
go mod download
go test ./...

Daily commands: Not fully visible from provided files. Likely: go run main.go (entry point not shown) or go build to produce binary. Web UI accessed via browser after startup; GUI mode for Windows via lxn/walk; CLI via command flags. Requires Go 1.24+.

🗺️Map of the codebase

app/app.go — Main application orchestrator that coordinates crawler, scheduler, pipeline, and distribution subsystems—all contributions must understand the entry point.
app/crawler/crawler.go — Core crawler engine managing concurrency, spider scheduling, and request/response cycles—fundamental to the distributed crawling model.
app/distribute/master_api.go — Master node API for distributed coordination; essential for understanding task distribution and slave node communication.
app/downloader/downloader.go — HTTP request execution layer supporting multiple backends (surfer, chrome); critical for all network operations.
app/pipeline/pipeline.go — Data pipeline orchestrator handling results collection and multi-output dispatch (CSV, MySQL, Kafka, etc.).
app/scheduler/scheduler.go — Request scheduling logic determining spider execution order and concurrency control across the crawler pool.
app/distribute/teleport/teleport.go — Custom RPC protocol for master–slave communication in distributed deployments; critical for distribution reliability.

🛠️How to make changes

Add a new output plugin

Create a new file in app/pipeline/collector/ named output_<backend>.go following the pattern of output_csv.go (app/pipeline/collector/output_csv.go)
Implement the OutputInterface methods (Name(), Outputting(), Close()) in your new plugin (app/pipeline/collector/collector.go)
Register the plugin in the collectors map within the Collector initialization logic (app/pipeline/collector/collector.go)

Add a new HTTP downloader backend

Create a new surfer backend file in app/downloader/surfer/ (e.g., surfer_mybackend.go) implementing the Surfer interface (app/downloader/surfer/surfer.go)
Implement required methods (Do, Download, etc.) following the pattern in surf.go (app/downloader/surfer/surf.go)
Register the backend in the downloader selection logic in app/downloader/downloader.go (app/downloader/downloader.go)

Extend spider capabilities with custom parsing

Create a spider implementation file in app/spider/ following the common pattern (app/spider/common/common.go)
Implement the spider interface methods (Name, Requests, Parse, etc.) (app/spider/common/common.go)
Register the spider in the scheduler to make it available to the crawler (app/scheduler/scheduler.go)

Add custom RPC endpoints for distributed mode

Define new message types in app/distribute/teleport/protocol.go (app/distribute/teleport/protocol.go)
Implement handler methods in app/distribute/master_api.go (master-side) or app/distribute/slave_api.go (slave-side) (app/distribute/master_api.go)
Register the handler in the teleport server initialization (app/distribute/teleport/server.go)

🔧Why these technologies

Go (pure language choice) — Enables lightweight goroutines for high-concurrency crawling without heavyweight threading overhead; native concurrent primitives (channels, mutexes) simplify distributed coordination
Custom RPC protocol (teleport) — Avoids external service dependencies (gRPC, JSON-RPC libraries); minimal serialization overhead for master–slave task distribution; tight control over protocol version compatibility
Multiple downloader backends (HTTP, Chrome, PhantomJS) — Handles diverse web content: static HTML via standard HTTP, JavaScript-rendered pages via headless browsers, graceful fallback strategy
Modular output plugins (CSV, MySQL, Kafka, Beanstalkd, MongoDB) — Supports flexible result persistence; different deployment scenarios (batch processing, real-time streaming, data warehousing) without core changes
Proxy pool abstraction (app/aid/proxy) — Rotates requests across proxies to avoid IP bans; supports distributed proxy sources for large-scale crawling

⚖️Trade-offs already made

Custom RPC protocol instead of gRPC/protobuf
- Why: Avoid external serialization library dependencies and reduce binary size; prioritize simplicity and control
- Consequence: Lower type safety and discoverability compared to protobuf; requires manual protocol evolution management; harder debugging without protocol inspection tools
Synchronous request–parse–output pipeline
- Why: Simpler model for single-threaded spider logic; easier backpressure handling via channel buffering
- Consequence: Cannot easily parallelize parsing across multiple cores for CPU-bound spider code; output I/O blocks crawler scheduling
Master–slave architecture with single master
- Why: Centralized task distribution simplifies state management and consistency; no distributed consensus complexity
- Consequence: Master is a single point of failure in production; does not scale to extreme slave counts; master bottleneck under very high task throughput
In-process scheduler matrix
- Why: No external scheduler service required; low-latency local request prioritization
- Consequence: Cannot share schedule state across multiple instances; scheduler becomes part of tight crawler loop (cannot be updated live)

🚫Non-goals (don't propose these)

Does not provide browser automation beyond headless Chrome/PhantomJS integration—no session management or interactive form filling
Not a distributed file system or cluster management tool—assumes operator manages master/slave provisioning and networking
Does not include built-in caching layer (Redis, memcached)—all state kept in-process or written to external storage
Not a visual UI framework—CLI and GUI (walk) are minimal; not suitable for complex multi-user concurrent crawler management
Does not handle CAPTCHA solving or anti-bot evasion beyond proxy rotation and basic User-Agent spoofing

🪤Traps & gotchas

Protocol versioning: Custom teleport protocol has no visible version negotiation—rolling updates between master and slave nodes risk silent failures. Chrome/PhantomJS: Requires headless browser binary on PATH; different OS packaging (apt, brew, Windows MSI); missing binary causes silent fallback to Surf without warning. Goroutine leaks: CrawlerPool and SpiderQueue use channels; improper context cancellation can hang goroutines—watch for unbuffered channel sends/receives. Proxy rotation: aid/proxy/proxy.go must maintain thread-safe state; concurrency bugs here corrupt request routing. Config file: References gopkg.in/ini.v1 but no example .ini file in listing—required config keys are undocumented. Otto JS sandbox: JavaScript rule execution via Otto is single-threaded; CPU-intensive JS blocks the Go goroutine.

🏗️Architecture

💡Concepts to learn

Worker Pool Pattern — CrawlerPool in app/crawler/crawlerpool.go uses bounded goroutine pools to control concurrency; understanding pool sizing, work stealing, and graceful shutdown is essential for tuning crawler performance and preventing resource exhaustion
Custom Protocol & Serialization — The bidirectional Socket protocol in app/distribute/teleport/ is Pholcus's unique distribution backbone; mastering netdata marshaling and protocol versioning is required to extend master-slave communication or debug distributed crawls
Request Deduplication & Bloom Filters — Large-scale crawlers must avoid re-fetching URLs; Pholcus likely uses hashing or bloom filters in the history module (app/aid/history/) to detect duplicates—critical for efficiency on billion-URL datasets
Distributed Task Scheduling — Master node in app/distribute/master_api.go must fairly distribute work to multiple slave nodes while handling failures and uneven load; this is a classic distributed systems challenge requiring load-aware task assignment
Headless Browser Automation (CDP) — The Chrome downloader uses Chrome DevTools Protocol via chromedp to execute JavaScript before scraping; understanding session management, page lifecycle, and network interception is needed for JS-heavy sites
Proxy Rotation & Rate Limiting — app/aid/proxy/proxy.go manages IP pools and frequency-based switching to avoid detection; token bucket or sliding window algorithms are likely used to throttle requests per proxy
Goroutine Lifecycle & Channel Coordination — Go concurrency primitives (channels, context, WaitGroup) are used throughout for worker coordination in crawlerpool.go and distribution; improper cleanup causes goroutine leaks and hangs on shutdown

colly/colly — Popular Go web scraping framework with simpler API and better docs; directly competes on single-machine crawling but lacks Pholcus's distributed architecture
gocolly/colly — Official Colly organization fork; same as above, industry standard alternative in Go ecosystem
qwuaker/zhonghua — Chinese web crawler framework (also Go); similar target audience but smaller feature set; reference for Chinese docs patterns
chromedp/chromedp — Pholcus depends on this for headless Chrome automation; understanding CDP protocol and session management is prerequisite knowledge
robertkrimen/otto — JavaScript VM that Pholcus uses for hot-loaded dynamic rules; needed to extend rule execution or debug JS-based scrapers

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive integration tests for distributed crawler coordination

The distribute package has an integration_test.go file but lacks thorough coverage of master-slave communication. Given this is a distributed crawler framework, testing task distribution, synchronization, and failure handling between master and slave nodes is critical. The teleport protocol has multiple test files but the high-level task coordination logic needs better coverage to catch race conditions and network failures.

[ ] Expand app/distribute/integration_test.go to test master_api.go and slave_api.go task distribution workflows
[ ] Add tests in app/distribute/taskjar_test.go for concurrent task queue access patterns
[ ] Create test scenarios in app/distribute/task_test.go covering task retry logic and state transitions across distributed nodes

Add missing unit tests for downloader request parameter validation

The downloader/surfer/param.go file has a corresponding param_test.go but it appears minimal. The surfer package handles critical HTTP client behavior with multiple implementations (Chrome, Phantom). Parameter validation for request headers, proxies, timeouts, and user agents needs thorough testing to prevent silent failures in production crawling.

[ ] Expand app/downloader/surfer/param_test.go to test edge cases: missing headers, invalid proxies, timeout bounds
[ ] Add validation tests in app/downloader/request/request_test.go for malformed URLs and conflicting parameters
[ ] Create tests in app/downloader/surfer/agent/agent_test.go covering platform-specific user agent generation across Linux, Windows, BSD, and ARM

Add missing test coverage for proxy pool and failure history tracking

The aid/proxy and aid/history packages handle critical reliability features (proxy rotation, failure recovery) but have incomplete test coverage. Testing concurrent proxy availability checks, history persistence across restarts, and the interaction between failure tracking and proxy selection is essential for a production crawler framework.

[ ] Expand app/aid/proxy/proxy_test.go to test concurrent access patterns and proxy availability verification under load
[ ] Add tests in app/aid/history/history_test.go for persistence layer (success/failure log synchronization and replay after restarts)
[ ] Create integration tests covering failure recovery: mark proxies as failed, verify they're excluded, test re-enablement logic in proxy rotation

🌿Good first issues

Add integration test for Chrome downloader fallback behavior when binary is missing (app/downloader/downloader_test.go only tests Surf); currently no test verifies graceful degradation or error messaging
Write missing benchmarks for CrawlerPool concurrency scaling (app/crawler/crawlerpool_test.go exists but has no Benchmark* functions); critical for performance tuning the goroutine pool size
Document the custom teleport protocol wire format with a .md file in app/distribute/teleport/; currently only code-readable; new contributors must reverse-engineer netdata.go and protocol.go to understand message serialization

⭐Top contributors

Click to expand

@andeya — 74 commits
@liguoqinjim — 6 commits
@xianyunyh — 2 commits
@zlh — 2 commits
@zerozh — 2 commits

📝Recent commits