andeya/pholcus
Pholcus is a distributed high-concurrency crawler software written in pure golang
Healthy across the board
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 2mo ago
- ✓17 active contributors
- ✓Apache-2.0 licensed
Show all 6 evidence items →Show less
- ✓CI configured
- ✓Tests present
- ⚠Concentrated ownership — top contributor handles 74% of recent commits
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/andeya/pholcus)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/andeya/pholcus on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: andeya/pholcus
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/andeya/pholcus shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit 2mo ago
- 17 active contributors
- Apache-2.0 licensed
- CI configured
- Tests present
- ⚠ Concentrated ownership — top contributor handles 74% of recent commits
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live andeya/pholcus
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/andeya/pholcus.
What it runs against: a local clone of andeya/pholcus — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in andeya/pholcus | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch master exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 95 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of andeya/pholcus. If you don't
# have one yet, run these first:
#
# git clone https://github.com/andeya/pholcus.git
# cd pholcus
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of andeya/pholcus and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "andeya/pholcus(\\.git)?\\b" \\
&& ok "origin remote is andeya/pholcus" \\
|| miss "origin remote is not andeya/pholcus (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 4. Critical files exist
test -f "app/app.go" \\
&& ok "app/app.go" \\
|| miss "missing critical file: app/app.go"
test -f "app/crawler/crawler.go" \\
&& ok "app/crawler/crawler.go" \\
|| miss "missing critical file: app/crawler/crawler.go"
test -f "app/distribute/master_api.go" \\
&& ok "app/distribute/master_api.go" \\
|| miss "missing critical file: app/distribute/master_api.go"
test -f "app/downloader/downloader.go" \\
&& ok "app/downloader/downloader.go" \\
|| miss "missing critical file: app/downloader/downloader.go"
test -f "app/pipeline/pipeline.go" \\
&& ok "app/pipeline/pipeline.go" \\
|| miss "missing critical file: app/pipeline/pipeline.go"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 95 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~65d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/andeya/pholcus"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
Pholcus is a distributed, high-concurrency web crawler framework written in pure Go that supports single-machine, server, and client modes. It provides three download engines (Surf for HTTP, PhantomJS, and headless Chrome), intelligent proxy rotation, cookie management, and persistent failure/success tracking—enabling large-scale web scraping with automatic deduplication and retry logic. Monolithic package structure: app/ contains the core engine split into crawler/ (concurrency pool & spider queue), downloader/ (three engines + request handling), distribute/ (master/slave APIs + custom teleport protocol), aid/ (proxy rotation, failure/success history), and app.go orchestrating everything. Test files colocate with implementation.
👥Who it's for
Go developers and DevOps engineers who need to build scalable web crawling infrastructure; data engineers needing distributed task collection across multiple machines; researchers or companies performing legitimate large-scale web data extraction who want Chinese-language documentation and a complete integrated framework rather than building from scratch.
🌱Maturity & risk
The project shows moderate maturity: ~6.3MB of Go code with comprehensive test coverage (all major modules have *_test.go files), three execution modes (single/server/client), and multiple output adapters (MySQL, MongoDB, Kafka, CSV). However, recent commit history and active issue resolution are not visible from the provided data; the Chinese-language focus and niche distributed crawler market suggest a specialized rather than universally-adopted project.
Dependencies on external services (Kafka, MongoDB, MySQL, Beanstalkd, PhantomJS/Chrome) create operational complexity; the custom bidirectional Socket framework under app/distribute/teleport/ requires careful protocol maintenance; no visible CI/CD pipeline metadata. Single-maintainer projects in the crawler space risk legal exposure due to terms-of-service violations by end users, and the framework's ease-of-use amplifies that liability.
Active areas of work
Unable to determine from provided data—no git log, CI status, or milestone information is available. The file structure suggests a mature codebase, but activity level is not visible.
🚀Get running
git clone https://github.com/andeya/pholcus.git
cd pholcus
go mod download
go test ./...
Daily commands:
Not fully visible from provided files. Likely: go run main.go (entry point not shown) or go build to produce binary. Web UI accessed via browser after startup; GUI mode for Windows via lxn/walk; CLI via command flags. Requires Go 1.24+.
🗺️Map of the codebase
app/app.go— Main application orchestrator that coordinates crawler, scheduler, pipeline, and distribution subsystems—all contributions must understand the entry point.app/crawler/crawler.go— Core crawler engine managing concurrency, spider scheduling, and request/response cycles—fundamental to the distributed crawling model.app/distribute/master_api.go— Master node API for distributed coordination; essential for understanding task distribution and slave node communication.app/downloader/downloader.go— HTTP request execution layer supporting multiple backends (surfer, chrome); critical for all network operations.app/pipeline/pipeline.go— Data pipeline orchestrator handling results collection and multi-output dispatch (CSV, MySQL, Kafka, etc.).app/scheduler/scheduler.go— Request scheduling logic determining spider execution order and concurrency control across the crawler pool.app/distribute/teleport/teleport.go— Custom RPC protocol for master–slave communication in distributed deployments; critical for distribution reliability.
🛠️How to make changes
Add a new output plugin
- Create a new file in app/pipeline/collector/ named output_<backend>.go following the pattern of output_csv.go (
app/pipeline/collector/output_csv.go) - Implement the OutputInterface methods (Name(), Outputting(), Close()) in your new plugin (
app/pipeline/collector/collector.go) - Register the plugin in the collectors map within the Collector initialization logic (
app/pipeline/collector/collector.go)
Add a new HTTP downloader backend
- Create a new surfer backend file in app/downloader/surfer/ (e.g., surfer_mybackend.go) implementing the Surfer interface (
app/downloader/surfer/surfer.go) - Implement required methods (Do, Download, etc.) following the pattern in surf.go (
app/downloader/surfer/surf.go) - Register the backend in the downloader selection logic in app/downloader/downloader.go (
app/downloader/downloader.go)
Extend spider capabilities with custom parsing
- Create a spider implementation file in app/spider/ following the common pattern (
app/spider/common/common.go) - Implement the spider interface methods (Name, Requests, Parse, etc.) (
app/spider/common/common.go) - Register the spider in the scheduler to make it available to the crawler (
app/scheduler/scheduler.go)
Add custom RPC endpoints for distributed mode
- Define new message types in app/distribute/teleport/protocol.go (
app/distribute/teleport/protocol.go) - Implement handler methods in app/distribute/master_api.go (master-side) or app/distribute/slave_api.go (slave-side) (
app/distribute/master_api.go) - Register the handler in the teleport server initialization (
app/distribute/teleport/server.go)
🔧Why these technologies
- Go (pure language choice) — Enables lightweight goroutines for high-concurrency crawling without heavyweight threading overhead; native concurrent primitives (channels, mutexes) simplify distributed coordination
- Custom RPC protocol (teleport) — Avoids external service dependencies (gRPC, JSON-RPC libraries); minimal serialization overhead for master–slave task distribution; tight control over protocol version compatibility
- Multiple downloader backends (HTTP, Chrome, PhantomJS) — Handles diverse web content: static HTML via standard HTTP, JavaScript-rendered pages via headless browsers, graceful fallback strategy
- Modular output plugins (CSV, MySQL, Kafka, Beanstalkd, MongoDB) — Supports flexible result persistence; different deployment scenarios (batch processing, real-time streaming, data warehousing) without core changes
- Proxy pool abstraction (app/aid/proxy) — Rotates requests across proxies to avoid IP bans; supports distributed proxy sources for large-scale crawling
⚖️Trade-offs already made
-
Custom RPC protocol instead of gRPC/protobuf
- Why: Avoid external serialization library dependencies and reduce binary size; prioritize simplicity and control
- Consequence: Lower type safety and discoverability compared to protobuf; requires manual protocol evolution management; harder debugging without protocol inspection tools
-
Synchronous request–parse–output pipeline
- Why: Simpler model for single-threaded spider logic; easier backpressure handling via channel buffering
- Consequence: Cannot easily parallelize parsing across multiple cores for CPU-bound spider code; output I/O blocks crawler scheduling
-
Master–slave architecture with single master
- Why: Centralized task distribution simplifies state management and consistency; no distributed consensus complexity
- Consequence: Master is a single point of failure in production; does not scale to extreme slave counts; master bottleneck under very high task throughput
-
In-process scheduler matrix
- Why: No external scheduler service required; low-latency local request prioritization
- Consequence: Cannot share schedule state across multiple instances; scheduler becomes part of tight crawler loop (cannot be updated live)
🚫Non-goals (don't propose these)
- Does not provide browser automation beyond headless Chrome/PhantomJS integration—no session management or interactive form filling
- Not a distributed file system or cluster management tool—assumes operator manages master/slave provisioning and networking
- Does not include built-in caching layer (Redis, memcached)—all state kept in-process or written to external storage
- Not a visual UI framework—CLI and GUI (walk) are minimal; not suitable for complex multi-user concurrent crawler management
- Does not handle CAPTCHA solving or anti-bot evasion beyond proxy rotation and basic User-Agent spoofing
🪤Traps & gotchas
Protocol versioning: Custom teleport protocol has no visible version negotiation—rolling updates between master and slave nodes risk silent failures. Chrome/PhantomJS: Requires headless browser binary on PATH; different OS packaging (apt, brew, Windows MSI); missing binary causes silent fallback to Surf without warning. Goroutine leaks: CrawlerPool and SpiderQueue use channels; improper context cancellation can hang goroutines—watch for unbuffered channel sends/receives. Proxy rotation: aid/proxy/proxy.go must maintain thread-safe state; concurrency bugs here corrupt request routing. Config file: References gopkg.in/ini.v1 but no example .ini file in listing—required config keys are undocumented. Otto JS sandbox: JavaScript rule execution via Otto is single-threaded; CPU-intensive JS blocks the Go goroutine.
🏗️Architecture
💡Concepts to learn
- Worker Pool Pattern — CrawlerPool in
app/crawler/crawlerpool.gouses bounded goroutine pools to control concurrency; understanding pool sizing, work stealing, and graceful shutdown is essential for tuning crawler performance and preventing resource exhaustion - Custom Protocol & Serialization — The bidirectional Socket protocol in
app/distribute/teleport/is Pholcus's unique distribution backbone; mastering netdata marshaling and protocol versioning is required to extend master-slave communication or debug distributed crawls - Request Deduplication & Bloom Filters — Large-scale crawlers must avoid re-fetching URLs; Pholcus likely uses hashing or bloom filters in the history module (
app/aid/history/) to detect duplicates—critical for efficiency on billion-URL datasets - Distributed Task Scheduling — Master node in
app/distribute/master_api.gomust fairly distribute work to multiple slave nodes while handling failures and uneven load; this is a classic distributed systems challenge requiring load-aware task assignment - Headless Browser Automation (CDP) — The Chrome downloader uses Chrome DevTools Protocol via chromedp to execute JavaScript before scraping; understanding session management, page lifecycle, and network interception is needed for JS-heavy sites
- Proxy Rotation & Rate Limiting —
app/aid/proxy/proxy.gomanages IP pools and frequency-based switching to avoid detection; token bucket or sliding window algorithms are likely used to throttle requests per proxy - Goroutine Lifecycle & Channel Coordination — Go concurrency primitives (channels, context, WaitGroup) are used throughout for worker coordination in
crawlerpool.goand distribution; improper cleanup causes goroutine leaks and hangs on shutdown
🔗Related repos
colly/colly— Popular Go web scraping framework with simpler API and better docs; directly competes on single-machine crawling but lacks Pholcus's distributed architecturegocolly/colly— Official Colly organization fork; same as above, industry standard alternative in Go ecosystemqwuaker/zhonghua— Chinese web crawler framework (also Go); similar target audience but smaller feature set; reference for Chinese docs patternschromedp/chromedp— Pholcus depends on this for headless Chrome automation; understanding CDP protocol and session management is prerequisite knowledgerobertkrimen/otto— JavaScript VM that Pholcus uses for hot-loaded dynamic rules; needed to extend rule execution or debug JS-based scrapers
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive integration tests for distributed crawler coordination
The distribute package has an integration_test.go file but lacks thorough coverage of master-slave communication. Given this is a distributed crawler framework, testing task distribution, synchronization, and failure handling between master and slave nodes is critical. The teleport protocol has multiple test files but the high-level task coordination logic needs better coverage to catch race conditions and network failures.
- [ ] Expand app/distribute/integration_test.go to test master_api.go and slave_api.go task distribution workflows
- [ ] Add tests in app/distribute/taskjar_test.go for concurrent task queue access patterns
- [ ] Create test scenarios in app/distribute/task_test.go covering task retry logic and state transitions across distributed nodes
Add missing unit tests for downloader request parameter validation
The downloader/surfer/param.go file has a corresponding param_test.go but it appears minimal. The surfer package handles critical HTTP client behavior with multiple implementations (Chrome, Phantom). Parameter validation for request headers, proxies, timeouts, and user agents needs thorough testing to prevent silent failures in production crawling.
- [ ] Expand app/downloader/surfer/param_test.go to test edge cases: missing headers, invalid proxies, timeout bounds
- [ ] Add validation tests in app/downloader/request/request_test.go for malformed URLs and conflicting parameters
- [ ] Create tests in app/downloader/surfer/agent/agent_test.go covering platform-specific user agent generation across Linux, Windows, BSD, and ARM
Add missing test coverage for proxy pool and failure history tracking
The aid/proxy and aid/history packages handle critical reliability features (proxy rotation, failure recovery) but have incomplete test coverage. Testing concurrent proxy availability checks, history persistence across restarts, and the interaction between failure tracking and proxy selection is essential for a production crawler framework.
- [ ] Expand app/aid/proxy/proxy_test.go to test concurrent access patterns and proxy availability verification under load
- [ ] Add tests in app/aid/history/history_test.go for persistence layer (success/failure log synchronization and replay after restarts)
- [ ] Create integration tests covering failure recovery: mark proxies as failed, verify they're excluded, test re-enablement logic in proxy rotation
🌿Good first issues
- Add integration test for Chrome downloader fallback behavior when binary is missing (
app/downloader/downloader_test.goonly tests Surf); currently no test verifies graceful degradation or error messaging - Write missing benchmarks for CrawlerPool concurrency scaling (
app/crawler/crawlerpool_test.goexists but has noBenchmark*functions); critical for performance tuning the goroutine pool size - Document the custom teleport protocol wire format with a
.mdfile inapp/distribute/teleport/; currently only code-readable; new contributors must reverse-engineernetdata.goandprotocol.goto understand message serialization
⭐Top contributors
Click to expand
Top contributors
- @andeya — 74 commits
- @liguoqinjim — 6 commits
- @xianyunyh — 2 commits
- @zlh — 2 commits
- @zerozh — 2 commits
📝Recent commits
Click to expand
Recent commits
91a5608— refactor: translate all Chinese comments and logs to English (andeya)cd403a7— test: clean up dead code, improve comments, and achieve ≥80% test coverage (andeya)a4c31a7— feat: add Chrome headless browser downloader and fix Baidu search spiders (andeya)e2aac49— refactor: overhaul config system, normalize naming conventions, and refresh README (andeya)fb174d9— fix: repair broken static spider rules (andeya)9b8d19f— refactor: deep gust adoption and JS-friendly Context API (andeya)3d1faf8— refactor: translate comments to English, use strings.ReplaceAll, and replace bindata with go:embed (andeya)85ac561— refactor: consolidate rules into sample/ and restructure project layout (andeya)04a4f95— refactor: rename simple/ to sample/ for clarity (andeya)b00ce6f— chore: sync go.mod versions after go work sync (andeya)
🔒Security observations
- Critical · Outdated and Vulnerable Dependencies —
go.mod - dependencies: golang.org/x/net, golang.org/x/crypto, golang.org/x/sys. Multiple dependencies have known security vulnerabilities and are severely outdated: golang.org/x/net (from 2019), golang.org/x/crypto (from 2019), and golang.org/x/sys (outdated). These legacy versions contain publicly disclosed CVEs affecting web protocols, cryptographic operations, and system calls. Fix: Update all dependencies to their latest stable versions. Specifically: update golang.org/x to current versions (likely v0.25.0+), audit and update github.com/Shopify/sarama (v1.23.1 is from 2019), and github.com/chromedp/chromedp to latest. - Critical · Insecure Database Driver without TLS —
go.mod - dependencies: github.com/go-sql-driver/mysql v1.4.1, gopkg.in/mgo.v2. github.com/go-sql-driver/mysql v1.4.1 (2018) is outdated and may not enforce TLS connections by default. MongoDB driver gopkg.in/mgo.v2 is deprecated and unmaintained since 2018, with known authentication and connection security issues. Fix: Upgrade to github.com/go-sql-driver/mysql v1.8.0+. Replace gopkg.in/mgo.v2 with the official mongo-go-driver (go.mongodb.org/mongo-driver) which is actively maintained and secure. - High · Unsafe JavaScript Execution Engine —
go.mod - dependency: github.com/robertkrimen/otto v0.0.0-20180617131154-15f95af6e78d. github.com/robertkrimen/otto (2017) is an abandoned JavaScript interpreter used for dynamic script execution. No security updates since 2018. This can be exploited for arbitrary code execution if processing untrusted spider scripts. Fix: Consider removing JavaScript execution capabilities or use a sandboxed alternative like goja. If JavaScript execution is essential, validate and sanitize all spider scripts and run them in isolated environments with resource limits. - High · Unvalidated External HTTP Requests —
app/downloader/surfer/ - request handling modules. The crawler framework downloads content from arbitrary URLs via downloader/surfer modules without apparent validation. Combined with outdated golang.org/x/net, this creates SSRF vulnerability risks and exposure to malicious content. Fix: Implement URL validation (whitelist/blacklist), disable requests to private IP ranges (127.0.0.1, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), add request timeouts, and validate Content-Type headers before processing. - High · Deprecated and Unmaintained Dependencies —
go.mod - dependencies: github.com/lxn/walk, github.com/lxn/win, gopkg.in/ini.v1. github.com/lxn/walk and github.com/lxn/win are Windows GUI libraries last updated 2019. gopkg.in/ini.v1 requires audit for configuration file injection. These increase supply chain attack surface. Fix: Evaluate necessity of GUI components in production deployment. Update gopkg.in/ini.v1 to latest version. Consider removing Windows-specific GUI dependencies from server builds or audit configuration file handling for injection vulnerabilities. - High · Insecure Message Queue Integration —
app/distribute/ and app/pipeline/collector/output_beanstalkd.go. Kafka consumer (github.com/Shopify/sarama v1.23.1 from 2019) and Beanstalk queue (github.com/kr/beanstalk) may lack authentication validation and encryption. Distributed system could expose task data in transit. Fix: Update Sarama to v1.38.0+ with TLS/SASL configuration enforced. Verify Beanstalk connections use authentication. Encrypt all inter-node communication. Audit taskjar.go for message validation. - Medium · Potential SQL Injection in Database Layers —
undefined. Multiple database-related files (output_*.go collectors) suggest dynamic query construction Fix: undefined
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.