RepoPilot

andeya/pholcus

Pholcus is a distributed high-concurrency crawler software written in pure golang

Healthy

Healthy across the board

HealthyDependency

Permissive license, no critical CVEs, actively maintained — safe to depend on.

HealthyFork & modify

Has a license, tests, and CI — clean foundation to fork and modify.

HealthyLearn from

Documented and popular — useful reference codebase to read through.

HealthyDeploy as-is

No critical CVEs, sane security posture — runnable as-is.

  • Concentrated ownership — top contributor handles 74% of recent commits
  • Scorecard: default branch unprotected (0/10)
  • Last commit 2mo ago
  • 17 active contributors
  • Apache-2.0 licensed
  • CI configured
  • Tests present

Computed from maintenance signals — commit recency, contributor breadth, bus factor, license, CI, tests, cross-checked against OpenSSF Scorecard

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/andeya/pholcus)](https://repopilot.app/r/andeya/pholcus)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card

This card auto-renders when someone shares https://repopilot.app/r/andeya/pholcus on X, Slack, or LinkedIn.

Ask AI about andeya/pholcus

Grounded in the actual source code. Pick a starter question or write your own.

Or write your own question →

Onboarding doc

Onboarding: andeya/pholcus

Generated by RepoPilot · 2026-06-24 · Source

🎯Verdict

GO — Healthy across the board

  • Last commit 2mo ago
  • 17 active contributors
  • Apache-2.0 licensed
  • CI configured
  • Tests present
  • ⚠ Concentrated ownership — top contributor handles 74% of recent commits
  • ⚠ Scorecard: default branch unprotected (0/10)

<sub>Computed from maintenance signals — commit recency, contributor breadth, bus factor, license, CI, tests, cross-checked against OpenSSF Scorecard</sub>

TL;DR

Pholcus (幽灵蛛) is a pure Go distributed web crawler framework supporting high-concurrency crawling across multiple machines. It provides three execution engines (Surf HTTP, PhantomJS, and Chrome Chromium headless), intelligent cookie management, proxy pooling, and persistent task state for distributed crawl operations coordinated via master-slave architecture with full-duplex Socket communication. Monolithic modular structure: app/crawler/ handles core crawling loops (crawlerpool.go manages goroutine pools, spiderqueue.go manages task queues), app/downloader/ abstracts three HTTP engines, app/distribute/ implements master-slave coordination with custom teleport protocol in app/distribute/teleport/, and app/aid/ provides cross-cutting concerns (history persistence, proxy management).

👥Who it's for

Go developers building enterprise-scale web scraping systems who need to crawl millions of pages across distributed node clusters, handle JavaScript-heavy sites with Chrome automation, and manage persistent crawl state with automatic retry and deduplication.

🌱Maturity & risk

Actively maintained with 6.3M lines of Go code and comprehensive test coverage (failure_test.go, crawler_test.go, teleport_test.go, etc.). The project has released versions and uses semantic versioning, but commit recency and GitHub activity metrics are not visible in provided data. Conservative assessment: production-ready for Go practitioners, though smaller community than mainstream crawlers.

Moderate dependency burden (Kafka/Sarama, MongoDB/mgo, Chrome DevTools Protocol, Otto JS engine) with some outdated packages (mgo v2 deprecated, otto unmaintained). No visible CI/CD pipeline configuration in file list. Single active maintainer (andeya) concentrates risk; legal liability concerns exist given the explicit disclaimer about compliance with local laws in the README.

Active areas of work

No specific PR or milestone data provided in repo snapshot. Based on file structure, active areas include: distributed task coordination (taskjar.go), teleport protocol enhancements, and Chrome integration (chromedp v0.14.2 dependency suggests ongoing browser automation work).

🚀Get running

git clone https://github.com/andeya/pholcus.git
cd pholcus
go mod download
go build ./cmd  # if a cmd entry point exists, or examine app/app.go for library usage

Daily commands: No Makefile or shell scripts visible. As a library: import github.com/andeya/pholcus/app in your main.go and instantiate via app/app.go interfaces. For distributed mode, start a master node listening on teleport server (app/distribute/teleport/server.go) and slave nodes connecting via client (app/distribute/teleport/client.go). See app/distribute/master_api.go and slave_api.go for coordination APIs.

🗺️Map of the codebase

  • app/app.go — Main application entry point and orchestrator for the entire crawler framework
  • app/crawler/crawler.go — Core crawler engine that manages concurrent spider execution and lifecycle
  • app/downloader/downloader.go — HTTP request executor and response handling layer for all web fetches
  • app/spider/common/common.go — Spider interface and common utilities defining the contract all spiders must implement
  • app/distribute/master_api.go — Master node coordinator for distributed crawling across multiple slave nodes
  • app/pipeline/pipeline.go — Data processing and output pipeline that collects and persists crawled results
  • app/distribute/teleport/teleport.go — RPC framework enabling distributed communication between master and slave nodes

🛠️How to make changes

Create a New Spider

  1. Study the Spider interface in app/spider/common/common.go to understand the contract (app/spider/common/common.go)
  2. Create your spider struct implementing the Spider interface (Name, Descripton, Namespace methods, etc.) (app/spider/common/common.go)
  3. Implement the Init() method to set up spider configuration and parsing rules (app/spider/common/common.go)
  4. Implement the Request() method to define initial URLs and return Request objects (app/downloader/request/request.go)
  5. Implement the Parse() method to extract data from responses and emit results via ctx.Output() (app/spider/common/common.go)
  6. Register your spider with the spider collection so app.go can discover it (app/app.go)

Add a New Data Output Backend

  1. Study existing output implementations like output_mysql.go or output_kafka.go (app/pipeline/collector/output_mysql.go)
  2. Create output_newbackend.go implementing the output interface with Save(data) method (app/pipeline/collector/collector.go)
  3. Add configuration parsing for your backend in the output function (app/pipeline/collector/collector.go)
  4. Register your output handler in the collector's switch statement (app/pipeline/collector/collector.go)
  5. Test by running spider with output configuration pointing to your backend (app/app.go)

Implement a Custom Downloader Strategy

  1. Review the Downloader interface in app/downloader/downloader.go (app/downloader/downloader.go)
  2. Create downloader_custom.go implementing Download(req *Request) (*Response, error) (app/downloader/downloader.go)
  3. Optionally use surfer backends in app/downloader/surfer/surfer.go for browser rendering (app/downloader/surfer/surfer.go)
  4. Register your downloader in app.go's downloader selector logic (app/app.go)
  5. Add unit tests in downloader_custom_test.go validating request/response handling (app/downloader/downloader_test.go)

Set Up Distributed Crawling (Master-Slave)

  1. Understand the master API structure in app/distribute/master_api.go (app/distribute/master_api.go)
  2. Start a master node with app.go configured as master role (app/app.go)
  3. Define Task objects specifying spiders and parameters in app/distribute/task.go (app/distribute/task.go)
  4. Start slave nodes pointing to master via app/distribute/slave_api.go (app/distribute/slave_api.go)
  5. Submit tasks to master which distributes work via teleport RPC in app/distribute/teleport/teleport.go (app/distribute/teleport/teleport.go)
  6. Results aggregate at master pipeline and persist via collectors (app/pipeline/pipeline.go)

🔧Why these technologies

  • Go 1.24.0 — Pure Go implementation enables single-binary deployment, goroutine-based concurrency, and cross-platform compilation for distributed nodes
  • chromedp (Chrome DevTools Protocol) — Headless Chrome automation for JavaScript-heavy sites without heavy PhantomJS dependencies
  • Sarama (Kafka client) — Stream real-time crawl results to distributed systems for downstream processing
  • mgo (MongoDB driver) — Schema-flexible document storage for unstructured scraped data with native BSON support
  • Custom teleport RPC — Lightweight binary protocol optimized for master-slave communication instead of gRPC overhead
  • Otto (JavaScript interpreter) — Server-side JavaScript parsing and rule evaluation without browser instantiation

⚖️Trade-offs already made

  • Custom RPC (teleport) instead of gRPC

    • Why: Reduce framework overhead and binary size for lightweight slave nodes while maintaining master control
    • Consequence: Less mature ecosystem, custom protocol debugging needed, but faster marshaling and simpler deployment
  • undefined

    • Why: undefined
    • Consequence: undefined

🪤Traps & gotchas

Distributed mode requires custom teleport protocol handshake (see teleport/protocol_test.go for binary format details). Chrome automation requires Chromium/Headless Chrome binary in PATH or specified config; missing browser causes silent failures. Otto JavaScript engine (otto v0.0.0-20180617131154) is unmaintained—complex JS rules may fail. Proxy rotation uses frequency-based cycling (app/aid/proxy/proxy.go) not success-based; misconfiguration causes IP bans. MongoDB mgo driver is deprecated; mgo v2 connections are deprecated in Go 1.16+. No visible database migrations or schema versioning; upgrades must be manual.

🏗️Architecture

💡Concepts to learn

  • go-colly/colly — Production Go web scraping library with similar high-level API but monolithic design without native distributed support
  • gocolly/colly — Community fork of colly addressing historical issues; simpler alternative if distributed master-slave complexity not needed
  • chromedp/chromedp — Standalone Chrome DevTools Protocol client used as dependency; understanding this directly helps with pholcus browser automation layer
  • elastic/go-elasticsearch — Ecosystem companion for storing and indexing crawled documents at scale beyond MySQL/MongoDB
  • andeya/gust — Direct dependency (github.com/andeya/gust v1.20.7) providing utilities; study for understanding framework helper functions

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive integration tests for distributed crawler task execution

The repo has app/distribute/integration_test.go but the distribute package is critical for the core functionality (master_api, slave_api, task management, teleport protocol). Current test coverage appears minimal. Adding thorough integration tests would validate the distributed coordination between master and slave nodes, task jar management, and the custom teleport protocol used for inter-node communication.

  • [ ] Expand app/distribute/integration_test.go with multi-node task distribution scenarios
  • [ ] Add tests validating task serialization/deserialization in app/distribute/task_test.go
  • [ ] Add tests for app/distribute/teleport/protocol.go covering the custom protocol handshake, message framing, and error handling
  • [ ] Add tests for app/distribute/teleport/netdata.go validating concurrent network data transmission
  • [ ] Add tests validating master-slave synchronization with app/distribute/master_api_test.go and app/distribute/slave_api_test.go

Add unit tests for critical downloader request and surfer modules

The downloader package handles HTTP requests and browser automation (Chrome, Phantom JS). While some test files exist (downloader_test.go, request_test.go), there are no tests for several critical modules: app/downloader/surfer/chrome.go, app/downloader/surfer/phantom.go, app/downloader/surfer/param.go, and platform-specific user agent selection in app/downloader/surfer/agent/. These are essential for the crawler's core functionality.

  • [ ] Add tests to app/downloader/surfer/param_test.go validating parameter parsing and validation
  • [ ] Add tests to app/downloader/surfer/chrome_test.go for Chrome automation scenarios (mock chromedp interactions)
  • [ ] Add tests to app/downloader/surfer/phantom_test.go for PhantomJS fallback scenarios
  • [ ] Expand app/downloader/surfer/agent/agent_test.go to validate platform-specific user agent selection across Linux, Windows, BSD, and ARM architectures

Add missing unit tests for proxy management and history tracking modules

The app/aid/ package contains critical helper modules for proxy management and request history tracking, but test coverage is sparse. app/aid/proxy/proxy.go handles proxy rotation and validation (essential for distributed crawling), and app/aid/history/ tracks success/failure state. These have basic test files but are missing edge-case coverage for concurrent operations and state transitions.

  • [ ] Expand app/aid/proxy/proxy_test.go with concurrent proxy rotation tests, proxy validation failure scenarios, and proxy pool exhaustion handling
  • [ ] Expand app/aid/proxy/host_test.go with hostname parsing edge cases and port validation
  • [ ] Expand app/aid/history/history_test.go with concurrent read/write scenarios and state consistency validation
  • [ ] Add tests in app/aid/history/success_test.go and app/aid/history/failure_test.go for tracking concurrent crawler success/failure under load

🌿Good first issues

  • Add test coverage to app/aid/proxy/host.go: only host.go lacks corresponding host_test.go, and proxy pooling is critical for reliability
  • Document the teleport protocol binary format: protocol_test.go has examples but no architecture doc; add ADR in doc/ directory explaining message framing, type codes, and return_func lifecycle
  • Implement request deduplication using cryptographic hashing: app/crawler/spiderqueue.go queue management doesn't show bloom filter or hash set deduplication; add configurable hash-based request deduplication to reduce redundant crawls

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 91a5608 — refactor: translate all Chinese comments and logs to English (andeya)
  • cd403a7 — test: clean up dead code, improve comments, and achieve ≥80% test coverage (andeya)
  • a4c31a7 — feat: add Chrome headless browser downloader and fix Baidu search spiders (andeya)
  • e2aac49 — refactor: overhaul config system, normalize naming conventions, and refresh README (andeya)
  • fb174d9 — fix: repair broken static spider rules (andeya)
  • 9b8d19f — refactor: deep gust adoption and JS-friendly Context API (andeya)
  • 3d1faf8 — refactor: translate comments to English, use strings.ReplaceAll, and replace bindata with go:embed (andeya)
  • 85ac561 — refactor: consolidate rules into sample/ and restructure project layout (andeya)
  • 04a4f95 — refactor: rename simple/ to sample/ for clarity (andeya)
  • b00ce6f — chore: sync go.mod versions after go work sync (andeya)

🔒Security observations

  • High · Outdated and Vulnerable Dependencies — go.mod dependency versions. Multiple dependencies are significantly outdated with known security vulnerabilities. Notable examples: golang.org/x/net (July 2019), golang.org/x/crypto (June 2019), go-sql-driver/mysql (April 2018), and gopkg.in/mgo.v2 (July 2018). These versions contain multiple CVEs including TLS vulnerabilities, cryptographic weaknesses, and protocol implementation issues. Fix: Update all dependencies to their latest stable versions immediately. Prioritize: golang.org/x/net, golang.org/x/crypto, golang.org/x/sys, go-sql-driver/mysql, and gopkg.in/mgo.v2. Consider replacing mgo.v2 with a maintained MongoDB driver.
  • High · Deprecated Go Version — go.mod: go directive. The project specifies 'go 1.24.0' which appears to be a future/unreleased version or typo. This could indicate version management issues and may lead to compatibility problems or build failures. Fix: Verify and use a stable, released Go version (1.21+ recommended). Update to a version that aligns with the actual development environment.
  • High · Unsupported MongoDB Driver (mgo.v2) — go.mod: gopkg.in/mgo.v2. The project depends on gopkg.in/mgo.v2 which has been unmaintained since 2018. This driver contains known security issues and is no longer supported. The maintainer recommends migration to the official MongoDB Go driver. Fix: Migrate to the official MongoDB Go driver (go.mongodb.org/mongo-driver). This will ensure security patches and ongoing maintenance.
  • High · Vulnerable Sarama Kafka Client — go.mod: github.com/Shopify/sarama v1.23.1. Shopify/sarama v1.23.1 (May 2019) is outdated and contains known vulnerabilities related to SASL authentication and encryption handling. Fix: Update to the latest version of sarama. Verify SASL configuration is properly secured and certificates are validated.
  • Medium · Use of otto JavaScript Engine — go.mod: github.com/robertkrimen/otto, spider/parse files (inferred). The project includes robertkrimen/otto for JavaScript evaluation. JavaScript engines used for dynamic code execution can be exploitation vectors if not carefully controlled. Input validation and sandboxing are critical. Fix: Ensure all JavaScript code executed through otto comes from trusted sources only. Implement strict input validation and consider using a more restricted scripting environment. Never execute user-supplied JavaScript directly.
  • Medium · Potential SQL Injection in Database Operations — app/pipeline/collector/ (database output modules), app/downloader/. The codebase includes database-related modules (app/pipeline/collector/output_*.go) and uses outdated go-sql-driver/mysql (1.4.1 from April 2018). Crawler frameworks that parse and execute queries are at higher risk for SQL injection if parameterized queries are not consistently used. Fix: Audit all database query construction to ensure parameterized queries/prepared statements are used exclusively. Never concatenate user input into SQL queries. Update go-sql-driver/mysql to latest version.
  • Medium · Chrome/Headless Browser Security — app/downloader/surfer/chrome.go, app/downloader/surfer/chrome_test.go. The project includes chromedp for headless browser automation (app/downloader/surfer/chrome.go). Running untrusted crawled content in a browser context can lead to script execution and data exfiltration. Fix: Sandbox browser instances. Disable JavaScript execution for untrusted content when possible. Run chromedp in restricted environments. Monitor and limit resource access. Consider running in containers with restricted capabilities.
  • Medium · Insecure Deserialization (mgo.v2) — go.mod: gopkg.in/mgo.v2. mgo.v2 uses BSON deserialization which can be vulnerable to malicious payloads if untrusted data is deserialized without proper validation. Fix:

LLM-derived; treat as a starting point, not a security audit.

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/andeya/pholcus shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live andeya/pholcus repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/andeya/pholcus.

What it runs against: a local clone of andeya/pholcus — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in andeya/pholcus | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 99 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>andeya/pholcus</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of andeya/pholcus. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/andeya/pholcus.git
#   cd pholcus
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of andeya/pholcus and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "andeya/pholcus(\\.git)?\\b" \\
  && ok "origin remote is andeya/pholcus" \\
  || miss "origin remote is not andeya/pholcus (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "app/app.go" \\
  && ok "app/app.go" \\
  || miss "missing critical file: app/app.go"
test -f "app/crawler/crawler.go" \\
  && ok "app/crawler/crawler.go" \\
  || miss "missing critical file: app/crawler/crawler.go"
test -f "app/downloader/downloader.go" \\
  && ok "app/downloader/downloader.go" \\
  || miss "missing critical file: app/downloader/downloader.go"
test -f "app/spider/common/common.go" \\
  && ok "app/spider/common/common.go" \\
  || miss "missing critical file: app/spider/common/common.go"
test -f "app/distribute/master_api.go" \\
  && ok "app/distribute/master_api.go" \\
  || miss "missing critical file: app/distribute/master_api.go"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 99 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~69d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/andeya/pholcus"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Embed this chat in your README →

Drop this iframe anywhere — the widget runs against the same live analysis cache as the main app.

<iframe
  src="https://repopilot.app/embed/andeya/pholcus"
  width="100%" height="500"
  style="border:1px solid #d0d7de; border-radius:8px;"
  allow="microphone"
  loading="lazy"
></iframe>