gocolly/colly

Item: gocolly/colly
Rating: 5
Author: RepoPilot

Elegant Scraper and Crawler Framework for Golang

Healthy

Healthy across the board

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 2w ago
✓22+ active contributors
✓Distributed ownership (top contributor 39% of recent commits)

Show 3 more →

✓Apache-2.0 licensed
✓CI configured
⚠No test directory detected

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/gocolly/colly)](https://repopilot.app/r/gocolly/colly)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/gocolly/colly on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: gocolly/colly

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/gocolly/colly shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 2w ago
22+ active contributors
Distributed ownership (top contributor 39% of recent commits)
Apache-2.0 licensed
CI configured
⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live gocolly/colly repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/gocolly/colly.

What it runs against: a local clone of gocolly/colly — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in gocolly/colly | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 45 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>gocolly/colly</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of gocolly/colly. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/gocolly/colly.git
#   cd colly
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of gocolly/colly and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "gocolly/colly(\\.git)?\\b" \\
  && ok "origin remote is gocolly/colly" \\
  || miss "origin remote is not gocolly/colly (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "colly.go" \\
  && ok "colly.go" \\
  || miss "missing critical file: colly.go"
test -f "request.go" \\
  && ok "request.go" \\
  || miss "missing critical file: request.go"
test -f "response.go" \\
  && ok "response.go" \\
  || miss "missing critical file: response.go"
test -f "context.go" \\
  && ok "context.go" \\
  || miss "missing critical file: context.go"
test -f "http_backend.go" \\
  && ok "http_backend.go" \\
  || miss "missing critical file: http_backend.go"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 45 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~15d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/gocolly/colly"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Colly is a fast, elegant web scraping and crawling framework for Go that provides a clean API to extract structured data from websites. It handles HTTP requests, HTML/XML parsing, cookie management, rate limiting, and concurrent scraping at >1k req/sec on a single core, making it production-grade for data mining, archiving, and processing tasks. Monolithic package structure: core collector logic lives in the root package (implied from README), with _examples/ containing 20+ standalone demo programs (basic.go, login.go, parallel.go, etc.) and internal parsers/extensions. No internal/ subdirectories visible; suggests flat hierarchy with all exported APIs at package root.

👥Who it's for

Go developers building web scrapers, data miners, and crawlers who need to avoid reinventing request management, parsing, and concurrency control. Specifically: DevOps/SRE engineers monitoring sites, data engineers building ETL pipelines, and security researchers analyzing web content at scale.

🌱Maturity & risk

Highly mature and production-ready. The project has strong GitHub presence, active CI/CD via .github/workflows/ci.yml, comprehensive examples in _examples/ covering 20+ real-world scenarios (Instagram, Coursera, Reddit, etc.), and Go 1.24 support. Last visible activity is recent (go.mod shows current toolchain), with A+ goreportcard rating indicating code quality.

Low risk for core usage, but moderate dependency sprawl: requires goquery, antchfx (htmlquery/xmlquery), PuerkitoBio/goquery, temoto/robotstxt, and chardet across 9 direct dependencies. Self-dependency on gocolly/colly v1.2.0 in go.mod is unusual and worth investigating. No visible recent commits in snapshot, so verify current maintainer activity on GitHub before adoption in new greenfield projects.

Active areas of work

Repository is in maintenance mode—the file structure shows stable examples and no breaking changes indicated in visible CHANGELOG/VERSION. Recent Go toolchain bump to 1.24.9 suggests ongoing compatibility work. Active CI via GitHub Actions; no specific PR/milestone data in snapshot, so check Issues and recent commits on GitHub for current focus.

🚀Get running

git clone https://github.com/gocolly/colly.git
cd colly
go mod download
go run _examples/basic/basic.go

Daily commands: Single package, no build/dev server. Run examples directly: go run _examples/basic/basic.go or go run _examples/parallel/parallel.go. For integration: import github.com/gocolly/colly/v2 in your own main.go and call colly.NewCollector() (see README example).

🗺️Map of the codebase

colly.go — Core Collector type and primary API; entry point for all scraping operations and request handling.
request.go — Request abstraction managing HTTP requests, callbacks, and request-level state throughout the scraping lifecycle.
response.go — Response wrapper providing parsed HTML/XML DOM access and callback execution during response processing.
context.go — Context storage for request/response-scoped data and state management across callback chains.
http_backend.go — HTTP client abstraction handling transport, proxy, cookie, and header management for all network requests.
htmlelement.go — DOM selector and element manipulation interface built on goquery for CSS selector parsing and traversal.
go.mod — Dependencies including goquery, htmlquery, xmlquery for parsing; robotstxt for compliance; chardet for encoding detection.

🧩Components & responsibilities

Collector — Orchestrates the scraping lifecycle: manages request/response callbacks, request queue, options (rate limit, proxies, cookies), and invokes user handlers at each stage

🛠️How to make changes

Add a new scraper for a website

Create a new .go file in _examples/ with a func main() that creates a Collector via colly.NewCollector() (_examples/mysite/mysite.go)
Call collector.OnRequest() to log the request being made (_examples/mysite/mysite.go)
Call collector.OnHTML(selector) with goquery CSS selectors to extract data from parsed HTML elements (_examples/mysite/mysite.go)
Call collector.OnError() to handle network or parsing errors (_examples/mysite/mysite.go)
Call collector.Visit(url) to start the scraping; use e.Request.Visit() in callbacks to follow links (_examples/mysite/mysite.go)

Add a custom extension or middleware

Create a new file in extensions/ (e.g., extensions/my_extension.go) with a public func that accepts *colly.Collector (extensions/my_extension.go)
Call collector.OnRequest() or other On* callbacks to inject behavior before/after requests (extensions/my_extension.go)
Register the extension in user code by calling MyExtension(collector) after NewCollector() (_examples/myapp/myapp.go)

Enable distributed crawling with a custom queue

Implement the queue.Queue interface in storage/ (Push, Pop, Size, etc.) for Redis, SQL, or file-based persistence (storage/custom_queue.go)
Pass the custom queue to collector via collector.SetQueue(myQueue) after NewCollector() (_examples/myapp/myapp.go)
Call collector.Visit(url) to enqueue; use e.Request.Visit() in callbacks to enqueue discovered URLs (_examples/myapp/myapp.go)
Run multiple instances of the scraper concurrently; each will pop from the shared queue (_examples/myapp/myapp.go)

🔧Why these technologies

goquery (PuerkitoBio/goquery) — Provides jQuery-like CSS selector API for HTML parsing, making element selection and traversal intuitive for web scrapers
htmlquery (antchfx/htmlquery) — Enables XPath selectors for complex HTML queries; alternative to CSS selectors for power users
xmlquery (antchfx/xmlquery) — XPath-based XML/SOAP parsing for non-HTML documents and feeds
golang.org/x/net — Provides net, http, html packages for core HTTP and HTML parsing at the Go stdlib level
robotstxt (temoto/robotstxt) — Parses and respects robots.txt to ensure ethical, compliant scraping
chardet (saintfish/chardet) — Detects character encoding of responses to handle non-UTF8 pages correctly

⚖️Trade-offs already made

Single-threaded Collector with opt-in concurrency
- Why: Simplicity and ease of use for simple scripts; developers add goroutines manually for parallelism
- Consequence: No built-in thread-safety or request batching; users must manage concurrency and resource limits themselves
Callback-based architecture (OnRequest, OnHTML, OnError) rather than channels or futures
- Why: Familiar imperative style; aligns with event-driven web scraping patterns
- Consequence: Callback hell and error propagation complexity in deep callback chains; harder to compose sequential logic
Optional queue abstraction for distributed crawling
- Why: Decouples crawl state from the core Collector; allows custom persistence backends
- Consequence: Developers must implement their own queue for multi-machine crawling; no out-of-box distributed support
Extensible middleware via On callbacks rather than middleware chain pattern*
- Why: Direct control over request/response lifecycle; flexible plugin points
- Consequence: Extensions must be manually chained and ordered; harder to compose cross-cutting concerns

🚫Non-goals (don't propose these)

Real-time streaming or event-driven updates (batch/crawl-at-rest focused)
JavaScript rendering (no headless browser; static HTML/XML only)
Built-in distributed/multi-machine coordination (queue abstraction is user-implemented)
GraphQL or REST API client (HTML/XML scraping only)
Authentication frameworks beyond cookie/session management (no OAuth/SAML helpers)

🪤Traps & gotchas

Self-dependency quirk: go.mod lists github.com/gocolly/colly v1.2.0 as a dependency while this is clearly colly v2 (import path colly/v2). This may be a legacy constraint or workspace issue—verify no circular dependency problems in your build. 2. No visible test files in snapshot: Be prepared to look in root or colly/ subdirectory for *_test.go files; testing patterns not immediately obvious. 3. Encoding auto-detection: Framework automatically detects non-unicode via chardet; can mask encoding mismatches. 4. Example-centric docs: Heavy reliance on _examples/ for learning; no detailed API godoc snippets visible, so rely on pkg.go.dev or source reading.

🏗️Architecture

💡Concepts to learn

Request callback pipeline — Colly's OnRequest → OnHTML → OnError → OnScraped flow defines async execution; understanding callback ordering is critical for state management across requests
Token bucket rate limiting — Colly implements per-domain concurrency throttling to avoid IP bans; used in _examples/rate_limit/ and essential for production crawling
Robots.txt protocol — Built-in robot.txt parsing (temoto/robotstxt dependency) allows ethical crawling; Colly respects crawl-delay and disallow rules automatically
DOM selector engines (CSS and XPath) — Colly's OnHTML uses CSS selectors (goquery) and OnXML uses XPath (antchfx); mastering both is needed for flexible element extraction
Character encoding detection — Colly auto-detects and normalizes non-unicode responses via chardet; understanding charset handling prevents data corruption on international sites
HTTP session and cookie persistence — Colly auto-manages cookies and sessions (shown in _examples/login/); transparent jar handling enables stateful multi-request workflows without manual headers
Distributed crawling with queues — Example _examples/queue/queue.go hints at queue-based coordination for parallel/distributed scraping; understanding work queues unlocks horizontal scaling

PuerkitoBio/goquery — jQuery-style selector library that Colly wraps; understanding goquery is essential for OnHTML callbacks and element extraction
chromedp/chromedp — Alternative scraper for JavaScript-heavy sites; complementary to Colly for dynamic rendering needs Colly doesn't cover
asdf-vm/asdf — Not directly related, but useful for managing Go versions when testing Colly against multiple toolchains (go 1.24 in this repo)
antchfx/htmlquery — XPath query engine Colly uses internally; understanding xpath patterns unlocks OnXML and advanced selector capabilities
temoto/robotstxt — Robots.txt parser integrated into Colly; core dependency for respectful crawling and understanding bot directives

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for context.go and request context handling

The repo has context.go and a _examples/request_context/request_context.go example, but context_test.go likely has incomplete coverage. Context management is critical for scraping workflows (managing state across requests, handling request-scoped data). Adding thorough tests would ensure reliability and serve as documentation for contributors on proper context usage patterns.

[ ] Review existing context_test.go for coverage gaps
[ ] Add tests for context value setting/retrieval across request lifecycle
[ ] Add tests for context cleanup and garbage collection scenarios
[ ] Add tests for concurrent request contexts to ensure thread-safety
[ ] Reference the request_context example to ensure test scenarios match real-world usage

Add GitHub Actions workflow for dependency security scanning and automated updates

The repo has .github/workflows/ci.yml but there's no visible security scanning or dependabot configuration. With 11 dependencies including critical ones (golang.org/x/net, google.golang.org packages), automated vulnerability detection and version updates are essential for a widely-used scraping framework that users rely on for production workloads.

[ ] Add Dependabot configuration (.github/dependabot.yml) for Go module updates
[ ] Create GitHub Action workflow for security scanning using 'go list -json -m all | nancy sleuth' or similar
[ ] Add workflow step to run 'go mod verify' to detect tampering
[ ] Configure workflow to run on schedule (weekly) and on pull requests
[ ] Document security update process in CONTRIBUTING.md

Add integration tests for real-world scraping scenarios with example validation

The _examples directory has 20+ example scripts (basic, coursera_courses, instagram, login, etc.) but there's no automated validation that these examples work correctly. Adding integration tests that verify examples produce expected output would catch regressions early and ensure the documentation examples remain functional as dependencies update.

[ ] Create tests/integration_test.go or similar test file
[ ] Add test for _examples/basic/basic.go to verify it produces expected HTML parsing output
[ ] Add test for _examples/local_files/local_files.go using the provided HTML fixtures in _examples/local_files/html/
[ ] Add test for _examples/rate_limit/rate_limit.go to verify rate limiting behavior
[ ] Add test for _examples/error_handling/error_handling.go to verify error scenarios
[ ] Document in CONTRIBUTING.md how to add new example validation tests

🌿Good first issues

Add integration test covering the _examples/login/login.go scenario end-to-end (currently appears to only show code structure, not test coverage of session/cookie persistence)
Document rate limiting behavior in _examples/rate_limit/rate_limit.go with inline comments explaining per-domain queue mechanics and backoff calculation
Create a new example _examples/error_recovery/error_recovery.go showing retry logic, circuit breaker patterns, and graceful degradation on HTTP 429/503 responses

⭐Top contributors

Click to expand

@asciimoo — 39 commits
@Shinku-Chen — 23 commits
@horstf — 10 commits
@WGH- — 7 commits
@tieubaovi0217 — 2 commits

📝Recent commits