crawlab-team/crawlab

Item: crawlab-team/crawlab
Rating: 5
Author: RepoPilot

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Healthy

Healthy across all four use cases

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 3mo ago
✓BSD-3-Clause licensed
✓CI configured

Show all 5 evidence items →

✓Tests present
⚠Solo or near-solo (1 contributor active in recent commits)

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/crawlab-team/crawlab)](https://repopilot.app/r/crawlab-team/crawlab)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/crawlab-team/crawlab on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: crawlab-team/crawlab

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/crawlab-team/crawlab shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across all four use cases

Last commit 3mo ago
BSD-3-Clause licensed
CI configured
Tests present
⚠ Solo or near-solo (1 contributor active in recent commits)

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live crawlab-team/crawlab repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/crawlab-team/crawlab.

What it runs against: a local clone of crawlab-team/crawlab — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in crawlab-team/crawlab | Confirms the artifact applies here, not a fork | | 2 | License is still BSD-3-Clause | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 116 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>crawlab-team/crawlab</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of crawlab-team/crawlab. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/crawlab-team/crawlab.git
#   cd crawlab
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of crawlab-team/crawlab and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "crawlab-team/crawlab(\\.git)?\\b" \\
  && ok "origin remote is crawlab-team/crawlab" \\
  || miss "origin remote is not crawlab-team/crawlab (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(BSD-3-Clause)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"BSD-3-Clause\"" package.json 2>/dev/null) \\
  && ok "license is BSD-3-Clause" \\
  || miss "license drift — was BSD-3-Clause at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "backend/main.go" \\
  && ok "backend/main.go" \\
  || miss "missing critical file: backend/main.go"
test -f "core/cmd/server.go" \\
  && ok "core/cmd/server.go" \\
  || miss "missing critical file: core/cmd/server.go"
test -f "core/apps/server_v2.go" \\
  && ok "core/apps/server_v2.go" \\
  || miss "missing critical file: core/apps/server_v2.go"
test -f "core/apps/interfaces.go" \\
  && ok "core/apps/interfaces.go" \\
  || miss "missing critical file: core/apps/interfaces.go"
test -f "backend/conf/config.yml" \\
  && ok "backend/conf/config.yml" \\
  || miss "missing critical file: backend/conf/config.yml"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 116 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~86d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/crawlab-team/crawlab"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Crawlab is a Go-based distributed web crawler management platform that provides a unified UI to deploy, schedule, and monitor spiders written in any language (Python, Node.js, Go, Java, PHP) across a cluster of worker nodes. It abstracts away language and framework differences, allowing teams to manage Scrapy, Puppeteer, Selenium, and custom crawlers from a single master control plane. Monorepo structure: /backend contains the main Go application (backend/main.go, backend/go.mod), /core houses the shared kernel as a separate module (github.com/crawlab-team/crawlab/core), /db, /fs, /grpc, /template-parser, /trace, /vcs are modular packages in /backend/go.mod as local replacements. Docker-first deployment model with Dockerfile at root + backend/Dockerfile; config in backend/conf/config.yml; test configs for master + 3 worker nodes in backend/test/.

👥Who it's for

DevOps engineers and data engineering teams who operate web scraping infrastructure at scale and need centralized job scheduling, monitoring, and result aggregation across heterogeneous spider codebases without rewriting crawlers to fit a specific framework.

🌱Maturity & risk

Actively maintained with recent commits, published Docker images, multi-environment CI/CD via GitHub Actions (docker-crawlab.yml, docker-crawlab-tencent.yml), and semantic versioning with changelog entries (CHANGELOG.md, CHANGELOG-zh.md). Supports Go 1.22 in production deployments and has a live demo at demo.crawlab.cn — production-ready but relatively niche compared to mainstream projects.

Moderate risk: monorepo with 6 internal module dependencies (core, db, fs, grpc, template-parser, trace, vcs) that can create tight coupling and version alignment issues; primary language is Go (1.1M LOC) with minimal Python/Shell glue (502B Python). Relies on external MongoDB for state; single 'master' node is a potential single point of failure. No visible test coverage metrics in file listing. Bilingual docs (English/Mandarin) suggest international maintenance but could indicate split focus.

Active areas of work

Recent activity includes Go 1.22 adoption, Docker multi-registry support (Tencent Cloud + Docker Hub via separate workflows), and version management via bin/update-ver.sh. No open PR visibility in file list, but presence of .air.master.conf and .air.worker.conf suggests active hot-reload development setup. SECURITY.md exists, indicating recent governance attention.

🚀Get running

Clone and run via Docker Compose (no local Go build needed for demo): git clone https://github.com/crawlab-team/examples && cd examples/docker/basic && docker-compose up -d. To develop locally: cd backend && go mod download && configure backend/conf/config.yml with MongoDB connection, then run or use air for hot reload via .air.master.conf.

Daily commands: Docker: docker-compose up -d (see README snippet docker-compose.yml). Local backend: cd backend && go run main.go with CRAWLAB_NODE_MASTER=Y and CRAWLAB_MONGO_HOST=localhost env vars set, MongoDB running on localhost:27017. Use air for hot reload: air -c .air.master.conf (master) or .air.worker.conf (worker).

🗺️Map of the codebase

backend/main.go — Backend service entry point; understand how the master/worker nodes initialize and bootstrap the crawlab platform
core/cmd/server.go — Core server bootstrap logic that orchestrates node startup, registration, and service initialization
core/apps/server_v2.go — Main API server application logic; defines how HTTP requests are routed and handled in v2 API
core/apps/interfaces.go — Core application interfaces that define contracts for services, controllers, and middleware; foundational to extensibility
backend/conf/config.yml — Default configuration schema showing how crawlab configures masters, workers, databases, and gRPC services
go.mod — Dependency manifest; shows use of core modules (core, db, fs, grpc, vcs) and external libraries for distributed coordination
core/constants/node.go — Node constants defining master/worker roles and node state machines; essential for understanding distributed architecture

🧩Components & responsibilities

Node (Master or Worker) (Go, config.yml, gRPC) — Single running instance of backend/main.go configured as either master (schedules tasks, aggregates results) or worker (receives tasks, executes spiders)
- Failure mode: Node crashes → tasks on that node are orph

🛠️How to make changes

Add a new v2 API endpoint

Define the controller function in core/controllers/ (e.g., core/controllers/spider_v2.go with GetSpiders, CreateSpider methods) (core/controllers/base_file_v2.go)
Implement the IController interface from core/apps/interfaces.go with GetPath() returning the route prefix (core/apps/interfaces.go)
Register the controller in core/apps/api_v2.go by adding it to the RegisterRoutes() function's router groups (core/apps/api_v2.go)
Add the controller to the DI container in core/container/container.go so it can be instantiated at startup (core/container/container.go)

Add a new configuration setting

Add the new field to the config struct in core/config/config.go with YAML tags matching backend/conf/config.yml naming (core/config/config.go)
Add the setting to backend/conf/config.yml with a sensible default value and comment (backend/conf/config.yml)
Update core/config/base.go if the setting requires custom parsing or validation logic (core/config/base.go)
Access the setting in services via the config instance injected from the container (core/container/container.go)

Add a new service layer

Define the service interface in core/apps/interfaces.go extending IService with your business logic methods (core/apps/interfaces.go)
Implement the service in a new file under core/ (e.g., core/services/spider_service.go) with dependency injection of other services (core/apps/interfaces.go)
Register the service factory in core/container/container.go to be injected into controllers and other services (core/container/container.go)
Inject the service into controllers via constructor and call its methods from API endpoints (core/apps/api_v2.go)

🔧Why these technologies

Go (1.22) with Gin framework — High-performance concurrent HTTP server suitable for distributed crawler coordination; easy multi-platform cross-compilation for master/worker nodes
MongoDB/MySQL (via db module) — Scalable persistent storage for spider definitions, task queues, and execution results across distributed nodes
gRPC (via grpc module) — Efficient inter-node communication for master-to-worker task distribution, result collection, and cluster coordination
Docker (Dockerfile, docker-init.sh) — Containerized deployment ensuring consistent runtime environment across master and worker nodes in production
Local module replaces (core, db, fs, grpc, vcs, trace) — Monorepo-style organization allowing independent versioning and deployment of core abstraction layers while maintaining tight integration

⚖️Trade-offs already made

Dependency Injection container pattern over global singletons
- Why: Enables testability, loose coupling, and late binding of services at runtime
- Consequence: Slightly more verbose instantiation code but easier to mock and extend services without modifying core bootstrap
Local module replaces instead of vendoring core dependencies
- Why: Allows development of core, db, grpc modules alongside backend without publishing intermediate releases
- Consequence: Requires developers to manage local paths; CI/CD must handle module paths correctly during builds
Separate master and worker nodes via configuration instead of runtime detection
- Why: Explicit node role declaration simplifies deployment, scaling, and role-based logic branches
- Consequence: Operators must correctly configure node roles; misconfiguration could lead to duplicate masters or orphaned workers

🚫Non-goals (don't propose these)

Embedded crawler framework (does not include spider execution engine; orchestrates external spiders)
Real-time WebSocket communication (uses HTTP polling for task status; no push updates)
Multi-tenancy isolation (shared single database; no tenant-level access controls)

🪤Traps & gotchas

MongoDB must be running and reachable at CRAWLAB_MONGO_HOST before master/worker startup — no fallback or embedded DB. Master node election requires manual CRAWLAB_NODE_MASTER=Y flag per container; no leader election logic visible. gRPC communication between master and workers assumes network connectivity on default port (check backend/.air.worker.conf for grpc port); firewalled workers will silently fail. Monorepo replace directives in go.mod mean local ../core, ../db changes break builds if submodule paths shift. Version bumping via bin/update-ver.sh — must be run before release or version in Docker image will be stale. Configuration file format (YAML vs JSON) differs between Dockerfile default and test configs; ensure correct format for your environment.

🏗️Architecture

💡Concepts to learn

Master-Worker Distributed Architecture — Crawlab's core topology uses a single master node to coordinate scheduling and monitor multiple stateless worker nodes; understanding this pattern is essential to troubleshooting node communication failures and scaling bottlenecks
gRPC (Remote Procedure Call) — Master and worker nodes communicate via gRPC (github.com/crawlab-team/crawlab/grpc module); you need to understand protobuf message serialization and streaming to extend inter-node APIs or debug protocol errors
MongoDB Atlas / Document Stores — Spider metadata, job results, and scheduling state live in MongoDB; schema design and query performance directly impact dashboard responsiveness and job throughput
Container Orchestration & Docker Compose Networking — Quickstart uses docker-compose networking (CRAWLAB_MONGO_HOST: 'mongo' service name); understanding service discovery via hostnames and container restart policies is critical for production deployments
Environment-Driven Configuration — Node role (master vs worker) is determined by CRAWLAB_NODE_MASTER env var, not config file; this pattern allows single Dockerfile to serve dual roles but requires disciplined env var management
Monorepo with Local Module Replacements — backend/go.mod uses 'replace' directives to import ../core, ../db, ../grpc as local modules during development; breaking this contract requires rebuilding and retesting multiple modules simultaneously
Template Parsing & Variable Substitution — Crawlab-team/crawlab/template-parser module suggests spider arguments, cron expressions, or result field mappings use template syntax; misunderstanding template scope could lead to injection vulnerabilities or unexpected variable binding

scrapy-cloud/scrapy-cloud — Alternative distributed Scrapy cloud platform; represents competing approach to same problem but Scrapy-specific vs Crawlab's language-agnostic design
airbytehq/airbyte — Related ELT/orchestration platform with similar distributed execution model and multi-language connector support; different problem domain (data pipelines vs spiders) but shares worker-pool architecture pattern
crawlab-team/crawlab-core — Actual /core module location (referenced as github.com/crawlab-team/crawlab/core in go.mod); separate repo containing the kernel logic extracted from main monorepo
apify/crawlee — TypeScript-based web crawling framework with cloud execution option; represents modern lightweight alternative to Crawlab for single-language teams
crawlab-team/examples — Companion repo with runnable docker-compose examples and spider templates; referenced in quickstart section of main README

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for backend gRPC services

The repo has a gRPC module (github.com/crawlab-team/crawlab/grpc) and backend server implementation (core/apps/server_v2.go, core/apps/api_v2.go), but backend/test/ only contains configuration files for master/worker nodes with no actual test cases. Adding integration tests would ensure the distributed communication layer works correctly across master-worker architecture.

[ ] Create backend/test/grpc_integration_test.go with test cases for master-worker communication
[ ] Add test cases for gRPC service initialization and message passing in core/apps/
[ ] Implement test fixtures using existing config files in backend/test/config-*.json
[ ] Ensure tests validate spider task distribution and result collection

Add GitHub Actions workflow for backend Go linting and code quality

The repo has Docker CI workflows (.github/workflows/docker-*.yml) but no dedicated Go linting/testing workflow for the backend. With Go 1.22 as the target version and multiple interdependent modules (core, db, fs, grpc, trace, vcs), a proper CI workflow would catch regressions early and maintain code quality standards.

[ ] Create .github/workflows/backend-lint.yml to run golangci-lint on backend/ and core/ modules
[ ] Add go test ./... steps for all replaced modules referenced in backend/go.mod
[ ] Include go mod tidy and go mod verify checks to ensure dependency consistency
[ ] Configure the workflow to trigger on PR changes to backend/, core/, and go.mod files

Create comprehensive API documentation for core/apps/api_v2.go endpoints

The V2 API implementation (core/apps/api_v2.go) exists but there's no generated API documentation. With this being a distributed platform managing spiders across multiple nodes, API consumers (CLI tools, frontend, external integrations) need clear endpoint documentation. This is especially important given the bilingual nature of the project (English/Chinese).

[ ] Add OpenAPI/Swagger annotations to all endpoint handlers in core/apps/api_v2.go
[ ] Generate API documentation using swag CLI in the Taskfile.yml build process
[ ] Create docs/api/ directory with OpenAPI spec output and endpoint reference (in both English and Chinese)
[ ] Update backend/README.md with API documentation links and example usage for core features (task submission, spider management, worker registration)

🌿Good first issues

Add unit tests for backend/go.mod dependency modules (db, fs, grpc, trace) — currently no /backend/test/ or _test.go files visible; start by testing config loading in backend/conf/config.yml parsing
Document the gRPC service contract between master and worker nodes — /grpc module is imported but no .proto files visible in file listing; create proto definitions and comments for new contributors onboarding
Create a Helm chart or Kubernetes manifest examples — current quickstart only covers docker-compose; add /k8s or /helm directory with StatefulSet for master + Deployment for workers, addressing the single-master bottleneck

⭐Top contributors

Click to expand

@tikazyq — 100 commits

📝Recent commits