dotnetcore/DotnetSpider
DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework
Healthy across all four use cases
Permissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 5w ago
- ✓10 active contributors
- ✓MIT licensed
Show 3 more →Show less
- ✓Tests present
- ⚠Single-maintainer risk — top contributor 88% of recent commits
- ⚠No CI workflows detected
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/dotnetcore/dotnetspider)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/dotnetcore/dotnetspider on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: dotnetcore/DotnetSpider
Generated by RepoPilot · 2026-05-10 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/dotnetcore/DotnetSpider shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across all four use cases
- Last commit 5w ago
- 10 active contributors
- MIT licensed
- Tests present
- ⚠ Single-maintainer risk — top contributor 88% of recent commits
- ⚠ No CI workflows detected
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live dotnetcore/DotnetSpider
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/dotnetcore/DotnetSpider.
What it runs against: a local clone of dotnetcore/DotnetSpider — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in dotnetcore/DotnetSpider | Confirms the artifact applies here, not a fork |
| 2 | License is still MIT | Catches relicense before you depend on it |
| 3 | Default branch master exists | Catches branch renames |
| 4 | Last commit ≤ 67 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of dotnetcore/DotnetSpider. If you don't
# have one yet, run these first:
#
# git clone https://github.com/dotnetcore/DotnetSpider.git
# cd DotnetSpider
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of dotnetcore/DotnetSpider and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "dotnetcore/DotnetSpider(\\.git)?\\b" \\
&& ok "origin remote is dotnetcore/DotnetSpider" \\
|| miss "origin remote is not dotnetcore/DotnetSpider (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
&& ok "license is MIT" \\
|| miss "license drift — was MIT at generation time"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 67 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~37d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/dotnetcore/DotnetSpider"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
DotnetSpider is a .NET Standard web crawling and scraping framework that provides lightweight, high-performance distributed web data collection. It handles HTML parsing, entity extraction, and multi-agent orchestration via RabbitMQ, with pluggable storage backends (HBase, MongoDB, SQL Server) for storing crawled data at scale. Modular architecture split across src/DotnetSpider.* projects: Core spider logic, distributed Agent workers, AgentCenter management hub, and database adapters (HBase, Mongo) as plugins. Docker Compose orchestrates the full stack (agent.yml, portal.yml, rabbitmq.yml, hbase.yml); build scripts (build.sh, publish_*.sh) handle multi-component deployment.
👥Who it's for
.NET developers building web scraping pipelines who need distributed crawling across multiple agents, entity mapping to strongly-typed C# objects, and centralized job management via the AgentCenter portal—particularly useful for data collection teams handling large-scale, continuous scraping workloads.
🌱Maturity & risk
Actively developed and production-ready: the repo shows Azure Pipelines CI/CD, NuGet packaging, Docker support for Agent/Portal/Spiders, and a mature multi-component architecture (Agent, AgentCenter, storage modules). The codebase spans 750K+ lines of C#, indicating substantial feature completeness.
Moderate risk: the project depends on multiple heavy services (RabbitMQ, HBase, MongoDB, SQL Server, PostgreSQL—see docker-compose/) creating complex local dev setup; primarily maintained by a single author (zlzforever) based on GitHub handle prevalence; the Portal and AgentCenter components are separate deployables that add operational complexity.
Active areas of work
No recent commit data visible in the file list, but the project maintains Azure Pipelines CI/CD (azure-pipelines.yml), publishes to both NuGet and MyGet feeds (publish_nuget.sh, publish_myget.sh), and supports containerized deployment. The presence of agent-register logs dated 2020-04 suggests historical activity; current status unclear from static file list.
🚀Get running
git clone https://github.com/dotnetcore/DotnetSpider.git
cd DotnetSpider
dotnet restore
# Start dependencies (requires Docker):
docker-compose -f docker-compose/mysql.yml up
docker-compose -f docker-compose/redis.yml up
docker-compose -f docker-compose/rabbitmq.yml up
dotnet build DotnetSpider.sln
Daily commands:
# Run AgentCenter (web UI for job management):
cd src/DotnetSpider.AgentCenter
dotnet run --configuration Release
# Run Agent (worker that executes crawling jobs):
cd src/DotnetSpider.Agent
dotnet run --configuration Release
# Or use Docker Compose:
docker-compose -f docker-compose/portal.yml up
docker-compose -f docker-compose/agent.yml up
🗺️Map of the codebase
- src/DotnetSpider.Agent/Program.cs: Entry point for the distributed worker agent that connects to RabbitMQ and executes crawling jobs
- src/DotnetSpider.AgentCenter/Program.cs: ASP.NET Core management hub that orchestrates job distribution to agents and monitors crawling progress
- src/DotnetSpider.HBase/HBaseStorage.cs: HBase persistence adapter demonstrating the storage plugin pattern for extracted entity data
- src/DotnetSpider.Mongo/MongoEntityStorage.cs: MongoDB entity storage implementation showing alternative persistence backend
- docker-compose/rabbitmq.yml: RabbitMQ service definition critical for agent-to-center communication in the distributed system
- Directory.Build.props: Central MSBuild properties file controlling version, build output, and shared NuGet package metadata across all projects
- azure-pipelines.yml: CI/CD pipeline definition for automated builds, tests, and NuGet/MyGet publishing
🛠️How to make changes
For new spider logic: examine src/DotnetSpider.Sample/samples/BaseUsageSpider.cs and entity definitions in the same folder. For storage: modify src/DotnetSpider.HBase/HBaseStorage.cs or src/DotnetSpider.Mongo/MongoEntityStorage.cs. For distributed features: check src/DotnetSpider.Agent/Program.cs and RabbitMQ configuration in docker-compose/rabbitmq.yml. Build scripts (build.sh, publish.sh) control CI/CD; update azure-pipelines.yml for pipeline changes.
🪤Traps & gotchas
Docker Compose requires specific service startup order (MySQL → Redis → RabbitMQ → HBase before agents can connect); default credentials hardcoded in docker-compose files (root:1qazZAQ! for MySQL, user:password for RabbitMQ) must be changed in production. Agent and AgentCenter expect RabbitMQ connectivity—missing RabbitMQ will cause silent connection failures. The HBase integration requires HBase 1.x+ and Java 8+; version mismatches cause cryptic serialization errors. Portal and Agent are separate Docker images; scaling requires running multiple agent containers with unique agent names.
💡Concepts to learn
- Distributed web scraping with message queues — DotnetSpider uses RabbitMQ to decouple job submission (AgentCenter) from execution (Agent workers); understanding producer-consumer patterns is essential to extend the crawling cluster
- CSS selectors and XPath for HTML extraction — Core to defining entity mappings in DotnetSpider; the framework relies on selector-based extraction to map HTML fragments to C# properties
- Attribute-driven ORM (Object-Relational Mapping) — DotnetSpider uses C# attributes on entity classes to declaratively define extraction rules and storage targets, similar to Entity Framework; critical for understanding how crawled data maps to databases
- Horizontal scaling with stateless workers — The Agent is stateless by design so multiple instances can run in parallel; understanding idempotency and distributed state in the AgentCenter is needed for reliable scaling
- Container orchestration (Docker Compose and implicit Kubernetes support) — DotnetSpider is designed for containerized deployment; the docker-compose files and separate Agent/Portal/Spiders Dockerfiles show the system assumes container-based infrastructure
- Pluggable persistence backends (HBase, MongoDB, SQL Server) — The framework abstracts storage via adapter classes (HBaseStorage, MongoEntityStorage); learning to implement a new storage backend teaches the plugin architecture pattern
- Rate limiting and politeness in web scraping — Essential for ethical crawling; DotnetSpider likely includes request throttling and robot.txt handling, though not visible in the static files—understanding these concepts prevents IP bans and server overload
🔗Related repos
ScrapySharp/ScrapySharp— Alternative .NET web scraping library with simpler API; useful comparison for design decisions around distributed vs. single-process architecturedotnetcore/Util— Companion library from the same .NET Core Community; DotnetSpider likely depends on or shares utilities from this projectlouthy/language-ext— Provides functional programming primitives (Option, Either, Pipe) that DotnetSpider may use for error handling and compositionserilog/serilog— Structured logging framework commonly integrated into .NET crawlers for distributed tracing across Agent/AgentCenter componentsAutoMapper/AutoMapper— Entity mapping library that could simplify the ORM-style attribute-driven extraction used in DotnetSpider's entity definitions
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add unit tests for DotnetSpider.MySql scheduler implementations
The src/DotnetSpider.MySql/Scheduler directory contains three scheduler implementations (BFS, DFS, and base Queue scheduler) but there's no evidence of dedicated unit tests for these critical components. Testing scheduler behavior (queue ordering, deduplication, task distribution) is essential for a web crawling framework. This would improve reliability for distributed crawling scenarios.
- [ ] Create tests/DotnetSpider.MySql.Tests/Scheduler/ directory structure
- [ ] Add unit tests for MySqlQueueScheduler.cs covering basic queue operations
- [ ] Add tests for MySqlQueueBfsScheduler.cs verifying breadth-first traversal order
- [ ] Add tests for MySqlQueueDfsScheduler.cs verifying depth-first traversal order
- [ ] Add integration tests using test database to verify persistence and recovery
Add GitHub Actions CI workflow to replace/supplement Azure Pipelines
The repo currently uses Azure Pipelines (azure-pipelines.yml) but lacks GitHub Actions workflows. Adding a GitHub Actions workflow (.github/workflows/) would provide faster feedback on PRs and reduce CI/CD platform dependencies. This is especially valuable for contributors who prefer native GitHub integration.
- [ ] Create .github/workflows/build-and-test.yml for .NET solution builds
- [ ] Add build matrix for multiple target frameworks (.NET 5.0, 6.0, 7.0)
- [ ] Include steps for running unit tests and generating code coverage reports
- [ ] Add conditional job for publishing NuGet pre-release packages to MyGet on successful master builds
- [ ] Document the new workflow in CONTRIBUTING.md or README.md
Add comprehensive integration tests for storage providers (MongoDB, HBase, MySQL)
The repo has multiple storage implementations (src/DotnetSpider.Mongo, src/DotnetSpider.HBase, src/DotnetSpider.MySql) but no visible integration test suite. Testing actual data persistence, connection pooling, and error handling across different backends is critical for reliability. The docker-compose files suggest infrastructure-as-code capability for test environments.
- [ ] Create tests/DotnetSpider.Integration.Tests/ directory structure
- [ ] Add docker-compose.test.yml that spins up MySQL, MongoDB, and HBase test instances
- [ ] Write integration tests for MongoEntityStorage.cs (CRUD operations, bulk insert)
- [ ] Write integration tests for MySqlEntityStorage.cs and MySqlFileEntityStorage.cs
- [ ] Write integration tests for HBaseStorage.cs
- [ ] Document test setup in CONTRIBUTING.md with docker-compose startup instructions
🌿Good first issues
- Add integration tests for MongoEntityStorage.cs and HBaseStorage.cs to ensure entity extraction round-trips correctly through each storage backend
- Document the entity attribute syntax (selector, regex extraction, type conversion) with code examples in src/DotnetSpider.Sample/samples/ since BaseUsageSpider.cs exists but lacks inline comments
- Create a local-dev docker-compose.yml template in the root that pre-configures all services with sensible defaults, reducing setup friction for new contributors
⭐Top contributors
Click to expand
Top contributors
- @zlzforever — 88 commits
- @ananck — 3 commits
- @aminparsa18 — 2 commits
- @TTonlyV5 — 1 commits
- @SalehAhmadi — 1 commits
📝Recent commits
Click to expand
Recent commits
414facf— Merge pull request #274 from aminparsa18/upgrade-net10 (zlzforever)27c3694— remove legacy library (aminparsa18)f558b29— upgrade framework and deprecated packages (aminparsa18)8496941— Merge pull request #271 from TTonlyV5/master (zlzforever)3060102— 修复Post请求头无数据问题 (TTonlyV5)94eb899— Merge pull request #264 from SalehAhmadi/patch-1 (zlzforever)aaf6372— Update README.md (SalehAhmadi)7cc3f15— 代理适配新版本 (zlzforever)6a8e94c— 修复依赖错误 (zlzforever)a293999— 修复动态类型找不到方法的问题 (zlzforever)
🔒Security observations
- High · Outdated jQuery Dependency —
package.json - jquery dependency. The package.json specifies jQuery ^3.4.1, which is significantly outdated. This version range may include versions with known security vulnerabilities. Current jQuery releases are in the 3.6+ and 3.7+ ranges with numerous security patches. Fix: Update to the latest stable jQuery version (3.7.x or later). Run 'npm audit' to identify all vulnerable dependencies and update them accordingly. - High · Outdated Bootstrap Dependency —
package.json - bootstrap dependency. Bootstrap ^4.4.1 is outdated (released 2020). Current stable versions are Bootstrap 5.x+ with improved security and accessibility features. The specified version may contain known vulnerabilities. Fix: Update to Bootstrap 5.x or later. Review breaking changes during migration. Test thoroughly across all components. - Medium · SQL Injection Risk - MySql Components —
src/DotnetSpider.MySql/. The codebase contains multiple MySql-related files (MySqlEntityStorage.cs, MySqlSchedulerOptions.cs, MySqlQueueBfsScheduler.cs) that may construct SQL queries. Without code review, raw SQL query construction patterns are common sources of SQL injection vulnerabilities. Fix: Conduct a thorough code review of all SQL construction code. Ensure parameterized queries are used exclusively. Use ORM frameworks or query builders with built-in SQL injection protection. - Medium · Potential Log Injection —
src/DotnetSpider.AgentCenter/logs/. Presence of log files in version control (src/DotnetSpider.AgentCenter/logs/agent-register-*.txt) indicates logs may be stored in the repository. This can expose sensitive information and create log injection opportunities. Fix: Remove log files from version control. Add logs/ to .gitignore immediately. Implement proper log management with secure storage outside the repository. - Medium · Insecure Package Repository Configuration —
README.md - MyGet feed configuration snippet. The README mentions adding MyGet feed via HTTP-style configuration without explicit HTTPS requirement visible. Package sources should enforce HTTPS to prevent man-in-the-middle attacks. Fix: Ensure all NuGet package sources use HTTPS. Explicitly document HTTPS-only package feed requirements in build configurations. - Medium · Exposed Configuration Files in Docker —
docker-compose/agent1.json, docker-compose/agent2.json. Docker compose files reference agent configuration files (agent1.json, agent2.json) that may contain sensitive configuration. These files are included in the repository structure. Fix: Extract sensitive configuration into environment variables or secure secret management systems. Use .env files with .env.example templates. Never commit sensitive configuration to version control. - Low · Outdated jQuery Validation Plugin —
package.json - jquery-validation dependency. jquery-validation ^1.19.1 is outdated. While primarily a client-side validation tool, outdated dependencies can introduce inconsistent behavior and potential security issues. Fix: Update to the latest version of jquery-validation. Ensure server-side validation is implemented regardless, as client-side validation is not a security control. - Low · Vue.js Version is Outdated —
package.json - vue dependency. Vue ^2.6.11 is from the Vue 2.x line which reached end-of-life in December 2023. Vue 3.x is the current maintained version with improved security and performance. Fix: Plan migration to Vue 3.x (or later). Review migration guide and update components accordingly. This provides better security maintenance and support. - Low · Missing Security Headers Configuration —
src/DotnetSpider.Portal/Controllers/. No visible security headers configuration in Portal controllers or middleware. Standard headers like CSP, X-Frame-Options, X-Content-Type-Options are not evident in the file structure. Fix: Implement security headers middleware in the ASP.NET Core application. Configure Content-Security-Policy, X-Frame-Options, X-Content-Type-Options, Strict-Transport-Security, and other protective headers.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.