TecharoHQ/anubis

Item: TecharoHQ/anubis
Rating: 5
Author: RepoPilot

Weighs the soul of incoming HTTP requests to stop AI crawlers

Healthy

Healthy across the board

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 1d ago
✓30+ active contributors
✓Distributed ownership (top contributor 38% of recent commits)

Show 3 more →

✓MIT licensed
✓CI configured
✓Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/techarohq/anubis)](https://repopilot.app/r/techarohq/anubis)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/techarohq/anubis on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: TecharoHQ/anubis

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/TecharoHQ/anubis shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 1d ago
30+ active contributors
Distributed ownership (top contributor 38% of recent commits)
MIT licensed
CI configured
Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live TecharoHQ/anubis repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/TecharoHQ/anubis.

What it runs against: a local clone of TecharoHQ/anubis — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in TecharoHQ/anubis | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 31 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>TecharoHQ/anubis</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of TecharoHQ/anubis. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/TecharoHQ/anubis.git
#   cd anubis
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of TecharoHQ/anubis and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "TecharoHQ/anubis(\\.git)?\\b" \\
  && ok "origin remote is TecharoHQ/anubis" \\
  || miss "origin remote is not TecharoHQ/anubis (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "anubis.go" \\
  && ok "anubis.go" \\
  || miss "missing critical file: anubis.go"
test -f "cmd/anubis/main.go" \\
  && ok "cmd/anubis/main.go" \\
  || miss "missing critical file: cmd/anubis/main.go"
test -f "data/botPolicies.yaml" \\
  && ok "data/botPolicies.yaml" \\
  || miss "missing critical file: data/botPolicies.yaml"
test -f "data/bots/ai-catchall.yaml" \\
  && ok "data/bots/ai-catchall.yaml" \\
  || miss "missing critical file: data/bots/ai-catchall.yaml"
test -f "cmd/robots2policy/main.go" \\
  && ok "cmd/robots2policy/main.go" \\
  || miss "missing critical file: cmd/robots2policy/main.go"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 31 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~1d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/TecharoHQ/anubis"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Anubis is a Go-based HTTP middleware that identifies and blocks AI crawler requests (like GPTBot, CCBot, Anthropic) by analyzing request patterns, User-Agent headers, and behavioral signals. It acts as a reverse proxy or middleware layer that 'weighs the soul' of incoming requests to prevent unauthorized data scraping by large language model training operations. Modular Go monorepo: core HTTP middleware logic in Go (main application), TypeScript/templ frontend for admin dashboard (.web/ directory with static assets), comprehensive documentation in docs/ (Docusaurus 3.8.1 setup), CI/CD orchestration via .github/workflows/, and containerized deployment via Dockerfile and docker-compose in .devcontainer/. Configuration-driven with air.toml for hot-reload development.

👥Who it's for

Website operators, content creators, and infrastructure teams who want to protect their sites from being indexed by AI training crawlers without blocking legitimate traffic. DevOps engineers and site owners concerned about data usage without permission.

🌱Maturity & risk

Actively developed and production-ready. The project has comprehensive CI/CD pipelines (.github/workflows/), Docker support, package builds for multiple platforms (stable/unstable), and DCO enforcement suggesting mature governance. The extensive Go codebase (546KB), multiple deployment options, and sponsor support indicate a well-maintained project with active development.

Low risk for production use. Single primary maintainer (Xe) model per GitHub Sponsors, but mitigated by comprehensive test coverage (go.yml workflow), active dependency management (dependabot.yml), and established release processes. The project has no major red flags visible in the repository structure, though typical crawler detection can become a cat-and-mouse game as AI vendors evolve their request signatures.

Active areas of work

Active development with security-focused changes (zizmor.yml for security linting), spelling checks via GitHub Actions, package builds for stable/unstable channels, and smoke testing workflows. The presence of AGENTS.md and CLAUDE.md suggests recent work on documentation and AI integration patterns. Multiple open workflows indicate the project is shipping regular updates.

🚀Get running

Clone the repo, set up the dev environment, and run locally: git clone https://github.com/TecharoHQ/anubis && cd anubis && go mod download && make (or use the devcontainer: code . && reopen in container). The .devcontainer/docker-compose.yaml provides a complete local environment with docker-compose up.

Daily commands: Dev: make or use air for hot reload (air --version configured in .air.toml). Run tests: go test ./... via .github/workflows/go.yml. Dashboard: navigate to web/ and npm install && npm start (Docusaurus). Container: docker-compose -f .devcontainer/docker-compose.yaml up.

🗺️Map of the codebase

anubis.go — Core library implementing the soul-weighing logic for HTTP request classification and AI crawler detection
cmd/anubis/main.go — Entry point for the Anubis HTTP middleware/service; demonstrates how to integrate the core library
data/botPolicies.yaml — Central configuration file defining all bot detection rules and policies that the system applies to incoming requests
data/bots/ai-catchall.yaml — Defines AI crawler detection patterns; critical for understanding how the system identifies machine learning bots
cmd/robots2policy/main.go — Converts robots.txt files into Anubis bot policies; essential tool for extending bot detection coverage
.github/workflows/go.yml — CI/CD pipeline for Go code; shows testing and build requirements for contributions
CONTRIBUTING.md — Developer guidelines for the project; required reading for all contributors before submitting changes

🧩Components & responsibilities

Anubis Core Library (anubis.go) (Go, regexp/pattern matching) — Implements the soul-weighing algorithm: parses requests, matches patterns, returns allow/deny decision
- Failure mode: Malformed policy or request crashes detector; graceful fallback to allow or explicit error logging needed
Middleware (cmd/anubis/main.go) (Go net/http, YAML unmarshaling) — HTTP server wrapper that instantiates core library, intercepts requests, enforces decisions
- Failure mode: Network errors or slow backend timeouts; circuit breaker pattern recommended for production
Bot Policies (data/bots/*.yaml) (YAML) — Declarative rules for AI crawlers, headless browsers, aggressive scrapers; loaded at startup
- Failure mode: Overly permissive patterns allow crawlers through; overly strict patterns block legitimate users (false positives)
App Exemptions (data/apps/*.yaml) (YAML) — Whitelisted services and routes that bypass bot detection (e.g., SSL Labs, Gitea RSS feeds)
- Failure mode: Missing exemption blocks legitimate service; incorrect exemption allows malicious crawler
robots2policy Tool (cmd/robots2policy/main.go) (Go, robots.txt parser) — Converts standard robots.txt to Anubis YAML policy format for bulk policy generation
- Failure mode: Malformed robots.txt or output mismatch requires manual editing; batch processing needs retry logic

🔀Data flow

HTTP Client → Anubis Middleware — Incoming HTTP request with headers (User-Agent,

🛠️How to make changes

Add a new AI crawler detection pattern

Create a new YAML file in data/bots/ (e.g., data/bots/new-ai-bot.yaml) with User-Agent patterns and detection rules (data/bots/new-ai-bot.yaml)
Reference the new file in data/botPolicies.yaml under the appropriate category (ai, headless-browsers, etc.) (data/botPolicies.yaml)
Add test cases in cmd/robots2policy/testdata/ if converting from robots.txt (cmd/robots2policy/testdata/)
Run go test to verify the policy loads correctly (.github/workflows/go.yml)

Add an exemption for a legitimate service

Create a new YAML file in data/apps/ (e.g., data/apps/my-service.yaml) defining allowed routes or User-Agent patterns (data/apps/allow-api-routes.yaml)
Reference the exemption in data/botPolicies.yaml or configure it in the middleware (data/botPolicies.yaml)
Document the exemption rationale in a comment within the YAML file (data/apps/)

Convert robots.txt to Anubis policy format

Place the robots.txt file in cmd/robots2policy/testdata/ (e.g., complex.robots.txt) (cmd/robots2policy/testdata/)
Run: go run cmd/robots2policy/main.go -input testdata/complex.robots.txt -output testdata/complex.yaml (cmd/robots2policy/main.go)
Review and edit the generated YAML to match Anubis policy conventions (cmd/robots2policy/testdata/complex.yaml)
Run tests to ensure the policy is valid: go test ./cmd/robots2policy/... (cmd/robots2policy/robots2policy_test.go)

🔧Why these technologies

Go — High performance, minimal runtime overhead, easy containerization for middleware deployment
YAML policies — Human-readable, declarative rule definitions that can be updated without recompilation
Docker/Ko — Cross-platform containerization enabling deployment as sidecar or reverse-proxy middleware
GitHub Actions — Native CI/CD for Go projects with automated testing, linting, and container builds

⚖️Trade-offs already made

Purely policy-driven detection (no ML models)
- Why: Simplicity, transparency, and zero dependency on external AI inference services
- Consequence: May miss sophisticated new attack patterns; requires manual policy updates to stay ahead of evasion
YAML configuration files over embedded defaults
- Why: Allows operators to customize bot detection without rebuilding binaries
- Consequence: Adds file I/O overhead and requires careful policy management; misconfiguration can allow or block unwanted traffic
Pattern matching on User-Agent and headers only
- Why: Fast, stateless, and suitable for reverse-proxy/middleware placement
- Consequence: Cannot detect sophisticated crawlers spoofing legitimate User-Agents; behavioral analysis would require stateful analysis

🚫Non-goals (don't propose these)

Real-time ML-based bot detection
Browser fingerprinting or JavaScript challenge (TLS/HTTP-only)
Authentication or session management
Log aggregation or analytics beyond request classification
Content caching or optimization
Rate limiting (delegated to reverse-proxy layer)

🪤Traps & gotchas

No obvious poisonous pitfalls, but: (1) The .devcontainer setup is strongly recommended - raw local Go/Node setup may miss environment variables or service dependencies; (2) DCO enforcement (.github/workflows/dco-check.yaml) requires commits be signed with legal authority - all commits must have 'Signed-off-by' trailer; (3) The project uses multiple deployment channels (stable/unstable in package-builds), so check which version you're targeting; (4) AGENTS.md and CLAUDE.md exist - read these before starting to understand design decisions around AI integration.

🏗️Architecture

💡Concepts to learn

User-Agent Analysis & Fingerprinting — Core to Anubis - identifying crawlers by analyzing User-Agent headers, header order, and TLS fingerprints to detect spoofed requests
Behavioral Request Heuristics — Anubis weighs multiple request signals (timing, patterns, headers) not just single identifiers - learning to identify which patterns are reliable crawler indicators
Reverse Proxy Middleware Pattern — Anubis intercepts HTTP traffic before application servers - understanding middleware request/response lifecycle is essential for extending it
TLS Fingerprinting — Anubis can identify bots by analyzing TLS handshake properties and cipher suite ordering - more reliable than User-Agent alone
Rate Limiting & Token Bucket — Blocking crawlers while allowing legitimate users requires sophisticated rate-limiting that doesn't trigger false positives
Hot Reload Development (air) — The .air.toml configuration enables rapid iteration on crawler detection rules without restarting the server
Multi-stage Docker Builds — Anubis Dockerfile likely uses multi-stage builds to keep production images small while supporting full dev toolchains

semrush/robots-txt-parser — Parses robots.txt to enforce crawler policies - Anubis could integrate this to respect site-level directives before blocking
cloudflare/wrangler — Edge computing framework that could deliver Anubis-like blocking at CDN level before reaching origin servers
getlantern/lantern — Related proxy/middleware project showing patterns for request inspection and filtering across networks
owasp/ModSecurity — WAF that inspired middleware-based request validation patterns - Anubis implements similar layered inspection principles
TecharoHQ/site-bans-crawler — Likely companion project or predecessor in the same organization showing crawler blocking techniques

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for cmd/robots2policy package

The robots2policy command has a test file (robots2policy_test.go) and testdata directory, but testing coverage appears minimal. Given this tool converts robots.txt files to policy YAML, adding tests for edge cases (malformed robots.txt, various directives, batch processing logic in batch_process.go) would significantly improve reliability and prevent regressions.

[ ] Review existing cmd/robots2policy/robots2policy_test.go coverage
[ ] Add tests for batch_process.go logic (cmd/robots2policy/batch/batch_process.go)
[ ] Add edge case tests using testdata files (blacklist.robots.txt, blacklist.yaml)
[ ] Test error handling for malformed input
[ ] Ensure test coverage >80% for the package

Add GitHub Actions workflow for Go security scanning with gosec

The repo has extensive CI/CD workflows (.github/workflows/) including linting, module tidy checks, and docker builds, but no dedicated security scanning workflow. Given Anubis's role in blocking malicious requests, a gosec security scanner workflow would catch potential vulnerabilities in the Go codebase automatically on PRs.

[ ] Create .github/workflows/gosec.yml workflow file
[ ] Configure gosec to scan ./cmd/... and root anubis.go
[ ] Set severity thresholds and exclude false positives
[ ] Integrate with existing PR checks in PULL_REQUEST_TEMPLATE.md
[ ] Document security scanning process in CONTRIBUTING.md or SECURITY.md

Add integration tests for cmd/containerbuild functionality

The containerbuild command (cmd/containerbuild/main.go) appears to manage container image building, but there are no test files in that directory. Adding integration tests that verify container image building logic would prevent breakage and improve maintainability, especially given the .ko.yaml and Docker workflows present.

[ ] Create cmd/containerbuild/containerbuild_test.go
[ ] Add tests for image tagging and manifest generation logic
[ ] Verify integration with .ko.yaml build configuration
[ ] Test error cases (missing dependencies, invalid Docker setup)
[ ] Add testdata fixtures if needed (sample dockerfiles, build configs)

🌿Good first issues

Add test coverage for User-Agent parsing edge cases in the crawler detection logic - the .github/workflows/go.yml runs tests but coverage reports would reveal untested paths in header parsing
Expand the AGENTS.md documentation with concrete examples of blocked vs. allowed requests - currently sparse on debugging information for operators troubleshooting why legitimate requests are blocked
Add TypeScript type definitions and improve the dashboard UI in web/ to show real-time block statistics - the web/ directory has minimal React/TS suggesting limited frontend functionality compared to backend complexity

⭐Top contributors

Click to expand

@Xe — 38 commits
@dependabot[bot] — 21 commits
@JasonLovesDoggo — 8 commits
@tdgroot — 2 commits
@michi-onl — 2 commits

📝Recent commits

Click to expand

0491f1f — fix: patch GHSA-6wcg-mqvh-fcvg (#1616) (Xe)
d3a00da — feat: Log weight when issuing challenge (#1611) (tdgroot)
7e037b6 — feat: add ASN data from Thoth to logs/metrics (#1608) (lillian-b)
ebf9a30 — fix(metrics): bind to the right network/bindhost (#1606) (Xe)
f8605bc — fix: Thoth geoip compare (#1564) (lenny87)
1d700a0 — fix(honeypot): remove DoS vector (#1581) (Xe)
681c2cc — feat(metrics): basic auth support (#1579) (Xe)
8f8ae76 — feat(metrics): enable TLS/mTLS serving support (#1576) (Xe)
f21706e — feat(data): add Meta's web indexer used for AI purposes (#1573) (bnjbvr)
d5ccf9c — feat: move metrics server config to the policy file (#1572) (Xe)

🔒Security observations

The codebase shows moderate security posture with several areas for improvement. The main concerns are: (1) use of recent major versions without extensive testing history (React 19), (2) automatic dependency updates via caret specifications without explicit security scanning, (3) missing visible npm audit configuration in CI/CD, and (4) loose Node.js engine constraints allowing EOL versions. The project does have Dependabot configured and security contact information in SECURITY.md, which are positive signs. Recommendations focus on stricter dependency management, adding security scanning to CI/CD pipelines, and updating version constraints to use actively maintained releases.

High · React Version 19.0.0 - Potential Stability Concerns — package.json - dependencies.react. React 19.0.0 is a major version release with significant changes. The documentation site is using a very recent major version that may have undiscovered vulnerabilities or breaking changes. This is less of a direct security issue and more of a dependency risk. Fix: Monitor React security advisories closely. Consider testing thoroughly before deploying to production. Use npm audit regularly to detect known vulnerabilities.
Medium · Caret Dependency Specifications — package.json - all dependencies with caret syntax. Dependencies use caret (^) specifications which allow minor and patch version updates automatically. For example, ^3.8.1 could update to 3.9.0 or 3.8.2 without explicit review. This increases exposure to newly introduced vulnerabilities. Fix: Consider using more restrictive version pinning (exact versions) for production deployments, or implement dependency scanning in CI/CD pipeline. Use 'npm ci' instead of 'npm install' in CI environments.
Medium · Missing npm audit Configuration — package.json and .github/workflows/. No evidence of npm audit scripts or security scanning in package.json. The GitHub workflows don't show explicit npm security auditing steps visible in the file structure. Fix: Add npm audit to CI/CD pipeline: include 'npm audit --production' in GitHub Actions workflow. Set up automated dependency updates with Dependabot (already present in .dependabot.yml but verify it's configured).
Low · No Node.js Security Headers Configuration Visible — docs configuration (docusaurus.config.js not shown in structure). The Docusaurus configuration does not show explicit security headers (CSP, X-Frame-Options, etc.) configuration in the visible file structure. This could allow certain client-side attacks if the site is served without proper headers. Fix: Configure security headers in Docusaurus config or via reverse proxy. Implement Content Security Policy (CSP), X-Content-Type-Options: nosniff, X-Frame-Options: DENY, etc.
Low · TypeScript Version Flexibility — package.json - devDependencies.typescript. TypeScript is specified with ~5.6.2 (allows patch updates within 5.6.x range). While TypeScript is not a runtime dependency, using looser constraints could introduce unexpected behavior changes. Fix: Consider pinning TypeScript to exact version in production environments to ensure consistent type checking across development and CI/CD.
Low · Engine Version Constraint Not Strict — package.json - engines.node. Node.js engine requirement is >=18.0, which is very broad and includes EOL versions. Node 18.x reached End-of-Life status, and some security patches may not be available. Fix: Update to >=20.0 or >=22.0 to require maintained LTS versions. Node 20 LTS and 22 LTS are currently supported with regular security updates.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

TecharoHQ/anubis

Embed the "Healthy" badge

Onboarding doc

Onboarding: TecharoHQ/anubis

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🧩Components & responsibilities

🔀Data flow

🛠️How to make changes

Add a new AI crawler detection pattern

Add an exemption for a legitimate service

Convert robots.txt to Anubis policy format

🔧Why these technologies

⚖️Trade-offs already made

🚫Non-goals (don't propose these)

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive unit tests for cmd/robots2policy package

Add GitHub Actions workflow for Go security scanning with gosec

Add integration tests for cmd/containerbuild functionality

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next