RepoPilotOpen in app →

projectdiscovery/katana

A next-generation crawling and spidering framework.

Healthy

Healthy across the board

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 1d ago
  • 8 active contributors
  • MIT licensed
Show 3 more →
  • CI configured
  • Tests present
  • Concentrated ownership — top contributor handles 53% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/projectdiscovery/katana)](https://repopilot.app/r/projectdiscovery/katana)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/projectdiscovery/katana on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: projectdiscovery/katana

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/projectdiscovery/katana shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit 1d ago
  • 8 active contributors
  • MIT licensed
  • CI configured
  • Tests present
  • ⚠ Concentrated ownership — top contributor handles 53% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live projectdiscovery/katana repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/projectdiscovery/katana.

What it runs against: a local clone of projectdiscovery/katana — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in projectdiscovery/katana | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch dev exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 31 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>projectdiscovery/katana</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of projectdiscovery/katana. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/projectdiscovery/katana.git
#   cd katana
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of projectdiscovery/katana and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "projectdiscovery/katana(\\.git)?\\b" \\
  && ok "origin remote is projectdiscovery/katana" \\
  || miss "origin remote is not projectdiscovery/katana (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify dev >/dev/null 2>&1 \\
  && ok "default branch dev exists" \\
  || miss "default branch dev no longer exists"

# 4. Critical files exist
test -f "cmd/katana/main.go" \\
  && ok "cmd/katana/main.go" \\
  || miss "missing critical file: cmd/katana/main.go"
test -f "pkg/engine/engine.go" \\
  && ok "pkg/engine/engine.go" \\
  || miss "missing critical file: pkg/engine/engine.go"
test -f "internal/runner/runner.go" \\
  && ok "internal/runner/runner.go" \\
  || miss "missing critical file: internal/runner/runner.go"
test -f "pkg/engine/headless/crawler/crawler.go" \\
  && ok "pkg/engine/headless/crawler/crawler.go" \\
  || miss "missing critical file: pkg/engine/headless/crawler/crawler.go"
test -f "pkg/engine/common/base.go" \\
  && ok "pkg/engine/common/base.go" \\
  || miss "missing critical file: pkg/engine/common/base.go"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 31 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~1d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/projectdiscovery/katana"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Katana is a fast, configurable web crawler and spider written in Go that automatically discovers and extracts URLs from websites in both standard and headless (JavaScript-rendering) modes. It handles form filling, scope control via regex/fields, and outputs discovered URLs in JSON or plaintext formats—solving the problem of manual or slow web reconnaissance for security researchers and developers. Monolithic CLI tool: cmd/katana/main.go is the entry point, internal/runner/ contains the core crawler engine (runner.go orchestrates execution, options.go manages configuration, executer.go handles crawling logic). pkg-level organization separates concerns like filtering, JavaScript parsing, and scope control. Headless rendering uses go-rod for browser automation; standard mode uses net/http.

👥Who it's for

Security researchers, penetration testers, and bug bounty hunters who need to map web application attack surfaces and discover hidden endpoints. Also useful for developers building web crawlers that must handle JavaScript-heavy sites without heavy dependencies like Selenium.

🌱Maturity & risk

Production-ready and actively maintained. The repository shows professional CI/CD setup (11 GitHub Actions workflows), comprehensive testing (integration tests, functional tests, security crawl scoring), and recent commits. The codebase is substantial (722KB Go code) and part of ProjectDiscovery's established toolkit ecosystem.

Low risk for core stability, but depends on go-rod (Chromium automation via WebSocket), which adds complexity in headless mode. Dependency surface is moderate (~25 direct dependencies in go.mod), and the project requires Go 1.25+ which is a relatively recent constraint. No obvious red flags in maintenance—regular releases and active GitHub workflows indicate ongoing support.

Active areas of work

Active development with security improvements and feature additions. Recent workflow additions (security-crawl-maze-score.yaml, compat-checks.yaml) suggest ongoing quality assurance. Release pipeline is automated (release-binary.yml, release-test.yml, dockerhub-push.yml), indicating regular versioned releases to users.

🚀Get running

git clone https://github.com/projectdiscovery/katana.git
cd katana
CGO_ENABLED=1 go install ./cmd/katana
katana -u https://example.com

Or via Docker: docker run projectdiscovery/katana:latest -u https://example.com

Daily commands: Development build: make build (see Makefile). Run directly: go run cmd/katana/main.go -u https://example.com. Run tests: go test ./.... Integration tests: cd integration_tests && bash run.sh. Functional tests: cd cmd/functional-test && bash run.sh.

🗺️Map of the codebase

  • cmd/katana/main.go — Entry point for the CLI tool; defines command-line interface and orchestrates the crawling workflow.
  • pkg/engine/engine.go — Core crawling engine abstraction; coordinates between standard and headless crawlers, managing the crawl lifecycle.
  • internal/runner/runner.go — Main execution runner that initializes options, health checks, and manages the crawling process end-to-end.
  • pkg/engine/headless/crawler/crawler.go — Headless browser crawler implementation; handles JavaScript-heavy sites using Rod for browser automation.
  • pkg/engine/common/base.go — Shared base crawler logic and URL discovery; used by both standard and headless crawlers.
  • internal/runner/options.go — Configuration options structure; defines all CLI flags and crawler behavior settings.
  • pkg/engine/headless/captcha/captcha.go — CAPTCHA detection and solving framework; integrates with external solvers for headless crawling.

🛠️How to make changes

Add a new crawler strategy or mode

  1. Define a new crawler type struct implementing the crawler interface in a new file under pkg/engine/ (pkg/engine/mytype/crawler.go)
  2. Register the new crawler in the engine selector logic (pkg/engine/engine.go)
  3. Add corresponding CLI option flags for the strategy (internal/runner/options.go)
  4. Add integration tests validating crawl output (cmd/integration-test/integration-test.go)

Add support for a new CAPTCHA type

  1. Add detection pattern and type constant in captcha/identify.go (pkg/engine/headless/captcha/identify.go)
  2. Implement injection and solving logic for the new type (pkg/engine/headless/captcha/injection_test.go)
  3. Add JavaScript payload for client-side identification if needed (pkg/engine/headless/captcha/js/identify.js)
  4. Extend the solver to route to the correct external service or handler (pkg/engine/headless/captcha/solver.go)

Add a new CLI option or configuration setting

  1. Define a new field in the Options struct (internal/runner/options.go)
  2. Add the flag parsing logic in the CLI setup (cmd/katana/main.go)
  3. Thread the option through runner.Run() and into the engine/crawler (internal/runner/runner.go)
  4. Update the engine or crawler to use the new option (pkg/engine/engine.go)

Add duplicate detection or page similarity logic

  1. Extend the normalizer to preprocess pages according to new rules (pkg/engine/headless/crawler/normalizer/normalizer.go)
  2. Modify text or DOM utilities to extract fingerprint-worthy content (pkg/engine/headless/crawler/normalizer/text_utils.go)
  3. Update the simhash or add alternative hashing in the deduplication check (pkg/engine/headless/crawler/normalizer/simhash/simhash.go)
  4. Write unit tests for the new normalization strategy (pkg/engine/headless/crawler/normalizer/normalizer_test.go)

🔧Why these technologies

  • Go 1.25.7 — High-performance concurrent crawling; fast compilation and deployment; excellent goroutine support for parallel page fetching
  • Rod (go-rod/rod) — Lightweight, high-level Chrome DevTools Protocol client; enables JavaScript rendering and headless browser automation without Selenium overhead
  • PuerkitoBio/goquery — jQuery-like HTML parsing and DOM traversal in Go; fast, intuitive link and element extraction from HTML
  • Simhash algorithm — Fast approximate string matching for duplicate detection in large crawls; avoids storing full page bodies
  • Capsolver integration — Third-party CAPTCHA solving service; enables automated crawling of CAPTCHA-protected sites without manual intervention
  • jsluice (Mzack9999/jsluice) — JavaScript source code parsing and extraction; discovers URLs and endpoints hidden in JavaScript files

⚖️Trade-offs already made

  • Dual crawler design (standard + headless)

    • Why: Standard crawler is fast for static content; headless crawler needed for JavaScript-heavy sites. Trade memory/performance vs coverage.
    • Consequence: More code paths to maintain; user must choose or let auto-detection decide; headless mode is ~10x slower but more thorough
  • Sim

    • Why: undefined
    • Consequence: undefined

🪤Traps & gotchas

CGO_ENABLED=1 is required for install (stated in README but easy to miss)—without it, crypto/DNS libraries fail. Headless mode requires Chromium to be present on the system or uses system-provided chrome/chromium (go-rod auto-downloads on Linux/macOS but may fail in restricted environments). YAML config files are not automatically discovered—must be passed via -config flag. Test server startup in internal/testutils/testserver.go requires available localhost ports. Large scope crawls without depth/concurrency limits can consume significant memory due to LRU caches and worker pools.

🏗️Architecture

💡Concepts to learn

  • Headless Browser Automation (go-rod) — Katana's headless mode uses go-rod to render JavaScript-heavy SPAs and extract dynamically-generated content—critical for crawling modern web apps that don't expose URLs in static HTML
  • Worker Pool Concurrency Pattern — Katana uses remeh/sizedwaitgroup to manage a bounded pool of concurrent crawl workers, preventing resource exhaustion while maximizing throughput
  • Scope Control via Regex and Prefix Matching — The -scope and -scope-regex flags allow crawlers to define what URLs are in/out of bounds, preventing crawl explosion into third-party domains or sensitive areas
  • LRU Cache for URL Deduplication — Katana uses hashicorp/golang-lru to cache already-visited URLs, preventing infinite loops in cyclic web structures without storing entire URL history in memory
  • JavaScript AST Parsing (jsluice) — The jsluice library extracts URLs, API endpoints, and variable names directly from JavaScript source code, discovering hidden endpoints not present in DOM
  • Form Auto-fill and Submission — Katana can automatically detect and fill HTML forms with synthetic data, allowing crawling of POST-only pages without manual form configuration
  • Technology Detection (wappalyzergo) — Integrated tech detection identifies frameworks and libraries running on crawled sites, providing reconnaissance context for security assessments
  • projectdiscovery/subfinder — Companion tool in the ProjectDiscovery suite for subdomain enumeration—feeds domain lists into katana for deep crawling
  • projectdiscovery/nuclei — ProjectDiscovery's vulnerability scanner that consumes URLs discovered by katana to test for security flaws
  • OWASP/ZAP — Established web crawler/proxy alternative; katana differs by being CLI-first, faster for bulk URL discovery, and headless-by-default rather than proxy-based
  • hakluke/hakrawler — Simpler bash/Go crawler focused on URL discovery; katana improves on it with headless rendering and form automation
  • projectdiscovery/httpx — Companion tool for HTTP probing and metadata extraction from URLs discovered by katana

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for pkg/engine/common/http.go

The http.go file in pkg/engine/common/ lacks corresponding test coverage (*_test.go). Given that HTTP handling is critical for a crawling framework, adding tests for request building, header manipulation, response handling, and edge cases would significantly improve reliability. This is especially important since related files like backoff_test.go and base_test.go exist in the same package, establishing a testing pattern.

  • [ ] Create pkg/engine/common/http_test.go with test cases for HTTP request construction
  • [ ] Add tests for header handling, content-type detection, and response parsing
  • [ ] Test edge cases like malformed URLs, redirects, and timeout scenarios
  • [ ] Ensure coverage aligns with existing test patterns in backoff_test.go and base_test.go

Add integration tests for headless browser cookie handling (pkg/engine/headless/browser/cookie/)

The cookie package has rules.json and cookie_test.go exists, but there's a TODOS.md in the headless directory suggesting incomplete work. Integration tests validating cookie persistence across navigation, cookie rule enforcement, and SameSite/Secure flag handling would be valuable. This directly supports katana's headless browsing capabilities and is critical for realistic crawling scenarios.

  • [ ] Review pkg/engine/headless/TODOS.md to understand outstanding cookie-related work
  • [ ] Create integration tests in cmd/integration-test/ for cookie scenarios (persistence, rules application, security flags)
  • [ ] Add test cases for cookie handling with JavaScript-set cookies and third-party cookie scenarios
  • [ ] Validate that rules.json restrictions are properly enforced during headless crawling

Add missing error handling and input validation tests for internal/runner/options.go

The options_test.go file exists but typically option parsing in CLI tools has complex validation logic. Given that cmd/katana/main.go passes options through internal/runner, adding comprehensive tests for invalid flag combinations, missing required parameters, and edge cases in option parsing would catch bugs early. This directly improves user experience by validating CLI input before execution.

  • [ ] Extend internal/runner/options_test.go with tests for conflicting flag combinations
  • [ ] Add validation tests for boundary conditions (rate limits, timeout values, proxy configurations)
  • [ ] Test error cases: invalid URLs, malformed scope configurations, incompatible headless+standard mode combinations
  • [ ] Add tests for options that depend on external files (config files referenced in options)

🌿Good first issues

  • Add comprehensive test coverage for internal/runner/options_test.go—currently minimal. Expand to cover all CLI flag combinations, scope regex validation, and output format marshaling. This would improve confidence in configuration handling.
  • Document the YAML configuration schema with examples in a dedicated CONFIG.md file. Currently config options are scattered across options.go code comments. A new contributor could extract these into a structured guide with before/after examples.
  • Implement missing output filter tests in cmd/integration-test/filters.go—the test framework exists but filter logic for custom fields, regex matching, and JSON output paths needs explicit test cases to verify correctness.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 0f80ea8 — feat: add Context field to Options for external lifecycle control (#1593) (idanavishai)
  • 63bffb6 — Merge pull request #1646 from projectdiscovery/dependabot/go_modules/dev/github.com/projectdiscovery/dsl-0.8.17 (dogancanbakir)
  • 207b5f1 — Merge pull request #1647 from projectdiscovery/dependabot/go_modules/dev/github.com/projectdiscovery/retryablehttp-go-1. (dogancanbakir)
  • e00be2e — Merge pull request #1648 from projectdiscovery/dependabot/go_modules/dev/github.com/projectdiscovery/wappalyzergo-0.2.79 (dogancanbakir)
  • a079c32 — bump version (dogancanbakir)
  • d4f0749 — chore(deps): bump github.com/projectdiscovery/wappalyzergo (dependabot[bot])
  • 7775029 — chore(deps): bump github.com/projectdiscovery/retryablehttp-go (dependabot[bot])
  • 681e3c6 — chore(deps): bump github.com/projectdiscovery/dsl from 0.8.16 to 0.8.17 (dependabot[bot])
  • 7217e42 — Merge pull request #1638 from projectdiscovery/dependabot/go_modules/dev/github.com/projectdiscovery/ratelimit-0.0.86 (dogancanbakir)
  • c3a398d — chore(deps): bump github.com/projectdiscovery/ratelimit (dependabot[bot])

🔒Security observations

  • High · Outdated Go Version in Dockerfile — Dockerfile (line 1) and go.mod (line 3). The Dockerfile uses golang:1.26.2-alpine for building, which is a future/non-existent version. The go.mod specifies 'go 1.25.7' which is also unusually high. This may indicate version mismatch issues or use of untested/unstable Go versions that could contain security bugs. Fix: Use a stable, released Go version that matches go.mod. Verify Go 1.25.7 is a valid release and update to LTS versions like 1.22.x or 1.23.x if possible. Ensure build and runtime Go versions are properly tested.
  • High · Missing Input Validation in Web Crawler — pkg/engine/headless/captcha/js/ and pkg/engine/headless/browser/. As a web crawling framework that processes arbitrary URLs and web content, the codebase likely parses user-supplied URLs and HTML/JavaScript without explicit validation. Files like 'pkg/engine/headless/captcha/inject_test.go' and JavaScript injection files suggest dynamic code execution. This creates risks for XSS, command injection, or SSRF attacks. Fix: Implement strict input validation for all URLs (use url.Parse, validate schemes). Sanitize all HTML/JavaScript before execution. Use CSP headers. Validate and escape all data before passing to browser/JS execution contexts.
  • High · Potential SSRF via Unconstrained URL Crawling — pkg/engine/common/ and URL processing logic. A crawling framework by nature follows URLs. Without proper validation, it could crawl private/internal IPs (127.0.0.1, 10.0.0.0/8, 172.16.0.0/12, 169.254.0.0/16) or be exploited for Server-Side Request Forgery attacks against internal services. Fix: Implement IP whitelist/blacklist validation. Block requests to private IP ranges by default. Add configurable options to restrict crawling scope. Validate redirects strictly.
  • Medium · Headless Browser Security Risks — pkg/engine/headless/browser/ and pkg/engine/headless/captcha/. The use of go-rod (Chromium automation) with features like captcha injection (pkg/engine/headless/captcha/) and stealth mode suggests executing arbitrary JavaScript and accessing browser contexts. This could allow malicious websites to break out of the browser sandbox or exfiltrate data. Fix: Run headless browser in isolated environments. Implement resource limits (memory, timeout). Disable access to sensitive browser APIs. Consider using a sandboxed approach or containerization per crawl session.
  • Medium · Dependency on Third-Party CAPTCHA Solver — pkg/engine/headless/captcha/capsolver/capsolver.go. The codebase includes integration with capsolver (pkg/engine/headless/captcha/capsolver/). This requires external API keys and network communication. Hardcoded API credentials, insecure storage, or interception of CAPTCHA solutions could compromise security. Fix: Store API keys in environment variables or secure vaults, never hardcode. Validate CAPSOLVER_API_KEY configuration. Use HTTPS for capsolver communication. Implement API key rotation policies.
  • Medium · Insufficient Rate Limiting Enforcement — pkg/engine/ and rate limit configuration. While the project includes 'projectdiscovery/ratelimit' dependency, there's no visible enforcement mechanism in the file structure. A crawler without proper rate limiting could be used for DoS attacks or abuse target servers. Fix: Enforce rate limiting by default. Implement per-domain rate limits. Add configurable delays between requests. Respect robots.txt and crawl-delay directives.
  • Medium · Alpine Base Image Missing Security Updates — Dockerfile (line 10). The runtime Dockerfile uses 'alpine:3.23.4', which may be outdated or have unpatched vulnerabilities. Additionally, it installs 'chromium' which is a large, complex dependency with potential security issues. Fix: Use the latest Alpine LTS version. Run 'apk update && apk upgrade' in the build stage. Regularly update base images in CI/CD. Consider using distroless images for smaller attack surface

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · projectdiscovery/katana — RepoPilot