digininja/CeWL

Item: digininja/CeWL
Rating: 3
Author: RepoPilot

CeWL is a Custom Word List Generator

Mixed

Missing license — unclear to depend on

worst of 4 axes

Use as dependencyConcerns

no license — legally unclear; no tests detected

Fork & modifyConcerns

no license — can't legally use code; no tests detected

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isConcerns

no license — can't legally use code

✓Last commit 1d ago
✓8 active contributors
✓CI configured

Show 3 more →

⚠Concentrated ownership — top contributor handles 74% of recent commits
⚠No license — legally unclear to depend on
⚠No test directory detected

What would change the summary?

→Use as dependency Concerns → Mixed if: publish a permissive license (MIT, Apache-2.0, etc.)
→Fork & modify Concerns → Mixed if: add a LICENSE file
→Deploy as-is Concerns → Mixed if: add a LICENSE file

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Great to learn from" badge

Paste into your README — live-updates from the latest cached analysis.

[![RepoPilot: Great to learn from](https://repopilot.app/api/badge/digininja/cewl?axis=learn)](https://repopilot.app/r/digininja/cewl)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/digininja/cewl on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: digininja/CeWL

Generated by RepoPilot · 2026-05-10 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/digininja/CeWL shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Missing license — unclear to depend on

Last commit 1d ago
8 active contributors
CI configured
⚠ Concentrated ownership — top contributor handles 74% of recent commits
⚠ No license — legally unclear to depend on
⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live digininja/CeWL repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/digininja/CeWL.

What it runs against: a local clone of digininja/CeWL — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in digininja/CeWL | Confirms the artifact applies here, not a fork | | 2 | Default branch master exists | Catches branch renames | | 3 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 4 | Last commit ≤ 31 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>digininja/CeWL</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of digininja/CeWL. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/digininja/CeWL.git
#   cd CeWL
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of digininja/CeWL and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "digininja/CeWL(\\.git)?\\b" \\
  && ok "origin remote is digininja/CeWL" \\
  || miss "origin remote is not digininja/CeWL (artifact may be from a fork)"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "cewl.rb" \\
  && ok "cewl.rb" \\
  || miss "missing critical file: cewl.rb"
test -f "cewl_lib.rb" \\
  && ok "cewl_lib.rb" \\
  || miss "missing critical file: cewl_lib.rb"
test -f "fab.rb" \\
  && ok "fab.rb" \\
  || miss "missing critical file: fab.rb"
test -f "Gemfile" \\
  && ok "Gemfile" \\
  || miss "missing critical file: Gemfile"
test -f "Dockerfile" \\
  && ok "Dockerfile" \\
  || miss "missing critical file: Dockerfile"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 31 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~1d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/digininja/CeWL"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

CeWL is a Ruby-based web spider that crawls a target website to a configurable depth, extracts unique words from page content and metadata, and generates custom wordlists for password cracking tools like John the Ripper. It also includes FAB (Files Already Bagged), a companion tool that extracts author/creator metadata from already-downloaded files using exiftool. Monolithic structure: cewl.rb is the main CLI entry point that imports cewl_lib.rb (the core spider logic), fab.rb provides the metadata extraction utility, and Gemfile pins all gem dependencies. Docker support via Dockerfile and compose.yml allows containerized execution without local Ruby setup.

👥Who it's for

Penetration testers and security professionals who need to generate custom wordlists tailored to a specific target organization by harvesting words from that organization's website, enabling more effective dictionary-based password attacks than generic wordlists.

🌱Maturity & risk

Actively maintained but modest in scale—the Ruby codebase is ~44KB with a single Dockerfile for containerization, suggesting a stable, focused tool rather than a large project. The presence of a documented changelog and Docker support indicates production readiness for its specific niche, though commit recency data is not visible in the provided metadata.

Low architectural risk due to its focused, single-purpose design; however, it depends on 8 external gems (nokogiri, spider, mini_exiftool, rubyzip, public_suffix, mime-types, mime, getoptlong) and requires the system-level exiftool binary, creating a moderate supply-chain dependency surface. Single maintainer (Robin Wood) is a known risk factor for community-driven contributions.

Active areas of work

No specific recent activity is visible in the provided file list, but the presence of .github/workflows/docker-image.yml suggests active CI/CD for Docker builds. The changelog.md file indicates historical maintenance, though its most recent entries are not shown.

🚀Get running

git clone https://github.com/digininja/CeWL.git
cd CeWL
bundle install
chmod u+x ./cewl.rb
./cewl.rb --help

Daily commands:

./cewl.rb https://target.com
# or with Docker:
docker compose up
# or specific arguments:
./cewl.rb -d 3 -w output.txt https://target.com

🗺️Map of the codebase

cewl.rb — Main entry point and CLI handler; every contributor must understand how command-line arguments are parsed and routed to the core library.
cewl_lib.rb — Core web spider and word extraction logic; this is the load-bearing abstraction that implements URL crawling, HTML parsing, and word collection.
fab.rb — Companion CLI tool for metadata extraction from downloaded files; demonstrates how to reuse cewl_lib patterns for a separate use case.
Gemfile — Ruby dependency manifest; contributors must check this to understand required gems and compatibility constraints.
Dockerfile — Container build definition; essential for understanding deployment, Ruby version, and runtime environment setup.
README.md — Project overview and usage guide; foundational reference for understanding CeWL's purpose, behavior, and configuration options.
changelog.md — Version history and breaking changes; critical for understanding deprecated features and evolution of the API.

🧩Components & responsibilities

cewl.rb (CLI Handler) (Ruby stdlib (OptionParser, File I/O)) — Parses command-line arguments, initializes library with user options, coordinates output streams.
- Failure mode: Invalid arguments cause usage message; uncaught exceptions terminate with error code.
cewl_lib.rb (Spider & Extractor) (Ruby HTTP client, HTML parser (Nokogiri or similar), regex) — Core recursion engine: fetches URLs, parses HTML, extracts words and links, applies filters, returns word list.
- Failure mode: Network timeouts or 404s skip that URL and continue; malformed HTML gracefully degrades; bad regex filters silently skip words.
fab.rb (File Metadata Extractor) (File I/O, metadata extraction libraries (document parsers)) — Scans local files (PDF, DOCX, etc.) for embedded metadata (author, creator, timestamps); outputs author lists.
- Failure mode: Unsupported file types skipped; corrupted files logged and continue; missing metadata fields ignored.
Output Handler (Ruby File I/O, stdout redirection) — Writes deduplicated word list to stdout or file; formats as plain text (one word per line).
- Failure mode: Disk full or permission denied raises exception; stdout pipe broken terminates process.

🔀Data flow

User (CLI args) → cewl.rb — Target URL, depth, follow-external, output file, word length, exclude patterns.
cewl.rb → cewl_lib.rb — Configuration object with crawl parameters and callbacks for word collection.
cewl_lib.rb → HTTP Client — URLs to fetch; receives raw HTML responses.
cewl_lib.rb → HTML Parser — Raw HTML; receives parsed DOM with text nodes and href attributes.
cewl_lib.rb → Word Filter — Extracted words; receives filtered words that meet length/pattern criteria.
cewl_lib.rb → Output Handler — Deduplicated word list; written to file or stdout.
Output Handler → User (stdout/file) — Plain text word list,

🛠️How to make changes

Add a new word filtering rule

Open cewl_lib.rb and locate the word collection/filtering section where minimum length and character validation occur (cewl_lib.rb)
Add a new conditional check (e.g., regex pattern, word list exclusion) before the word is added to the output array (cewl_lib.rb)
Update cewl.rb CLI argument parsing to accept a new option flag (e.g., --exclude-pattern) if user control is needed (cewl.rb)
Test by running cewl.rb with the new filter on a small target site and verify words are correctly filtered (cewl.rb)

Extend spider depth and crawling behavior

Open cewl.rb and review the depth and follow-external-links argument handlers (cewl.rb)
Modify the depth recursion logic in cewl_lib.rb to implement new traversal rules (e.g., domain whitelist, URL pattern matching) (cewl_lib.rb)
Add new CLI flags in cewl.rb to expose the configurable parameters to end users (cewl.rb)
Update README.md with examples of the new crawling behavior (README.md)

Add metadata extraction for a new file type in FAB

Open fab.rb and identify where file type detection and metadata parsing occur (fab.rb)
Add a new parser method or delegate to cewl_lib.rb for the target file format (e.g., PDF, DOCX) (fab.rb)
Implement extraction logic to pull author, creator, or other metadata fields (fab.rb)
Test with sample files and verify metadata is correctly extracted and output in expected format (fab.rb)

🔧Why these technologies

Ruby — Cross-platform scripting language ideal for rapid prototyping; strong standard library for HTTP and HTML parsing; widely used in security tools.
Docker — Ensures consistent runtime across development and production environments; eliminates Ruby version and dependency conflicts; simplifies distribution.
HTTP + HTML Parsing — Standard web technologies required to spider arbitrary websites; parsing HTML extracts text and links for recursive crawling.

⚖️Trade-offs already made

Single-threaded sequential crawling by default
- Why: Simplifies implementation and avoids race conditions; reduces resource overhead for small to medium sites.
- Consequence: Slow on large, deep sites; users must accept longer execution times or manually implement external parallelization.
Minimum word length of 3 characters as default
- Why: Reduces noise in word lists (many 1-2 char words are not useful for password cracking); balances list size vs. utility.
- Consequence: May miss valid short words or acronyms; users must adjust threshold if specific words are needed.
Opt-in external link following
- Why: Prevents uncontrolled scope creep and drift into unrelated domains; protects against accidental resource exhaustion.
- Consequence: Requires explicit user awareness to explore beyond target site; may require multiple runs for comprehensive coverage.
Words output to stdout (or optionally to file)
- Why: Allows piping directly into other tools (grep, sort, john, etc.); Unix philosophy of composable utilities.
- Consequence: Large word lists printed to terminal can be unwieldy; requires redirection to file for most production use.

🚫Non-goals (don't propose these)

Does not handle HTTPS certificate validation or advanced SSL/TLS configurations—relies on Ruby's default OpenSSL.
Does not support JavaScript rendering or dynamic DOM content—parses static HTML only.
Does not implement authentication (login, cookies, session handling) for access-controlled pages.
Does not provide GUI or web interface—CLI-only tool.
Does not perform concurrent/parallel crawling by default.
Not designed for ethical penetration testing coordination or reporting—a raw data collection tool only.

🪤Traps & gotchas

The mini_exiftool gem requires the system exiftool binary to be installed separately (not included in gems)—Docker users are protected by the Dockerfile, but bare-metal installs must run apt-get install exiftool or equivalent. Ruby 2.7 produces spurious warnings from the mime-types gem's logger (see README)—this is cosmetic but tests may fail if running with strict warning levels. CeWL respects robots.txt by default via the Spider gem, but deep crawls with --follow-external can drift off-site uncontrollably, creating legal/ethical risk for the user.

🏗️Architecture

💡Concepts to learn

Web Crawling & Spidering — CeWL's core function relies on the Spider gem to traverse links up to a configurable depth; understanding breadth-first vs. depth-first traversal and robots.txt compliance is essential to predict crawl behavior
DOM Parsing & XPath/CSS Selectors — Nokogiri (the gem CeWL uses) parses HTML into a DOM tree; contributors need to understand how Nokogiri extracts text nodes and filters by selector to modify word extraction logic
Metadata Extraction (Exiftool) — FAB's key feature relies on mini_exiftool to read EXIF, IPTC, and XMP metadata from files; understanding what metadata exiftool can extract informs wordlist quality from document author names and keywords
Public Suffix Lists — The public_suffix gem helps CeWL identify domain boundaries (e.g., distinguish example.co.uk as one domain, not two parts), preventing unintended external link following
Dictionary-Based Password Attacks — CeWL's entire purpose is to generate custom wordlists for tools like John the Ripper; understanding how dictionary attacks work and why target-specific wordlists are more effective than generic lists motivates the tool's design
Command-Line Argument Parsing (GetoptLong) — CeWL uses Ruby's GetoptLong to handle CLI flags (--depth, --follow-external, --write); contributors adding features must extend this argument parser
Container Orchestration (Docker & Docker Compose) — The Dockerfile and compose.yml enable reproducible execution; understanding how the Dockerfile installs Ruby gems and exiftool dependency helps with debugging containerized runs

zaproxy/zaproxy — OWASP ZAP is a full-featured security testing suite that includes spidering and can generate wordlists, but CeWL is more lightweight and wordlist-focused
danielmiessler/SecLists — Curated collection of security wordlists for testing; CeWL generates target-specific wordlists that can complement or replace generic SecLists entries
Corelan/mona — Another Ruby-based pentesting utility from the community; CeWL shares the same ecosystem and use case (custom wordlist generation for password attacks)
hashicorp/vagrant — Not directly related, but CeWL users often pair it with Vagrant for portable penetration testing environments (works well with the provided Dockerfile)
sqlmap/sqlmap — Another Python/Ruby-era security tool in the same pentester toolkit ecosystem; some users chain CeWL output into SQLMap's wordlist features

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive RSpec test suite for cewl_lib.rb core functionality

The repo has cewl_lib.rb as the core library but there's no visible test directory structure in the file listing. Given this is a web spider with complex parsing logic, unit tests for URL handling, word extraction, depth control, and external link following would catch regressions and make refactoring safe. This is critical for a tool that modifies word lists.

[ ] Create spec/ directory structure with spec/spec_helper.rb
[ ] Add tests for core cewl_lib.rb methods: spider initialization, depth traversal, word filtering (3+ chars), external link handling
[ ] Add tests for edge cases: invalid URLs, circular redirects, timeout handling, malformed HTML
[ ] Update Gemfile to include rspec and webmock (for mocking HTTP responses)
[ ] Create .github/workflows/rspec.yml to run tests on push/PR

Add integration test GitHub Action for Docker image with real-world crawl validation

The repo has Dockerfile and docker-image.yml workflow, but the workflow likely only builds the image without validating it works correctly. Adding an integration test that spiders a test fixture site (or local test server) and validates output word list format would catch Docker-specific bugs and ensure the packaged app functions properly across versions.

[ ] Extend .github/workflows/docker-image.yml to add an integration test step after build
[ ] Create a lightweight test fixture (test/fixtures/test-site/) with sample HTML files or use a public test domain
[ ] Run the built Docker image against the fixture and validate: output contains expected words, respects --depth flag, respects --min-word-length flag
[ ] Add assertions for word count, word format, and exit codes
[ ] Document the test approach in README.md Testing section

Extract FAB (Files Already B...) functionality into separate fab_lib.rb and add CLI wrapper tests

The README mentions an associated command-line app FAB but the implementation is unclear from the file structure (fab.rb exists but is not described). Clarifying this by separating business logic (fab_lib.rb) from CLI logic (fab.rb) and adding unit tests for FAB functionality would improve maintainability and make the tool more composable, following the same pattern as cewl.rb/cewl_lib.rb.

[ ] Review fab.rb and identify core logic vs CLI argument parsing
[ ] Create fab_lib.rb with extracted business logic (similar structure to cewl_lib.rb)
[ ] Refactor fab.rb to use fab_lib.rb and handle only CLI concerns
[ ] Add spec/fab_lib_spec.rb with unit tests for FAB core functions
[ ] Update README.md with complete documentation of FAB feature and usage examples

🌿Good first issues

Add integration tests in a tests/ directory covering spider depth limits (currently no test files visible), focusing on validating that the Spider gem respects max-depth and external-link flags
Extend fab.rb to support additional metadata formats (e.g., PDF author extraction via pdf-reader gem) and document which file types are currently supported in README
Add a --exclude-pattern option to cewl.rb (via GetoptLong in cewl.rb and filtering in cewl_lib.rb) to allow users to skip words matching regex patterns, reducing false positives in generated wordlists

⭐Top contributors

Click to expand

@digininja — 74 commits
@dependabot[bot] — 13 commits
@Umair-khurshid — 6 commits
@loris-intergalactique — 2 commits
@cbrunnkvist — 2 commits

📝Recent commits