RepoPilotOpen in app →

minimaxir/big-list-of-naughty-strings

The Big List of Naughty Strings is a list of strings which have a high probability of causing issues when used as user-input data.

Mixed

Stale — last commit 2y ago

weakest axis
Use as dependencyMixed

last commit was 2y ago; no tests detected…

Fork & modifyMixed

no tests detected; no CI workflows detected…

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isMixed

last commit was 2y ago; no CI workflows detected

  • 35+ active contributors
  • Distributed ownership (top contributor 46% of recent commits)
  • MIT licensed
  • Stale — last commit 2y ago
  • No CI workflows detected
  • No test directory detected
What would change the summary?
  • Use as dependency MixedHealthy if: 1 commit in the last 365 days; add a test suite
  • Fork & modify MixedHealthy if: add a test suite
  • Deploy as-is MixedHealthy if: 1 commit in the last 180 days

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Great to learn from" badge

Paste into your README — live-updates from the latest cached analysis.

RepoPilot: Great to learn from
[![RepoPilot: Great to learn from](https://repopilot.app/api/badge/minimaxir/big-list-of-naughty-strings?axis=learn)](https://repopilot.app/r/minimaxir/big-list-of-naughty-strings)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/minimaxir/big-list-of-naughty-strings on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: minimaxir/big-list-of-naughty-strings

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/minimaxir/big-list-of-naughty-strings shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 2y ago

  • 35+ active contributors
  • Distributed ownership (top contributor 46% of recent commits)
  • MIT licensed
  • ⚠ Stale — last commit 2y ago
  • ⚠ No CI workflows detected
  • ⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live minimaxir/big-list-of-naughty-strings repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/minimaxir/big-list-of-naughty-strings.

What it runs against: a local clone of minimaxir/big-list-of-naughty-strings — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in minimaxir/big-list-of-naughty-strings | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 779 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>minimaxir/big-list-of-naughty-strings</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of minimaxir/big-list-of-naughty-strings. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/minimaxir/big-list-of-naughty-strings.git
#   cd big-list-of-naughty-strings
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of minimaxir/big-list-of-naughty-strings and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "minimaxir/big-list-of-naughty-strings(\\.git)?\\b" \\
  && ok "origin remote is minimaxir/big-list-of-naughty-strings" \\
  || miss "origin remote is not minimaxir/big-list-of-naughty-strings (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "blns.txt" \\
  && ok "blns.txt" \\
  || miss "missing critical file: blns.txt"
test -f "blns.json" \\
  && ok "blns.json" \\
  || miss "missing critical file: blns.json"
test -f "naughtystrings/__init__.py" \\
  && ok "naughtystrings/__init__.py" \\
  || miss "missing critical file: naughtystrings/__init__.py"
test -f "naughtystrings/naughtystrings.go" \\
  && ok "naughtystrings/naughtystrings.go" \\
  || miss "missing critical file: naughtystrings/naughtystrings.go"
test -f "scripts/txt_to_json.py" \\
  && ok "scripts/txt_to_json.py" \\
  || miss "missing critical file: scripts/txt_to_json.py"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 779 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~749d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/minimaxir/big-list-of-naughty-strings"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

The Big List of Naughty Strings is a curated collection of strings designed to break or exploit edge cases in software, distributed as blns.txt, blns.json, and blns.base64.* formats. It serves as a QA testing dataset to catch unexpected input handling failures (like zero-width space crashes on Twitter) before they reach production, with multi-language library implementations in Python, Go, Node, and Shell. Monorepo structure: root-level blns.txt (canonical source, newline-delimited with comments) and auto-generated blns.json via scripts/txt_to_json.py; naughtystrings/ directory contains Go package with naughtystrings.go (core API) and naughtystrings_test.go; Python wrapper in naughtystrings/__init__.py; build orchestration via root Makefile and naughtystrings/Makefile.

👥Who it's for

QA engineers and developers performing both automated and manual testing who need comprehensive test inputs for edge cases; security researchers validating input sanitization; package maintainers integrating test data into testing frameworks across multiple languages.

🌱Maturity & risk

Mature and stable: the project is a well-established reference list (evidenced by npm/nuget/rubygems package availability mentioned in README) with production-grade multi-language implementations. The Go implementation includes tests (naughtystrings_test.go) and proper module structure. Verdict: production-ready — actively maintained as a reference resource with library wrappers.

Low risk for a reference dataset: single maintainer (minimaxir) creates some maintenance dependency, but the data itself is static and non-breaking by design (README explicitly discourages very-long strings and problematic formats like null characters to preserve manual usability). No external dependencies in the core datasets; Go module has clean go.mod with no listed dependencies visible.

Active areas of work

No recent activity data provided in the repo snapshot, but the project maintains synchronization across three formats (txt, json, base64) — any string addition requires updates to all three files per README contribution guidelines. The scripts/ folder contains transformation pipelines (txt_to_json.py, texttobase64.sh) that are re-run on updates.

🚀Get running

git clone https://github.com/minimaxir/big-list-of-naughty-strings.git
cd big-list-of-naughty-strings
make  # Regenerates all derived formats from blns.txt

For Go: cd naughtystrings && go test ./.... For Python: import via naughtystrings after pip install from PyPI.

Daily commands: This is a static dataset repo, not a runnable service. To regenerate all formats from source:

make

To run tests:

cd naughtystrings && go test -v ./...

To use programmatically: import naughtystrings in Go or Python after installation.

🗺️Map of the codebase

  • blns.txt — The authoritative source list of naughty strings in human-readable format with comments; all other formats derive from this.
  • blns.json — Structured JSON export of naughty strings used by programmatic consumers and language bindings.
  • naughtystrings/__init__.py — Python package entry point that exposes the naughty strings list via the primary Python interface.
  • naughtystrings/naughtystrings.go — Go package implementation providing programmatic access to naughty strings for Go consumers.
  • scripts/txt_to_json.py — Build script that transforms blns.txt into blns.json; critical for maintaining consistency across formats.
  • README.md — Documents the rationale and usage patterns that guide all contributions and integrations.

🧩Components & responsibilities

  • blns.txt (Master Source) (Plain text, comment-based structure) — Authoritative, human-curated collection of problematic input strings organized by category
    • Failure mode: Accidental deletion or corruption leads to loss of test cases across all derived formats; requires version control recovery
  • txt_to_json.py (Transformation Engine) (Python, JSON serialization) — Parses blns.txt, extracts categories and strings, generates structured JSON and base64 exports
    • Failure mode: Parser bug or incomplete category extraction results in missing or malformed strings in JSON; test coverage should validate output completeness
  • blns.json (Primary Export) (JSON, flat or nested array structure) — Structured JSON format exposing all naughty strings with metadata; serves as distribution point for programmatic consumers
    • Failure mode: Corruption or format change breaks downstream consumers; should maintain backward compatibility
  • Python Binding (naughtystrings/init.py) (Python, optional embedded data or JSON loading) — Provides Pythonic API (e.g., list or generator) for test frameworks to access naughty strings
    • Failure mode: Import errors or missing embedded data make module unavailable; should handle missing JSON gracefully
  • Go Binding (naughtystrings.go) (Go, go.mod, resource embedding or code generation) — Exposes naughty strings as Go types/functions, distributable via go get; intended for Go testing suites
    • Failure mode: Build errors or missing embedded data prevent module installation; should validate resource availability at compile time

🔀Data flow

  • blns.txttxt_to_json.py — Build-time: txt_to_json.py reads and parses blns.txt to extract category metadata and string entries
  • txt_to_json.pyblns.json — Build-time: txt_to_json.py generates structured JSON export with category keys and string arrays
  • txt_to_json.pyblns.base64.json — Build-time: txt_to_json.py generates base64-encoded JSON variant for systems with encoding

🛠️How to make changes

Add a new naughty string to the master list

  1. Edit blns.txt and add your string under the appropriate category comment or create a new one (blns.txt)
  2. Run the transformation script to regenerate all export formats from the updated source (scripts/txt_to_json.py)
  3. Verify blns.json and base64 variants were updated correctly (blns.json)

Create a new language binding for naughty strings

  1. Read from blns.json as the canonical data source in your new language directory (blns.json)
  2. Implement a module following the pattern used in naughtystrings/init.py (return list of strings from embedded/fetched JSON) (naughtystrings/__init__.py)
  3. Write test suite following the pattern in naughtystrings/naughtystrings_test.go to validate string count and basic correctness (naughtystrings/naughtystrings_test.go)
  4. Add package metadata (e.g., go.mod or package.json equivalent) and publish to language registry (package.json)

Regenerate data exports after manual edits

  1. Ensure blns.txt contains all changes in the correct format (strings with # prefixes for category comments) (blns.txt)
  2. Execute txt_to_json.py to parse blns.txt and generate structured JSON (scripts/txt_to_json.py)
  3. Run texttobase64.sh to generate base64-encoded variants (scripts/texttobase64.sh)
  4. Verify all four output files (blns.json, blns.base64.json, blns.base64.txt) are updated (blns.json)

🔧Why these technologies

  • Plain text (blns.txt) as canonical source — Human-readable, version-control-friendly, comment-based organization enables category browsing and manual QA copy-paste workflows
  • JSON export formats (blns.json) — Enables programmatic consumption across web, mobile, and scripting ecosystems without parsing complexity
  • Base64-encoded variants (blns.base64.*) — Provides pre-encoded strings for systems that require encoded input or need to avoid encoding complexity
  • Python and Go language bindings — Reduces friction for testing frameworks in these languages; distributed via native package managers (pip, go get)

⚖️Trade-offs already made

  • Single source of truth (blns.txt) with derived exports

    • Why: Simplifies maintenance and prevents divergence between formats
    • Consequence: Requires running transformation scripts after any edit; adds build step but prevents format desynchronization
  • No real-time fetching or API server

    • Why: Keeps project lightweight and distribution-agnostic; suitable for static QA testing resources
    • Consequence: Consumers embed or locally cache the list; cannot push updates to live installations without consumer-side refresh
  • Categorical organization via text comments

    • Why: Makes the list readable and manually navigable for QA engineers copying test cases into forms
    • Consequence: Requires custom parsing in txt_to_json.py; JSON export structure must preserve category metadata

🚫Non-goals (don't propose these)

  • Real-time API service for consuming naughty strings
  • Hosting or deployment infrastructure
  • Authentication or access control
  • Machine learning or automatic string generation
  • Language-specific validation (applies strings only; does not interpret or sanitize them)
  • Support for systems that cannot load JSON or base64 data

🪤Traps & gotchas

Synchronization requirement: Adding or removing a string from blns.txt mandates manual updates to blns.json and base64 files; the Makefile should orchestrate this but verify all three formats are in sync. File encoding: blns.txt is UTF-8 and must remain so (README warns against encoding changes in PRs). Null byte prohibition: Do not add U+0000 strings — they render the file binary on GitHub and break readability. String length limit: Contributions should not exceed 255 characters (per README) to preserve manual usability. No EICAR test string: Explicitly forbidden due to antivirus scanner false positives. Go module expects clean go.mod with no external dependencies (keep it that way).

🏗️Architecture

💡Concepts to learn

  • Zero-width Unicode characters — BLNS's canonical motivating example (zero-width space U+200B breaking Twitter) — understanding invisible Unicode is essential for validating string handling, regex matching, and display edge cases.
  • SQL injection and parameterized queries — BLNS includes SQL fragment strings to catch unsafe string interpolation; knowing why single-quotes and -- are dangerous is crucial context for QA testing.
  • Null byte injection (NUL / U+0000) — BLNS explicitly forbids null-byte strings in contributions because they change GitHub's file format detection to binary; understanding this edge case is essential for file handling and string truncation vulnerabilities.
  • Unicode normalization (NFD/NFC/NFKC/NFKD) — Different Unicode forms can represent the same character visually but differ in byte representation; BLNS likely includes normalization variants to catch string comparison and sanitization failures.
  • Regular expression denial of service (ReDoS) — BLNS includes strings designed to exploit pathological regex backtracking (e.g., repeated patterns); testing against these strings reveals performance and security vulnerabilities in input validation.
  • Command injection and shell metacharacters — BLNS contains shell operators and escape sequences to catch unsafe command construction; essential for validating that string inputs don't break out of intended execution context.
  • Base64 encoding and binary safety — BLNS provides base64-encoded variants (blns.base64.json, blns.base64.txt) for testing systems that handle encoded input; understanding when to encode/decode prevents injection vectors and data corruption.
  • OWASP/owasp-testing-guide — Comprehensive security testing methodology that relies on payload lists like BLNS for injection and validation testing.
  • payloadbox/sql-injection-payload-list — Specialized SQL injection payload collection; complementary to BLNS for deep-dive security testing of specific vulnerability classes.
  • minimaxir/big-list-of-naughty-strings-javascript — Parallel implementation providing JavaScript/Node.js bindings and examples specific to the BLNS dataset.
  • fuzzdb-project/fuzzdb — Broader fuzzing database with similar goals (edge-case discovery) but larger scope covering protocol formats, file types, and command injection vectors.
  • SecLists/SecLists — Massive collection of security testing wordlists and payloads; BLNS is narrower and more curated, but SecLists offers complementary domain-specific lists (usernames, passwords, file paths).

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive test coverage for naughtystrings Go package

The naughtystrings/naughtystrings_test.go exists but likely has incomplete coverage. Given that this is a data-heavy library with multiple output formats (json, txt, base64), tests should verify: (1) all naughty strings load correctly from the embedded resource in naughtystrings/internal/resource.go, (2) string counts match across blns.json and blns.txt, (3) base64 encoding/decoding round-trip correctly for blns.base64.json and blns.base64.txt. This ensures data integrity across formats.

  • [ ] Expand naughtystrings/naughtystrings_test.go with table-driven tests for each string category
  • [ ] Add test that verifies string counts in blns.json match naughtystrings.go exported functions
  • [ ] Add integration test validating blns.base64.json decodes to match blns.json content
  • [ ] Run tests in CI to prevent data sync issues in future releases

Add Python package integration tests for naughtystrings/init.py

The naughtystrings/init.py module exists but there are no visible Python-specific tests. The Python package should be tested to ensure: (1) strings load correctly from the embedded blns.json, (2) the module exports match the Go version, (3) it works with Python 3.x standard library. This is important since naughtystrings is also published as a PyPI package (visible in package.json references).

  • [ ] Create naughtystrings/tests/ directory with test_naughtystrings.py
  • [ ] Add tests verifying all string categories load and match expected counts from blns.json
  • [ ] Add test for Python 3.6+ compatibility (f-strings, type hints if present)
  • [ ] Ensure Makefile target or GitHub Action runs Python tests on each commit

Create CI workflow to validate data consistency across all formats

This repo maintains 4 data formats (blns.json, blns.txt, blns.base64.json, blns.base64.txt) but has no visible automated validation that they stay in sync. The scripts/txt_to_json.py and scripts/texttobase64.sh exist but their output is never verified in CI. A GitHub Action should: (1) re-generate all formats from a source of truth, (2) compare against committed files, (3) fail if any format is out of date. This prevents accidental data corruption or format mismatches.

  • [ ] Create .github/workflows/validate-data-formats.yml GitHub Action
  • [ ] Run scripts/txt_to_json.py and scripts/texttobase64.sh in CI and diff against committed files
  • [ ] Add step to validate blns.json is valid JSON and blns.txt is UTF-8 encoded
  • [ ] Document in README.md how contributors should regenerate data formats before committing

🌿Good first issues

  • Add comprehensive test coverage to naughtystrings/__init__.py — currently only naughtystrings_test.go exists for Go; Python users deserve parity with unit tests validating string counts and category membership.
  • Create a naughtystrings/naughtystrings.test.js (or similar) for the Node.js package or reference implementation — README lists npm as a distribution target but no test file is visible in the repo structure.
  • Document the section categories in blns.txt with a mapping table in README.md — readers should know what each comment-delimited section tests for (SQL injection, Unicode issues, etc.) without parsing the raw file.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • db33ec7 — Merge pull request #226 from caasi/patch-1 (minimaxir)
  • 18a8898 — Index XSS strings (caasi)
  • 894882e — Merge pull request #211 from doroshenko/master (minimaxir)
  • f356d4d — Merge pull request #210 from tryauuum/master (minimaxir)
  • b2eada7 — Added emoji zwj sequences (Dmytro Doroshenko)
  • 0d5fd11 — added jinja2 injections (tryauuum)
  • 9c25300 — Merge pull request #209 from xeroskiller/patch-1 (minimaxir)
  • 494b425 — Update blns.json (xeroskiller)
  • a9bae33 — Added tSQL-specific injection string (xeroskiller)
  • ff8b1b2 — Merge pull request #206 from mattsparks/master (minimaxir)

🔒Security observations

This is a legitimate security testing utility repository with minimal security concerns. The project is a curated list of test strings designed specifically for QA and security testing. No hardcoded secrets, insecure dependencies, or dangerous code patterns were identified in the provided file structure and content. The repository follows good practices with clear documentation and a permissive license. The main consideration is ensuring users understand this is testing data meant for authorized security validation only, not for malicious purposes.

  • Low · Test Data Repository Public Access — blns.json, blns.txt, blns.base64.json, blns.base64.txt. The repository contains a list of naughty strings and test vectors that could potentially be used to identify common input validation patterns. While this is by design for QA testing purposes, the public availability of these test cases could assist attackers in crafting payloads. Fix: Document in README that this is intended for authorized testing only. Consider adding terms of use guidance. Ensure this test data is only used in non-production environments.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Mixed signals · minimaxir/big-list-of-naughty-strings — RepoPilot