RepoPilotOpen in app →

benbusby/whoogle-search

A self-hosted, ad-free, privacy-respecting metasearch engine

Healthy

Healthy across the board

weakest axis
Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 2w ago
  • 15 active contributors
  • MIT licensed
Show all 6 evidence items →
  • CI configured
  • Tests present
  • Concentrated ownership — top contributor handles 59% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/benbusby/whoogle-search)](https://repopilot.app/r/benbusby/whoogle-search)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/benbusby/whoogle-search on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: benbusby/whoogle-search

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/benbusby/whoogle-search shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit 2w ago
  • 15 active contributors
  • MIT licensed
  • CI configured
  • Tests present
  • ⚠ Concentrated ownership — top contributor handles 59% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live benbusby/whoogle-search repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/benbusby/whoogle-search.

What it runs against: a local clone of benbusby/whoogle-search — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in benbusby/whoogle-search | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | Last commit ≤ 45 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>benbusby/whoogle-search</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of benbusby/whoogle-search. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/benbusby/whoogle-search.git
#   cd whoogle-search
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of benbusby/whoogle-search and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "benbusby/whoogle-search(\\.git)?\\b" \\
  && ok "origin remote is benbusby/whoogle-search" \\
  || miss "origin remote is not benbusby/whoogle-search (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 45 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~15d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/benbusby/whoogle-search"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Whoogle is a self-hosted metasearch engine that strips Google search results of ads, JavaScript, AMP links, cookies, and IP tracking. It proxies queries through Google's search backend while serving clean HTML responses, letting users perform privacy-respecting Google searches without client-side tracking. However, as of early 2025, Google's aggressive blocking of non-JavaScript requests has made this project unmaintained — the final release was April 2026. Monolithic Flask application: app/__init__.py bootstraps the server, app/routes.py handles HTTP endpoints, app/request.py performs the Google scraping via httpx, app/filter.py sanitizes HTML responses, and app/models/ defines configuration and search result schemas. Static assets (CSS themes, bangs JSON) live in app/static/, with Docker and environment-based configuration in root files.

👥Who it's for

Privacy-conscious users and system administrators who want to self-host a Google search alternative without surveillance; developers building personal search appliances or federated search infrastructure; users in restrictive network environments seeking a lightweight metasearch solution.

🌱Maturity & risk

The project was production-ready with Docker deployment, CI/CD pipelines (GitHub Actions for tests, buildx, PyPI), and active maintenance through mid-2024. However, it is now deprecated and unmaintained due to Google's JavaScript-blocking countermeasures that break the core scraping mechanism. No new feature development is planned unless a reliable User-Agent workaround emerges.

Critical: This project is end-of-life as of April 2026 — Google actively blocks requests from non-JavaScript clients, which is fundamental to Whoogle's operation. Production deployments will fail without either a custom Google CSE key or a hardcoded working User-Agent string (neither is reliably obtainable). Dependency surface is moderate (~30 packages via pip) with reasonable maintenance (beautifulsoup4, httpx, Flask), but no active security updates should be expected.

Active areas of work

The project is in maintenance mode with no active development. The final notice (April 2026) indicates that Google's JavaScript enforcement has blocked all standard scraping approaches. Existing deployments may continue to work with workarounds (CSE keys, custom User-Agents), but the upstream maintainer is not pursuing fixes. GitHub workflows remain for CI but new issues are unlikely to be addressed.

🚀Get running

git clone https://github.com/benbusby/whoogle-search.git
cd whoogle-search
pip install -r requirements.txt
python -m app

This starts the Flask dev server on localhost:5000. For Docker: docker build -t whoogle . && docker run -p 5000:5000 whoogle.

Daily commands: Development: python -m app (starts Flask on port 5000, or override with FLASK_PORT env var). Production: Use Waitress via waitress-serve app:app or Docker: docker run -e FLASK_PORT=5000 -p 5000:5000 benbusby/whoogle-search. Configuration via .env file or environment variables (e.g., WHOOGLE_CSE_KEY for Custom Search Engine, WHOOGLE_USER_AGENT for UA override).

🗺️Map of the codebase

  • app/request.py: Core scraping logic that constructs Google requests, handles User-Agent rotation, and performs the actual HTTP query — this is where Google's blocking manifests
  • app/filter.py: HTML sanitization and AMP link removal; critical for stripping tracking and cleaning results before serving to clients
  • app/routes.py: Flask route handlers for search, config, theme selection, and bangs support — entry point for all user-facing functionality
  • app/models/config.py: Configuration schema and environment variable parsing — defines all settings users can customize
  • Dockerfile: Production deployment template; shows how to containerize the Flask app with Waitress and minimal image size
  • app/services/cse_client.py: Google Custom Search Engine (CSE) fallback integration — the workaround for when standard scraping fails

🛠️How to make changes

Adding filters: Edit app/filter.py to modify HTML sanitization logic. Changing search behavior: Modify app/request.py (Google request construction) or app/routes.py (endpoint handling). UI changes: Edit templates in app/templates/ and styles in app/static/css/ (dark-theme.css, light-theme.css, main.css). Search result parsing: Update app/models/g_classes.py for result schema changes. Testing: Add test cases to match app/ file structure under a tests/ directory (currently sparse).

🪤Traps & gotchas

Critical gotcha: The project is non-functional against live Google without either (1) a valid Google Custom Search Engine (CSE) key via WHOOGLE_CSE_KEY env var, or (2) a working User-Agent string hardcoded in app/request.py. Google actively blocks non-JavaScript requests. Config quirk: WHOOGLE_CONFIG_FILENAME env var allows loading a custom .json config file instead of environment variables. Service dependency: Tor support via stem library is optional (WHOOGLE_USE_TOR env var), but requires tor daemon running separately. No tests: The test suite is minimal; CI passes but code coverage is unknown. Template issues: Some templates may reference undefined context variables if routes.py doesn't populate them correctly (no strict mode enforcement).

💡Concepts to learn

  • Web Scraping & HTML Parsing — Whoogle's entire existence depends on parsing Google's HTML responses with BeautifulSoup4 to extract search results; understanding CSS selectors and DOM traversal is essential to maintain or fix the scraper
  • Request Proxying & User-Agent Spoofing — The project masks client requests as legitimate browsers to bypass Google's anti-bot detection; httpx's header management and User-Agent rotation are core to avoiding blocks
  • HTML Sanitization & XSS Prevention — app/filter.py must strip malicious JavaScript and tracking pixels from Google results before serving them; defusedxml and cssutils prevent injection attacks and XXE vulnerabilities
  • Content Security Policy (CSP) — Whoogle serves cleaned Google results but must prevent inline scripts and external tracking — understanding CSP headers is critical for maintaining the privacy guarantee
  • AMP (Accelerated Mobile Pages) Unwrapping — Google serves AMP links that route through Google's proxy for tracking; Whoogle's filter must detect and replace these with canonical URLs to preserve privacy
  • Docker Containerization & Multi-Stage Builds — Whoogle is deployed primarily via Docker; the Dockerfile uses multi-stage patterns to minimize image size, and understanding container networking is essential for reverse proxy setups
  • Search Bang Syntax (DuckDuckGo-compatible) — Whoogle supports bang shortcuts (e.g., !gh for GitHub) via app/static/bangs/ JSON files; this is a user-facing feature that requires understanding the bang index format
  • searxng/searxng — Active metasearch engine alternative supporting 100+ search backends; similar privacy goals but uses a different architecture and is actively maintained
  • asciimoo/searx — Predecessor to SearXNG; inspired Whoogle's metasearch concept, though now less actively maintained
  • benbusby/whoogle-docker-compose — Official Docker Compose deployment template for Whoogle, simplifying multi-container setups with reverse proxy and database
  • acheong08/ChatGPT — Similar reverse-engineering approach for accessing AI APIs without official clients; shares similar risk profile vs. vendor blocking
  • nondanee/UnblockNeteaseMusic — Demonstrates metaproxy pattern (intercepting and modifying third-party service responses) applied to music streaming; same architectural philosophy

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add unit tests for app/filter.py and app/request.py core modules

These modules handle critical search filtering and HTTP request logic, but there are no visible test files for them in the repo. Given the project's struggle with Google blocking (per the WARNING), robust tests for filter behavior and request handling are essential to prevent regressions. The repo has pytest in dependencies and a tests.yml workflow, but core module coverage is missing.

  • [ ] Create tests/test_filter.py with unit tests for filter.py functions (e.g., filtering by language, region, result type)
  • [ ] Create tests/test_request.py with unit tests for request.py HTTP client logic and response parsing
  • [ ] Verify tests run in existing .github/workflows/tests.yml pipeline
  • [ ] Aim for >80% code coverage on both modules using pytest-cov

Add integration tests for app/services/http_client.py with mocked Google responses

The http_client.py service is the frontline for handling User-Agent strings and working around Google's blocking (the core issue mentioned in the WARNING). There should be integration tests that verify the client handles different response codes, blocked requests, and fallback scenarios. This would help contributors quickly validate if new UA strings or workarounds actually work.

  • [ ] Create tests/test_http_client_integration.py with mocked responses for success, 429/403 (blocked), and different User-Agent scenarios
  • [ ] Add fixtures in tests/conftest.py for common mock Google search responses
  • [ ] Test fallback logic when requests fail (if any exists in app/services/http_client.py)
  • [ ] Document how to run integration tests in CONTRIBUTING.md (if it exists) or README.md

Add comprehensive API endpoint tests for app/routes.py

The routes.py file handles all Flask endpoints but there's no visible test coverage for the search routes, parameter validation, and configuration endpoints. Given that this is a Flask web application with multiple routes (indicated by routes.py existence), contributors need tests to verify endpoint behavior without manual testing. This also ensures the final release remains stable.

  • [ ] Create tests/test_routes.py with Flask test client fixtures
  • [ ] Add tests for main search endpoint with various query parameters (q, lang, region, tbm, etc.)
  • [ ] Add tests for configuration endpoints (GET/POST /config) and configuration persistence
  • [ ] Add tests for edge cases: empty queries, malformed parameters, missing required fields
  • [ ] Verify tests integrate with existing pytest workflow in .github/workflows/tests.yml

🌿Good first issues

  • Add unit tests for app/filter.py sanitization logic — currently the HTML cleaning code has no test coverage and is prone to regressing when Google's HTML structure changes
  • Document the CSE fallback setup in README — clarify step-by-step instructions for obtaining a Google Custom Search Engine key and setting WHOOGLE_CSE_KEY, since this is now the only reliable workaround
  • Create a health-check endpoint in app/routes.py that reports whether scraping is currently working vs. falling back to CSE — helps operators diagnose if Google has blocked their User-Agent without manual testing

Top contributors

Click to expand

📝Recent commits

Click to expand
  • efa36b1 — Merge pull request #1300 from benbusby/FinalRelease (Don-Swanson)
  • b842c44 — Update README and User Agent generation scripts for final release; version bump to 1.2.4 (Don-Swanson)
  • bf44f2e — Merge pull request #1292 from benbusby/dependabot/pip/werkzeug-3.1.6 (Don-Swanson)
  • 42313d6 — Merge pull request #1294 from benbusby/dependabot/pip/flask-3.1.3 (Don-Swanson)
  • a7bb335 — Bump flask from 3.1.2 to 3.1.3 (dependabot[bot])
  • f22829b — Bump werkzeug from 3.1.5 to 3.1.6 (dependabot[bot])
  • d9fe38b — Merge pull request #1291 from benbusby/Safari_UAs (Don-Swanson)
  • 18e3659 — Merge pull request #1287 from benbusby/dependabot/pip/werkzeug-3.1.5 (Don-Swanson)
  • c0a948a — Merge pull request #1288 from benbusby/dependabot/pip/cryptography-46.0.5 (Don-Swanson)
  • 5a062af — Update User Agent generation from Opera to Safari (Don-Swanson)

🔒Security observations

  • Critical · Project End of Life - Fundamental Functionality Broken — README.md - Final Release Notice. As stated in the README, Google has been aggressively blocking search queries without JavaScript since early 2025. This is the core mechanism Whoogle relies on. The project is in final release status with no active maintenance for fixing this fundamental issue. The application is effectively non-functional for its primary purpose. Fix: Do not deploy this application in production. Consider alternative metasearch engines or implement a working Custom Search Engine (CSE) key-based solution if required.
  • High · Missing HTTPS/TLS Enforcement — Dockerfile, docker-compose.yml, app/routes.py. No evidence of HTTPS enforcement, HSTS headers, or secure cookie flags in the configuration. The docker-compose.yml and Dockerfile do not show TLS configuration, and routes.py structure suggests potential HTTP-only operation. This exposes user search queries and configuration in transit. Fix: Configure Flask with Flask-Talisman or similar to enforce HTTPS, set HSTS headers (Strict-Transport-Security), and mark all cookies as Secure and SameSite.
  • High · Potential XSS Vulnerability in Search Results — app/filter.py, app/static/js/controller.js, Jinja2 templates. The application processes and displays HTML search results from Google (via BeautifulSoup in filter.py). Without proper output encoding in Jinja2 templates, there is risk of stored/reflected XSS if scraped content contains malicious scripts. The static JS files suggest client-side processing of user data. Fix: Ensure all user input and scraped content is properly escaped when rendering in templates. Use Jinja2's default autoescaping (|escape filter). Implement Content Security Policy (CSP) headers. Sanitize HTML content from search results using bleach or defusedxml.
  • High · Insufficient Input Validation on Search Queries — app/routes.py, app/request.py, app/services/http_client.py. No evidence of comprehensive input validation on search query parameters (q, lang, time, etc.). User input is passed to external services (Google via HTTP client, Tor via stem). Improper validation could lead to header injection, SSRF, or query manipulation attacks. Fix: Implement strict input validation using libraries like validators.py (already in dependencies). Validate query length, character sets, and parameter types. Use parameterized requests to external services. Implement rate limiting and query complexity checks.
  • High · Insecure Deserialization via defusedxml — app/models/config.py, app/static/settings/ (JSON files). The defusedxml==0.7.1 dependency suggests XML parsing, but without seeing the implementation, there's risk of XXE (XML External Entity) attacks if user-controlled XML is processed. Configuration files (settings/countries.json, etc.) are loaded without validation. Fix: Ensure defusedxml is used correctly for all XML parsing. Validate and sanitize all JSON configuration files. Implement schema validation for all external config inputs. Do not allow user-supplied XML/JSON without strict schema validation.
  • Medium · Exposed Tor Configuration and Sensitive Directories — docker-compose.yml (tmpfs configuration). docker-compose.yml creates tmpfs mounts for /config/, /var/lib/tor/, and /run/tor/ with specific UIDs/GIDs. If Tor is exposed or misconfigured, these directories could reveal Tor bridge information or exit node data. The tmpfs mode 1700 is correct but the exposure of Tor service itself is a concern. Fix: Ensure Tor service is not directly exposed. Use network policies to restrict Tor communication. Implement authentication on the Whoogle application itself. Monitor Tor exit node information and implement guardrails against malicious node selection.
  • Medium · Missing Security Headers Configuration — app/__init__.py, app/routes.py (Flask configuration). No visible configuration for critical security headers (X-Frame-Options, X-Content-Type-Options, X-XSS-Protection, Content-Security-Policy, Referrer-Policy). Static assets in app/static/ may be served without proper cache-control or security directives. Fix: Add Flask-Talisman or manually configure response headers. Set: X-Frame-Options:

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Healthy signals · benbusby/whoogle-search — RepoPilot