benbusby/whoogle-search

Item: benbusby/whoogle-search
Rating: 5
Author: RepoPilot

A self-hosted, ad-free, privacy-respecting metasearch engine

Healthy

Healthy across the board

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 2w ago
✓15 active contributors
✓MIT licensed

Show all 6 evidence items →

✓CI configured
✓Tests present
⚠Concentrated ownership — top contributor handles 59% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/benbusby/whoogle-search)](https://repopilot.app/r/benbusby/whoogle-search)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/benbusby/whoogle-search on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: benbusby/whoogle-search

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/benbusby/whoogle-search shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 2w ago
15 active contributors
MIT licensed
CI configured
Tests present
⚠ Concentrated ownership — top contributor handles 59% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live benbusby/whoogle-search repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/benbusby/whoogle-search.

What it runs against: a local clone of benbusby/whoogle-search — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in benbusby/whoogle-search | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | Last commit ≤ 45 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>benbusby/whoogle-search</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of benbusby/whoogle-search. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/benbusby/whoogle-search.git
#   cd whoogle-search
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of benbusby/whoogle-search and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "benbusby/whoogle-search(\\.git)?\\b" \\
  && ok "origin remote is benbusby/whoogle-search" \\
  || miss "origin remote is not benbusby/whoogle-search (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 45 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~15d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/benbusby/whoogle-search"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Whoogle is a self-hosted metasearch engine that strips Google search results of ads, JavaScript, AMP links, cookies, and IP tracking. It proxies queries through Google's search backend while serving clean HTML responses, letting users perform privacy-respecting Google searches without client-side tracking. However, as of early 2025, Google's aggressive blocking of non-JavaScript requests has made this project unmaintained — the final release was April 2026. Monolithic Flask application: app/__init__.py bootstraps the server, app/routes.py handles HTTP endpoints, app/request.py performs the Google scraping via httpx, app/filter.py sanitizes HTML responses, and app/models/ defines configuration and search result schemas. Static assets (CSS themes, bangs JSON) live in app/static/, with Docker and environment-based configuration in root files.

👥Who it's for

Privacy-conscious users and system administrators who want to self-host a Google search alternative without surveillance; developers building personal search appliances or federated search infrastructure; users in restrictive network environments seeking a lightweight metasearch solution.

🌱Maturity & risk

The project was production-ready with Docker deployment, CI/CD pipelines (GitHub Actions for tests, buildx, PyPI), and active maintenance through mid-2024. However, it is now deprecated and unmaintained due to Google's JavaScript-blocking countermeasures that break the core scraping mechanism. No new feature development is planned unless a reliable User-Agent workaround emerges.

Critical: This project is end-of-life as of April 2026 — Google actively blocks requests from non-JavaScript clients, which is fundamental to Whoogle's operation. Production deployments will fail without either a custom Google CSE key or a hardcoded working User-Agent string (neither is reliably obtainable). Dependency surface is moderate (~30 packages via pip) with reasonable maintenance (beautifulsoup4, httpx, Flask), but no active security updates should be expected.

Active areas of work

The project is in maintenance mode with no active development. The final notice (April 2026) indicates that Google's JavaScript enforcement has blocked all standard scraping approaches. Existing deployments may continue to work with workarounds (CSE keys, custom User-Agents), but the upstream maintainer is not pursuing fixes. GitHub workflows remain for CI but new issues are unlikely to be addressed.

🚀Get running

git clone https://github.com/benbusby/whoogle-search.git
cd whoogle-search
pip install -r requirements.txt
python -m app

This starts the Flask dev server on localhost:5000. For Docker: docker build -t whoogle . && docker run -p 5000:5000 whoogle.

Daily commands: Development: python -m app (starts Flask on port 5000, or override with FLASK_PORT env var). Production: Use Waitress via waitress-serve app:app or Docker: docker run -e FLASK_PORT=5000 -p 5000:5000 benbusby/whoogle-search. Configuration via .env file or environment variables (e.g., WHOOGLE_CSE_KEY for Custom Search Engine, WHOOGLE_USER_AGENT for UA override).

🗺️Map of the codebase

app/request.py: Core scraping logic that constructs Google requests, handles User-Agent rotation, and performs the actual HTTP query — this is where Google's blocking manifests
app/filter.py: HTML sanitization and AMP link removal; critical for stripping tracking and cleaning results before serving to clients
app/routes.py: Flask route handlers for search, config, theme selection, and bangs support — entry point for all user-facing functionality
app/models/config.py: Configuration schema and environment variable parsing — defines all settings users can customize
Dockerfile: Production deployment template; shows how to containerize the Flask app with Waitress and minimal image size
app/services/cse_client.py: Google Custom Search Engine (CSE) fallback integration — the workaround for when standard scraping fails

🛠️How to make changes

Adding filters: Edit app/filter.py to modify HTML sanitization logic. Changing search behavior: Modify app/request.py (Google request construction) or app/routes.py (endpoint handling). UI changes: Edit templates in app/templates/ and styles in app/static/css/ (dark-theme.css, light-theme.css, main.css). Search result parsing: Update app/models/g_classes.py for result schema changes. Testing: Add test cases to match app/ file structure under a tests/ directory (currently sparse).

🪤Traps & gotchas

Critical gotcha: The project is non-functional against live Google without either (1) a valid Google Custom Search Engine (CSE) key via WHOOGLE_CSE_KEY env var, or (2) a working User-Agent string hardcoded in app/request.py. Google actively blocks non-JavaScript requests. Config quirk: WHOOGLE_CONFIG_FILENAME env var allows loading a custom .json config file instead of environment variables. Service dependency: Tor support via stem library is optional (WHOOGLE_USE_TOR env var), but requires tor daemon running separately. No tests: The test suite is minimal; CI passes but code coverage is unknown. Template issues: Some templates may reference undefined context variables if routes.py doesn't populate them correctly (no strict mode enforcement).

💡Concepts to learn

Web Scraping & HTML Parsing — Whoogle's entire existence depends on parsing Google's HTML responses with BeautifulSoup4 to extract search results; understanding CSS selectors and DOM traversal is essential to maintain or fix the scraper
Request Proxying & User-Agent Spoofing — The project masks client requests as legitimate browsers to bypass Google's anti-bot detection; httpx's header management and User-Agent rotation are core to avoiding blocks
HTML Sanitization & XSS Prevention — app/filter.py must strip malicious JavaScript and tracking pixels from Google results before serving them; defusedxml and cssutils prevent injection attacks and XXE vulnerabilities
Content Security Policy (CSP) — Whoogle serves cleaned Google results but must prevent inline scripts and external tracking — understanding CSP headers is critical for maintaining the privacy guarantee
AMP (Accelerated Mobile Pages) Unwrapping — Google serves AMP links that route through Google's proxy for tracking; Whoogle's filter must detect and replace these with canonical URLs to preserve privacy
Docker Containerization & Multi-Stage Builds — Whoogle is deployed primarily via Docker; the Dockerfile uses multi-stage patterns to minimize image size, and understanding container networking is essential for reverse proxy setups
Search Bang Syntax (DuckDuckGo-compatible) — Whoogle supports bang shortcuts (e.g., !gh for GitHub) via app/static/bangs/ JSON files; this is a user-facing feature that requires understanding the bang index format

searxng/searxng — Active metasearch engine alternative supporting 100+ search backends; similar privacy goals but uses a different architecture and is actively maintained
asciimoo/searx — Predecessor to SearXNG; inspired Whoogle's metasearch concept, though now less actively maintained
benbusby/whoogle-docker-compose — Official Docker Compose deployment template for Whoogle, simplifying multi-container setups with reverse proxy and database
acheong08/ChatGPT — Similar reverse-engineering approach for accessing AI APIs without official clients; shares similar risk profile vs. vendor blocking
nondanee/UnblockNeteaseMusic — Demonstrates metaproxy pattern (intercepting and modifying third-party service responses) applied to music streaming; same architectural philosophy

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add unit tests for app/filter.py and app/request.py core modules

These modules handle critical search filtering and HTTP request logic, but there are no visible test files for them in the repo. Given the project's struggle with Google blocking (per the WARNING), robust tests for filter behavior and request handling are essential to prevent regressions. The repo has pytest in dependencies and a tests.yml workflow, but core module coverage is missing.

[ ] Create tests/test_filter.py with unit tests for filter.py functions (e.g., filtering by language, region, result type)
[ ] Create tests/test_request.py with unit tests for request.py HTTP client logic and response parsing
[ ] Verify tests run in existing .github/workflows/tests.yml pipeline
[ ] Aim for >80% code coverage on both modules using pytest-cov

Add integration tests for app/services/http_client.py with mocked Google responses

The http_client.py service is the frontline for handling User-Agent strings and working around Google's blocking (the core issue mentioned in the WARNING). There should be integration tests that verify the client handles different response codes, blocked requests, and fallback scenarios. This would help contributors quickly validate if new UA strings or workarounds actually work.

[ ] Create tests/test_http_client_integration.py with mocked responses for success, 429/403 (blocked), and different User-Agent scenarios
[ ] Add fixtures in tests/conftest.py for common mock Google search responses
[ ] Test fallback logic when requests fail (if any exists in app/services/http_client.py)
[ ] Document how to run integration tests in CONTRIBUTING.md (if it exists) or README.md

Add comprehensive API endpoint tests for app/routes.py

The routes.py file handles all Flask endpoints but there's no visible test coverage for the search routes, parameter validation, and configuration endpoints. Given that this is a Flask web application with multiple routes (indicated by routes.py existence), contributors need tests to verify endpoint behavior without manual testing. This also ensures the final release remains stable.

[ ] Create tests/test_routes.py with Flask test client fixtures
[ ] Add tests for main search endpoint with various query parameters (q, lang, region, tbm, etc.)
[ ] Add tests for configuration endpoints (GET/POST /config) and configuration persistence
[ ] Add tests for edge cases: empty queries, malformed parameters, missing required fields
[ ] Verify tests integrate with existing pytest workflow in .github/workflows/tests.yml

🌿Good first issues

Add unit tests for app/filter.py sanitization logic — currently the HTML cleaning code has no test coverage and is prone to regressing when Google's HTML structure changes
Document the CSE fallback setup in README — clarify step-by-step instructions for obtaining a Google Custom Search Engine key and setting WHOOGLE_CSE_KEY, since this is now the only reliable workaround
Create a health-check endpoint in app/routes.py that reports whether scraping is currently working vs. falling back to CSE — helps operators diagnose if Google has blocked their User-Agent without manual testing

⭐Top contributors

Click to expand

@Don-Swanson — 59 commits
@benbusby — 17 commits
@dependabot[bot] — 9 commits
@rstefko — 4 commits
@tunazorlu — 1 commits

📝Recent commits