RepoPilot

soimort/you-get

:arrow_double_down: Dumb downloader that scrapes the web

Mixed

Mixed signals — read the receipts

ConcernsDependency

non-standard license (Other)

HealthyFork & modify

Has a license, tests, and CI — clean foundation to fork and modify.

HealthyLearn from

Documented and popular — useful reference codebase to read through.

HealthyDeploy as-is

No critical CVEs, sane security posture — runnable as-is.

  • Concentrated ownership — top contributor handles 79% of recent commits
  • Non-standard license (Other) — review terms
  • Last commit 1w ago
  • 14 active contributors
  • Other licensed
  • CI configured
  • Tests present

What would improve this?

  • Use as dependency ConcernsMixed if: clarify license terms

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Forkable
[![RepoPilot: Forkable](https://repopilot.app/api/badge/soimort/you-get?axis=fork)](https://repopilot.app/r/soimort/you-get)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card

This card auto-renders when someone shares https://repopilot.app/r/soimort/you-get on X, Slack, or LinkedIn.

Ask AI about soimort/you-get

Grounded in the actual source code. Pick a starter question or write your own.

Or write your own question →

Onboarding doc

Onboarding: soimort/you-get

Generated by RepoPilot · 2026-06-21 · Source

🎯Verdict

WAIT — Mixed signals — read the receipts

  • Last commit 1w ago
  • 14 active contributors
  • Other licensed
  • CI configured
  • Tests present
  • ⚠ Concentrated ownership — top contributor handles 79% of recent commits
  • ⚠ Non-standard license (Other) — review terms

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

TL;DR

You-Get is a command-line media downloader that scrapes web pages to extract and download videos, audio, and images from 100+ websites (YouTube, Bilibili, Niconico, etc.). It's a Python-based tool that reverse-engineers site-specific playback mechanisms to bypass streaming-only restrictions and download content for offline use. Monolithic CLI tool with layered architecture: src/you_get/main.py is the CLI entry point, src/you_get/extractor.py defines the base Extractor class, src/you_get/extractors/ contains 50+ site-specific subclasses (acfun.py, bilibili.py, youtube.py, etc.), and src/you_get/cli_wrapper/ wraps external tools (ffmpeg, players, openssl). Common utilities live in src/you_get/common.py.

👥Who it's for

Individual users and power users who want to download video/audio content from restricted platforms without using a web browser, plus developers who contribute site-specific extractors (in src/you_get/extractors/) to support new websites.

🌱Maturity & risk

Moderately mature and active: published on PyPI with CI via GitHub Actions (python-package.yml), supports Python 3.7+, and has 100+ site extractors. However, the NOTICE about dropping Python 3.5-3.7 support and the migration to TLS 1.3 indicate active maintenance but also edge-case fragility. Recent activity appears steady but not rapid.

Single-maintainer project (soimort) creates sustainability risk. Heavy dependency on external site APIs (40+ extractors in src/you_get/extractors/) means frequent breakage when sites change structure. The dukpy runtime dependency suggests JavaScript parsing is used for some sites, adding complexity. No visible test directory in top-level structure suggests testing coverage may be sparse.

Active areas of work

No specific PR or milestone data visible in the provided structure. Based on NOTICE dates (May 2022), the project is in maintenance mode, handling Python version deprecations and TLS compatibility issues rather than major feature development.

🚀Get running

git clone https://github.com/soimort/you-get.git
cd you-get
pip install -e .
you-get 'https://www.youtube.com/watch?v=jNQXAC9IVRw'

Daily commands:

make
# or
python -m you_get [URL]

🗺️Map of the codebase

  • src/you_get/__init__.py — Main package initialization; entry point for the library that exports core functionality and coordinates the download workflow
  • src/you_get/__main__.py — CLI entry point that parses arguments and orchestrates the download process; essential for understanding how the tool is invoked
  • src/you_get/extractor.py — Base extractor class defining the plugin architecture; all site-specific extractors inherit from this interface
  • src/you_get/common.py — Shared utilities and HTTP client wrapper; provides common functionality used across all extractors
  • src/you_get/extractors/__init__.py — Extractor registry and loader; dynamically imports and manages all site-specific extractor plugins
  • setup.py — Package configuration and dependencies; defines the project structure and runtime requirements (dukpy)

🧩Components & responsibilities

  • Extractor Registry (Python import system, regex) — Dynamically loads all extractor plugins and routes URLs to the correct handler based on domain/pattern matching
    • Failure mode: If extractor not found, download fails with 'Unsupported site' error; registry corruption would break all downloads
  • Base Extractor Class (Python ABC, HTTP client) — Defines the interface that all site-specific extractors must implement (extract method, metadata parsing)
    • Failure mode: Breaking changes to interface break all downstream extractors; extraction logic errors propagate to all sites
  • HTTP Client (common.py) (urllib, custom headers) — Provides unified HTTP request handling with retries, User-Agent spoofing, proxy support, and cookie management
    • Failure mode: Network errors or blocked IPs cause all extractors to fail; timeout misconfigurations cause hangs
  • Site-Specific Extractors (BeautifulSoup, regex, JavaScript execution (dukpy)) — Parse HTML/JSON responses and extract stream URLs, titles, and metadata for individual platforms
    • Failure mode: Site layout changes break parsing logic for that extractor only; requires rapid patching
  • Download Module (urllib, subprocess, file I/O) — Manages file writing, progress reporting, and delegates to external tools (ffmpeg, wget, aria2) or built-in downloader
    • Failure mode: Disk full, permissions denied, or network interruption aborts download; resume logic optional

🔀Data flow

  • User CLI inputURL parser in __main__.py — Command-line arguments parsed into URL, output path, and options
  • URL parserExtractor Registry — URL routed to registry to find matching extractor based on domain/pattern
  • Extractor RegistrySelected Extractor Plugin — Matched extractor class instantiated and passed the URL
  • Extractor PluginHTTP Client — Extractor requests content

🛠️How to make changes

Add Support for a New Video Platform

  1. Create a new extractor file in src/you_get/extractors/ named after the site (e.g., newsite.py) (src/you_get/extractors/newsite.py)
  2. Import the base Extractor class and define your extractor class inheriting from it, implementing name, patterns, and extract() method (src/you_get/extractor.py)
  3. Register your extractor in the init.py by adding an import statement in the extractors module (src/you_get/extractors/__init__.py)
  4. Implement URL pattern matching in the class attribute and scrape logic in extract() to populate stream metadata (src/you_get/extractors/newsite.py)
  5. Test by running: python -m you_get 'https://newsite.com/video-url' (src/you_get/__main__.py)

Add a New Download Backend

  1. Create a new wrapper module in src/you_get/cli_wrapper/downloader/ (src/you_get/cli_wrapper/downloader/__init__.py)
  2. Implement the downloader interface with methods to invoke external download tools (src/you_get/cli_wrapper/downloader/__init__.py)
  3. Integrate the downloader selection logic in the main CLI via main.py (src/you_get/__main__.py)

Extend HTTP Client Functionality

  1. Review the shared utilities and HTTP helpers in common.py (src/you_get/common.py)
  2. Add custom headers, proxy handling, or authentication logic to the common HTTP client (src/you_get/common.py)
  3. Import and use the enhanced client in your extractor implementations (src/you_get/extractors/newsite.py)

🔧Why these technologies

  • Python 3.8+ — Cross-platform scripting language with strong web scraping libraries; allows single codebase for Linux/macOS/Windows
  • dukpy — JavaScript runtime integration for sites that require JavaScript execution to render video metadata (dynamic sites)
  • Plugin Architecture (extractors) — Decouples site-specific logic from core; enables community contributions for new platforms without modifying core code
  • CLI wrappers (ffmpeg, VLC, wget) — Delegates heavy lifting to battle-tested external tools; reduces code complexity and improves reliability

⚖️Trade-offs already made

  • Web scraping instead of official APIs

    • Why: Most media sites do not provide public APIs or restrict download programmatically; scraping is the only viable approach
    • Consequence: Extractors are fragile and break when sites update HTML structure; requires frequent maintenance
  • Reliance on external tools (ffmpeg, wget, players)

    • Why: Avoids reimplementing video codecs, HTTP optimization, and playback logic
    • Consequence: Users must install multiple dependencies; platform-specific installation complexity
  • Single-threaded download by default

    • Why: Simplifies state management and reduces resource usage for typical users
    • Consequence: Slower for large files; multi-threaded downloads not natively supported (must use external downloader)

🚫Non-goals (don't propose these)

  • Does not provide a graphical user interface (GUI-only); CLI-only tool
  • Does not handle DRM/encrypted content (respects DMCA; will not circumvent copy protection)
  • Does not support real-time streaming or live broadcast recording (focuses on on-demand content)
  • Not a replacement for official platform clients or downloaders

🪤Traps & gotchas

  1. Extractor breakage is frequent: Each site extractor is tightly coupled to that site's HTML/JavaScript structure. YouTube, Bilibili, and others change playback APIs regularly, causing silent failures. 2. dukpy dependency is fragile: Used for JavaScript evaluation on some sites; requires node.js/JavaScript runtime, making installation environment-specific. 3. No formal test suite visible: No tests/ directory in structure means validating changes requires manual site testing. 4. Python version constraints: Notice warns about dropping 3.5-3.7; TLS 1.3 post-handshake auth issues may cause cryptic SSL errors on older systems. 5. FFmpeg/RTMPDump are optional but site-dependent: Some extractors assume ffmpeg is installed; others fail silently if RTMPDump isn't present.

🏗️Architecture

💡Concepts to learn

  • ytdl-org/youtube-dl — Direct predecessor and spiritual ancestor; much larger and more feature-complete, but you-get is maintained and lighter-weight for basic use cases
  • yt-dlp/yt-dlp — Modern active fork of youtube-dl with better maintenance; if you-get breaks for a site, yt-dlp likely has a working extractor
  • Homebrew/homebrew-core — you-get is packaged here; important distribution path for macOS users
  • rg3/youtube-dl — Original youtube-dl repo (now archived); you-get was influenced by its extractor plugin pattern
  • AnimeFire/AnimeFire-website — Example of a website that you-get targets; useful for testing the bilibili/acfun extractors

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for extractor base class and common utilities

The repo has 50+ extractor implementations in src/you_get/extractors/ but no visible test suite for the core extractor.py and common.py modules. This is critical for a web scraping tool where extractors frequently break due to site changes. A test suite would enable CI to catch regressions early and give contributors confidence when modifying core extraction logic.

  • [ ] Create tests/test_extractor.py with unit tests for src/you_get/extractor.py base classes
  • [ ] Create tests/test_common.py with unit tests for src/you_get/common.py utilities (HTTP requests, parsing helpers, etc.)
  • [ ] Add mock fixtures for common web responses to avoid live network calls in CI
  • [ ] Update .github/workflows/python-package.yml to run pytest and report coverage

Add integration tests for high-traffic extractors with GitHub Actions

With 50+ extractor modules, there's no visible testing infrastructure to catch when extractors break (which happens frequently as websites change). A periodic CI workflow testing a curated subset of extractors against real sites would provide early warnings of breakage and help prioritize maintenance effort.

  • [ ] Create .github/workflows/extractor-tests.yml that runs on schedule (daily/weekly)
  • [ ] Add tests/test_extractors_integration.py with smoke tests for top 10-15 extractors (bilibili, youtube, dailymotion, etc.)
  • [ ] Implement optional skipping for flaky tests using pytest.mark.flaky or environment variables
  • [ ] Add job status badge to README.md documenting extractor health

Create CLI wrapper documentation and tests for transcoder and player modules

The src/you_get/cli_wrapper/ directory contains transcoder (ffmpeg, libav, mencoder) and player (vlc, mplayer, etc.) integrations but there's no documentation about how these are configured or tested. New contributors can't understand when/how these modules are invoked, and there are no unit tests to verify the command-line argument construction.

  • [ ] Create tests/test_cli_wrappers.py with unit tests for src/you_get/cli_wrapper/transcoder/ffmpeg.py and player modules
  • [ ] Test command construction (verify correct flags passed to external tools) without requiring actual ffmpeg/vlc installation
  • [ ] Add docstrings and examples to src/you_get/cli_wrapper/init.py explaining transcoder/player detection and fallback logic
  • [ ] Document in CONTRIBUTING.md how the CLI wrapper selection works and how to test it locally

🌿Good first issues

  • Add unit tests for src/you_get/common.py functions (url_size(), parse_querystring(), etc.)—this file has no test coverage visible and handles critical shared logic.
  • Create a test suite for src/you_get/extractors/archive.py (Internet Archive extractor)—it's a low-complexity extractor, making it ideal for writing tests without deep site-specific knowledge.
  • Document the Extractor plugin interface in src/you_get/extractor.py with docstring examples—new contributors trying to add sites are forced to reverse-engineer the base class by reading existing extractors.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 049548f — README.md: add --force-reinstall to pip because it is now necessary for upgrading from a VCS URL to work when the packag (soimort)
  • 1af8b71 — python-package.yml: remove python 3.7 (no longer available in Ubuntu 24.04) (soimort)
  • c7e7525 — python-package.yml: disable the new flake8 F824 check (soimort)
  • 57cf717 — python-package.yml: disable the new flake8 F824 check (soimort)
  • 4fb7d23 — [common] fix a long-standing bug that causes infinite downloading when content-length is missing (soimort)
  • e9165e0 — version 0.4.1743 (soimort)
  • f25ddca — [youtube] fix caption tracks extraction (soimort)
  • 51a7eb5 — [youtube] update self.ua (fix extraction) (soimort)
  • ce1f930 — Merge branch 'develop' of https://github.com/ljhcage/you-get into ljhcage-develop (soimort)
  • 7fbd4c3 — python-package.yml: update artifact actions (soimort)

🔒Security observations

This web scraping utility has moderate security concerns primarily centered around its use of an unmaintained dependency (dukpy), inherent risks of web scraping without proper validation, support for end-of-life Python versions, and insufficient input validation in extractor modules. The project requires significant security hardening, particularly in input validation, dependency management, and external command execution. The decision to use JavaScript execution via dukpy is a critical risk that should be addressed immediately.

  • High · Insecure Dependency: dukpy — requirements.txt, setup.py. dukpy is a JavaScript execution library that has known security vulnerabilities and is no longer actively maintained. It's used for executing arbitrary JavaScript code, which poses significant security risks including potential code injection and sandbox escape attacks. Fix: Replace dukpy with a modern, actively maintained alternative such as nodejs integration, or evaluate if JavaScript execution is truly necessary. If required, consider using safer sandboxing approaches like subprocess isolation with strict input validation.
  • High · Web Scraping Security Risks — src/you_get/extractors/*.py, src/you_get/common.py. The project is designed as a web scraper/downloader that interacts with multiple external websites (YouTube, Bilibili, Instagram, etc.). This introduces risks including: SSL/TLS certificate validation bypasses, insecure handling of authentication credentials, and potential SSRF vulnerabilities when processing user-supplied URLs. Fix: Implement strict URL validation, enforce HTTPS, validate SSL/TLS certificates properly, sanitize all external inputs, implement rate limiting, and provide clear security documentation about safe usage patterns.
  • Medium · Deprecated Python Versions Support — README.md, .github/workflows/python-package.yml. The README explicitly mentions support for Python 3.5, 3.6, and 3.7, which are end-of-life versions with known unpatched security vulnerabilities. TLS 1.3 post-handshake authentication (PHA) is disabled in older Python versions. Fix: Drop support for Python versions before 3.8, update CI/CD pipelines to test only on actively supported Python versions, and enforce minimum Python version in setup.py.
  • Medium · Missing Input Validation in Downloaders — src/you_get/extractors/*.py, src/you_get/cli_wrapper/downloader/. Multiple extractor modules (bilibili, youtube, instagram, etc.) process URLs and metadata from external sources without apparent centralized validation. This could lead to path traversal attacks during file downloads or arbitrary file write vulnerabilities. Fix: Implement robust input validation for all URLs and file paths, use safe path joining utilities, validate file names to prevent directory traversal, implement allowlist-based filename sanitization.
  • Medium · Unencrypted External Command Execution — src/you_get/cli_wrapper/transcoder/*.py, src/you_get/cli_wrapper/player/*.py, src/you_get/cli_wrapper/openssl/. The project uses CLI wrappers for external tools (ffmpeg, mplayer, vlc, OpenSSL) without visible security controls. This could be exploited for command injection if command arguments aren't properly escaped. Fix: Use subprocess with shell=False, implement strict argument whitelisting, avoid shell interpolation, validate all external inputs before passing to subprocess calls, use shlex.quote() for any required shell escaping.
  • Low · No Security Headers Documentation — SECURITY.md, README.md, CONTRIBUTING.md. While this is primarily a CLI tool, there's no documented security best practices for users or guidelines for secure credential handling. Fix: Expand SECURITY.md with security best practices for users, including: how to handle credentials securely, warning about risks of downloading from untrusted sources, recommendations for network security.
  • Low · GitHub Workflow Security — .github/workflows/python-package.yml. The CI/CD workflow file is minimal and doesn't show evidence of security scanning, dependency checking, or SAST integration. Fix: Integrate security scanning tools such as Bandit, Safety, or Snyk; implement dependency vulnerability checking; add SAST tools; enforce branch protection rules; implement code review requirements.

LLM-derived; treat as a starting point, not a security audit.

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/soimort/you-get shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live soimort/you-get repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/soimort/you-get.

What it runs against: a local clone of soimort/you-get — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in soimort/you-get | Confirms the artifact applies here, not a fork | | 2 | License is still Other | Catches relicense before you depend on it | | 3 | Default branch develop exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 39 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>soimort/you-get</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of soimort/you-get. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/soimort/you-get.git
#   cd you-get
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of soimort/you-get and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "soimort/you-get(\\.git)?\\b" \\
  && ok "origin remote is soimort/you-get" \\
  || miss "origin remote is not soimort/you-get (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Other)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Other\"" package.json 2>/dev/null) \\
  && ok "license is Other" \\
  || miss "license drift — was Other at generation time"

# 3. Default branch
git rev-parse --verify develop >/dev/null 2>&1 \\
  && ok "default branch develop exists" \\
  || miss "default branch develop no longer exists"

# 4. Critical files exist
test -f "src/you_get/__init__.py" \\
  && ok "src/you_get/__init__.py" \\
  || miss "missing critical file: src/you_get/__init__.py"
test -f "src/you_get/__main__.py" \\
  && ok "src/you_get/__main__.py" \\
  || miss "missing critical file: src/you_get/__main__.py"
test -f "src/you_get/extractor.py" \\
  && ok "src/you_get/extractor.py" \\
  || miss "missing critical file: src/you_get/extractor.py"
test -f "src/you_get/common.py" \\
  && ok "src/you_get/common.py" \\
  || miss "missing critical file: src/you_get/common.py"
test -f "src/you_get/extractors/__init__.py" \\
  && ok "src/you_get/extractors/__init__.py" \\
  || miss "missing critical file: src/you_get/extractors/__init__.py"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 39 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~9d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/soimort/you-get"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Embed this chat in your README →

Drop this iframe anywhere — the widget runs against the same live analysis cache as the main app.

<iframe
  src="https://repopilot.app/embed/soimort/you-get"
  width="100%" height="500"
  style="border:1px solid #d0d7de; border-radius:8px;"
  allow="microphone"
  loading="lazy"
></iframe>