soimort/you-get
:arrow_double_down: Dumb downloader that scrapes the web
Mixed signals — read the receipts
weakest axisnon-standard license (Other)
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 1w ago
- ✓14 active contributors
- ✓Other licensed
- ✓CI configured
- ✓Tests present
- ⚠Concentrated ownership — top contributor handles 79% of recent commits
- ⚠Non-standard license (Other) — review terms
What would change the summary?
- →Use as dependency Concerns → Mixed if: clarify license terms
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Forkable" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/soimort/you-get)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/soimort/you-get on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: soimort/you-get
Generated by RepoPilot · 2026-05-07 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/soimort/you-get shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
WAIT — Mixed signals — read the receipts
- Last commit 1w ago
- 14 active contributors
- Other licensed
- CI configured
- Tests present
- ⚠ Concentrated ownership — top contributor handles 79% of recent commits
- ⚠ Non-standard license (Other) — review terms
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live soimort/you-get
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/soimort/you-get.
What it runs against: a local clone of soimort/you-get — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in soimort/you-get | Confirms the artifact applies here, not a fork |
| 2 | License is still Other | Catches relicense before you depend on it |
| 3 | Default branch develop exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 37 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of soimort/you-get. If you don't
# have one yet, run these first:
#
# git clone https://github.com/soimort/you-get.git
# cd you-get
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of soimort/you-get and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "soimort/you-get(\\.git)?\\b" \\
&& ok "origin remote is soimort/you-get" \\
|| miss "origin remote is not soimort/you-get (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Other)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Other\"" package.json 2>/dev/null) \\
&& ok "license is Other" \\
|| miss "license drift — was Other at generation time"
# 3. Default branch
git rev-parse --verify develop >/dev/null 2>&1 \\
&& ok "default branch develop exists" \\
|| miss "default branch develop no longer exists"
# 4. Critical files exist
test -f "src/you_get/__main__.py" \\
&& ok "src/you_get/__main__.py" \\
|| miss "missing critical file: src/you_get/__main__.py"
test -f "src/you_get/extractor.py" \\
&& ok "src/you_get/extractor.py" \\
|| miss "missing critical file: src/you_get/extractor.py"
test -f "src/you_get/common.py" \\
&& ok "src/you_get/common.py" \\
|| miss "missing critical file: src/you_get/common.py"
test -f "src/you_get/extractors/__init__.py" \\
&& ok "src/you_get/extractors/__init__.py" \\
|| miss "missing critical file: src/you_get/extractors/__init__.py"
test -f "src/you_get/cli_wrapper/downloader/__init__.py" \\
&& ok "src/you_get/cli_wrapper/downloader/__init__.py" \\
|| miss "missing critical file: src/you_get/cli_wrapper/downloader/__init__.py"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 37 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~7d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/soimort/you-get"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
You-Get is a command-line scraper and downloader that extracts media (videos, audios, images) from web pages—primarily YouTube, Youku, Niconico, and 50+ other video platforms—without requiring a browser or proprietary plugins. It parses HTML/JavaScript from target sites, identifies media streams, and downloads them directly to disk, optionally converting with FFmpeg. Monolithic CLI tool: src/you_get/__main__.py is the entry point, src/you_get/extractor.py defines the base Extractor class that all 50+ site-specific extractors (src/you_get/extractors/*.py) inherit from. src/you_get/common.py holds shared utilities (HTTP, serialization, file I/O). Wrappers in src/you_get/cli_wrapper/ abstract external tools (FFmpeg, players like VLC/MPlayer, OpenSSL).
👥Who it's for
Users (power users, privacy advocates, archivists) who want to download online video content without Flash or proprietary code, and Python developers who want to build custom downloaders by extending the extractor framework in src/you_get/extractors/ for new video sites.
🌱Maturity & risk
Actively maintained but aging: the codebase is 10+ years old (Python 2/3 migration completed), has 3.5K+ stars, includes CI via GitHub Actions (.github/workflows/python-package.yml), but the last major version bump notice (May 2022) warns of dropping Python 3.5-3.7 support due to TLS 1.3 PHA issues, indicating gradual modernization rather than rapid feature development.
Moderate risk: (1) Heavy web-scraping dependency—site structure changes break extractors frequently (50+ extractor files in src/you_get/extractors/ require manual updates); (2) Single dependency (dukpy) in requirements.txt for JavaScript evaluation, which can be a fragility point; (3) No test directory visible in the top 60 files, suggesting limited automated test coverage; (4) Relies on external tools (FFmpeg, RTMPDump) which may not be installed; (5) Maintainer-heavy project (core logic lives in src/you_get/extractor.py base class).
Active areas of work
Maintenance mode: the repo is actively tested via CI on each push (per python-package.yml), but recent focus is stability/compatibility (Python version support deprecation notice, TLS 1.3 compliance). No specific open initiatives visible from file list; typical activity is likely bug fixes for broken extractors and dependency updates.
🚀Get running
git clone https://github.com/soimort/you-get.git
cd you-get
pip install -r requirements.txt
pip install -e . # Install in editable mode
you-get --help
Daily commands:
Development: python -m you_get <URL> after pip install -e .. For specific site testing: python -c "from you_get.extractors.youtube import YouTube; YouTube('<URL>').download('.')". The Makefile likely has test targets: make test or make install.
🗺️Map of the codebase
src/you_get/__main__.py— Entry point for the CLI—all downloads start here, essential for understanding the execution flowsrc/you_get/extractor.py— Core abstraction defining the Extractor base class that all 100+ site extractors inherit fromsrc/you_get/common.py— Shared utilities for HTTP requests, file I/O, and media parsing used across all extractorssrc/you_get/extractors/__init__.py— Registry and loader for all site-specific extractors—critical for URL routing and plugin discoverysrc/you_get/cli_wrapper/downloader/__init__.py— Orchestrates the actual download logic once an extractor provides media URLs and metadatasetup.py— Defines package metadata, dependencies (dukpy), and entry points for CLI installation
🧩Components & responsibilities
- Extractor (base class) (Python, abc, requests, regex, JSON parsing, dukpy for JS decryption) — Abstract interface defining prepare() and download() lifecycle. Each site extractor overrides prepare() to populate self.streams with media URLs and metadata.
- Failure mode: If prepare() fails to populate streams, download cannot proceed; user
🛠️How to make changes
Add support for a new video/audio/image site
- Create a new extractor file in src/you_get/extractors/ following naming convention (e.g., src/you_get/extractors/newsite.py) (
src/you_get/extractors/newsite.py) - Inherit from the Extractor base class and implement prepare() to fetch metadata and download() to retrieve media (
src/you_get/extractor.py) - Define a prepare() method that parses the URL, scrapes the HTML/JSON, and populates self.streams with media URLs and metadata (
src/you_get/extractors/newsite.py) - Add site detection logic to src/you_get/extractors/init.py so the CLI auto-routes matching URLs to your extractor (
src/you_get/extractors/__init__.py) - Use common.py utilities (request(), parse_*) to handle HTTP and media parsing consistently (
src/you_get/common.py)
Customize download behavior (parallel streams, formats, output paths)
- Examine the downloaded extractor instance's self.streams dict structure (mime type, quality, URL mapping) (
src/you_get/extractor.py) - Modify src/you_get/cli_wrapper/downloader/init.py to select specific streams, adjust concurrency, or chain transcoding (
src/you_get/cli_wrapper/downloader/__init__.py) - Optionally wrap ffmpeg.py or other transcoders to apply filters or format conversions post-download (
src/you_get/cli_wrapper/transcoder/ffmpeg.py)
Extend CLI options and help text
- Add argument parser configuration in src/you_get/main.py to define new flags (--quality, --playlist, etc.) (
src/you_get/__main__.py) - Pass parsed options to the selected extractor's prepare() method so site-specific logic can honor user preferences (
src/you_get/extractor.py) - Update downloader and player wrappers to consume and apply those options during stream selection and playback (
src/you_get/cli_wrapper/downloader/__init__.py)
🔧Why these technologies
- Python 3.8+ — Cross-platform scripting; easy HTTP and JSON parsing; mature ecosystem for web scraping
- dukpy (runtime JS engine) — Some extractors (e.g., Bilibili, IQiyi) require executing JS to decrypt/obfuscate media URLs server-side
- External tools (ffmpeg, VLC, mplayer) — Offload transcoding and playback to proven, highly optimized binaries rather than pure-Python implementations
- Modular extractor plugin pattern — 100+ sites require independent parsing logic; plugin architecture isolates site-specific code and enables easy addition of new sites
⚖️Trade-offs already made
-
Dynamic extractor loading at runtime rather than pre-built binary/bundled format
- Why: Allows rapid iteration on individual site extractors without full recompile; users can patch a single .py file
- Consequence: Startup is slightly slower (module discovery); site breakage can be patched immediately without releasing a new version
-
External ffmpeg/VLC dependency rather than Python-native transcoding
- Why: Avoids reimplementing codec logic; ffmpeg is ubiquitous, battle-tested, and highly performant
- Consequence: Users must install ffmpeg separately on Windows; adds system dependency but ensures quality and speed
-
Synchronous, blocking I/O with optional parallel downloads instead of fully async
- Why: Simpler mental model for CLI tool; easier to implement with standard library threading; dukpy JS execution is blocking anyway
- Consequence: Large parallel downloads may be CPU-bound on JS decryption; no event loop overhead
-
Human-oriented CLI output (pretty tables, progress bars) rather than JSON-only
- Why: Primary use case is direct human interaction and manual downloads; easier to debug site breakage visually
- Consequence: Harder to programmatically parse output; no --json flag for automation (though streams dict is exposed internally)
🚫Non-goals (don't propose these)
- Not a media server or streaming proxy—no HTTP server, no caching layer, no multi-user access control
- Not a real-time transcoder—does not support live streams (only VOD/pre-recorded content for most sites)
- Not a browser automation tool—does not handle JavaScript-rendered pages unless site provides static JSON API or dukpy can decrypt it
- Not a DRM-breaking tool—skips sites that require Widevine or PlayReady; focuses on scraping public/unencrypted streams
- Not multi-threaded by default—uses sequential extraction and serial downloads unless --concurrent flag is set; no distributed crawling
🪤Traps & gotchas
Site-specific changes: YouTube, Bilibili, Youku frequently change their HTML/JS structure, breaking extractors without notice—the prepare() method must scrape or reverse-engineer the site's current API. JavaScript evaluation: dukpy is used to run site JavaScript to extract video URLs; if a site switches to WebAssembly or service workers, the extractor will fail silently. External tool dependencies: FFmpeg and RTMPDump must be in PATH; no bundled fallback. No isolated tests: the test suite (if it exists) is not in the top 60 files, suggesting tests may be integration-only (actual HTTP calls to live sites), which is fragile. Python 3.7+ only after 2022: older code may have Python 2 remnants (search for six, __future__ imports). Encoding issues: common.py likely handles charset detection; malformed HTML from some sites may cause UnicodeDecodeError.
🏗️Architecture
💡Concepts to learn
- Web scraping & DOM parsing — Extractors must parse HTML to find media URLs; understanding XPath, CSS selectors, and how to handle dynamic JavaScript is core to extending you-get to new sites
- Reverse engineering HTTP APIs — Many video sites hide stream URLs behind JavaScript or proprietary APIs; you-get extractors often need to mimic browser requests, decode obfuscated payloads, or intercept network calls
- Plugin/Strategy pattern — The codebase uses inheritance (Extractor base class) and dynamic loading to support 50+ sites without monolithic conditionals; learning this pattern is key to adding new extractors
- Media stream negotiation (HLS, DASH, progressive download) — Modern video platforms serve adaptive streams (m3u8, mpd manifests) rather than single files; extractors must parse these formats and select quality based on user preference or bandwidth
- JavaScript evaluation in Python (dukpy/Duktape) — Some sites encrypt or obfuscate stream URLs in JavaScript; dukpy allows you-get to execute site JavaScript within Python to extract the real URLs
- User-Agent spoofing & request headers — Many video sites block or degrade service to non-browser clients; extractors must forge convincing browser headers to avoid detection and throttling
- Subtitle/metadata extraction (SRT, VTT, JSON formats) — Videos often ship with subtitles and metadata; you-get extractors should fetch and save these alongside the video file in standard formats
🔗Related repos
yt-dlp/yt-dlp— Drop-in replacement for youtube-dl (now unmaintained) with active development; handles 1000+ sites via similar plugin architecture but with better maintenance and faster update cyclesrg3/youtube-dl— Predecessor/inspiration; the original popular YouTube downloader that you-get was partly inspired by; mostly unmaintained now but widely forkedmikf/gallery-dl— Similar scraper architecture but focused on image galleries (Flickr, Instagram, Tumblr, etc.); uses same plugin pattern and appeals to the same power-user audiencesoimort/mononoke— Companion project by the same author; Rust-based reimplementation exploring alternative architecture for media downloadingytdl-org/youtube-dl— The official current fork/revival of the original youtube-dl project; competes with you-get on YouTube/broad-site support
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive unit tests for extractor base class (src/you_get/extractor.py)
The repo has 50+ extractor implementations in src/you_get/extractors/ but no visible test suite. The extractor.py base class is critical infrastructure - adding unit tests would ensure new extractors follow patterns correctly and catch regressions. This is especially important given the breadth of site-specific extractors that depend on it.
- [ ] Create tests/test_extractor.py to test src/you_get/extractor.py base class methods
- [ ] Add tests for common extractor patterns (URL validation, metadata extraction, stream parsing)
- [ ] Add integration tests for 2-3 sample extractors (e.g., bilibili, youtube) to verify real-world usage
- [ ] Update .github/workflows/python-package.yml to run pytest in CI pipeline
Add type hints and mypy validation to src/you_get/common.py
The common.py module is a critical utility module used across all extractors, but has no type hints. Adding comprehensive type hints would improve code quality, catch bugs early, and make the codebase more maintainable. This is high-leverage since it affects all downstream code that imports from common.py.
- [ ] Add type hints to all function signatures in src/you_get/common.py
- [ ] Create a mypy configuration in setup.cfg or pyproject.toml
- [ ] Add mypy check to .github/workflows/python-package.yml to enforce types on future PRs
- [ ] Document the typing approach in CONTRIBUTING.md for new contributors
Create missing security disclosure policy documentation
The repo has a SECURITY.md file present in the file structure, but given the project's nature (downloading content from the web), it likely needs comprehensive security guidance. The file should cover dependency vulnerability reporting, extractor-specific security concerns (SSL/TLS), and a disclosure timeline. This is critical for a project that handles web scraping and encryption (openssl wrapper exists).
- [ ] Expand SECURITY.md with a detailed reporting process and expected response timeline
- [ ] Add security best practices section covering: HTTPS validation, user-agent handling, and dependency pinning recommendations
- [ ] Document known security considerations in src/you_get/cli_wrapper/openssl/ module usage
- [ ] Add a MAINTAINERS or SECURITY_CONTACTS section with response details
🌿Good first issues
- Add unit tests for
src/you_get/common.pyutilities: The common module handles HTTP, file I/O, and serialization but appears to have no dedicated test file in the visible structure. Write pytest tests forprettify(), URL parsing, and error handling functions to improve coverage and catch regressions early. - Document the Extractor base class API: Create a contributor guide showing how to implement
prepare()anddownload()methods with a template (seeextractor.py). Include examples from simple sites (e.g., Bandcamp vs. complex ones like YouTube) to lower the barrier for adding new extractors. - Add integration tests for 3–5 major extractors: The file list shows no
tests/directory. Create a test suite that validates YouTube, Bilibili, and Archive.org extractors (with cached/mocked responses) to catch when sites change and break extractors; integrate with CI.
⭐Top contributors
Click to expand
- @soimort — 79 commits
- @crnkv — 7 commits
- @chenrui333 — 2 commits
- @ifui — 2 commits
- @ljhcage — 1 commits
📝Recent commits
Click to expand
049548f— README.md: add --force-reinstall to pip because it is now necessary for upgrading from a VCS URL to work when the packag (soimort)1af8b71— python-package.yml: remove python 3.7 (no longer available in Ubuntu 24.04) (soimort)c7e7525— python-package.yml: disable the new flake8 F824 check (soimort)57cf717— python-package.yml: disable the new flake8 F824 check (soimort)4fb7d23— [common] fix a long-standing bug that causes infinite downloading when content-length is missing (soimort)e9165e0— version 0.4.1743 (soimort)f25ddca— [youtube] fix caption tracks extraction (soimort)51a7eb5— [youtube] update self.ua (fix extraction) (soimort)ce1f930— Merge branch 'develop' of https://github.com/ljhcage/you-get into ljhcage-develop (soimort)7fbd4c3— python-package.yml: update artifact actions (soimort)
🔒Security observations
You-Get has moderate security concerns primarily around its use of an unmaintained dependency (dukpy) and inherent risks in web scraping from untrusted sources. The project handles external command execution and URL processing without explicit security documentation. Key risks include potential code injection through JavaScript execution, command injection via subprocess calls, MITM vulnerabilities in HTTP requests, and input validation issues with user-provided URLs. The codebase lacks comprehensive security documentation and best practices. Immediate actions should include
- High · Dependency on dukpy - Inactive and Unmaintained —
requirements.txt, setup.py. The project depends on 'dukpy', a Python wrapper for Nashorn JavaScript engine. This package is no longer actively maintained and has known security vulnerabilities. Dukpy relies on deprecated Java components and is not regularly updated for security patches. Fix: Replace dukpy with actively maintained alternatives such as 'nodejs' with subprocess calls, 'pyppeteer', or 'playwright' for JavaScript execution. If JavaScript execution is required, use Node.js with proper sandboxing. - High · Web Scraping with Potential Code Injection —
src/you_get/extractors/ (all extractor files). The project is designed to scrape web content from multiple sources. The large number of extractors (60+ video sites) suggests extensive use of regex parsing, HTML parsing, and potential execution of untrusted JavaScript code. This creates significant injection risks if user-controlled URLs or content are processed without proper validation. Fix: Implement strict input validation for all URLs and scraped content. Use sandboxed JavaScript execution environments. Validate all regex patterns and HTML parsing routines. Consider using security-focused parsing libraries with built-in protections. - Medium · Insecure External Command Execution —
src/you_get/cli_wrapper/ (downloader, player, transcoder, openssl modules). The project wraps external tools (ffmpeg, mplayer, vlc, mencoder, libav, openssl) via CLI. If not properly sanitized, user inputs could be passed to subprocess calls leading to command injection vulnerabilities. Fix: Use subprocess with shell=False and pass arguments as a list instead of shell strings. Implement strict argument validation and sanitization before passing to external commands. Never use os.system() or shell=True. - Medium · Missing HTTPS Enforcement —
src/you_get/common.py, src/you_get/extractor.py. As a web scraper, the project likely makes HTTP requests to various websites. Without explicit HTTPS enforcement and certificate validation, the tool is vulnerable to MITM attacks when scraping content. Fix: Enforce HTTPS for all requests. Implement strict certificate validation. Document security best practices for users. Consider using requests library with verify=True (default). Add warning/error for unencrypted connections. - Medium · No Input Validation on URLs —
src/you_get/__main__.py, src/you_get/extractor.py. The project accepts user-provided URLs and processes them through multiple extractors. Malicious URLs could exploit vulnerabilities in the parsing logic or lead to SSRF attacks if the scraper visits attacker-controlled endpoints. Fix: Implement URL validation to ensure they match expected domains. Use a whitelist of supported sites. Implement SSRF protections (block private IP ranges, localhost, etc.). Validate URL schemes (http/https only). - Low · Missing Security Headers Documentation —
SECURITY.md. The SECURITY.md file is minimal and lacks security guidance for users and developers. No mention of secure download practices, signature verification, or security best practices. Fix: Expand SECURITY.md with: security best practices for users, vulnerability disclosure timeline, supported versions for security updates, instructions for secure installation, warnings about third-party sites' content policies. - Low · Potential Path Traversal in Downloads —
src/you_get/cli_wrapper/downloader/__init__.py. The downloader component accepts filenames from web sources without explicit validation. Attackers could craft responses with path traversal sequences (../../../) to write files outside intended directories. Fix: Sanitize all filenames extracted from web content. Use os.path.basename() to strip directory components. Validate that final download paths are within intended directories. Use pathlib.Path.resolve() for path validation.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.