github-linguist/linguist
Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
Healthy across the board
Permissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 1w ago
- ✓68+ active contributors
- ✓Distributed ownership (top contributor 12% of recent commits)
Show 3 more →Show less
- ✓MIT licensed
- ✓CI configured
- ✓Tests present
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/github-linguist/linguist)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/github-linguist/linguist on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: github-linguist/linguist
Generated by RepoPilot · 2026-05-10 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/github-linguist/linguist shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit 1w ago
- 68+ active contributors
- Distributed ownership (top contributor 12% of recent commits)
- MIT licensed
- CI configured
- Tests present
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live github-linguist/linguist
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/github-linguist/linguist.
What it runs against: a local clone of github-linguist/linguist — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in github-linguist/linguist | Confirms the artifact applies here, not a fork |
| 2 | License is still MIT | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 38 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of github-linguist/linguist. If you don't
# have one yet, run these first:
#
# git clone https://github.com/github-linguist/linguist.git
# cd linguist
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of github-linguist/linguist and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "github-linguist/linguist(\\.git)?\\b" \\
&& ok "origin remote is github-linguist/linguist" \\
|| miss "origin remote is not github-linguist/linguist (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
&& ok "license is MIT" \\
|| miss "license drift — was MIT at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "lib/linguist.rb" \\
&& ok "lib/linguist.rb" \\
|| miss "missing critical file: lib/linguist.rb"
test -f "lib/linguist/blob.rb" \\
&& ok "lib/linguist/blob.rb" \\
|| miss "missing critical file: lib/linguist/blob.rb"
test -f "lib/linguist/languages.yml" \\
&& ok "lib/linguist/languages.yml" \\
|| miss "missing critical file: lib/linguist/languages.yml"
test -f "lib/linguist/language.rb" \\
&& ok "lib/linguist/language.rb" \\
|| miss "missing critical file: lib/linguist/language.rb"
test -f "lib/linguist/heuristics.yml" \\
&& ok "lib/linguist/heuristics.yml" \\
|| miss "missing critical file: lib/linguist/heuristics.yml"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 38 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~8d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/github-linguist/linguist"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
Linguist is GitHub's library for detecting programming languages in repositories, classifying files by language, filtering out vendored/binary/generated code, and generating language breakdown statistics. It powers the language detection and visualization on GitHub.com, using heuristics, file extensions, and machine learning to accurately identify 500+ languages. Core library in lib/linguist/ with main entry point lib/linguist.rb. Language definitions live in grammars.yml and documentation.yml. Native extensions in ext/linguist/ handle tokenization and encoding detection. Two CLI binaries (bin/github-linguist, bin/git-linguist) wrap the library for command-line usage. Test/fixture structure implied by Rakefile; Docker support via Dockerfile and .devcontainer for reproducible development.
👥Who it's for
GitHub developers and repository maintainers who need accurate language detection for their projects; DevOps engineers integrating language analysis into CI/CD pipelines; open-source projects wanting to understand their codebase composition.
🌱Maturity & risk
Production-ready and actively maintained. This is GitHub's official language detection library used on GitHub.com itself. It has comprehensive CI/CD setup (.github/workflows/ci.yml, publish_docker_image.yml), detailed documentation, and a mature Ruby gem distribution (github-linguist.gemspec). The codebase shows active maintenance with recent updates across native extensions and language definitions.
Low risk overall due to GitHub stewardship, but depends on two heavy native extensions: charlock_holmes (ICU library bindings for encoding) and rugged (libgit2 bindings). Installation can fail on systems without proper C/C++ build tooling and ICU/libcurl/OpenSSL libraries. The native Lex tokenizer (ext/linguist/lex.linguist_yy.c) is generated from tokenizer.l and requires manual regeneration if grammar changes.
Active areas of work
Active maintenance of language definitions, CI infrastructure, and dependency updates (see dependabot.yml). The repo includes issue templates for bug reports and new language requests (.github/ISSUE_TEMPLATE/new_language.md), indicating ongoing community contributions for language support expansion.
🚀Get running
git clone https://github.com/github-linguist/linguist.git
cd linguist
bundle install
bundle exec rake test
github-linguist # Test on the repo itself
Alternatively, use GitHub Codespaces: click the 'Open in GitHub Codespaces' badge in the README. Ruby 2.7+ required; system dependencies installed via Brewfile (macOS) or apt (Ubuntu).
Daily commands:
No server to start. This is a library. To use in your app: require 'linguist' and call Linguist::Repository.new(). For CLI analysis: cd /any/git/repo && github-linguist. For development testing: bundle exec rake test. Docker: docker build -t linguist . && docker run linguist github-linguist /path/to/repo.
🗺️Map of the codebase
lib/linguist.rb— Main entry point and top-level API that exposes the core Linguist functionality for language detection.lib/linguist/blob.rb— Core abstraction representing a file blob; orchestrates language detection across all detection strategies.lib/linguist/languages.yml— Master language definitions and configuration data that drives all language recognition logic.lib/linguist/language.rb— Language model class that encapsulates language properties, file associations, and type metadata.lib/linguist/heuristics.yml— Heuristic rules that disambiguate edge cases in language detection for ambiguous file patterns.lib/linguist/repository.rb— Repository analysis interface that aggregates per-file language detection into repository-level statistics.ext/linguist/tokenizer.l— Lexical tokenizer implementation (Lex/Flex) that provides fast statistical language analysis.
🧩Components & responsibilities
- Blob (Ruby, File I/O) — Core abstraction representing a single file; orchestrates detection strategy chain and caches results.
- Failure mode: Missing or incorrect file content raises exceptions; detection returns nil if all strategies fail.
- Strategy Layer — Pluggable detection modules (Extension, Filename, Shebang, Modeline, XML) applied in order until
🛠️How to make changes
Add support for a new language
- Define the language in lib/linguist/languages.yml with name, type, extensions, aliases, color (
lib/linguist/languages.yml) - If needed, add ambiguity resolution rules in lib/linguist/heuristics.yml (
lib/linguist/heuristics.yml) - Optionally add sample files in samples/{LanguageName}/ for training the classifier (
samples) - Run tests to verify detection works across file extensions and edge cases (
Rakefile)
Add a custom file detection strategy
- Create a new strategy class in lib/linguist/strategy/{strategy_name}.rb inheriting from Strategy::Base (
lib/linguist/strategy) - Implement the call(blob) method returning a Language or nil (
lib/linguist/blob.rb) - Register the strategy in the Blob#detect_language method's strategy chain (
lib/linguist/blob.rb)
Handle a file as vendor, generated, or documentation
- Add path patterns to lib/linguist/vendor.yml to ignore vendored code (
lib/linguist/vendor.yml) - Or add regex patterns to lib/linguist/generated.rb to mark files as generated (
lib/linguist/generated.rb) - Or add patterns to lib/linguist/documentation.yml for documentation files (
lib/linguist/documentation.yml)
🔧Why these technologies
- Ruby — Cross-platform, good for text processing and DSLs; GitHub's main language facilitates integration and contribution.
- Lex/Flex (C) for tokenizer — Provides fast, compiled tokenization for statistical language detection; avoids performance bottlenecks in pure Ruby.
- Rugged (libgit2 bindings) — Enables efficient repository traversal and git metadata access without shelling out to git CLI.
- YAML for configuration — Human-readable format for language definitions, heuristics, vendor patterns, and generated file rules.
⚖️Trade-offs already made
-
Hybrid detection: rule-based strategies + statistical classifier
- Why: Rules are fast and deterministic for obvious cases; statistics handle ambiguous files robustly.
- Consequence: More code complexity, but better accuracy across diverse codebases; adds latency for ambiguous files requiring tokenization.
-
Separate native tokenizer (Lex) vs. pure Ruby
- Why: Native code is much faster for large files and repeated tokenization at scale.
- Consequence: Requires C compilation and maintenance; added build complexity but essential for GitHub-scale performance.
-
Per-file detection vs. whole-repository analysis
- Why: Enables independent file language inference and supports offline/distributed usage; Repository class aggregates for stats.
- Consequence: Repository-level detection is post-hoc aggregation, not co-optimized; may miss cross-file context clues.
🚫Non-goals (don't propose these)
- Does not provide syntax highlighting; only language identification.
- Does not execute code or perform semantic analysis; purely syntactic/statistical.
- Does not provide IDE features like code completion or refactoring.
- Does not handle real-time language detection on unsaved editor buffers at scale.
- Does not offer language-specific parsing or AST generation; stops at identification.
🪤Traps & gotchas
- Native extensions require full C build environment + ICU/libcurl/OpenSSL; brew install cmake pkg-config icu4c on macOS or apt-get build-essential on Ubuntu. 2) Changing tokenizer.l requires manually running flex to regenerate lex.linguist_yy.c; the .c file is committed, easy to forget. 3) grammars.yml is the single source of truth—must be valid YAML or the entire gem breaks silently. 4) Repository analysis requires a valid git repo object (via Rugged); won't work on arbitrary file trees. 5) Character encoding detection (charlock_holmes) can be slow on large files and may fail if ICU is misconfigured.
🏗️Architecture
💡Concepts to learn
- Tokenization via Flex/Lex — Linguist uses a hand-written Lex grammar (tokenizer.l) compiled to C for fast, accurate token extraction—understanding this explains the ext/linguist/ native extension and why tokens are a key input to the classifier
- Bayesian Classification — The classifier.rb uses probabilistic scoring across heuristics (extensions, shebangs, content patterns) to assign language confidence—knowing this helps you understand why some detections are ambiguous or need overrides
- Character Encoding Detection (ICU) — charlock_holmes wraps ICU for encoding detection; Linguist must decode files correctly before tokenization, and encoding mismatches are a common source of language detection failures
- Git Object Types (Blob, Tree, Commit) — Rugged exposes Git internals; Linguist::Repository iterates Git blobs to analyze a repo—understanding Git's object model helps with debugging large-repo performance or submodule handling
- Vendor/Generated/Documentation File Suppression — Linguist filters out node_modules, dist/, build/, and auto-generated files to give accurate language breakdown—this heuristic is defined in file_blob.rb and docs/overrides.md and is critical for GitHub's accuracy
- TextMate Grammar Bundles — Linguist can integrate TextMate .plist grammars for syntax highlighting patterns and language detection; understanding .plist parsing helps extend language support
- Ruby C Extension API (Gem.StdLib) — The native extensions (linguist.c, linguist.h) are compiled Ruby C modules; understanding Ruby's C API and extconf.rb is necessary to modify tokenization or encoding logic
🔗Related repos
github/linguist-grammars— Official collection of language grammar definitions and TextMate bundles used by Linguist for syntax highlighting patternstree-sitter/tree-sitter— Modern alternative for language parsing that Linguist could integrate; used by GitHub.com for code navigationgithub/gitignore— Companion repo of language-specific .gitignore templates; used by Linguist to identify vendored fileslibgit2/libgit2— C library that Rugged (Linguist's Git dependency) wraps; understanding this helps debug Git integration issuesgithub/super-linter— GitHub linting action that uses Linguist for language detection to apply language-specific lint rules
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive test coverage for lib/linguist/heuristics.yml rules
The heuristics.yml file contains language detection rules that are critical to Linguist's accuracy, but there's no dedicated test suite showing coverage of each heuristic rule. A new test file would verify that heuristics.yml rules work as intended and prevent regressions when rules are modified.
- [ ] Create test/linguist/test_heuristics.rb with test cases for each rule in lib/linguist/heuristics.yml
- [ ] Add sample files in test/samples/ that match each heuristic rule
- [ ] Verify all heuristic rules in languages.yml have corresponding tests
- [ ] Ensure tests validate both positive matches and negative cases (files that should NOT match)
Expand lib/linguist/strategy/ with missing detection strategies
The strategy directory currently has extension.rb and filename.rb, but the codebase references additional detection methods (shebang, content-based). Creating dedicated strategy classes would improve code organization and make the detection pipeline more testable and maintainable.
- [ ] Create lib/linguist/strategy/shebang.rb extracting logic from lib/linguist/shebang.rb
- [ ] Create lib/linguist/strategy/content.rb for content-based detection heuristics
- [ ] Refactor lib/linguist/blob.rb to use the new strategy classes consistently
- [ ] Add tests in test/linguist/strategy/ for each new strategy class
Add integration tests for GitHub Actions CI workflow scenarios
The .github/workflows/ci.yml exists but there are no documented integration tests validating that Linguist correctly handles real repository scenarios (vendored files, generated files, .gitattributes overrides). This would catch regressions in the most common GitHub use cases.
- [ ] Create test/integration/ directory for end-to-end scenarios
- [ ] Add test repositories with .gitattributes override files and vendored/generated file patterns
- [ ] Create test/integration/test_github_attributes.rb validating .gitattributes override behavior
- [ ] Create test/integration/test_vendored_generated.rb validating correct handling of vendored and generated files from docs/overrides.md
🌿Good first issues
- Expand language detection heuristics in lib/linguist/classifier.rb for ambiguous extensions (.sh vs shell script vs bash, .mk vs Makefile vs other build files)—add test cases to verify accuracy. 2) Add missing shebang patterns to grammars.yml for languages with weak extension matching (e.g., #!/usr/bin/env python2 vs python3 detection). 3) Write integration tests for each language in lib/linguist/documentation.yml—verify correct detection on real code samples in test/samples/ directory.
⭐Top contributors
Click to expand
Top contributors
- @lildude — 12 commits
- @DecimalTurn — 7 commits
- @Alhadis — 6 commits
- @nickswalker — 3 commits
- @spenserblack — 3 commits
📝Recent commits
Click to expand
Recent commits
e535c9a— Add SpiceDB Schema language support (#7936) (ivanauth)917e840— Mark mise lockfiles as generated TOML (#7923) (risu729)537297c— Release v9.5.0 (#7858) (lildude)cb756ae— Add .gitattributes override mention when returning the strategy (#7600) (DecimalTurn)e5e38c0— Add XVBA dependencies as vendored (#7532) (DecimalTurn)240bf92— Addingtxtpbextension to Protocol Buffer Text Format (#6566) (milesflo)1549797— Support JetBrains colour scheme (.icls) files (#7851) (hearsilent)ba889d0— Add language support for Liquidsoap (#6565) (toots)83f2cc8— Add FlatBuffers language support (#7837) (aidalgol)d403bc7— Add typescriptreact alias (#7762) (dennyac)
🔒Security observations
The codebase has moderate to high security concerns primarily in the Docker infrastructure. The use of end-of-life Ruby 2.x and Alpine 3.13 are critical issues as they no longer receive security patches. The container runs as root without privilege dropping, which increases risk. The main application code structure appears reasonable with separation of concerns and proper module organization, but infrastructure hardening is needed. No obvious hardcoded secrets or injection vulnerabilities were detected in the visible code structure. Immediate action required: update Ruby and Alpine versions, implement user privilege dropping, and establish a dependency update strategy.
- High · Outdated Alpine Linux Base Image —
Dockerfile (line 1). The Dockerfile uses 'ruby:2-alpine3.13' which is based on Alpine 3.13 released in January 2021. This version has reached end-of-life and contains unpatched security vulnerabilities. Alpine 3.13 no longer receives security updates. Fix: Update to a current Alpine version such as 'ruby:3-alpine3.19' or later. Review and update the Ruby version to a currently maintained release (Ruby 2 reached end-of-life in December 2020). - High · Outdated Ruby Version —
Dockerfile (line 1). The Dockerfile specifies 'ruby:2-*' which refers to Ruby 2.x series. Ruby 2.x reached end-of-life in December 2020 and no longer receives security patches. Any vulnerabilities in Ruby 2.x will not be addressed. Fix: Upgrade to Ruby 3.x (currently 3.3.x) which is actively maintained and receives security updates. Test compatibility with the codebase before upgrading. - Medium · Incomplete apk Cache Cleanup —
Dockerfile (lines 4-6). The Dockerfile removes build dependencies and cache with 'apk del build_deps && rm /var/cache/apk/', but does not explicitly clear other package manager artifacts. This could leave unnecessary files that increase image size and potential attack surface. Fix: Consolidate RUN commands and ensure complete cleanup: use 'apk del --purge build_deps' and add 'rm -rf /var/lib/apk/lists/' to reduce layer size and attack surface. - Medium · Missing HEALTHCHECK Directive —
Dockerfile. The Dockerfile lacks a HEALTHCHECK instruction, which could allow the container to run in a failed state undetected. This is important for production deployments. Fix: Add a HEALTHCHECK directive to verify the linguist service is functioning properly, e.g., 'HEALTHCHECK CMD github-linguist --version || exit 1'. - Medium · No User Privilege Dropping —
Dockerfile. The Dockerfile does not create or specify a non-root user. The container runs as root by default, which increases the impact of any container escape or RCE vulnerability. Fix: Create a non-root user and switch to it before the CMD instruction. Example: 'RUN addgroup -g 1000 linguist && adduser -D -u 1000 -G linguist linguist' and 'USER linguist'. - Low · Missing Explicit Version Pinning for gem install —
Dockerfile (line 5). The Dockerfile uses 'gem install github-linguist' without specifying a version, which installs the latest version at build time. This can lead to non-reproducible builds and unpredictable updates. Fix: Pin to a specific version: 'gem install github-linguist=VERSION' or use a Gemfile with bundler for reproducible dependency management. - Low · Missing Container Labels and Metadata —
Dockerfile. The Dockerfile lacks LABEL directives for version, maintainer, and source information, making it harder to track the container's provenance and purpose. Fix: Add LABEL directives for metadata: version, description, maintainer, source, and any security-relevant information.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.