github-linguist/linguist

Item: github-linguist/linguist
Rating: 5
Author: RepoPilot

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!

Healthy

Healthy across the board

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 1w ago
✓68+ active contributors
✓Distributed ownership (top contributor 12% of recent commits)

Show 3 more →

✓MIT licensed
✓CI configured
✓Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/github-linguist/linguist)](https://repopilot.app/r/github-linguist/linguist)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/github-linguist/linguist on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: github-linguist/linguist

Generated by RepoPilot · 2026-05-10 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/github-linguist/linguist shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 1w ago
68+ active contributors
Distributed ownership (top contributor 12% of recent commits)
MIT licensed
CI configured
Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live github-linguist/linguist repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/github-linguist/linguist.

What it runs against: a local clone of github-linguist/linguist — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in github-linguist/linguist | Confirms the artifact applies here, not a fork | | 2 | License is still MIT | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 38 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>github-linguist/linguist</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of github-linguist/linguist. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/github-linguist/linguist.git
#   cd linguist
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of github-linguist/linguist and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "github-linguist/linguist(\\.git)?\\b" \\
  && ok "origin remote is github-linguist/linguist" \\
  || miss "origin remote is not github-linguist/linguist (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(MIT)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"MIT\"" package.json 2>/dev/null) \\
  && ok "license is MIT" \\
  || miss "license drift — was MIT at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "lib/linguist.rb" \\
  && ok "lib/linguist.rb" \\
  || miss "missing critical file: lib/linguist.rb"
test -f "lib/linguist/blob.rb" \\
  && ok "lib/linguist/blob.rb" \\
  || miss "missing critical file: lib/linguist/blob.rb"
test -f "lib/linguist/languages.yml" \\
  && ok "lib/linguist/languages.yml" \\
  || miss "missing critical file: lib/linguist/languages.yml"
test -f "lib/linguist/language.rb" \\
  && ok "lib/linguist/language.rb" \\
  || miss "missing critical file: lib/linguist/language.rb"
test -f "lib/linguist/heuristics.yml" \\
  && ok "lib/linguist/heuristics.yml" \\
  || miss "missing critical file: lib/linguist/heuristics.yml"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 38 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~8d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/github-linguist/linguist"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Linguist is GitHub's library for detecting programming languages in repositories, classifying files by language, filtering out vendored/binary/generated code, and generating language breakdown statistics. It powers the language detection and visualization on GitHub.com, using heuristics, file extensions, and machine learning to accurately identify 500+ languages. Core library in lib/linguist/ with main entry point lib/linguist.rb. Language definitions live in grammars.yml and documentation.yml. Native extensions in ext/linguist/ handle tokenization and encoding detection. Two CLI binaries (bin/github-linguist, bin/git-linguist) wrap the library for command-line usage. Test/fixture structure implied by Rakefile; Docker support via Dockerfile and .devcontainer for reproducible development.

👥Who it's for

GitHub developers and repository maintainers who need accurate language detection for their projects; DevOps engineers integrating language analysis into CI/CD pipelines; open-source projects wanting to understand their codebase composition.

🌱Maturity & risk

Production-ready and actively maintained. This is GitHub's official language detection library used on GitHub.com itself. It has comprehensive CI/CD setup (.github/workflows/ci.yml, publish_docker_image.yml), detailed documentation, and a mature Ruby gem distribution (github-linguist.gemspec). The codebase shows active maintenance with recent updates across native extensions and language definitions.

Low risk overall due to GitHub stewardship, but depends on two heavy native extensions: charlock_holmes (ICU library bindings for encoding) and rugged (libgit2 bindings). Installation can fail on systems without proper C/C++ build tooling and ICU/libcurl/OpenSSL libraries. The native Lex tokenizer (ext/linguist/lex.linguist_yy.c) is generated from tokenizer.l and requires manual regeneration if grammar changes.

Active areas of work

Active maintenance of language definitions, CI infrastructure, and dependency updates (see dependabot.yml). The repo includes issue templates for bug reports and new language requests (.github/ISSUE_TEMPLATE/new_language.md), indicating ongoing community contributions for language support expansion.

🚀Get running

git clone https://github.com/github-linguist/linguist.git
cd linguist
bundle install
bundle exec rake test
github-linguist  # Test on the repo itself

Alternatively, use GitHub Codespaces: click the 'Open in GitHub Codespaces' badge in the README. Ruby 2.7+ required; system dependencies installed via Brewfile (macOS) or apt (Ubuntu).

Daily commands: No server to start. This is a library. To use in your app: require 'linguist' and call Linguist::Repository.new(). For CLI analysis: cd /any/git/repo && github-linguist. For development testing: bundle exec rake test. Docker: docker build -t linguist . && docker run linguist github-linguist /path/to/repo.

🗺️Map of the codebase

lib/linguist.rb — Main entry point and top-level API that exposes the core Linguist functionality for language detection.
lib/linguist/blob.rb — Core abstraction representing a file blob; orchestrates language detection across all detection strategies.
lib/linguist/languages.yml — Master language definitions and configuration data that drives all language recognition logic.
lib/linguist/language.rb — Language model class that encapsulates language properties, file associations, and type metadata.
lib/linguist/heuristics.yml — Heuristic rules that disambiguate edge cases in language detection for ambiguous file patterns.
lib/linguist/repository.rb — Repository analysis interface that aggregates per-file language detection into repository-level statistics.
ext/linguist/tokenizer.l — Lexical tokenizer implementation (Lex/Flex) that provides fast statistical language analysis.

🧩Components & responsibilities

Blob (Ruby, File I/O) — Core abstraction representing a single file; orchestrates detection strategy chain and caches results.
- Failure mode: Missing or incorrect file content raises exceptions; detection returns nil if all strategies fail.
Strategy Layer — Pluggable detection modules (Extension, Filename, Shebang, Modeline, XML) applied in order until

🛠️How to make changes

Add support for a new language

Define the language in lib/linguist/languages.yml with name, type, extensions, aliases, color (lib/linguist/languages.yml)
If needed, add ambiguity resolution rules in lib/linguist/heuristics.yml (lib/linguist/heuristics.yml)
Optionally add sample files in samples/{LanguageName}/ for training the classifier (samples)
Run tests to verify detection works across file extensions and edge cases (Rakefile)

Add a custom file detection strategy

Create a new strategy class in lib/linguist/strategy/{strategy_name}.rb inheriting from Strategy::Base (lib/linguist/strategy)
Implement the call(blob) method returning a Language or nil (lib/linguist/blob.rb)
Register the strategy in the Blob#detect_language method's strategy chain (lib/linguist/blob.rb)

Handle a file as vendor, generated, or documentation

Add path patterns to lib/linguist/vendor.yml to ignore vendored code (lib/linguist/vendor.yml)
Or add regex patterns to lib/linguist/generated.rb to mark files as generated (lib/linguist/generated.rb)
Or add patterns to lib/linguist/documentation.yml for documentation files (lib/linguist/documentation.yml)

🔧Why these technologies

Ruby — Cross-platform, good for text processing and DSLs; GitHub's main language facilitates integration and contribution.
Lex/Flex (C) for tokenizer — Provides fast, compiled tokenization for statistical language detection; avoids performance bottlenecks in pure Ruby.
Rugged (libgit2 bindings) — Enables efficient repository traversal and git metadata access without shelling out to git CLI.
YAML for configuration — Human-readable format for language definitions, heuristics, vendor patterns, and generated file rules.

⚖️Trade-offs already made

Hybrid detection: rule-based strategies + statistical classifier
- Why: Rules are fast and deterministic for obvious cases; statistics handle ambiguous files robustly.
- Consequence: More code complexity, but better accuracy across diverse codebases; adds latency for ambiguous files requiring tokenization.
Separate native tokenizer (Lex) vs. pure Ruby
- Why: Native code is much faster for large files and repeated tokenization at scale.
- Consequence: Requires C compilation and maintenance; added build complexity but essential for GitHub-scale performance.
Per-file detection vs. whole-repository analysis
- Why: Enables independent file language inference and supports offline/distributed usage; Repository class aggregates for stats.
- Consequence: Repository-level detection is post-hoc aggregation, not co-optimized; may miss cross-file context clues.

🚫Non-goals (don't propose these)

Does not provide syntax highlighting; only language identification.
Does not execute code or perform semantic analysis; purely syntactic/statistical.
Does not provide IDE features like code completion or refactoring.
Does not handle real-time language detection on unsaved editor buffers at scale.
Does not offer language-specific parsing or AST generation; stops at identification.

🪤Traps & gotchas

Native extensions require full C build environment + ICU/libcurl/OpenSSL; brew install cmake pkg-config icu4c on macOS or apt-get build-essential on Ubuntu. 2) Changing tokenizer.l requires manually running flex to regenerate lex.linguist_yy.c; the .c file is committed, easy to forget. 3) grammars.yml is the single source of truth—must be valid YAML or the entire gem breaks silently. 4) Repository analysis requires a valid git repo object (via Rugged); won't work on arbitrary file trees. 5) Character encoding detection (charlock_holmes) can be slow on large files and may fail if ICU is misconfigured.

🏗️Architecture

💡Concepts to learn

Tokenization via Flex/Lex — Linguist uses a hand-written Lex grammar (tokenizer.l) compiled to C for fast, accurate token extraction—understanding this explains the ext/linguist/ native extension and why tokens are a key input to the classifier
Bayesian Classification — The classifier.rb uses probabilistic scoring across heuristics (extensions, shebangs, content patterns) to assign language confidence—knowing this helps you understand why some detections are ambiguous or need overrides
Character Encoding Detection (ICU) — charlock_holmes wraps ICU for encoding detection; Linguist must decode files correctly before tokenization, and encoding mismatches are a common source of language detection failures
Git Object Types (Blob, Tree, Commit) — Rugged exposes Git internals; Linguist::Repository iterates Git blobs to analyze a repo—understanding Git's object model helps with debugging large-repo performance or submodule handling
Vendor/Generated/Documentation File Suppression — Linguist filters out node_modules, dist/, build/, and auto-generated files to give accurate language breakdown—this heuristic is defined in file_blob.rb and docs/overrides.md and is critical for GitHub's accuracy
TextMate Grammar Bundles — Linguist can integrate TextMate .plist grammars for syntax highlighting patterns and language detection; understanding .plist parsing helps extend language support
Ruby C Extension API (Gem.StdLib) — The native extensions (linguist.c, linguist.h) are compiled Ruby C modules; understanding Ruby's C API and extconf.rb is necessary to modify tokenization or encoding logic

github/linguist-grammars — Official collection of language grammar definitions and TextMate bundles used by Linguist for syntax highlighting patterns
tree-sitter/tree-sitter — Modern alternative for language parsing that Linguist could integrate; used by GitHub.com for code navigation
github/gitignore — Companion repo of language-specific .gitignore templates; used by Linguist to identify vendored files
libgit2/libgit2 — C library that Rugged (Linguist's Git dependency) wraps; understanding this helps debug Git integration issues
github/super-linter — GitHub linting action that uses Linguist for language detection to apply language-specific lint rules

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive test coverage for lib/linguist/heuristics.yml rules

The heuristics.yml file contains language detection rules that are critical to Linguist's accuracy, but there's no dedicated test suite showing coverage of each heuristic rule. A new test file would verify that heuristics.yml rules work as intended and prevent regressions when rules are modified.

[ ] Create test/linguist/test_heuristics.rb with test cases for each rule in lib/linguist/heuristics.yml
[ ] Add sample files in test/samples/ that match each heuristic rule
[ ] Verify all heuristic rules in languages.yml have corresponding tests
[ ] Ensure tests validate both positive matches and negative cases (files that should NOT match)

Expand lib/linguist/strategy/ with missing detection strategies

The strategy directory currently has extension.rb and filename.rb, but the codebase references additional detection methods (shebang, content-based). Creating dedicated strategy classes would improve code organization and make the detection pipeline more testable and maintainable.

[ ] Create lib/linguist/strategy/shebang.rb extracting logic from lib/linguist/shebang.rb
[ ] Create lib/linguist/strategy/content.rb for content-based detection heuristics
[ ] Refactor lib/linguist/blob.rb to use the new strategy classes consistently
[ ] Add tests in test/linguist/strategy/ for each new strategy class

Add integration tests for GitHub Actions CI workflow scenarios

The .github/workflows/ci.yml exists but there are no documented integration tests validating that Linguist correctly handles real repository scenarios (vendored files, generated files, .gitattributes overrides). This would catch regressions in the most common GitHub use cases.

[ ] Create test/integration/ directory for end-to-end scenarios
[ ] Add test repositories with .gitattributes override files and vendored/generated file patterns
[ ] Create test/integration/test_github_attributes.rb validating .gitattributes override behavior
[ ] Create test/integration/test_vendored_generated.rb validating correct handling of vendored and generated files from docs/overrides.md

🌿Good first issues

Expand language detection heuristics in lib/linguist/classifier.rb for ambiguous extensions (.sh vs shell script vs bash, .mk vs Makefile vs other build files)—add test cases to verify accuracy. 2) Add missing shebang patterns to grammars.yml for languages with weak extension matching (e.g., #!/usr/bin/env python2 vs python3 detection). 3) Write integration tests for each language in lib/linguist/documentation.yml—verify correct detection on real code samples in test/samples/ directory.

⭐Top contributors

Click to expand

@lildude — 12 commits
@DecimalTurn — 7 commits
@Alhadis — 6 commits
@nickswalker — 3 commits
@spenserblack — 3 commits

📝Recent commits

Click to expand

e535c9a — Add SpiceDB Schema language support (#7936) (ivanauth)
917e840 — Mark mise lockfiles as generated TOML (#7923) (risu729)
537297c — Release v9.5.0 (#7858) (lildude)
cb756ae — Add .gitattributes override mention when returning the strategy (#7600) (DecimalTurn)
e5e38c0 — Add XVBA dependencies as vendored (#7532) (DecimalTurn)
240bf92 — Adding txtpb extension to Protocol Buffer Text Format (#6566) (milesflo)
1549797 — Support JetBrains colour scheme (.icls) files (#7851) (hearsilent)
ba889d0 — Add language support for Liquidsoap (#6565) (toots)
83f2cc8 — Add FlatBuffers language support (#7837) (aidalgol)
d403bc7 — Add typescriptreact alias (#7762) (dennyac)

🔒Security observations

The codebase has moderate to high security concerns primarily in the Docker infrastructure. The use of end-of-life Ruby 2.x and Alpine 3.13 are critical issues as they no longer receive security patches. The container runs as root without privilege dropping, which increases risk. The main application code structure appears reasonable with separation of concerns and proper module organization, but infrastructure hardening is needed. No obvious hardcoded secrets or injection vulnerabilities were detected in the visible code structure. Immediate action required: update Ruby and Alpine versions, implement user privilege dropping, and establish a dependency update strategy.

High · Outdated Alpine Linux Base Image — Dockerfile (line 1). The Dockerfile uses 'ruby:2-alpine3.13' which is based on Alpine 3.13 released in January 2021. This version has reached end-of-life and contains unpatched security vulnerabilities. Alpine 3.13 no longer receives security updates. Fix: Update to a current Alpine version such as 'ruby:3-alpine3.19' or later. Review and update the Ruby version to a currently maintained release (Ruby 2 reached end-of-life in December 2020).
High · Outdated Ruby Version — Dockerfile (line 1). The Dockerfile specifies 'ruby:2-*' which refers to Ruby 2.x series. Ruby 2.x reached end-of-life in December 2020 and no longer receives security patches. Any vulnerabilities in Ruby 2.x will not be addressed. Fix: Upgrade to Ruby 3.x (currently 3.3.x) which is actively maintained and receives security updates. Test compatibility with the codebase before upgrading.
Medium · Incomplete apk Cache Cleanup — Dockerfile (lines 4-6). The Dockerfile removes build dependencies and cache with 'apk del build_deps && rm /var/cache/apk/', but does not explicitly clear other package manager artifacts. This could leave unnecessary files that increase image size and potential attack surface. Fix: Consolidate RUN commands and ensure complete cleanup: use 'apk del --purge build_deps' and add 'rm -rf /var/lib/apk/lists/' to reduce layer size and attack surface.
Medium · Missing HEALTHCHECK Directive — Dockerfile. The Dockerfile lacks a HEALTHCHECK instruction, which could allow the container to run in a failed state undetected. This is important for production deployments. Fix: Add a HEALTHCHECK directive to verify the linguist service is functioning properly, e.g., 'HEALTHCHECK CMD github-linguist --version || exit 1'.
Medium · No User Privilege Dropping — Dockerfile. The Dockerfile does not create or specify a non-root user. The container runs as root by default, which increases the impact of any container escape or RCE vulnerability. Fix: Create a non-root user and switch to it before the CMD instruction. Example: 'RUN addgroup -g 1000 linguist && adduser -D -u 1000 -G linguist linguist' and 'USER linguist'.
Low · Missing Explicit Version Pinning for gem install — Dockerfile (line 5). The Dockerfile uses 'gem install github-linguist' without specifying a version, which installs the latest version at build time. This can lead to non-reproducible builds and unpredictable updates. Fix: Pin to a specific version: 'gem install github-linguist=VERSION' or use a Gemfile with bundler for reproducible dependency management.
Low · Missing Container Labels and Metadata — Dockerfile. The Dockerfile lacks LABEL directives for version, maintainer, and source information, making it harder to track the container's provenance and purpose. Fix: Add LABEL directives for metadata: version, description, maintainer, source, and any security-relevant information.

LLM-derived; treat as a starting point, not a security audit.

👉Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

github-linguist/linguist

Embed the "Healthy" badge

Onboarding doc

Onboarding: github-linguist/linguist

🤖Agent protocol

🎯Verdict

✅Verify before trusting

⚡TL;DR

👥Who it's for

🌱Maturity & risk

Active areas of work

🚀Get running

🗺️Map of the codebase

🧩Components & responsibilities

🛠️How to make changes

Add support for a new language

Add a custom file detection strategy

Handle a file as vendor, generated, or documentation

🔧Why these technologies

⚖️Trade-offs already made

🚫Non-goals (don't propose these)

🪤Traps & gotchas

🏗️Architecture

💡Concepts to learn

🔗Related repos

🪄PR ideas

Add comprehensive test coverage for lib/linguist/heuristics.yml rules

Expand lib/linguist/strategy/ with missing detection strategies

Add integration tests for GitHub Actions CI workflow scenarios

🌿Good first issues

⭐Top contributors

Top contributors

📝Recent commits

Recent commits

🔒Security observations

👉Where to read next