v-lavrenko/yari
A rapid prototyping library for IR and ML
Slowing — last commit 4mo ago
copyleft license (GPL-3.0) — review compatibility; single-maintainer (no co-maintainers visible)…
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
Scorecard "Branch-Protection" is 0/10; no CI workflows detected
- ⚠Slowing — last commit 4mo ago
- ⚠Solo or near-solo (1 contributor active in recent commits)
- ⚠GPL-3.0 is copyleft — check downstream compatibility
- ⚠No CI workflows detected
- ⚠Scorecard: default branch unprotected (0/10)
- ✓Last commit 4mo ago
- ✓GPL-3.0 licensed
- ✓Tests present
What would improve this?
- •Use as dependency Concerns to Mixed if: relicense under MIT/Apache-2.0 (rare for established libs)
- •Deploy as-is Mixed to Healthy if: bring "Branch-Protection" to ≥3/10 (see scorecard report)
Computed from maintenance signals — commit recency, contributor breadth, bus factor, license, CI, tests, cross-checked against OpenSSF Scorecard
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Want this for your own repo?
Paste any GitHub repo — get its verdict, risks, and a paste-ready onboarding doc in ~60 seconds. Free, no sign-up.
Embed the "Forkable" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/v-lavrenko/yari)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card
This card auto-renders when someone shares https://repopilot.app/r/v-lavrenko/yari on X, Slack, or LinkedIn.
Ask AI about v-lavrenko/yari
Grounded in the actual source code. Pick a starter question or write your own.
Onboarding doc
Onboarding: v-lavrenko/yari
Generated by RepoPilot · 2026-06-28 · Source
🎯Verdict
Mixed — Slowing — last commit 4mo ago
- Last commit 4mo ago
- GPL-3.0 licensed
- Tests present
- ⚠ Slowing — last commit 4mo ago
- ⚠ Solo or near-solo (1 contributor active in recent commits)
- ⚠ GPL-3.0 is copyleft — check downstream compatibility
- ⚠ No CI workflows detected
- ⚠ Scorecard: default branch unprotected (0/10)
<sub>Computed from maintenance signals — commit recency, contributor breadth, bus factor, license, CI, tests, cross-checked against OpenSSF Scorecard</sub>
⚡TL;DR
mtx is a command-line rapid prototyping toolkit for Information Retrieval and Machine Learning that represents almost everything (documents, queries, rankings, networks) as sparse matrices and performs experiments through memory-mapped matrix operations. It trades memory efficiency for disk I/O, allowing Wikipedia-scale datasets to be processed interactively on a laptop by keeping matrices on disk and streaming them in small chunks rather than loading into RAM. Monolithic single-directory C codebase: core matrix/IR operations in *.c files (matrix.c, mtx.c, cache.c, compress.c, hash.c, kvs.c, cluster.c), utility modules for compression and hashing, GPU acceleration via cumtx.cu, and shell/awk scripts in scripts/ directory for gluing operations together. Memory-mapped I/O (mmap.c/h) and custom hash tables (hash.c, hashf.c) provide the low-level machinery; Makefile orchestrates compilation.
LLM-derived; treat as a starting point, not verified fact.
👥Who it's for
Unix-savvy IR researchers and ML practitioners who prototype novel algorithms (BM25 variants, PageRank combinations, clustering approaches) and need to iterate quickly on large datasets without writing custom C code or managing Matlab licenses. Also targets academic groups building ad-hoc data pipelines where tab-completion and shell history matter more than GUI niceties.
LLM-derived; treat as a starting point, not verified fact.
🌱Maturity & risk
Mature and stable but dormant: the codebase is substantial (~900KB C, well-organized modules like cache.c, compress.c, matrix.c) suggesting years of refinement, but no recent activity is evident. No CI/CD files visible (no .github, no Travis/GitHub Actions), no test directory, and the single maintainer (v-lavrenko) pattern suggests low bus-factor risk is outweighed by maintenance risk. Production-ready for the specific use case it was designed for, but not actively developed.
Single-author, dormant repo with no visible dependency manifest (no package.json, no requirements.txt), meaning security vulnerabilities in upstream C libraries (curl.c, compress.c) would require manual patching. The codebase mixes C (916KB), AWK, shell, Python, and CUDA without visible integration tests or CI, making refactoring dangerous. Heavy reliance on POSIX shell scripting (scripts/ directory) and awk (email2xml.awk, evl2recall.awk) means breakage on non-Unix platforms is undetected.
LLM-derived; treat as a starting point, not verified fact.
Active areas of work
No recent activity visible from file list alone. The repo structure suggests a completed, mature tool that is feature-complete and maintained reactively. Last commits, open issues, and PR counts cannot be inferred from the provided metadata.
LLM-derived; treat as a starting point, not verified fact.
🚀Get running
Clone and build: git clone https://github.com/v-lavrenko/yari.git && cd yari && make. Binaries will be created by the Makefile. Interactive usage: mtx (bash tab-completion recommended). Example: zcat docs.gz | mtx load:xml DOCS to import data.
Daily commands:
Interactive: ./mtx after make. One-off commands: mtx load:xml DOCS < file.xml, mtx multiply A B C, mtx transpose DOCS DOCS_T. Batch scripting via Makefile (k-means.csh example visible). See scripts/ directory for domain-specific pipelines (batch-gemini.py for Gemini API integration).
🗺️Map of the codebase
matrix.h— Core sparse matrix abstraction that represents documents, queries, indices, and rankings—the foundational data structure for all IR/ML operations.matrix.c— Implementation of sparse matrix operations, storage, and serialization; handles the heaviest computational workloads.mtx.c— Main CLI entry point and orchestrator for command dispatch, input parsing, and matrix pipeline execution.cache.h— Cache abstraction that enables yari's core design principle of avoiding re-computation by persisting intermediate results.hash.h— Hash table implementation used throughout for vocabulary, document mapping, and fast lookups in IR workflows.query.h— Query parsing and execution interface that bridges CLI input to matrix operations and ranking algorithms.README.md— Design philosophy and quick-start guide explaining yari's Unix-hacker ethos and matrix-centric IR/ML approach.
🧩Components & responsibilities
- Matrix Engine (C, mmap, custom sparse format) — Stores, loads, iterates, and performs arithmetic on sparse matrices (transpose, multiply, normalize).
- Failure mode: Out-of-memory on very large matrices; graceful degradation to disk-based operations via mmap.
- Cache Layer (C, disk I/O, hash-based lookup) — Saves intermediate computation results to disk; retrieves on subsequent queries to avoid re-running expensive algorithms.
- Failure mode: Cache invalidation bugs if algorithm parameters change; user must manually clear stale cache entries.
- Query Executor (C, CLI argument parsing) — Parses user command, orchestrates algorithm selection, and pipes results through output formatters.
- Failure mode: Malformed queries crash or produce silent errors; error messages are minimal.
- Ranking/Clustering Algorithms (C, matrix operations) — Implement IR scoring functions (BM25, vector space, PageRank) and unsupervised learning (K-means, HAC).
- Failure mode: Numerical instability on edge cases (zero vectors, sparse data); no automatic regularization.
- Evaluation Metrics (C, matrix iteration) — Computes IR quality measures (MAP, NDCG, P@k) against relevance judgments stored as sparse matrices.
- Failure mode: Assumes standard evaluation set format; incompatible formats silently produce garbage scores.
- Data Transformation Scripts (AWK, shell, awk arrays) — AWK/shell utilities convert between formats (TSV, RCV, JSON) and perform standard text operations (tokenization, grouping, aggregation).
- Failure mode: Memory limits on large files; slow on very large datasets; no streaming aggregation.
🔀Data flow
User shell→mtx CLI— Command-line query with algorithm name, parameters, and input/output file paths.mtx CLI→Cache layer— CLI checks if cached result exists for given command+parameters.Cache miss → Matrix Engine→Ranking/Clustering Algorithm— Engine loads sparse matrix from disk, algorithm computes scores/clusters.Algorithm output→Cache layer— Result matrix persisted to disk cache for reuse in future queries.Cached/computed matrix→Output formatters— Result piped through AWK scripts or formatters to TSV/JSON for downstream analysis.
🛠️How to make changes
Add a new ranking algorithm
- Create new .c file in root (e.g., bm25.c) implementing your ranking function (
bm25.c (new)) - Add corresponding .h header with function signature (
bm25.h (new)) - Register command dispatch in mtx.c's main command switch (
mtx.c) - Use matrix.h sparse matrix API to iterate docs/terms and compute scores (
matrix.h)
Add a new data transformation script
- Create .awk or .sh script in scripts/ following naming convention (e.g., scripts/normalize.awk) (
scripts/normalize.awk (new)) - Use standard input/output and Unix pipes for streaming data (
scripts/normalize.awk (new)) - Reference in Makefile if it's a critical workflow component (
Makefile)
Add a new evaluation metric
- Add metric computation function to eval.c (
eval.c) - Export function signature in query.h if metric is query-dependent (
query.h) - Call metric function from query execution pipeline in query.c (
query.c) - Test with standard IR collections (RCV, TREC) using scripts like rcv_eval.awk (
scripts/rcv_eval.awk)
🔧Why these technologies
- Sparse matrix representation — Enables compact storage of high-dimensional, low-density IR/ML data (documents, queries, features) and makes algorithms naturally scale to Wikipedia-sized datasets.
- Unix CLI + shell pipes — Allows interactive, exploratory workflows with tab-completion and command history; users can debug mid-pipeline using awk/perl without restarting long jobs.
- Result caching layer — Core design principle: never recompute intermediate results, enabling rapid iteration on expensive algorithms without losing progress.
- C for core algorithms — Provides performance necessary for Wikipedia-scale matrix operations while remaining portable across Unix systems.
- AWK/shell scripts in scripts/ — Unix philosophy: small, composable tools for common data transformations (format conversion, aggregation, filtering) without heavyweight dependencies.
⚖️Trade-offs already made
-
Sparse matrix as universal abstraction
- Why: Unifies treatment of documents, queries, indices, ratings, and features under single representation.
- Consequence: Some operations (e.g., dense neural networks) are awkward to express; not ideal for dense computations.
-
No built-in parallel/GPU support in core
- Why: Keeps CLI fast and interactive on single machine; caching provides most speedup for iterative workflows.
- Consequence: Large-scale distributed training requires external tools; cumtx.cu is experimental CUDA extension.
-
Shell-first interface, not library
- Why: Lowers barrier to entry for Unix hackers; encourages ad-hoc combination of existing tools.
- Consequence: Harder to embed in larger applications; no Python/Java bindings.
-
Minimal dependencies (C, AWK, POSIX)
- Why: Ensures portability and reliability; no dependency hell.
- Consequence: Cannot leverage modern ML libraries (PyTorch, TensorFlow); must reimplement algorithms in C.
🚫Non-goals (don't propose these)
- Does not handle distributed/parallel computation; designed for interactive single-machine exploration
- Not a real-time system; trade-off favors batch processing and caching over latency
- Does not provide web UI or REST API; pure Unix CLI tool
- Not a general-purpose ML framework; focused on IR-specific tasks and matrix operations
- No built-in authentication or multi-user support; assumes single-user Unix environment
📊Code metrics
- Avg cyclomatic complexity: ~2.3 — Most algorithms are O(n) or O(n log n) for sparse operations; dense matrix ops can be O(n³); high variance across files.
- Largest file:
matrix.c(2,400 lines) - Estimated quality issues: ~18 — Minimal error handling, implicit format assumptions, missing documentation, weak test coverage; Unix-hacker ethos prioritizes agility over robustness.
⚠️Anti-patterns to avoid
- Implicit file format assumptions (High) —
matrix.c, query.c: Code assumes specific sparse matrix binary format without version checking; format changes break backwards compatibility silently. - Minimal error handling (High) —
eval.c, cluster.c, mtx.c: malloc/file operations lack null checks; invalid inputs cause silent data corruption or crashes rather than user-friendly errors. - Global cache invalidation (Medium) —
cache.c: Cache does not track parameter changes; stale results returned if algorithm parameters updated without explicit cache clear. - Linear search in hot loops (Low) —
dense.c, matrix.c: Some matrix operations use linear search instead of hash table lookups, causing O(n) overhead in repeated iterations.
🔥Performance hotspots
matrix.c: sparse matrix multiplication(CPU-bound) — Unoptimized triple-loop dot products; no SIMD or cache-friendly blocking; degrades on wide matrices.cache.c: disk I/O on large matrices(I/O-bound) — Serial write of matrices to disk; no incremental writes or compression by default.eval.c: ranking evaluation on large judgment sets(CPU-bound) — Recomputes all metrics from scratch; no incremental or batch evaluation.hash.c: linear probing collision resolution(Memory/Cache) — High collision rates on skewed vocabulary distributions; no dynamic rehashing strategy.
🪤Traps & gotchas
No visible CMake or autoconf; Makefile is hand-written and may assume POSIX tools not available on Windows/MSVC. No .env or config files; behavior controlled entirely via command-line args and stdin/stdout piping—shell quoting and escaping pitfalls likely. GPU support (cumtx.cu) requires CUDA toolkit installed but no detection in Makefile visible. Memory limits: maximum matrix dimensions hardcoded as 2^32-1 rows/columns, and actual memory usage (16R + 8C bytes minimum) can surprise users with multi-GB intermediates. LSM directory (lsm) mentioned but purpose unclear—may be incomplete feature or stale code. No error handling conventions visible; segfaults likely from buffer overflows in hand-rolled C code given the age and single-author maintenance.
🏗️Architecture
💡Concepts to learn
- Sparse matrix CSR/compressed format — mtx's entire design hinges on compressing large matrices (docs, queries) into compact on-disk representations that fit in memory when streamed. Understanding CSR (row pointers, column indices, values) is critical to predicting memory usage and optimizing operations.
- Memory-mapped I/O (mmap) — mtx uses mmap (cache.c, mmap.c/h) to transparently page matrices from disk without explicit read/write calls, enabling Wikipedia-scale experiments on modest RAM. Grasping mmap semantics (page faults, shared memory, file consistency) is essential to debugging performance and correctness.
- Hash-based inverted indexing — hash.c and hashf.c implement custom hash tables for vocabulary and document lookups without stdlib dependencies. Understanding collision resolution and hash function choice (hashf.c) helps when tuning performance on large vocabularies.
- BM25 and vector space models for IR — mtx is designed around composing standard IR operations (BM25 weighting, query expansion, PageRank) as matrix multiplications. Knowing BM25 term weights and how they decompose into matrix form is required to use mtx effectively.
- TREC evaluation format and metrics — mtx is built around TREC-format relevance judgments (qrels, rcv files) and metrics (scripts like evl2recall.awk compute recall/precision/NDCG). Fluency with TREC qrels structure and standard metrics is assumed throughout.
- Bloom filters for approximate membership — bloom.c implements Bloom filters for fast set membership testing (e.g., 'is this document in the relevant set?') with tunable false-positive rates, trading memory for accuracy. Critical for large-scale filtering without building full indices.
- K-means and hierarchical agglomerative clustering — cluster.c and hac.c provide these standard clustering algorithms operating on sparse matrices. Understanding how clustering is formulated as repeated matrix operations (e.g., centroid updates as weighted sums) shows mtx's composability.
🔗Related repos
scipy/scipy— scipy.sparse provides Python sparse matrix operations for interactive IR/ML; mtx is the Unix command-line equivalent optimized for disk-resident matrices too large for RAMlemire/FastPFor— Bitpacking and compression for integer streams; mtx uses similar ideas (bitvec.h, compress.c) for sparse matrix on-disk compression to reduce I/Oterrier-org/terrier-core— Terrier is a full-featured Java IR toolkit; mtx is a lighter-weight Unix-native alternative for the same domain (document ranking, indexing, evaluation)mgormley/pna— Probabilistic networks and structured inference; mtx's generic matrix representation (via cluster.c, maxent.c) overlaps with PNA's use of matrices for graphical modelsxwhan/trectools— TREC-compatible evaluation tools; mtx integrates with TREC relevance judgment format (*.rcv) and scripts like evl2recall.awk for standard IR metrics
🪄PR ideas
Click to expand
PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add unit tests for core data structure modules (hash.c, bitvec.h, dlist.h, matrix.c)
The repo has no visible test directory despite being a 'rapid prototyping library' with fundamental data structures. hash.c, bitvec.h, dlist.h, and matrix.c are core abstractions used throughout the codebase. Adding targeted unit tests would catch regressions early and make contributions safer. This is especially valuable for a tool meant for interactive IR/ML work where correctness is critical.
- [ ] Create tests/ directory with a Makefile or shell script runner
- [ ] Write test_hash.c covering hash insertion, collision handling, and lookup edge cases
- [ ] Write test_bitvec.h covering set operations and bit manipulation
- [ ] Write test_dlist.h covering insertion, deletion, and iteration
- [ ] Write test_matrix.c covering sparse matrix operations referenced in matrix.h
- [ ] Integrate test runner into existing Makefile as 'make test' target
Document CLI usage and add man pages for key mtx commands (query.c, cluster.c, eval.c)
The README describes mtx as a 'command-line tool' and 'shell tool' but there are no man pages or usage documentation visible. query.c, cluster.c, and eval.c are core commands that users would invoke directly. Without documentation, new users cannot discover flags, input formats, or output formats. Adding man pages would dramatically improve usability.
- [ ] Review query.c for all command-line flags and create man page mtx-query.1
- [ ] Review cluster.c for clustering algorithm options and create man page mtx-cluster.1
- [ ] Review eval.c for evaluation metrics supported and create man page mtx-eval.1
- [ ] Add man/ directory and integrate into Makefile with 'make install-man' target
- [ ] Document expected input/output formats (e.g. sparse matrix formats used by mtx.c)
Create integration examples for ungodly concoctions mentioned in README (e.g., BM25-weighted PageRank)
The README explicitly promises users can 'try ungodly concoctions, like BM25-weighted PageRank over ratings' but there are no example scripts showing how to actually compose query.c, maxent.c, compress.c, and other modules together. The scripts/ directory exists but lacks cookbook-style examples. Adding 3-5 worked examples would validate the promise and guide new contributors.
- [ ] Create scripts/recipes/ directory
- [ ] Write scripts/recipes/bm25-pagerank-combo.sh showing how to pipeline query.c output to compress.c to cluster.c
- [ ] Write scripts/recipes/maxent-ranking-with-cache.sh demonstrating cache.c integration for iterative ML
- [ ] Write scripts/recipes/wikipedia-scale-example.sh with actual dataset download and processing steps
- [ ] Add README.md in scripts/recipes/ explaining each example and linking to relevant source files
🌿Good first issues
- Add unit tests for matrix.c operations (multiply, transpose, compress) and cache.c mmap streaming. Currently no test/ directory exists despite complex low-level I/O code that is easy to regress.
- Document the LSM module (lsm/) and its role—currently a black box in the file list. Either complete its integration or deprecate it in README.
- Write a Python 3 wrapper module (e.g., pyarmed/) exposing core mtx operations to Jupyter notebooks, leveraging the existing scripts/batch-gemini.py pattern to make the toolkit usable without shell expertise.
📝Recent commits
Click to expand
Recent commits
a0ba964— Adds: gemini print A H | gemini load B H (v-lavrenko)3250c38— Bugfix: lowercase alphanumeric and punctuated tokens (v-lavrenko)1166d90— Adds: trim_doc_tags() (v-lavrenko)7f9bb80— Adds: dict -diff A B ... shows ins/del/neq (v-lavrenko)3e66d1d— batch-gemini now supports head *, lines, XML etc (v-lavrenko)11de5bb— Adds: gemini A = generate prompt B (v-lavrenko)60201cb— Error handling in batch-gemini (v-lavrenko)40cc239— Adds: head * | batch-gemini -p prompt (v-lavrenko)7554807— Gemini rate limits (v-lavrenko)a58aad8— <DOCID> takes precedence over <DOCNO> (v-lavrenko)
🔒Security observations
Click to expand
Security observations
- High · Potential Buffer Overflow in C Code —
bio.c, cache.c, cluster.c, compress.c, curl.c, hash.c, matrix.c, query.c, regexp.c, and other .c files. The codebase contains multiple C source files (bio.c, cache.c, cluster.c, compress.c, etc.) without visible bounds checking or safe string handling patterns. C programs are inherently vulnerable to buffer overflows if not carefully implemented, especially in files dealing with data structures like matrices, hashes, and compression. Fix: Conduct thorough code review focusing on: 1) Buffer allocation and access patterns, 2) Use of safe functions (strncpy instead of strcpy, snprintf instead of sprintf), 3) Implement bounds checking for all array/buffer operations, 4) Use static analysis tools (clang-analyzer, cppcheck) to identify potential overflows - High · Shell Script Injection Vulnerabilities —
scripts/ directory, especially: scripts/batch-gemini.py, scripts/json2rcv.csh, scripts/json2txt.csh, scripts/httpd, scripts/gg. Multiple shell scripts in the scripts/ directory (batch-gemini.py, json2rcv.csh, json2txt.csh, etc.) may be vulnerable to command injection if they process user input without proper sanitization or quote escaping. Scripts using awk, perl, and shell commands are particularly at risk. Fix: 1) Avoid using shell-based string concatenation for constructing commands, 2) Use proper quoting and escaping (single quotes for literal strings), 3) Use language-native methods instead of shell backticks or system() calls, 4) Validate and sanitize all user input before passing to system commands, 5) Use shellcheck tool to audit shell scripts - Medium · Unsafe Network Operations —
curl.c, netutil.c, netutil.h. Files curl.c, netutil.c, and netutil.h suggest network operations. Without visible TLS/SSL verification code or secure connection handling, there could be man-in-the-middle vulnerabilities or insecure protocol usage. Fix: 1) Verify TLS certificate validation is enabled in all curl operations, 2) Enforce HTTPS/TLS for all network communications, 3) Implement certificate pinning if communicating with known servers, 4) Review curl error handling to ensure failures are not silently ignored, 5) Use curl_easy_setopt() with CURLOPT_SSL_VERIFYPEER and CURLOPT_SSL_VERIFYHOST - Medium · Missing Input Validation in Data Processing —
compress.c, dict.c, query.c, kvs.c, bloom.c, bpe.c. Files handling external data (compress.c, dict.c, query.c, kvs.c) lack visible input validation mechanisms. Processing untrusted data without validation can lead to crashes, denial of service, or memory corruption. Fix: 1) Implement strict input validation for all data parsing functions, 2) Check file sizes and data boundaries before processing, 3) Add sanity checks on compression ratios and decompressed sizes, 4) Use fuzzing tools to test robustness against malformed input, 5) Implement graceful error handling instead of assertions - Medium · Potential Integer Overflow in Matrix Operations —
matrix.c, matrix.h, mtx.c. matrix.c and mtx.c files handle large matrix operations. Without visible overflow checks, operations on matrix dimensions and indices could lead to integer overflow and subsequent buffer overflows. Fix: 1) Add overflow checks before multiplication operations (particularly for rows × columns), 2) Use safe integer arithmetic functions, 3) Validate matrix dimensions before allocation, 4) Implement assertions for invariants, 5) Consider using size_t for size calculations - Low · Missing Security Documentation —
Repository root, README.md. No visible security documentation, security policy, or guidelines for contributors. No evidence of security audits or vulnerability disclosure procedures. Fix: 1) Create SECURITY.md with vulnerability disclosure policy, 2) Document security assumptions and design decisions, 3) Implement automated security testing in CI/CD, 4) Conduct regular security audits, 5) Add security badges and scanning - Low · No Visible Memory Sanitization —
undefined. C code handling sensitive data (if Fix: undefined
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
The exported doc (Copy CLAUDE.md / Download / .cursor/rules) also includes an agent protocol and a verification script written for AI coding agents — omitted here to keep this view scannable.
Embed this chat in your README
Drop this iframe anywhere — the widget runs against the same live analysis cache as the main app.
<iframe src="https://repopilot.app/embed/v-lavrenko/yari" width="100%" height="500" style="border:1px solid #d0d7de; border-radius:8px;" allow="microphone" loading="lazy" ></iframe>