Kodezi/Chronos

Item: Kodezi/Chronos
Rating: 3
Author: RepoPilot

Kodezi Chronos is a debugging-first language model that achieves state-of-the-art results on SWE-bench Lite (80.33%) and 67% real-world fix accuracy, over six times better than GPT-4. Built with Adaptive Graph-Guided Retrieval and Persistent Debug Memory. Model available Q1 2026 via Kodezi OS.

Mixed

Slowing — last commit 6mo ago

weakest axis

Use as dependencyConcerns

non-standard license (Other); single-maintainer (no co-maintainers visible)

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 6mo ago
✓Other licensed
✓CI configured

Show all 7 evidence items →

✓Tests present
⚠Slowing — last commit 6mo ago
⚠Solo or near-solo (1 contributor active in recent commits)
⚠Non-standard license (Other) — review terms

What would change the summary?

→Use as dependency Concerns → Mixed if: clarify license terms

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Forkable](https://repopilot.app/api/badge/kodezi/chronos?axis=fork)](https://repopilot.app/r/kodezi/chronos)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/kodezi/chronos on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: Kodezi/Chronos

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/Kodezi/Chronos shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Slowing — last commit 6mo ago

Last commit 6mo ago
Other licensed
CI configured
Tests present
⚠ Slowing — last commit 6mo ago
⚠ Solo or near-solo (1 contributor active in recent commits)
⚠ Non-standard license (Other) — review terms

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live Kodezi/Chronos repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/Kodezi/Chronos.

What it runs against: a local clone of Kodezi/Chronos — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in Kodezi/Chronos | Confirms the artifact applies here, not a fork | | 2 | License is still Other | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 207 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>Kodezi/Chronos</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of Kodezi/Chronos. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/Kodezi/Chronos.git
#   cd Chronos
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of Kodezi/Chronos and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "Kodezi/Chronos(\\.git)?\\b" \\
  && ok "origin remote is Kodezi/Chronos" \\
  || miss "origin remote is not Kodezi/Chronos (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Other)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Other\"" package.json 2>/dev/null) \\
  && ok "license is Other" \\
  || miss "license drift — was Other at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "architecture/chronos_2025_architecture.md" \\
  && ok "architecture/chronos_2025_architecture.md" \\
  || miss "missing critical file: architecture/chronos_2025_architecture.md"
test -f "architecture/AGR_ALGORITHM.md" \\
  && ok "architecture/AGR_ALGORITHM.md" \\
  || miss "missing critical file: architecture/AGR_ALGORITHM.md"
test -f "benchmarks/mrr_full_benchmark/BENCHMARK_COMPLETE.md" \\
  && ok "benchmarks/mrr_full_benchmark/BENCHMARK_COMPLETE.md" \\
  || miss "missing critical file: benchmarks/mrr_full_benchmark/BENCHMARK_COMPLETE.md"
test -f "architecture/memory_engine.md" \\
  && ok "architecture/memory_engine.md" \\
  || miss "missing critical file: architecture/memory_engine.md"
test -f "benchmarks/comprehensive_benchmarks/run_all_benchmarks.py" \\
  && ok "benchmarks/comprehensive_benchmarks/run_all_benchmarks.py" \\
  || miss "missing critical file: benchmarks/comprehensive_benchmarks/run_all_benchmarks.py"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 207 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~177d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/Kodezi/Chronos"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Kodezi Chronos is a debugging-first language model that achieves 80.33% accuracy on SWE-bench Lite by combining Adaptive Graph-Guided Retrieval (AGR) with Persistent Debug Memory to understand repository-scale code context and autonomously fix bugs across Java, Python, and JavaScript. It solves the core problem that generic LLMs lack specialized debugging reasoning and fail to maintain coherent context across large codebases during multi-step bug fixes. Monorepo structure: /architecture/ contains core algorithm documentation (AGR_ALGORITHM.md, memory_engine.md, debugging_loop.md), /benchmarks/ houses the evaluation harness (comprehensive_benchmarks/ with language-specific suites, evaluation_metrics/, and baseline results), and language-specific code split by Java/Python/JavaScript. The project emphasizes reproducibility but gates the actual model behind future commercial access.

👥Who it's for

Software engineers and DevOps teams who need autonomous code debugging and bug fixing at repository scale; researchers evaluating debugging-first AI systems; enterprises looking to reduce time spent on bug triage and fix verification (the project emphasizes 40% time reduction vs manual debugging).

🌱Maturity & risk

Actively developed but not yet production-ready: the model itself is proprietary and gated behind Kodezi OS (Q1 2026 GA timeline), though this repository contains fully functional research code, benchmarks, and evaluation harnesses with comprehensive CI/CD pipelines (GitHub Actions in .github/workflows/). The codebase shows mature practices (CHANGELOG.md, CODE_OF_CONDUCT.md, CONTRIBUTING.md) and substantial test infrastructure in benchmarks/, but the core model weights are unavailable, limiting real-world evaluation to the paper's claims.

High risk for production adoption: the model is proprietary, not available until Q1 2026, and the repository explicitly states 'Model available Q1 2026 via Kodezi OS'—meaning this is primarily a research artifact with a future commercial gate. The codebase relies on reproducing paper results (80.33% SWE-bench, 67% real-world accuracy) without access to model internals, and there's no published benchmark data for common repositories beyond the curated SWE-bench Lite set. Single organizational risk—Kodezi owns both the model and the evaluation framework.

Active areas of work

Active research publication and benchmark refinement: the repository tracks SWE-bench Lite performance (80.33%), human preference studies (89% human preference), and is preparing Q4 2025 beta access and Q1 2026 general availability. Benchmarking is the primary focus—comprehensive_benchmarks/ includes distributed systems, dynamic language, performance regression, and hardware-dependent test suites, suggesting ongoing evaluation across diverse failure modes.

🚀Get running

Clone and explore benchmarks (no training/inference without model access): git clone https://github.com/Kodezi/Chronos.git && cd Chronos && cat QUICK_START.md then make (Makefile present) or python benchmarks/evaluate_2025.py to run evaluation harnesses. Note: .env.example suggests required configuration but model weights are not included—this repo is documentation and evaluation only.

Daily commands: make (Makefile present) or for specific benchmarks: python benchmarks/run_all_benchmarks.py to execute comprehensive suite, or python benchmarks/evaluate_2025.py for single evaluation. Docker setup available: docker-compose -f benchmarks/docker-compose.yml up for isolated benchmarking environment. All commands require Python 3.8+ and dependencies from undisplayed requirements.txt/setup.py.

🗺️Map of the codebase

architecture/chronos_2025_architecture.md — Defines the complete 2025 architecture including Adaptive Graph-Guided Retrieval (AGR) and Persistent Debug Memory — essential for understanding the core debugging-first design philosophy.
architecture/AGR_ALGORITHM.md — Details the Adaptive Graph-Guided Retrieval algorithm that achieves repository-scale code understanding — the foundational innovation behind Chronos's 80.33% SWE-bench performance.
benchmarks/mrr_full_benchmark/BENCHMARK_COMPLETE.md — Comprehensive benchmark specification and results across all debugging categories (API misuse, logic errors, performance regressions) — required reading for evaluating and extending the model.
architecture/memory_engine.md — Describes the Persistent Debug Memory system that maintains context across debugging iterations — critical for understanding how Chronos sustains multi-turn debugging workflows.
benchmarks/comprehensive_benchmarks/run_all_benchmarks.py — Entry point for executing all benchmark suites; demonstrates how to validate model performance and integrates all evaluation metrics.
QUICK_START.md — Getting started guide with installation and usage patterns for the Chronos framework — the first file new developers should review.
architecture/debugging_loop.md — Illustrates the iterative debugging loop that forms Chronos's core execution model — essential for contributors implementing or extending debugging functionality.

🛠️How to make changes

Add a new debugging benchmark category

Create a new benchmark module in benchmarks/comprehensive_benchmarks/ following the naming pattern *_benchmarks.py (e.g., security_benchmarks.py) (benchmarks/comprehensive_benchmarks/new_category_benchmarks.py)
Implement benchmark functions that yield test cases with task, expected_fix, and category fields matching the structure in benchmarks/debugging-tasks/sample_tasks.json (benchmarks/comprehensive_benchmarks/new_category_benchmarks.py)
Register the new module in benchmarks/comprehensive_benchmarks/run_all_benchmarks.py by importing and adding to the benchmark execution loop (benchmarks/comprehensive_benchmarks/run_all_benchmarks.py)
Add bug category definitions to benchmarks/debug_categories/bug_categories.json for classification (benchmarks/debug_categories/bug_categories.json)
Update benchmarks/mrr_full_benchmark/BENCHMARK_COMPLETE.md to document the new category and its evaluation methodology (benchmarks/mrr_full_benchmark/BENCHMARK_COMPLETE.md)

Extend the AGR algorithm with a new retrieval strategy

Review the current AGR algorithm specification in architecture/AGR_ALGORITHM.md to understand the graph-guided retrieval structure (architecture/AGR_ALGORITHM.md)
Document the new retrieval strategy in architecture/agr_retrieval.md, including how it integrates with the existing graph-guided approach (architecture/agr_retrieval.md)
Create a new benchmark in benchmarks/comprehensive_benchmarks/retrieval_benchmarks.py to validate the new strategy's effectiveness on different code patterns (benchmarks/comprehensive_benchmarks/retrieval_benchmarks.py)
Update the 2025 architecture document with the new strategy and its performance implications (architecture/chronos_2025_architecture.md)

Add support for a new programming language to the debugging framework

Create language-specific test cases in benchmarks/debugging-tasks/sample_tasks.json with language identifier and syntax patterns (benchmarks/debugging-tasks/sample_tasks.json)
Extend benchmarks/comprehensive_benchmarks/multi_language_benchmarks.py to include the new language's parser and AST handling (benchmarks/comprehensive_benchmarks/multi_language_benchmarks.py)
Add category mappings in benchmarks/debug_categories/bug_categories.json for language-specific bug patterns (benchmarks/debug_categories/bug_categories.json)
Create evaluation metrics in benchmarks/evaluation_metrics/metrics.py that account for language-specific syntax and semantics (benchmarks/evaluation_metrics/metrics.py)
Document the new language support in benchmarks/mrr_full_benchmark/BENCHMARK_COMPLETE.md (benchmarks/mrr_full_benchmark/BENCHMARK_COMPLETE.md)

Integrate Chronos into a new evaluation framework

Review the benchmark structure in benchmarks/BENCHMARK_GUIDE.md to understand expected input/output formats (benchmarks/BENCHMARK_GUIDE.md)
Examine existing metrics implementations in benchmarks/evaluation_metrics/comprehensive_metrics.py and adapt for your framework (benchmarks/evaluation_metrics/comprehensive_metrics.py)
Use benchmarks/evaluate_2025.py as the integration entry point, extending it with framework-specific result aggregation logic (benchmarks/evaluate_2025.py)
Reference benchmarks/mrr_full_benchmark/PERFORMANCE_METRICS.md for expected metric definitions and reporting standards (benchmarks/mrr_full_benchmark/PERFORMANCE_METRICS.md)

🔧Why these technologies

Adaptive Graph-Guided Retrieval (AGR) — Enables repository-scale code understanding by building semantic graphs of dependencies and using them to guide context retrieval, achieving 6x better performance than GPT-4 on real-world fixes
Persistent Debug Memory — Maintains debugging context across multiple iterations, allowing the model to reason over accumulated evidence and refine hypotheses without losing prior discoveries
Multi-language benchmark suite — Validates debugging capabilities across Python, JavaScript, Java,

🪤Traps & gotchas

Model weights and inference code are absent—this repo is research documentation + benchmarking harness only. .env.example indicates required environment variables (likely API keys or model endpoints) but are not documented in visible files. Benchmarks require significant compute resources: Docker image in benchmarks/Dockerfile suggests GPU or multi-core scaling for realistic evaluation. No explicit Python version constraints visible; codebase uses both notebook format (.ipynb in benchmarks/) and .py scripts, which may have compatibility issues. SWE-bench Lite evaluation depends on private or gated datasets not included in the repository—evaluation is reproducible only if you have access to SWE-bench itself. The 'real-world fix accuracy' (67%) is measured against undisclosed proprietary datasets, so external reproducibility is limited.

🏗️Architecture

💡Concepts to learn

Adaptive Graph-Guided Retrieval (AGR) — Core innovation that enables Chronos to select only relevant code snippets from massive repositories instead of passing entire codebases to the LLM, reducing context noise and improving fix accuracy from 40% to 80%+ on SWE-bench
Persistent Debug Memory — Mechanism that tracks debug state across multiple LLM iterations, allowing Chronos to learn from previous failed fix attempts and refine hypotheses—key to achieving 67% real-world accuracy vs. one-shot baselines
SWE-bench Lite — Standardized benchmark for evaluating software engineering agents on real GitHub issues; Chronos's 80.33% is the state-of-the-art result, making understanding the benchmark essential to contextualize claims
Mean Reciprocal Rank (MRR) — Evaluation metric used in benchmarks/MRR_BENCHMARK_USAGE.md for measuring retrieval quality; Chronos uses MRR to optimize AGR's ability to rank relevant code snippets at the top of context windows
Debugging-First Language Model — Paradigm shift from generic code generation to specialized debug reasoning; Chronos is trained with debugging as the primary objective, not a secondary capability, explaining why it outperforms general-purpose LLMs on fix tasks
Repository-Scale Code Understanding — The ability to reason about thousands of files and cross-repository dependencies simultaneously; Chronos solves this via AGR + Memory, whereas naive approaches fail due to context window limits and semantic drift across large codebases
Multi-Language Debugging Parity — Chronos handles Java, Python, and JavaScript with equivalent accuracy; understanding language-specific debugging patterns in benchmarks/comprehensive_benchmarks/multi_language_benchmarks.py is critical for extending to new languages

OpenDevin/OpenDevin — Autonomous AI software engineer with repository-scale code understanding; competes on SWE-bench and shares the multi-step reasoning + codebase context problem that Chronos solves with AGR
aider-ai/aider — AI pair programmer for code generation and debugging; alternative approach to repository-scale code modification that Chronos benchmarks itself against in multi-language scenarios
gpt4-code-interpreter/gpt4-code-interpreter — Early GPT-4 based debugging framework; Chronos explicitly claims 6x improvement over GPT-4 and uses this as a baseline comparison point
SWE-bench/SWE-bench — Official SWE-bench evaluation dataset and harness; Chronos's 80.33% result is measured directly against this benchmark, making it essential context for understanding claim validity
Kodezi/Kodezi-IDE — Kodezi's commercial IDE plugin that will integrate Chronos model (Q1 2026); this repo is the research foundation for that product

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for benchmarks/evaluation_metrics/ modules

The repo contains multiple evaluation metric modules (metrics.py, mrr_metrics.py, mrr_metrics_2025.py, comprehensive_metrics.py) but there's no visible test suite for these critical components. Since Chronos claims 80.33% SWE-bench performance, the metrics that measure this need rigorous testing. This ensures accuracy of reported results and prevents metric regressions.

[ ] Create tests/benchmarks/test_metrics.py with unit tests for metrics.py functions
[ ] Create tests/benchmarks/test_mrr_metrics.py covering mrr_metrics.py and mrr_metrics_2025.py
[ ] Create tests/benchmarks/test_comprehensive_metrics.py for comprehensive_metrics.py
[ ] Add pytest fixtures using sample data from benchmarks/mrr_full_benchmark/api_misuse/
[ ] Integrate test execution into .github/workflows/tests.yml
[ ] Ensure tests validate statistical_analysis.py calculations with known datasets

Add integration tests for AGR (Adaptive Graph-Guided Retrieval) algorithm implementation

Architecture documentation exists (architecture/AGR_ALGORITHM.md, architecture/agr_retrieval.md, benchmarks/mrr_full_benchmark/AGR_ARCHITECTURE.md) and benchmarks reference AGR retrieval metrics, but there's no visible test suite validating the AGR implementation. This is core to Chronos' differentiation and needs testing to ensure the retrieval pipeline functions correctly across the documented debug categories.

[ ] Create tests/architecture/test_agr_retrieval.py with tests for AGR initialization, ranking, and graph construction
[ ] Create tests/architecture/test_agr_integration.py for end-to-end retrieval workflows
[ ] Use sample tasks from benchmarks/debugging-tasks/sample_tasks.json and debug categories from benchmarks/debug_categories/bug_categories.json
[ ] Add test cases for memory engine integration (referenced in architecture/memory_engine.md)
[ ] Add performance regression tests comparing AGR performance against baseline_results/chronos_baseline_sample.json
[ ] Update .github/workflows/quality.yml to run AGR tests on each PR

Create CI workflow for multi-language benchmark test coverage

The repo has comprehensive_benchmarks/multi_language_benchmarks.py and benchmarks for distributed_systems, dynamic_language, and hardware_dependent scenarios, but .github/workflows/ only contains quality.yml and tests.yml. There's no dedicated benchmark execution workflow. This prevents automated validation that Chronos handles the diverse language scenarios it claims to support.

[ ] Create .github/workflows/benchmarks.yml workflow file
[ ] Add job to run benchmarks/comprehensive_benchmarks/run_all_benchmarks.py on PRs targeting architecture/ or benchmarks/
[ ] Configure conditional execution: run multi_language_benchmarks.py on code changes, run dynamic_language_benchmarks.py for Python/JS changes
[ ] Add result artifact uploads referencing BENCHMARK_METADATA.json structure
[ ] Add comparison logic to check new results against benchmarks/mrr_full_benchmark/BENCHMARK_SUMMARY.json baseline
[ ] Document expected runtime and resource requirements in benchmarks/BENCHMARK_GUIDE.md

🌿Good first issues

Add missing benchmark documentation for benchmarks/comprehensive_benchmarks/hardware_dependent_benchmarks.py—currently no README or docstrings explain how to run hardware-specific debugging tests or interpret results for CPU vs. GPU scenarios.
Expand bug_categories.json with examples and validation schema—the taxonomy in benchmarks/debug_categories/bug_categories.json lacks concrete code examples for each category (syntax, type, logic, concurrency) and a JSON schema, making it hard for new contributors to add custom debugging profiles.
Create a local evaluation guide for non-benchmark scenarios—BENCHMARK_GUIDE.md documents SWE-bench but lacks instructions for debugging a custom Java/Python/JavaScript repository using the same AGR + Memory pipeline, blocking adoption by developers outside the research community.

⭐Top contributors

Click to expand

@ishraqkhann — 5 commits

📝Recent commits