cortexlabs/cortex

Item: cortexlabs/cortex
Rating: 3
Author: RepoPilot

Production infrastructure for machine learning at scale

Mixed

Stale — last commit 2y ago

weakest axis

Use as dependencyMixed

last commit was 2y ago; no CI workflows detected

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isMixed

last commit was 2y ago; no CI workflows detected

✓7 active contributors
✓Apache-2.0 licensed
✓Tests present

Show all 6 evidence items →

⚠Stale — last commit 2y ago
⚠Concentrated ownership — top contributor handles 56% of recent commits
⚠No CI workflows detected

What would change the summary?

→Use as dependency Mixed → Healthy if: 1 commit in the last 365 days
→Deploy as-is Mixed → Healthy if: 1 commit in the last 180 days

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Forkable](https://repopilot.app/api/badge/cortexlabs/cortex?axis=fork)](https://repopilot.app/r/cortexlabs/cortex)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/cortexlabs/cortex on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: cortexlabs/cortex

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/cortexlabs/cortex shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 2y ago

7 active contributors
Apache-2.0 licensed
Tests present
⚠ Stale — last commit 2y ago
⚠ Concentrated ownership — top contributor handles 56% of recent commits
⚠ No CI workflows detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live cortexlabs/cortex repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/cortexlabs/cortex.

What it runs against: a local clone of cortexlabs/cortex — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in cortexlabs/cortex | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch master exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 724 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>cortexlabs/cortex</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of cortexlabs/cortex. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/cortexlabs/cortex.git
#   cd cortex
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of cortexlabs/cortex and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "cortexlabs/cortex(\\.git)?\\b" \\
  && ok "origin remote is cortexlabs/cortex" \\
  || miss "origin remote is not cortexlabs/cortex (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 4. Critical files exist
test -f "cmd/operator/main.go" \\
  && ok "cmd/operator/main.go" \\
  || miss "missing critical file: cmd/operator/main.go"
test -f "cli/main.go" \\
  && ok "cli/main.go" \\
  || miss "missing critical file: cli/main.go"
test -f "cli/cmd/lib_manager.go" \\
  && ok "cli/cmd/lib_manager.go" \\
  || miss "missing critical file: cli/cmd/lib_manager.go"
test -f "cli/cluster/deploy.go" \\
  && ok "cli/cluster/deploy.go" \\
  || miss "missing critical file: cli/cluster/deploy.go"
test -f "cli/types/cliconfig/cli_config.go" \\
  && ok "cli/types/cliconfig/cli_config.go" \\
  || miss "missing critical file: cli/types/cliconfig/cli_config.go"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 724 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~694d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/cortexlabs/cortex"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Cortex is a production ML infrastructure platform that deploys, manages, and autoscales machine learning models on AWS EKS (Kubernetes). It abstracts away cluster management complexity by providing declarative configuration for realtime APIs, async workloads, and batch jobs, with built-in autoscaling, spot instance support, and metrics/logging integrations. Monorepo structured as: cli/ contains the command-line interface (split into cmd/ for command handlers and cluster/ for cluster operations), build/ houses Docker image and AMI generation scripts, and .circleci/ configures CI. The Go module root is at repo root with no separate packages/ subdirectory.

👥Who it's for

ML engineers and platform teams who need to deploy trained models to production without becoming Kubernetes experts—they write YAML configs and the Cortex CLI handles provisioning, scaling, and lifecycle management on their AWS account.

🌱Maturity & risk

This project is no longer actively maintained by its original authors (as stated in README). It has substantial Go and Python codebases (3.1 MB Go, 184 KB Python), organized CI/CD via CircleCI, and comprehensive CLI tooling, suggesting it reached production maturity—but is now in maintenance/archived mode.

High risk for new adopters: the project is explicitly unmaintained, dependency graph is large (EKS, Istio, Kubernetes client-go, AWS SDK), and pinned to Go 1.17 (released August 2021, well past EOL). No active security patches or bug fixes incoming; suited only for teams inheriting existing deployments.

Active areas of work

Nothing—the README explicitly notes this is no longer actively maintained. The codebase is stable but receiving no new features, bug fixes, or dependency updates. Last activity likely occurred well before the Go 1.17 pin became obsolete.

🚀Get running

Clone via git clone https://github.com/cortexlabs/cortex.git && cd cortex. Install Go 1.17 (or compatible), then run make build (visible in Makefile) to compile the CLI. AWS credentials and EKS cluster access are required for actual deployments.

Daily commands: make build compiles the CLI binary. For cluster deployment, use cortex cluster create (via cli/cmd/cluster.go handlers) after configuring AWS credentials. Development uses CircleCI (.circleci/config.yml) for testing and image builds.

🗺️Map of the codebase

cmd/operator/main.go — Entry point for the operator service that manages ML workload orchestration and cluster control—foundational to understanding the system architecture.
cli/main.go — CLI entry point; essential for understanding how users interact with Cortex and how commands are routed to cluster management logic.
cli/cmd/lib_manager.go — Core library handling communication with the operator; critical dependency for all CLI commands that manage deployments.
cli/cluster/deploy.go — Implements the deployment workflow for models; illustrates the primary user-facing operation and how configs flow into the system.
cli/types/cliconfig/cli_config.go — Configuration parsing and management; defines how environments and credentials are stored and used throughout the CLI.
go.mod — Lists all external dependencies (AWS SDK, Docker, Kubernetes, gRPC utilities); essential for understanding external integrations and tool choices.
Makefile — Build and release orchestration; shows how the system is compiled, tested, and deployed to production.

🛠️How to make changes

Add a new CLI command

Create a new command file in cli/cmd/ (e.g., cli/cmd/mynewcmd.go) with a func init() that registers the command to the root command (cli/cmd/root.go)
Implement the command function using the manager HTTP client (lib_manager.go) to communicate with the operator (cli/cmd/lib_manager.go)
Add any required CLI flags in the command file and parse them in the Run function (cli/cmd/const.go)
Update the CLI help/docs and test the command via build/cli.sh (build/cli.sh)

Add a new backend operator service

Create a new main.go file in cmd/mynewservice/ that imports necessary packages and defines main() (cmd/operator/main.go)
Implement gRPC or HTTP service handlers using gorilla/mux patterns from existing services (cmd/proxy/main.go)
Add build target to Makefile for compiling the binary and Docker image (Makefile)
Create build/build-image.sh entries or extend build/build-images.sh to containerize the new service (build/build-images.sh)

Add a new cluster configuration option

Define the config struct field in cli/cmd/lib_cluster_config.go with YAML tags (cli/cmd/lib_cluster_config.go)
Add validation logic for the new field in the same file or in cli/cluster/errors.go (cli/cluster/errors.go)
Update the CLI config loading to parse the field from environment or cortex.yaml files (cli/types/cliconfig/cli_config.go)
Pass the new config to the operator via lib_manager.go HTTP requests (cli/cmd/lib_manager.go)

Add monitoring/metrics collection

Use the DataDog client (imported in go.mod) to emit metrics from operator services (cmd/operator/main.go)
Add Prometheus endpoints or StatsD gauge increments in service main files (cmd/autoscaler/main.go)
Configure metric tags and endpoint details in dev/prometheus.md or environment variables (dev/prometheus.md)

🔧Why these technologies

Go + gRPC/HTTP — High-performance, concurrent service-to-service communication; efficient for orchestration and low-latency operator commands
Kubernetes — Industry-standard container orchestration; built-in workload scheduling, networking, autoscaling, and declarative config management
AWS SDK (boto3/aws-go) — Enables multi-region cluster provisioning, elastic compute, and managed services integration for spot instances and on-demand backups
Docker — Containerization of ML models and system components; enables reproducible, portable deployments across environments
DataDog + Prometheus — Observability and metrics collection; allows integration with any monitoring stack and provides pre-built dashboards
YAML configuration — Human-readable, declarative workload definitions; familiar to DevOps engineers and integrates with GitOps workflows

⚖️Trade-offs already made

Operator-based architecture instead of serverless platforms
- Why: Need for full control over scheduling, autoscaling policies, and multi-workload orchestration; avoiding vendor lock-in
- Consequence: Higher operational overhead; requires Kubernetes expertise and cluster management; more flexibility but more complexity
Async/batch workloads via queue-based system (enqueuer/dequeuer) rather than event streaming
- Why: Simpler fault tolerance, guaranteed delivery, and easier backpressure handling for batch ML jobs
- Consequence: Lower throughput for extremely high-frequency events; requires queue infrastructure management
CLI-first user experience with HTTP/gRPC backend
- Why: Integrate with CI/CD pipelines and infrastructure-as-code tools; familiar workflow for DevOps users
- Consequence: No web UI; requires users to learn CLI; less interactive for exploratory deployments
Spot instance support with on-demand failover
- Why: Cost optimization for ML workloads; leverage AWS pricing discounts without risk
- Consequence: Increased complexity in scheduling and failover logic; unpredictable workload interruptions managed transparently

🚫Non-goals (don't propose these)

Does not provide built-in model training; only deployment and serving of pre-trained models
Does not handle authentication/authorization; assumes integration with cloud provider IAM or external auth systems
Not a data lake or feature store; does not manage raw training data or feature pipelines
Does not support serverless platforms other than Kubernetes (e.g., AWS Lambda, Google Cloud Run); Kubernetes-only
Does not provide web UI; CLI and programmatic API only

🪤Traps & gotchas

Go 1.17 is pinned in go.mod (EOL January 2023); attempting to use newer toolchains may cause module resolution issues. AWS credentials must be available (env vars or ~/.aws/config) before any cortex cluster command. Kubernetes client-go is pinned to v0.22.11 (old); EKS API changes in newer clusters could cause compatibility issues. No Terraform provider code visible despite README mention—it may be in a separate repo or removed. CircleCI secrets (AWS keys, Docker registry) are required to publish images; local builds will fail at push stages.

🏗️Architecture

💡Concepts to learn

Kubernetes Custom Resource Definitions (CRDs) — Cortex extends EKS by defining custom API types (realtime APIs, async APIs, batch jobs) as CRDs; understanding CRDs is essential to grasp how Cortex models are deployed as K8s objects
EKS Autoscaling (Cluster Autoscaler, Karpenter) — Cortex's core value proposition is elastic cluster autoscaling; it adjusts EKS node pools (CPU/GPU instances, spot vs on-demand) based on workload demand—this is what differentiates it from manual EKS management
Istio Virtual Services and Traffic Management — Cortex uses Istio (istio.io/client-go in dependencies) to route traffic, implement canary deployments, and manage traffic splitting between API versions—critical for rolling updates without downtime
IAM AssumeRole and IRSA (IAM Roles for Service Accounts) — Cortex integrates with AWS IAM for fine-grained permissions; understanding IRSA allows deployed models to access S3, RDS, etc. without embedding AWS keys—visible in lib_aws_creds.go and sigs.k8s.io/aws-iam-authenticator
Declarative Infrastructure as Code (GitOps pattern) — Cortex follows IaC principles—users define clusters and APIs in YAML config files, versioned in Git; the CLI reconciles desired state with actual cluster state, enabling reproducible, auditable deployments
Spot Instances and On-Demand Fallback — Cortex can run workloads on AWS Spot instances (cheaper but interruptible) with automatic failover to on-demand; this cost-optimization pattern is core to the platform's value for cost-conscious teams
Blue-Green and Canary Deployments — Cortex enables traffic splitting between API versions (via lib_traffic_splitters.go and Istio), allowing gradual rollouts and instant rollbacks—essential for production ML where bad models cause revenue loss

seldon/seldon-core — Kubernetes-native ML model serving platform; alternative to Cortex for production inference, also uses Istio for traffic management
bentoml/bentoml — Python-first ML model packaging and serving framework; lighter-weight alternative focused on reproducibility and rapid iteration
ray-project/ray — Distributed computing framework for ML workloads; complements Cortex for batch and async processing at scale
aws/amazon-sagemaker-examples — AWS-native ML ops examples; relevant for teams migrating from Cortex to SageMaker or evaluating SageMaker vs self-managed EKS
kubernetes/kubernetes — Cortex runs on EKS (managed Kubernetes); understanding K8s primitives (Deployments, Services, Custom Resources) is essential

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for CLI commands in cli/cmd with Golang testing framework

The cli/cmd directory contains critical command implementations (deploy.go, delete.go, get.go, logs.go, etc.) but there are no visible test files (*_test.go). These commands interact with cluster management and API operations, making them high-risk areas for regressions. Adding integration tests would catch breaking changes early and document expected CLI behavior.

[ ] Create cli/cmd/deploy_test.go with tests for flag parsing and cluster deployment flows
[ ] Create cli/cmd/delete_test.go testing deletion confirmation and error handling
[ ] Create cli/cmd/get_test.go and cli/cmd/logs_test.go for API retrieval logic
[ ] Use testify assertions (already in go.mod) for consistent test patterns
[ ] Update build/test.sh to run Go tests in ./cli/... directory
[ ] Reference existing ginkgo/gomega setup in go.mod for BDD-style tests if preferred

Add GitHub Actions workflow for automated Go linting and security scanning on PRs

The repo has build/lint.sh script but no visible GitHub Actions workflow (only .circleci/config.yml for CircleCI). Contributors won't get immediate feedback on code quality issues before PR review. A GitHub Actions workflow for golangci-lint and gosec would catch issues early and reduce review burden.

[ ] Create .github/workflows/go-lint.yml with golangci-lint action targeting ./cli and ./cmd directories
[ ] Add gosec security scanning step for vulnerability detection in Go dependencies
[ ] Configure to run on pull_request and push to main branch events
[ ] Set up workflow to fail on high-severity linting violations to enforce standards
[ ] Add status check requirement in GitHub branch protection rules (documented in CONTRIBUTING.md)

Add comprehensive error handling tests for cli/cmd/errors.go and cli/cluster/errors.go

Two dedicated error files exist (cli/cmd/errors.go and cli/cluster/errors.go) suggesting custom error types, but without visible test coverage. These are critical for user experience—incorrect error messages or missing error cases cause poor CLI UX. Testing error handling paths ensures clarity and consistency.

[ ] Create cli/cmd/errors_test.go with table-driven tests for each error type and message formatting
[ ] Create cli/cluster/errors_test.go testing HTTP error mappings and cluster-specific error scenarios
[ ] Test error message formatting, error wrapping chains (pkg/errors is in dependencies), and user-facing text
[ ] Use existing stretchr/testify for assertions and ensure error cases in deploy/delete/get commands propagate correctly
[ ] Document error handling patterns in CONTRIBUTING.md for future contributors

🌿Good first issues

Add missing unit tests for cli/cmd/lib_traffic_splitters.go—it's in the file list but has no corresponding test file; write tests for traffic splitting config parsing and validation.
Document the cluster configuration schema in docs/—the CLI accepts YAML via lib_cluster_config.go but no inline documentation of required/optional fields exists; extract schema and write reference docs.
Refactor error handling in cli/cluster/errors.go—consolidate duplicate AWS/Kubernetes error messages into a single error catalog to improve consistency across deploy.go, get.go, and logs.go.

⭐Top contributors

Click to expand

@deliahu — 56 commits
@RobertLucian — 28 commits
[@Miguel Varela Ramos](https://github.com/Miguel Varela Ramos) — 6 commits
@vishalbollu — 4 commits
@bluuewhale — 2 commits

📝Recent commits