cortexlabs/cortex
Production infrastructure for machine learning at scale
Stale — last commit 2y ago
weakest axislast commit was 2y ago; no CI workflows detected
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
last commit was 2y ago; no CI workflows detected
- ✓7 active contributors
- ✓Apache-2.0 licensed
- ✓Tests present
Show all 6 evidence items →Show less
- ⚠Stale — last commit 2y ago
- ⚠Concentrated ownership — top contributor handles 56% of recent commits
- ⚠No CI workflows detected
What would change the summary?
- →Use as dependency Mixed → Healthy if: 1 commit in the last 365 days
- →Deploy as-is Mixed → Healthy if: 1 commit in the last 180 days
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Forkable" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/cortexlabs/cortex)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/cortexlabs/cortex on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: cortexlabs/cortex
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/cortexlabs/cortex shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
WAIT — Stale — last commit 2y ago
- 7 active contributors
- Apache-2.0 licensed
- Tests present
- ⚠ Stale — last commit 2y ago
- ⚠ Concentrated ownership — top contributor handles 56% of recent commits
- ⚠ No CI workflows detected
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live cortexlabs/cortex
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/cortexlabs/cortex.
What it runs against: a local clone of cortexlabs/cortex — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in cortexlabs/cortex | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch master exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 724 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of cortexlabs/cortex. If you don't
# have one yet, run these first:
#
# git clone https://github.com/cortexlabs/cortex.git
# cd cortex
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of cortexlabs/cortex and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "cortexlabs/cortex(\\.git)?\\b" \\
&& ok "origin remote is cortexlabs/cortex" \\
|| miss "origin remote is not cortexlabs/cortex (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 4. Critical files exist
test -f "cmd/operator/main.go" \\
&& ok "cmd/operator/main.go" \\
|| miss "missing critical file: cmd/operator/main.go"
test -f "cli/main.go" \\
&& ok "cli/main.go" \\
|| miss "missing critical file: cli/main.go"
test -f "cli/cmd/lib_manager.go" \\
&& ok "cli/cmd/lib_manager.go" \\
|| miss "missing critical file: cli/cmd/lib_manager.go"
test -f "cli/cluster/deploy.go" \\
&& ok "cli/cluster/deploy.go" \\
|| miss "missing critical file: cli/cluster/deploy.go"
test -f "cli/types/cliconfig/cli_config.go" \\
&& ok "cli/types/cliconfig/cli_config.go" \\
|| miss "missing critical file: cli/types/cliconfig/cli_config.go"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 724 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~694d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/cortexlabs/cortex"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
Cortex is a production ML infrastructure platform that deploys, manages, and autoscales machine learning models on AWS EKS (Kubernetes). It abstracts away cluster management complexity by providing declarative configuration for realtime APIs, async workloads, and batch jobs, with built-in autoscaling, spot instance support, and metrics/logging integrations. Monorepo structured as: cli/ contains the command-line interface (split into cmd/ for command handlers and cluster/ for cluster operations), build/ houses Docker image and AMI generation scripts, and .circleci/ configures CI. The Go module root is at repo root with no separate packages/ subdirectory.
👥Who it's for
ML engineers and platform teams who need to deploy trained models to production without becoming Kubernetes experts—they write YAML configs and the Cortex CLI handles provisioning, scaling, and lifecycle management on their AWS account.
🌱Maturity & risk
This project is no longer actively maintained by its original authors (as stated in README). It has substantial Go and Python codebases (3.1 MB Go, 184 KB Python), organized CI/CD via CircleCI, and comprehensive CLI tooling, suggesting it reached production maturity—but is now in maintenance/archived mode.
High risk for new adopters: the project is explicitly unmaintained, dependency graph is large (EKS, Istio, Kubernetes client-go, AWS SDK), and pinned to Go 1.17 (released August 2021, well past EOL). No active security patches or bug fixes incoming; suited only for teams inheriting existing deployments.
Active areas of work
Nothing—the README explicitly notes this is no longer actively maintained. The codebase is stable but receiving no new features, bug fixes, or dependency updates. Last activity likely occurred well before the Go 1.17 pin became obsolete.
🚀Get running
Clone via git clone https://github.com/cortexlabs/cortex.git && cd cortex. Install Go 1.17 (or compatible), then run make build (visible in Makefile) to compile the CLI. AWS credentials and EKS cluster access are required for actual deployments.
Daily commands:
make build compiles the CLI binary. For cluster deployment, use cortex cluster create (via cli/cmd/cluster.go handlers) after configuring AWS credentials. Development uses CircleCI (.circleci/config.yml) for testing and image builds.
🗺️Map of the codebase
cmd/operator/main.go— Entry point for the operator service that manages ML workload orchestration and cluster control—foundational to understanding the system architecture.cli/main.go— CLI entry point; essential for understanding how users interact with Cortex and how commands are routed to cluster management logic.cli/cmd/lib_manager.go— Core library handling communication with the operator; critical dependency for all CLI commands that manage deployments.cli/cluster/deploy.go— Implements the deployment workflow for models; illustrates the primary user-facing operation and how configs flow into the system.cli/types/cliconfig/cli_config.go— Configuration parsing and management; defines how environments and credentials are stored and used throughout the CLI.go.mod— Lists all external dependencies (AWS SDK, Docker, Kubernetes, gRPC utilities); essential for understanding external integrations and tool choices.Makefile— Build and release orchestration; shows how the system is compiled, tested, and deployed to production.
🛠️How to make changes
Add a new CLI command
- Create a new command file in cli/cmd/ (e.g., cli/cmd/mynewcmd.go) with a func init() that registers the command to the root command (
cli/cmd/root.go) - Implement the command function using the manager HTTP client (lib_manager.go) to communicate with the operator (
cli/cmd/lib_manager.go) - Add any required CLI flags in the command file and parse them in the Run function (
cli/cmd/const.go) - Update the CLI help/docs and test the command via build/cli.sh (
build/cli.sh)
Add a new backend operator service
- Create a new main.go file in cmd/mynewservice/ that imports necessary packages and defines main() (
cmd/operator/main.go) - Implement gRPC or HTTP service handlers using gorilla/mux patterns from existing services (
cmd/proxy/main.go) - Add build target to Makefile for compiling the binary and Docker image (
Makefile) - Create build/build-image.sh entries or extend build/build-images.sh to containerize the new service (
build/build-images.sh)
Add a new cluster configuration option
- Define the config struct field in cli/cmd/lib_cluster_config.go with YAML tags (
cli/cmd/lib_cluster_config.go) - Add validation logic for the new field in the same file or in cli/cluster/errors.go (
cli/cluster/errors.go) - Update the CLI config loading to parse the field from environment or cortex.yaml files (
cli/types/cliconfig/cli_config.go) - Pass the new config to the operator via lib_manager.go HTTP requests (
cli/cmd/lib_manager.go)
Add monitoring/metrics collection
- Use the DataDog client (imported in go.mod) to emit metrics from operator services (
cmd/operator/main.go) - Add Prometheus endpoints or StatsD gauge increments in service main files (
cmd/autoscaler/main.go) - Configure metric tags and endpoint details in dev/prometheus.md or environment variables (
dev/prometheus.md)
🔧Why these technologies
- Go + gRPC/HTTP — High-performance, concurrent service-to-service communication; efficient for orchestration and low-latency operator commands
- Kubernetes — Industry-standard container orchestration; built-in workload scheduling, networking, autoscaling, and declarative config management
- AWS SDK (boto3/aws-go) — Enables multi-region cluster provisioning, elastic compute, and managed services integration for spot instances and on-demand backups
- Docker — Containerization of ML models and system components; enables reproducible, portable deployments across environments
- DataDog + Prometheus — Observability and metrics collection; allows integration with any monitoring stack and provides pre-built dashboards
- YAML configuration — Human-readable, declarative workload definitions; familiar to DevOps engineers and integrates with GitOps workflows
⚖️Trade-offs already made
-
Operator-based architecture instead of serverless platforms
- Why: Need for full control over scheduling, autoscaling policies, and multi-workload orchestration; avoiding vendor lock-in
- Consequence: Higher operational overhead; requires Kubernetes expertise and cluster management; more flexibility but more complexity
-
Async/batch workloads via queue-based system (enqueuer/dequeuer) rather than event streaming
- Why: Simpler fault tolerance, guaranteed delivery, and easier backpressure handling for batch ML jobs
- Consequence: Lower throughput for extremely high-frequency events; requires queue infrastructure management
-
CLI-first user experience with HTTP/gRPC backend
- Why: Integrate with CI/CD pipelines and infrastructure-as-code tools; familiar workflow for DevOps users
- Consequence: No web UI; requires users to learn CLI; less interactive for exploratory deployments
-
Spot instance support with on-demand failover
- Why: Cost optimization for ML workloads; leverage AWS pricing discounts without risk
- Consequence: Increased complexity in scheduling and failover logic; unpredictable workload interruptions managed transparently
🚫Non-goals (don't propose these)
- Does not provide built-in model training; only deployment and serving of pre-trained models
- Does not handle authentication/authorization; assumes integration with cloud provider IAM or external auth systems
- Not a data lake or feature store; does not manage raw training data or feature pipelines
- Does not support serverless platforms other than Kubernetes (e.g., AWS Lambda, Google Cloud Run); Kubernetes-only
- Does not provide web UI; CLI and programmatic API only
🪤Traps & gotchas
Go 1.17 is pinned in go.mod (EOL January 2023); attempting to use newer toolchains may cause module resolution issues. AWS credentials must be available (env vars or ~/.aws/config) before any cortex cluster command. Kubernetes client-go is pinned to v0.22.11 (old); EKS API changes in newer clusters could cause compatibility issues. No Terraform provider code visible despite README mention—it may be in a separate repo or removed. CircleCI secrets (AWS keys, Docker registry) are required to publish images; local builds will fail at push stages.
🏗️Architecture
💡Concepts to learn
- Kubernetes Custom Resource Definitions (CRDs) — Cortex extends EKS by defining custom API types (realtime APIs, async APIs, batch jobs) as CRDs; understanding CRDs is essential to grasp how Cortex models are deployed as K8s objects
- EKS Autoscaling (Cluster Autoscaler, Karpenter) — Cortex's core value proposition is elastic cluster autoscaling; it adjusts EKS node pools (CPU/GPU instances, spot vs on-demand) based on workload demand—this is what differentiates it from manual EKS management
- Istio Virtual Services and Traffic Management — Cortex uses Istio (istio.io/client-go in dependencies) to route traffic, implement canary deployments, and manage traffic splitting between API versions—critical for rolling updates without downtime
- IAM AssumeRole and IRSA (IAM Roles for Service Accounts) — Cortex integrates with AWS IAM for fine-grained permissions; understanding IRSA allows deployed models to access S3, RDS, etc. without embedding AWS keys—visible in
lib_aws_creds.goand sigs.k8s.io/aws-iam-authenticator - Declarative Infrastructure as Code (GitOps pattern) — Cortex follows IaC principles—users define clusters and APIs in YAML config files, versioned in Git; the CLI reconciles desired state with actual cluster state, enabling reproducible, auditable deployments
- Spot Instances and On-Demand Fallback — Cortex can run workloads on AWS Spot instances (cheaper but interruptible) with automatic failover to on-demand; this cost-optimization pattern is core to the platform's value for cost-conscious teams
- Blue-Green and Canary Deployments — Cortex enables traffic splitting between API versions (via
lib_traffic_splitters.goand Istio), allowing gradual rollouts and instant rollbacks—essential for production ML where bad models cause revenue loss
🔗Related repos
seldon/seldon-core— Kubernetes-native ML model serving platform; alternative to Cortex for production inference, also uses Istio for traffic managementbentoml/bentoml— Python-first ML model packaging and serving framework; lighter-weight alternative focused on reproducibility and rapid iterationray-project/ray— Distributed computing framework for ML workloads; complements Cortex for batch and async processing at scaleaws/amazon-sagemaker-examples— AWS-native ML ops examples; relevant for teams migrating from Cortex to SageMaker or evaluating SageMaker vs self-managed EKSkubernetes/kubernetes— Cortex runs on EKS (managed Kubernetes); understanding K8s primitives (Deployments, Services, Custom Resources) is essential
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add integration tests for CLI commands in cli/cmd with Golang testing framework
The cli/cmd directory contains critical command implementations (deploy.go, delete.go, get.go, logs.go, etc.) but there are no visible test files (*_test.go). These commands interact with cluster management and API operations, making them high-risk areas for regressions. Adding integration tests would catch breaking changes early and document expected CLI behavior.
- [ ] Create cli/cmd/deploy_test.go with tests for flag parsing and cluster deployment flows
- [ ] Create cli/cmd/delete_test.go testing deletion confirmation and error handling
- [ ] Create cli/cmd/get_test.go and cli/cmd/logs_test.go for API retrieval logic
- [ ] Use testify assertions (already in go.mod) for consistent test patterns
- [ ] Update build/test.sh to run Go tests in ./cli/... directory
- [ ] Reference existing ginkgo/gomega setup in go.mod for BDD-style tests if preferred
Add GitHub Actions workflow for automated Go linting and security scanning on PRs
The repo has build/lint.sh script but no visible GitHub Actions workflow (only .circleci/config.yml for CircleCI). Contributors won't get immediate feedback on code quality issues before PR review. A GitHub Actions workflow for golangci-lint and gosec would catch issues early and reduce review burden.
- [ ] Create .github/workflows/go-lint.yml with golangci-lint action targeting ./cli and ./cmd directories
- [ ] Add gosec security scanning step for vulnerability detection in Go dependencies
- [ ] Configure to run on pull_request and push to main branch events
- [ ] Set up workflow to fail on high-severity linting violations to enforce standards
- [ ] Add status check requirement in GitHub branch protection rules (documented in CONTRIBUTING.md)
Add comprehensive error handling tests for cli/cmd/errors.go and cli/cluster/errors.go
Two dedicated error files exist (cli/cmd/errors.go and cli/cluster/errors.go) suggesting custom error types, but without visible test coverage. These are critical for user experience—incorrect error messages or missing error cases cause poor CLI UX. Testing error handling paths ensures clarity and consistency.
- [ ] Create cli/cmd/errors_test.go with table-driven tests for each error type and message formatting
- [ ] Create cli/cluster/errors_test.go testing HTTP error mappings and cluster-specific error scenarios
- [ ] Test error message formatting, error wrapping chains (pkg/errors is in dependencies), and user-facing text
- [ ] Use existing stretchr/testify for assertions and ensure error cases in deploy/delete/get commands propagate correctly
- [ ] Document error handling patterns in CONTRIBUTING.md for future contributors
🌿Good first issues
- Add missing unit tests for
cli/cmd/lib_traffic_splitters.go—it's in the file list but has no corresponding test file; write tests for traffic splitting config parsing and validation. - Document the cluster configuration schema in
docs/—the CLI accepts YAML vialib_cluster_config.gobut no inline documentation of required/optional fields exists; extract schema and write reference docs. - Refactor error handling in
cli/cluster/errors.go—consolidate duplicate AWS/Kubernetes error messages into a single error catalog to improve consistency acrossdeploy.go,get.go, andlogs.go.
⭐Top contributors
Click to expand
Top contributors
- @deliahu — 56 commits
- @RobertLucian — 28 commits
- [@Miguel Varela Ramos](https://github.com/Miguel Varela Ramos) — 6 commits
- @vishalbollu — 4 commits
- @bluuewhale — 2 commits
📝Recent commits
Click to expand
Recent commits
dc48c02— Update website URL (deliahu)a1bfb09— Add maintenance note (deliahu)dc5f732— Update stable version to 0.42.1 (RobertLucian)51c18df— Send al VPC CNI logs to /dev/null by default (#2443) (RobertLucian)d647b0d— Upgrade CNI version to 1.11.3 (#2442) (RobertLucian)78de0c6— Update Cortex versions (eksctl, EKS, AWS IAM, Python, etc) (#2438) (RobertLucian)231cbb6— fix: export_images.sh (#2424) (bluuewhale)46bc56f— Mention VPC endpoints in docs (deliahu)cdeb9df— Comment out test (deliahu)30c9960— Update logo url (deliahu)
🔒Security observations
- High · Outdated Go Dependencies with Known Vulnerabilities —
go.mod. The go.mod file uses Go 1.17, which is outdated (released August 2021). Multiple dependencies have known vulnerabilities and are pinned to older versions. Notably: docker v20.10.7 (released June 2021), prometheus/client_golang v1.11.0, and aws-sdk-go v1.43.29 contain known security issues. The project appears to be no longer actively maintained, increasing the risk of unpatched vulnerabilities. Fix: Upgrade Go to 1.21+, audit and update all dependencies to their latest patched versions. Run 'go list -json -m all | nancy sleuth' or similar vulnerability scanners. Consider implementing automated dependency updates via Dependabot. - High · Insecure Docker Image Dependencies —
go.mod (docker/docker dependency), build/ directory. The codebase uses Docker (v20.10.7+incompatible) which has multiple known vulnerabilities (CVE-2021-30465, CVE-2021-41091, etc.). The build scripts reference Docker operations but version pinning is not enforced, risking execution of vulnerable container runtimes during build and deployment. Fix: Upgrade to Docker 20.10.16+ or 23.0.0+. Use exact image digests in Dockerfile instead of tags. Implement image scanning in CI/CD pipeline using tools like Trivy, Clair, or ECR image scanning. - High · Kubernetes Client-go with Security Issues —
go.mod (k8s.io/client-go dependency). Using k8s.io/client-go v0.22.11 (released September 2021) which has deprecated and contains known vulnerabilities. This is critical for a production infrastructure tool that manages Kubernetes clusters. Fix: Upgrade k8s.io/client-go to v0.28.0 or later. Review release notes for breaking changes. Test thoroughly in staging before production deployment. - Medium · Use of Deprecated and Unmaintained Dependencies —
go.mod. Multiple dependencies are outdated or from less-maintained sources: github.com/davecgh/go-spew v1.1.1, github.com/xtgo/uuid (last updated 2014), github.com/patrickmn/go-cache (last meaningful update 2015). These increase technical debt and potential security risks. Fix: Replace deprecated UUID library with google/uuid (already in use). Audit go-cache usage and consider replacing with more maintained alternatives like patrickmn/go-cache fork or native sync.Map. Run 'go mod graph | grep deprecated' to identify all problematic dependencies. - Medium · No Apparent Secret Management Implementation —
cli/cmd/lib_aws_creds.go, .circleci/config.yml, cli/types/cliconfig/. The codebase includes AWS credential handling (lib_aws_creds.go) but no evidence of secret management best practices. Credentials may be passed via environment variables or config files without encryption. The .circleci/config.yml and deployment scripts could expose secrets in logs. Fix: Implement AWS Secrets Manager or HashiCorp Vault for secret storage. Use IAM roles instead of static credentials where possible. Add secret scanning to CI/CD (e.g., git-secrets, truffleHog). Ensure CI/CD logs don't expose sensitive data. - Medium · HTTP Client Usage Without Apparent TLS Verification —
cli/cluster/lib_http_client.go. File cli/cluster/lib_http_client.go suggests custom HTTP client implementation. Without reviewing the actual code, standard risks include: disabled TLS verification, no certificate pinning, and missing timeout configurations that could be exploited for MITM or DoS attacks. Fix: Review HTTP client implementation to ensure: TLS verification is enabled by default, proper certificate validation, request timeouts are set, secure cipher suites are configured. Never disable TLS verification in production. - Medium · Sentry Integration Without Configuration Validation —
go.mod (getsentry/sentry-go dependency). github.com/getsentry/sentry-go v0.11.0 is included but integration details are unclear. If improperly configured, it could leak sensitive stack traces or error details to external servers. Fix: undefined
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.