LAION-AI/Open-Assistant
OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
Healthy across all four use cases
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓34+ active contributors
- ✓Distributed ownership (top contributor 21% of recent commits)
- ✓Apache-2.0 licensed
- ✓CI configured
- ✓Tests present
- ⚠Stale — last commit 2y ago
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/laion-ai/open-assistant)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/laion-ai/open-assistant on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: LAION-AI/Open-Assistant
Generated by RepoPilot · 2026-05-07 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/LAION-AI/Open-Assistant shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across all four use cases
- 34+ active contributors
- Distributed ownership (top contributor 21% of recent commits)
- Apache-2.0 licensed
- CI configured
- Tests present
- ⚠ Stale — last commit 2y ago
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live LAION-AI/Open-Assistant
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/LAION-AI/Open-Assistant.
What it runs against: a local clone of LAION-AI/Open-Assistant — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in LAION-AI/Open-Assistant | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code |
| 5 | Last commit ≤ 658 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of LAION-AI/Open-Assistant. If you don't
# have one yet, run these first:
#
# git clone https://github.com/LAION-AI/Open-Assistant.git
# cd Open-Assistant
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of LAION-AI/Open-Assistant and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "LAION-AI/Open-Assistant(\\.git)?\\b" \\
&& ok "origin remote is LAION-AI/Open-Assistant" \\
|| miss "origin remote is not LAION-AI/Open-Assistant (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 4. Critical files exist
test -f "backend/main.py" \\
&& ok "backend/main.py" \\
|| miss "missing critical file: backend/main.py"
test -f "backend/alembic/env.py" \\
&& ok "backend/alembic/env.py" \\
|| miss "missing critical file: backend/alembic/env.py"
test -f "backend/oasst_backend" \\
&& ok "backend/oasst_backend" \\
|| miss "missing critical file: backend/oasst_backend"
test -f "backend/.env.example" \\
&& ok "backend/.env.example" \\
|| miss "missing critical file: backend/.env.example"
test -f ".github/workflows/test-api-contract.yaml" \\
&& ok ".github/workflows/test-api-contract.yaml" \\
|| miss "missing critical file: .github/workflows/test-api-contract.yaml"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 658 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~628d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/LAION-AI/Open-Assistant"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
OpenAssistant is a large-scale, open-source RLHF-trained chat assistant built on FastAPI (backend) and Next.js/React (frontend) that collects human feedback to improve LLM responses. It powers a web interface where users interact with an AI assistant while the platform systematically gathers preference data (rankings, ratings) to create the oasst2 dataset—solving the bottleneck of expensive proprietary human feedback for LLM training. Monorepo structure: backend/ contains FastAPI+SQLAlchemy REST API with alembic migrations; frontend/ holds Next.js/TypeScript web UI; inference/ (via ansible/) manages model serving; .devcontainer/ provides isolated dev environments for backend and frontend separately. State flows from React frontend → FastAPI endpoints → PostgreSQL + Redis cache, with Celery workers for async tasks.
👥Who it's for
ML researchers and engineers building open-source LLMs who need crowdsourced RLHF training data; full-stack developers contributing to the chat UI, API, or data collection pipeline; and organizations wanting to fine-tune models on human preference signals without vendor lock-in.
🌱Maturity & risk
Project is completed and archived as of October 2023 (per README note). It achieved production scale with 13K+ GitHub stars, comprehensive CI/CD via 9+ GitHub Actions workflows (test-e2e.yaml, test-api-contract.yaml, production-deploy.yaml), and released the final oasst2 dataset on HuggingFace. Not actively developed anymore, but codebase is stable and the dataset is the lasting artifact.
As an archived project, no ongoing maintenance or security patches. Backend dependencies (FastAPI 0.88.0, SQLAlchemy 1.4.41, pydantic 1.10.7) are pinned to 2022–2023 versions and may have unpatched CVEs. The monorepo couples frontend/backend/inference tightly—forking requires significant DevOps work (see ansible/ for production infrastructure). No obvious single-maintainer risk since LAION is an org, but community contributions have ceased.
Active areas of work
Project is in maintenance mode—no active development. The codebase is frozen at the point of dataset completion. The final oasst2 dataset (1M+ conversations) is published on HuggingFace. Workflows still run (CI checks pass) but no new features or PRs are being merged. Historical contributions and annotation work are archived.
🚀Get running
git clone https://github.com/LAION-AI/Open-Assistant.git
cd Open-Assistant
# For backend:
cd backend && pip install -r requirements.txt && cp .env.example .env
# For frontend:
cd ../frontend && npm install
# See .devcontainer/ for containerized setup
Daily commands:
Backend: cd backend && uvicorn main:app --reload (requires PostgreSQL + Redis running). Frontend: cd frontend && npm run dev (Next.js dev server on :3000). See .devcontainer/backend-dev/post_create_command.sh and .devcontainer/frontend-dev/post_create_command.sh for full bootstrap steps including DB migrations.
🗺️Map of the codebase
backend/main.py— Entry point for the FastAPI backend server; initializes all routes, middleware, and core application logic.backend/alembic/env.py— Database migration configuration; must understand before making schema changes or running the application.backend/oasst_backend— Core backend package containing models, routes, and business logic; primary codebase for backend contributors.backend/.env.example— Defines all required environment variables and configuration; essential for local setup and CI/CD..github/workflows/test-api-contract.yaml— API contract testing and validation; shows how changes are validated before merge.backend/alembic.ini— Alembic configuration for database migrations; required for understanding schema evolution.
🛠️How to make changes
Add a new API endpoint
- Create a new route handler in backend/oasst_backend/routes or a subdirectory (
backend/oasst_backend/routes) - Define request/response models using Pydantic in backend/oasst_backend/models (
backend/oasst_backend/models) - Register the route in backend/main.py via app.include_router() (
backend/main.py) - Add database migrations if new tables/columns are needed using alembic revision --autogenerate -m 'description' (
backend/alembic/versions) - Add OpenAPI documentation via docstrings and route parameters for auto-generated API docs (
backend/oasst_backend/routes)
Update database schema
- Define or modify SQLAlchemy ORM models in backend/oasst_backend/models or subdirectories (
backend/oasst_backend/models) - Generate migration using 'alembic revision --autogenerate -m "migration description"' to backend/alembic/versions (
backend/alembic/versions) - Review and edit the generated migration file if needed (
backend/alembic/versions) - Apply migration locally with 'alembic upgrade head' or via backend/alembic/env.py (
backend/alembic/env.py) - Update related API serialization models in backend/oasst_backend/models (
backend/oasst_backend/models)
Add a new background task or worker
- Install Celery task definitions (already in dependencies); create tasks in backend/oasst_backend/tasks or similar (
backend/oasst_backend) - Configure Redis connection in backend/.env.example and corresponding app initialization (
backend/.env.example) - Register Celery beat schedules or trigger from API endpoints in backend/oasst_backend/routes (
backend/oasst_backend/routes) - Update Ansible playbooks if workers need separate deployment (ansible/inference/deploy-worker.yaml) (
ansible/inference/deploy-worker.yaml)
Modify environment configuration
- Add new environment variable to backend/.env.example with documentation (
backend/.env.example) - Load and validate the variable in backend/main.py or a dedicated config module using python-dotenv (
backend/main.py) - Update GitHub Actions workflows (.github/workflows/*.yaml) to set required secrets or vars (
.github/workflows/production-deploy.yaml) - Update .devcontainer/post_create_command.sh or .devcontainer/backend-dev/post_create_command.sh for local dev setup (
.devcontainer/backend-dev/post_create_command.sh)
🔧Why these technologies
- FastAPI + Uvicorn — Modern async Python web framework with automatic OpenAPI documentation, minimal boilerplate, and high performance for I/O-bound operations.
- SQLAlchemy + SQLModel — Unified ORM and data validation layer; SQLModel bridges SQLAlchemy models and Pydantic schemas for consistency.
- PostgreSQL + Alembic — Production-grade relational database with strong ACID guarantees; Alembic provides version-controlled schema migrations.
- Redis — In-memory cache and Celery message broker; reduces database load and enables distributed task queuing.
- Celery — Distributed task queue for background jobs (labeling, inference, aggregation) decoupled from request-response cycle.
- Docker + Ansible — Containerized deployment with Infrastructure-as-Code for reproducible, scalable multi-node inference clusters.
⚖️Trade-offs already made
-
Monolithic backend vs. microservices
- Why: Simpler deployment and operation for a crowdsourced labeling platform; trade-off is reduced independent scaling.
- Consequence: All API routes, business logic, and database access share single FastAPI instance; easier to develop and debug but harder to scale individual components.
-
SQLAlchemy ORM vs. raw SQL
- Why: Type safety, query composition, and Alembic migration support improve maintainability.
- Consequence: Slight performance overhead; requires understanding of ORM lazy-loading and N+1 query pitfalls.
-
Pydantic + SQLModel validation at API boundary
- Why: Automatic request validation, serialization, and OpenAPI schema generation.
- Consequence: Schema definitions are duplicated (model + Pydantic), though SQLModel attempts to unify them.
-
Centralized PostgreSQL vs. event-sourced architecture
- Why: Simpler querying and reporting for labeling analytics; audit trail tracked via database triggers or journal tables.
- Consequence: Not optimized for high-frequency event streaming; labeling flow relies on transactional consistency rather than eventual consistency.
🚫Non-goals (don't propose these)
- Not a real-time collaborative editing system; no WebSocket-based live cursors or live label merging.
- Does not handle model training or inference directly; delegates LLM inference to separate inference servers (ansible/inference).
- Not a general-purpose task platform; specifically designed for crowdsourced dialogue annotation and ranking.
- Does not provide built-in federation or data privacy for HIPAA/GDPR compliance; relies on deployment environment.
🪤Traps & gotchas
Critical environment variables: DATABASE_URL (PostgreSQL), REDIS_URL, SECRET_KEY (JWT signing), OPENAI_API_KEY (if using OpenAI fallback). Database migrations must run on startup: Alembic auto-migration is expected; see post_create_command.sh. PostgreSQL and Redis must be running before starting the backend—no in-memory fallbacks. Frontend requires a running backend at NEXT_PUBLIC_API_URL env var. Devcontainer post-create scripts are essential—running pip/npm without them will miss critical setup steps. pydantic 1.10.7 is pinned (not v2)—code uses v1 syntax and will break on v2 upgrades.
🏗️Architecture
💡Concepts to learn
- RLHF (Reinforcement Learning from Human Feedback) — Core methodology behind this project—human rankings of assistant responses are used as reward signals to fine-tune LLMs, and understanding this paradigm is essential to grasping why the data collection UI and backend are designed this way
- Alembic migrations (SQLAlchemy-based) — The backend uses Alembic for declarative schema versioning; contributors must understand migration workflows to modify the database safely in a multi-environment setup
- Celery async task queue — Heavy operations (model inference calls, dataset exports, email notifications) are offloaded to Celery workers; understanding task definitions and Redis brokering is essential for extending the platform
- JWT (JSON Web Token) authentication — The API uses python-jose to issue and validate JWTs for stateless user sessions; crucial for securing the feedback collection API and understanding the auth flow
- SQLModel (Pydantic + SQLAlchemy hybrid) — The backend uses sqlmodel to unify ORM models with Pydantic validators; this pattern reduces boilerplate but requires understanding both ORMs simultaneously
- Prometheus instrumentation for FastAPI — The codebase uses
prometheus-fastapi-instrumentatorto expose metrics; relevant for understanding observability requirements in production deployments - Next.js server-side rendering (SSR) + API routes — The frontend uses Next.js not just for static generation but for its API route layer (
pages/api/) which proxies to the FastAPI backend; essential for deployment and understanding the full-stack request flow
🔗Related repos
OpenAssistant/oasst2— The official HuggingFace dataset repo (1M conversations + rankings) that this codebase was built to collect—the permanent artifact of the projectLAION-AI/Open-Assistant-Plugins— Companion repo for extending Open-Assistant with third-party tool integrations (APIs, web search, code execution)Stability-AI/StabilityLM— Uses similar RLHF + crowdsourcing methodology as Open-Assistant for training instruction-following modelsCarperAI/trlx— Transformer Reinforcement Learning X library—the underlying framework used for RLHF training on oasst2 dataallenai/open-instruct— Related instruction-tuning dataset and training code that complements Open-Assistant's preference learning approach
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add database migration validation tests for Alembic versions
The backend has 13+ Alembic migration files (backend/alembic/versions/) but no visible test suite validating that migrations apply cleanly, can be reversed, and don't break schema integrity. This is critical for a production chat system with user data. New contributors could add pytest-based tests that validate each migration's up/down cycle against a test PostgreSQL database.
- [ ] Create backend/tests/test_migrations.py with fixtures for temporary test databases
- [ ] Add test cases that apply each migration sequentially and verify schema state
- [ ] Add test cases that validate downgrade/upgrade cycles don't lose data
- [ ] Integrate tests into .github/workflows/test-api-contract.yaml to run on PRs
- [ ] Document migration testing in backend/README.md
Add API contract tests for OpenAssistant backend endpoints
The repo has a workflow file .github/workflows/test-api-contract.yaml but no visible test suite for it. With FastAPI, SQLModel, and Pydantic models defined throughout the backend, there should be request/response validation tests. Contributors could add comprehensive endpoint contract tests to ensure request schemas, response codes, and error handling remain consistent.
- [ ] Create backend/tests/test_api_contracts.py with pytest fixtures for test client
- [ ] Add parametrized tests for all major endpoints (auth, chat, feedback submission, etc.)
- [ ] Validate request validation (Pydantic schema enforcement) for malformed inputs
- [ ] Validate response schemas match Pydantic models for success/error cases
- [ ] Ensure test-api-contract.yaml workflow properly runs and reports results
Add integration tests for Redis caching layer in backend
The dependencies show redis==4.5.5 and there are .devcontainer Redis configs, but no visible test suite for Redis integration. With Celery and FastAPI-limiter (rate limiting) likely using Redis, contributors could add integration tests to verify cache operations, session management, and rate limiting work correctly.
- [ ] Create backend/tests/test_redis_integration.py with Redis container fixtures
- [ ] Add tests for rate limiter functionality using fastapi-limiter with test endpoints
- [ ] Add tests for any session/cache operations that depend on Redis
- [ ] Add tests for Celery task queueing and retrieval from Redis
- [ ] Document Redis testing setup in backend/README.md and update .devcontainer/backend-dev/post_create_command.sh if needed
🌿Good first issues
- Add missing integration tests for the feedback ranking endpoint (
backend/app/routes/feedback.py) to increase test coverage beyond what.github/workflows/test-api-contract.yamlcurrently validates - Document the RLHF training pipeline: create a markdown guide in
docs/explaining how to consume the final oasst2 dataset and retrain models using the collected rankings - Refactor the Celery task definitions in
backend/app/workers/to use type hints and add task-specific logging; current code lacks celery_app type safety
⭐Top contributors
Click to expand
- @dependabot[bot] — 21 commits
- @andreaskoepf — 15 commits
- @yk — 10 commits
- @shahules786 — 8 commits
- @olliestanley — 6 commits
📝Recent commits
Click to expand
f1e6ed9— add note about oasst2 being available (#3743) (andrewm4894)e1769c1— next build fixes (yk)ca7dc79— apparently, prod ignores redirects or process env (yk)29c50ee— ansible fix (yk)46520c3— deployment workflows for bye (yk)5c0efa6— added dashboard redirect (yk)fcd2453— pre-commit (yk)1f621d3— added bye page (yk)de1f5c3— Update docs for current project status. (#3730) (someone13574)7558fa8— add note to readme about project being completed (#3724) (andrewm4894)
🔒Security observations
- Critical · Hardcoded Database Credentials in Docker Compose —
docker-compose.yaml - db service environment variables. The docker-compose.yaml file contains hardcoded PostgreSQL credentials (POSTGRES_USER: postgres, POSTGRES_PASSWORD: postgres). These default credentials are exposed in the repository and used in development environments, creating a significant security risk if this configuration is accidentally used in production. Fix: Use environment variables or Docker secrets to manage credentials. Reference credentials from .env files (already present as .env.example). Update docker-compose.yaml to use: POSTGRES_PASSWORD: ${DB_PASSWORD} and document the requirement to set environment variables before deployment. - High · Outdated Cryptography Dependencies with Known Vulnerabilities —
backend/requirements or dependencies list. The dependencies include several packages with known vulnerabilities: cryptography==41.0.0 (may have unpatched issues), python-jose[cryptography]==3.3.0 (outdated), and fastapi==0.88.0 (outdated). These versions are not the latest and may contain known CVEs. Fix: Update all dependencies to their latest secure versions: cryptography>=42.0.0, python-jose>=3.3.0 (latest), fastapi>=0.100.0+. Implement automated dependency scanning with tools like Snyk or Dependabot (already configured in .github/dependabot.yml but may need review). - High · Unencrypted Default Database Port Exposure —
docker-compose.yaml - db service ports configuration. The PostgreSQL database service exposes port 5432 to the host without any network isolation or authentication enforcement at the Docker level. Combined with default credentials, this creates a direct attack vector. Fix: Remove port exposure for production deployments. For development, use: ports: ['127.0.0.1:5432:5432'] to restrict access to localhost only. Implement network policies and ensure strong credentials are mandatory before any port exposure. - High · Missing Security Headers and HTTPS Configuration —
backend application setup (FastAPI app initialization not fully visible). No explicit security header configuration visible in the FastAPI application setup. The application lacks mechanisms for enforcing HTTPS, CORS restrictions, and security headers (CSP, X-Frame-Options, X-Content-Type-Options, etc.). Fix: Implement comprehensive security middleware in FastAPI: use TrustedHostMiddleware, CORSMiddleware with restricted origins, add SecurityHeaders via custom middleware, enforce HTTPS in production, implement rate limiting (fastapi-limiter is already included but needs proper configuration). - High · SQLAlchemy ORM Version Outdated —
dependencies - SQLAlchemy==1.4.41, sqlmodel==0.0.8. SQLAlchemy==1.4.41 is an older version. While SQLAlchemy ORM generally protects against SQL injection, outdated versions may have unpatched vulnerabilities. The project also uses sqlmodel==0.0.8 which is extremely early-stage and may have stability and security issues. Fix: Upgrade to SQLAlchemy 2.0+ for enhanced security features and improved ORM protections. Evaluate sqlmodel maturity or migrate to well-maintained ORM alternatives. Pin to specific minor versions and regularly review security advisories. - Medium · Missing Environment Variable Validation —
backend/.env.example and application initialization. The .env.example file exists, but there's no visible enforcement mechanism to validate that required environment variables are set before application startup. This could lead to misconfigurations with insecure defaults. Fix: Implement Pydantic settings validation using python-dotenv to enforce required environment variables at startup. Create a settings.py that validates all critical configuration variables and raises errors if mandatory variables are missing. - Medium · Potential XXE/YAML Injection via Dependencies —
dependencies - Celery==5.2.0, redis==4.5.5. The application uses Celery==5.2.0 and Redis==4.5.5 which may deserialize untrusted data. If not properly configured, these could be vulnerable to deserialization attacks. Fix: Ensure Celery is configured with secure serializers (avoid pickle in production), use only trusted message brokers, validate all data before deserialization. Update Celery to latest version. Review message broker security configuration in ansible configuration files. - undefined · undefined —
undefined. undefined Fix: undefined
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.