FunAudioLLM/CosyVoice
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
Healthy across the board
weakest axisPermissive license, no critical CVEs, actively maintained — safe to depend on.
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
No critical CVEs, sane security posture — runnable as-is.
- ✓Last commit 4d ago
- ✓14 active contributors
- ✓Apache-2.0 licensed
Show all 6 evidence items →Show less
- ✓CI configured
- ✓Tests present
- ⚠Concentrated ownership — top contributor handles 72% of recent commits
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Healthy" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/funaudiollm/cosyvoice)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/funaudiollm/cosyvoice on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: FunAudioLLM/CosyVoice
Generated by RepoPilot · 2026-05-07 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/FunAudioLLM/CosyVoice shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
GO — Healthy across the board
- Last commit 4d ago
- 14 active contributors
- Apache-2.0 licensed
- CI configured
- Tests present
- ⚠ Concentrated ownership — top contributor handles 72% of recent commits
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live FunAudioLLM/CosyVoice
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/FunAudioLLM/CosyVoice.
What it runs against: a local clone of FunAudioLLM/CosyVoice — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in FunAudioLLM/CosyVoice | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | Last commit ≤ 34 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of FunAudioLLM/CosyVoice. If you don't
# have one yet, run these first:
#
# git clone https://github.com/FunAudioLLM/CosyVoice.git
# cd CosyVoice
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of FunAudioLLM/CosyVoice and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "FunAudioLLM/CosyVoice(\\.git)?\\b" \\
&& ok "origin remote is FunAudioLLM/CosyVoice" \\
|| miss "origin remote is not FunAudioLLM/CosyVoice (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 34 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~4d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/FunAudioLLM/CosyVoice"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
CosyVoice is a multilingual text-to-speech (TTS) system powered by large language models (LLM) that generates natural speech with zero-shot voice cloning capabilities. It supports 9+ languages and 18+ Chinese dialects with state-of-the-art content consistency, speaker similarity, and prosody naturalness, using a flow-matching architecture with DiT-based diffusion and HiFi-GAN vocoding for high-quality audio synthesis. Monorepo structured as cosyvoice/ package with modular subsystems: cosyvoice/llm (language model backbone), cosyvoice/flow (flow-matching diffusion with DiT transformer), cosyvoice/hifigan (neural vocoder), cosyvoice/tokenizer (multilingual token encoding), cosyvoice/transformer (core attention/decoder blocks), cosyvoice/cli (inference CLI), cosyvoice/bin (training/export scripts), cosyvoice/dataset (data loading). Entry points via cosyvoice/cli/cosyvoice.py and cosyvoice/cli/model.py.
👥Who it's for
Speech synthesis researchers and audio engineers building production TTS systems who need multilingual support, speaker cloning, and instruction-based control (emotion, speed, volume, dialect); ML practitioners who want to train or fine-tune large voice models with full-stack inference, training, and deployment pipelines.
🌱Maturity & risk
Production-ready with active development. CosyVoice 3.0 was released in December 2024 with recent model checkpoints on ModelScope and HuggingFace; the repo shows continuous iteration (v1.0→v2.0→v3.0 progression) with structured CI workflows (.github/workflows/lint.yml), although specific test coverage metrics are not visible in the file listing.
Moderate dependency risk: 30+ critical dependencies (conformer, diffusers, lightning, onnxruntime-gpu, tensorrt) create fragility around GPU/ONNX ecosystem changes; no visible test suite in top-level structure increases regression risk during contributions. Single-maintainer risk from FunAudioLLM org suggests knowledge concentration. Breaking changes likely between major versions (v1→v2→v3) without clear migration guides.
Active areas of work
Actively shipping Fun-CosyVoice 3.0 (Dec 2024) with base and RL-tuned models; recent work includes NVIDIA TensorRT/Triton runtime support (July 2025), vLLM integration for faster inference (May 2025), and streaming support (both text-in and audio-out with 150ms latency). Continuous model checkpoint releases across ModelScope and HuggingFace platforms.
🚀Get running
git clone https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
pip install -e .
# Or install from requirements (inferred from dependencies):
pip install conformer==0.3.2 diffusers==0.29.0 lightning==2.2.4 onnxruntime-gpu==1.18.0 torch torchaudio
Daily commands: Inference (CLI):
python -m cosyvoice.cli.cosyvoice --model_path <model_ckpt> --text "Hello" --output_path output.wav
Training:
python cosyvoice/bin/train.py --config conf/train.yaml
Export to ONNX/JIT:
python cosyvoice/bin/export_onnx.py --model_path ckpt.pth
python cosyvoice/bin/export_jit.py --model_path ckpt.pth
(Exact commands inferred from bin/ structure; see conf/ directory for config templates)
🗺️Map of the codebase
- cosyvoice/cli/cosyvoice.py: Main inference entry point; handles text→speech pipeline, voice cloning, instruction parsing, and streaming orchestration
- cosyvoice/flow/flow.py: Core flow-matching diffusion model; defines reverse-time diffusion schedule and acoustic feature generation from LLM embeddings
- cosyvoice/flow/DiT/dit.py: Diffusion Transformer encoder backbone; implements cross-attention with LLM tokens and duration/pitch conditioning
- cosyvoice/hifigan/hifigan.py: Neural vocoder converting mel-spectrogram → waveform; critical for speech quality and inference latency
- cosyvoice/llm/llm.py: Language model semantic token generation; decoupled from diffusion, can swap backends (vLLM, TensorRT-LLM, etc.)
- cosyvoice/tokenizer/tokenizer.py: Multilingual token encoding using tiktoken; handles 9+ languages and dialect-specific phoneme conversion
- cosyvoice/bin/train.py: Training orchestration script; integrates Lightning, dataset loading, and checkpoint management for base + RL fine-tuning
- cosyvoice/dataset/processor.py: Data pipeline: text preprocessing, alignment, F0/duration extraction via pyworld; critical for training data quality
🛠️How to make changes
Adding a new language: Extend cosyvoice/tokenizer/assets/ with new tiktoken vocab; update cosyvoice/cli/frontend.py text normalization rules. Modifying model architecture: Edit cosyvoice/flow/flow.py (diffusion core) or cosyvoice/flow/DiT/dit.py (transformer encoder). Training pipeline changes: Modify cosyvoice/bin/train.py and cosyvoice/dataset/processor.py. CLI features: Extend cosyvoice/cli/cosyvoice.py entry point. Vocoding: Replace or fine-tune cosyvoice/hifigan/hifigan.py.
🪤Traps & gotchas
CUDA/ONNX version alignment: onnxruntime-gpu==1.18.0 requires specific CUDA version; mismatches cause silent inference failures. Config management: Hydra config overrides via command line (conf/ not in file list but referenced) are order-sensitive; incorrect YAML syntax breaks silently. Model checkpoint locations: Assumes ModelScope/HuggingFace models auto-download; offline environments need pre-cached checkpoints. WeTextProcessing dependency: Custom text normalization library not in standard PyPI; requires separate installation and may lag on non-Linux systems. Tiktoken assets: multilingual_zh_ja_yue_char_del.tiktoken is baked into repo; changes require regeneration script (not visible in file list). F0 extraction (pyworld): Requires audio sample rate consistency across training/inference; mismatches silently degrade prosody.
💡Concepts to learn
- Flow Matching — CosyVoice uses flow-matching (cosyvoice/flow/flow_matching.py) instead of traditional VAE-based TTS; enables faster, more stable acoustic feature generation from text embeddings without explicit duration models
- Diffusion Transformer (DiT) — cosyvoice/flow/DiT/dit.py replaces U-Net with pure transformer blocks for conditional acoustic synthesis; improves content consistency and reduces hallucinations via cross-attention to LLM tokens
- Zero-Shot Voice Cloning — CosyVoice's core capability: generates speech in a target speaker's voice using only a reference audio prompt without speaker embeddings or fine-tuning; powered by LLM semantic understanding
- Byte Pair Encoding (BPE) via tiktoken — cosyvoice/tokenizer/ uses OpenAI's tiktoken for multilingual tokenization; enables efficient vocabulary sharing across 9+ languages while preserving language-specific structure
- Mel-Spectrogram Vocoding — HiFi-GAN (cosyvoice/hifigan/hifigan.py) converts diffusion-generated mel-spectrograms → waveforms; critical bottleneck for inference speed and audio quality; alternative to autoregressive WaveNet
- Length Regulation — cosyvoice/flow/length_regulator.py maps phoneme/token sequences to acoustic frame counts; replaces traditional duration models in flow-matching pipeline for more robust control
- Cross-Lingual Prosody Transfer — CosyVoice 3.0 supports cross-lingual zero-shot cloning (generate English speech in Mandarin speaker's voice); achieved via LLM's multilingual semantic embeddings and flow-matched acoustic features
🔗Related repos
microsoft/VoiceConversion— Zero-shot voice conversion technique that inspired CosyVoice's voice cloning without explicit speaker embeddingsopenai/whisper— Multilingual speech-to-text backbone used in some CosyVoice evaluation pipelines (jiwer dependency suggests ASR eval); complementary to TTSNVIDIA/TensorRT-LLM— Inference optimization for the LLM backbone in CosyVoice; integrated via recent triton trtllm runtime support for sub-100ms latencyMoonSound/HiFi-GAN— Original HiFi-GAN vocoder architecture that CosyVoice's cosyvoice/hifigan/ directly builds upon for mel→waveform synthesisjik876/HiFi-GAN— Reference HiFi-GAN implementation; CosyVoice customizes discriminator.py and generator.py for speech-specific training
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add comprehensive unit tests for cosyvoice/utils/ module
The utils module contains critical helper functions (file_utils.py, common.py, frontend_utils.py, losses.py, mask.py, onnx.py) that lack visible test coverage. These utilities are foundational for training, inference, and data processing pipelines. Adding tests would catch regressions early and improve maintainability as the codebase evolves.
- [ ] Create tests/utils/test_file_utils.py for file I/O operations in cosyvoice/utils/file_utils.py
- [ ] Create tests/utils/test_frontend_utils.py for audio preprocessing in cosyvoice/utils/frontend_utils.py
- [ ] Create tests/utils/test_losses.py for loss computation functions in cosyvoice/utils/losses.py
- [ ] Create tests/utils/test_onnx.py for ONNX export/import utilities in cosyvoice/utils/onnx.py
- [ ] Add pytest configuration to setup.py or pyproject.toml
- [ ] Integrate test execution into existing .github/workflows/lint.yml
Add GitHub Action workflow for multi-version Python compatibility testing
The project supports multiple Python versions and heavy dependencies (torch, onnxruntime-gpu, tensorrt). Currently only lint.yml exists. A matrix-based test workflow would catch compatibility issues across Python 3.8-3.11 and help prevent breaking changes when upgrading core dependencies like numpy, onnx, and diffusers.
- [ ] Create .github/workflows/test-matrix.yml with Python 3.8, 3.9, 3.10, 3.11 matrix
- [ ] Include CPU-only test variants (using onnxruntime instead of onnxruntime-gpu) to reduce CI overhead
- [ ] Add dependency installation and basic inference tests for cosyvoice/cli/cosyvoice.py
- [ ] Configure workflow to run on PR, push to main, and scheduled weekly basis
- [ ] Document in CONTRIBUTING.md which tests are required before merging
Refactor cosyvoice/vllm/cosyvoice2.py into a module with clear inference APIs
The vllm/ directory suggests integration with vLLM (LLM serving framework) but contains only a single cosyvoice2.py file. As CosyVoice 3.0 is released, this module needs clear separation between inference, serving, and model loading concerns. Currently, the CLI inference path (cosyvoice/cli/cosyvoice.py) and vllm path likely duplicate logic. Refactoring would improve code reuse and support deployment at scale.
- [ ] Create cosyvoice/vllm/init.py with clear public API exports
- [ ] Split cosyvoice/vllm/cosyvoice2.py into: model_loader.py (model initialization), inference_engine.py (forward passes), and server.py (vLLM integration)
- [ ] Extract shared inference logic from cosyvoice/cli/model.py and cosyvoice/vllm/ into cosyvoice/core/inference.py
- [ ] Add docstrings and type hints (using pydantic models already in dependencies) to the refactored modules
- [ ] Add integration tests in tests/vllm/ demonstrating batch inference and concurrent requests
- [ ] Update README.md with vLLM deployment example referencing the new public API
🌿Good first issues
- Add unit tests for cosyvoice/tokenizer/tokenizer.py covering all 9+ language tokenization paths and the multilingual vocab boundary cases (missing currently in visible file structure); helps prevent silent encoding regressions.: Medium
- Document the Hydra config schema (conf/ directory not in file listing) with example YAML for training, inference, and model export; add validation schema using Pydantic (pydantic==2.7.0 is already a dependency) to catch config errors early.: Medium
- Create integration test suite in tests/ for end-to-end inference pipeline (text → tokenize → LLM → flow matching → vocoding) with mock models to catch breaking changes across cosyvoice/cli/cosyvoice.py, cosyvoice/flow/, and cosyvoice/hifigan/ boundaries.: Hard
⭐Top contributors
Click to expand
Top contributors
- @aluminumbox — 72 commits
- @root — 8 commits
- @tyanz — 5 commits
- @yuekaizhang — 3 commits
- @GoyoUijin — 3 commits
📝Recent commits
Click to expand
Recent commits
ace7c47— Merge pull request #1850 from yuekaizhang/cosy3_pr (aluminumbox)914454e— fix lint (yuekaizhang)f4d376e— update results (root)4ddb996— update results (root)c7686fa— rename files (root)8f0b28b— add cosyvoice3 (root)04bcadc— fix vllm yaml version (aluminumbox)0e62489— simply code (aluminumbox)2145b58— keep offline embedding/token extraction for compatibale (aluminumbox)f08872a— Merge pull request #1814 from hexisyztem/main (aluminumbox)
🔒Security observations
- High · Outdated and Vulnerable Dependencies —
requirements/dependency specification. Multiple dependencies have known vulnerabilities: numpy==1.25.2 (CVE-2024-35493, CVE-2024-56303), protobuf==4.25 (potential deserialization issues), and onnxruntime-gpu==1.18.0 which may have unpatched security issues. These should be updated to latest secure versions. Fix: Update all dependencies to their latest stable versions. Run 'pip install --upgrade' and use tools like 'safety' or 'pip-audit' to identify and remediate known vulnerabilities. Implement automated dependency scanning in CI/CD pipeline. - High · Unvalidated Model Loading from External Sources —
cosyvoice/cli/model.py, cosyvoice/bin/export_jit.py, cosyvoice/bin/export_onnx.py. The codebase loads models from ModelScope and HuggingFace (as seen in README and cosyvoice/cli/model.py). There is no apparent verification of model integrity (checksums, signatures) which could allow man-in-the-middle attacks or malicious model injection. Fix: Implement cryptographic verification of downloaded models using checksums (SHA-256) or digital signatures. Verify model sources and cache them securely. Add integrity checks before loading models into memory. - High · Potential Arbitrary Code Execution via ONNX/TensorRT —
cosyvoice/bin/export_onnx.py, cosyvoice/utils/onnx.py, dependencies: onnx, onnxruntime-gpu, tensorrt. The codebase uses ONNX and TensorRT for model inference. These frameworks can be vectors for arbitrary code execution if untrusted models are loaded, as they support operator definitions that can execute arbitrary code. Fix: Only load models from trusted sources. Implement sandboxing or containerization for model inference. Use ONNX opset validation to restrict dangerous operations. Consider using ONNX Runtime with execution provider restrictions. - Medium · Missing Input Validation in CLI Interface —
cosyvoice/cli/cosyvoice.py, cosyvoice/cli/frontend.py. The CLI interface (cosyvoice/cli/cosyvoice.py) and frontend (cosyvoice/cli/frontend.py) may not properly validate user inputs before processing, which could lead to path traversal, injection attacks, or resource exhaustion. Fix: Implement strict input validation for all user-provided parameters. Validate file paths to prevent directory traversal. Implement rate limiting and resource limits for inference requests. Use parameterized inputs rather than string concatenation. - Medium · Insecure File Operations —
cosyvoice/utils/file_utils.py, cosyvoice/utils/executor.py. File operations in cosyvoice/utils/file_utils.py and cosyvoice/utils/executor.py may be vulnerable to path traversal attacks if user input is not properly sanitized before being used in file operations. Fix: Use pathlib.Path with .resolve() to prevent path traversal. Implement whitelist-based file path validation. Ensure all file operations are restricted to designated directories. - Medium · Potential Denial of Service via Resource Exhaustion —
cosyvoice/llm/llm.py, cosyvoice/flow/flow.py, cosyvoice/vllm/cosyvoice2.py. Large language model inference (cosyvoice/llm/llm.py) and audio generation (cosyvoice/flow/) can consume significant memory and CPU. No visible rate limiting, resource quotas, or request size limits detected. Fix: Implement rate limiting on API endpoints. Add request size and timeout limits. Implement memory and CPU usage monitoring with automatic request rejection when thresholds are exceeded. Add queue management for concurrent requests. - Medium · Unsafe Pickle/Serialization Usage —
cosyvoice/bin/average_model.py, cosyvoice/cli/model.py. PyTorch model loading via torch.load() in cosyvoice/bin/average_model.py and throughout the codebase can execute arbitrary code if loading untrusted pickled models. Fix: Use torch.load()
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.