FunAudioLLM/CosyVoice

Item: FunAudioLLM/CosyVoice
Rating: 5
Author: RepoPilot

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.

Healthy

Healthy across the board

weakest axis

Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

✓Last commit 4d ago
✓14 active contributors
✓Apache-2.0 licensed

Show all 6 evidence items →

✓CI configured
✓Tests present
⚠Concentrated ownership — top contributor handles 72% of recent commits

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Healthy](https://repopilot.app/api/badge/funaudiollm/cosyvoice)](https://repopilot.app/r/funaudiollm/cosyvoice)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/funaudiollm/cosyvoice on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: FunAudioLLM/CosyVoice

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/FunAudioLLM/CosyVoice shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

Last commit 4d ago
14 active contributors
Apache-2.0 licensed
CI configured
Tests present
⚠ Concentrated ownership — top contributor handles 72% of recent commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live FunAudioLLM/CosyVoice repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/FunAudioLLM/CosyVoice.

What it runs against: a local clone of FunAudioLLM/CosyVoice — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in FunAudioLLM/CosyVoice | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | Last commit ≤ 34 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>FunAudioLLM/CosyVoice</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of FunAudioLLM/CosyVoice. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/FunAudioLLM/CosyVoice.git
#   cd CosyVoice
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of FunAudioLLM/CosyVoice and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "FunAudioLLM/CosyVoice(\\.git)?\\b" \\
  && ok "origin remote is FunAudioLLM/CosyVoice" \\
  || miss "origin remote is not FunAudioLLM/CosyVoice (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 34 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~4d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/FunAudioLLM/CosyVoice"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

CosyVoice is a multilingual text-to-speech (TTS) system powered by large language models (LLM) that generates natural speech with zero-shot voice cloning capabilities. It supports 9+ languages and 18+ Chinese dialects with state-of-the-art content consistency, speaker similarity, and prosody naturalness, using a flow-matching architecture with DiT-based diffusion and HiFi-GAN vocoding for high-quality audio synthesis. Monorepo structured as cosyvoice/ package with modular subsystems: cosyvoice/llm (language model backbone), cosyvoice/flow (flow-matching diffusion with DiT transformer), cosyvoice/hifigan (neural vocoder), cosyvoice/tokenizer (multilingual token encoding), cosyvoice/transformer (core attention/decoder blocks), cosyvoice/cli (inference CLI), cosyvoice/bin (training/export scripts), cosyvoice/dataset (data loading). Entry points via cosyvoice/cli/cosyvoice.py and cosyvoice/cli/model.py.

👥Who it's for

Speech synthesis researchers and audio engineers building production TTS systems who need multilingual support, speaker cloning, and instruction-based control (emotion, speed, volume, dialect); ML practitioners who want to train or fine-tune large voice models with full-stack inference, training, and deployment pipelines.

🌱Maturity & risk

Production-ready with active development. CosyVoice 3.0 was released in December 2024 with recent model checkpoints on ModelScope and HuggingFace; the repo shows continuous iteration (v1.0→v2.0→v3.0 progression) with structured CI workflows (.github/workflows/lint.yml), although specific test coverage metrics are not visible in the file listing.

Moderate dependency risk: 30+ critical dependencies (conformer, diffusers, lightning, onnxruntime-gpu, tensorrt) create fragility around GPU/ONNX ecosystem changes; no visible test suite in top-level structure increases regression risk during contributions. Single-maintainer risk from FunAudioLLM org suggests knowledge concentration. Breaking changes likely between major versions (v1→v2→v3) without clear migration guides.

Active areas of work

Actively shipping Fun-CosyVoice 3.0 (Dec 2024) with base and RL-tuned models; recent work includes NVIDIA TensorRT/Triton runtime support (July 2025), vLLM integration for faster inference (May 2025), and streaming support (both text-in and audio-out with 150ms latency). Continuous model checkpoint releases across ModelScope and HuggingFace platforms.

🚀Get running

git clone https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
pip install -e .
# Or install from requirements (inferred from dependencies):
pip install conformer==0.3.2 diffusers==0.29.0 lightning==2.2.4 onnxruntime-gpu==1.18.0 torch torchaudio

Daily commands: Inference (CLI):

python -m cosyvoice.cli.cosyvoice --model_path <model_ckpt> --text "Hello" --output_path output.wav

Training:

python cosyvoice/bin/train.py --config conf/train.yaml

Export to ONNX/JIT:

python cosyvoice/bin/export_onnx.py --model_path ckpt.pth
python cosyvoice/bin/export_jit.py --model_path ckpt.pth

(Exact commands inferred from bin/ structure; see conf/ directory for config templates)

🗺️Map of the codebase

cosyvoice/cli/cosyvoice.py: Main inference entry point; handles text→speech pipeline, voice cloning, instruction parsing, and streaming orchestration
cosyvoice/flow/flow.py: Core flow-matching diffusion model; defines reverse-time diffusion schedule and acoustic feature generation from LLM embeddings
cosyvoice/flow/DiT/dit.py: Diffusion Transformer encoder backbone; implements cross-attention with LLM tokens and duration/pitch conditioning
cosyvoice/hifigan/hifigan.py: Neural vocoder converting mel-spectrogram → waveform; critical for speech quality and inference latency
cosyvoice/llm/llm.py: Language model semantic token generation; decoupled from diffusion, can swap backends (vLLM, TensorRT-LLM, etc.)
cosyvoice/tokenizer/tokenizer.py: Multilingual token encoding using tiktoken; handles 9+ languages and dialect-specific phoneme conversion
cosyvoice/bin/train.py: Training orchestration script; integrates Lightning, dataset loading, and checkpoint management for base + RL fine-tuning
cosyvoice/dataset/processor.py: Data pipeline: text preprocessing, alignment, F0/duration extraction via pyworld; critical for training data quality

🛠️How to make changes

Adding a new language: Extend cosyvoice/tokenizer/assets/ with new tiktoken vocab; update cosyvoice/cli/frontend.py text normalization rules. Modifying model architecture: Edit cosyvoice/flow/flow.py (diffusion core) or cosyvoice/flow/DiT/dit.py (transformer encoder). Training pipeline changes: Modify cosyvoice/bin/train.py and cosyvoice/dataset/processor.py. CLI features: Extend cosyvoice/cli/cosyvoice.py entry point. Vocoding: Replace or fine-tune cosyvoice/hifigan/hifigan.py.

🪤Traps & gotchas

CUDA/ONNX version alignment: onnxruntime-gpu==1.18.0 requires specific CUDA version; mismatches cause silent inference failures. Config management: Hydra config overrides via command line (conf/ not in file list but referenced) are order-sensitive; incorrect YAML syntax breaks silently. Model checkpoint locations: Assumes ModelScope/HuggingFace models auto-download; offline environments need pre-cached checkpoints. WeTextProcessing dependency: Custom text normalization library not in standard PyPI; requires separate installation and may lag on non-Linux systems. Tiktoken assets: multilingual_zh_ja_yue_char_del.tiktoken is baked into repo; changes require regeneration script (not visible in file list). F0 extraction (pyworld): Requires audio sample rate consistency across training/inference; mismatches silently degrade prosody.

💡Concepts to learn

Flow Matching — CosyVoice uses flow-matching (cosyvoice/flow/flow_matching.py) instead of traditional VAE-based TTS; enables faster, more stable acoustic feature generation from text embeddings without explicit duration models
Diffusion Transformer (DiT) — cosyvoice/flow/DiT/dit.py replaces U-Net with pure transformer blocks for conditional acoustic synthesis; improves content consistency and reduces hallucinations via cross-attention to LLM tokens
Zero-Shot Voice Cloning — CosyVoice's core capability: generates speech in a target speaker's voice using only a reference audio prompt without speaker embeddings or fine-tuning; powered by LLM semantic understanding
Byte Pair Encoding (BPE) via tiktoken — cosyvoice/tokenizer/ uses OpenAI's tiktoken for multilingual tokenization; enables efficient vocabulary sharing across 9+ languages while preserving language-specific structure
Mel-Spectrogram Vocoding — HiFi-GAN (cosyvoice/hifigan/hifigan.py) converts diffusion-generated mel-spectrograms → waveforms; critical bottleneck for inference speed and audio quality; alternative to autoregressive WaveNet
Length Regulation — cosyvoice/flow/length_regulator.py maps phoneme/token sequences to acoustic frame counts; replaces traditional duration models in flow-matching pipeline for more robust control
Cross-Lingual Prosody Transfer — CosyVoice 3.0 supports cross-lingual zero-shot cloning (generate English speech in Mandarin speaker's voice); achieved via LLM's multilingual semantic embeddings and flow-matched acoustic features

microsoft/VoiceConversion — Zero-shot voice conversion technique that inspired CosyVoice's voice cloning without explicit speaker embeddings
openai/whisper — Multilingual speech-to-text backbone used in some CosyVoice evaluation pipelines (jiwer dependency suggests ASR eval); complementary to TTS
NVIDIA/TensorRT-LLM — Inference optimization for the LLM backbone in CosyVoice; integrated via recent triton trtllm runtime support for sub-100ms latency
MoonSound/HiFi-GAN — Original HiFi-GAN vocoder architecture that CosyVoice's cosyvoice/hifigan/ directly builds upon for mel→waveform synthesis
jik876/HiFi-GAN — Reference HiFi-GAN implementation; CosyVoice customizes discriminator.py and generator.py for speech-specific training

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for cosyvoice/utils/ module

The utils module contains critical helper functions (file_utils.py, common.py, frontend_utils.py, losses.py, mask.py, onnx.py) that lack visible test coverage. These utilities are foundational for training, inference, and data processing pipelines. Adding tests would catch regressions early and improve maintainability as the codebase evolves.

[ ] Create tests/utils/test_file_utils.py for file I/O operations in cosyvoice/utils/file_utils.py
[ ] Create tests/utils/test_frontend_utils.py for audio preprocessing in cosyvoice/utils/frontend_utils.py
[ ] Create tests/utils/test_losses.py for loss computation functions in cosyvoice/utils/losses.py
[ ] Create tests/utils/test_onnx.py for ONNX export/import utilities in cosyvoice/utils/onnx.py
[ ] Add pytest configuration to setup.py or pyproject.toml
[ ] Integrate test execution into existing .github/workflows/lint.yml

Add GitHub Action workflow for multi-version Python compatibility testing

The project supports multiple Python versions and heavy dependencies (torch, onnxruntime-gpu, tensorrt). Currently only lint.yml exists. A matrix-based test workflow would catch compatibility issues across Python 3.8-3.11 and help prevent breaking changes when upgrading core dependencies like numpy, onnx, and diffusers.

[ ] Create .github/workflows/test-matrix.yml with Python 3.8, 3.9, 3.10, 3.11 matrix
[ ] Include CPU-only test variants (using onnxruntime instead of onnxruntime-gpu) to reduce CI overhead
[ ] Add dependency installation and basic inference tests for cosyvoice/cli/cosyvoice.py
[ ] Configure workflow to run on PR, push to main, and scheduled weekly basis
[ ] Document in CONTRIBUTING.md which tests are required before merging

Refactor cosyvoice/vllm/cosyvoice2.py into a module with clear inference APIs

The vllm/ directory suggests integration with vLLM (LLM serving framework) but contains only a single cosyvoice2.py file. As CosyVoice 3.0 is released, this module needs clear separation between inference, serving, and model loading concerns. Currently, the CLI inference path (cosyvoice/cli/cosyvoice.py) and vllm path likely duplicate logic. Refactoring would improve code reuse and support deployment at scale.

[ ] Create cosyvoice/vllm/init.py with clear public API exports
[ ] Split cosyvoice/vllm/cosyvoice2.py into: model_loader.py (model initialization), inference_engine.py (forward passes), and server.py (vLLM integration)
[ ] Extract shared inference logic from cosyvoice/cli/model.py and cosyvoice/vllm/ into cosyvoice/core/inference.py
[ ] Add docstrings and type hints (using pydantic models already in dependencies) to the refactored modules
[ ] Add integration tests in tests/vllm/ demonstrating batch inference and concurrent requests
[ ] Update README.md with vLLM deployment example referencing the new public API

🌿Good first issues

Add unit tests for cosyvoice/tokenizer/tokenizer.py covering all 9+ language tokenization paths and the multilingual vocab boundary cases (missing currently in visible file structure); helps prevent silent encoding regressions.: Medium
Document the Hydra config schema (conf/ directory not in file listing) with example YAML for training, inference, and model export; add validation schema using Pydantic (pydantic==2.7.0 is already a dependency) to catch config errors early.: Medium
Create integration test suite in tests/ for end-to-end inference pipeline (text → tokenize → LLM → flow matching → vocoding) with mock models to catch breaking changes across cosyvoice/cli/cosyvoice.py, cosyvoice/flow/, and cosyvoice/hifigan/ boundaries.: Hard

⭐Top contributors

Click to expand

@aluminumbox — 72 commits
@root — 8 commits
@tyanz — 5 commits
@yuekaizhang — 3 commits
@GoyoUijin — 3 commits

📝Recent commits