RepoPilotOpen in app →

OpenBMB/VoxCPM

VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning

Healthy

Healthy across the board

weakest axis
Use as dependencyHealthy

Permissive license, no critical CVEs, actively maintained — safe to depend on.

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isHealthy

No critical CVEs, sane security posture — runnable as-is.

  • Last commit 1w ago
  • 24+ active contributors
  • Distributed ownership (top contributor 25% of recent commits)
  • Apache-2.0 licensed
  • CI configured
  • Tests present

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/openbmb/voxcpm)](https://repopilot.app/r/openbmb/voxcpm)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/openbmb/voxcpm on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: OpenBMB/VoxCPM

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/OpenBMB/VoxCPM shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

GO — Healthy across the board

  • Last commit 1w ago
  • 24+ active contributors
  • Distributed ownership (top contributor 25% of recent commits)
  • Apache-2.0 licensed
  • CI configured
  • Tests present

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live OpenBMB/VoxCPM repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/OpenBMB/VoxCPM.

What it runs against: a local clone of OpenBMB/VoxCPM — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in OpenBMB/VoxCPM | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 39 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>OpenBMB/VoxCPM</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of OpenBMB/VoxCPM. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/OpenBMB/VoxCPM.git
#   cd VoxCPM
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of OpenBMB/VoxCPM and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "OpenBMB/VoxCPM(\\.git)?\\b" \\
  && ok "origin remote is OpenBMB/VoxCPM" \\
  || miss "origin remote is not OpenBMB/VoxCPM (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "src/voxcpm/core.py" \\
  && ok "src/voxcpm/core.py" \\
  || miss "missing critical file: src/voxcpm/core.py"
test -f "src/voxcpm/model/voxcpm2.py" \\
  && ok "src/voxcpm/model/voxcpm2.py" \\
  || miss "missing critical file: src/voxcpm/model/voxcpm2.py"
test -f "src/voxcpm/modules/audiovae/audio_vae_v2.py" \\
  && ok "src/voxcpm/modules/audiovae/audio_vae_v2.py" \\
  || miss "missing critical file: src/voxcpm/modules/audiovae/audio_vae_v2.py"
test -f "src/voxcpm/modules/locdit/local_dit_v2.py" \\
  && ok "src/voxcpm/modules/locdit/local_dit_v2.py" \\
  || miss "missing critical file: src/voxcpm/modules/locdit/local_dit_v2.py"
test -f "src/voxcpm/training/config.py" \\
  && ok "src/voxcpm/training/config.py" \\
  || miss "missing critical file: src/voxcpm/training/config.py"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 39 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~9d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/OpenBMB/VoxCPM"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

VoxCPM2 is a 2B-parameter tokenizer-free Text-to-Speech system that generates continuous speech representations via diffusion and autoregressive architecture, supporting 30 languages, voice design from natural language descriptions, controllable voice cloning, and 48kHz studio-quality audio output. It bypasses discrete tokenization entirely, enabling highly natural and expressive multilingual synthesis. Modular monorepo: src/voxcpm/core.py is the main inference API, src/voxcpm/model/ contains voxcpm.py (v1) and voxcpm2.py (v2 variant), src/voxcpm/modules/ houses audiovae/ (audio encoding), locdit/ (tokenization-free backbone), and layers/ (LoRA, scalar quantization). Configuration-driven training via conf/ with per-version YAML specs; scripts/ contains train_voxcpm_finetune.py and inference examples; app.py and lora_ft_webui.py are Gradio interfaces.

👥Who it's for

Speech synthesis researchers and ML engineers building multilingual TTS applications, voice cloning products, or creative voice design systems; users who need production-grade audio synthesis across 30+ languages without tokenizer quantization artifacts.

🌱Maturity & risk

Actively developed and production-ready: VoxCPM2 is a major release (v2) trained on 2M+ hours of multilingual data with HuggingFace and ModelScope model hosting, GitHub CI/CD via publish-to-pypi.yml, and live demo at huggingface.co/spaces. The codebase shows mature structure with configuration versioning (v1, v1.5, v2 in conf/), training infrastructure, and LoRA fine-tuning support.

Moderate dependency risk: the project is actively maintained but relies heavily on custom modules (audiovae, locdit tokenization-free components) that are core to differentiation; no obvious test suite visible in the file list (scripts/ has test files but no pytest/unittest structure shown). Single-org maintainence (OpenBMB) means community contribution surface may be limited. Large model size (2B params) requires significant compute for inference and fine-tuning.

Active areas of work

The project is actively maintained with VoxCPM2 as the latest release; visible work includes voice design capability addition, multilingual expansion to 30 languages, LoRA fine-tuning infrastructure (scripts/test_voxcpm_lora_infer.py, lora_ft_webui.py), and audio quality improvements (AudioVAE v2). PyPI publishing is automated via GitHub Actions.

🚀Get running

Clone and install: git clone https://github.com/OpenBMB/VoxCPM.git && cd VoxCPM && pip install -e . (pyproject.toml indicates editable install support). Run inference: python scripts/test_voxcpm_ft_infer.py or launch the web UI: python app.py (requires Gradio and model weights from HuggingFace).

Daily commands: python app.py launches the Gradio web interface for inference; python scripts/train_voxcpm_finetune.py trains with conf/voxcpm_v2/voxcpm_finetune_all.yaml or conf/voxcpm_v2/voxcpm_finetune_lora.yaml configs; python scripts/test_voxcpm_ft_infer.py and python scripts/test_voxcpm_lora_infer.py test fine-tuned and LoRA models respectively.

🗺️Map of the codebase

  • src/voxcpm/core.py — Main inference entry point for VoxCPM models; all TTS generation flows through here.
  • src/voxcpm/model/voxcpm2.py — Core VoxCPM2 model architecture implementation; defines the tokenizer-free TTS pipeline.
  • src/voxcpm/modules/audiovae/audio_vae_v2.py — Audio VAE encoder/decoder for converting between raw audio and latent space; critical for quality.
  • src/voxcpm/modules/locdit/local_dit_v2.py — Local Diffusion Transformer for iterative audio generation; core generative model.
  • src/voxcpm/training/config.py — Training configuration schema; defines all hyperparameters and model setup for fine-tuning.
  • src/voxcpm/modules/minicpm4/model.py — Text encoding backbone (MiniCPM4); processes input text into semantic embeddings.
  • src/voxcpm/training/data.py — Data loading and preprocessing pipeline for training; handles JSONL dataset ingestion.

🛠️How to make changes

Add a new Language or Voice Variant

  1. Update text normalization rules to handle the new language's phonetics. (src/voxcpm/utils/text_normalize.py)
  2. Add language-specific tokens or embeddings to MiniCPM4 if needed. (src/voxcpm/modules/minicpm4/model.py)
  3. Create a new config file for fine-tuning on the language. (conf/voxcpm_v2/voxcpm_finetune_lora.yaml)
  4. Run training with the new config to adapt the model. (scripts/train_voxcpm_finetune.py)

Fine-tune VoxCPM2 on Custom Voice Data

  1. Prepare training data in JSONL format (see examples/train_data_example.jsonl). (examples/train_data_example.jsonl)
  2. Choose a training config (full or LoRA) based on available compute. (conf/voxcpm_v2/voxcpm_finetune_lora.yaml)
  3. Update dataset path and hyperparameters in the config. (src/voxcpm/training/config.py)
  4. Launch training script with your config. (scripts/train_voxcpm_finetune.py)
  5. Load the checkpoint and test inference. (scripts/test_voxcpm_lora_infer.py)

Add Custom Voice Cloning Encoder

  1. Implement a new speaker encoder in the acoustic modules. (src/voxcpm/modules/locenc/local_encoder.py)
  2. Integrate the encoder into VoxCPM2 model. (src/voxcpm/model/voxcpm2.py)
  3. Update the core inference to use the new encoder. (src/voxcpm/core.py)
  4. Add test cases for the new encoder. (tests/test_model_utils.py)

Optimize Model for Inference Speed

  1. Check current dtype and optimization options. (scripts/test_pick_runtime_dtype.py)
  2. Use model utilities to apply quantization or compilation. (src/voxcpm/model/utils.py)
  3. Consider compression via ZipEnhancer utility. (src/voxcpm/zipenhancer.py)
  4. Benchmark inference time and quality tradeoffs. (src/voxcpm/core.py)

🔧Why these technologies

  • Diffusion Transformers (Local DiT) — Enables iterative refinement of audio generation without discrete tokenizers; better quality and naturalness than AR/VQ approaches.
  • Audio VAE (Continuous latent space) — Provides continuous audio representation enabling smooth synthesis and voice cloning without quantization artifacts.
  • MiniCPM4 Text Encoder — Lightweight yet capable text understanding backbone; reduces compute vs. full LLMs while preserving semantic fidelity.
  • LoRA Fine-tuning — Parameter-efficient adaptation for custom voices/languages; reduces memory and training time for practitioners.
  • HuggingFace Transformers + — undefined

🪤Traps & gotchas

Model weights: inference requires downloading 2B model from HuggingFace (openbmb/VoxCPM2) or ModelScope; not bundled in repo. Audio dtype: scripts/test_pick_runtime_dtype.py suggests runtime dtype selection complexity—float16 vs float32 inference may require explicit config. Config versioning: conf/ has separate v1, v1.5, v2 trees; using wrong version YAML will cause model mismatches. LoRA training: lora_ft_webui.py and scripts/test_voxcpm_lora_infer.py use distinct training/inference paths; LoRA adapters and base weights must be compatible. Language inference: 30-language support is auto-detected; no explicit language tags needed in prompts (unlike some TTS systems).

🏗️Architecture

💡Concepts to learn

  • Tokenizer-Free Speech Generation — Core design philosophy of VoxCPM2 — bypassing discrete audio tokenization to generate continuous speech directly, avoiding quantization artifacts and enabling higher fidelity synthesis
  • Diffusion Autoregressive Architecture — VoxCPM2's hybrid approach combining diffusion for quality and autoregressive generation for efficiency; essential to understanding its inference pipeline
  • LoRA (Low-Rank Adaptation) — Implemented in src/voxcpm/modules/layers/lora.py for efficient voice cloning and fine-tuning without updating all 2B parameters; critical for production customization
  • AudioVAE (Variational Autoencoder for Audio) — Continuous audio representation engine (audio_vae_v2.py); replaces discrete tokenizers with learned latent space for high-quality speech reconstruction
  • Voice Design via Natural Language Description — VoxCPM2's ability to synthesize novel voices from text prompts (e.g., 'warm female voice, 40s, slightly emotional') without reference audio; requires embedding voice semantics in the language model
  • Scalar Quantization Layer — Implemented in src/voxcpm/modules/layers/scalar_quantization_layer.py for discretizing continuous voice embeddings in voice design; enables structured voice space navigation
  • Controllable Voice Cloning — VoxCPM2's synthesis from short reference clips with optional style guidance (emotion, pace); combines speaker embeddings with style conditioning, visible in app.py and lora_ft_webui.py interfaces
  • OpenBMB/MiniCPM — VoxCPM2's language backbone is built on MiniCPM-4; understanding MiniCPM's architecture is essential for language understanding in the TTS pipeline
  • coqui-ai/TTS — Alternative open-source TTS framework with similar goals (multilingual synthesis); key comparison point for tokenizer-free vs. discrete token approaches
  • openai/gpt-4-turbo-preview — VoxCPM2 references MiniCPM as an open alternative; understanding large model inference patterns relevant to the 2B parameter inference
  • google-research/soundstream — SoundStream is a neural audio codec paradigm that influenced tokenizer-free audio representations; VoxCPM2's AudioVAE v2 draws conceptual lineage from this space
  • huggingface/transformers — VoxCPM2 integrates with HuggingFace model hosting and likely uses transformers for language backbone; required for model loading and fine-tuning

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add comprehensive unit tests for core inference pipeline (src/voxcpm/core.py and src/voxcpm/model/)

The repo lacks unit tests for the main inference components. Currently only script-based tests exist (scripts/test_voxcpm_ft_infer.py, scripts/test_voxcpm_lora_infer.py). Adding proper unit tests would improve code reliability, enable CI/CD validation, and help new contributors understand the API contract. The core.py module is the entry point for users and needs robust test coverage.

  • [ ] Create tests/test_core.py for src/voxcpm/core.py inference methods
  • [ ] Create tests/test_voxcpm_models.py for src/voxcpm/model/voxcpm.py and voxcpm2.py
  • [ ] Add fixture-based tests for mock audio inputs and speaker references using examples/example.wav
  • [ ] Test multilingual TTS generation with different language inputs
  • [ ] Verify LoRA fine-tuning integration via tests/test_lora_integration.py
  • [ ] Integrate unit tests into .github/workflows/publish-to-pypi.yml before package publishing

Add integration tests and validation workflow for training pipeline (src/voxcpm/training/)

The training module (src/voxcpm/training/config.py, data.py, accelerator.py, validate.py) has no visible test coverage. Scripts like scripts/train_voxcpm_finetune.py exist but lack automated validation. Adding tests would ensure training configuration schemas are correct, data loaders work properly, and accelerator compatibility is maintained across updates.

  • [ ] Create tests/test_training_config.py to validate training configuration loading from conf/voxcpm_v2/
  • [ ] Create tests/test_training_data.py to validate JSONL data parsing using examples/train_data_example.jsonl
  • [ ] Add tests/test_packers.py for src/voxcpm/training/packers.py batch packing logic
  • [ ] Create tests/test_validator.py for src/voxcpm/training/validate.py validation functions
  • [ ] Add GitHub Actions workflow .github/workflows/test-training.yml to run training tests on CPU with small models

Add module-level documentation and docstring tests for src/voxcpm/modules/ subpackages

The modules directory contains complex components (AudioVAE, LocalDIT, LocalEncoder, MiniCPM4) without visible docstrings or module-level documentation. The repo has ReadTheDocs setup but likely missing technical docs for these internal components. Adding docstrings and doctest examples would improve maintainability and allow auto-generating API documentation.

  • [ ] Add comprehensive docstrings to src/voxcpm/modules/audiovae/audio_vae.py and audio_vae_v2.py following NumPy style
  • [ ] Add docstrings to src/voxcpm/modules/locdit/local_dit.py and unified_cfm.py explaining CFM diffusion logic
  • [ ] Add docstrings to src/voxcpm/modules/locenc/local_encoder.py for speech encoding pipeline
  • [ ] Add docstrings to src/voxcpm/modules/layers/lora.py explaining LoRA application mechanism
  • [ ] Create docs/API_REFERENCE.md or update ReadTheDocs config to auto-generate from docstrings
  • [ ] Add doctest examples in docstrings (testable via pytest --doctest-modules)

🌿Good first issues

  • Add unit tests for src/voxcpm/modules/audiovae/audio_vae_v2.py and audio_vae.py to verify encoding/decoding correctness on example WAV files (examples/example.wav exists); currently no visible test coverage for audio quality.
  • Document voice design API in src/voxcpm/core.py with docstring examples showing how to generate voices from natural language descriptions (gender, age, tone); API exists but user-facing documentation is missing.
  • Create a minimal inference example script in examples/ demonstrating voice cloning workflow (load reference speaker, clone with style guidance) matching the patterns in scripts/test_voxcpm_ft_infer.py but accessible to users without training knowledge.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 19b6bf7 — fix: handle LoRA rank mismatch during inference in lora_ft_webui (liuxin)
  • 86bff0f — Merge pull request #253 from SuperMarioYL/feat/validate-training-data (a710128)
  • dd7b78f — refactor(cli): defer soundfile and voxcpm.core imports to inference commands (SuperMarioYL)
  • 29577d5 — test: fix test_cli_validate_exit_code to use --manifest flag and assert specific exit code (SuperMarioYL)
  • 4509bec — fix: address four validation correctness issues from review (SuperMarioYL)
  • cd79a64 — Merge pull request #263 from Oumnya/fix/mps-bf16-dtype (a710128)
  • 96d605b — fix(mps): align VOXCPM_MPS_DTYPE override set with get_dtype parser (Oumnya)
  • a9b03a7 — Merge pull request #277 from gluttony-10/main (a710128)
  • 77f847f — Merge pull request #268 from shaun0927/fix/lora-weights-only (a710128)
  • d3cc887 — feat: enhance control text processing in VoxCPMDemo (gluttony-10)

🔒Security observations

VoxCPM codebase shows moderate security posture with several areas requiring attention. Primary concerns include unsafe YAML loading, unvalidated model weight loading from external sources, and potential path traversal vulnerabilities in data handling. The project lacks visible hardcoded credentials but would benefit from stricter input validation, dependency version pinning, and verification of secure configuration practices.

  • High · Potential Arbitrary Code Execution via Unsafe YAML Loading — conf/voxcpm_v*/voxcpm_finetune_*.yaml and src/voxcpm/training/config.py. Configuration files in conf/ directory use YAML format. If these are loaded using yaml.load() instead of yaml.safe_load(), this could allow arbitrary Python code execution through deserialized objects. Fix: Ensure all YAML loading uses yaml.safe_load() instead of yaml.load(). Verify configuration loading in training/config.py uses safe parsing methods.
  • High · Unvalidated Model Loading from External Sources — src/voxcpm/model/voxcpm.py, src/voxcpm/model/voxcpm2.py. The codebase appears to load pre-trained models from Hugging Face and ModelScope. Without proper signature verification, this could allow loading malicious model weights. The src/voxcpm/model/ directory suggests direct model loading without validation. Fix: Implement cryptographic verification of downloaded models using checksums or signatures. Validate model integrity before loading. Use trusted model repositories only.
  • Medium · Potential Arbitrary File Read via Path Traversal — src/voxcpm/training/data.py, examples/train_data_example.jsonl. Training data loading from JSONL files (examples/train_data_example.jsonl) and potential file path handling in training/data.py could be vulnerable to path traversal attacks if user input is not properly sanitized. Fix: Implement strict path validation and normalization. Use os.path.abspath() and verify resolved paths are within expected directories. Never trust user-provided file paths directly.
  • Medium · Missing Dependency Version Pinning — pyproject.toml, uv.lock. Without access to pyproject.toml or requirements files content, dependency versions cannot be verified. Unpinned dependencies could introduce vulnerabilities through transitive dependency updates. Fix: Implement strict dependency version pinning in pyproject.toml. Use lock files (uv.lock appears present). Regularly audit dependencies with tools like pip-audit or safety.
  • Medium · Potential Command Injection via CLI Arguments — src/voxcpm/cli.py. CLI interface (src/voxcpm/cli.py) may accept user input that is passed to system commands or shell operations without proper sanitization. Fix: Use subprocess with shell=False. Avoid string concatenation for commands. Validate and sanitize all CLI inputs. Use argument parsing libraries (argparse) with type validation.
  • Low · Exposed PyPI Publishing Workflow — .github/workflows/publish-to-pypi.yml. GitHub Actions workflow for publishing to PyPI (.github/workflows/publish-to-pypi.yml) could potentially expose credentials if secrets are not properly configured. Fix: Verify all publishing credentials are stored as GitHub Secrets, never hardcoded. Use trusted publisher authentication. Implement branch protection rules. Review workflow permissions.
  • Low · Debug/Development Code in Production — app_old.py, scripts/. Presence of both app.py and app_old.py, plus test files in scripts/, suggests development artifacts may be included in distribution. Fix: Remove development/debug files from production distribution. Use .gitignore and MANIFEST.in to exclude unnecessary files from packages. Keep test files in tests/ directory only.
  • Low · Potential Information Disclosure via Error Messages — app.py, lora_ft_webui.py. Web interface (app.py) may expose sensitive system information through error stack traces if error handling is not properly configured. Fix: Implement proper error handling with generic error messages for users. Log detailed errors server-side only. Disable debug mode in production. Use structured exception handling.

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.