SparkAudio/Spark-TTS

Item: SparkAudio/Spark-TTS
Rating: 3
Author: RepoPilot

Spark-TTS Inference Code

Mixed

Stale — last commit 1y ago

weakest axis

Use as dependencyMixed

last commit was 1y ago; no tests detected…

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isMixed

last commit was 1y ago; no CI workflows detected

✓6 active contributors
✓Apache-2.0 licensed
⚠Stale — last commit 1y ago

Show all 6 evidence items →

⚠Concentrated ownership — top contributor handles 67% of recent commits
⚠No CI workflows detected
⚠No test directory detected

What would change the summary?

→Use as dependency Mixed → Healthy if: 1 commit in the last 365 days; add a test suite
→Deploy as-is Mixed → Healthy if: 1 commit in the last 180 days

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:

[![RepoPilot: Forkable](https://repopilot.app/api/badge/sparkaudio/spark-tts?axis=fork)](https://repopilot.app/r/sparkaudio/spark-tts)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/sparkaudio/spark-tts on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: SparkAudio/Spark-TTS

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/SparkAudio/Spark-TTS shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 1y ago

6 active contributors
Apache-2.0 licensed
⚠ Stale — last commit 1y ago
⚠ Concentrated ownership — top contributor handles 67% of recent commits
⚠ No CI workflows detected
⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live SparkAudio/Spark-TTS repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/SparkAudio/Spark-TTS.

What it runs against: a local clone of SparkAudio/Spark-TTS — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in SparkAudio/Spark-TTS | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | Last commit ≤ 423 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>SparkAudio/Spark-TTS</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of SparkAudio/Spark-TTS. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/SparkAudio/Spark-TTS.git
#   cd Spark-TTS
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of SparkAudio/Spark-TTS and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "SparkAudio/Spark-TTS(\\.git)?\\b" \\
  && ok "origin remote is SparkAudio/Spark-TTS" \\
  || miss "origin remote is not SparkAudio/Spark-TTS (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 423 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~393d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/SparkAudio/Spark-TTS"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

Spark-TTS is an LLM-based text-to-speech inference system built on Qwen2.5 that generates high-quality audio by predicting speech tokens directly through an LLM, then reconstructing waveforms via audio tokenizers and vocoders. It eliminates separate acoustic feature generation pipelines, offering efficient bilingual (Chinese/English) synthesis with zero-shot voice cloning capabilities. Modular architecture: sparktts/models/ contains audio_tokenizer.py and bicodec.py for audio encoding; sparktts/modules/ splits into encoder_decoder/ (feature processing), fsq/ (quantization), speaker/ (voice embeddings), and blocks/ (vocos vocoder); cli/ provides SparkTTS.py inference wrapper; runtime/triton_trtllm/ contains Triton model definitions for deployment.

👥Who it's for

ML engineers and researchers building production TTS systems who need efficient LLM-based speech synthesis with voice cloning support; deployment engineers integrating TTS into systems via Triton Inference Server; audio ML practitioners exploring token-based audio generation instead of traditional spectrogram approaches.

🌱Maturity & risk

Actively developed and research-grade: the codebase includes production deployment patterns (Triton TensorRT-LLM runtime with Docker, gRPC/HTTP clients) alongside core inference code, suggesting post-paper release maturity. However, repo structure lacks visible test suites, CI/CD pipelines, or extensive documentation beyond examples—typical of research code with early production deployment setup.

Moderate risk: tight coupling to specific versions (PyTorch 2.5+, transformers 4.46.2) with 7 core dependencies could cause compatibility issues; no test coverage visible limits regression detection; deployment relies on external Triton/TensorRT-LLM infrastructure. Single research institution ownership (SparkAudio) means community support is limited.

Active areas of work

Recent code snapshot includes complete Triton model repository setup with audio_tokenizer, spark_tts LLM, and vocoder services; example inference script (example/infer.sh) and sample output (example/results/20250225113521.wav) suggest active testing. Conversion and templating scripts in runtime/triton_trtllm/scripts/ indicate ongoing deployment optimization.

🚀Get running

git clone https://github.com/SparkAudio/Spark-TTS.git && cd Spark-TTS && pip install -r requirements.txt && bash example/infer.sh

Daily commands: For CLI inference: python cli/inference.py --text 'hello world' --speaker_wav example/prompt_audio.wav --output_path result.wav; for Triton server: cd runtime/triton_trtllm && docker-compose up then use client_grpc.py or client_http.py; for UI: python -m gradio cli/SparkTTS.py

🗺️Map of the codebase

cli/inference.py: Main entry point for end-to-end inference: orchestrates text preprocessing, LLM generation, tokenization, and vocoding into final waveform output
sparktts/models/audio_tokenizer.py: Core audio codec for converting waveforms to discrete tokens and vice versa; critical path in the inference pipeline
sparktts/modules/encoder_decoder/wave_generator.py: Converts LLM-predicted tokens back to audio waveforms; directly impacts audio quality and synthesis speed
runtime/triton_trtllm/model_repo/spark_tts/1/model.py: Triton deployment wrapper for the Qwen2.5 LLM; bridges inference script to production serving infrastructure
sparktts/modules/blocks/vocos.py: Vocoder module responsible for final audio reconstruction from acoustic codes; critical for output audio quality
sparktts/modules/speaker/ecapa_tdnn.py: Speaker embedding extractor for voice cloning; enables zero-shot voice adaptation to prompt speaker

🛠️How to make changes

To add new speaker encoders: modify sparktts/modules/speaker/ (e.g., add new pooling in pooling_layers.py or encoder in perceiver_encoder.py); to change audio tokenization: edit sparktts/models/audio_tokenizer.py or bicodec.py; to adjust decoding: modify sparktts/modules/encoder_decoder/wave_generator.py; to customize Triton deployment: edit model configs in runtime/triton_trtllm/model_repo/*/config.pbtxt and */1/model.py.

🪤Traps & gotchas

PyTorch 2.5.1 and torchaudio 2.5.1 are pinned versions—mixing with older/newer torch risks CUDA/library incompatibilities; Triton deployment requires separate TensorRT-LLM engine (not included, needs runtime/triton_trtllm/scripts/convert_checkpoint.py execution); prompt_audio.wav format must match tokenizer expectations (likely 16kHz mono)—no validation in CLI; FSQ (finite scalar quantization) indices from LLM must match audio_tokenizer codebook size or decoding fails silently.

💡Concepts to learn

Discrete Audio Tokenization — Spark-TTS converts continuous waveforms to discrete tokens (via audio_tokenizer.py) so LLMs can predict speech—understanding this bridging layer is key to modifying codecs or debugging audio artifacts
Finite Scalar Quantization (FSQ) — FSQ in sparktts/modules/fsq/ replaces traditional VQ-VAE for smoother token learning; critical for token prediction accuracy and understanding bicodec.py architecture
Zero-Shot Voice Cloning — Via speaker embeddings (ECAPA-TDNN in sparktts/modules/speaker/), the model adapts to unseen speakers without fine-tuning; essential for understanding the voice conditioning pipeline in inference.py
Token-Based Speech Synthesis — Unlike traditional spectral synthesis, Spark-TTS predicts discrete audio tokens via Qwen2.5 then reconstructs waveforms; understanding this paradigm shift explains why flow matching and separate acoustic models are eliminated
Triton Inference Server Model Repository — Deployment requires config.pbtxt and model.py for each stage (tokenizer, LLM, vocoder); understanding Triton's composable pipeline pattern is essential for productionizing or scaling the inference stack
Speaker Embeddings and Perceiver Encoders — sparktts/modules/speaker/perceiver_encoder.py extracts speaker-invariant embeddings from prompt audio for conditioning the LLM; modifying this affects voice cloning fidelity and cross-lingual capability
Vocoder Design (Vocos) — sparktts/modules/blocks/vocos.py converts quantized audio codes back to waveforms; vocoder quality directly determines final audio naturalness, making this module critical for output tuning

qwen-team/Qwen2.5 — The base LLM backbone for Spark-TTS token prediction; understanding Qwen2.5 architecture and tokenization is essential for model modifications
openai/whisper — Comparable audio encoding approach (token-based) for speech recognition; shares FSQ-style quantization patterns relevant to understanding Spark-TTS audio codec
NVIDIA/TensorRT-LLM — Required runtime dependency for Triton deployment in runtime/triton_trtllm/; enables inference optimization critical for production latency
coqui-ai/TTS — Alternative open-source TTS system; useful for benchmarking quality, speed, and ease-of-use comparisons against traditional non-LLM approaches
huggingface/transformers — Core dependency (4.46.2) for loading Qwen2.5 and handling tokenization; upgrading or debugging requires familiarity with transformers pipeline API

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for CLI inference pipeline

The repo has cli/SparkTTS.py and cli/inference.py but no visible test suite. Given that this is inference code with real audio I/O (example/infer.sh exists), adding integration tests would catch regressions when dependencies update (torch, torchaudio, transformers are frequently updated). Tests should verify end-to-end inference with example/prompt_audio.wav and validate output audio quality metrics.

[ ] Create tests/ directory with test_inference.py
[ ] Add pytest fixtures that load example/prompt_audio.wav
[ ] Test cli/SparkTTS.py main entry point with various text inputs
[ ] Validate output audio from example/results/ matches expected format (WAV, sample rate)
[ ] Add tests for cli/inference.py functions with mock models to avoid large downloads
[ ] Update requirements.txt to include pytest and pytest-cov as dev dependencies

Create Triton inference client wrapper for easier production deployment

The runtime/triton_trtllm/ directory contains two separate client implementations (client_grpc.py and client_http.py) but they appear to be boilerplate. Adding a high-level unified client class would make it easier for users to switch between GRPC/HTTP backends and handle the audio preprocessing (sparktts/utils/audio.py utilities should be integrated). This bridges the gap between the CLI tool and production Triton deployment.

[ ] Create runtime/triton_trtllm/triton_client.py with TritonSparkttsClient class
[ ] Implement automatic backend detection (GRPC vs HTTP) based on server availability
[ ] Integrate sparktts/utils/audio.py preprocessing into the client
[ ] Add methods for batch inference and streaming audio output
[ ] Add error handling for model not ready, timeout, and invalid inputs
[ ] Document usage in runtime/triton_trtllm/README.md with example code

Add unit tests for speaker encoder modules

The sparktts/modules/speaker/ directory contains ECAPA-TDNN and Perceiver encoder implementations (ecapa_tdnn.py, perceiver_encoder.py, speaker_encoder.py) but there are no visible tests. These are critical components for speaker conditioning. Unit tests would validate that speaker embeddings have correct shapes, handle various audio lengths properly, and that pooling layers work correctly with different input dimensions.

[ ] Create tests/test_speaker_encoder.py
[ ] Add tests for SpeakerEncoder forward pass with random audio tensors of varying lengths
[ ] Test ECAPA-TDNN with different input shapes and verify embedding output dimension
[ ] Test Perceiver encoder with different sequence lengths
[ ] Add tests for all pooling layers in pooling_layers.py (temporal pooling, attention pooling, etc.)
[ ] Verify gradients flow correctly through speaker components for training
[ ] Add parametrized tests covering edge cases (very short audio, mono vs stereo)

🌿Good first issues

Add pytest suite for sparktts/models/audio_tokenizer.py covering encoding/decoding round-trip with synthetic and real audio to catch quantization artifacts
Document required speaker prompt audio specifications (sample rate, duration, loudness) in README with validation code in cli/inference.py to reject invalid inputs early
Extract hardcoded hyperparameters (vocoder settings, token vocabulary size, embedding dimensions) from sparktts/modules/ into sparktts/config.yaml with OmegaConf loading, enabling experiment tracking

⭐Top contributors

Click to expand

@xinshengwang — 39 commits
@yuekaizhang — 6 commits
@pandamq — 6 commits
@yuekaiz — 3 commits
@xcv58 — 3 commits

📝Recent commits