SparkAudio/Spark-TTS
Spark-TTS Inference Code
Stale — last commit 1y ago
weakest axislast commit was 1y ago; no tests detected…
Has a license, tests, and CI — clean foundation to fork and modify.
Documented and popular — useful reference codebase to read through.
last commit was 1y ago; no CI workflows detected
- ✓6 active contributors
- ✓Apache-2.0 licensed
- ⚠Stale — last commit 1y ago
Show all 6 evidence items →Show less
- ⚠Concentrated ownership — top contributor handles 67% of recent commits
- ⚠No CI workflows detected
- ⚠No test directory detected
What would change the summary?
- →Use as dependency Mixed → Healthy if: 1 commit in the last 365 days; add a test suite
- →Deploy as-is Mixed → Healthy if: 1 commit in the last 180 days
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Forkable" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/sparkaudio/spark-tts)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/sparkaudio/spark-tts on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: SparkAudio/Spark-TTS
Generated by RepoPilot · 2026-05-07 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/SparkAudio/Spark-TTS shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
WAIT — Stale — last commit 1y ago
- 6 active contributors
- Apache-2.0 licensed
- ⚠ Stale — last commit 1y ago
- ⚠ Concentrated ownership — top contributor handles 67% of recent commits
- ⚠ No CI workflows detected
- ⚠ No test directory detected
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live SparkAudio/Spark-TTS
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/SparkAudio/Spark-TTS.
What it runs against: a local clone of SparkAudio/Spark-TTS — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in SparkAudio/Spark-TTS | Confirms the artifact applies here, not a fork |
| 2 | License is still Apache-2.0 | Catches relicense before you depend on it |
| 3 | Default branch main exists | Catches branch renames |
| 4 | Last commit ≤ 423 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of SparkAudio/Spark-TTS. If you don't
# have one yet, run these first:
#
# git clone https://github.com/SparkAudio/Spark-TTS.git
# cd Spark-TTS
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of SparkAudio/Spark-TTS and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "SparkAudio/Spark-TTS(\\.git)?\\b" \\
&& ok "origin remote is SparkAudio/Spark-TTS" \\
|| miss "origin remote is not SparkAudio/Spark-TTS (artifact may be from a fork)"
# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
|| grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
&& ok "license is Apache-2.0" \\
|| miss "license drift — was Apache-2.0 at generation time"
# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
&& ok "default branch main exists" \\
|| miss "default branch main no longer exists"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 423 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~393d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/SparkAudio/Spark-TTS"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
Spark-TTS is an LLM-based text-to-speech inference system built on Qwen2.5 that generates high-quality audio by predicting speech tokens directly through an LLM, then reconstructing waveforms via audio tokenizers and vocoders. It eliminates separate acoustic feature generation pipelines, offering efficient bilingual (Chinese/English) synthesis with zero-shot voice cloning capabilities. Modular architecture: sparktts/models/ contains audio_tokenizer.py and bicodec.py for audio encoding; sparktts/modules/ splits into encoder_decoder/ (feature processing), fsq/ (quantization), speaker/ (voice embeddings), and blocks/ (vocos vocoder); cli/ provides SparkTTS.py inference wrapper; runtime/triton_trtllm/ contains Triton model definitions for deployment.
👥Who it's for
ML engineers and researchers building production TTS systems who need efficient LLM-based speech synthesis with voice cloning support; deployment engineers integrating TTS into systems via Triton Inference Server; audio ML practitioners exploring token-based audio generation instead of traditional spectrogram approaches.
🌱Maturity & risk
Actively developed and research-grade: the codebase includes production deployment patterns (Triton TensorRT-LLM runtime with Docker, gRPC/HTTP clients) alongside core inference code, suggesting post-paper release maturity. However, repo structure lacks visible test suites, CI/CD pipelines, or extensive documentation beyond examples—typical of research code with early production deployment setup.
Moderate risk: tight coupling to specific versions (PyTorch 2.5+, transformers 4.46.2) with 7 core dependencies could cause compatibility issues; no test coverage visible limits regression detection; deployment relies on external Triton/TensorRT-LLM infrastructure. Single research institution ownership (SparkAudio) means community support is limited.
Active areas of work
Recent code snapshot includes complete Triton model repository setup with audio_tokenizer, spark_tts LLM, and vocoder services; example inference script (example/infer.sh) and sample output (example/results/20250225113521.wav) suggest active testing. Conversion and templating scripts in runtime/triton_trtllm/scripts/ indicate ongoing deployment optimization.
🚀Get running
git clone https://github.com/SparkAudio/Spark-TTS.git && cd Spark-TTS && pip install -r requirements.txt && bash example/infer.sh
Daily commands:
For CLI inference: python cli/inference.py --text 'hello world' --speaker_wav example/prompt_audio.wav --output_path result.wav; for Triton server: cd runtime/triton_trtllm && docker-compose up then use client_grpc.py or client_http.py; for UI: python -m gradio cli/SparkTTS.py
🗺️Map of the codebase
- cli/inference.py: Main entry point for end-to-end inference: orchestrates text preprocessing, LLM generation, tokenization, and vocoding into final waveform output
- sparktts/models/audio_tokenizer.py: Core audio codec for converting waveforms to discrete tokens and vice versa; critical path in the inference pipeline
- sparktts/modules/encoder_decoder/wave_generator.py: Converts LLM-predicted tokens back to audio waveforms; directly impacts audio quality and synthesis speed
- runtime/triton_trtllm/model_repo/spark_tts/1/model.py: Triton deployment wrapper for the Qwen2.5 LLM; bridges inference script to production serving infrastructure
- sparktts/modules/blocks/vocos.py: Vocoder module responsible for final audio reconstruction from acoustic codes; critical for output audio quality
- sparktts/modules/speaker/ecapa_tdnn.py: Speaker embedding extractor for voice cloning; enables zero-shot voice adaptation to prompt speaker
🛠️How to make changes
To add new speaker encoders: modify sparktts/modules/speaker/ (e.g., add new pooling in pooling_layers.py or encoder in perceiver_encoder.py); to change audio tokenization: edit sparktts/models/audio_tokenizer.py or bicodec.py; to adjust decoding: modify sparktts/modules/encoder_decoder/wave_generator.py; to customize Triton deployment: edit model configs in runtime/triton_trtllm/model_repo/*/config.pbtxt and */1/model.py.
🪤Traps & gotchas
PyTorch 2.5.1 and torchaudio 2.5.1 are pinned versions—mixing with older/newer torch risks CUDA/library incompatibilities; Triton deployment requires separate TensorRT-LLM engine (not included, needs runtime/triton_trtllm/scripts/convert_checkpoint.py execution); prompt_audio.wav format must match tokenizer expectations (likely 16kHz mono)—no validation in CLI; FSQ (finite scalar quantization) indices from LLM must match audio_tokenizer codebook size or decoding fails silently.
💡Concepts to learn
- Discrete Audio Tokenization — Spark-TTS converts continuous waveforms to discrete tokens (via audio_tokenizer.py) so LLMs can predict speech—understanding this bridging layer is key to modifying codecs or debugging audio artifacts
- Finite Scalar Quantization (FSQ) — FSQ in sparktts/modules/fsq/ replaces traditional VQ-VAE for smoother token learning; critical for token prediction accuracy and understanding bicodec.py architecture
- Zero-Shot Voice Cloning — Via speaker embeddings (ECAPA-TDNN in sparktts/modules/speaker/), the model adapts to unseen speakers without fine-tuning; essential for understanding the voice conditioning pipeline in inference.py
- Token-Based Speech Synthesis — Unlike traditional spectral synthesis, Spark-TTS predicts discrete audio tokens via Qwen2.5 then reconstructs waveforms; understanding this paradigm shift explains why flow matching and separate acoustic models are eliminated
- Triton Inference Server Model Repository — Deployment requires config.pbtxt and model.py for each stage (tokenizer, LLM, vocoder); understanding Triton's composable pipeline pattern is essential for productionizing or scaling the inference stack
- Speaker Embeddings and Perceiver Encoders — sparktts/modules/speaker/perceiver_encoder.py extracts speaker-invariant embeddings from prompt audio for conditioning the LLM; modifying this affects voice cloning fidelity and cross-lingual capability
- Vocoder Design (Vocos) — sparktts/modules/blocks/vocos.py converts quantized audio codes back to waveforms; vocoder quality directly determines final audio naturalness, making this module critical for output tuning
🔗Related repos
qwen-team/Qwen2.5— The base LLM backbone for Spark-TTS token prediction; understanding Qwen2.5 architecture and tokenization is essential for model modificationsopenai/whisper— Comparable audio encoding approach (token-based) for speech recognition; shares FSQ-style quantization patterns relevant to understanding Spark-TTS audio codecNVIDIA/TensorRT-LLM— Required runtime dependency for Triton deployment in runtime/triton_trtllm/; enables inference optimization critical for production latencycoqui-ai/TTS— Alternative open-source TTS system; useful for benchmarking quality, speed, and ease-of-use comparisons against traditional non-LLM approacheshuggingface/transformers— Core dependency (4.46.2) for loading Qwen2.5 and handling tokenization; upgrading or debugging requires familiarity with transformers pipeline API
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add integration tests for CLI inference pipeline
The repo has cli/SparkTTS.py and cli/inference.py but no visible test suite. Given that this is inference code with real audio I/O (example/infer.sh exists), adding integration tests would catch regressions when dependencies update (torch, torchaudio, transformers are frequently updated). Tests should verify end-to-end inference with example/prompt_audio.wav and validate output audio quality metrics.
- [ ] Create tests/ directory with test_inference.py
- [ ] Add pytest fixtures that load example/prompt_audio.wav
- [ ] Test cli/SparkTTS.py main entry point with various text inputs
- [ ] Validate output audio from example/results/ matches expected format (WAV, sample rate)
- [ ] Add tests for cli/inference.py functions with mock models to avoid large downloads
- [ ] Update requirements.txt to include pytest and pytest-cov as dev dependencies
Create Triton inference client wrapper for easier production deployment
The runtime/triton_trtllm/ directory contains two separate client implementations (client_grpc.py and client_http.py) but they appear to be boilerplate. Adding a high-level unified client class would make it easier for users to switch between GRPC/HTTP backends and handle the audio preprocessing (sparktts/utils/audio.py utilities should be integrated). This bridges the gap between the CLI tool and production Triton deployment.
- [ ] Create runtime/triton_trtllm/triton_client.py with TritonSparkttsClient class
- [ ] Implement automatic backend detection (GRPC vs HTTP) based on server availability
- [ ] Integrate sparktts/utils/audio.py preprocessing into the client
- [ ] Add methods for batch inference and streaming audio output
- [ ] Add error handling for model not ready, timeout, and invalid inputs
- [ ] Document usage in runtime/triton_trtllm/README.md with example code
Add unit tests for speaker encoder modules
The sparktts/modules/speaker/ directory contains ECAPA-TDNN and Perceiver encoder implementations (ecapa_tdnn.py, perceiver_encoder.py, speaker_encoder.py) but there are no visible tests. These are critical components for speaker conditioning. Unit tests would validate that speaker embeddings have correct shapes, handle various audio lengths properly, and that pooling layers work correctly with different input dimensions.
- [ ] Create tests/test_speaker_encoder.py
- [ ] Add tests for SpeakerEncoder forward pass with random audio tensors of varying lengths
- [ ] Test ECAPA-TDNN with different input shapes and verify embedding output dimension
- [ ] Test Perceiver encoder with different sequence lengths
- [ ] Add tests for all pooling layers in pooling_layers.py (temporal pooling, attention pooling, etc.)
- [ ] Verify gradients flow correctly through speaker components for training
- [ ] Add parametrized tests covering edge cases (very short audio, mono vs stereo)
🌿Good first issues
- Add pytest suite for sparktts/models/audio_tokenizer.py covering encoding/decoding round-trip with synthetic and real audio to catch quantization artifacts
- Document required speaker prompt audio specifications (sample rate, duration, loudness) in README with validation code in cli/inference.py to reject invalid inputs early
- Extract hardcoded hyperparameters (vocoder settings, token vocabulary size, embedding dimensions) from sparktts/modules/ into sparktts/config.yaml with OmegaConf loading, enabling experiment tracking
⭐Top contributors
Click to expand
Top contributors
- @xinshengwang — 39 commits
- @yuekaizhang — 6 commits
- @pandamq — 6 commits
- @yuekaiz — 3 commits
- @xcv58 — 3 commits
📝Recent commits
Click to expand
Recent commits
2f1ea90— [runtime] Benchmark streaming Triton TRT-LLM first chunk latency (#188) (yuekaizhang)77b1786— Feat: add triton runtime decoupled spark_tts python backend + decoupled tensorrt_llm backend (#126) (weedge)ee29f36— update readme for Triton (xinshengwang)1c17251— Merge pull request #92 from yuekaizhang/triton (xinshengwang)b228154— change usage (yuekaizhang)6de055e— add license (yuekaizhang)1354114— Merge branch 'SparkAudio:main' into triton (yuekaizhang)22e21cd— add code commit (yuekaizhang)4d769ff— update docker file (yuekaizhang)82f7b02— update http client; launch script (yuekaiz)
🔒Security observations
- High · Outdated and Potentially Vulnerable Dependencies —
requirements.txt. Multiple dependencies have known vulnerabilities or are significantly outdated. Notably: gradio==5.18.0 has had security issues in previous versions; transformers==4.46.2 is from mid-2024; numpy==2.2.3 and torch==2.5.1 may have unpatched vulnerabilities. No dependency pinning with hash verification is evident. Fix: 1) Update all dependencies to latest stable versions. 2) Runpip-auditorsafety checkagainst requirements.txt. 3) Implement dependency scanning in CI/CD pipeline. 4) Consider using lock files (pip-compile, poetry.lock) with hash verification. - High · Arbitrary File Upload via Gradio Interface —
cli/SparkTTS.py, cli/inference.py. Gradio is used in the codebase (cli/SparkTTS.py likely contains a Gradio interface based on dependencies). Gradio interfaces can be vulnerable to arbitrary file uploads if not properly validated. The inference code may accept user-supplied audio files without adequate validation, potentially leading to path traversal, DoS, or malicious file execution. Fix: 1) Implement strict file type validation (whitelist only .wav files). 2) Validate file size limits. 3) Use secure temporary directories. 4) Scan uploaded files for malicious content. 5) Implement rate limiting on Gradio endpoints. 6) Run Gradio with share=False in production. - Medium · Insecure Deserialization via SafeTensors —
sparktts/models/*.py, runtime/triton_trtllm/model_repo/*/1/model.py. The codebase uses safetensors==0.5.2 for model loading. While safer than pickle, safetensors can still execute arbitrary Python code if custom deserializers are used. Model files loaded from untrusted sources could be exploited. Fix: 1) Only load model files from trusted sources. 2) Verify model file integrity using checksums/signatures. 3) Implement sandboxing for model inference. 4) Add code to detect and reject suspicious model configurations. 5) Keep safetensors updated. - Medium · Unvalidated Command Execution in Shell Scripts —
example/infer.sh, runtime/triton_trtllm/run.sh, sparktts/utils/parse_options.sh. Shell scripts (example/infer.sh, runtime/triton_trtllm/run.sh, sparktts/utils/parse_options.sh) execute commands that may accept user input. If parameters are not properly escaped, shell injection attacks are possible. Fix: 1) Avoid shell scripts for parameter handling; use Python argparse instead. 2) If shell scripts are necessary, quote all variables: "$VAR" not $VAR. 3) Use set -e and set -u for safer defaults. 4) Validate all input parameters against whitelists. 5) Avoid eval() and command substitution with user input. - Medium · Docker Container Security Issues —
runtime/triton_trtllm/Dockerfile.server, runtime/triton_trtllm/docker-compose.yml. Dockerfile.server in runtime/triton_trtllm likely runs as root and may not implement non-root user execution, read-only filesystems, or resource limits. The docker-compose.yml may expose ports without authentication. Fix: 1) Create a non-root user and switch to it. 2) Use RUN commands with --chown to set proper permissions. 3) Mark filesystem as read-only where possible. 4) Implement resource limits in docker-compose.yml (memory, CPU). 5) Don't expose inference endpoints directly; use authentication/authorization. 6) Scan image with Trivy or similar tools. - Medium · Insecure gRPC/HTTP API Exposure —
runtime/triton_trtllm/client_grpc.py, runtime/triton_trtllm/client_http.py. Client implementations (client_grpc.py, client_http.py) interact with inference endpoints without apparent authentication, authorization, or TLS validation. If exposed to untrusted networks, this could allow unauthorized access to inference capabilities. Fix: 1) Implement mutual TLS (mTLS) for gRPC. 2) Add authentication tokens
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.