RepoPilotOpen in app →

SparkAudio/Spark-TTS

Spark-TTS Inference Code

Mixed

Stale — last commit 1y ago

weakest axis
Use as dependencyMixed

last commit was 1y ago; no tests detected…

Fork & modifyHealthy

Has a license, tests, and CI — clean foundation to fork and modify.

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isMixed

last commit was 1y ago; no CI workflows detected

  • 6 active contributors
  • Apache-2.0 licensed
  • Stale — last commit 1y ago
Show all 6 evidence items →
  • Concentrated ownership — top contributor handles 67% of recent commits
  • No CI workflows detected
  • No test directory detected
What would change the summary?
  • Use as dependency MixedHealthy if: 1 commit in the last 365 days; add a test suite
  • Deploy as-is MixedHealthy if: 1 commit in the last 180 days

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Forkable" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Forkable
[![RepoPilot: Forkable](https://repopilot.app/api/badge/sparkaudio/spark-tts?axis=fork)](https://repopilot.app/r/sparkaudio/spark-tts)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/sparkaudio/spark-tts on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: SparkAudio/Spark-TTS

Generated by RepoPilot · 2026-05-07 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/SparkAudio/Spark-TTS shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

WAIT — Stale — last commit 1y ago

  • 6 active contributors
  • Apache-2.0 licensed
  • ⚠ Stale — last commit 1y ago
  • ⚠ Concentrated ownership — top contributor handles 67% of recent commits
  • ⚠ No CI workflows detected
  • ⚠ No test directory detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live SparkAudio/Spark-TTS repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/SparkAudio/Spark-TTS.

What it runs against: a local clone of SparkAudio/Spark-TTS — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in SparkAudio/Spark-TTS | Confirms the artifact applies here, not a fork | | 2 | License is still Apache-2.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | Last commit ≤ 423 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>SparkAudio/Spark-TTS</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of SparkAudio/Spark-TTS. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/SparkAudio/Spark-TTS.git
#   cd Spark-TTS
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of SparkAudio/Spark-TTS and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "SparkAudio/Spark-TTS(\\.git)?\\b" \\
  && ok "origin remote is SparkAudio/Spark-TTS" \\
  || miss "origin remote is not SparkAudio/Spark-TTS (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(Apache-2\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"Apache-2\\.0\"" package.json 2>/dev/null) \\
  && ok "license is Apache-2.0" \\
  || miss "license drift — was Apache-2.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 423 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~393d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/SparkAudio/Spark-TTS"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

TL;DR

Spark-TTS is an LLM-based text-to-speech inference system built on Qwen2.5 that generates high-quality audio by predicting speech tokens directly through an LLM, then reconstructing waveforms via audio tokenizers and vocoders. It eliminates separate acoustic feature generation pipelines, offering efficient bilingual (Chinese/English) synthesis with zero-shot voice cloning capabilities. Modular architecture: sparktts/models/ contains audio_tokenizer.py and bicodec.py for audio encoding; sparktts/modules/ splits into encoder_decoder/ (feature processing), fsq/ (quantization), speaker/ (voice embeddings), and blocks/ (vocos vocoder); cli/ provides SparkTTS.py inference wrapper; runtime/triton_trtllm/ contains Triton model definitions for deployment.

👥Who it's for

ML engineers and researchers building production TTS systems who need efficient LLM-based speech synthesis with voice cloning support; deployment engineers integrating TTS into systems via Triton Inference Server; audio ML practitioners exploring token-based audio generation instead of traditional spectrogram approaches.

🌱Maturity & risk

Actively developed and research-grade: the codebase includes production deployment patterns (Triton TensorRT-LLM runtime with Docker, gRPC/HTTP clients) alongside core inference code, suggesting post-paper release maturity. However, repo structure lacks visible test suites, CI/CD pipelines, or extensive documentation beyond examples—typical of research code with early production deployment setup.

Moderate risk: tight coupling to specific versions (PyTorch 2.5+, transformers 4.46.2) with 7 core dependencies could cause compatibility issues; no test coverage visible limits regression detection; deployment relies on external Triton/TensorRT-LLM infrastructure. Single research institution ownership (SparkAudio) means community support is limited.

Active areas of work

Recent code snapshot includes complete Triton model repository setup with audio_tokenizer, spark_tts LLM, and vocoder services; example inference script (example/infer.sh) and sample output (example/results/20250225113521.wav) suggest active testing. Conversion and templating scripts in runtime/triton_trtllm/scripts/ indicate ongoing deployment optimization.

🚀Get running

git clone https://github.com/SparkAudio/Spark-TTS.git && cd Spark-TTS && pip install -r requirements.txt && bash example/infer.sh

Daily commands: For CLI inference: python cli/inference.py --text 'hello world' --speaker_wav example/prompt_audio.wav --output_path result.wav; for Triton server: cd runtime/triton_trtllm && docker-compose up then use client_grpc.py or client_http.py; for UI: python -m gradio cli/SparkTTS.py

🗺️Map of the codebase

🛠️How to make changes

To add new speaker encoders: modify sparktts/modules/speaker/ (e.g., add new pooling in pooling_layers.py or encoder in perceiver_encoder.py); to change audio tokenization: edit sparktts/models/audio_tokenizer.py or bicodec.py; to adjust decoding: modify sparktts/modules/encoder_decoder/wave_generator.py; to customize Triton deployment: edit model configs in runtime/triton_trtllm/model_repo/*/config.pbtxt and */1/model.py.

🪤Traps & gotchas

PyTorch 2.5.1 and torchaudio 2.5.1 are pinned versions—mixing with older/newer torch risks CUDA/library incompatibilities; Triton deployment requires separate TensorRT-LLM engine (not included, needs runtime/triton_trtllm/scripts/convert_checkpoint.py execution); prompt_audio.wav format must match tokenizer expectations (likely 16kHz mono)—no validation in CLI; FSQ (finite scalar quantization) indices from LLM must match audio_tokenizer codebook size or decoding fails silently.

💡Concepts to learn

  • Discrete Audio Tokenization — Spark-TTS converts continuous waveforms to discrete tokens (via audio_tokenizer.py) so LLMs can predict speech—understanding this bridging layer is key to modifying codecs or debugging audio artifacts
  • Finite Scalar Quantization (FSQ) — FSQ in sparktts/modules/fsq/ replaces traditional VQ-VAE for smoother token learning; critical for token prediction accuracy and understanding bicodec.py architecture
  • Zero-Shot Voice Cloning — Via speaker embeddings (ECAPA-TDNN in sparktts/modules/speaker/), the model adapts to unseen speakers without fine-tuning; essential for understanding the voice conditioning pipeline in inference.py
  • Token-Based Speech Synthesis — Unlike traditional spectral synthesis, Spark-TTS predicts discrete audio tokens via Qwen2.5 then reconstructs waveforms; understanding this paradigm shift explains why flow matching and separate acoustic models are eliminated
  • Triton Inference Server Model Repository — Deployment requires config.pbtxt and model.py for each stage (tokenizer, LLM, vocoder); understanding Triton's composable pipeline pattern is essential for productionizing or scaling the inference stack
  • Speaker Embeddings and Perceiver Encoders — sparktts/modules/speaker/perceiver_encoder.py extracts speaker-invariant embeddings from prompt audio for conditioning the LLM; modifying this affects voice cloning fidelity and cross-lingual capability
  • Vocoder Design (Vocos) — sparktts/modules/blocks/vocos.py converts quantized audio codes back to waveforms; vocoder quality directly determines final audio naturalness, making this module critical for output tuning
  • qwen-team/Qwen2.5 — The base LLM backbone for Spark-TTS token prediction; understanding Qwen2.5 architecture and tokenization is essential for model modifications
  • openai/whisper — Comparable audio encoding approach (token-based) for speech recognition; shares FSQ-style quantization patterns relevant to understanding Spark-TTS audio codec
  • NVIDIA/TensorRT-LLM — Required runtime dependency for Triton deployment in runtime/triton_trtllm/; enables inference optimization critical for production latency
  • coqui-ai/TTS — Alternative open-source TTS system; useful for benchmarking quality, speed, and ease-of-use comparisons against traditional non-LLM approaches
  • huggingface/transformers — Core dependency (4.46.2) for loading Qwen2.5 and handling tokenization; upgrading or debugging requires familiarity with transformers pipeline API

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for CLI inference pipeline

The repo has cli/SparkTTS.py and cli/inference.py but no visible test suite. Given that this is inference code with real audio I/O (example/infer.sh exists), adding integration tests would catch regressions when dependencies update (torch, torchaudio, transformers are frequently updated). Tests should verify end-to-end inference with example/prompt_audio.wav and validate output audio quality metrics.

  • [ ] Create tests/ directory with test_inference.py
  • [ ] Add pytest fixtures that load example/prompt_audio.wav
  • [ ] Test cli/SparkTTS.py main entry point with various text inputs
  • [ ] Validate output audio from example/results/ matches expected format (WAV, sample rate)
  • [ ] Add tests for cli/inference.py functions with mock models to avoid large downloads
  • [ ] Update requirements.txt to include pytest and pytest-cov as dev dependencies

Create Triton inference client wrapper for easier production deployment

The runtime/triton_trtllm/ directory contains two separate client implementations (client_grpc.py and client_http.py) but they appear to be boilerplate. Adding a high-level unified client class would make it easier for users to switch between GRPC/HTTP backends and handle the audio preprocessing (sparktts/utils/audio.py utilities should be integrated). This bridges the gap between the CLI tool and production Triton deployment.

  • [ ] Create runtime/triton_trtllm/triton_client.py with TritonSparkttsClient class
  • [ ] Implement automatic backend detection (GRPC vs HTTP) based on server availability
  • [ ] Integrate sparktts/utils/audio.py preprocessing into the client
  • [ ] Add methods for batch inference and streaming audio output
  • [ ] Add error handling for model not ready, timeout, and invalid inputs
  • [ ] Document usage in runtime/triton_trtllm/README.md with example code

Add unit tests for speaker encoder modules

The sparktts/modules/speaker/ directory contains ECAPA-TDNN and Perceiver encoder implementations (ecapa_tdnn.py, perceiver_encoder.py, speaker_encoder.py) but there are no visible tests. These are critical components for speaker conditioning. Unit tests would validate that speaker embeddings have correct shapes, handle various audio lengths properly, and that pooling layers work correctly with different input dimensions.

  • [ ] Create tests/test_speaker_encoder.py
  • [ ] Add tests for SpeakerEncoder forward pass with random audio tensors of varying lengths
  • [ ] Test ECAPA-TDNN with different input shapes and verify embedding output dimension
  • [ ] Test Perceiver encoder with different sequence lengths
  • [ ] Add tests for all pooling layers in pooling_layers.py (temporal pooling, attention pooling, etc.)
  • [ ] Verify gradients flow correctly through speaker components for training
  • [ ] Add parametrized tests covering edge cases (very short audio, mono vs stereo)

🌿Good first issues

  • Add pytest suite for sparktts/models/audio_tokenizer.py covering encoding/decoding round-trip with synthetic and real audio to catch quantization artifacts
  • Document required speaker prompt audio specifications (sample rate, duration, loudness) in README with validation code in cli/inference.py to reject invalid inputs early
  • Extract hardcoded hyperparameters (vocoder settings, token vocabulary size, embedding dimensions) from sparktts/modules/ into sparktts/config.yaml with OmegaConf loading, enabling experiment tracking

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 2f1ea90 — [runtime] Benchmark streaming Triton TRT-LLM first chunk latency (#188) (yuekaizhang)
  • 77b1786 — Feat: add triton runtime decoupled spark_tts python backend + decoupled tensorrt_llm backend (#126) (weedge)
  • ee29f36 — update readme for Triton (xinshengwang)
  • 1c17251 — Merge pull request #92 from yuekaizhang/triton (xinshengwang)
  • b228154 — change usage (yuekaizhang)
  • 6de055e — add license (yuekaizhang)
  • 1354114 — Merge branch 'SparkAudio:main' into triton (yuekaizhang)
  • 22e21cd — add code commit (yuekaizhang)
  • 4d769ff — update docker file (yuekaizhang)
  • 82f7b02 — update http client; launch script (yuekaiz)

🔒Security observations

  • High · Outdated and Potentially Vulnerable Dependencies — requirements.txt. Multiple dependencies have known vulnerabilities or are significantly outdated. Notably: gradio==5.18.0 has had security issues in previous versions; transformers==4.46.2 is from mid-2024; numpy==2.2.3 and torch==2.5.1 may have unpatched vulnerabilities. No dependency pinning with hash verification is evident. Fix: 1) Update all dependencies to latest stable versions. 2) Run pip-audit or safety check against requirements.txt. 3) Implement dependency scanning in CI/CD pipeline. 4) Consider using lock files (pip-compile, poetry.lock) with hash verification.
  • High · Arbitrary File Upload via Gradio Interface — cli/SparkTTS.py, cli/inference.py. Gradio is used in the codebase (cli/SparkTTS.py likely contains a Gradio interface based on dependencies). Gradio interfaces can be vulnerable to arbitrary file uploads if not properly validated. The inference code may accept user-supplied audio files without adequate validation, potentially leading to path traversal, DoS, or malicious file execution. Fix: 1) Implement strict file type validation (whitelist only .wav files). 2) Validate file size limits. 3) Use secure temporary directories. 4) Scan uploaded files for malicious content. 5) Implement rate limiting on Gradio endpoints. 6) Run Gradio with share=False in production.
  • Medium · Insecure Deserialization via SafeTensors — sparktts/models/*.py, runtime/triton_trtllm/model_repo/*/1/model.py. The codebase uses safetensors==0.5.2 for model loading. While safer than pickle, safetensors can still execute arbitrary Python code if custom deserializers are used. Model files loaded from untrusted sources could be exploited. Fix: 1) Only load model files from trusted sources. 2) Verify model file integrity using checksums/signatures. 3) Implement sandboxing for model inference. 4) Add code to detect and reject suspicious model configurations. 5) Keep safetensors updated.
  • Medium · Unvalidated Command Execution in Shell Scripts — example/infer.sh, runtime/triton_trtllm/run.sh, sparktts/utils/parse_options.sh. Shell scripts (example/infer.sh, runtime/triton_trtllm/run.sh, sparktts/utils/parse_options.sh) execute commands that may accept user input. If parameters are not properly escaped, shell injection attacks are possible. Fix: 1) Avoid shell scripts for parameter handling; use Python argparse instead. 2) If shell scripts are necessary, quote all variables: "$VAR" not $VAR. 3) Use set -e and set -u for safer defaults. 4) Validate all input parameters against whitelists. 5) Avoid eval() and command substitution with user input.
  • Medium · Docker Container Security Issues — runtime/triton_trtllm/Dockerfile.server, runtime/triton_trtllm/docker-compose.yml. Dockerfile.server in runtime/triton_trtllm likely runs as root and may not implement non-root user execution, read-only filesystems, or resource limits. The docker-compose.yml may expose ports without authentication. Fix: 1) Create a non-root user and switch to it. 2) Use RUN commands with --chown to set proper permissions. 3) Mark filesystem as read-only where possible. 4) Implement resource limits in docker-compose.yml (memory, CPU). 5) Don't expose inference endpoints directly; use authentication/authorization. 6) Scan image with Trivy or similar tools.
  • Medium · Insecure gRPC/HTTP API Exposure — runtime/triton_trtllm/client_grpc.py, runtime/triton_trtllm/client_http.py. Client implementations (client_grpc.py, client_http.py) interact with inference endpoints without apparent authentication, authorization, or TLS validation. If exposed to untrusted networks, this could allow unauthorized access to inference capabilities. Fix: 1) Implement mutual TLS (mTLS) for gRPC. 2) Add authentication tokens

LLM-derived; treat as a starting point, not a security audit.


Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Mixed signals · SparkAudio/Spark-TTS — RepoPilot