π ViiTor Voice TTS
Fast, flexible speech cloning with transformers or vLLM β batch-friendly and duration-aware.
δΈζζζ‘£ Β· Demo page Β· GitHub Β· Hugging Face
π What it is
ViiTor Voice is a three-stage speech cloning stack:
- Stage 1: prompt + text β semantic tokens.
- Stage 2: prompt acoustic/semantic + predicted semantic β predicted acoustic tokens.
- Stage 3: acoustic tokens β waveform.
β¨ Why it shines
- Text-free prompts: stronger cross-lingual cloning, less ASR dependencyβraw prompts are welcome.
- Similarity boost: InfoNCE + condition encoder as a similarity constraint; robust even with noisy/background prompts.
- Built-in duration control: duration prediction in the LLM trunk; force duration with ~0.5s precision.
- LoRA-based emotion control: plug in LoRA adapters to steer emotion/style without full finetuning.
cli.py covers both backends, two batch modes, and an optional duration hint (single-text only).
β‘ Quickstart (Linux)
1) Environment
Use the provided script (PyTorch, vLLM 0.12.0 CUDA 12.8, requirements, dualcodec):
bash create_env.sh
source .venv/bin/activate
Notes:
create_env.shusesuv venvwith Python 3.12βadjust if needed.- vLLM install targets CUDA 12.8 (
--torch-backend=cu128); adapt to your CUDA/toolkit.
2) Checkpoints
Fetch required models (Hugging Face mirror by default):
bash download_checkpoints.sh
Default paths (override via CLI if you store elsewhere):
- SoundStorm:
checkpoints/viitor/soundstorm - DualCodec:
checkpoints/dualcodec - wav2vec:
checkpoints/w2v - LLM:
checkpoints/viitor/llm/zh-en
π― Demo usage
π₯οΈ Gradio demo
Launch a web UI (hosted on 0.0.0.0, Gradio share disabled):
python gradio_demo.py \
--soundstorm-model-path checkpoints/viitor/soundstorm \
--dualcodec-model-path checkpoints/dualcodec \
--w2v-path checkpoints/w2v \
--llm-model-path checkpoints/viitor/llm/zh-en \
--server-port 7860
Upload a prompt audio file in the UI, type text, optionally set a duration (seconds), then click βSynthesizeβ to preview the generated audio. Toggle βEnable two-pass speaker refinement (prompt + generated speech)β to reduce accent leakage; helpful for cross-language cloning when you want lighter source accent.
π» CLI demo
Base command (transformers backend + default checkpoints):
python cli.py \
--prompt /path/to/prompt.wav \
--text "Hello ViiTorVoice!" \
--output outputs/out.wav
Common flags:
--use-vllmswitch to vLLM.--duration <seconds>duration hint; honored only when exactly one text.--speaker-windowedenable two-pass speaker refinement (average prompt embedding with generated-speech embedding; reduces accent leakage, useful for cross-language cloning).
π§ͺ Cases
- Single inference (transformers)
python cli.py \
--prompt data/prompt.wav \
--text "Welcome to ViiTorVoice." \
--output outputs/single.wav
- vLLM backend
python cli.py \
--use-vllm \
--prompt data/prompt.wav \
--text "This runs with vLLM." \
--output outputs/vllm.wav
- Duration hint (single text)
python cli.py \
--prompt data/prompt.wav \
--text "Keep this around three seconds." \
--duration 3.0 \
--output outputs/with_duration.wav
- Batch: prompts and texts 1:1
python cli.py \
--prompt data/p1.wav data/p2.wav \
--text "First line" "Second line" \
--output outputs/pair_batch/
Paired by order; outputs auto-named in the directory.
- Batch: one prompt, many texts
python cli.py \
--prompt data/prompt.wav \
--text "Line 1" "Line 2" "Line 3" \
--output outputs/multi_text_batch/
Generates multiple files, auto-named 000_prompt_t0.wav, etc.
π£ Output log
Saved -> path | text='...' | prompt='...' | set/predicted duration=3.00s | actual duration=2.95s
set/predicted duration: provided duration (or model-predicted if none)actual duration: measured from generated audio
π§ Tips
- Ensure CUDA driver/toolkit matches the PyTorch/vLLM build; edit
create_env.shif you need a different CUDA wheel. - vLLM prefers generous GPU memory; fall back to transformers if constrained.
- Set duration hints reasonably; extreme values can produce abnormal audio.
π TODO
- β Open-sourced Chinese/English base model
- β Inference code (this repo and demo)
- β³ SoundStorm training recipe
- β³ LLM training recipe
- β Gradio demo
- β³ Emotion-control LoRA
- β³ Japanese, Korean, Cantonese model weights
- β³ Flow matchingβbased semantic-to-wav module
π Acknowledgments
π Product
Official site: ViiTor AI