YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

πŸš€ ViiTor Voice TTS

Fast, flexible speech cloning with transformers or vLLM β€” batch-friendly and duration-aware.

δΈ­ζ–‡ζ–‡ζ‘£ Β· Demo page Β· GitHub Β· Hugging Face

πŸ€ What it is

ViiTor Voice is a three-stage speech cloning stack:

  • Stage 1: prompt + text β†’ semantic tokens.
  • Stage 2: prompt acoustic/semantic + predicted semantic β†’ predicted acoustic tokens.
  • Stage 3: acoustic tokens β†’ waveform.

✨ Why it shines

  • Text-free prompts: stronger cross-lingual cloning, less ASR dependencyβ€”raw prompts are welcome.
  • Similarity boost: InfoNCE + condition encoder as a similarity constraint; robust even with noisy/background prompts.
  • Built-in duration control: duration prediction in the LLM trunk; force duration with ~0.5s precision.
  • LoRA-based emotion control: plug in LoRA adapters to steer emotion/style without full finetuning.

cli.py covers both backends, two batch modes, and an optional duration hint (single-text only).

⚑ Quickstart (Linux)

1) Environment

Use the provided script (PyTorch, vLLM 0.12.0 CUDA 12.8, requirements, dualcodec):

bash create_env.sh
source .venv/bin/activate

Notes:

  • create_env.sh uses uv venv with Python 3.12β€”adjust if needed.
  • vLLM install targets CUDA 12.8 (--torch-backend=cu128); adapt to your CUDA/toolkit.

2) Checkpoints

Fetch required models (Hugging Face mirror by default):

bash download_checkpoints.sh

Default paths (override via CLI if you store elsewhere):

  • SoundStorm: checkpoints/viitor/soundstorm
  • DualCodec: checkpoints/dualcodec
  • wav2vec: checkpoints/w2v
  • LLM: checkpoints/viitor/llm/zh-en

🎯 Demo usage

πŸ–₯️ Gradio demo

Launch a web UI (hosted on 0.0.0.0, Gradio share disabled):

python gradio_demo.py \
  --soundstorm-model-path checkpoints/viitor/soundstorm \
  --dualcodec-model-path checkpoints/dualcodec \
  --w2v-path checkpoints/w2v \
  --llm-model-path checkpoints/viitor/llm/zh-en \
  --server-port 7860

Upload a prompt audio file in the UI, type text, optionally set a duration (seconds), then click β€œSynthesize” to preview the generated audio. Toggle β€œEnable two-pass speaker refinement (prompt + generated speech)” to reduce accent leakage; helpful for cross-language cloning when you want lighter source accent.

πŸ’» CLI demo

Base command (transformers backend + default checkpoints):

python cli.py \
  --prompt /path/to/prompt.wav \
  --text "Hello ViiTorVoice!" \
  --output outputs/out.wav

Common flags:

  • --use-vllm switch to vLLM.
  • --duration <seconds> duration hint; honored only when exactly one text.
  • --speaker-windowed enable two-pass speaker refinement (average prompt embedding with generated-speech embedding; reduces accent leakage, useful for cross-language cloning).

πŸ§ͺ Cases

  1. Single inference (transformers)
python cli.py \
  --prompt data/prompt.wav \
  --text "Welcome to ViiTorVoice." \
  --output outputs/single.wav
  1. vLLM backend
python cli.py \
  --use-vllm \
  --prompt data/prompt.wav \
  --text "This runs with vLLM." \
  --output outputs/vllm.wav
  1. Duration hint (single text)
python cli.py \
  --prompt data/prompt.wav \
  --text "Keep this around three seconds." \
  --duration 3.0 \
  --output outputs/with_duration.wav
  1. Batch: prompts and texts 1:1
python cli.py \
  --prompt data/p1.wav data/p2.wav \
  --text "First line" "Second line" \
  --output outputs/pair_batch/

Paired by order; outputs auto-named in the directory.

  1. Batch: one prompt, many texts
python cli.py \
  --prompt data/prompt.wav \
  --text "Line 1" "Line 2" "Line 3" \
  --output outputs/multi_text_batch/

Generates multiple files, auto-named 000_prompt_t0.wav, etc.

πŸ“£ Output log

Saved -> path | text='...' | prompt='...' | set/predicted duration=3.00s | actual duration=2.95s
  • set/predicted duration: provided duration (or model-predicted if none)
  • actual duration: measured from generated audio

🧭 Tips

  • Ensure CUDA driver/toolkit matches the PyTorch/vLLM build; edit create_env.sh if you need a different CUDA wheel.
  • vLLM prefers generous GPU memory; fall back to transformers if constrained.
  • Set duration hints reasonably; extreme values can produce abnormal audio.

πŸ“Œ TODO

  • βœ… Open-sourced Chinese/English base model
  • βœ… Inference code (this repo and demo)
  • ⏳ SoundStorm training recipe
  • ⏳ LLM training recipe
  • βœ… Gradio demo
  • ⏳ Emotion-control LoRA
  • ⏳ Japanese, Korean, Cantonese model weights
  • ⏳ Flow matching–based semantic-to-wav module

πŸ™Œ Acknowledgments

🌟 Product

Official site: ViiTor AI

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support