LEMAS-TTS
LEMAS-TTS is a multilingual zero-shot text-to-speech system, presented in the paper LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models.
- Project Page: https://lemas-project.github.io/LEMAS-Project
- Paper: https://arxiv.org/abs/2601.04233
- GitHub Repository: https://github.com/LEMAS-Project/LEMAS-TTS
- Hugging Face Demo: https://huggingface.co/spaces/LEMAS-Project/LEMAS-TTS
Model Description
LEMAS-TTS is built upon a non-autoregressive flow-matching framework. It leverages the massive scale and linguistic diversity of the LEMAS-Dataset to achieve robust zero-shot multilingual synthesis. The model incorporates accent-adversarial training and CTC loss to mitigate cross-lingual accent issues, enhancing synthesis stability and quality across diverse languages.
Supported Languages
The model supports 10 major languages for zero-shot synthesis:
- Chinese (zh)
- English (en)
- Spanish (es)
- Russian (ru)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Indonesian (id)
- Vietnamese (vi)
Training Data
LEMAS-TTS was trained on the LEMAS-Dataset, which is, to our knowledge, currently the largest open-source multilingual speech corpus with word-level timestamps. It covers over 150,000 hours across 10 major languages.
Citation
@article{zhao2026lemas,
title={LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models},
author={Zhao, Zhiyuan and Lin, Lijian and Zhu, Ye and Xie, Kai and Liu, Yunfei and Li, Yu},
journal={arXiv preprint arXiv:2601.04233},
year={2026}
}
- Downloads last month
- 3