Update (2025-02-07): Our paper has been released! Llasa 1b Multilingual version released!
torchrun --nproc_per_node=8 train_tts.py config.json
or
sbatch run_slurm.sh
You can download tokenized open-source speech data here. This includes LibriHeavy, Emilia (in both Chinese and English), and WenetSpeech4TTS, totaling approximately 160,000 hours of open-source data.
Our models are trained on 250,000 hours of speech data. Of this, 160,000 hours come from the open-source datasets mentioned above, while the remaining 90,000 hours are from internal datasets, which are not yet available for open-source release.
Text_sequence is encoded by the text tokenizer from Llama, for example, Llama-3.2-1B-Instruct
Speech_sequence is extrated through X-codec2 We change the value of speech tokens by adding len(text tokenizer) +8 special tokens thereby forming a unified tokenizer that encompasses both speech and text.
Coming Soon
Codec: xcodec2 (Please install new version xcodec2==0.1.3)
Llasa 1b version: Llasa-1B
Llasa 1b Multilingual version: Llasa-1B-Multilingual (Not mentioned in the paper)
Llasa 3b version: Llasa-3B
Llasa 8b version: Llasa-8B