This repository contains code to train a End-to-End Speech Synthesis system. Currently only single speaker models are supported, and the text frontend supports English.
The system consists of two parts:
-
A Tacotron model with Dynamic Convolutional Attention which modifies the hybrid location sensitive attention mechanism to be purely location based as described in Location Relative Attention Mechanisms for Robust Long-Form Speech Synthesis, resulting in better generalization on long utterances. This model takes text (in the form of character sequence) as input and predicts a sequence of mel-spectrogram frames as output (the seq2seq model).
-
A WaveRNN based vocoder; which takes the mel-spectrogram predicted in the previous step as input and generates a waveform as output (the vocoder model).
All audio processing parameters, model hyperparameters, training configuration etc are specified in the config/config.py
file.
Both the seq2seq model and the vocoder model need to be trained seperately. Training using automatic mixed precision is supported.
-
Download and extract dataset
-
English single speaker dataset LJSpeech:
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 tar -xvjf LJSpeech-1.1.tar.bz2
-
-
Edit the configuration parameters in
config/config.py
appropriate for the dataset to be used for training -
Process the downloaded dataset, and split into train and eval splits
python preprocess.py \ --dataset_dir <Path to the root of the downloaded dataset> \ --out_dir <Output path to write the processed dataset>
-
Train the Tacotron (seq2seq) model
python train_tts.py \ --train_data_dir <Path to the processed train split> \ --checkpoint_dir <Path to location where training checkpoints will be saved> \ --alignments_dir <Path to the location where training alignments will be saved> \ --resume_checkpoint_path <If specified load checkpoint and resume training>
-
Train the vocoder model
python train_vocoder.py \ --train_data_dir <Path to the processed train split> \ --checkpoint_dir <Path to location where training checkpoints will be saved> \ --resume_checkpoint_path <If specified load checkpoint and resume training>
-
Prepare the text to be synthesized
The text to be synthesized should be placed in the
synthesis.csv
file in the following formatID_1|TEXT_1 ID_2|TEXT_2 . . .
-
Text to speech synthesis
python tts_synthesis.py \ --synthesis_file <Path to the synthesis.csv file (created in Step 1)> \ --seq2seq_checkpoint <Path to the trained seq2seq model to use for synthesis> \ --vocoder_checkpoint <Path to the trained vocoder model to use for synthesis> \ --out_dir <Path to where the synthesized waveforms will be written to disk>
This code is based on the code in the following repositories
- Location Relative Attention Mechanisms for Robust Long-Form Speech Synthesis
- Tacotron: Towards End-To-End Speech Synthesis
- Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
- Support for multi-speakers models
- Support for Indic languages