Fast Wavenet, Curriculum Learning, T2 pipeline...

tuong-olli · May 19, 2018 · ec0cd50 · ec0cd50
1 parent 78138a3
commit ec0cd50
Show file tree

Hide file tree

Showing 30 changed files with 2,612 additions and 705 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 # Tacotron-2:
-Tensorflow implementation of Deep mind's Tacotron-2. A deep neural network architecture described in this paper: [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)
+Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)
 
 
 # Repository Structure:
@@ -15,10 +15,20 @@ Tensorflow implementation of Deep mind's Tacotron-2. A deep neural network archi
 	├── LJSpeech-1.1	(0)
 	│   └── wavs
 	├── logs-Tacotron	(2)
+	│   ├── eval_-dir
+	│   │ 	├── plots
+	│ 	│ 	└── wavs
 	│   ├── mel-spectrograms
 	│   ├── plots
 	│   ├── pretrained
 	│   └── wavs
+	├── logs-Wavenet	(4)
+	│   ├── eval-dir
+	│   │ 	├── plots
+	│ 	│ 	└── wavs
+	│   ├── plots
+	│   ├── pretrained
+	│   └── wavs
 	├── papers
 	├── tacotron
 	│   ├── models
@@ -30,26 +40,31 @@ Tensorflow implementation of Deep mind's Tacotron-2. A deep neural network archi
 	│   │   ├── plots
 	│   │   └── wavs
 	│   └── natural
+	├── wavenet_output	(5)
+	│   ├── plots
+	│   └── wavs
 	├── training_data	(1)
 	│   ├── audio
-	│   └── mels
+	│   ├── linear
+	│	└── mels
 	└── wavenet_vocoder
 		└── models
 
 
-
-
-The previous tree shows what the current state of the repository.
+The previous tree shows the current state of the repository (separate training, one step at a time).
 
 - Step **(0)**: Get your dataset, here I have set the examples of **Ljspeech**, **en_US** and **en_UK** (from **M-AILABS**).
 - Step **(1)**: Preprocess your data. This will give you the **training_data** folder.
 - Step **(2)**: Train your Tacotron model. Yields the **logs-Tacotron** folder.
 - Step **(3)**: Synthesize/Evaluate the Tacotron model. Gives the **tacotron_output** folder.
+- Step **(4)**: Train your Wavenet model. Yield the **logs-Wavenet** folder.
+- Step **(5)**: Synthesize audio using the Wavenet model. Gives the **wavenet_output** folder.
 
 
 Note:
 - **Our preprocessing only supports Ljspeech and Ljspeech-like datasets (M-AILABS speech data)!** If running on datasets stored differently, you will probably need to make your own preprocessing script.
 - In the previous tree, files **were not represented** and **max depth was set to 3** for simplicity.
+- If you run training of both **models at the same time**, repository structure will be different.
 
 # Model Architecture:
 <p align="center">
@@ -69,16 +84,12 @@ To have an overview of our advance on this project, please refer to [this discus
 since the two parts of the global model are trained separately, we can start by training the feature prediction model to use his predictions later during the wavenet training.
 
 # How to start
-first, you need to have python 3 installed along with [Tensorflow v1.6](https://www.tensorflow.org/install/).
+first, you need to have python 3 installed along with [Tensorflow](https://www.tensorflow.org/install/).
 
-next you can install the requirements. If you are an Anaconda user:
+next you can install the requirements. If you are an Anaconda user: (else replace **pip** with **pip3** and **python** with **python3**)
 
 > pip install -r requirements.txt
 
-else:
-
-> pip3 install -r requirements.txt
-
 # Dataset:
 We tested the code above on the [ljspeech dataset](https://keithito.com/LJ-Speech-Dataset/), which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)
 
@@ -95,10 +106,6 @@ Preprocessing can then be started using:
 
 > python preprocess.py
 
-or 
-
-> python3 preprocess.py
-
 dataset can be chosen using the **--dataset** argument. If using M-AILABS dataset, you need to provide the **language, voice, reader, merge_books and book arguments** for your custom need. Default is **Ljspeech**.
 
 Example M-AILABS:
@@ -112,77 +119,68 @@ or if you want to use all books for a single speaker:
 This should take no longer than a **few minutes.**
 
 # Training:
-Feature prediction model can be **trained** using:
+To **train both models** sequentially (one after the other):
 
-> python train.py --model='Tacotron'
+> python train.py --model='Tacotron-2'
 
-or 
+or:
 
-> python3 train.py --model='Tacotron'
+> python train.py --model='Both'
 
-checkpoints will be made each **100 steps** and stored under **logs-Tacotron folder.**
+Feature prediction model can **separately** be **trained** using:
 
-Naturally, **training the wavenet** is done by: (Not implemented yet)
+> python train.py --model='Tacotron'
 
-> python train.py --model='Wavenet'
+checkpoints will be made each **250 steps** and stored under **logs-Tacotron folder.**
 
-or 
+Naturally, **training the wavenet separately** is done by:
 
-> python3 train.py --model='Wavenet'
+> python train.py --model='WaveNet'
 
 logs will be stored inside **logs-Wavenet**.
 
 **Note:**
-- If model argument is not provided, training will default to Tacotron model training.
+- If model argument is not provided, training will default to Tacotron-2 model training. (both models)
+- Please refer to train arguments under [train.py](https://github.com/Rayhane-mamah/Tacotron-2/blob/master/train.py) for a set of options you can use.
 
 # Synthesis
-There are **three types** of mel spectrograms synthesis for the Spectrogram prediction network (Tacotron):
+To **synthesize audio** in an **End-to-End** (text to audio) manner (both models at work):
 
-- **Evaluation** (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.
+> python synthesize.py --model='Tacotron-2'
 
-> python synthesize.py --model='Tacotron' --mode='eval'
+For the spectrogram prediction network (separately), there are **three types** of mel spectrograms synthesis:
 
-or
+- **Evaluation** (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.
 
-> python3 synthesize.py --model='Tacotron' --mode='eval'
+> python synthesize.py --model='Tacotron' --mode='eval'
 
 - **Natural synthesis** (let the model make predictions alone by feeding last decoder output to the next time step).
 
 > python synthesize.py --model='Tacotron' --GTA=False
 
-or
-
-> python3 synthesize.py --model='Tacotron' --GTA=False
 
 - **Ground Truth Aligned synthesis** (DEFAULT: the model is assisted by true labels in a teacher forcing manner). This synthesis method is used when predicting mel spectrograms used to train the wavenet vocoder. (yields better results as stated in the paper)
 
-> python synthesize.py --model='Tacotron'
+> python synthesize.py --model='Tacotron' --GTA=True
 
-or 
+Synthesizing the **waveforms** conditionned on previously synthesized Mel-spectrograms (separately) can be done with:
 
-> python3 synthesize.py --model='Tacotron'
-
-Synthesizing the waveforms conditionned on previously synthesized Mel-spectrograms can be done with:
-
-> python synthesize.py --model='Wavenet'
-
-or 
-
-> python3 synthesize.py --model='Wavenet'
+> python synthesize.py --model='WaveNet'
 
 **Note:**
-- If model argument is not provided, synthesis will default to Tacotron model synthesis.
-- If mode argument is not provided, synthesis defaults to Ground Truth Aligned synthesis.
+- If model argument is not provided, synthesis will default to Tacotron-2 model synthesis. (End-to-End TTS)
+- Please refer to synthesis arguments under [synthesize.py](https://github.com/Rayhane-mamah/Tacotron-2/blob/master/synthesize.py) for a set of options you can use.
 
 # Pretrained model and Samples:
-Pre-trained models and audio samples will be added at a later date due to technical difficulties. You can however check some primary insights of the model performance (at early stages of training) [here](https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-378741465).
+Pre-trained models and audio samples will be added at a later date. You can however check some primary insights of the model performance (at early stages of training) [here](https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-378741465).
 
 
 # References and Resources:
-- [Tensorflow original tacotron implementation](https://github.com/keithito/tacotron)
+- [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)
 - [Original tacotron paper](https://arxiv.org/pdf/1703.10135.pdf)
 - [Attention-Based Models for Speech Recognition](https://arxiv.org/pdf/1506.07503.pdf)
-- [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)
+- [Wavenet: A generative model for raw audio](https://arxiv.org/pdf/1609.03499.pdf)
+- [Fast Wavenet](https://arxiv.org/pdf/1611.09482.pdf)
 - [r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder)
+- [keithito/tacotron](https://github.com/keithito/tacotron)
 
-**Work in progress**