Skip to content

Commit

Permalink
Wavenet prep/correction, Global cond, GTA train
Browse files Browse the repository at this point in the history
- It is now possible to do wavenet preprocessing on its own to make use of wavenet as standalone model (This will omit GTA training)
- Wavenet synthesis has been fixed: Rayhane-mamah#106
- Added global conditioning provided you write the speaker_id rules during preprocessing
- Added GTA training function
  • Loading branch information
Rayhane-mamah authored Aug 4, 2018
1 parent e2f9780 commit 87bedae
Show file tree
Hide file tree
Showing 18 changed files with 529 additions and 178 deletions.
17 changes: 10 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,9 @@ Note:
- In the previous tree, files **were not represented** and **max depth was set to 3** for simplicity.
- If you run training of both **models at the same time**, repository structure will be different.

# Pretrained model and Samples:
Pre-trained models and audio samples will be added at a later date. You can however check some primary insights of the model performance (at early stages of training) [here](https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-378741465). THIS IS VERY OUTDATED, I WILL UPDATE THIS SOON

# Model Architecture:
<p align="center">
<img src="https://preview.ibb.co/bU8sLS/Tacotron_2_Architecture.png"/>
Expand Down Expand Up @@ -97,6 +100,11 @@ We are also running current tests on the [new M-AILABS speech dataset](http://ww

After **downloading** the dataset, **extract** the compressed file, and **place the folder inside the cloned repository.**

# Hparams setting:
Before proceeding, you must pick the hyperparameters that suit best your needs. While it is possible to change the hyper parameters from command line during preprocessing/training, I still recommend making the changes once and for all on the **hparams.py** file directly.

To pick optimal fft parameters, I have made a **griffin_lim_synthesis_tool** notebook that you can use to invert real extracted mel/linear spectrograms and choose how good your preprocessing is. All other options are well explained in the **hparams.py** and have meaningful names so that you can try multiple things with them.

# Preprocessing
Before running the following steps, please make sure you are inside **Tacotron-2 folder**

Expand All @@ -123,15 +131,12 @@ To **train both models** sequentially (one after the other):

> python train.py --model='Tacotron-2'
or:

> python train.py --model='Both'

Feature prediction model can **separately** be **trained** using:

> python train.py --model='Tacotron'
checkpoints will be made each **250 steps** and stored under **logs-Tacotron folder.**
checkpoints will be made each **5000 steps** and stored under **logs-Tacotron folder.**

Naturally, **training the wavenet separately** is done by:

Expand All @@ -142,6 +147,7 @@ logs will be stored inside **logs-Wavenet**.
**Note:**
- If model argument is not provided, training will default to Tacotron-2 model training. (both models)
- Please refer to train arguments under [train.py](https://github.com/Rayhane-mamah/Tacotron-2/blob/master/train.py) for a set of options you can use.
- It is now possible to make wavenet preprocessing alone using **wavenet_proprocess.py**.

# Synthesis
To **synthesize audio** in an **End-to-End** (text to audio) manner (both models at work):
Expand Down Expand Up @@ -171,9 +177,6 @@ Synthesizing the **waveforms** conditionned on previously synthesized Mel-spectr
- If model argument is not provided, synthesis will default to Tacotron-2 model synthesis. (End-to-End TTS)
- Please refer to synthesis arguments under [synthesize.py](https://github.com/Rayhane-mamah/Tacotron-2/blob/master/synthesize.py) for a set of options you can use.

# Pretrained model and Samples:
Pre-trained models and audio samples will be added at a later date. You can however check some primary insights of the model performance (at early stages of training) [here](https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-378741465).


# References and Resources:
- [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)
Expand Down
11 changes: 6 additions & 5 deletions datasets/preprocessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,10 @@ def build_from_path(hparams, input_dirs, mel_dir, linear_dir, wav_dir, n_jobs=12
with open(os.path.join(input_dir, 'metadata.csv'), encoding='utf-8') as f:
for line in f:
parts = line.strip().split('|')
wav_path = os.path.join(input_dir, 'wavs', '{}.wav'.format(parts[0]))
basename = parts[0]
wav_path = os.path.join(input_dir, 'wavs', '{}.wav'.format(basename))
text = parts[2]
futures.append(executor.submit(partial(_process_utterance, mel_dir, linear_dir, wav_dir, index, wav_path, text, hparams)))
futures.append(executor.submit(partial(_process_utterance, mel_dir, linear_dir, wav_dir, basename, wav_path, text, hparams)))
index += 1

return [future.result() for future in tqdm(futures) if future.result() is not None]
Expand Down Expand Up @@ -130,9 +131,9 @@ def _process_utterance(mel_dir, linear_dir, wav_dir, index, wav_path, text, hpar
time_steps = len(out)

# Write the spectrogram and audio to disk
audio_filename = 'speech-audio-{:05d}.npy'.format(index)
mel_filename = 'speech-mel-{:05d}.npy'.format(index)
linear_filename = 'speech-linear-{:05d}.npy'.format(index)
audio_filename = 'audio-{}.npy'.format(index)
mel_filename = 'mel-{}.npy'.format(index)
linear_filename = 'linear-{}.npy'.format(index)
np.save(os.path.join(wav_dir, audio_filename), out.astype(out_dtype), allow_pickle=False)
np.save(os.path.join(mel_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)
np.save(os.path.join(linear_dir, linear_filename), linear_spectrogram.T, allow_pickle=False)
Expand Down
134 changes: 134 additions & 0 deletions datasets/wavenet_preprocessor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
from concurrent.futures import ProcessPoolExecutor
from functools import partial
from datasets import audio
import os
import numpy as np
from wavenet_vocoder.util import mulaw_quantize, mulaw, is_mulaw, is_mulaw_quantize


def build_from_path(hparams, input_dir, mel_dir, wav_dir, n_jobs=12, tqdm=lambda x: x):
"""
Preprocesses the speech dataset from a gven input path to given output directories
Args:
- hparams: hyper parameters
- input_dir: input directory that contains the files to prerocess
- mel_dir: output directory of the preprocessed speech mel-spectrogram dataset
- linear_dir: output directory of the preprocessed speech linear-spectrogram dataset
- wav_dir: output directory of the preprocessed speech audio dataset
- n_jobs: Optional, number of worker process to parallelize across
- tqdm: Optional, provides a nice progress bar
Returns:
- A list of tuple describing the train examples. this should be written to train.txt
"""

# We use ProcessPoolExecutor to parallelize across processes, this is just for
# optimization purposes and it can be omited
executor = ProcessPoolExecutor(max_workers=n_jobs)
futures = []
for file in os.listdir(input_dir):
wav_path = os.path.join(input_dir, file)
basename = os.path.basename(wav_path).replace('.wav', '')
futures.append(executor.submit(partial(_process_utterance, mel_dir, wav_dir, basename, wav_path, hparams)))

return [future.result() for future in tqdm(futures) if future.result() is not None]


def _process_utterance(mel_dir, wav_dir, index, wav_path, hparams):
"""
Preprocesses a single utterance wav/text pair
this writes the mel scale spectogram to disk and return a tuple to write
to the train.txt file
Args:
- mel_dir: the directory to write the mel spectograms into
- linear_dir: the directory to write the linear spectrograms into
- wav_dir: the directory to write the preprocessed wav into
- index: the numeric index to use in the spectrogram filename
- wav_path: path to the audio file containing the speech input
- text: text spoken in the input audio file
- hparams: hyper parameters
Returns:
- A tuple: (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, linear_frames, text)
"""
try:
# Load the audio as numpy array
wav = audio.load_wav(wav_path, sr=hparams.sample_rate)
except FileNotFoundError: #catch missing wav exception
print('file {} present in csv metadata is not present in wav folder. skipping!'.format(
wav_path))
return None

#rescale wav
if hparams.rescale:
wav = wav / np.abs(wav).max() * hparams.rescaling_max

#M-AILABS extra silence specific
if hparams.trim_silence:
wav = audio.trim_silence(wav, hparams)

#Mu-law quantize
if is_mulaw_quantize(hparams.input_type):
#[0, quantize_channels)
out = mulaw_quantize(wav, hparams.quantize_channels)

#Trim silences
start, end = audio.start_and_end_indices(out, hparams.silence_threshold)
wav = wav[start: end]
out = out[start: end]

constant_values = mulaw_quantize(0, hparams.quantize_channels)
out_dtype = np.int16

elif is_mulaw(hparams.input_type):
#[-1, 1]
out = mulaw(wav, hparams.quantize_channels)
constant_values = mulaw(0., hparams.quantize_channels)
out_dtype = np.float32

else:
#[-1, 1]
out = wav
constant_values = 0.
out_dtype = np.float32

# Compute the mel scale spectrogram from the wav
mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
mel_frames = mel_spectrogram.shape[1]

if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length:
return None

#Ensure time resolution adjustement between audio and mel-spectrogram
fft_size = hparams.n_fft if hparams.win_size is None else hparams.win_size
l, r = audio.pad_lr(wav, fft_size, audio.get_hop_size(hparams))

#Zero pad for quantized signal
out = np.pad(out, (l, r), mode='constant', constant_values=constant_values)
assert len(out) >= mel_frames * audio.get_hop_size(hparams)

#time resolution adjustement
#ensure length of raw audio is multiple of hop size so that we can use
#transposed convolution to upsample
out = out[:mel_frames * audio.get_hop_size(hparams)]
assert len(out) % audio.get_hop_size(hparams) == 0
time_steps = len(out)

# Write the spectrogram and audio to disk
audio_filename = os.path.join(wav_dir, 'audio-{}.npy'.format(index))
mel_filename = os.path.join(mel_dir, 'mel-{}.npy'.format(index))
np.save(audio_filename, out.astype(out_dtype), allow_pickle=False)
np.save(mel_filename, mel_spectrogram.T, allow_pickle=False)

#global condition features
if hparams.gin_channels > 0:
raise RuntimeError('When activating global conditions, please set your speaker_id rules in line 128 of datasets/wavenet_preprocessor.py to use them during training')
speaker_id = '<no_g>' #put the rule to determine how to assign speaker ids (using file names maybe? file basenames are available in "index" variable)
else:
speaker_id = '<no_g>'

# Return a tuple describing this training example
return (audio_filename, mel_filename, '_', speaker_id, time_steps, mel_frames)
54 changes: 29 additions & 25 deletions hparams.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,30 +15,30 @@

#Audio
num_mels = 80, #Number of mel-spectrogram channels and local conditioning dimensionality
num_freq = 513, # (= n_fft / 2 + 1) only used when adding linear spectrograms post processing network
num_freq = 1025, # (= n_fft / 2 + 1) only used when adding linear spectrograms post processing network
rescale = True, #Whether to rescale audio prior to preprocessing
rescaling_max = 0.999, #Rescaling value
trim_silence = True, #Whether to clip silence in Audio (at beginning and end of audio only, not the middle)
clip_mels_length = True, #For cases of OOM (Not really recommended, working on a workaround)
max_mel_frames = 900, #Only relevant when clip_mels_length = True
max_mel_frames = 1100, #Only relevant when clip_mels_length = True

# Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction
# It's preferred to set True to use with https://github.com/r9y9/wavenet_vocoder
# Does not work if n_ffit is not multiple of hop_size!!
use_lws=True,
use_lws=False,
silence_threshold=2, #silence threshold used for sound trimming for wavenet preprocessing

#Mel spectrogram
n_fft = 1024, #Extra window size is filled with 0 paddings to match this parameter
hop_size = 256, #For 22050Hz, 275 ~= 12.5 ms
win_size = None, #For 22050Hz, 1100 ~= 50 ms (If None, win_size = n_fft)
sample_rate = 22050, #22050 Hz (corresponding to ljspeech dataset)
n_fft = 2048, #Extra window size is filled with 0 paddings to match this parameter
hop_size = 300, #For 22050Hz, 275 ~= 12.5 ms
win_size = 1200, #For 22050Hz, 1100 ~= 50 ms (If None, win_size = n_fft)
sample_rate = 24000, #22050 Hz (corresponding to ljspeech dataset)
frame_shift_ms = None,

#M-AILABS (and other datasets) trim params
trim_fft_size = 512,
trim_hop_size = 128,
trim_top_db = 60,
trim_top_db = 23,

#Mel and Linear spectrograms normalization/scaling and clipping
signal_normalization = True,
Expand All @@ -49,11 +49,11 @@
#Limits
min_level_db = -100,
ref_level_db = 20,
fmin = 25, #Set this to 75 if your speaker is male! if female, 125 should help taking off noise. (To test depending on dataset)
fmin = 0, #Set this to 75 if your speaker is male! if female, 125 should help taking off noise. (To test depending on dataset)
fmax = 7600,

#Griffin Lim
power = 1.2,
power = 1.5,
griffin_lim_iters = 60,
###########################################################################################################################################

Expand All @@ -77,17 +77,17 @@
prenet_layers = [256, 256], #number of layers and number of units of prenet
decoder_layers = 2, #number of decoder lstm layers
decoder_lstm_units = 1024, #number of decoder lstm units on each layer
max_iters = 2500, #Max decoder steps during inference (Just for safety from infinite loop cases)
max_iters = 1000, #Max decoder steps during inference (Just for safety from infinite loop cases)

postnet_num_layers = 5, #number of postnet convolutional layers
postnet_kernel_size = (5, ), #size of postnet convolution filters for each layer
postnet_channels = 512, #number of postnet convolution filters for each layer

mask_encoder = True, #whether to mask encoder padding while computing attention
mask_decoder = True, #Whether to use loss mask for padded sequences (if False, <stop_token> loss function will not be weighted, else recommended pos_weight = 20)
mask_encoder = False, #whether to mask encoder padding while computing attention
mask_decoder = False, #Whether to use loss mask for padded sequences (if False, <stop_token> loss function will not be weighted, else recommended pos_weight = 20)

cross_entropy_pos_weight = 20, #Use class weights to reduce the stop token classes imbalance (by adding more penalty on False Negatives (FN)) (1 = disabled)
predict_linear = False, #Whether to add a post-processing network to the Tacotron to predict linear spectrograms (True mode Not tested!!)
cross_entropy_pos_weight = 1, #Use class weights to reduce the stop token classes imbalance (by adding more penalty on False Negatives (FN)) (1 = disabled)
predict_linear = True, #Whether to add a post-processing network to the Tacotron to predict linear spectrograms (True mode Not tested!!)
###########################################################################################################################################


Expand All @@ -105,30 +105,33 @@
log_scale_min=float(np.log(1e-14)), #Mixture of logistic distributions minimal log scale

out_channels = 10 * 3, #This should be equal to quantize channels when input type is 'mulaw-quantize' else: num_distributions * 3 (prob, mean, log_scale)
layers = 24, #Number of dilated convolutions (Default: Simplified Wavenet of Tacotron-2 paper)
stacks = 4, #Number of dilated convolution stacks (Default: Simplified Wavenet of Tacotron-2 paper)
layers = 30, #Number of dilated convolutions (Default: Simplified Wavenet of Tacotron-2 paper)
stacks = 3, #Number of dilated convolution stacks (Default: Simplified Wavenet of Tacotron-2 paper)
residual_channels = 512,
gate_channels = 512, #split in 2 in gated convolutions
skip_out_channels = 256,
kernel_size = 3,

cin_channels = 80, #Set this to -1 to disable local conditioning, else it must be equal to num_mels!!
upsample_conditional_features = True, #Whether to repeat conditional features or upsample them (The latter is recommended)
upsample_scales = [16, 16], #prod(scales) should be equal to hop size
upsample_scales = [5, 5, 4, 3], #prod(scales) should be equal to hop size
freq_axis_kernel_size = 3,

gin_channels = -1, #Set this to -1 to disable global conditioning, Only used for multi speaker dataset
gin_channels = -1, #Set this to -1 to disable global conditioning, Only used for multi speaker dataset. It defines the depth of the embeddings (Recommended: 512)
use_speaker_embedding = True, #whether to make a speaker embedding
n_speakers = 6, #number of speakers (rows of the embedding)

use_bias = True, #Whether to use bias in convolutional layers of the Wavenet

max_time_sec = None,
max_time_steps = 13000, #Max time steps in audio used to train wavenet (decrease to save memory)
max_time_steps = 8000, #Max time steps in audio used to train wavenet (decrease to save memory)
###########################################################################################################################################

#Tacotron Training
tacotron_random_seed = 5339, #Determines initial graph and operations (i.e: model) random state for reproducibility
tacotron_swap_with_cpu = False, #Whether to use cpu as support to gpu for decoder computation (Not recommended: may cause major slowdowns! Only use when critical!)

tacotron_batch_size = 48, #number of training samples on each training steps
tacotron_batch_size = 32, #number of training samples on each training steps
tacotron_reg_weight = 1e-6, #regularization weight (for L2 regularization)
tacotron_scale_regularization = True, #Whether to rescale regularization weight to adapt for outputs range (used when reg_weight is high and biasing the model)

Expand All @@ -138,8 +141,8 @@

tacotron_decay_learning_rate = True, #boolean, determines if the learning rate will follow an exponential decay
tacotron_start_decay = 50000, #Step at which learning decay starts
tacotron_decay_steps = 40000, #Determines the learning rate decay slope (UNDER TEST)
tacotron_decay_rate = 0.2, #learning rate decay rate (UNDER TEST)
tacotron_decay_steps = 50000, #Determines the learning rate decay slope (UNDER TEST)
tacotron_decay_rate = 0.4, #learning rate decay rate (UNDER TEST)
tacotron_initial_learning_rate = 1e-3, #starting learning rate
tacotron_final_learning_rate = 1e-5, #minimal learning rate

Expand All @@ -150,6 +153,7 @@
tacotron_zoneout_rate = 0.1, #zoneout rate for all LSTM cells in the network
tacotron_dropout_rate = 0.5, #dropout rate for all convolutional layers + prenet

tacotron_clip_gradients = False, #whether to clip gradients
natural_eval = False, #Whether to use 100% natural eval (to evaluate Curriculum Learning performance) or with same teacher-forcing ratio as in training (just for overfit)

#Decoder RNN learning can take be done in one of two ways:
Expand All @@ -176,10 +180,10 @@
wavenet_test_batches = None, #number of test batches.
wavenet_data_random_state = 1234, #random state for train test split repeatability

wavenet_learning_rate = 1e-4,
wavenet_learning_rate = 1e-3,
wavenet_adam_beta1 = 0.9,
wavenet_adam_beta2 = 0.999,
wavenet_adam_epsilon = 1e-6,
wavenet_adam_epsilon = 1e-8,

wavenet_ema_decay = 0.9999, #decay rate of exponential moving average

Expand Down
Loading

0 comments on commit 87bedae

Please sign in to comment.