Wavenet prep/correction, Global cond, GTA train

- It is now possible to do wavenet preprocessing on its own to make use of wavenet as standalone model (This will omit GTA training) - Wavenet synthesis has been fixed: Rayhane-mamah#106 - Added global conditioning provided you write the speaker_id rules during preprocessing - Added GTA training function
tuong-olli · Aug 4, 2018 · 87bedae · 87bedae
1 parent e2f9780
commit 87bedae
Show file tree

Hide file tree

Showing 18 changed files with 529 additions and 178 deletions.
diff --git a/README.md b/README.md
@@ -66,6 +66,9 @@ Note:
 - In the previous tree, files **were not represented** and **max depth was set to 3** for simplicity.
 - If you run training of both **models at the same time**, repository structure will be different.
 
+# Pretrained model and Samples:
+Pre-trained models and audio samples will be added at a later date. You can however check some primary insights of the model performance (at early stages of training) [here](https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-378741465). THIS IS VERY OUTDATED, I WILL UPDATE THIS SOON
+
 # Model Architecture:
 <p align="center">
   <img src="https://preview.ibb.co/bU8sLS/Tacotron_2_Architecture.png"/>
@@ -97,6 +100,11 @@ We are also running current tests on the [new M-AILABS speech dataset](http://ww
 
 After **downloading** the dataset, **extract** the compressed file, and **place the folder inside the cloned repository.**
 
+# Hparams setting:
+Before proceeding, you must pick the hyperparameters that suit best your needs. While it is possible to change the hyper parameters from command line during preprocessing/training, I still recommend making the changes once and for all on the **hparams.py** file directly.
+
+To pick optimal fft parameters, I have made a **griffin_lim_synthesis_tool** notebook that you can use to invert real extracted mel/linear spectrograms and choose how good your preprocessing is. All other options are well explained in the **hparams.py** and have meaningful names so that you can try multiple things with them.
+
 # Preprocessing
 Before running the following steps, please make sure you are inside **Tacotron-2 folder**
 
@@ -123,15 +131,12 @@ To **train both models** sequentially (one after the other):
 
 > python train.py --model='Tacotron-2'
 
-or:
-
-> python train.py --model='Both'
 
 Feature prediction model can **separately** be **trained** using:
 
 > python train.py --model='Tacotron'
 
-checkpoints will be made each **250 steps** and stored under **logs-Tacotron folder.**
+checkpoints will be made each **5000 steps** and stored under **logs-Tacotron folder.**
 
 Naturally, **training the wavenet separately** is done by:
 
@@ -142,6 +147,7 @@ logs will be stored inside **logs-Wavenet**.
 **Note:**
 - If model argument is not provided, training will default to Tacotron-2 model training. (both models)
 - Please refer to train arguments under [train.py](https://github.com/Rayhane-mamah/Tacotron-2/blob/master/train.py) for a set of options you can use.
+- It is now possible to make wavenet preprocessing alone using **wavenet_proprocess.py**.
 
 # Synthesis
 To **synthesize audio** in an **End-to-End** (text to audio) manner (both models at work):
@@ -171,9 +177,6 @@ Synthesizing the **waveforms** conditionned on previously synthesized Mel-spectr
 - If model argument is not provided, synthesis will default to Tacotron-2 model synthesis. (End-to-End TTS)
 - Please refer to synthesis arguments under [synthesize.py](https://github.com/Rayhane-mamah/Tacotron-2/blob/master/synthesize.py) for a set of options you can use.
 
-# Pretrained model and Samples:
-Pre-trained models and audio samples will be added at a later date. You can however check some primary insights of the model performance (at early stages of training) [here](https://github.com/Rayhane-mamah/Tacotron-2/issues/4#issuecomment-378741465).
-
 
 # References and Resources:
 - [Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions](https://arxiv.org/pdf/1712.05884.pdf)

diff --git a/datasets/preprocessor.py b/datasets/preprocessor.py
@@ -32,9 +32,10 @@ def build_from_path(hparams, input_dirs, mel_dir, linear_dir, wav_dir, n_jobs=12
 		with open(os.path.join(input_dir, 'metadata.csv'), encoding='utf-8') as f:
 			for line in f:
 				parts = line.strip().split('|')
-				wav_path = os.path.join(input_dir, 'wavs', '{}.wav'.format(parts[0]))
+				basename = parts[0]
+				wav_path = os.path.join(input_dir, 'wavs', '{}.wav'.format(basename))
 				text = parts[2]
-				futures.append(executor.submit(partial(_process_utterance, mel_dir, linear_dir, wav_dir, index, wav_path, text, hparams)))
+				futures.append(executor.submit(partial(_process_utterance, mel_dir, linear_dir, wav_dir, basename, wav_path, text, hparams)))
 				index += 1
 
 	return [future.result() for future in tqdm(futures) if future.result() is not None]
@@ -130,9 +131,9 @@ def _process_utterance(mel_dir, linear_dir, wav_dir, index, wav_path, text, hpar
 	time_steps = len(out)
 
 	# Write the spectrogram and audio to disk
-	audio_filename = 'speech-audio-{:05d}.npy'.format(index)
-	mel_filename = 'speech-mel-{:05d}.npy'.format(index)
-	linear_filename = 'speech-linear-{:05d}.npy'.format(index)
+	audio_filename = 'audio-{}.npy'.format(index)
+	mel_filename = 'mel-{}.npy'.format(index)
+	linear_filename = 'linear-{}.npy'.format(index)
 	np.save(os.path.join(wav_dir, audio_filename), out.astype(out_dtype), allow_pickle=False)
 	np.save(os.path.join(mel_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)
 	np.save(os.path.join(linear_dir, linear_filename), linear_spectrogram.T, allow_pickle=False)

diff --git a/datasets/wavenet_preprocessor.py b/datasets/wavenet_preprocessor.py
@@ -0,0 +1,134 @@
+from concurrent.futures import ProcessPoolExecutor
+from functools import partial
+from datasets import audio
+import os
+import numpy as np 
+from wavenet_vocoder.util import mulaw_quantize, mulaw, is_mulaw, is_mulaw_quantize
+
+
+def build_from_path(hparams, input_dir, mel_dir, wav_dir, n_jobs=12, tqdm=lambda x: x):
+	"""
+	Preprocesses the speech dataset from a gven input path to given output directories
+
+	Args:
+		- hparams: hyper parameters
+		- input_dir: input directory that contains the files to prerocess
+		- mel_dir: output directory of the preprocessed speech mel-spectrogram dataset
+		- linear_dir: output directory of the preprocessed speech linear-spectrogram dataset
+		- wav_dir: output directory of the preprocessed speech audio dataset
+		- n_jobs: Optional, number of worker process to parallelize across
+		- tqdm: Optional, provides a nice progress bar
+
+	Returns:
+		- A list of tuple describing the train examples. this should be written to train.txt
+	"""
+
+	# We use ProcessPoolExecutor to parallelize across processes, this is just for 
+	# optimization purposes and it can be omited
+	executor = ProcessPoolExecutor(max_workers=n_jobs)
+	futures = []
+	for file in os.listdir(input_dir):
+		wav_path = os.path.join(input_dir, file)
+		basename = os.path.basename(wav_path).replace('.wav', '')
+		futures.append(executor.submit(partial(_process_utterance, mel_dir, wav_dir, basename, wav_path, hparams)))
+
+	return [future.result() for future in tqdm(futures) if future.result() is not None]
+
+
+def _process_utterance(mel_dir, wav_dir, index, wav_path, hparams):
+	"""
+	Preprocesses a single utterance wav/text pair
+
+	this writes the mel scale spectogram to disk and return a tuple to write
+	to the train.txt file
+
+	Args:
+		- mel_dir: the directory to write the mel spectograms into
+		- linear_dir: the directory to write the linear spectrograms into
+		- wav_dir: the directory to write the preprocessed wav into
+		- index: the numeric index to use in the spectrogram filename
+		- wav_path: path to the audio file containing the speech input
+		- text: text spoken in the input audio file
+		- hparams: hyper parameters
+
+	Returns:
+		- A tuple: (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, linear_frames, text)
+	"""
+	try:
+		# Load the audio as numpy array
+		wav = audio.load_wav(wav_path, sr=hparams.sample_rate)
+	except FileNotFoundError: #catch missing wav exception
+		print('file {} present in csv metadata is not present in wav folder. skipping!'.format(
+			wav_path))
+		return None
+
+	#rescale wav
+	if hparams.rescale:
+		wav = wav / np.abs(wav).max() * hparams.rescaling_max
+
+	#M-AILABS extra silence specific
+	if hparams.trim_silence:
+		wav = audio.trim_silence(wav, hparams)
+
+	#Mu-law quantize
+	if is_mulaw_quantize(hparams.input_type):
+		#[0, quantize_channels)
+		out = mulaw_quantize(wav, hparams.quantize_channels)
+
+		#Trim silences
+		start, end = audio.start_and_end_indices(out, hparams.silence_threshold)
+		wav = wav[start: end]
+		out = out[start: end]
+
+		constant_values = mulaw_quantize(0, hparams.quantize_channels)
+		out_dtype = np.int16
+
+	elif is_mulaw(hparams.input_type):
+		#[-1, 1]
+		out = mulaw(wav, hparams.quantize_channels)
+		constant_values = mulaw(0., hparams.quantize_channels)
+		out_dtype = np.float32
+
+	else:
+		#[-1, 1]
+		out = wav
+		constant_values = 0.
+		out_dtype = np.float32
+
+	# Compute the mel scale spectrogram from the wav
+	mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
+	mel_frames = mel_spectrogram.shape[1]
+
+	if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length:
+		return None
+
+	#Ensure time resolution adjustement between audio and mel-spectrogram
+	fft_size = hparams.n_fft if hparams.win_size is None else hparams.win_size
+	l, r = audio.pad_lr(wav, fft_size, audio.get_hop_size(hparams))
+
+	#Zero pad for quantized signal
+	out = np.pad(out, (l, r), mode='constant', constant_values=constant_values)
+	assert len(out) >= mel_frames * audio.get_hop_size(hparams)
+
+	#time resolution adjustement
+	#ensure length of raw audio is multiple of hop size so that we can use
+	#transposed convolution to upsample
+	out = out[:mel_frames * audio.get_hop_size(hparams)]
+	assert len(out) % audio.get_hop_size(hparams) == 0
+	time_steps = len(out)
+
+	# Write the spectrogram and audio to disk
+	audio_filename = os.path.join(wav_dir, 'audio-{}.npy'.format(index))
+	mel_filename = os.path.join(mel_dir, 'mel-{}.npy'.format(index))
+	np.save(audio_filename, out.astype(out_dtype), allow_pickle=False)
+	np.save(mel_filename, mel_spectrogram.T, allow_pickle=False)
+
+	#global condition features
+	if hparams.gin_channels > 0:
+		raise RuntimeError('When activating global conditions, please set your speaker_id rules in line 128 of datasets/wavenet_preprocessor.py to use them during training')
+		speaker_id = '<no_g>' #put the rule to determine how to assign speaker ids (using file names maybe? file basenames are available in "index" variable)
+	else:
+		speaker_id = '<no_g>'
+
+	# Return a tuple describing this training example
+	return (audio_filename, mel_filename, '_', speaker_id, time_steps, mel_frames)
diff --git a/hparams.py b/hparams.py
@@ -15,30 +15,30 @@
 
 	#Audio
 	num_mels = 80, #Number of mel-spectrogram channels and local conditioning dimensionality
-	num_freq = 513, # (= n_fft / 2 + 1) only used when adding linear spectrograms post processing network
+	num_freq = 1025, # (= n_fft / 2 + 1) only used when adding linear spectrograms post processing network
 	rescale = True, #Whether to rescale audio prior to preprocessing
 	rescaling_max = 0.999, #Rescaling value
 	trim_silence = True, #Whether to clip silence in Audio (at beginning and end of audio only, not the middle)
 	clip_mels_length = True, #For cases of OOM (Not really recommended, working on a workaround)
-	max_mel_frames = 900,  #Only relevant when clip_mels_length = True
+	max_mel_frames = 1100,  #Only relevant when clip_mels_length = True
 
 	# Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction
 	# It's preferred to set True to use with https://github.com/r9y9/wavenet_vocoder
 	# Does not work if n_ffit is not multiple of hop_size!!
-	use_lws=True,
+	use_lws=False,
 	silence_threshold=2, #silence threshold used for sound trimming for wavenet preprocessing
 
 	#Mel spectrogram
-	n_fft = 1024, #Extra window size is filled with 0 paddings to match this parameter
-	hop_size = 256, #For 22050Hz, 275 ~= 12.5 ms
-	win_size = None, #For 22050Hz, 1100 ~= 50 ms (If None, win_size = n_fft)
-	sample_rate = 22050, #22050 Hz (corresponding to ljspeech dataset)
+	n_fft = 2048, #Extra window size is filled with 0 paddings to match this parameter
+	hop_size = 300, #For 22050Hz, 275 ~= 12.5 ms
+	win_size = 1200, #For 22050Hz, 1100 ~= 50 ms (If None, win_size = n_fft)
+	sample_rate = 24000, #22050 Hz (corresponding to ljspeech dataset)
 	frame_shift_ms = None,
 
 	#M-AILABS (and other datasets) trim params
 	trim_fft_size = 512,
 	trim_hop_size = 128,
-	trim_top_db = 60,
+	trim_top_db = 23,
 
 	#Mel and Linear spectrograms normalization/scaling and clipping
 	signal_normalization = True,
@@ -49,11 +49,11 @@
 	#Limits
 	min_level_db = -100,
 	ref_level_db = 20,
-	fmin = 25, #Set this to 75 if your speaker is male! if female, 125 should help taking off noise. (To test depending on dataset)
+	fmin = 0, #Set this to 75 if your speaker is male! if female, 125 should help taking off noise. (To test depending on dataset)
 	fmax = 7600, 
 
 	#Griffin Lim
-	power = 1.2, 
+	power = 1.5, 
 	griffin_lim_iters = 60,
 	###########################################################################################################################################
 
@@ -77,17 +77,17 @@
 	prenet_layers = [256, 256], #number of layers and number of units of prenet
 	decoder_layers = 2, #number of decoder lstm layers
 	decoder_lstm_units = 1024, #number of decoder lstm units on each layer
-	max_iters = 2500, #Max decoder steps during inference (Just for safety from infinite loop cases)
+	max_iters = 1000, #Max decoder steps during inference (Just for safety from infinite loop cases)
 
 	postnet_num_layers = 5, #number of postnet convolutional layers
 	postnet_kernel_size = (5, ), #size of postnet convolution filters for each layer
 	postnet_channels = 512, #number of postnet convolution filters for each layer
 
-	mask_encoder = True, #whether to mask encoder padding while computing attention
-	mask_decoder = True, #Whether to use loss mask for padded sequences (if False, <stop_token> loss function will not be weighted, else recommended pos_weight = 20)
+	mask_encoder = False, #whether to mask encoder padding while computing attention
+	mask_decoder = False, #Whether to use loss mask for padded sequences (if False, <stop_token> loss function will not be weighted, else recommended pos_weight = 20)
 
-	cross_entropy_pos_weight = 20, #Use class weights to reduce the stop token classes imbalance (by adding more penalty on False Negatives (FN)) (1 = disabled)
-	predict_linear = False, #Whether to add a post-processing network to the Tacotron to predict linear spectrograms (True mode Not tested!!)
+	cross_entropy_pos_weight = 1, #Use class weights to reduce the stop token classes imbalance (by adding more penalty on False Negatives (FN)) (1 = disabled)
+	predict_linear = True, #Whether to add a post-processing network to the Tacotron to predict linear spectrograms (True mode Not tested!!)
 	###########################################################################################################################################
 
 
@@ -105,30 +105,33 @@
 	log_scale_min=float(np.log(1e-14)), #Mixture of logistic distributions minimal log scale
 
 	out_channels = 10 * 3, #This should be equal to quantize channels when input type is 'mulaw-quantize' else: num_distributions * 3 (prob, mean, log_scale)
-	layers = 24, #Number of dilated convolutions (Default: Simplified Wavenet of Tacotron-2 paper)
-	stacks = 4, #Number of dilated convolution stacks (Default: Simplified Wavenet of Tacotron-2 paper)
+	layers = 30, #Number of dilated convolutions (Default: Simplified Wavenet of Tacotron-2 paper)
+	stacks = 3, #Number of dilated convolution stacks (Default: Simplified Wavenet of Tacotron-2 paper)
 	residual_channels = 512,
 	gate_channels = 512, #split in 2 in gated convolutions
 	skip_out_channels = 256,
 	kernel_size = 3,
 
 	cin_channels = 80, #Set this to -1 to disable local conditioning, else it must be equal to num_mels!!
 	upsample_conditional_features = True, #Whether to repeat conditional features or upsample them (The latter is recommended)
-	upsample_scales = [16, 16], #prod(scales) should be equal to hop size
+	upsample_scales = [5, 5, 4, 3], #prod(scales) should be equal to hop size
 	freq_axis_kernel_size = 3,
 
-	gin_channels = -1, #Set this to -1 to disable global conditioning, Only used for multi speaker dataset
+	gin_channels = -1, #Set this to -1 to disable global conditioning, Only used for multi speaker dataset. It defines the depth of the embeddings (Recommended: 512)
+	use_speaker_embedding = True, #whether to make a speaker embedding
+	n_speakers = 6, #number of speakers (rows of the embedding)
+
 	use_bias = True, #Whether to use bias in convolutional layers of the Wavenet
 
 	max_time_sec = None,
-	max_time_steps = 13000, #Max time steps in audio used to train wavenet (decrease to save memory)
+	max_time_steps = 8000, #Max time steps in audio used to train wavenet (decrease to save memory)
 	###########################################################################################################################################
 
 	#Tacotron Training
 	tacotron_random_seed = 5339, #Determines initial graph and operations (i.e: model) random state for reproducibility
 	tacotron_swap_with_cpu = False, #Whether to use cpu as support to gpu for decoder computation (Not recommended: may cause major slowdowns! Only use when critical!)
 
-	tacotron_batch_size = 48, #number of training samples on each training steps
+	tacotron_batch_size = 32, #number of training samples on each training steps
 	tacotron_reg_weight = 1e-6, #regularization weight (for L2 regularization)
 	tacotron_scale_regularization = True, #Whether to rescale regularization weight to adapt for outputs range (used when reg_weight is high and biasing the model)
 
@@ -138,8 +141,8 @@
 
 	tacotron_decay_learning_rate = True, #boolean, determines if the learning rate will follow an exponential decay
 	tacotron_start_decay = 50000, #Step at which learning decay starts
-	tacotron_decay_steps = 40000, #Determines the learning rate decay slope (UNDER TEST)
-	tacotron_decay_rate = 0.2, #learning rate decay rate (UNDER TEST)
+	tacotron_decay_steps = 50000, #Determines the learning rate decay slope (UNDER TEST)
+	tacotron_decay_rate = 0.4, #learning rate decay rate (UNDER TEST)
 	tacotron_initial_learning_rate = 1e-3, #starting learning rate
 	tacotron_final_learning_rate = 1e-5, #minimal learning rate
 
@@ -150,6 +153,7 @@
 	tacotron_zoneout_rate = 0.1, #zoneout rate for all LSTM cells in the network
 	tacotron_dropout_rate = 0.5, #dropout rate for all convolutional layers + prenet
 
+	tacotron_clip_gradients = False, #whether to clip gradients
 	natural_eval = False, #Whether to use 100% natural eval (to evaluate Curriculum Learning performance) or with same teacher-forcing ratio as in training (just for overfit)
 
 	#Decoder RNN learning can take be done in one of two ways:
@@ -176,10 +180,10 @@
 	wavenet_test_batches = None, #number of test batches.
 	wavenet_data_random_state = 1234, #random state for train test split repeatability
 
-	wavenet_learning_rate = 1e-4,
+	wavenet_learning_rate = 1e-3,
 	wavenet_adam_beta1 = 0.9,
 	wavenet_adam_beta2 = 0.999,
-	wavenet_adam_epsilon = 1e-6,
+	wavenet_adam_epsilon = 1e-8,
 
 	wavenet_ema_decay = 0.9999, #decay rate of exponential moving average