add Relative Multi-Head Attention and unify masking

bobo-paopao · Oct 8, 2021 · 81be46d · 81be46d
1 parent 1e56fc7
commit 81be46d
Show file tree

Hide file tree

Showing 8 changed files with 156 additions and 290 deletions.
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ PyTorch Implementation of [PortaSpeech: Portable and High-Quality Generative Tex
 | Module | Normal | Small | Normal (paper) | Small (paper) |
 | :----- | :-----: | :-----: | :-----: | :-----: |
 | *Total* | 34.3M | 9.6M | 21.8M | 6.7M
-| *LinguisticEncoder* | 14M | 3.4M | - | -
+| *LinguisticEncoder* | 14M | 3.5M | - | -
 | *VariationalGenerator* | 11M | 2.8M | - | -
 | *FlowPostNet* | 9.3M | 3.4M | - | -
 
@@ -122,7 +122,6 @@ to serve TensorBoard on your localhost.
 - For vocoder, **HiFi-GAN** and **MelGAN** are supported.
 - Add convolution layer and residual layer in **VariationalGenerator** to match the shape of conditioner and output.
 - No ReLU activation and LayerNorm in **VariationalGenerator** for convergence of word-to-phoneme alignment of **LinguisticEncoder**.
-- Use absolute positional encoding in **LinguisticEncoder** instead of relative positional encoding.
 - Will be extended to a **multi-speaker TTS**.
 <!-- - Two options for embedding for the **multi-speaker TTS** setting: training speaker embedder from scratch or using a pre-trained [philipperemy's DeepSpeaker](https://github.com/philipperemy/deep-speaker) model (as [STYLER](https://github.com/keonlee9420/STYLER) did). You can toggle it by setting the config (between `'none'` and `'DeepSpeaker'`).
 - DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

diff --git a/config/LJSpeech/train.yaml b/config/LJSpeech/train.yaml
@@ -15,10 +15,10 @@ optimizer:
   grad_clip_thresh: 1.0
   grad_acc_step: 1
   warm_up_step: 4000
-  anneal_steps: [300000, 400000, 500000]
+  anneal_steps: [100000, 200000, 300000]
   anneal_rate: 0.3
 step:
-  total_step: 900000
+  total_step: 500000
   log_step: 100
   synth_step: 1000
   val_step: 1000

diff --git a/model/PortaSpeech.py b/model/PortaSpeech.py
@@ -17,7 +17,7 @@ def __init__(self, preprocess_config, model_config):
         super(PortaSpeech, self).__init__()
         self.model_config = model_config
 
-        self.linguistic_encoder = LinguisticEncoder(model_config, abs_mha=True)
+        self.linguistic_encoder = LinguisticEncoder(model_config)
         self.variational_generator = VariationalGenerator(
             preprocess_config, model_config)
         self.postnet = FlowPostNet(preprocess_config, model_config)
@@ -104,14 +104,14 @@ def forward(
                 mels, mel_lens, mel_masks, output)
             postnet_output = self.postnet(
                 mels.transpose(1, 2),
-                ~mel_masks.unsqueeze(1),
+                mel_masks.unsqueeze(1),
                 g=(out_residual + residual).transpose(1, 2),
             )
         else:
             _, out_residual, dist_info = self.variational_generator.inference(
                 mel_lens, mel_masks, output)
             output = self.postnet.inference(
-                ~mel_masks.unsqueeze(1),
+                mel_masks.unsqueeze(1),
                 g=(out_residual + residual).transpose(1, 2),
             )
             postnet_output = None