Skip to content

Commit

Permalink
add Relative Multi-Head Attention and unify masking
Browse files Browse the repository at this point in the history
  • Loading branch information
keonlee9420 committed Oct 8, 2021
1 parent 1e56fc7 commit 81be46d
Show file tree
Hide file tree
Showing 8 changed files with 156 additions and 290 deletions.
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ PyTorch Implementation of [PortaSpeech: Portable and High-Quality Generative Tex
| Module | Normal | Small | Normal (paper) | Small (paper) |
| :----- | :-----: | :-----: | :-----: | :-----: |
| *Total* | 34.3M | 9.6M | 21.8M | 6.7M
| *LinguisticEncoder* | 14M | 3.4M | - | -
| *LinguisticEncoder* | 14M | 3.5M | - | -
| *VariationalGenerator* | 11M | 2.8M | - | -
| *FlowPostNet* | 9.3M | 3.4M | - | -

Expand Down Expand Up @@ -122,7 +122,6 @@ to serve TensorBoard on your localhost.
- For vocoder, **HiFi-GAN** and **MelGAN** are supported.
- Add convolution layer and residual layer in **VariationalGenerator** to match the shape of conditioner and output.
- No ReLU activation and LayerNorm in **VariationalGenerator** for convergence of word-to-phoneme alignment of **LinguisticEncoder**.
- Use absolute positional encoding in **LinguisticEncoder** instead of relative positional encoding.
- Will be extended to a **multi-speaker TTS**.
<!-- - Two options for embedding for the **multi-speaker TTS** setting: training speaker embedder from scratch or using a pre-trained [philipperemy's DeepSpeaker](https://github.com/philipperemy/deep-speaker) model (as [STYLER](https://github.com/keonlee9420/STYLER) did). You can toggle it by setting the config (between `'none'` and `'DeepSpeaker'`).
- DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.
Expand Down
4 changes: 2 additions & 2 deletions config/LJSpeech/train.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ optimizer:
grad_clip_thresh: 1.0
grad_acc_step: 1
warm_up_step: 4000
anneal_steps: [300000, 400000, 500000]
anneal_steps: [100000, 200000, 300000]
anneal_rate: 0.3
step:
total_step: 900000
total_step: 500000
log_step: 100
synth_step: 1000
val_step: 1000
Expand Down
6 changes: 3 additions & 3 deletions model/PortaSpeech.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ def __init__(self, preprocess_config, model_config):
super(PortaSpeech, self).__init__()
self.model_config = model_config

self.linguistic_encoder = LinguisticEncoder(model_config, abs_mha=True)
self.linguistic_encoder = LinguisticEncoder(model_config)
self.variational_generator = VariationalGenerator(
preprocess_config, model_config)
self.postnet = FlowPostNet(preprocess_config, model_config)
Expand Down Expand Up @@ -104,14 +104,14 @@ def forward(
mels, mel_lens, mel_masks, output)
postnet_output = self.postnet(
mels.transpose(1, 2),
~mel_masks.unsqueeze(1),
mel_masks.unsqueeze(1),
g=(out_residual + residual).transpose(1, 2),
)
else:
_, out_residual, dist_info = self.variational_generator.inference(
mel_lens, mel_masks, output)
output = self.postnet.inference(
~mel_masks.unsqueeze(1),
mel_masks.unsqueeze(1),
g=(out_residual + residual).transpose(1, 2),
)
postnet_output = None
Expand Down
Loading

0 comments on commit 81be46d

Please sign in to comment.