This repository contains the official Pytorch implementation of SQ-VAE.
-
Ensure you have Python 3 and PyTorch 1.4 or greater.
-
Install NVIDIA/apex for mixed precision training.
-
Install pip dependencies:
pip install -r requirements.txt
-
Download and extract the ZeroSpeech2020 dataset. For reproduction, only
zerospeech2020.z01
,zerospeech2020.z02
, andzerospeech2020.zip
are required (follow the official instruction to extract the dataset). -
Download the train/test splits here and extract in the root directory of the repo.
-
Preprocess audio and extract train/test log-Mel spectrograms:
python preprocess.py in_dir=/path/to/dataset dataset=2019/english
Note:
in_dir
must be the path to the2019
folder.e.g. python preprocess.py in_dir=../datasets/2020/2019 dataset=2019/english
Train a model:
python train.py checkpoint_dir=path/to/checkpoint_dir dataset=2019/english
e.g. python train.py checkpoint_dir=checkpoints/2019english dataset=2019/english
Note: The default parameterization of the variance is Gaussian SQ-VAE (IV) "gaussian_4"
.
You can switch the parameterizations in config/model/default.yaml
:
Gaussian SQ-VAE (I) "gaussian_1"
, Gaussian SQ-VAE (III) "gaussian_3"
, and Gaussian SQ-VAE (IV) "gaussian_4"
.
python evaluate_mse.py checkpoint=path/to/checkpoint in_dir=path/to/wavs evaluation_list=path/to/evaluation_list dataset=2019/english
Note: the evaluation list
is a json
file:
[
[
"english/train/parallel/voice/V001_0000000047.wav",
"V001"
]
]
containing a list of items with a) the path (relative to in_dir
) of the source wav
files;
and b) the target speaker (see datasets/2019/english/speakers.json
for a list of options).
e.g. python evaluate_mse.py checkpoint=checkpoints/2019english/model.ckpt-500000.pt in_dir=../datasets/2020/2019 evaluation_list=datasets/2019/english/mse_evaluation.json dataset=2019/english
-
Install bootphon/zerospeech2020.
-
Encode test data for evaluation:
python encode.py checkpoint=path/to/checkpoint out_dir=path/to/out_dir dataset=2019/english
e.g. python encode.py checkpoint=checkpoints/2019english/model.ckpt-500000.pt out_dir=submission/2019/english/test dataset=2019/english
-
Run ABX evaluation script (see bootphon/zerospeech2020).
This work is based on:
-
Niekerk, Nortje, and Kamper. "Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge." INTERSPEECH. 2020.
-
Chorowski, Jan, et al. "Unsupervised speech representation learning using wavenet autoencoders." IEEE/ACM transactions on audio, speech, and language processing 27.12 (2019): 2041-2053.
-
Lorenzo-Trueba, Jaime, et al. "Towards achieving robust universal neural vocoding." INTERSPEECH. 2019.
-
van den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in Neural Information Processing Systems. 2017.
@INPROCEEDINGS{takida2022sq-vae,
author={Takida, Yuhta and Shibuya, Takashi and Liao, WeiHsiang and Lai, Chieh-Hsin and Ohmura, Junki and Uesaka, Toshimitsu and Murata, Naoki and Takahashi Shusuke and Kumakura, Toshiyuki and Mitsufuji, Yuki},
title={SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization},
booktitle={International Conference on Machine Learning},
year={2022},
}