SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing
Official PyTorch implementation and pretrained models of SpeechT5
Model | Pre-training Dataset | Fine-tuning Dataset | Model |
---|---|---|---|
SpeechT5 Base | 960 hrs LibriSpeech + LibriSpeech LM Dataset | - | HuggingFace Google Drive |
SpeechT5 Base | 960 hrs LibriSpeech + LibriSpeech LM Dataset | 100 hrs LibriSpeech | HuggingFace Google Drive |
SpeechT5 Large | 60k hrs Libri-Light + LibriSpeech LM Dataset | - | Google Drive |
Model | Dataset | Model | Vocabulary | SPM Model |
---|---|---|---|---|
LM | LibriSpeech LM Dataset | LM Model | Vocabulary | SPM Model |
git submodule update --init SpeechT5/fairseq
cd SpeechT5/
pip install --editable fairseq/
pip install espnet
Please follow the steps for preparing wav2vec 2.0 manifest in here and preparing HuBERT label in here.
We add a third column for the speaker embedding, which is provided in here. It includes the speaker embeddings for 960hr training data and dev-other data of LibriSpeech.
We also provide example manifests for your reference in here.
Please use fairseq-preprocess to generate the index and bin files of the text data. Note that we use sentencepiece to pre-process the text, so please refer to here to download the SPM model and dictionary for preparing text data. This means you firstly need to use the SPM model to process the text and then use fairseq-preprocess with the provided dictionary to get the index and bin files.
import torch
from speecht5.tasks.speecht5 import SpeechT5Task
from speecht5.models.speecht5 import T5TransformerModel
checkpoint = torch.load('/path/to/speecht5_checkpoint')
checkpoint['cfg']['task'].t5_task = 'pretrain'
checkpoint['cfg']['task'].hubert_label_dir = "/path/to/hubert_label"
checkpoint['cfg']['task'].data = "/path/to/tsv_file"
task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])
model = T5TransformerModel.build_model(checkpoint['cfg']['model'], task)
model.load_state_dict(checkpoint['model'])
DATA_ROOT=
SAVE_DIR=
LABEL_DIR=
TRAIN_SET="speech_train|text_train"
VALID_SET="speech_valid|text_valid"
fairseq-train ${DATA_ROOT} \
--save-dir ${SAVE_DIR} \
--tensorboard-logdir ${SAVE_DIR} \
--train-subset ${TRAIN_SET} \
--valid-subset ${VALID_SET} \
--hubert-label-dir ${LABEL_DIR} \
--distributed-world-size 32 \
--distributed-port 0 \
--ddp-backend legacy_ddp \
--user-dir SpeechT5/speecht5 \
--log-format json \
--seed 1337 \
--fp16 \
\
--task speecht5 \
--t5-task pretrain \
--label-rates 50 \
--sample-rate 16000 \
--random-crop \
\
--num-workers 0 \
--max-tokens 1400000 \
--max-speech-sample-size 250000 \
--update-freq 2 \
--batch-ratio "[1,0.0086]" \
\
--criterion speecht5 \
--optimizer adam \
--reset-optimizer \
--adam-betas "(0.9, 0.98)" \
--adam-eps 1e-06 \
--weight-decay 0.01 \
--power 1 \
--clip-norm 5.0 \
--lr 0.0002 \
--lr-scheduler polynomial_decay \
\
--max-update 800000 \
--warmup-updates 64000 \
--total-num-update 800000 \
--save-interval-updates 3000 \
--skip-invalid-size-inputs-valid-test \
--required-batch-size-multiple 1 \
\
--arch t5_transformer_base \
--share-input-output-embed \
--find-unused-parameters \
--bert-init \
--relative-position-embedding \
--use-codebook \
--codebook-prob 0.1 \
--loss-weights="[10,0.1]" \
--max-text-positions 600 \
DATA_ROOT=
SAVE_DIR=
TRAIN_SET=
VALID_SET=
LABEL_DIR=
BPE_TOKENIZER=
USER_DIR=
PT_CHECKPOINT_PATH=
mkdir -p ${SAVE_DIR}
fairseq-train ${DATA_ROOT} \
--save-dir ${SAVE_DIR} \
--tensorboard-logdir ${SAVE_DIR} \
--train-subset ${TRAIN_SET} \
--valid-subset ${VALID_SET} \
--hubert-label-dir ${LABEL_DIR} \
--distributed-world-size 8 \
--distributed-port 0 \
--ddp-backend legacy_ddp \
--user-dir ${USER_DIR} \
--log-format json \
--seed 1 \
--fp16 \
\
--task speecht5 \
--t5-task s2t \
--sample-rate 16000 \
--num-workers 0 \
--max-tokens 1600000 \
--update-freq 2 \
--bpe-tokenizer ${BPE_TOKENIZER} \
\
--criterion speecht5 \
--report-accuracy \
--zero-infinity \
--ce-weight 0.5 \
--ctc-weight 0.5 \
--sentence-avg \
\
--optimizer adam \
--adam-betas "(0.9, 0.98)" \
--adam-eps 1e-08 \
--weight-decay 0.1 \
--clip-norm 25.0 \
--lr 0.00006 \
--lr-scheduler tri_stage \
--phase-ratio "[0.1, 0.4, 0.5]" \
--final-lr-scale 0.05 \
\
--max-update 80000 \
--max-text-positions 600 \
--required-batch-size-multiple 1 \
--save-interval-updates 3000 \
--skip-invalid-size-inputs-valid-test \
\
--arch t5_transformer_base_asr \
--share-input-output-embed \
--find-unused-parameters \
--bert-init \
--relative-position-embedding \
--freeze-encoder-updates 13000 \
\
--keep-last-epochs 10 \
--feature-grad-mult 1.0 \
--best-checkpoint-metric s2t_accuracy \
--maximize-best-checkpoint-metric \
--finetune-from-model ${PT_CHECKPOINT_PATH}
Note that joint CTC/Decoder inference is only supported when batch size is 1.
CHECKPOINT_PATH=
DATA_ROOT=
SUBSET=
BPE_TOKENIZER=
LABEL_DIR=
USER_DIR=
BEAM=
MAX_TOKENS=
CTC_WEIGHT=
LM_WEIGHT=
LM_PATH=
fairseq-generate ${DATA_ROOT} \
--gen-subset ${SUBSET} \
--bpe-tokenizer ${BPE_TOKENIZER} \
--user-dir ${USER_DIR} \
--task speecht5 \
--t5-task s2t \
--path ${CHECKPOINT_PATH} \
--hubert-label-dir ${LABEL_DIR} \
--ctc-weight ${CTC_WEIGHT} \
--lm-weight ${LM_WEIGHT} \
--lm-path ${LM_PATH} \
--max-tokens ${MAX_TOKENS} \
--beam ${BEAM} \
--scoring wer \
--max-len-a 0 \
--max-len-b 620 \
--sample-rate 16000
The manifest and pre-trained vocoder can be found in huggingface, which may be helpful to reproduce the results of SpeechT5 TTS model.
We also provide re-implementation of TTS fine-tuned model speecht5_tts.pt, but with a smaller batch size or max updates, which can be helpful.
DATA_ROOT=
SAVE_DIR=
TRAIN_SET=
VALID_SET=
LABEL_DIR=
BPE_TOKENIZER=
USER_DIR=
PT_CHECKPOINT_PATH=
fairseq-train ${DATA_ROOT} \
--save-dir ${SAVE_DIR} \
--tensorboard-logdir ${SAVE_DIR} \
--train-subset ${TRAIN_SET} \
--valid-subset ${VALID_SET} \
--hubert-label-dir ${LABEL_DIR} \
--distributed-world-size 8 \
--distributed-port 0 \
--ddp-backend legacy_ddp \
--user-dir ${USER_DIR} \
--log-format json \
--seed 1 \
--fp16 \
\
--task speecht5 \
--t5-task t2s \
--sample-rate 16000 \
--num-workers 4 \
--max-tokens 3200000 \
--update-freq 1 \
--bpe-tokenizer ${BPE_TOKENIZER} \
--max-tokens-valid 3200000 \
\
--criterion speecht5 \
--use-guided-attn-loss \
--report-accuracy \
--sentence-avg \
\
--optimizer adam \
--adam-betas "(0.9, 0.98)" \
--dropout 0.15 \
--activation-dropout 0.15 \
--attention-dropout 0.15 \
--encoder-layerdrop 0.0 \
--decoder-layerdrop 0.0 \
--weight-decay 0.0 \
--clip-norm 25.0 \
--lr 0.0001 \
--lr-scheduler inverse_sqrt \
--warmup-updates 10000 \
--feature-grad-mult 1.0 \
\
--max-update 120000 \
--max-text-positions 600 \
--min-speech-sample-size 1056 \
--max-speech-sample-size 480256 \
--max-speech-positions 1876 \
--required-batch-size-multiple 1 \
--skip-invalid-size-inputs-valid-test \
--keep-last-epochs 10 \
--validate-after-updates 20000 \
--validate-interval 50 \
--log-interval 10 \
\
--arch t5_transformer_base_asr \
--share-input-output-embed \
--find-unused-parameters \
--bert-init \
--relative-position-embedding \
--freeze-encoder-updates 20000 \
\
--finetune-from-model ${PT_CHECKPOINT_PATH}
Generating speech is available only if batch size is 1.
SPEECHT5_CODE_DIR=
CHECKPOINT_PATH=
DATA_ROOT=
SUBSET=
BPE_TOKENIZER=
LABEL_DIR=
USER_DIR=
RESULTS_PATH=
python3 ${SPEECHT5_CODE_DIR}/SpeechT5/scripts/generate_speech.py ${DATA_ROOT} \
--gen-subset ${SUBSET} \
--bpe-tokenizer ${BPE_TOKENIZER} \
--user-dir ${USER_DIR} \
--task speecht5 \
--t5-task t2s \
--path ${CHECKPOINT_PATH} \
--hubert-label-dir ${LABEL_DIR} \
--batch-size 1 \
--results-path ${RESULTS_PATH} \
--sample-rate 16000
Here we follow fairseq/speech_to_text/mustc to generate vocabulary, which is different from the pre-trained models. So we randomly initilize the embedding table of the pre-trained models during fine-tuning.
DATA_ROOT=
SAVE_DIR=
TRAIN_SET=
VALID_SET=
LABEL_DIR=
BPE_TOKENIZER=
USER_DIR=
PT_CHECKPOINT_PATH=
fairseq-train ${DATA_ROOT} \
--save-dir ${SAVE_DIR} \
--tensorboard-logdir ${SAVE_DIR} \
--train-subset ${TRAIN_SET} \
--valid-subset ${VALID_SET} \
--hubert-label-dir ${LABEL_DIR} \
--distributed-world-size 8 \
--distributed-port 0 \
--ddp-backend legacy_ddp \
--user-dir ${USER_DIR} \
--log-format json \
--seed 1 \
--fp16 \
\
--task speecht5 \
--t5-task s2t \
--sample-rate 16000 \
--num-workers 6 \
--max-tokens 480256 \
--update-freq 4 \
--bpe-tokenizer ${BPE_TOKENIZER} \
--max-tokens-valid 3200000 \
\
--criterion speecht5 \
--label-smoothing 0.1 \
--report-accuracy \
--sentence-avg \
\
--optimizer adam \
--adam-betas "(0.9, 0.98)" \
--weight-decay 0.0 \
--clip-norm 10.0 \
--lr 0.0002 \
--lr-scheduler inverse_sqrt \
--warmup-updates 25000 \
--feature-grad-mult 1.0 \
\
--max-update 80000 \
--max-text-positions 600 \
--min-speech-sample-size 1056 \
--max-speech-sample-size 480256 \
--max-speech-positions 1876 \
--required-batch-size-multiple 1 \
--skip-invalid-size-inputs-valid-test \
--keep-last-epochs 10 \
\
--arch t5_transformer_base_asr \
--share-input-output-embed \
--find-unused-parameters \
--bert-init \
--relative-position-embedding \
--freeze-encoder-updates 0 \
--mask-prob 0.5 \
--mask-channel-prob 0.5 \
\
--finetune-from-model ${PT_CHECKPOINT_PATH}
FAIRSEQ_DIR=
CHECKPOINT_PATH=
DATA_ROOT=
BPE_TOKENIZER=
LABEL_DIR=
USER_DIR=
MAX_TOKENS=
python3 ${FAIRSEQ_DIR}/scripts/average_checkpoints.py \
--inputs ${CHECKPOINT_PATH} \
--num-epoch-checkpoints 10 \
--output ${CHECKPOINT_PATH}/avg_last_10_checkpoint.pt
fairseq-generate ${DATA_ROOT} \
--gen-subset tst-COMMON \
--bpe-tokenizer ${BPE_TOKENIZER} \
--user-dir ${USER_DIR} \
--task speecht5 \
--t5-task s2t \
--path ${CHECKPOINT_PATH}/avg_last_10_checkpoint.pt \
--hubert-label-dir ${LABEL_DIR} \
--max-tokens ${MAX_TOKENS} \
--min-speech-sample-size 1056 \
--beam 5 \
--scoring sacrebleu \
--max-len-a 0 \
--max-len-b 620 \
--sample-rate 16000
The manifest and pre-trained vocoder can be found in huggingface, which may be helpful to reproduce the results of SpeechT5 VC model.
We also provide re-implementation of VC fine-tuned model speecht5_vc.pt, but with a smaller batch size or max updates, which can be helpful.
DATA_ROOT=
SAVE_DIR=
TRAIN_SET=
VALID_SET=
LABEL_DIR=
BPE_TOKENIZER=
USER_DIR=
PT_CHECKPOINT_PATH=
fairseq-train ${DATA_ROOT} \
--save-dir ${SAVE_DIR} \
--tensorboard-logdir ${SAVE_DIR} \
--train-subset ${TRAIN_SET} \
--valid-subset ${VALID_SET} \
--hubert-label-dir ${LABEL_DIR} \
--distributed-world-size 8 \
--distributed-port 0 \
--ddp-backend legacy_ddp \
--user-dir ${USER_DIR} \
--log-format json \
--seed 1 \
--fp16 \
\
--task speecht5 \
--t5-task s2s \
--sample-rate 16000 \
--num-workers 4 \
--max-tokens 1280000 \
--update-freq 3 \
--max-tokens-valid 1280000 \
\
--criterion speecht5 \
--use-guided-attn-loss \
--report-accuracy \
--sentence-avg \
\
--optimizer adam \
--dropout 0.2 \
--activation-dropout 0.2 \
--attention-dropout 0.2 \
--encoder-layerdrop 0.05 \
--decoder-layerdrop 0.0 \
--clip-norm 1.0 \
--lr 0.0001 \
--lr-scheduler inverse_sqrt \
--warmup-updates 6000 \
--feature-grad-mult 1.0 \
\
--max-update 60000 \
--max-text-positions 600 \
--min-speech-sample-size 1056 \
--max-speech-sample-size 480256 \
--max-speech-positions 1876 \
--required-batch-size-multiple 1 \
--skip-invalid-size-inputs-valid-test \
--keep-last-epochs 10 \
--save-interval-updates 10000 \
--disable-validation \
--log-interval 10 \
\
--arch t5_transformer_base_asr \
--share-input-output-embed \
--find-unused-parameters \
--bert-init \
--relative-position-embedding \
--mask-prob 0.0 \
--mask-channel-prob 0.0 \
\
--finetune-from-model ${PT_CHECKPOINT_PATH}
Generating speech is available only if batch size is 1.
SPEECHT5_CODE_DIR=
CHECKPOINT_PATH=
DATA_ROOT=
SUBSET=
LABEL_DIR=
USER_DIR=
RESULTS_PATH=
python3 ${SPEECHT5_CODE_DIR}/SpeechT5/scripts/generate_speech.py ${DATA_ROOT} \
--gen-subset test \
--user-dir ${USER_DIR} \
--task speecht5 \
--t5-task s2s \
--path ${CHECKPOINT_PATH} \
--hubert-label-dir ${LABEL_DIR} \
--batch-size 1 \
--results-path ${RESULTS_PATH} \
--sample-rate 16000
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ and ESPnet projects.
Microsoft Open Source Code of Conduct
If you find our work is useful in your research, please cite the following paper:
@article{Ao2021SpeechT5,
title = {SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing},
author = {Junyi Ao and Rui Wang and Long Zhou and Chengyi Wang and Shuo Ren and Yu Wu and Shujie Liu and Tom Ko and Qing Li and Yu Zhang and Zhihua Wei and Yao Qian and Jinyu Li and Furu Wei},
eprint={2110.07205},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2021}
}
For help or issues using SpeechT5 models, please submit a GitHub issue.
For other communications related to SpeechT5, please contact Long Zhou ([email protected]
).