Official Python3 implementation of the ICASSP 2021 paper (https://arxiv.org/abs/2010.13105) proposing two-stage textual knowledge distillation method for end-to-end spoken language understanding.
Authors: Seongbin Kim, Gyuwan Kim, Seongjin Shin, Sangmin Lee
Our E2E SLU model is a combination of vq-wav2vec BERT and DeepSpeech2 acoustic model. We perform knowledge distillation from the text BERT model to the speech encoder during (1) additional pre-training (PT-KD) and (2) fine-tuning (FT-KD). Also, we use (3) data augmentation methods (DA) in fine-tuning.
Description | Dataset | Model |
---|---|---|
vq-wav2vec K-means vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations (Baevski et al., 2019) |
Librispeech | download |
RoBERTa on K-means codes vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations (Baevski et al., 2019) |
Librispeech | download |
RoBERTa-base on Text RoBERTa: A Robustly Optimized BERT Pretraining Approach (Yinhan et al., 2019) |
- | download |
Knowledge Distillation Pretrained Model | Librispeech | download |
Finetuned Text RoBERTa Model | Fluent Command Speech Dataset, need auth | download |
Acoustic Model Pretrained Model | Librispeech | download |
Two-stage Textual Knowledge Distillation SLU Model (P-KD + F-KD + AM Pretrain + Data Augmentation) |
Fluent Command Speech Dataset, need auth | download |
Python 3.7.4
ctcdecode==0.4
torch==1.4.0
fairseq==0.9.0 (git history 3335de5f441ee1b3824e16dcd98db620e40beaba)
torchaudio==0.5.0
warpctc-pytorch==0.1
soundfile
Install with pip is very tricky because of the dependency between libraries so we recommend using our docker file by running docker run bekinsmingo/icassp:v4
.
Download datasets and run the following script to convert wav files to vq tokens and manifest.
python preprocessing.py --config ./configs/preprocessing.json
python text_finetuning.py --config ./configs/text_fine.json \
--train-manifest ./manifest/vq_fsc_train.csv \
--val-manifest ./manifest/vq_fsc_valid.csv
python am_pretraining.py --train-manifest manifest/am_pretrain_manifest.csv \
--val-manifest manifest/am_pretrain_manifest.csv \
--config configs/am_pre.json
python finetuning.py --config ./configs/finetuning_kd.json \
--train-manifest ./manifest/vq_fsc_train.csv \
--val-manifest ./manifest/vq_fsc_valid.csv \
--infer-manifest ./manifest/vq_fsc_test.csv \
--prekd-path xbt_recent \
--ampre-path xbt_asr.pth \
--intent-path ./manifest/intent_dict
@article{kim2020two,
title={Two-Stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding},
author={Kim, Seongbin and Kim, Gyuwan and Shin, Seongjin and Lee, Sangmin},
journal={arXiv preprint arXiv:2010.13105},
year={2020}
}
Copyright (c) 2021-present NAVER Corp.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.