- Kaldi (Data preparation related function script) Github link
- Espnet-0.9.7 Githhub link
- Google SentencePiece(pip3 install sentencepiece) Github link
- Modify the installation address of espnet in the path.sh file
- All the data used in the experiment are stored in the
data
directory, in which train is used for training, valid is the verification set, cv_all and test are used for testing respectively. - In order to better reproduce my experimental results, you can download the data set first, and then directly change the path in
wav.scp
in different sets indata
directory. You can also use thesed
command to replace the path in the wav.scp file with your path. - Other files can remain unchanged, you can use it directly (eg, utt2IntLabel, utt2accent, text, utt2spk...).
- Model file preparation
run_accent_recogntion.sh
is used to train a accent recognition model. Before running, you need to first put the model file(models/e2e_asr_transformer_accent.py) to your espnet directory.
eg:
move `models/e2e_asr_transformer_accent.py` to `/your espnet localtion/espnet/nets/pytorch_backend`
move `models/e2e_asr_transformer_accent_with_attention.py` to `/your espnet localtion/espnet/nets/pytorch_backend`
- step by step The overall code is divided into four parts, including feature extraction, JSON file generation, model training and decoding. The model training is divided into two parts, using ASR init(step05) and not using ASR init(step04). You can control the steps by changing the value of the step variable.
egs:
bash run_accent_recogntion.sh --nj 20 --steps 1-2 data exp
bash run_accent_recogntion.sh --nj 20 --steps 3 data exp
bash run_accent_recogntion.sh --nj 20 --steps 4 data exp
bash run_accent_recogntion.sh --nj 20 --steps 6 data exp
- ASR initialization
In order to get better results, the encoder of ASR model can be used to initialize the encoder of accent recognition model.
As in the
run_accent_recogntion.sh
script, you can set the value ofpretrained_model
variable to you asr model path. Then use the following command to run.
bash run_accent_recogntion.sh --nj 20 --steps 5 data exp
- In addition, in order to better reproduce and avoid you training asr system again, I uploaded two ASR models, including
pretrained_model/accent160.val5.avg.best
andpretrained_model/accent160_and_librispeech960.val5.avg.best
. One is trained use only accent160 data, the other is both use accent160 and librispeech960 data. You can use these two models by change thepretrained_model
variable values. - In the experiment, we found when run too many epochs will lead to over fitting. Similarly, we also discuss how many epochs are used to decode the data to get the best result. We find that in the accent classification system, only using 10 epochs can get better results without using ASR initialization. When using ASR initialization, using 5 epochs can get better results. At the same time, You can use different epoch decoding by changing the
max_epoch
variable instep06
. You can also change themax_epoch
variable to find out how many epoch models produce the best results.
The purpose of training the asr model is to initialize the accent recogniton model. Because ASR training is no different from normal transformer training, there is no need to prepare additional model files. You can directly execute the run_accent160_asr.sh
script step by step. Features can directly use the features of single accent system(steps 01-02).
egs:
bash run_accent160_asr.sh --nj 20 --steps 1-2 data exp
bash run_accent160_asr.sh --nj 20 --steps 3 data exp
bash run_accent160_asr.sh --nj 20 --steps 4 data exp
bash run_accent160_asr.sh --nj 20 --steps 5 data exp (Not necessary, because we only need to train the ASR model)
bash run_accent160_asr.sh --nj 20 --steps 6 data exp
bash run_accent160_asr.sh --nj 20 --steps 7 data exp
All scripts have three inputs: data exp step
data: Directory for storing data preparation
exp: Output directory during training
steps: Control execution parameters
For librispeech data, you can prepare librispeech data into kaldi format, and then mix it with accent data to train the asr system
In reality, it is hard to obtain sufficient domain specific real telephony data to train acoustic models due to data privacy consideration. So we employ diversified audio codecs simulation based data augmentation method to train telephony speech recognition system.
In this study, we use AESRC accent data as wide-band data, we first down-sample the 16 kHz accent data to the 8 kH. For simulate narrow-band data, we select randomly from the full list of codecs, and using FFMPEG tools convert it to narrow-band data.
For specific implementation, you can refer to add-codec/add-codec.sh
script, but before you run it, you must change the value "/home4/hhx502/w2019/ffmpeg_source/bin/ffmpeg"
in add-codec/scripts/add-codec-with-ffmpeg.pl to you ffmpeg path. Then you should modify the value of data_set
and source_dir
variable in the add-codec/add-codec.sh
script. After the first two steps, you can run it directly
egs:
bash add-codec.sh