PyTorch implementation of speech embedding net and loss described here: https://arxiv.org/pdf/1710.10467.pdf.
Also contains code to create embeddings compatible as input for the speaker diarization model found at https://github.com/google/uis-rnn
The TIMIT speech corpus was used to train the model, found here: https://catalog.ldc.upenn.edu/LDC93S1, or here, https://github.com/philipperemy/timit
- PyTorch 0.4.1
- python 3.5+
- numpy 1.15.4
- librosa 0.6.1
- webrtcvad
- pydub
WebRTC VAD and pydub are required to run the real-time speaker identification application.
The python WebRTC VAD found at https://github.com/wiseman/py-webrtcvad is required to create run dvector_create.py, but not to train the neural network.
Change the following config.yaml key to a regex containing all .WAV files in your downloaded TIMIT dataset. The TIMIT .WAV files must be converted to the standard format (RIFF) for the dvector_create.py script, but not for training the neural network.
unprocessed_data: './TIMIT/*/*/*/*.wav'
Run the preprocessing script:
./data_preprocess.py
Two folders will be created, train_tisv and test_tisv, containing .npy files containing numpy ndarrays of speaker utterances with a 90%/10% training/testing split.
To train the speaker verification model, run:
./train_speech_embedder.py
with the following config.yaml key set to true:
training: !!bool "true"
for testing, set the key value to:
training: !!bool "false"
The log file and checkpoint save locations are controlled by the following values:
log_file: './speech_id_checkpoint/Stats'
checkpoint_dir: './speech_id_checkpoint'
Only TI-SV is implemented.
EER across 10 epochs: 0.0377
Place the enrollment data for each speaker at dataset/chihweif/train_wav/[speaker_id]/*.wav
.
Run
python ./my_preprocess_data.py
to preprocess the enrollment audio samples. The preprocessed data will be located at dataset/chihweif/train_npy
.
Run
python ./my_enroll_speaker.py [model path] dataset/chihweif/train_npy
to create embedding vector of each speaker.
After enrollment run
python ./my_realtime.py [model path] dataset/chihweif/enroll
to perform real-time speaker identification.
After training and testing the model, run dvector_create.py to create the numpy files train_sequence.npy, train_cluster_ids.npy, test_sequence.npy, and test_cluster_ids.npy.
These files can be loaded and used to train the uis-rnn model found at https://github.com/google/uis-rnn