The goal if this project is to create a multi-modal Speech Emotion Recogniton system on IEMOCAP dataset.
- Feb 2019 - IEMOCAP dataset aquisition and parsing
- Mar 2019 - Baseline of linguistic model
- Apr 2019 - Baseline of acoustic model
- May 2019 - Integration and optimization of both models
- Jun 2019 - Integration with open-source ASR(most likely DeepSpeech)
IEMOCAP states for Interactive Emotional Dyadic Motion and Capture dataset. It is the most popular database used for multi-modal speech emotion recognition.
IEMOCAP database suffers from major class imbalance. To solve this problem we reduce the number of classes to 4 and merge Enthusiastic and Happiness into one class.
References: [1] [2] [3] [4] [5] [6] [7] [8] [9]
Classifier Architecture | Input type | Accuracy [%] |
---|---|---|
Convolutional Neural Network | Spectrogram | 55.3 |
Bidirectional LSTM with self-attention | LLD features | 53.2 |
Classifier Architecture | Input type | Accuracy[%] |
---|---|---|
LSTM | Transcription | 58.9 |
Bidirectional LSTM | Transcription | 59.4 |
Bidirectional LSTM with self-attention | Transcription | 63.1 |
Ensemble architectures make use of the most accurate acoustic and linguistic architectures. This means that linguistic model with bidirectional LSTM with self-attention architecture and acoustic model with Convolutional architecture are being used.
Ensemble type | Accuracy |
---|---|
Decision-level Ensemble(maximum confidence) | 66.7 |
Decision-level Ensemble(average) | 68.8 |
Decision-level Ensemble(weighted average) | 69.0 |
Feature-level Ensemble | 71.1 |
- 1.Download IEMOCAP dataset from https://sail.usc.edu/iemocap/
- 2.Create dataset pickle using this module:
https://github.com/didi/delta/blob/master/egs/iemocap/emo/v1/local/python/mocap_data_collect.py - 3.Use create_balanced_iemocap() to get balanced version of iemocap dataset containing 4 classes
- 4.Use load_<DATASET_TYPE>_dataset to load a specific dataset.
The first time you load datasets, they will be created from scratch and cached in .npy files. This might take a while.
Next time you load datasets, they will be loaded from cached .npy files
python3 -m speech_emotion_recognition.run_hyperparameter_tuning -m acoustic-spectrogram
python3 -m speech_emotion_recognition.run_training_ensemble -m acoustic-spectrogram
python3 -m speech_emotion_recognition.run_training_ensemble -a /path/to/acoustic_spec_model.torch -l /path/to/linguistic_model.torch
python3 -m speech_emotion_recognition.run_evaluate -a /path/to/acoustic_spec_model.torch -l /path/to/linguistic_model.torch -e /path/to/ensemble_model.torch
docker run -t -v /path/to/project/data:/data -v /path/to/project/saved_models:/saved_models -v /tmp:/tmp speech-emotion-recognition -m speech_emotion_recognition.run_hyperparameter_tuning -m acoustic-spectrogram
docker run -t -v /path/to/project/data:/data -v /path/to/project/saved_models:/saved_models -v /tmp:/tmp speech-emotion-recognition -m speech_emotion_recognition.run_training_ensemble -m acoustic-spectrogram
docker run -t -v /path/to/project/data:/data -v /path/to/project/saved_models:/saved_models -v /tmp:/tmp speech-emotion-recognition -m speech_emotion_recognition.run_training_ensemble -a /path/to/acoustic_spec_model.torch -l /path/to/linguistic_model.torch
docker run -t -v /path/to/project/data:/data -v /path/to/project/saved_models:/saved_models -v /tmp:/tmp speech-emotion-recognition -m speech_emotion_recognition.run_evaluate -a /path/to/acoustic_spec_model.torch -l /path/to/linguistic_model.torch -e /path/to/ensemble_model.torch