This repository contains the realignment data, and the code of the realignment procedure and of the emotional recognition presented at Whose Emotion Matters? Speaking Activity Localisation without Prior Knowledge.
The idea of MELD-FAIR is to be a first step towards making the acoustic and visual data of MELD more reliable.
MELD is a multimodal multi-party dataset for emotion recognition in conversations. It contains videos with scenes from the Friends TV-sitcom as well as transcriptions of the utterances within those scenes.
The textual data of MELD (the transcriptions) are quite reliable. However, this reliability is not reflected in the video scenes. MELD-FAIR was conceived with the aim of addressing this issue.
Install the required packages
python3 -m pip install -r requirements.txt
Download the original dataset from MELD and store it in a new folder.
Extract the source code from TalkNet from its repository and store it in a new folder.
Obtain TalkNet's pretrained model pretrain_AVA.model
and store it in TalkNet's base folder.
Default folder names are already given in config.py. If the folder structure differs from the values given in config.py, modify it accordingly.
The generation of MELD-FAIR dataset is composed of several steps that should take quite a long time each.
The first step consists in generating the CSV with time stamp data for forced alignment and assembling of the realigned videos.
This data associates each video of MELD-FAIR (the realigned version of MELD) with a sequences of chunks of the original videos from MELD.
To execute this step, run the following script at the parent folder:
python3 -m MELD-FAIR.realigner.forced_alignment_data_generator
In the script above, MELD-FAIR is assumed to be the name of the folder containing this code. In case the name of the folder is different, use the corresponding name.
In this step, the videos are assembled with ffmpeg based on the information present in the CSV generated in the previous step.
To execute this step, run the following script at the parent folder:
python3 -m MELD-FAIR.realigner.realigned_video_assembler
In this step, for every frame of every video of MELD-FAIR, the delimiting coordinates of the bounding boxes that determine the regions of all detected faces are retrieved.
The bounding boxes retrieved this way are organised into tracks (faces are detected and tracked).
The information of the coordinates of those bounding boxes as well as the tracks they are associated with is then stored in a CSV file.
To execute this step, run the following script at the parent folder:
python3 -m MELD-FAIR.asd.face_bbox_determination
In this step, for every video of MELD-FAIR, TalkNet is used to determine the frames in which the face of the active speaker is detected, and to which of the above mentioned tracks it is assigned to.
This step produces another CSV file, which contains the same kind of information that exists in the CSV produced in the previous step.
However, instead of having information regarding all detected faces, it only contains information on the position of the face of the active speaker.
To execute this step, run the following script at the parent folder:
python3 -m MELD-FAIR.asd.active_speaker_detection
In this step, for every MELD-FAIR video, faces are cropped given the information given in the CSV produced in the previous step.
To execute this step, run the following script at the parent folder:
python3 -m MELD-FAIR.asd.generate_face_crops
A CSV file with bounding box information is already available at the csvs folder.
In order to extract face crops using the information in this CSV, remind to modify the content of config.py accordingly and run the script above.
In the folder er, there is the implementation of a model for the recognition of emotions from utterances in MELD-FAIR given acoustic and visual data.
In case you used MELD-FAIR for your research, please cite the following paper:
@article{carneiro2023emotion,
title = {Whose emotion matters? Speaking activity localisation without prior knowledge},
author = {Hugo Carneiro and Cornelius Weber and Stefan Wermter},
journal = {Neurocomputing},
volume = {545},
pages = {126271},
year = {2023},
doi = {10.1016/j.neucom.2023.126271}
}