Skip to content

A Multimodal Engagement Prediction Model, based on EngageNet baselines

License

Notifications You must be signed in to change notification settings

yc-kang/EmotiW2024

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal Engagement Classification - EmotiW2024

Introduction

ACM EmotiW2024 challenge, we focused the subchallenge: Engagement classification on videos.

Checkout our PowerPoint here

Dataset and baselines

We worked with EngageNet, with a pre-ensemble baseline.
By data augmentation (flip & color-filter), we ensured that each classes have a minimum of 3500 videos.

Dataset

Architecture

The model is ensembled from: Pose Tracking, Facial Landmarks, Facial Features, Video Understanding

Model Architecture

Code Layout

Structure:

  • notebooks/augmentation - Data augmentation
  • notebooks/preprocessing - Data preprocessing pipelines
  • notebooks/ensemble - Model ensemble from different modalities

Results

Individual Modalities

Based on EngageNet Test dataset

Modality Accuracy F1-Score
Pose 0.698 0.69
Landmark 0.614 0.58
Face 0.689 0.67
Video Understanding 0.652 0.61

Ensembling Performance

Ensemble Accuracy
Late-Fusion (Hard) 0.676
Late-Fusion (Soft) 0.718
Late-Fusion (Weighted) 0.694
Early-Fusion (Transformer Fusion) 0.744

Ablation Results

Ensemble Accuracy
Pose-Land-Face 0.743
Pose-Land-Vid 0.740
Pose-Face-Vid 0.747
Land-Face-Vid 0.695

Table - Final Ensemble

Dataset Accuracy
Validation 0.713
Test 0.747

The Team

Yichen Kang, Yanchun Zhang, Jun Wu
EESM5900V - HKUST
The Hong Kong University of Science and Technology (HKUST)

About

A Multimodal Engagement Prediction Model, based on EngageNet baselines

Resources

License

Stars

Watchers

Forks