Audio-Visua Lip Synthesis via intermediate landmark representation | Final Year Project (Dissertation) of Wish Suharitdamrong
This is a code implementation for Wish Suharitdamrong's Final Year Project Year 3 BSc Computer Science at University of Surrey on the topic of Audio-Visua Lip Synthesis via intermediate landmark representation.
Online demonstration is available at 🤗 HuggingFace
There are two ways of installing package using conda or pip
1.Create virtual conda environment from environment.yml
2.Use pip to install a pakages (make sure you use python 3.7
or above since older version might not support some libraries)
# Create virtual environment from .yml file
conda env create -f environment.yml
# activate virtual environment
conda activate fyp
# Use pip to install require packages
pip install -r requirement.txt
The audio-visual dataset used in this proejct are LRS2 and LRS3. LRS2 data was use for both model training and evaluation. LRS3 data was only used for model evaluation.
Dataset | Page |
---|---|
LRS2 | Link |
LRS3 | Link |
Download weights Generator model
Model | Donwload Link |
---|---|
Generator | Link |
Generator + SyncLoss | Link |
Attention Generator + SyncLoss | Link |
Download weights for Landmark-based SyncNet model Download Link
Pre-trained weight for Image2Image Translation model can be download from MakeItTalk repository on their pre-trained models section Repo Link.
├── checkpoint # Directory for model checkpoint
│ └── generator # put Generator model weights here
│ └── syncnet # put Landmark SyncNet model weights here
│ └── image2image # put Image2Image Translation model weights here
python run_inference.py --generator_checkpoint <checkpoint_path> --image2image_checkpoint <checkpoint_path> --input_face <image/video_path> --input_audio <audio_source_path>
I used same ways of data preprocessing as Wav2Lip for more details of folder structure can be find in their repository Here.
python preprocess_data.py --data_root data_root/main --preprocessed_root preprocessed_lrs2_landmark/
# CLI for traning attention generator with pretrain landmark SyncNet discriminator
python run_train_generator.py --model_type attnlstm --train_type pretrain --data_root preprocessed_lrs2_landmark/ --checkpoint_dir <folder_to_save_checkpoints>
# CLI for training pretrain landmark SyncNet discriminator
python run_train_syncnet.py --data_root preprocessed_lrs2_landmark/ --checkpoint_dir <folder_to_save_checkpoints>
This project used data from LRS2 and LRS3 dataset for quantitative evaluation, the list of evaluation data is provide from Wav2Lip. The filelist(video and audio data used for evaluation) and details about Lip Sync benchmark are available in their repository Here.
cd evaluation
# generate evaluation videos
python gen_eval_vdo.py --filelist <path> --data_root <path> --model_type <type_of_model> --result_dir <save_path> --generator_checkpoint <gen_ckpt> --image2image_checkpoint <image2image_checkpoint>
The code base of this project was inspired from Wav2Lip and MakeItTalk. I would like to thanks author of both project for making code implementation of their amazing work available online.