Skip to content

Framework for training and evaluating self-supervised learning methods for speaker verification.

License

Notifications You must be signed in to change notification settings

theolepage/sslsv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sslsv

sslsv is a PyTorch-based Deep Learning framework consisting of a collection of Self-Supervised Learning (SSL) methods for learning speaker representations applicable to different speaker-related downstream tasks, notably Speaker Verification (SV).

Our aim is to: (1) provide self-supervised SOTA methods by porting algorithms from the computer vision domain; and (2) evaluate them in a comparable environment.

Our training framework is depicted by the figure below.


News

  • April 2024 – 👏 Introduction of new various methods and complete refactoring (v2.0).
  • June 2022 – 🌠 First release of sslsv (v1.0).

Features

General

  • Data:
    • Supervised and Self-supervised datasets (siamese and DINO sampling)
    • Audio augmentation (noise and reverberation)
  • Training:
    • CPU, GPU and multi-GPUs (DataParallel and DistributedDataParallel)
    • Checkpointing, resuming, early stopping and logging
    • Tensorboard and wandb
  • Evaluation:
    • Speaker verification
      • Backend: Cosine scoring and PLDA
      • Metrics: EER, MinDCF, ActDFC, CLLR, AvgRPrec
    • Classification (emotion, language, ...)
  • Notebooks: DET curve, scores distribution, t-SNE on embeddings, ...
  • Misc: scalable config, typing, documentation and tests
Encoders
  • TDNN (sslsv.encoders.TDNN)
    X-vectors: Robust dnn embeddings for speaker recognition (PDF)
    David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, Sanjeev Khudanpur

  • Simple Audio CNN (sslsv.encoders.SimpleAudioCNN)
    Representation Learning with Contrastive Predictive Coding (arXiv)
    Aaron van den Oord, Yazhe Li, Oriol Vinyals

  • ResNet-34 (sslsv.encoders.ResNet34)
    VoxCeleb2: Deep Speaker Recognition (arXiv)
    Joon Son Chung, Arsha Nagrani, Andrew Zisserman

  • ECAPA-TDNN (sslsv.encoders.ECAPATDNN)
    ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification (arXiv)
    Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck

Methods
  • LIM (sslsv.methods.LIM)
    Learning Speaker Representations with Mutual Information (arXiv)
    Mirco Ravanelli, Yoshua Bengio

  • CPC (sslsv.methods.CPC)
    Representation Learning with Contrastive Predictive Coding (arXiv)
    Aaron van den Oord, Yazhe Li, Oriol Vinyals

  • SimCLR (sslsv.methods.SimCLR)
    A Simple Framework for Contrastive Learning of Visual Representations (arXiv)
    Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton

  • MoCo v2+ (sslsv.methods.MoCo)
    Improved Baselines with Momentum Contrastive Learning (arXiv)
    Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He

  • W-MSE (sslsv.methods.WMSE)
    Whitening for Self-Supervised Representation Learning (arXiv)
    Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe

  • Barlow Twins (sslsv.methods.BarlowTwins)
    Barlow Twins: Self-Supervised Learning via Redundancy Reduction (arXiv)
    Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, Stéphane Deny

  • VICReg (sslsv.methods.VICReg)
    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning (arXiv)
    Adrien Bardes, Jean Ponce, Yann LeCun

  • VIbCReg (sslsv.methods.VIbCReg)
    Computer Vision Self-supervised Learning Methods on Time Series (arXiv)
    Daesoo Lee, Erlend Aune

  • BYOL (sslsv.methods.BYOL)
    Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (arXiv)
    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko

  • SimSiam (sslsv.methods.SimSiam)
    Exploring Simple Siamese Representation Learning (arXiv)
    Xinlei Chen, Kaiming He

  • DINO (sslsv.methods.DINO)
    Emerging Properties in Self-Supervised Vision Transformers (arXiv)
    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin

  • DeepCluster v2 (sslsv.methods.DeepCluster)
    Deep Clustering for Unsupervised Learning of Visual Features (arXiv)
    Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze

  • SwAV (sslsv.methods.SwAV)
    Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (arXiv)
    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin

Methods (ours)
  • Combiner (sslsv.methods.Combiner)
    Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning (arXiv)
    Théo Lepage, Réda Dehaks
  • SimCLR Margins (sslsv.methods.SimCLRMargins)
    Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (arXiv)
    Théo Lepage, Réda Dehak

  • MoCo Margins (sslsv.methods.MoCoMargins)
    Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations (arXiv)
    Théo Lepage, Réda Dehak

  • SimCLR MultiViews (sslsv.methods.SimCLRMultiViews)


Requirements

sslsv runs on Python 3.8 with the following dependencies.

Module Versions
torch >= 1.11.0
torchaudio >= 0.11.0
numpy *
pandas *
soundfile *
scikit-learn *
speechbrain *
tensorboard *
wandb *
ruamel.yaml *
dacite *
prettyprinter *
tqdm *

Note: developers will also need pytest, pre-commit and twine to work on this project.


Datasets

Speaker recognition:

Language recognition:

Emotion recognition:

Data-augmentation:

Data used for main experiments (conducted on VoxCeleb1 and VoxCeleb2 + data-augmentation) can be automatically downloaded, extracted and prepared using the following scripts.

python tools/prepare_data/prepare_voxceleb.py data/
python tools/prepare_data/prepare_augmentation.py data/

The resulting data folder shoud have the structure presented below.

data
├── musan_split/
├── simulated_rirs/
├── voxceleb1/
├── voxceleb2/
├── voxceleb1_test_O
├── voxceleb1_test_H
├── voxceleb1_test_E
├── voxsrc2021_val
├── voxceleb1_train.csv
└── voxceleb2_train.csv

Other datasets have to be manually downloaded and extracted but their train and trials files can be created using the corresponding scripts from the tools/prepare_data/ folder.

  • Example format of a train file (voxceleb1_train.csv)

    File,Speaker
    voxceleb1/id10001/1zcIwhmdeo4/00001.wav,id10001
    ...
    voxceleb1/id11251/s4R4hvqrhFw/00009.wav,id11251
    
  • Example format of a trials file (voxceleb1_test_O)

    1 voxceleb1/id10270/x6uYqmx31kE/00001.wav voxceleb1/id10270/8jEAjG6SegY/00008.wav
    ...
    0 voxceleb1/id10309/0cYFdtyWVds/00005.wav voxceleb1/id10296/Y-qKARMSO7k/00001.wav
    

Installation

  1. Clone this repository: git clone https://github.com/theolepage/sslsv.git.
  2. Install dependencies: pip install -r requirements.txt.

Note: sslsv can also be installed as a standalone package via pip with pip install sslsv or with pip install . (in the project root folder) to get the latest version.


Usage

  • Start a training (2 GPUs): ./train_ddp.sh 2 <config_path>.
  • Evaluate your model (2 GPUs): ./evaluate_ddp.sh 2 <config_path>.

Note: use sslsv/bin/train.py and sslsv/bin/evaluate.py for non-distributed mode to run with a CPU, a single GPU or multiple GPUs (DataParallel).

Tensorboard

You can visualize your experiments with tensorboard --logdir models/your_model/.

wandb

Use wandb online and wandb offline to toggle wandb. To log your experiments you first need to provide your API key with wandb login API_KEY.


Documentation

Documentation is currently being developed...


Results

SOTA

  • Train set: VoxCeleb2
  • Evaluation: VoxCeleb1-O (Original)
  • Encoder: ECAPA-TDNN (C=1024)
Method Model EER (%) minDCF (p=0.01) Checkpoint
SimCLR ssl/voxceleb2/simclr/simclr_e-ecapa-1024 6.41 0.5160 🔗
MoCo ssl/voxceleb2/moco/moco_e-ecapa-1024 6.38 0.5384 🔗
SwAV ssl/voxceleb2/swav/swav_e-ecapa-1024 8.33 0.6120 🔗
VICReg ssl/voxceleb2/vicreg/vicreg_e-ecapa-1024 7.85 0.6004 🔗
DINO ssl/voxceleb2/dino/dino+_e-ecapa-1024 2.92 0.3523 🔗
Supervised ssl/voxceleb2/supervised/supervised_e-ecapa-1024 1.34 0.1521 🔗

Acknowledgements

sslsv contains third-party components and code adapted from other open-source projects, including: voxceleb_trainer, voxceleb_unsupervised and solo-learn.


Citations

If you use sslsv, please consider starring this repository on GitHub and citing one the following papers.

@InProceedings{lepage2024AdditiveMarginSSLSV,
  author    = {Lepage, Théo and Dehak, Réda},
  booktitle = {The Speaker and Language Recognition Workshop (Odyssey)},
  title     = {Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations},
  year      = {2024},
  url       = {https://www.isca-archive.org/odyssey_2024/lepage24_odyssey.html},
}

@InProceedings{lepage2023ExperimentingAdditiveMarginsSSLSV,
  author    = {Lepage, Théo and Dehak, Réda},
  booktitle = {INTERSPEECH},
  title     = {Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification},
  year      = {2023},
  url       = {https://www.isca-speech.org/archive/interspeech_2023/lepage23_interspeech.html},
}

@InProceedings{lepage2022LabelEfficientSelfSupervisedSV,
  author    = {Lepage, Théo and Dehak, Réda},
  booktitle = {INTERSPEECH},
  title     = {Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning},
  year      = {2022},
  url       = {https://www.isca-speech.org/archive/interspeech_2022/lepage22_interspeech.html},
}

License

This project is released under the MIT License.