Skip to content

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

License

Notifications You must be signed in to change notification settings

clmpt/pyannote-audio

 
 

Repository files navigation

Neural speaker diarization with pyannote.audio

This is the development branch of upcoming pyannote.audio 2.0 for which it has been decided to rewrite almost everything from scratch. Highlights of this upcoming release will be:

Installation

Until a proper release is available on PyPI, install from the develop branch:

pip install https://github.com/pyannote/pyannote-audio/archive/develop.zip

Windows users need to install PyTorch by themselves using the recommended commands (only torch and torchaudio are required) after the installation pyannote.audio.

pyannote.audio 101

For now, this is the closest you can get to an actual documentation.

Experimental protocol is reproducible thanks to pyannote.database. Here, we use the AMI "only_words" speaker diarization protocol.

from pyannote.database import get_protocol
ami = get_protocol('AMI.SpeakerDiarization.only_words')

Data augmentation is supported via torch-audiomentations.

from torch_audiomentations import Compose, ApplyImpulseResponse, AddBackgroundNoise
augmentation = Compose(transforms=[ApplyImpulseResponse(...),
                                   AddBackgroundNoise(...)])

A growing collection of tasks can be addressed. Here, we address speaker segmentation.

from pyannote.audio.tasks import Segmentation
seg = Segmentation(ami, augmentation=augmentation)

A growing collection of model architecture can be used. Here, we use the PyanNet (sincnet + LSTM) architecture.

from pyannote.audio.models.segmentation import PyanNet
model = PyanNet(task=seg)

We benefit from all the nice things that pytorch-lightning has to offer: distributed (GPU & TPU) training, model checkpointing, logging, etc. In this example, we don't really use any of this...

from pytorch_lightning import Trainer
trainer = Trainer()
trainer.fit(model)

Predictions are obtained by wrapping the model into the Inference engine.

from pyannote.audio import Inference
inference = Inference(model)
predictions = inference('audio.wav')

Pretrained models can be shared on Huggingface.co model hub. Here, we download and use a pretrained voice activity detection model.

inference = Inference('hbredin/VoiceActivityDetection-PyanNet-DIHARD')
predictions = inference('audio.wav')

Fine-tuning is as easy as setting the task attribute, freezing early layers and training. Here, we fine-tune on AMI dataset a voice activity detection model pretrained on DIHARD dataset.

from pyannote.audio import Model
model = Model.from_pretrained('hbredin/VoiceActivityDetection-PyanNet-DIHARD')
model.task = VoiceActivityDetection(ami)
model.freeze_up_to('sincnet')
trainer.fit(model)

Transfer learning is also supported out of the box. Here, we do transfer learning from voice activity detection to overlapped speech detection.

from pyannote.audio.tasks import OverlappedSpeechDetection
osd = OverlappedSpeechDetection(ami)
model.task = osd
trainer.fit(model)

Default optimizer (Adam with default parameters) is automatically set up for you. Customizing optimizer (and scheduler) requires overriding model.configure_optimizers method:

from types import MethodType
from torch.optim import SGD
from torch.optim.lr_scheduler import ExponentialLR
def configure_optimizers(self):
    return {"optimizer": SGD(self.parameters()),
            "lr_scheduler": ExponentialLR(optimizer, 0.9)}
model.configure_optimizers = MethodType(configure_optimizers, model)
trainer.fit(model)

Contributing

The commands below will setup pre-commit hooks and packages needed for developing the pyannote.audio library.

pip install -e .[dev,testing]
pre-commit install

Testing

Tests rely on a set of debugging files available in test/data directory. Set PYANNOTE_DATABASE_CONFIG environment variable to test/data/database.yml before running tests:

PYANNOTE_DATABASE_CONFIG=tests/data/database.yml pytest

About

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.5%
  • Shell 0.5%