Skip to content

RicherMans/SpokenLanguageClassifiers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spoken Language Classification

This repository contains some models pretrained on the Voxlingua107 dataset to be used for spoken (audio based) language classification. The dataset (and therefore the models) can distinguish between 107 different types of languages. Four models are provided ( See below ).

Usage

git clone https://github.com/RicherMans/SpokenLanguageClassifiers
pip install -r requirements.txt
python3 predict.py AUDIOFILE

The models (see below) can be also modified. Currently four models have been pretrained. All of which are accessed with the --model MODELNAME parameter.

By default the models just print the top N results (N=5 and can be changed with --N NUMBER).

Models

Four models were pretrained and can be chosen as the back-end:

  1. CNN6 (default) : A six layer CNN model, using attention as temporal aggregation.
  2. CNN10: A ten layer CNN model, using mean and max pooling as temporal aggregation.
  3. MobilenetV2: A mobilenet implementation for audio classification.
  4. CNNVAD: A model that simultaneously does VAD and classification. The VAD model is taken from GPV and Data-driven GPVAD. Model training has been done by fine-tuning both VAD and Language classification models. The back-end model here is the default CNN6.

Since I don't have access to other datasets for cross-dataset evaluation, I provide the current performance on my held-out cross-validation dataset:

Model Precision Recall Accuracy
CNN6 81.7 84.4 83.6
CNN10 89.9 90.9 90.8
MobileNetV2 80.0 80.1 79.3
CNNVAD 81.0 82.4 82.9

About

Pretrained spoken language classifiers from audio.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages