This repository contains some models pretrained on the Voxlingua107 dataset to be used for spoken (audio based) language classification. The dataset (and therefore the models) can distinguish between 107 different types of languages. Four models are provided ( See below ).
git clone https://github.com/RicherMans/SpokenLanguageClassifiers
pip install -r requirements.txt
python3 predict.py AUDIOFILE
The models (see below) can be also modified. Currently four models have been pretrained. All of which are accessed with the --model MODELNAME
parameter.
By default the models just print the top N
results (N=5 and can be changed with --N NUMBER
).
Four models were pretrained and can be chosen as the back-end:
- CNN6 (default) : A six layer CNN model, using attention as temporal aggregation.
- CNN10: A ten layer CNN model, using mean and max pooling as temporal aggregation.
- MobilenetV2: A mobilenet implementation for audio classification.
- CNNVAD: A model that simultaneously does VAD and classification. The VAD model is taken from GPV and Data-driven GPVAD. Model training has been done by fine-tuning both VAD and Language classification models. The back-end model here is the default CNN6.
Since I don't have access to other datasets for cross-dataset evaluation, I provide the current performance on my held-out cross-validation dataset:
Model | Precision | Recall | Accuracy |
---|---|---|---|
CNN6 | 81.7 | 84.4 | 83.6 |
CNN10 | 89.9 | 90.9 | 90.8 |
MobileNetV2 | 80.0 | 80.1 | 79.3 |
CNNVAD | 81.0 | 82.4 | 82.9 |