GigaAM: the family of open-source acoustic models for speech processing

Latest News

2024/12 — MIT License, GigaAM-v2 (-15% and -12% WER Reduction for CTC and RNN-T models, respectively), ONNX export support
2024/05 — GigaAM-RNNT (-19% WER Reduction), long-form inference using external Voice Activity Detection
2024/04 — GigaAM Release: GigaAM-CTC (SoTA Speech Recognition model for the Russian language), GigaAM-Emo

Overview

GigaAM (Giga Acoustic Model) is a family of open-source models for Russian speech processing tasks, including speech recognition and emotion recognition. The models are built on top of the Conformer architecture and leverage self-supervised learning (wav2vec2-based for GigaAM-v1 and HuBERT-based for GigaAM-v2).

GigaAM models are state-of-the-art open-source solutions for their respective tasks in the Russian language.

This repository includes:

GigaAM: A foundational self-supervised model pre-trained on massive Russian speech datasets.
GigaAM-CTC and GigaAM-RNNT: Fine-tuned models for automatic speech recognition (ASR).
GigaAM-Emo: A fine-tuned model for emotion recognition.

Installation

Requirements

Python ≥ 3.8
ffmpeg installed and added to your system's PATH

Install the GigaAM Package

Clone the repository:

git clone https://github.com/salute-developers/GigaAM.git
cd GigaAM

Install the package in editable mode:
```
pip install -e .
```

Verify the installation:

import gigaam
model = gigaam.load_model("ctc")
print(model)

GigaAM: The Foundational Model

GigaAM is a Conformer-based foundational model (240M parameters) pre-trained on 50,000+ hours of diverse Russian speech data.

It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition.

There are 2 available versions:

GigaAM-v1 was trained with a wav2vec2-like approach and can be used by loading the v1_ssl model version.
GigaAM-v2 was trained with a HuBERT-like approach and allows us to get GigaAM-v2 ASR model with better quality. It can be used by loading the v2_ssl or ssl model version.

More information about GigaAM-v1 can be found in our post on Habr.

GigaAM Usage Example

import gigaam
model = gigaam.load_model('ssl') # Options: "ssl", "v1_ssl"
embedding, _ = model.embed_audio(audio_path)

GigaAM for Speech Recognition

We fine-tuned the GigaAM encoder for ASR using two different architectures:

GigaAM-CTC was fine-tuned with Connectionist Temporal Classification and a character-based tokenizer.
GigaAM-RNNT was fine-tuned with RNN Transducer loss.

Fine-tuning was done for both GigaAM-v1 and GigaAM-v2 SSL models, so we have 4 ASR models: v1 and v2 versions for both CTC and RNNT.

Training Data

The models were trained on publicly available Russian datasets:

Dataset	Size (hours)	Weight
Golos	1227	0.6
SOVA	369	0.2
Russian Common Voice	207	0.1
Russian LibriSpeech	93	0.1

Performance Metrics (Word Error Rate)

Model	Parameters	Golos Crowd	Golos Farfield	OpenSTT YouTube	OpenSTT Phone Calls	OpenSTT Audiobooks	Mozilla Common Voice 12	Mozilla Common Voice 19	Russian LibriSpeech
Whisper-large-v3	1.5B	13.9	16.6	18.0	28.0	14.4	5.7	5.5	9.5
NVIDIA FastConformer	115M	2.2	6.6	21.2	30.0	13.9	2.7	5.7	11.3
GigaAM-CTC-v1	242M	3.0	5.7	16.0	23.2	12.5	2.0	10.5	7.5
GigaAM-RNNT-v1	243M	2.3	5.0	14.0	21.7	11.7	1.9	9.9	7.7
GigaAM-CTC-v2	242M	2.5	4.3	14.1	21.1	10.7	2.1	3.1	5.5
GigaAM-RNNT-v2	243M	2.2	3.9	13.3	20.0	10.2	1.8	2.7	5.5

Speech Recognition Example (GigaAM-ASR)

Basic usage: short audio transcribation (up to 30 seconds)

import gigaam
model_name = "rnnt"  # Options: "v2_ctc" or "ctc", "v2_rnnt" or "rnnt", "v1_ctc", "v1_rnnt"
model = gigaam.load_model(model_name)
transcription = model.transcribe(audio_path)

Long-form audio transcribation

Install external VAD dependencies (pyannote.audio library) with
```
pip install gigaam[longform]
```
- Generate Hugging Face API token
- Accept the conditions to access pyannote/voice-activity-detection files and content.
- Accept the conditions to access pyannote/segmentation files and content.

Use the model.transcribe_longform method:

import os
import gigaam

os.environ["HF_TOKEN"] = "<HF_TOKEN>"

model = gigaam.load_model("ctc")
recognition_result = model.transcribe_longform("long_example.wav")

for utterance in recognition_result:
   transcription = utterance["transcription"]
   start, end = utterance["boundaries"]
   print(f"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}")

ONNX inference example

Export the model to ONNX using the model.to_onnx method:

onnx_dir = "onnx"
model_type = "rnnt" # or "ctc"

model = gigaam.load_model(
   model_type,
   fp16_encoder=False,  # only fp32 tensors
   use_flash=False,  # disable flash attention
)
model.to_onnx(dir_path=onnx_dir)

Run ONNX inference:

from gigaam.onnx_utils import load_onnx_sessions, transcribe_sample

sessions = load_onnx_sessions(onnx_dir, model_type)
transcribe_sample("example.wav", model_type, sessions)

All these examples can also be found in inference_example.ipynb notebook.

GigaAM-Emo: Emotion Recognition

GigaAM-Emo is a fine-tuned model for emotion recognition trained on the Dusha dataset. It significantly outperforms existing models on several metrics.

Performance Metrics

		Crowd			Podcast
	Unweighted Accuracy	Weighted Accuracy	Macro F1-score	Unweighted Accuracy	Weighted Accuracy	Macro F1-score
DUSHA baseline (MobileNetV2 + Self-Attention)	0.83	0.76	0.77	0.89	0.53	0.54
АБК (TIM-Net)	0.84	0.77	0.78	0.90	0.50	0.55
GigaAM-Emo	0.90	0.87	0.84	0.90	0.76	0.67

Emotion Recognition Example (GigaAM-Emo)

import gigaam
model = gigaam.load_model('emo')
emotion2prob: Dict[str, int] = model.get_probs("example.wav")

print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))

License

GigaAM's code and model weights are released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
gigaam		gigaam
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ru.md		README_ru.md
gigaam_scheme.svg		gigaam_scheme.svg
inference_example.ipynb		inference_example.ipynb
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GigaAM: the family of open-source acoustic models for speech processing

Latest News

Table of Contents

Overview

Installation

Requirements

Install the GigaAM Package

GigaAM: The Foundational Model

GigaAM Usage Example

GigaAM for Speech Recognition

Training Data

Performance Metrics (Word Error Rate)

Speech Recognition Example (GigaAM-ASR)

Basic usage: short audio transcribation (up to 30 seconds)

Long-form audio transcribation

ONNX inference example

GigaAM-Emo: Emotion Recognition

Performance Metrics

Emotion Recognition Example (GigaAM-Emo)

License

Links

About

Releases

Packages

Languages

License

salute-developers/GigaAM

Folders and files

Latest commit

History

Repository files navigation

GigaAM: the family of open-source acoustic models for speech processing

Latest News

Table of Contents

Overview

Installation

Requirements

Install the GigaAM Package

GigaAM: The Foundational Model

GigaAM Usage Example

GigaAM for Speech Recognition

Training Data

Performance Metrics (Word Error Rate)

Speech Recognition Example (GigaAM-ASR)

Basic usage: short audio transcribation (up to 30 seconds)

Long-form audio transcribation

ONNX inference example

GigaAM-Emo: Emotion Recognition

Performance Metrics

Emotion Recognition Example (GigaAM-Emo)

License

Links

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages