- 2024/12 — MIT License, GigaAM-v2 (-15% and -12% WER Reduction for CTC and RNN-T models, respectively), ONNX export support
- 2024/05 — GigaAM-RNNT (-19% WER Reduction), long-form inference using external Voice Activity Detection
- 2024/04 — GigaAM Release: GigaAM-CTC (SoTA Speech Recognition model for the Russian language), GigaAM-Emo
- Overview
- Installation
- GigaAM: The Foundational Model
- GigaAM for Speech Recognition
- GigaAM-Emo: Emotion Recognition
- License
- Links
GigaAM (Giga Acoustic Model) is a family of open-source models for Russian speech processing tasks, including speech recognition and emotion recognition. The models are built on top of the Conformer architecture and leverage self-supervised learning (wav2vec2-based for GigaAM-v1 and HuBERT-based for GigaAM-v2).
GigaAM models are state-of-the-art open-source solutions for their respective tasks in the Russian language.
This repository includes:
- GigaAM: A foundational self-supervised model pre-trained on massive Russian speech datasets.
- GigaAM-CTC and GigaAM-RNNT: Fine-tuned models for automatic speech recognition (ASR).
- GigaAM-Emo: A fine-tuned model for emotion recognition.
- Python ≥ 3.8
- ffmpeg installed and added to your system's PATH
-
Clone the repository:
git clone https://github.com/salute-developers/GigaAM.git cd GigaAM
-
Install the package in editable mode:
pip install -e .
-
Verify the installation:
import gigaam model = gigaam.load_model("ctc") print(model)
GigaAM is a Conformer-based foundational model (240M parameters) pre-trained on 50,000+ hours of diverse Russian speech data.
It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition.
There are 2 available versions:
- GigaAM-v1 was trained with a wav2vec2-like approach and can be used by loading the
v1_ssl
model version. - GigaAM-v2 was trained with a HuBERT-like approach and allows us to get GigaAM-v2 ASR model with better quality. It can be used by loading the
v2_ssl
orssl
model version.
More information about GigaAM-v1 can be found in our post on Habr.
import gigaam
model = gigaam.load_model('ssl') # Options: "ssl", "v1_ssl"
embedding, _ = model.embed_audio(audio_path)
We fine-tuned the GigaAM encoder for ASR using two different architectures:
- GigaAM-CTC was fine-tuned with Connectionist Temporal Classification and a character-based tokenizer.
- GigaAM-RNNT was fine-tuned with RNN Transducer loss.
Fine-tuning was done for both GigaAM-v1 and GigaAM-v2 SSL models, so we have 4 ASR models: v1
and v2
versions for both CTC and RNNT.
The models were trained on publicly available Russian datasets:
Dataset | Size (hours) | Weight |
---|---|---|
Golos | 1227 | 0.6 |
SOVA | 369 | 0.2 |
Russian Common Voice | 207 | 0.1 |
Russian LibriSpeech | 93 | 0.1 |
Model | Parameters | Golos Crowd | Golos Farfield | OpenSTT YouTube | OpenSTT Phone Calls | OpenSTT Audiobooks | Mozilla Common Voice 12 | Mozilla Common Voice 19 | Russian LibriSpeech |
---|---|---|---|---|---|---|---|---|---|
Whisper-large-v3 | 1.5B | 13.9 | 16.6 | 18.0 | 28.0 | 14.4 | 5.7 | 5.5 | 9.5 |
NVIDIA FastConformer | 115M | 2.2 | 6.6 | 21.2 | 30.0 | 13.9 | 2.7 | 5.7 | 11.3 |
GigaAM-CTC-v1 | 242M | 3.0 | 5.7 | 16.0 | 23.2 | 12.5 | 2.0 | 10.5 | 7.5 |
GigaAM-RNNT-v1 | 243M | 2.3 | 5.0 | 14.0 | 21.7 | 11.7 | 1.9 | 9.9 | 7.7 |
GigaAM-CTC-v2 | 242M | 2.5 | 4.3 | 14.1 | 21.1 | 10.7 | 2.1 | 3.1 | 5.5 |
GigaAM-RNNT-v2 | 243M | 2.2 | 3.9 | 13.3 | 20.0 | 10.2 | 1.8 | 2.7 | 5.5 |
import gigaam
model_name = "rnnt" # Options: "v2_ctc" or "ctc", "v2_rnnt" or "rnnt", "v1_ctc", "v1_rnnt"
model = gigaam.load_model(model_name)
transcription = model.transcribe(audio_path)
- Install external VAD dependencies (pyannote.audio library) with
pip install gigaam[longform]
-
- Generate Hugging Face API token
- Accept the conditions to access pyannote/voice-activity-detection files and content.
- Accept the conditions to access pyannote/segmentation files and content.
- Use the
model.transcribe_longform
method:import os import gigaam os.environ["HF_TOKEN"] = "<HF_TOKEN>" model = gigaam.load_model("ctc") recognition_result = model.transcribe_longform("long_example.wav") for utterance in recognition_result: transcription = utterance["transcription"] start, end = utterance["boundaries"] print(f"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}")
- Export the model to ONNX using the
model.to_onnx
method:onnx_dir = "onnx" model_type = "rnnt" # or "ctc" model = gigaam.load_model( model_type, fp16_encoder=False, # only fp32 tensors use_flash=False, # disable flash attention ) model.to_onnx(dir_path=onnx_dir)
- Run ONNX inference:
from gigaam.onnx_utils import load_onnx_sessions, transcribe_sample sessions = load_onnx_sessions(onnx_dir, model_type) transcribe_sample("example.wav", model_type, sessions)
All these examples can also be found in inference_example.ipynb notebook.
GigaAM-Emo is a fine-tuned model for emotion recognition trained on the Dusha dataset. It significantly outperforms existing models on several metrics.
Crowd | Podcast | |||||
---|---|---|---|---|---|---|
Unweighted Accuracy | Weighted Accuracy | Macro F1-score | Unweighted Accuracy | Weighted Accuracy | Macro F1-score | |
DUSHA baseline (MobileNetV2 + Self-Attention) |
0.83 | 0.76 | 0.77 | 0.89 | 0.53 | 0.54 |
АБК (TIM-Net) | 0.84 | 0.77 | 0.78 | 0.90 | 0.50 | 0.55 |
GigaAM-Emo | 0.90 | 0.87 | 0.84 | 0.90 | 0.76 | 0.67 |
import gigaam
model = gigaam.load_model('emo')
emotion2prob: Dict[str, int] = model.get_probs("example.wav")
print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))
GigaAM's code and model weights are released under the MIT License.