Skip to content

salute-developers/GigaAM

Repository files navigation

GigaAM: the family of open-source acoustic models for speech processing

plot

Latest News


Table of Contents


Overview

GigaAM (Giga Acoustic Model) is a family of open-source models for Russian speech processing tasks, including speech recognition and emotion recognition. The models are built on top of the Conformer architecture and leverage self-supervised learning (wav2vec2-based for GigaAM-v1 and HuBERT-based for GigaAM-v2).

GigaAM models are state-of-the-art open-source solutions for their respective tasks in the Russian language.

This repository includes:

  • GigaAM: A foundational self-supervised model pre-trained on massive Russian speech datasets.
  • GigaAM-CTC and GigaAM-RNNT: Fine-tuned models for automatic speech recognition (ASR).
  • GigaAM-Emo: A fine-tuned model for emotion recognition.

Installation

Requirements

  • Python ≥ 3.8
  • ffmpeg installed and added to your system's PATH

Install the GigaAM Package

  1. Clone the repository:

    git clone https://github.com/salute-developers/GigaAM.git
    cd GigaAM
  2. Install the package in editable mode:

    pip install -e .
  3. Verify the installation:

    import gigaam
    model = gigaam.load_model("ctc")
    print(model)

GigaAM: The Foundational Model

GigaAM is a Conformer-based foundational model (240M parameters) pre-trained on 50,000+ hours of diverse Russian speech data.

It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition.

There are 2 available versions:

  • GigaAM-v1 was trained with a wav2vec2-like approach and can be used by loading the v1_ssl model version.
  • GigaAM-v2 was trained with a HuBERT-like approach and allows us to get GigaAM-v2 ASR model with better quality. It can be used by loading the v2_ssl or ssl model version.

More information about GigaAM-v1 can be found in our post on Habr.

GigaAM Usage Example

import gigaam
model = gigaam.load_model('ssl') # Options: "ssl", "v1_ssl"
embedding, _ = model.embed_audio(audio_path)

GigaAM for Speech Recognition

We fine-tuned the GigaAM encoder for ASR using two different architectures:

Fine-tuning was done for both GigaAM-v1 and GigaAM-v2 SSL models, so we have 4 ASR models: v1 and v2 versions for both CTC and RNNT.

Training Data

The models were trained on publicly available Russian datasets:

Dataset Size (hours) Weight
Golos 1227 0.6
SOVA 369 0.2
Russian Common Voice 207 0.1
Russian LibriSpeech 93 0.1

Performance Metrics (Word Error Rate)

Model Parameters Golos Crowd Golos Farfield OpenSTT YouTube OpenSTT Phone Calls OpenSTT Audiobooks Mozilla Common Voice 12 Mozilla Common Voice 19 Russian LibriSpeech
Whisper-large-v3 1.5B 13.9 16.6 18.0 28.0 14.4 5.7 5.5 9.5
NVIDIA FastConformer 115M 2.2 6.6 21.2 30.0 13.9 2.7 5.7 11.3
GigaAM-CTC-v1 242M 3.0 5.7 16.0 23.2 12.5 2.0 10.5 7.5
GigaAM-RNNT-v1 243M 2.3 5.0 14.0 21.7 11.7 1.9 9.9 7.7
GigaAM-CTC-v2 242M 2.5 4.3 14.1 21.1 10.7 2.1 3.1 5.5
GigaAM-RNNT-v2 243M 2.2 3.9 13.3 20.0 10.2 1.8 2.7 5.5

Speech Recognition Example (GigaAM-ASR)

Basic usage: short audio transcribation (up to 30 seconds)

import gigaam
model_name = "rnnt"  # Options: "v2_ctc" or "ctc", "v2_rnnt" or "rnnt", "v1_ctc", "v1_rnnt"
model = gigaam.load_model(model_name)
transcription = model.transcribe(audio_path)

Long-form audio transcribation

  1. Install external VAD dependencies (pyannote.audio library) with
    pip install gigaam[longform]
  2. Use the model.transcribe_longform method:
    import os
    import gigaam
    
    os.environ["HF_TOKEN"] = "<HF_TOKEN>"
    
    model = gigaam.load_model("ctc")
    recognition_result = model.transcribe_longform("long_example.wav")
    
    for utterance in recognition_result:
       transcription = utterance["transcription"]
       start, end = utterance["boundaries"]
       print(f"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}")

ONNX inference example

  1. Export the model to ONNX using the model.to_onnx method:
    onnx_dir = "onnx"
    model_type = "rnnt" # or "ctc"
    
    model = gigaam.load_model(
       model_type,
       fp16_encoder=False,  # only fp32 tensors
       use_flash=False,  # disable flash attention
    )
    model.to_onnx(dir_path=onnx_dir)
  2. Run ONNX inference:
    from gigaam.onnx_utils import load_onnx_sessions, transcribe_sample
    
    sessions = load_onnx_sessions(onnx_dir, model_type)
    transcribe_sample("example.wav", model_type, sessions)

All these examples can also be found in inference_example.ipynb notebook.


GigaAM-Emo: Emotion Recognition

GigaAM-Emo is a fine-tuned model for emotion recognition trained on the Dusha dataset. It significantly outperforms existing models on several metrics.

Performance Metrics

Crowd Podcast
Unweighted Accuracy Weighted Accuracy Macro F1-score Unweighted Accuracy Weighted Accuracy Macro F1-score
DUSHA baseline
(MobileNetV2 + Self-Attention)
0.83 0.76 0.77 0.89 0.53 0.54
АБК (TIM-Net) 0.84 0.77 0.78 0.90 0.50 0.55
GigaAM-Emo 0.90 0.87 0.84 0.90 0.76 0.67

Emotion Recognition Example (GigaAM-Emo)

import gigaam
model = gigaam.load_model('emo')
emotion2prob: Dict[str, int] = model.get_probs("example.wav")

print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))

License

GigaAM's code and model weights are released under the MIT License.


Links