- A clearer and more detailed environment setup.
- Integrate the source code of HumanOmni into this project.
- Open-source a more detailed reproduction process.
- Open-source all the training data used.
- Inference for single-video and single-audio modality data
- Results of the 7B version of the model.
We will complete these updates as soon as possible.
R1-Omni is the industry’s first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model. We focus on emotion recognition, a task where both visual and audio modalities play crucial roles, to validate the potential of combining RLVR with Omni model. Our findings reveal several key insights:
- Enhanced Reasoning Capability: R1-Omni demonstrate superior reasoning abilities, enabling a clearer understanding of how visual and audio information contribute to emotion recognition.
- Improved Understanding Capability: Compared to SFT, RLVR significantly boosts performance on emotion recognition tasks.
- Stronger Generalization Capability: RLVR models exhibit markedly better generalization capabilities, particularly excelling in out-of-distribution scenarios.
We chose the open-source Omni model HumanOmni-0.5B as our base model. We have open-sourced the following: the base model HumanOmni-0.5B, the cold-start model EMER-SFT, the model MAFW-DFEW-SFT fine-tuned directly on the MAFW and DFEW training sets, and our final model R1-Omni.
Below are the performance on emotion recognition datasets. We use symbols to indicate whether the data is in-distribution (⬤) or out-of-distribution (△).
Method | DFEW (WAR) ⬤ | DFEW (UAR) ⬤ | MAFW (WAR) ⬤ | MAFW (UAR) ⬤ | RAVDESS (WAR) △ | RAVDESS (UAR) △ |
---|---|---|---|---|---|---|
HumanOmni-0.5B | 22.64 | 19.44 | 20.18 | 13.52 | 7.33 | 9.38 |
EMER-SFT | 38.66 | 35.31 | 38.39 | 28.02 | 29.00 | 27.19 |
MAFW-DFEW-SFT | 60.23 | 44.39 | 50.44 | 30.39 | 29.33 | 30.75 |
R1-Omni | 65.83 | 56.27 | 57.68 | 40.04 | 43.00 | 44.69 |
- ⬤: Indicates in-distribution data (DFEW and MAFW).
- △: Indicates out-of-distribution data (RAVDESS).
10069.mp4
User: As an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags.
HumanOmni-0.5B: fear
EMER-SFT: <think>In the video, a man in a brown jacket stands in front of a vibrant mural, his face showing clear signs of anger. His furrowed brows and open mouth express his dissatisfaction. From his expressions and vocal traits, it can be inferred that he is experiencing intense emotional turmoil. The audio content mentions "fear" and "lowering one's voice," suggesting that he is facing an urgent situation or expressing strong emotions during an emotionally tense moment. Overall, the analysis indicates that the man is experiencing a deep sense of fear and anxiety.</think>
<answer>fear</answer>
MAFW-DFEW-SFT: angry
R1-Omni: <think>In the video, a man in a brown jacket stands in front of a vibrant mural. He is wearing a pink shirt underneath his brown jacket, and his hair is dark and curly. His facial expression is complex, with wide eyes, slightly open mouth, raised eyebrows, and furrowed brows, revealing surprise and anger. Speech recognition technology suggests that his voice contains words like "you", "lower your voice", "freaking out", indicating strong emotions and agitation. Overall, he displays an emotional state of confusion, anger, and excitement.</think>
<answer>angry</answer>
GT:angry
14791.mp4
HumanOmni-0.5B: happy
EMER-SFT: <think>In the video, we see a woman in an indoor setting that appears to be a personal space or a study room. She is wearing a red headband and a beige blazer over a white top, with her hair neatly tied back. Her facial expression shows slight confusion and surprise, as if she has encountered something unexpected. As time goes on, her emotions seem to calm down, but the specific reasons for this are not explained. In the audio, there is a long silence between "but" and "it was," indicating hesitation and uncertainty. In the text, the subtitle reads: "But it was interesting." This sentence may express the confusion and surprise of the person. Based on the video clues of the woman's slightly confused and surprised expression, as well as her overall appearance and facial expression showing minor confusion and surprise, we can infer that this sentence may have a hesitant or uncertain tone. Additionally, the audio clue describing a long silence between "but" and "it was" also indicates some hesitation and uncertainty. Therefore, this sentence expresses the woman's interest and curiosity in something, while also showing a certain level of hesitation and uncertainty.</think>
<answer>surprise</answer>
MAFW-DFEW-SFT: happy
R1-Omni: <think>In the video, in the opening scene, we see a woman with her eyes slightly closed and mouth slowly opening as if she is laughing. Her facial expression appears somewhat joyful, which may indicate that she is experiencing some pleasant or amusing situation at that moment. In the audio, there are no pauses between sentences, they flow continuously, and the tone is light and cheerful. Combined with the text content, it can be felt that the character is in a very happy and positive emotional state. In the text, the subtitle reads: "It was interesting." This sentence may express the woman's satisfaction and curiosity towards something or someone. Based on the laughter and joyful emotion displayed by the woman in the video clues, as well as the description of continuous flowing, light and cheerful tone in the audio clues, we can infer that this sentence expresses the woman's interest and curiosity towards something. The joy in this sentence aligns with the overall display of happiness and positivity.</think>
<answer>happy</answer>
GT:happy
Our code is built on the R1-V framework. To set up the environment, please follow the installation instructions in the R1-V repository Nvidia-driver,535.54,torch 2.5.1+cu124,torchvision 0.20.1+cu124 torchaudio 2.5.1+cu124�transformers 4.49.0, flash_attn 2.7.4
Our inference code is based on the implementation from HumanOmni. To ensure the model runs inference smoothly, follow these steps:
-
Download the Required Models:
-
Update the Configuration File:
- In the directory where you downloaded the R1-Omni model, locate the config.json file.
- Update the paths on line 23 and line 31 to point to the local folders where you saved the models.
- Update the path on line 21 in inferece.py to point to the local folder where you saved the bert-base-uncased.
If you saved the models to the following local paths::
/path/to/local/models/siglip-base-patch16-224
/path/to/local/models/whisper-large-v3
/path/to/local/models/bert_base_uncased
Update the relevant lines in config.json as follows:
"mm_audio_tower": "/path/to/local/models/whisper-large-v3",
"mm_vision_tower": "/path/to/local/models/siglip-base-patch16-224"
Update the relevant line in inference.py as follows:
bert_model = "/path/to/local/models/bert_base_uncased"
We provide inference.py for singe video inference.
python inference.py --modal video_audio \
--model_path ./R1-Omni-0.5B \
--video_path video.mp4 \
--instruct "As an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."
we initialize the HumanOmni-0.5B by fine-tuning it on a combined dataset consisting of 232 samples from the Explainable Multimodal Emotion Reasoning dataset and 348 samples from HumanOmni dataset. An example json file of the training data:
[
{
"video": "MER24/sample_00000967.mp4",
"conversations": [
{
"from": "human",
"value": "<video>\n<audio>\nAs an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you? Output the thinking process in <think> </think> and final emotion in <answer> </answer> tags."
},
{
"from": "gpt",
"value": "<think>The video depicts a bright and tranquil indoor setting, where a man in a white Polo shirt stands by the window, engaged in a phone call. His furrowed brow and open mouth suggest he is experiencing tension and anxiety. According to the audio content of the video, his speech is fast-paced, and his tone is filled with confusion and stress. A comprehensive analysis reveals that the man is facing a moderate level of anxiety, closely linked to the challenging phone conversation he is having. Consequently, the entire emotional analysis report emphasizes his anxiety and nervousness in handling challenging situations.</think>\n<answer>anxious</answer>"
}
]
},
...
]
All of the cold-start data will be released as soon as possible.
In this stage, we utilize the training sets from the MAFW and DFEW(https://dfew-dataset.github.io/) datasets, comprising a total of 15,306 video sample. An example json file of the training data:
[
{
"video": "DFEW/videos/1.mp4",
"conversations": [
{
"from": "human",
"value": "<video>\n<audio>\nAs an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you?"
},
{
"from": "gpt",
"value": "sad"
}
],
},
...
]
To ensure transparency and accountability, please review our Usage Policy.
If you find our work helpful, feel free to give us a cite.
{zhao2025r1omniexplainableomnimultimodalemotion,
title={R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning},
author={Jiaxing Zhao and Xihan Wei and Liefeng Bo},
journal={arXiv preprint arXiv:2503.05379},
year={2025}
}