🔥🔥🔥 A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper
The first survey for Multimodal Large Language Models (MLLMs). ✨
Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! 🌟
🔥🔥🔥 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
[2024.06.03] We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! 🌟
It applies to both image MLLMs, i.e., generalizing to multiple images, and video MLLMs. Our leaderboard involes SOTA models like Gemini 1.5 Pro, GPT-4o, GPT-4V, LLaVA-NeXT-Video, InternVL-Chat-V1.5, and Qwen-VL-Max. 🌟
It includes both short- (< 2min), medium- (4min~15min), and long-term (30min~60min) videos, ranging from 11 seconds to 1 hour. ✨
All data are newly collected and annotated by humans, not from any existing video dataset. ✨
🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Project Page [Leaderboards] | Paper | ✒️ Citation
The first comprehensive evaluation benchmark for MLLMs. Now the leaderboards include 50+ advanced models, such as Qwen-VL-Max, Gemini Pro, and GPT-4V. ✨
If you want to add your model in our leaderboards, please feel free to email [email protected]. We will update the leaderboards in time. ✨
Download MME 🌟🌟
The benchmark dataset is collected by Xiamen University for academic research only. You can email [email protected] to obtain the dataset, according to the following requirement.
Requirement: A real-name system is encouraged for better academic communication. Your email suffix needs to match your affiliation, such as [email protected] and Xiamen University. Otherwise, you need to explain why. Please include the information bellow when sending your application email.
Name: (tell us who you are.)
Affiliation: (the name/url of your university or company)
Job Title: (e.g., professor, PhD, and researcher)
Email: (your email address)
How to use: (only for non-commercial use)
🔥🔥🔥 Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper | Source Code
The first work to correct hallucinations in MLLMs. ✨
🔥🔥🔥 A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
Paper
The first technical report for Gemini vs GPT-4V. A total of 128 pages. Completed within one week of the Gemini API opening. 🌟
📑 If you find our projects helpful to your research, please consider citing:
@article{fu2023mme,
title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and others},
journal={arXiv preprint arXiv:2306.13394},
year={2023}
}
@article{fu2024video,
title={Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis},
author={Fu, Chaoyou and Dai, Yuhan and Luo, Yondong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others},
journal={arXiv preprint arXiv:2405.21075},
year={2024}
}
@article{yin2023survey,
title={A survey on multimodal large language models},
author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
journal={arXiv preprint arXiv:2306.13549},
year={2023}
}
@article{yin2023woodpecker,
title={Woodpecker: Hallucination correction for multimodal large language models},
author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Xu, Tong and Wang, Hao and Sui, Dianbo and Shen, Yunhang and Li, Ke and Sun, Xing and Chen, Enhong},
journal={arXiv preprint arXiv:2310.16045},
year={2023}
}
@article{fu2023challenger,
title={A challenger to gpt-4v? early explorations of gemini in visual expertise},
author={Fu, Chaoyou and Zhang, Renrui and Lin, Haojia and Wang, Zihan and Gao, Timin and Luo, Yongdong and Huang, Yubo and Zhang, Zhengye and Qiu, Longtian and Ye, Gaoxiang and others},
journal={arXiv preprint arXiv:2312.12436},
year={2023}
}
Table of Contents
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Silkie: Preference Distillation for Large Visual Language Models |
arXiv | 2023-12-17 | Github | - |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback |
arXiv | 2023-12-01 | Github | Demo |
Aligning Large Multimodal Models with Factually Augmented RLHF |
arXiv | 2023-09-25 | Github | Demo |
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models |
arXiv | 2024-02-03 | Github | - |
VCoder: Versatile Vision Encoders for Multimodal Large Language Models |
arXiv | 2023-12-21 | Github | Local Demo |
Prompt Highlighter: Interactive Control for Multi-Modal LLMs |
arXiv | 2023-12-07 | Github | - |
Planting a SEED of Vision in Large Language Model |
arXiv | 2023-07-16 | Github | |
Can Large Pre-trained Models Help Vision Models on Perception Tasks? |
arXiv | 2023-06-01 | Github | - |
Contextual Object Detection with Multimodal Large Language Models |
arXiv | 2023-05-29 | Github | Demo |
Generating Images with Multimodal Language Models |
arXiv | 2023-05-26 | Github | - |
On Evaluating Adversarial Robustness of Large Vision-Language Models |
arXiv | 2023-05-26 | Github | - |
Grounding Language Models to Images for Multimodal Inputs and Outputs |
ICML | 2023-01-31 | Github | Demo |
Name | Paper | Link | Notes |
---|---|---|---|
VEGA | VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models | Link | A dataset for enchancing model capabilities in comprehension of interleaved information |
ALLaVA-4V | ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | Link | Vision and language caption and instruction dataset generated by GPT4V |
IDK | Visually Dehallucinative Instruction Generation: Know What You Don't Know | Link | Dehallucinative visual instruction for "I Know" hallucination |
CAP2QA | Visually Dehallucinative Instruction Generation | Link | Image-aligned visual instruction dataset |
M3DBench | M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | Link | A large-scale 3D instruction tuning dataset |
ViP-LLaVA-Instruct | Making Large Multimodal Models Understand Arbitrary Visual Prompts | Link | A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data |
LVIS-Instruct4V | To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | Link | A visual instruction dataset via self-instruction from GPT-4V |
ComVint | What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | Link | A synthetic instruction dataset for complex visual reasoning |
SparklesDialogue | ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | Link | A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns. |
StableLLaVA | StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | Link | A cheap and effective approach to collect visual instruction tuning data |
M-HalDetect | Detecting and Preventing Hallucinations in Large Vision Language Models | Coming soon | A dataset used to train and benchmark models for hallucination detection and prevention |
MGVLID | ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | - | A high-quality instruction-tuning dataset including image-text and region-text pairs |
BuboGPT | BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | Link | A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data |
SVIT | SVIT: Scaling up Visual Instruction Tuning | Link | A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs |
mPLUG-DocOwl | mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | Link | An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding |
PF-1M | Visual Instruction Tuning with Polite Flamingo | Link | A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo. |
ChartLlama | ChartLlama: A Multimodal LLM for Chart Understanding and Generation | Link | A multi-modal instruction-tuning dataset for chart understanding and generation |
LLaVAR | LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | Link | A visual instruction-tuning dataset for Text-rich Image Understanding |
MotionGPT | MotionGPT: Human Motion as a Foreign Language | Link | A instruction-tuning dataset including multiple human motion-related tasks |
LRV-Instruction | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | Link | Visual instruction tuning dataset for addressing hallucination issue |
Macaw-LLM | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | Link | A large-scale multi-modal instruction dataset in terms of multi-turn dialogue |
LAMM-Dataset | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Link | A comprehensive multi-modal instruction tuning dataset |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | 100K high-quality video instruction dataset |
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction tuning |
M3IT | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | Link | Large-scale, broad-coverage multimodal instruction tuning dataset |
LLaVA-Med | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | Coming soon | A large-scale, broad-coverage biomedical instruction-following dataset |
GPT4Tools | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | Link | Tool-related instruction datasets |
MULTIS | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | Coming soon | Multimodal instruction tuning dataset covering 16 multimodal tasks |
DetGPT | DetGPT: Detect What You Need via Reasoning | Link | Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs |
PMC-VQA | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | Coming soon | Large-scale medical visual question-answering dataset |
VideoChat | VideoChat: Chat-Centric Video Understanding | Link | Video-centric multimodal instruction dataset |
X-LLM | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | Link | Chinese multimodal instruction dataset |
LMEye | LMEye: An Interactive Perception Network for Large Language Models | Link | A multi-modal instruction-tuning dataset |
cc-sbu-align | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Link | Multimodal aligned dataset for improving model's usability and generation's fluency |
LLaVA-Instruct-150K | Visual Instruction Tuning | Link | Multimodal instruction-following data generated by GPT |
MultiInstruct | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | Link | The first multimodal instruction tuning benchmark dataset |
Name | Paper | Link | Notes |
---|---|---|---|
MIC | MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | Link | A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs. |
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction dataset |
Name | Paper | Link | Notes |
---|---|---|---|
EMER | Explainable Multimodal Emotion Reasoning | Coming soon | A benchmark dataset for explainable emotion reasoning task |
EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | Coming soon | Large-scale embodied planning dataset |
VIP | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | Coming soon | An inference-time dataset that can be used to evaluate VideoCOT |
ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Link | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains |
Name | Paper | Link | Notes |
---|---|---|---|
VLFeedback | Silkie: Preference Distillation for Large Visual Language Models | Link | A vision-language feedback dataset annotated by AI |
Name | Paper | Link | Notes |
---|---|---|---|
Video-MME | Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis | Link | A comprehensive evaluation benchmark of Multi-modal LLMs in video analysis |
VL-ICL Bench | VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning | Link | A benchmark for M-ICL evaluation, covering a wide spectrum of tasks |
TempCompass | TempCompass: Do Video LLMs Really Understand Videos? | Link | A benchmark to evaluate the temporal perception ability of Video LLMs |
CoBSAT | Can MLLMs Perform Text-to-Image In-Context Learning? | Link | A benchmark for text-to-image ICL |
VQAv2-IDK | Visually Dehallucinative Instruction Generation: Know What You Don't Know | Link | A benchmark for assessing "I Know" visual hallucination |
Math-Vision | Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset | Link | A diverse mathematical reasoning benchmark |
CMMMU | CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark | Link | A Chinese benchmark involving reasoning and knowledge across multiple disciplines |
MMCBench | Benchmarking Large Multimodal Models against Common Corruptions | Link | A benchmark for examining self-consistency under common corruptions |
MMVP | Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs | Link | A benchmark for assessing visual capabilities |
TimeIT | TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | Link | A video instruction-tuning dataset with timestamp annotations, covering diverse time-sensitive video-understanding tasks. |
ViP-Bench | Making Large Multimodal Models Understand Arbitrary Visual Prompts | Link | A benchmark for visual prompts |
M3DBench | M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | Link | A 3D-centric benchmark |
Video-Bench | Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models | Link | A benchmark for video-MLLM evaluation |
Charting-New-Territories | Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | Link | A benchmark for evaluating geographic and geospatial capabilities |
MLLM-Bench | MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V | Link | GPT-4V evaluation with per-sample criteria |
BenchLMM | BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Link | A benchmark for assessment of the robustness against different image styles |
MMC-Benchmark | MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning | Link | A comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts |
MVBench | MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | Link | A comprehensive multimodal benchmark for video understanding |
Bingo | Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges | Link | A benchmark for hallucination evaluation that focuses on two common types |
MagnifierBench | OtterHD: A High-Resolution Multi-modality Model | Link | A benchmark designed to probe models' ability of fine-grained perception |
HallusionBench | HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models | Link | An image-context reasoning benchmark for evaluation of hallucination |
PCA-EVAL | Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond | Link | A benchmark for evaluating multi-domain embodied decision-making. |
MMHal-Bench | Aligning Large Multimodal Models with Factually Augmented RLHF | Link | A benchmark for hallucination evaluation |
MathVista | MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models | Link | A benchmark that challenges both visual and math reasoning capabilities |
SparklesEval | ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | Link | A GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria. |
ISEKAI | Link-Context Learning for Multimodal LLMs | Link | A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning |
M-HalDetect | Detecting and Preventing Hallucinations in Large Vision Language Models | Coming soon | A dataset used to train and benchmark models for hallucination detection and prevention |
I4 | Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions | Link | A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions |
SciGraphQA | SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | Link | A large-scale chart-visual question-answering dataset |
MM-Vet | MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | Link | An evaluation benchmark that examines large multimodal models on complicated multimodal tasks |
SEED-Bench | SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | Link | A benchmark for evaluation of generative comprehension in MLLMs |
MMBench | MMBench: Is Your Multi-modal Model an All-around Player? | Link | A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models |
Lynx | What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? | Link | A comprehensive evaluation benchmark including both image and video tasks |
GAVIE | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | Link | A benchmark to evaluate the hallucination and instruction following ability |
MME | MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | Link | A comprehensive MLLM Evaluation benchmark |
LVLM-eHub | LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models | Link | An evaluation platform for MLLMs |
LAMM-Benchmark | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Link | A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks |
M3Exam | M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | Link | A multilingual, multimodal, multilevel benchmark for evaluating MLLM |
OwlEval | mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality | Link | Dataset for evaluation on multiple capabilities |
Name | Paper | Link | Notes |
---|---|---|---|
IMAD | IMAD: IMage-Augmented multi-modal Dialogue | Link | Multimodal dialogue dataset |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | A quantitative evaluation framework for video-based dialogue models |
CLEVR-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A synthetic multimodal fine-tuning dataset for learning to reject instructions |
Fruit-ATVC | Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation | Link | A manually pictured multimodal fine-tuning dataset for learning to reject instructions |
InfoSeek | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | Link | A VQA dataset that focuses on asking information-seeking questions |
OVEN | Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities | Link | A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild |