-
Lead NJU-MiG (Multimodal intelligence Group, 南京大学米格小组), VITA, MME, and Awesome-MLLM
- https://bradyfu.github.io/
Stars
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension"
This is the official implementation of our paper "QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension"
Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
The Next Step Forward in Multimodal LLM Alignment
MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency
LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"
✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs
✨✨ [ICLR 2025] MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Awesome OVD-OVS - A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Simple PyTorch implementation of "Libra: Building Decoupled Vision System on Large Language Models" (accepted by ICML 2024)
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
VMamba: Visual State Space Models,code is based on mamba
[CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perception
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
[CVPR 2024] 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
[CVPR 2024] GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models