Stars
✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
[ICLR 2025] Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
[CSUR] A Survey on Video Diffusion Models
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Easily compute clip embeddings and build a clip retrieval system with them
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
[CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perception
✨✨Latest Advances on Multimodal Large Language Models
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
The pure and clear PyTorch Distributed Training Framework.