Stars
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
Mixture-of-Experts for Large Vision-Language Models
[CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
VPTQ, A Flexible and Extreme low-bit quantization algorithm
Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
SEED-Voken: A Series of Powerful Visual Tokenizers
VMamba: Visual State Space Models,code is based on mamba
Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"
fyabc / vllm
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
For optimization algorithm research and development.
An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)
[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
SALMONN: Speech Audio Language Music Open Neural Network
AI2-THOR Data Collection Tool Based On Keyboard Interaction
openvla / openvla
Forked from TRI-ML/prismatic-vlmsOpenVLA: An open-source vision-language-action model for robotic manipulation.
A flexible and efficient codebase for training visually-conditioned language models (VLMs)
ustcwhy / unilm
Forked from microsoft/unilmLarge-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Code for the Paper M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model