Highlights
- Pro
Stars
[ICLR2025] γ -MOD: Mixture-of-Depth Adaptation for Multimodal Large Language Models
📖 This is a repository for organizing papers, codes, and other resources related to unified multimodal models.
[AAAI-25] Cobra: Extending Mamba to Multi-modal Large Language Model for Efficient Inference
🔥 SpatialVLA: a spatial-enhanced vision-language-action model that is trained on 1.1 Million real robot episodes.
🔥CVPR2025 & ICLR2025 Embodied AI Paper List Resources. Star ⭐ the repo and follow me if you like what you see 🤩.
Official PyTorch implementation for "Large Language Diffusion Models"
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
✨First Open-Source R1-like Video-LLM [2025/02/18]
Code release for "PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop" (arXiv 2025)
[EMNLP 2023]Context Compression for Auto-regressive Transformers with Sentinel Tokens
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
🔥CVPR 2025 Multimodal Large Language Models Paper List
The Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
[NeurIPS 2023] ImageReward: Learning and Evaluating Human Preferences for Text-to-image Generation
[CVPR 2025] Official implementation of "MangaNinja: Line Art Colorization with Precise Reference Following"
Boosting the Class-Incremental Learning in 3D Point Clouds via Zero-Collection-Cost Basic Shape Pre-Training
[CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
Improving Video Generation with Human Feedback
Integrate the DeepSeek API into popular softwares
Official repository of ’Visual-RFT: Visual Reinforcement Fine-Tuning’
[Technical Report 2023] PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction
Code for "Diffusion Model Alignment Using Direct Preference Optimization"
Video Diffusion Alignment via Reward Gradients. We improve a variety of video diffusion models such as VideoCrafter, OpenSora, ModelScope and StableVideoDiffusion by finetuning them using various r…
Explore the Multimodal “Aha Moment” on 2B Model
Official code repository of "CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction"
moojink / openvla-oft
Forked from openvla/openvlaFine-Tuning Vision-Language-Action Models: Optimizing Speed and Success