Starred repositories
🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".
Low-level locomotion policy training in Isaac Lab
VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Code for "DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT"
Liquid: Language Models are Scalable Multi-modal Generators
[CVPR'2024] "SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution"
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
CoRL2024 | Hint-AD: Holistically Aligned Interpretability for End-to-End Autonomous Driving
OpenDILab Decision AI Engine. The Most Comprehensive Reinforcement Learning Framework B.P.
a family of versatile and state-of-the-art video tokenizers.
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
A simple testbed for robotics manipulation policies
Codebase for Aria - an Open Multimodal Native MoE
[ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creation
OpenEMMA, a permissively licensed open source reproduction of Waymo’s EMMA model.
The code for paper ''Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding''.
Code for "Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning"
[RSS 2024] Code for "Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals" for CALVIN experiments with pre-trained weights
Reimplementation of GR-1, a generalized policy for robotics manipulation.
A generative world for general-purpose robotics & embodied AI learning.
Summary of key papers and blogs about diffusion models to learn about the topic. Detailed list of all published diffusion robotics papers.
[NeurIPS'24] This repository is the implementation of "SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models"