Starred repositories
Framework to reduce autotune overhead to zero for well known deployments.
RUCAIBox / Virgo
Forked from Richar-Du/VirgoOfficial code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness
The official implementation of Tensor ProducT ATTenTion Transformer (T6)
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
GRAPE: Guided-Reinforced Vision-Language-Action Preference Optimization
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
A comprehensive video analysis tool that combines computer vision, audio transcription, and natural language processing to generate detailed descriptions of video content. This tool extracts key fr…
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
3D Occupancy Prediction Benchmark in Autonomous Driving
[ECCV 2024] Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction
The public-facing frontend, www.svlsimulator.com
GaussianAD: Gaussian-Centric End-to-End Autonomous Driving
The official repo for the paper "In-Context Imitation Learning via Next-Token Prediction"
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
A collection of reference AI microservices and workflows for Jetson Platform Services
Cosmos is a world model development platform that consists of world foundation models, tokenizers and video processing pipeline to accelerate the development of Physical AI at Robotics & AV labs. C…
A Telegram bot to recommend arXiv papers
FastBee开源物联网平台,简单易用,可用于搭建物联网平台以及二次开发和学习。适用于智能家居、智慧办公、智慧社区、农业监测、水利监测、工业控制等。
VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving
🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".
Low-level locomotion policy training in Isaac Lab
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.