Stars
"Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models", Hanwen Liang*, Yuyang Yin*, Dejia Xu, Hanxue Liang, Zhangyang Wang, Konstantinos N. Plataniotis, Yao Zhao, …
Memory-optimized training scripts for video models based on Diffusers
Official implementation of the paper "Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content".
🔥🔥First-ever hour scale video understanding models
Code of Pyramidal Flow Matching for Efficient Video Generative Modeling
📺 An End-to-End Solution for High-Resolution and Long Video Generation Based on Transformer Diffusion
[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of the Open World"
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Official repository for paper "Can LVLMs Obtain a Driver’s License? A Benchmark Towards Reliable AGI for Autonomous Driving"
Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs
[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
An open source implementation of CLIP.
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
[MM2024, oral] "Self-Supervised Visual Preference Alignment" https://arxiv.org/abs/2404.10501
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
[IROS 2024] HPHS: Hierarchical Planning based on Hybrid Frontier Sampling for Unknown Environments Exploration
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.