Lists (6)
Sort Name ascending (A-Z)
Stars
M3GPT: An advanced multimodal, multitask framework for motion comprehension and generation.
Code for "Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers" (NeurIPS 2024)
SOTA Re-identification Methods and Toolbox
[ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"
[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
✨✨Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
XBM: Cross-Batch Memory for Embedding Learning
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Ego4D Goal-Step: Toward Hierarchical Understanding of Procedural Activities (NeurIPS 2023)
Fine-Grained Egocentric Hand-Object Segmentation, ECCV 2022
【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Official PyTorch code of "Grounded Question-Answering in Long Egocentric Videos", accepted by CVPR 2024.
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
GRiT: A Generative Region-to-text Transformer for Object Understanding (https://arxiv.org/abs/2212.00280)
[BMVC2022, IJCV2023, Best Student Paper, Spotlight] Official codes for the paper "In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation".
[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale
[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering
Pandora: Towards General World Model with Natural Language Actions and Video States
[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.
This is the official code for MIME: Human-Aware 3D Scene Generation (CVPR2023)
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
Resolving 3D Human Pose Ambiguities with 3D Scene Constraints https://prox.is.tue.mpg.de