Starred repositories
[AAAI 25] Official Implementation for ”E-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment“
HunyuanVideo: A Systematic Framework For Large Video Generation Model
PhysGame Benchmark for Physical Commonsense Evaluation in Gameplay Videos
Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"
Code implementation of paper "MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval"
[ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"
[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"
This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring"
The official repository for ICLR2024 paper "FROSTER: Frozen CLIP is a Strong Teacher for Open-Vocabulary Action Recognition"
Multi-granularity Correspondence Learning from Long-term Noisy Videos [ICLR 2024, Oral]
[ICML 2024] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
This is the official GitHub for paper: On the Versatile Uses of Partial Distance Correlation in Deep Learning, in ECCV 2022
[arXiv22] Disentangled Representation Learning for Text-Video Retrieval
Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model
Official pytorch code for "ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations"
[2024-ACL]: TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wildrounded Conversation
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
Code and models for the paper "One Transformer Fits All Distributions in Multi-Modal Diffusion"
[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
Align 3D Point Cloud with Multi-modalities for Large Language Models
[NeurIPS 2022 Spotlight] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
The official GitHub page for the survey paper "A Survey of Large Language Models".
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
[ICCV2023] DETR Doesn’t Need Multi-Scale or Locality Design
Universal and Transferable Attacks on Aligned Language Models