Stars
A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
This is the official repository for M2UGen
Official implementation of the paper "Acoustic Music Understanding Model with Large-Scale Self-supervised Training".
MU-LLaMA: Music Understanding Large Language Model
Scripts to optimize NJU EasyConnect client routing rules.
《机器翻译:基础与模型》肖桐 朱靖波 著 - Machine Translation: Foundations and Models
code for paper "Modality-Aware Triplet Hard Mining for Zero-shot Sketch-Based Image Retrieval"
[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Clash for Windows for Mac,Clash for Windows for Mac教程,Clash for Windows for Mac配置说明,Clash for Mac
[ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"
Official repository of paper titled "How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs".
Weakly Supervised Video Moment Localisation with Contrastive Negative Sample Mining
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
[Preprint] TRACE: Temporal Grounding Video LLM via Casual Event Modeling
[AAAI 2025] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
A paper list of some recent works about Token Compress for Vit and VLM
This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"
Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GPT4V-level open-source multi-modal model based on Llama3-8B
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
The official repository of "Video assistant towards large language model makes everything easy"
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding