Stars
This is the official implementation of our paper "QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension"
[ICLR 2025] CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs
Official implementation for "Seagull: No-reference Image Quality Assessment for Regions of Interest via Visual-Language Instruction Tuning"
Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"
Official code for paper: [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster.
[ICLR 2025] Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
Official implementation of paper 'Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models'.
【ArXiv】PDF-Wukong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models
[ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
pkunlp-icler / MIC
Forked from HaozheZhao/MICMMICL, a state-of-the-art VLM with the in context learning ability from ICL, PKU
Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA, VL-Vicuna.
[ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"
PyTorch Implementation of "Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models"
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
A PyTorch implementation of the paper "All are Worth Words: A ViT Backbone for Diffusion Models".