Skip to content

Latest commit

 

History

History
52 lines (52 loc) · 8.36 KB

File metadata and controls

52 lines (52 loc) · 8.36 KB

LLM/LVM

  • (arXiv 2023.07) INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers, [Paper]
  • (arXiv 2023.11) NExT-Chat: An LMM for Chat, Detection and Segmentation, [Paper], [Code]
  • (arXiv 2023.11) u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model, [Paper]
  • (arXiv 2023.11) Towards Open-Ended Visual Recognition with Large Language Model, [Paper], [Code]
  • (arXiv 2023.11) Stable Segment Anything Model, [Paper], [Code]
  • (arXiv 2023.11) Adapter is All You Need for Tuning Visual Tasks, [Paper], [Code]
  • (arXiv 2023.11) LLaFS: When Large-Language Models Meet Few-Shot Segmentation, [Paper], [Code]
  • (arXiv 2023.11) Efficient In-Context Learning in Vision-Language Models for Egocentric Videos, [Paper], [Code]
  • (arXiv 2023.11) Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model, [Paper]
  • (arXiv 2023.11) PoseGPT: Chatting about 3D Human Pose, [Paper], [Code]
  • (arXiv 2023.11) InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation, [Paper], [Code]
  • (arXiv 2023.11) Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large Vision-Language Models, [Paper], [Code]
  • (arXiv 2023.11) Contrastive Vision-Language Alignment Makes Efficient Instruction Learner, [Paper], [Code]
  • (arXiv 2023.12) Bootstrapping SparseFormers from Vision Foundation Models, [Paper], [Code]
  • (arXiv 2023.12) IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks, [Paper], [Code]
  • (arXiv 2023.12) Segment and Caption Anything, [Paper], [Code]
  • (arXiv 2023.12) EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything, [Paper]
  • (arXiv 2023.12) Segment Any 3D Gaussians, [Paper], [Code]
  • (arXiv 2023.12) Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts, [Paper]
  • (arXiv 2023.12) PixelLM: Pixel Reasoning with Large Multimodal Model, [Paper], [Code]
  • (arXiv 2023.12) Foundation Model Assisted Weakly Supervised Semantic Segmentation, [Paper]
  • (arXiv 2023.12) AI-SAM: Automatic and Interactive Segment Anything Model, [Paper], [Code]
  • (arXiv 2023.12) MobileSAMv2: Faster Segment Anything to Everything, [Paper],[Code]
  • (arXiv 2023.12) MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices, [Paper],[Code]
  • (arXiv 2024.01) One for All: Toward Unified Foundation Models for Earth Vision, [Paper]
  • (arXiv 2024.01) RAP-SAM: Towards Real-Time All-Purpose Segment Anything, [Paper], [Code]
  • (arXiv 2024.02) MobileVLM V2: Faster and Stronger Baseline for Vision Language Model, [Paper], [Code]
  • (arXiv 2024.02) Data-efficient Large Vision Models through Sequential Autoregression, [Paper], [Code]
  • (arXiv 2024.02) EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss, [Paper], [Code]
  • (arXiv 2024.02) Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models, [Paper]
  • (arXiv 2024.02) PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter, [Paper]
  • (arXiv 2024.02) GROUNDHOG : Grounding Large Language Models to Holistic Segmentation, [Paper]
  • (arXiv 2024.03) VLM-PL: Advanced Pseudo Labeling approach Class Incremental Object Detection with Vision-Language Model, [Paper]
  • (arXiv 2024.03) Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models, [Paper], [Code]
  • (arXiv 2024.04) Adapting LLaMA Decoder to Vision Transformer, [Paper], [Code]
  • (arXiv 2024.04) Surgical-DeSAM: Decoupling SAM for Instrument Segmentation in Robotic Surgery, [Paper], [Code]
  • (arXiv 2024.04) Dense Connector for MLLMs, [Paper], [Code]
  • (arXiv 2024.05) Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR, [Paper]
  • (arXiv 2024.05) Matryoshka Query Transformer for Large Vision-Language Models, [Paper], [Code]
  • (arXiv 2024.07) A Single Transformer for Scalable Vision-Language Modeling, [Paper], [Code]
  • (arXiv 2024.07) X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs, [Paper]
  • (arXiv 2024.07) EVLM: An Efficient Vision-Language Model for Visual Understanding, [Paper]
  • (arXiv 2024.07) Hierarchical Generation for Coherent Long Visual Sequences, [Paper], [Code]
  • (arXiv 2024.08) ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers, [Paper]
  • (arXiv 2024.09) VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition, [Paper]
  • (arXiv 2024.09) Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling, [Paper], [Code]
  • (arXiv 2024.09) TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models, [Paper], [Code]
  • (arXiv 2024.10) OMCAT: Omni Context Aware Transformer, [Paper], [Code]
  • (arXiv 2024.11) MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding, [Paper]
  • (arXiv 2024.12) TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models, [Paper]
  • (arXiv 2025.01) 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer, [Paper]