Stars
Official implementation of 🛸 "UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface"
DINO-X: The World's Top-Performing Vision Model for Open-World Object Detection and Understanding
Labeling tool with SAM(segment anything model),supports SAM, SAM2, sam-hq, MobileSAM EdgeSAM etc.交互式半自动图像标注工具
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
Quick scripts to calculate CLIP text-image similarity
Evaluating text-to-image/video/3D models with VQAScore
Official Repo for Open-Reasoner-Zero
Solve Visual Understanding with Reinforced VLMs
Extend OpenRLHF to support LMM RL training for reproduction of DeepSeek-R1 on multimodal tasks.
✨First Open-Source R1-like Video-LLM [2025/02/18]
Janus-Series: Unified Multimodal Understanding and Generation Models
A fork to add multimodal model training to open-r1
Fully open reproduction of DeepSeek-R1
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Official code implementation of Slow Perception:Let's Perceive Geometric Figures Step-by-step
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
DVIS: Decoupled Video Instance Segmentation Framework
RedisPOI能爬取指定区域的POI,并将其存储于Redis数据库中。RedisPOI还实现了基本的查询检索和性能计算功能。
[NeurIPS'24] This repository is the implementation of "SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models"
[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
[NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context
[ICLR 2023 Spotlight] Vision Transformer Adapter for Dense Predictions
[ECCV 2024] Official PyTorch implementation code for realizing the technical part of Mixture of All Intelligence (MoAI) to improve performance of numerous zero-shot vision language tasks.