Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
-
Updated
Mar 4, 2025 - Python
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model
[CVPR'25] Official Implementations for Paper - MagicQuill: An Intelligent Interactive Image Editing System
Reasoning in LLMs: Papers and Resources, including Chain-of-Thought, OpenAI o1, and DeepSeek-R1 🍓
SpatialLM: Large Language Model for Spatial Understanding
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Agent S: an open agentic framework that uses computers like a human
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
🚀🚀🚀 A collection of some awesome public YOLO object detection series projects and the related object detection datasets.
🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
🚀🚀🚀A collection of some wesome public projects about Large Language Model(LLM), Vision Language Model(VLM), Vision Language Action(VLA), AI Generated Content(AIGC), the related Datasets and Applications.
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA model.
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
Add a description, image, and links to the mllm topic page so that developers can more easily learn about it.
To associate your repository with the mllm topic, visit your repo's landing page and select "manage topics."