Starred repositories
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Python sample codes for robotics algorithms.
A generative world for general-purpose robotics & embodied AI learning.
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
An open-source tool-augmented conversational language model from Fudan University
This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
An open source implementation of CLIP.
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
A collaboration friendly studio for NeRFs
A collection of libraries to optimise AI model performances
Large World Model -- Modeling Text and Video with Millions Context
Cosmos is a world model development platform that consists of world foundation models, tokenizers and video processing pipeline to accelerate the development of Physical AI at Robotics & AV labs. C…
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Enjoy the magic of Diffusion models!
[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Use PEFT or Full-parameter to finetune 400+ LLMs (Qwen2.5, InternLM3, GLM4, Llama3.3, Mistral, Yi1.5, Baichuan2, DeepSeek3, ...) and 150+ MLLMs (Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, Inter…
g1: Using Llama-3.1 70b on Groq to create o1-like reasoning chains
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
OpenDILab Decision AI Engine. The Most Comprehensive Reinforcement Learning Framework B.P.
FFCV: Fast Forward Computer Vision (and other ML workloads!)
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
Code of Pyramidal Flow Matching for Efficient Video Generative Modeling
Lumina-T2X is a unified framework for Text to Any Modality Generation
PyTorch pre-trained model for real-time interest point detection, description, and sparse tracking (https://arxiv.org/abs/1712.07629)
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
VideoSys: An easy and efficient system for video generation
LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning