✨✨✨ Behold our meticulously curated trove of Multimodal Large Language Models (MLLM) resources! 📚🔍 Feast your eyes on an assortment of datasets, techniques for tuning multimodal instructions, methods for multimodal in-context learning, approaches for multimodal chain-of-thought, visual reasoning aided by gargantuan language models, foundational models, and much more. 🌟🔥
✨✨✨ This compilation shall forever stay in sync with the vanguard of breakthroughs in the realm of MLLM. 🔄 We are committed to its perpetual evolution, ensuring that you never miss out on the latest developments. 🚀💡
✨✨✨ And hold your breath, for we are diligently crafting a survey paper on MLLM, which shall soon grace the world with its wisdom. Stay tuned for its grand debut! 🎉📑
Table of ContentsTitle | Venue | Date | Code | Demo |
---|---|---|---|---|
MIMIC-IT: Multi-Modal In-Context Instruction Tuning |
arXiv | 2023-06-08 | Github | Demo |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models |
arXiv | 2023-04-19 | Github | Demo |
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace |
arXiv | 2023-03-30 | Github | Demo |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action |
arXiv | 2023-03-20 | Github | Demo |
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering |
CVPR | 2023-03-03 | Github | - |
Visual Programming: Compositional visual reasoning without training |
CVPR | 2022-11-18 | Github | Local Demo |
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA |
AAAI | 2022-06-28 | Github | - |
Flamingo: a Visual Language Model for Few-Shot Learning |
NeurIPS | 2022-04-29 | Github | Demo |
Multimodal Few-Shot Learning with Frozen Language Models | NeurIPS | 2021-06-25 | - | - |
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Transfer Visual Prompt Generator across LLMs |
arXiv | 2023-05-02 | Github | Demo |
GPT-4 Technical Report | arXiv | 2023-03-15 | - | - |
PaLM-E: An Embodied Multimodal Language Model | arXiv | 2023-03-06 | - | Demo |
Prismer: A Vision-Language Model with An Ensemble of Experts |
arXiv | 2023-03-04 | Github | Demo |
Language Is Not All You Need: Aligning Perception with Language Models |
arXiv | 2023-02-27 | Github | - |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
arXiv | 2023-01-30 | Github | Demo |
VIMA: General Robot Manipulation with Multimodal Prompts |
ICML | 2022-10-06 | Github | Local Demo |
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge |
NeurIPS | 2022-06-17 | Github | - |
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Can Large Pre-trained Models Help Vision Models on Perception Tasks? | arXiv | 2023-06-01 | Coming soon | - |
Contextual Object Detection with Multimodal Large Language Models |
arXiv | 2023-05-29 | Github | Demo |
Generating Images with Multimodal Language Models |
arXiv | 2023-05-26 | Github | - |
On Evaluating Adversarial Robustness of Large Vision-Language Models |
arXiv | 2023-05-26 | Github | - |
Evaluating Object Hallucination in Large Vision-Language Models |
arXiv | 2023-05-17 | Github | - |
Grounding Language Models to Images for Multimodal Inputs and Outputs |
ICML | 2023-01-31 | Github | Demo |
- [Andrej Karpathy] State of GPT video
- [Hyung Won Chung] Instruction finetuning and RLHF lecture Youtube
- [Jason Wei] Scaling, emergence, and reasoning in large language models Slides
- [Susan Zhang] Open Pretrained Transformers Youtube
- [Ameet Deshpande] How Does ChatGPT Work? Slides
- [Yao Fu] 预训练,指令微调,对齐,专业化:论大语言模型能力的来源 Bilibili
- [Hung-yi Lee] ChatGPT 原理剖析 Youtube
- [Jay Mody] GPT in 60 Lines of NumPy Link
- [ICML 2022] Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models Link
- [NeurIPS 2022] Foundational Robustness of Foundation Models Link
- [Andrej Karpathy] Let's build GPT: from scratch, in code, spelled out. Video|Code
- [DAIR.AI] Prompt Engineering Guide Link
- [邱锡鹏] 大型语言模型的能力分析与应用 Slides | Video
- [Philipp Schmid] Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers Link
- [HuggingFace] Illustrating Reinforcement Learning from Human Feedback (RLHF) Link
- [HuggingFace] What Makes a Dialog Agent Useful? Link
- [张俊林]通向AGI之路:大型语言模型(LLM)技术精要 Link
- [大师兄]ChatGPT/InstructGPT详解 Link
- [HeptaAI]ChatGPT内核:InstructGPT,基于反馈指令的PPO强化学习 Link
- [Yao Fu] How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources Link
- [Stephen Wolfram] What Is ChatGPT Doing … and Why Does It Work? Link
- [Jingfeng Yang] Why did all of the public reproduction of GPT-3 fail? Link
- [Hung-yi Lee] ChatGPT (可能)是怎麼煉成的 - GPT 社會化的過程 Video
- [Keyvan Kambakhsh] Pure Rust implementation of a minimal Generative Pretrained Transformer code
- LLaMA2 - A revolutionary version of llama , 70 - 13 - 7 -billion-parameter large language model. LLaMA2 HF - TheBloke/Llama-2-13B-GPTQ
- LLaMA - A foundational, 65-billion-parameter large language model. LLaMA.cpp Lit-LLaMA
- Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca.cpp Alpaca-LoRA
- Flan-Alpaca - Instruction Tuning from Humans and Machines.
- Baize - Baize is an open-source chat model trained with LoRA. It uses 100k dialogs generated by letting ChatGPT chat with itself.
- Cabrita - A portuguese finetuned instruction LLaMA.
- Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
- Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
- Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
- Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
- GPTQ-for-LLaMA - 4 bits quantization of LLaMA using GPTQ.
- GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
- Koala - A Dialogue Model for Academic Research
- BELLE - Be Everyone's Large Language model Engine
- StackLLaMA - A hands-on guide to train LLaMA with RLHF.
- RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
- Chimera - Latin Phoenix.
- WizardLM|WizardCoder - Family of instruction-following LLMs powered by Evol-Instruct: WizardLM, WizardCoder.
- CaMA - a Chinese-English Bilingual LLaMA Model.
- Orca - Microsoft's finetuned LLaMA model that reportedly matches GPT3.5, finetuned against 5M of data, ChatGPT, and GPT4
- BayLing - an English/Chinese LLM equipped with advanced language alignment, showing superior capability in English/Chinese generation, instruction following and multi-turn interaction.
- UltraLM - Large-scale, Informative, and Diverse Multi-round Chat Models.
- Guanaco - QLoRA tuned LLaMA
- BLOOM - BigScience Large Open-science Open-access Multilingual Language Model BLOOM-LoRA
- BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
- Phoenix
- T5 - Text-to-Text Transfer Transformer
- T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization
- OPT - Open Pre-trained Transformer Language Models.
- UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.
- GLM- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
- ChatGLM-6B - ChatGLM-6B 是一个开源的、支持中英双语的对话语言模型,基于 General Language Model (GLM) 架构,具有 62 亿参数.
- ChatGLM2-6B - An Open Bilingual Chat LLM | 开源双语对话语言模型
- RWKV - Parallelizable RNN with Transformer-level LLM Performance.
- ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
- StableLM - Stability AI Language Models.
- YaLM - a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.
- GPT-Neo - An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.
- GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile.
- Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
- Pythia - Interpreting Autoregressive Transformers Across Time and Scale
- Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
- OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.
- Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.
- GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.
- GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
- Palmyra - Palmyra Base was primarily pre-trained with English text.
- Camel - a state-of-the-art instruction-following large language model designed to deliver exceptional performance and versatility.
- h2oGPT
- PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model develped by Huawei Noah's Ark Lab, MindSpore Team and Peng Cheng Laboratory.
- MOSS - MOSS是一个支持中英双语和多种插件的开源对话语言模型.
- Open-Assistant - a project meant to give everyone access to a great chat based large language model.
- HuggingChat - Powered by Open Assistant's latest model – the best open source chat model right now and @huggingface Inference API.
- StarCoder - Hugging Face LLM for Code
- MPT-7B - Open LLM for commercial use by MosaicML
- Falcon - Falcon LLM is a foundational large language model (LLM) with 40 billion parameters trained on one trillion tokens. TII has now released Falcon LLM – a 40B model.
- XGen - Salesforce open-source LLMs with 8k sequence length.
- baichuan-7B - baichuan-7B 是由百川智能开发的一个开源可商用的大规模预训练语言模型.
- Aquila - 悟道·天鹰语言大模型是首个具备中英双语知识、支持商用许可协议、国内数据合规需求的开源语言大模型。
- DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
- Megatron-DeepSpeed - DeepSpeed version of NVIDIA's Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others.
- FairScale - FairScale is a PyTorch extension library for high performance and large scale training.
- Megatron-LM - Ongoing research training transformer models at scale.
- Colossal-AI - Making large AI models cheaper, faster, and more accessible.
- BMTrain - Efficient Training for Big Models.
- Mesh Tensorflow - Mesh TensorFlow: Model Parallelism Made Easier.
- maxtext - A simple, performant and scalable Jax LLM!
- Alpa - Alpa is a system for training and serving large-scale neural networks.
- GPT-NeoX - An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.
-
FastChat - A distributed multi-model LLM serving system with web UI and OpenAI-compatible RESTful APIs.
-
SkyPilot - Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution -- all with a simple interface.
-
vLLM - A high-throughput and memory-efficient inference and serving engine for LLMs
-
Text Generation Inference - A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power LLMs api-inference widgets.
-
Haystack - an open-source NLP framework that allows you to use LLMs and transformer-based models from Hugging Face, OpenAI and Cohere to interact with your own data.
-
Sidekick - Data integration platform for LLMs.
-
LangChain - Building applications with LLMs through composability
-
wechat-chatgpt - Use ChatGPT On Wechat via wechaty
-
promptfoo - Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality.
-
Agenta - Easily build, version, evaluate and deploy your LLM-powered apps.
-
Embedchain - Framework to create ChatGPT like bots over your dataset.
- [DeepLearning.AI] ChatGPT Prompt Engineering for Developers Homepage
- [Princeton] Understanding Large Language Models Homepage
- [OpenBMB] 大模型公开课 主页
- [Stanford] CS224N-Lecture 11: Prompting, Instruction Finetuning, and RLHF Slides
- [Stanford] CS324-Large Language Models Homepage
- [Stanford] CS25-Transformers United V2 Homepage
- [Stanford Webinar] GPT-3 & Beyond Video
- [李沐] InstructGPT论文精读 Bilibili Youtube
- [陳縕儂] OpenAI InstructGPT 從人類回饋中學習 ChatGPT 的前身 Youtube
- [李沐] HELM全面语言模型评测 Bilibili
- [李沐] GPT,GPT-2,GPT-3 论文精读 Bilibili Youtube
- [Aston Zhang] Chain of Thought论文 Bilibili Youtube
- [MIT] Introduction to Data-Centric AI Homepage
- AutoGPT - an experimental open-source application showcasing the capabilities of the GPT-4 language model.
- OpenAGI - When LLM Meets Domain Experts.
- HuggingGPT - Solving AI Tasks with ChatGPT and its Friends in HuggingFace.
- EasyEdit - An easy-to-use framework to edit large language models.
- chatgpt-shroud - A Chrome extension for OpenAI's ChatGPT, enhancing user privacy by enabling easy hiding and unhiding of chat history. Ideal for privacy during screen shares.
- Arize-Phoenix - Open-source tool for ML observability that runs in your notebook environment. Monitor and fine tune LLM, CV and Tabular Models.
- Emergent Mind - The latest AI news, curated & explained by GPT-4.
- ShareGPT - Share your wildest ChatGPT conversations with one click.
- Major LLMs + Data Availability
- 500+ Best AI Tools
- Cohere Summarize Beta - Introducing Cohere Summarize Beta: A New Endpoint for Text Summarization
- chatgpt-wrapper - ChatGPT Wrapper is an open-source unofficial Python API and CLI that lets you interact with ChatGPT.
- Open-evals - A framework extend openai's Evals for different language model.
- Cursor - Write, edit, and chat about your code with a powerful AI.
Name | Paper | Link | Notes |
---|---|---|---|
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Coming soon | Multimodal in-context instruction dataset |
Name | Paper | Link | Notes |
---|---|---|---|
EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | Coming soon | Large-scale embodied planning dataset |
VIP | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | Coming soon | An inference-time dataset that can be used to evaluate VideoCOT |
ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Link | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains |