Skip to content

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Notifications You must be signed in to change notification settings

LMM101/Awesome-Multimodal-Next-Token-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 

Repository files navigation

Next Token Prediction Towards Multimodal Intelligence

Static Badge

Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success in both understanding and generation tasks. This repo features a comprehensive paper and repos collection for the survey: "Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey".

Authors: Liang Chen1, Zekun Wang2, Shuhuai Ren1, Lei Li3, Haozhe Zhao1, Yunshui Li 4, Zefan Cai1, Hongcheng Guo2, Lei Zhang4, Yizhe Xiong5, Yichi Zhang1, Ruoyu Wu1, Qingxiu Dong1, Ge Zhang6, Jian Yang8, Lingwei Meng7, Shujie Hu7, Yulong Chen9, Junyang Lin8, Shuai Bai8, Andreas Vlachos9, Xu Tan 10, Minjia Zhang11, Wen Xiao 10, Aaron Yee12,13, Tianyu Liu8, Baobao Chang1

1Peking University 2Beihang University 3University of Hong Kong 4Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 5Tsinghua University 6M-A-P 7The Chinese University of Hong Kong 8Alibaba Group 9University of Cambridge 10Microsoft Research 11UIUC 12Humanify Inc. 13Zhejiang University


🔥🔥 News

  • 2024.12.30: We release the survey on arxiv and this repo at GitHub! Feel free to make pull requests to add the latest work to the seasonly update of the survey ~

📑 Table of Contents

  1. Awesome Multimodal Tokenizers
  2. Awesome MMNTP Models
  3. Awesome Multimodal Prompt Engineering
  4. Citation

Awesome Multimodal Tokenizers

Vision Tokenizer

Paper Time Modality Tokenization Type GitHub
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation 2024 Image Discrete Star
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (QwenVL2-ViT) 2024 Image,Video Continuous Star
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction 2024 Image Discrete Star
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs 2023 Image Discrete -
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization 2023 Image Discrete Star
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation 2023 Image,Video Discrete -
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks 2023 Image Continuous Star
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution 2023 Image Continuous -
Planting a SEED of Vision in Large Language Model 2023 Image Discrete Star
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding 2023 Image Continuous -
EVA-CLIP: Improved Training Techniques for CLIP at Scale 2023 Image Continuous Github
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks 2023 Image Continuous Github
A Unified View of Masked Image Modeling 2023 Image Continuous Github
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers 2022 Image Continuous Github
MAGVIT: Masked Generative Video Transformer 2022 Video Discrete Star
Phenaki: Variable Length Video Generation From Open Domain Textual Description 2022 Video Discrete -
CoCa: Contrastive Captioners are Image-Text Foundation Models 2022 Image Continuous -
Autoregressive Image Generation using Residual Quantization 2022 Image Discrete -
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning 2022 Image Continuous Star
FlexiViT: One Model for All Patch Sizes 2022 Image Continuous Star
Vector-quantized Image Modeling with Improved VQGAN 2021 Image Discrete -
ViViT: A Video Vision Transformer 2021 Video Continuous Github
BEiT: BERT Pre-Training of Image Transformers 2021 Image Continuous Github
High-Performance Large-Scale Image Recognition Without Normalization 2021 Image Continuous Github
Learning Transferable Visual Models From Natural Language Supervision (CLIP) 2021 Image Continuous Star
Taming Transformers for High-Resolution Image Synthesis 2020 Image Discrete Star
Generating Diverse High-Fidelity Images with VQ-VAE-2 2019 Image Discrete Star
Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification 2017 Video Continuous Star
Neural Discrete Representation Learning (VQVAE) 2017 Image, Video, Audio Discrete -

Audio Tokenizer

Paper Time Modality Tokenization Type GitHub
Moshi: a speech-text foundation model for real-time dialogue (Mimi) 2024 Audio Discrete Star
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling 2024 Audio Discrete Star
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound 2024 Audio Discrete Star
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models (FACodec) 2024 Audio Discrete -
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models 2023 Audio Discrete Star
HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec 2023 Audio Discrete Star
LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models 2023 Audio Discrete -
High-Fidelity Audio Compression with Improved RVQGAN (DAC) 2023 Audio Discrete Star
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages 2023 Audio Continuous -
High Fidelity Neural Audio Compression (Encodec) 2022 Audio Discrete Star
CLAP: Learning Audio Concepts From Natural Language Supervision 2022 Audio Continuous Star
Robust Speech Recognition via Large-Scale Weak Supervision (Whisper) 2022 Audio Continuous Star
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language 2022 Audio Continuous Star
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing 2021 Audio Continuous Star
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units 2021 Audio Continuous Star
SoundStream: An End-to-End Neural Audio Codec 2021 Audio Discrete -
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations 2020 Audio Continuous Star
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations 2019 Audio Discrete Star

Awesome MMNTP Models

Vision Model

Paper Time Modality Model Type Task GitHub
Multimodal Latent Language Modeling with Next-Token Diffusion 2024 Image Unified Image2Text, Text2Image Star
Randomized Autoregressive Visual Generation (RAR) 2024 Image Unified Text2Image Star
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training (MonoInternVL) 2024 Image Unified Image2Text -
A Single Transformer for Scalable Vision-Language Modeling (SOLO) 2024 Image Unified Image2Text -
Unveiling Encoder-Free Vision-Language Models (EVE) 2024 Image Unified Image2Text Star
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (Qwen2VL) 2024 Image Compositional Image2Text Star
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (Janus) 2024 Image Compositional Image2Text, Text2Image Star
Emu3: Next-Token Prediction is All You Need (Emu3) 2024 Image, Video Unified Image2Text, Text2Image, Text2Video Star
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation (Show-o) 2024 Image, Video Unified Image2Text, Text2Image, Text2Video Star
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (VILA-U) 2024 Image, Video Unified Image2Text, Text2Image, Text2Video Star
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model (Transfusion) 2024 Image Unified Image2Text -
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (Fluid) 2024 Image Unified Image2Text -
Autoregressive Image Generation without Vector Quantization (MAR) 2024 Image Unified Image2Text Star
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Chameleon) 2024 Image Unified Image2Text, Text2Image Star
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (Mini-Genimi) 2024 Image Compositional Image2Text, Text2Image Star
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation (DnD-Transformer) 2024 Image Unified Text2Image Star
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (VAR) 2024 Image Unified Text2Image Star
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (LlamaGen) 2024 Image Unified Text2Image Star
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens (MiniGPT5) 2023 Image Compositional Image2Text, Text2Image Star
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing (Blip-Diffusion) 2023 Image Compositional Text2Image Star
Kosmos-G: Generating Images in Context with Multimodal Large Language Models (Kosmos-G) 2023 Image Compositional Text2Image Star
Kosmos-2: Grounding Multimodal Large Language Models to the World 2023 Image Compositional Image2Text Star
Kosmos-2.5: A Multimodal Literate Model 2023 Image Compositional Image2Text Star
Kosmos-E: Learning to Follow Instruction for Robotic Grasping 2023 Image Compositional Image2Text Star
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (LaVIT) 2023 Image Compositional Image2Text, Text2Image Star
Generative Multimodal Models are In-Context Learners (Emu2) 2023 Image Compositional Image2Text, Text2Image Star
Generative Pretraining in Multimodality (Emu1) 2023 Image Compositional Image2Text, Text2Image Star
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action (Unified-IO2) 2023 Image, Video, Audio Compositional Image2Text, Text2Image, Audio2Text, Text2Audio, Text2Video Star
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1) 2023 Image Compositional Image2Text Star
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (InternVL) 2023 Image Compositional Image2Text Star
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (QwenVL) 2023 Image Compositional Image2Text Star
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models (Molom) 2023 Image Compositional Image2Text -)
Fuyu-8B: A Multimodal Architecture for AI Agents (Fuyu) 2023 Image Unified Image2Text -
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models (BLIP2) 2023 Image Compositional Image2Text Star
Visual Instruction Tuning (LLaVA) 2023 Image Compositional Image2Text Star
MiniGPT4: a Visual Language Model for Few-Shot Learning (MiniGPT4) 2022 Image Compositional Image2Text -
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks (Unified-IO) 2022 Image Compositional Image2Text, Text2Image -
Zero-Shot Text-to-Image Generation (DALLE) 2022 Image Unified Text2Image -
Language Models are General-Purpose Interfaces 2022 Image Compositional Image2Text Star
Flamingo: a Visual Language Model for Few-Shot Learning (Flamingo) 2022 Image Compositional Image2Text -

Audio Model

Paper Time Modality Model Type Task GitHub
VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks (VoxtLM) 2024 Audio Unified A2T, T2A, A2A, T2T -
Moshi: a speech-text foundation model for real-time dialogue (Moshi) 2024 Audio Unified A2A Star
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming (Mini-Omni) 2024 Audio Compositional A2A Star
LLaMA-Omni: Seamless Speech Interaction with Large Language Models (LLaMA-Omni) 2024 Audio Compositional A2A Star
SpeechVerse: A Large-scale Generalizable Audio Language Model (SpeechVerse) 2024 Audio Compositional A2T -
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities (AudioFlamingo) 2024 Audio Compositional A2T Star
WavLLM: Towards Robust and Adaptive Speech Large Language Model (WavLLM) 2024 Audio Compositional A2T Star
MELLE: Autoregressive Speech Synthesis without Vector Quantization 2024 Audio Unified T2A -
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models (Seed-TTS) 2024 Audio Compositional T2A -
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications (FireRedTTS) 2024 Audio Compositional T2A Star
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens (CosyVoice) 2024 Audio Compositional T2A Star
Uniaudio: An audio foundation model toward universal audio generation (UniAudio) 2024 Audio Unified T2A, A2A Star
BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data (BASE TTS) 2024 Audio Unified T2A -
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (VoiceCraft) 2024 Audio Unified T2A Star
Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities (SpeechGPT) 2023 Audio Unified A2T, T2A, A2A, T2T Star
Lauragpt: Listen, attend, understand, and regenerate audio with gpt (LauraGPT) 2023 Audio Unified A2T, T2A, A2A, T2T -
Viola: Unified codec language models for speech recognition, synthesis, and translation (VIOLA) 2023 Audio Compositional A2T, T2A, A2A, T2T -
Audiopalm: A large language model that can speak and listen (AudioPaLM) 2023 Audio Compositional A2T, T2A, A2A -
Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models (Qwen-Audio) 2023 Audio Compositional A2T Star
Salmonn: Towards generic hearing abilities for large language models (SALMONN) 2023 Audio Compositional A2T Star
On decoder-only architecture for speech-to-text and large language model integration (SpeechLLaMA) 2023 Audio Compositional A2T -
Listen, think, and understand (LTU) 2023 Audio Compositional A2T Star
Pengi: An audio language model for audio tasks (Pengi) 2023 Audio Compositional A2T Star
Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning (MU-LLaMA) 2023 Audio Compositional A2T -
SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts (SpeechGen) 2023 Audio Unified T2A Star
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E) 2023 Audio Compositional T2A Star
Simple and Controllable Music Generation (MusicGen) 2023 Audio Unified T2A Star
Make-A-Voice: Unified Voice Synthesis With Discrete Representation (Make-A-Voice) 2023 Audio Compositional T2A -
Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision (SPEAR-TTS) 2023 Audio Compositional T2A -
AudioGen: Textually Guided Audio Generation (AudioGen) 2022 Audio Unified T2A -
AudioLM: a Language Modeling Approach to Audio Generation (AudioLM) 2022 Audio Compositional A2A -
Generative Spoken Language Modeling from Raw Audio (GSLM) 2021 Audio Unified A2A -

Awesome Multimodal Prompt Engineering

Multimodal ICL

Paper Time Modality GitHub
Multimodal Few-Shot Learning with Frozen Language Models (Frozen) 2021 Image -
Flamingo: a Visual Language Model for Few-Shot Learning (Flamingo) 2022 Image -
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning (MMICL) 2023 Image Star
Efficient In-Context Learning in Vision-Language Models for Egocentric Videos (EILeV) 2023 Image Star
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models (Open-Flamingo) 2023 Image Star
Link-Context Learning for Multimodal LLMs (LCL) 2023 Image Star
Med-Flamingo: a Multimodal Medical Few-shot Learner (Med-Flamingo) 2023 Image Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning (MIMIC-IT) 2023 Image Star
Sequential Modeling Enables Scalable Learning for Large Vision Models (LVM) 2023 Image Star
World Model on Million-Length Video And Language With Blockwise RingAttention (LWM) 2023 Image, Video Star
Exploring Diverse In-Context Configurations for Image Captioning (Yang et al.) 2024 Image Star
Visual In-Context Learning for Large Vision-Language Models (VisualICL) 2024 Image -
Many-Shot In-Context Learning in Multimodal Foundation Models (Many-Shots ICL) 2024 Image Star
Can MLLMs Perform Text-to-Image In-Context Learning? (CoBSAT) 2024 Image Star
Video In-context Learning (Video ICL) 2024 Video Star
Generative Pretraining in Multimodality (Emu) 2024 Image, Video Star
Generative Multimodal Models are In-Context Learners (Emu2) 2024 Image, Video Star
Towards More Unified In-context Visual Understanding (Sheng et al.) 2024 Image -
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E) 2023 Audio Star
MELLE: Autoregressive Speech Synthesis without Vector Quantization (MELLE) 2024 Audio -
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models (Seed-TTS) 2024 Audio -
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities (Audio Flamingo) 2024 Audio Star
Moshi: a speech-text foundation model for real-time dialogue (Moshi) 2024 Audio Star

Multimodal CoT

Paper Time Modality GitHub
WavLLM: Towards Robust and Adaptive Speech Large Language Model (WavLLM) 2024 Audio Star
SpeechVerse: A Large-scale Generalizable Audio Language Model (SpeechVerse) 2024 Audio -
CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought 2024 Audio Star
Chain-of-Thought Prompting for Speech Translation 2024 Audio -
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition 2024 Video Star
VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool 2024 Video -
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning 2024 Image Star
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations 2024 Image Star
Compositional Chain-of-Thought Prompting for Large Multimodal Model 2023 Image Star
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs 2023 Image Star
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models 2023 Image Star
Visual Chain-of-Thought Diffusion Models 2023 Image -
Multimodal Chain-of-Thought Reasoning in Language Models 2023 Image Star

Citation

If you feel our work helpful, please kindly cite the paper :)

@misc{chen2024tokenpredictionmultimodalintelligence,
      title={Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey}, 
      author={Liang Chen and Zekun Wang and Shuhuai Ren and Lei Li and Haozhe Zhao and Yunshui Li and Zefan Cai and Hongcheng Guo and Lei Zhang and Yizhe Xiong and Yichi Zhang and Ruoyu Wu and Qingxiu Dong and Ge Zhang and Jian Yang and Lingwei Meng and Shujie Hu and Yulong Chen and Junyang Lin and Shuai Bai and Andreas Vlachos and Xu Tan and Minjia Zhang and Wen Xiao and Aaron Yee and Tianyu Liu and Baobao Chang},
      year={2024},
      eprint={2412.18619},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.18619}, 
}

About

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published