Awesome-Visual-Autoregressive-Model

🌟This repository is still being updated, please stay tuned.

👉If you find mistakes or overlooked papers, please open issues or pull requests.

Content:

Title	Venue	Links
Autoregressive Pretraining with Mamba in Vision	ICLR2025	Paper\|Code
Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens	ICLR2025	Paper\|Code
ControlAR: Controllable Image Generation with Autoregressive Models	ICLR2025	Paper\|Code
ImageFolder: Autoregressive Image Generation with Folded Tokens	ICLR2025	Paper\|Code
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation	ICLR2025	Paper\|Code
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step	Arxiv2025	Paper\|Code
EditAR: Unified Conditional Generation with Autoregressive Models	Arxiv2025	Paper\|Code
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction	NeurIPS2024	Paper\|Code
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis	Arxiv2024	Paper\|Code
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation	Arxiv2024	Paper\|Code
STAR: Scale-wise Text-to-image generation via Auto-Regressive representations	Arxiv2024	Paper\|Code
VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling	Arxiv2024	Paper\|Code
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer	Arxiv2024	Paper\|Code
M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation	Arxiv2024	Paper\|Code

Title	Venue	Links
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation	ICLR2025	Paper\|Code
Autoregressive Video Generation without Vector Quantization	ICLR2025	Paper\|Code
Autoregressive Transformers are Zero-Shot Video Imitators	ICLR2025	Paper\|Code
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior	ICLR2025	Paper\|Code
Taming Teacher Forcing for Masked Autoregressive Video Generation	Arxiv2025	Paper\|Code
GameFactory: Creating New Games with Generative Interactive Videos	Arxiv2025	Paper\|Code
StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation	Arxiv2025	Paper\|Code
An Empirical Study of Autoregressive Pre-training from Videos	Arxiv2025	Paper\|Code
AR4D: Autoregressive 4D Generation from Monocular Videos	Arxiv2025	Paper\|Code
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	ICML2024	Paper\|Code
ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models	CVPRW2024	Paper\|Code
MarDini: Masked Auto-Regressive Diffusion for Video Generation at Scale	Arxiv2024	Paper\|Code
Loong: Generating Minute-level Long Videos with Autoregressive Language Models	Arxiv2024	Paper\|Code
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models	Arxiv2024	Paper\|Code

Title	Venue	Links
MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers	ICLR2025	Paper\|Code
DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control	ICLR2025	Paper\|Code
General Point Model Pretraining with Autoencoding and Autoregressive	CVPR2024	Paper\|Code
Bidirectional Autoregressive Diffusion Model for Dance Generation	CVPR2024	Paper\|Code
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model	ECCV2024	Paper\|Code
BAMM: Bidirectional Autoregressive Motion Model	ECCV2024	Paper\|Code
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE	Arxiv2024	Paper\|Code
3D representation in 512-Byte: Variational tokenizer is the key for autoregressive 3D generation	Arxiv2024	Paper\|Code

Title	Venue	Links
JetFormer: An autoregressive generative model of raw images and text	ICLR2025	Paper\|Code
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	ICLR2025	Paper\|Code
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model	Arxiv2025	Paper\|Code
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities	CVPR2024	Paper\|Code
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	CVPR2024	Paper\|Code
VideoPoet: A Large Language Model for Zero-Shot Video Generation	ICML2024	Paper\|Code
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	NeurIPS2024	Paper\|Code
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation	NeurIPS2024	Paper\|Code
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers	Arxiv2024	Paper\|Code
Emu3: Next-Token Prediction is All You Need	Arxiv2024	Paper\|Code

Title	Venue	Links
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding	ICLR2025	Paper\|Code
Next Patch Prediction for Autoregressive Visual Generation	Arxiv2025	Paper\|Code
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation	ICLR2024	Paper\|Code
E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling	Arxiv2024	Paper\|Code
LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization	Arxiv2024	Paper\|Code
CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient	Arxiv2024	Paper\|Code
XQ-GAN🚀: An Open-source Image Tokenization Framework for Autoregressive Generation	Arxiv2024	Paper\|Code
TokenFlow🚀: Unified Image Tokenizer for Multimodal Understanding and Generation	Arxiv2024	Paper\|Code
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis	Arxiv2024	Paper\|Code
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching	Arxiv2024	Paper\|Code
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching	Arxiv2024	Paper\|Code
Next Token Prediction Towards Multimodal Intelligence	Arxiv2024	Paper\|Code
Parallelized Autoregressive Visual Generation	Arxiv2024	Paper\|Code

Title	Venue	Links
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression	Arxiv2025	Paper\|Code
FAST: Efficient Action Tokenization for Vision-Language-Action Models	Arxiv2025	Paper\|Code
DeTrack: In-model Latent Denoising Learning for Visual Object Tracking	Arxiv2025	Paper\|Code
Less is More: Token Context-aware Learning for Object Tracking	AAAI2025	Paper\|Code
DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion	CVPR2024	Paper\|Code
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction	Arxiv2024	Paper\|Code
Varformer: Adapting VAR’s Generative Prior for Image Restoration	Arxiv2024	Paper\|Code
Scalable Autoregressive Monocular Depth Estimation	Arxiv2024	Paper\|Code