Awesome-Visual-Autoregressive-Model

🌟This repository is still being updated, please stay tuned.

👉If you find mistakes or overlooked papers, please open issues or pull requests.

Content:

Image Generation
Video Generation
Multimodal Generation
Understanding or Optimization
Others

Image Generation

Title	Venue	Links
Autoregressive Pretraining with Mamba in Vision	ICLR2025	Paper\|Code
Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens	ICLR2025	Paper\|Code
ControlAR: Controllable Image Generation with Autoregressive Models	ICLR2025	Paper\|Code
ImageFolder: Autoregressive Image Generation with Folded Tokens	ICLR2025	Paper\|Code
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation	ICLR2025	Paper\|Code
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step	Arxiv2025	Paper\|Code
EditAR: Unified Conditional Generation with Autoregressive Models	Arxiv2025	Paper\|Code
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction	NeurIPS2024	Paper\|Code
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis	Arxiv2024	Paper\|Code
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation	Arxiv2024	Paper\|Code
STAR: Scale-wise Text-to-image generation via Auto-Regressive representations	Arxiv2024	Paper\|Code
VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling	Arxiv2024	Paper\|Code
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer	Arxiv2024	Paper\|Code
M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation	Arxiv2024	Paper\|Code

Video Generation

Title	Venue	Links
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation	ICLR2025	Paper\|Code
Autoregressive Video Generation without Vector Quantization	ICLR2025	Paper\|Code
Autoregressive Transformers are Zero-Shot Video Imitators	ICLR2025	Paper\|Code
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior	ICLR2025	Paper\|Code
Taming Teacher Forcing for Masked Autoregressive Video Generation	Arxiv2025	Paper\|Code
GameFactory: Creating New Games with Generative Interactive Videos	Arxiv2025	Paper\|Code
StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation	Arxiv2025	Paper\|Code
An Empirical Study of Autoregressive Pre-training from Videos	Arxiv2025	Paper\|Code
AR4D: Autoregressive 4D Generation from Monocular Videos	Arxiv2025	Paper\|Code
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	ICML2024	Paper\|Code
ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models	CVPRW2024	Paper\|Code
MarDini: Masked Auto-Regressive Diffusion for Video Generation at Scale	Arxiv2024	Paper\|Code
Loong: Generating Minute-level Long Videos with Autoregressive Language Models	Arxiv2024	Paper\|Code
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models	Arxiv2024	Paper\|Code

3D Generation

Title	Venue	Links
MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers	ICLR2025	Paper\|Code
DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control	ICLR2025	Paper\|Code
General Point Model Pretraining with Autoencoding and Autoregressive	CVPR2024	Paper\|Code
Bidirectional Autoregressive Diffusion Model for Dance Generation	CVPR2024	Paper\|Code
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model	ECCV2024	Paper\|Code
BAMM: Bidirectional Autoregressive Motion Model	ECCV2024	Paper\|Code
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE	Arxiv2024	Paper\|Code
3D representation in 512-Byte: Variational tokenizer is the key for autoregressive 3D generation	Arxiv2024	Paper\|Code

Multimodal Generation

Title	Venue	Links
JetFormer: An autoregressive generative model of raw images and text	ICLR2025	Paper\|Code
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	ICLR2025	Paper\|Code
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model	Arxiv2025	Paper\|Code
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities	CVPR2024	Paper\|Code
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	CVPR2024	Paper\|Code
VideoPoet: A Large Language Model for Zero-Shot Video Generation	ICML2024	Paper\|Code
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	NeurIPS2024	Paper\|Code
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation	NeurIPS2024	Paper\|Code
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers	Arxiv2024	Paper\|Code
Emu3: Next-Token Prediction is All You Need	Arxiv2024	Paper\|Code

Understanding or Optimization

Title	Venue	Links
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding	ICLR2025	Paper\|Code
Next Patch Prediction for Autoregressive Visual Generation	Arxiv2025	Paper\|Code
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation	ICLR2024	Paper\|Code
E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling	Arxiv2024	Paper\|Code
LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization	Arxiv2024	Paper\|Code
CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient	Arxiv2024	Paper\|Code
XQ-GAN🚀: An Open-source Image Tokenization Framework for Autoregressive Generation	Arxiv2024	Paper\|Code
TokenFlow🚀: Unified Image Tokenizer for Multimodal Understanding and Generation	Arxiv2024	Paper\|Code
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis	Arxiv2024	Paper\|Code
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching	Arxiv2024	Paper\|Code
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching	Arxiv2024	Paper\|Code
Next Token Prediction Towards Multimodal Intelligence	Arxiv2024	Paper\|Code
Parallelized Autoregressive Visual Generation	Arxiv2024	Paper\|Code

Others:

Title	Venue	Links
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression	Arxiv2025	Paper\|Code
FAST: Efficient Action Tokenization for Vision-Language-Action Models	Arxiv2025	Paper\|Code
DeTrack: In-model Latent Denoising Learning for Visual Object Tracking	Arxiv2025	Paper\|Code
Less is More: Token Context-aware Learning for Object Tracking	AAAI2025	Paper\|Code
DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion	CVPR2024	Paper\|Code
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction	Arxiv2024	Paper\|Code
Varformer: Adapting VAR’s Generative Prior for Image Restoration	Arxiv2024	Paper\|Code
Scalable Autoregressive Monocular Depth Estimation	Arxiv2024	Paper\|Code

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Visual-Autoregressive-Model

Content:

Image Generation

Video Generation

3D Generation

Multimodal Generation

Understanding or Optimization

Others:

About

Releases

Packages

ZNan-Chen/Awesome-Visual-Autoregressive-Model

Folders and files

Latest commit

History

Repository files navigation

Awesome-Visual-Autoregressive-Model

Content:

Image Generation

Video Generation

3D Generation

Multimodal Generation

Understanding or Optimization

Others:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages