Awesome-Visual-Autoregressive-Model
🌟This repository is still being updated, please stay tuned.
👉If you find mistakes or overlooked papers, please open issues or pull requests.
Title
Venue
Links
Autoregressive Pretraining with Mamba in Vision
ICLR2025
Paper |Code
Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens
ICLR2025
Paper |Code
ControlAR: Controllable Image Generation with Autoregressive Models
ICLR2025
Paper |Code
ImageFolder: Autoregressive Image Generation with Folded Tokens
ICLR2025
Paper |Code
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation
ICLR2025
Paper |Code
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
Arxiv2025
Paper |Code
EditAR: Unified Conditional Generation with Autoregressive Models
Arxiv2025
Paper |Code
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
NeurIPS2024
Paper |Code
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Arxiv2024
Paper |Code
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Arxiv2024
Paper |Code
STAR: Scale-wise Text-to-image generation via Auto-Regressive representations
Arxiv2024
Paper |Code
VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling
Arxiv2024
Paper |Code
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
Arxiv2024
Paper |Code
M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation
Arxiv2024
Paper |Code
Title
Venue
Links
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation
ICLR2025
Paper |Code
Autoregressive Video Generation without Vector Quantization
ICLR2025
Paper |Code
Autoregressive Transformers are Zero-Shot Video Imitators
ICLR2025
Paper |Code
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
ICLR2025
Paper |Code
Taming Teacher Forcing for Masked Autoregressive Video Generation
Arxiv2025
Paper |Code
GameFactory: Creating New Games with Generative Interactive Videos
Arxiv2025
Paper |Code
StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
Arxiv2025
Paper |Code
An Empirical Study of Autoregressive Pre-training from Videos
Arxiv2025
Paper |Code
AR4D: Autoregressive 4D Generation from Monocular Videos
Arxiv2025
Paper |Code
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
ICML2024
Paper |Code
ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models
CVPRW2024
Paper |Code
MarDini: Masked Auto-Regressive Diffusion for Video Generation at Scale
Arxiv2024
Paper |Code
Loong: Generating Minute-level Long Videos with Autoregressive Language Models
Arxiv2024
Paper |Code
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models
Arxiv2024
Paper |Code
Title
Venue
Links
MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers
ICLR2025
Paper |Code
DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control
ICLR2025
Paper |Code
General Point Model Pretraining with Autoencoding and Autoregressive
CVPR2024
Paper |Code
Bidirectional Autoregressive Diffusion Model for Dance Generation
CVPR2024
Paper |Code
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model
ECCV2024
Paper |Code
BAMM: Bidirectional Autoregressive Motion Model
ECCV2024
Paper |Code
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
Arxiv2024
Paper |Code
3D representation in 512-Byte: Variational tokenizer is the key for autoregressive 3D generation
Arxiv2024
Paper |Code
Title
Venue
Links
JetFormer: An autoregressive generative model of raw images and text
ICLR2025
Paper |Code
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
ICLR2025
Paper |Code
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
Arxiv2025
Paper |Code
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
CVPR2024
Paper |Code
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
CVPR2024
Paper |Code
VideoPoet: A Large Language Model for Zero-Shot Video Generation
ICML2024
Paper |Code
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
NeurIPS2024
Paper |Code
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
NeurIPS2024
Paper |Code
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers
Arxiv2024
Paper |Code
Emu3: Next-Token Prediction is All You Need
Arxiv2024
Paper |Code
Understanding or Optimization
Title
Venue
Links
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding
ICLR2025
Paper |Code
Next Patch Prediction for Autoregressive Visual Generation
Arxiv2025
Paper |Code
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
ICLR2024
Paper |Code
E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling
Arxiv2024
Paper |Code
LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization
Arxiv2024
Paper |Code
CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
Arxiv2024
Paper |Code
XQ-GAN🚀: An Open-source Image Tokenization Framework for Autoregressive Generation
Arxiv2024
Paper |Code
TokenFlow🚀: Unified Image Tokenizer for Multimodal Understanding and Generation
Arxiv2024
Paper |Code
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
Arxiv2024
Paper |Code
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
Arxiv2024
Paper |Code
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
Arxiv2024
Paper |Code
Next Token Prediction Towards Multimodal Intelligence
Arxiv2024
Paper |Code
Parallelized Autoregressive Visual Generation
Arxiv2024
Paper |Code
Title
Venue
Links
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression
Arxiv2025
Paper |Code
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Arxiv2025
Paper |Code
DeTrack: In-model Latent Denoising Learning for Visual Object Tracking
Arxiv2025
Paper |Code
Less is More: Token Context-aware Learning for Object Tracking
AAAI2025
Paper |Code
DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion
CVPR2024
Paper |Code
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction
Arxiv2024
Paper |Code
Varformer: Adapting VAR’s Generative Prior for Image Restoration
Arxiv2024
Paper |Code
Scalable Autoregressive Monocular Depth Estimation
Arxiv2024
Paper |Code