🌟This repository is still being updated, please stay tuned.
👉If you find mistakes or overlooked papers, please open issues or pull requests.
Title | Venue | Links |
---|---|---|
Autoregressive Pretraining with Mamba in Vision | ICLR2025 | Paper|Code |
Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens | ICLR2025 | Paper|Code |
ControlAR: Controllable Image Generation with Autoregressive Models | ICLR2025 | Paper|Code |
ImageFolder: Autoregressive Image Generation with Folded Tokens | ICLR2025 | Paper|Code |
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation | ICLR2025 | Paper|Code |
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step | Arxiv2025 | Paper|Code |
EditAR: Unified Conditional Generation with Autoregressive Models | Arxiv2025 | Paper|Code |
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction | NeurIPS2024 | Paper|Code |
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis | Arxiv2024 | Paper|Code |
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation | Arxiv2024 | Paper|Code |
STAR: Scale-wise Text-to-image generation via Auto-Regressive representations | Arxiv2024 | Paper|Code |
VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling | Arxiv2024 | Paper|Code |
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer | Arxiv2024 | Paper|Code |
M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation | Arxiv2024 | Paper|Code |
Title | Venue | Links |
---|---|---|
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation | ICLR2025 | Paper|Code |
Autoregressive Video Generation without Vector Quantization | ICLR2025 | Paper|Code |
Autoregressive Transformers are Zero-Shot Video Imitators | ICLR2025 | Paper|Code |
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior | ICLR2025 | Paper|Code |
Taming Teacher Forcing for Masked Autoregressive Video Generation | Arxiv2025 | Paper|Code |
GameFactory: Creating New Games with Generative Interactive Videos | Arxiv2025 | Paper|Code |
StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation | Arxiv2025 | Paper|Code |
An Empirical Study of Autoregressive Pre-training from Videos | Arxiv2025 | Paper|Code |
AR4D: Autoregressive 4D Generation from Monocular Videos | Arxiv2025 | Paper|Code |
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | ICML2024 | Paper|Code |
ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models | CVPRW2024 | Paper|Code |
MarDini: Masked Auto-Regressive Diffusion for Video Generation at Scale | Arxiv2024 | Paper|Code |
Loong: Generating Minute-level Long Videos with Autoregressive Language Models | Arxiv2024 | Paper|Code |
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models | Arxiv2024 | Paper|Code |
Title | Venue | Links |
---|---|---|
MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers | ICLR2025 | Paper|Code |
DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control | ICLR2025 | Paper|Code |
General Point Model Pretraining with Autoencoding and Autoregressive | CVPR2024 | Paper|Code |
Bidirectional Autoregressive Diffusion Model for Dance Generation | CVPR2024 | Paper|Code |
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model | ECCV2024 | Paper|Code |
BAMM: Bidirectional Autoregressive Motion Model | ECCV2024 | Paper|Code |
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE | Arxiv2024 | Paper|Code |
3D representation in 512-Byte: Variational tokenizer is the key for autoregressive 3D generation | Arxiv2024 | Paper|Code |
Title | Venue | Links |
---|---|---|
JetFormer: An autoregressive generative model of raw images and text | ICLR2025 | Paper|Code |
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation | ICLR2025 | Paper|Code |
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model | Arxiv2025 | Paper|Code |
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities | CVPR2024 | Paper|Code |
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action | CVPR2024 | Paper|Code |
VideoPoet: A Large Language Model for Zero-Shot Video Generation | ICML2024 | Paper|Code |
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing | NeurIPS2024 | Paper|Code |
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation | NeurIPS2024 | Paper|Code |
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers | Arxiv2024 | Paper|Code |
Emu3: Next-Token Prediction is All You Need | Arxiv2024 | Paper|Code |
Title | Venue | Links |
---|---|---|
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding | ICLR2025 | Paper|Code |
Next Patch Prediction for Autoregressive Visual Generation | Arxiv2025 | Paper|Code |
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | ICLR2024 | Paper|Code |
E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling | Arxiv2024 | Paper|Code |
LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization | Arxiv2024 | Paper|Code |
CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient | Arxiv2024 | Paper|Code |
XQ-GAN🚀: An Open-source Image Tokenization Framework for Autoregressive Generation | Arxiv2024 | Paper|Code |
TokenFlow🚀: Unified Image Tokenizer for Multimodal Understanding and Generation | Arxiv2024 | Paper|Code |
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis | Arxiv2024 | Paper|Code |
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching | Arxiv2024 | Paper|Code |
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching | Arxiv2024 | Paper|Code |
Next Token Prediction Towards Multimodal Intelligence | Arxiv2024 | Paper|Code |
Parallelized Autoregressive Visual Generation | Arxiv2024 | Paper|Code |
Title | Venue | Links |
---|---|---|
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression | Arxiv2025 | Paper|Code |
FAST: Efficient Action Tokenization for Vision-Language-Action Models | Arxiv2025 | Paper|Code |
DeTrack: In-model Latent Denoising Learning for Visual Object Tracking | Arxiv2025 | Paper|Code |
Less is More: Token Context-aware Learning for Object Tracking | AAAI2025 | Paper|Code |
DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion | CVPR2024 | Paper|Code |
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction | Arxiv2024 | Paper|Code |
Varformer: Adapting VAR’s Generative Prior for Image Restoration | Arxiv2024 | Paper|Code |
Scalable Autoregressive Monocular Depth Estimation | Arxiv2024 | Paper|Code |