Skip to content

ZNan-Chen/Awesome-Visual-Autoregressive-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 

Repository files navigation

Awesome-Visual-Autoregressive-Model

🌟This repository is still being updated, please stay tuned.

👉If you find mistakes or overlooked papers, please open issues or pull requests.

Content:


Image Generation

Title Venue Links
Autoregressive Pretraining with Mamba in Vision ICLR2025 Paper|Code
Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens ICLR2025 Paper|Code
ControlAR: Controllable Image Generation with Autoregressive Models ICLR2025 Paper|Code
ImageFolder: Autoregressive Image Generation with Folded Tokens ICLR2025 Paper|Code
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation ICLR2025 Paper|Code
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step Arxiv2025 Paper|Code
EditAR: Unified Conditional Generation with Autoregressive Models Arxiv2025 Paper|Code
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction NeurIPS2024 Paper|Code
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis Arxiv2024 Paper|Code
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation Arxiv2024 Paper|Code
STAR: Scale-wise Text-to-image generation via Auto-Regressive representations Arxiv2024 Paper|Code
VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling Arxiv2024 Paper|Code
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer Arxiv2024 Paper|Code
M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation Arxiv2024 Paper|Code

Video Generation

Title Venue Links
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation ICLR2025 Paper|Code
Autoregressive Video Generation without Vector Quantization ICLR2025 Paper|Code
Autoregressive Transformers are Zero-Shot Video Imitators ICLR2025 Paper|Code
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior ICLR2025 Paper|Code
Taming Teacher Forcing for Masked Autoregressive Video Generation Arxiv2025 Paper|Code
GameFactory: Creating New Games with Generative Interactive Videos Arxiv2025 Paper|Code
StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation Arxiv2025 Paper|Code
An Empirical Study of Autoregressive Pre-training from Videos Arxiv2025 Paper|Code
AR4D: Autoregressive 4D Generation from Monocular Videos Arxiv2025 Paper|Code
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization ICML2024 Paper|Code
ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models CVPRW2024 Paper|Code
MarDini: Masked Auto-Regressive Diffusion for Video Generation at Scale Arxiv2024 Paper|Code
Loong: Generating Minute-level Long Videos with Autoregressive Language Models Arxiv2024 Paper|Code
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models Arxiv2024 Paper|Code

3D Generation

Title Venue Links
MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers ICLR2025 Paper|Code
DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control ICLR2025 Paper|Code
General Point Model Pretraining with Autoencoding and Autoregressive CVPR2024 Paper|Code
Bidirectional Autoregressive Diffusion Model for Dance Generation CVPR2024 Paper|Code
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model ECCV2024 Paper|Code
BAMM: Bidirectional Autoregressive Motion Model ECCV2024 Paper|Code
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE Arxiv2024 Paper|Code
3D representation in 512-Byte: Variational tokenizer is the key for autoregressive 3D generation Arxiv2024 Paper|Code

Multimodal Generation

Title Venue Links
JetFormer: An autoregressive generative model of raw images and text ICLR2025 Paper|Code
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation ICLR2025 Paper|Code
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model Arxiv2025 Paper|Code
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities CVPR2024 Paper|Code
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action CVPR2024 Paper|Code
VideoPoet: A Large Language Model for Zero-Shot Video Generation ICML2024 Paper|Code
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing NeurIPS2024 Paper|Code
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation NeurIPS2024 Paper|Code
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers Arxiv2024 Paper|Code
Emu3: Next-Token Prediction is All You Need Arxiv2024 Paper|Code

Understanding or Optimization

Title Venue Links
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding ICLR2025 Paper|Code
Next Patch Prediction for Autoregressive Visual Generation Arxiv2025 Paper|Code
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation ICLR2024 Paper|Code
E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling Arxiv2024 Paper|Code
LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization Arxiv2024 Paper|Code
CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient Arxiv2024 Paper|Code
XQ-GAN🚀: An Open-source Image Tokenization Framework for Autoregressive Generation Arxiv2024 Paper|Code
TokenFlow🚀: Unified Image Tokenizer for Multimodal Understanding and Generation Arxiv2024 Paper|Code
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis Arxiv2024 Paper|Code
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching Arxiv2024 Paper|Code
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching Arxiv2024 Paper|Code
Next Token Prediction Towards Multimodal Intelligence Arxiv2024 Paper|Code
Parallelized Autoregressive Visual Generation Arxiv2024 Paper|Code

Others:

Title Venue Links
Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression Arxiv2025 Paper|Code
FAST: Efficient Action Tokenization for Vision-Language-Action Models Arxiv2025 Paper|Code
DeTrack: In-model Latent Denoising Learning for Visual Object Tracking Arxiv2025 Paper|Code
Less is More: Token Context-aware Learning for Object Tracking AAAI2025 Paper|Code
DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion CVPR2024 Paper|Code
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction Arxiv2024 Paper|Code
Varformer: Adapting VAR’s Generative Prior for Image Restoration Arxiv2024 Paper|Code
Scalable Autoregressive Monocular Depth Estimation Arxiv2024 Paper|Code

About

Latest Advances on Autoregressive Visual Models.📖

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published