GIVT: Generative Infinite-Vocabulary Transformers

by Michael Tschannen, Cian Eastwood, Fabian Mentzer [arxiv] [colab]

Summary

We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a β-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework.

Checkpoints

We provide model checkpoints for a subset of the models from the paper. These are meant as small-scale baselines for researchers interested in exploring GIVT, and are not optimized to provide the best possible visual quality (e.g. scaling the model size can substantially improve visual quality as shown in the paper). See below for instructions to train your own models.

ImageNet 2012 VAEs

β	1e-5	2.5e-5	5e-5	1e-4	2e-4
checkpoint	link	link	link	link	link

Class-conditional ImageNet 2012 generative models

model	resolution	β	inference	FID	checkpoint
GIVT-Causal	256 x 256	5e-5	t=0.95, DB-CFG=0.4	3.35	link
GIVT-MaskGIT	256 x 256	5e-5	t_C=35, DB-CFG=0.1	4.53	link
GIVT-MaskGIT	512 x 512	5e-5	t_C=140	4.86	link

UViM

task	model	dataset	accuracy	checkpoint
Panoptic segmentation	VAE (stage 1)	COCO (2017)	71.0 (PQ)	link
Panoptic segmentation	GIVT (stage 2)	COCO (2017)	40.2 (PQ)	link
Depth estimation	VAE (stage 1)	NYU Depth v2	0.195 (RMSE)	link
Depth estimation	GIVT (stage 2)	NYU Depth v2	0.474 (RMSE)	link

Training models

This directory contains configs to train GIVT models as well as VAEs (for the UViM variants). For training the ImageNet 2012 VAE models we used a modified version of the MaskGIT code.

The big_vision input pipeline relies on TensorFlow Datasets (TFDS) which supports some data sets out-of-the-box, whereas others require manual download of the data (for example ImageNet and COCO (2017), see the big_vision main README and the UViM README, respectively, for details).

After setting up big_vision as described in the main README, training can be launched locally as follows

python -m big_vision.trainers.proj.givt.generative \
  --config big_vision/configs/proj/givt/givt_imagenet2012.py \
  --workdir gs://$GS_BUCKET_NAME/big_vision/`date '+%m-%d_%H%M'`

Add the suffix :key1=value1,key2=value2,... to the config path in the launch command to modify the config with predefined arguments (see config for details). For example: --config big_vision/configs/proj/givt/givt_imagenet_2012.py:model_size=large. Note that givt_imagenet2012.py uses Imagenette to ensure that the config is runnable without manual ImageNet download. This is only meant for testing and will overfit immediately. Please download ImageNet to reproduce the paper results.

VAE trainings for the GIVT variant of UViM can be launched as

python -m big_vision.trainers.proj.givt.vae \
  --config big_vision/configs/proj/givt/vae_nyu_depth.py \
  --workdir gs://$GS_BUCKET_NAME/big_vision/`date '+%m-%d_%H%M'`

Please refer to the main README for details on how to launch training on a (multi-host) TPU setup.

Disclaimer

This is not an official Google Product.

Citation

@article{tschannen2023givt,
  title={GIVT: Generative Infinite-Vocabulary Transformers},
  author={Tschannen, Michael and Eastwood, Cian and Mentzer, Fabian},
  journal={arXiv:2312.02116},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GIVT: Generative Infinite-Vocabulary Transformers

Summary

Checkpoints

Training models

Disclaimer

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

GIVT: Generative Infinite-Vocabulary Transformers

Summary

Checkpoints

Training models

Disclaimer

Citation