2D Positional Encodings for Vision Transformers (ViT)

Overview

This repository explores various 2D positional encoding strategies for Vision Transformers (ViTs), including:

No Position
Learnable
Sinusoidal (Absolute)
Relative
Rotary Position Embedding (RoPe)

The encodings are tested on CIFAR10 and CIFAR100 datasets with a compact ViT architecture (800k parameters).

Key Features

Implements 2D positional encodings by splitting dimensions into x and y sequences.
Handles classification tokens uniquely for each encoding type.
Provides a compact ViT model with only 800k parameters.
Comprehensive comparisons across CIFAR10 and CIFAR100 datasets (using a patch size of 4).

Run commands (also available in scripts.sh)

Use the following command to run the model with different positional encodings:

python main.py --dataset cifar10 --pos_embed [TYPE]

Replace TYPE with one of the following:

Positional Encoding Type	Argument
No Position	`--pos_embed none`
Learnable	`--pos_embed learn`
Sinusoidal (Absolute)	`--pos_embed sinusoidal`
Relative	`--pos_embed relative --max_relative_distance 2`
Rotary (RoPe)	`--pos_embed rope`

Use the --dataset argument to switch between CIFAR10 and CIFAR100.
For relative encoding, adjust the --max_relative_distance parameter as needed.

Results

Test set accuracy when ViT is trained using different positional Encoding.

Positional Encoding Type	CIFAR10	CIFAR100
No Position	79.63	53.25
Learnable	86.52	60.87
Sinusoidal (Absolute)	86.09	59.73
Relative	90.57	65.11
Rotary (Rope)	88.49	62.88

Splitting X and Y Axes into 1D Positional Encodings

Instead of flattening image patches directly, we encode spatial information separately for the x and y axes:

X-axis encoding applies 1D positional encoding to horizontal sequences.
Y-axis encoding applies 1D positional encoding to vertical sequences.

Below is a visualization:

X-axis Encoding
Y-axis Encoding

The x and y-axis sequences are replicated using get_x_positions and get_y_positions functions from the utils.py file. The resulting encodings are combined to represent 2D spatial positioning.

Handling the Classification Token

Positional encoding techniques handle classification tokens in unique ways:

No Position: No encoding applied to classification tokens.
Learnable: Classification token learns its encoding.
Sinusoidal: Patch tokens receive sinusoidal encoding; classification token learns its own.
Relative: The classification token is excluded from distance calculations. A fixed index (0) represents its distance in the lookup tables.
Rotary (RoPe): X and Y positions start at 1 for patch tokens, reserving 0 for the classification token (no rotation applied).

Parameter Comparison

The table below shows additional parameters introduced by different positional encodings:

Encoding Type	Parameter Description	Count
No Position	N/A	`0`
Learnable	`64 x 128`	`8192`
Sinusoidal (Absolute)	No learned parameters	`0`
Relative	Derived from max_relative_distance and other factors	`2304`
Rotary (RoPe)	No learned parameters	`0`

Base Transformer Configuration

Below are the training and architecture configurations:

Input Size: 3 x 32 x 32
Patch Size: 4
Sequence Length: 64
Embedding Dimension: 128
Number of Layers: 6
Number of Attention Heads: 4
Total Parameters: 820k

Note: This repo is built upon the following GitHub repo: Vision Transformers from Scratch in PyTorch

Citations

@article{vaswani2017attention,
  title={Attention is all you need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
  journal={Advances in neural information processing systems},
  volume={30},
  year={2017}
}
@inproceedings{dosovitskiy2020image,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others},
  booktitle={International Conference on Learning Representations},
  year={2020}
}
@article{shaw2018self,
  title={Self-attention with relative position representations},
  author={Shaw, Peter and Uszkoreit, Jakob and Vaswani, Ashish},
  journal={arXiv preprint arXiv:1803.02155},
  year={2018}
}
@article{su2024roformer,
  title={Roformer: Enhanced transformer with rotary position embedding},
  author={Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng},
  journal={Neurocomputing},
  volume={568},
  pages={127063},
  year={2024},
  publisher={Elsevier}
}

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
figures		figures
positional_encodings		positional_encodings
README.md		README.md
data_loader.py		data_loader.py
main.py		main.py
scripts.sh		scripts.sh
solver.py		solver.py
utils.py		utils.py
vit_model.py		vit_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

2D Positional Encodings for Vision Transformers (ViT)

Overview

Key Features

Run commands (also available in scripts.sh)

Results

Splitting X and Y Axes into 1D Positional Encodings

Handling the Classification Token

Parameter Comparison

Base Transformer Configuration

Citations

About

Uh oh!

Releases

Packages

Languages

s-chh/2D-Positional-Encoding-Vision-Transformer

Folders and files

Latest commit

History

Repository files navigation

2D Positional Encodings for Vision Transformers (ViT)

Overview

Key Features

Run commands (also available in scripts.sh)

Results

Splitting X and Y Axes into 1D Positional Encodings

Handling the Classification Token

Parameter Comparison

Base Transformer Configuration

Citations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages