Skip to content

PyTorch implementation of 2D Positional Encodings for Vision Transformers (ViT). Positional Encodings/Embeddings: Sinusoidal (Absolute), Learnable, Relative and Rotation (Rope).

Notifications You must be signed in to change notification settings

s-chh/2D-Positional-Encoding-Vision-Transformer

Repository files navigation

2D Positional Encodings for Vision Transformers (ViT)

Overview

This repository explores various 2D positional encoding strategies for Vision Transformers (ViTs), including:

  • No Position
  • Learnable
  • Sinusoidal (Absolute)
  • Relative
  • Rotary Position Embedding (RoPe)

The encodings are tested on CIFAR10 and CIFAR100 datasets with a compact ViT architecture (800k parameters).

Key Features

  • Implements 2D positional encodings by splitting dimensions into x and y sequences.
  • Handles classification tokens uniquely for each encoding type.
  • Provides a compact ViT model with only 800k parameters.
  • Comprehensive comparisons across CIFAR10 and CIFAR100 datasets (using a patch size of 4).

Run commands (also available in scripts.sh)

Use the following command to run the model with different positional encodings:

python main.py --dataset cifar10 --pos_embed [TYPE]

Replace TYPE with one of the following:

Positional Encoding Type Argument
No Position --pos_embed none
Learnable --pos_embed learn
Sinusoidal (Absolute) --pos_embed sinusoidal
Relative --pos_embed relative --max_relative_distance 2
Rotary (RoPe) --pos_embed rope
  • Use the --dataset argument to switch between CIFAR10 and CIFAR100.
  • For relative encoding, adjust the --max_relative_distance parameter as needed.

Results

Test set accuracy when ViT is trained using different positional Encoding.

Positional Encoding Type CIFAR10 CIFAR100
No Position 79.63 53.25
Learnable 86.52 60.87
Sinusoidal (Absolute) 86.09 59.73
Relative 90.57 65.11
Rotary (Rope) 88.49 62.88

Splitting X and Y Axes into 1D Positional Encodings

Instead of flattening image patches directly, we encode spatial information separately for the x and y axes:

  • X-axis encoding applies 1D positional encoding to horizontal sequences.
  • Y-axis encoding applies 1D positional encoding to vertical sequences.

Below is a visualization:

  • X-axis Encoding
    X-axis
  • Y-axis Encoding
    Y-axis

The x and y-axis sequences are replicated using get_x_positions and get_y_positions functions from the utils.py file. The resulting encodings are combined to represent 2D spatial positioning.


Handling the Classification Token

Positional encoding techniques handle classification tokens in unique ways:

  • No Position: No encoding applied to classification tokens.
  • Learnable: Classification token learns its encoding.
  • Sinusoidal: Patch tokens receive sinusoidal encoding; classification token learns its own.
  • Relative: The classification token is excluded from distance calculations. A fixed index (0) represents its distance in the lookup tables.
  • Rotary (RoPe): X and Y positions start at 1 for patch tokens, reserving 0 for the classification token (no rotation applied).

Parameter Comparison

The table below shows additional parameters introduced by different positional encodings:

Encoding Type Parameter Description Count
No Position N/A 0
Learnable 64 x 128 8192
Sinusoidal (Absolute) No learned parameters 0
Relative Derived from max_relative_distance and other factors 2304
Rotary (RoPe) No learned parameters 0

Base Transformer Configuration

Below are the training and architecture configurations:

  • Input Size: 3 x 32 x 32
  • Patch Size: 4
  • Sequence Length: 64
  • Embedding Dimension: 128
  • Number of Layers: 6
  • Number of Attention Heads: 4
  • Total Parameters: 820k

Note: This repo is built upon the following GitHub repo: Vision Transformers from Scratch in PyTorch

Citations

@article{vaswani2017attention,
  title={Attention is all you need},
  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
  journal={Advances in neural information processing systems},
  volume={30},
  year={2017}
}
@inproceedings{dosovitskiy2020image,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others},
  booktitle={International Conference on Learning Representations},
  year={2020}
}
@article{shaw2018self,
  title={Self-attention with relative position representations},
  author={Shaw, Peter and Uszkoreit, Jakob and Vaswani, Ashish},
  journal={arXiv preprint arXiv:1803.02155},
  year={2018}
}
@article{su2024roformer,
  title={Roformer: Enhanced transformer with rotary position embedding},
  author={Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng},
  journal={Neurocomputing},
  volume={568},
  pages={127063},
  year={2024},
  publisher={Elsevier}
}

About

PyTorch implementation of 2D Positional Encodings for Vision Transformers (ViT). Positional Encodings/Embeddings: Sinusoidal (Absolute), Learnable, Relative and Rotation (Rope).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published