This repository contains all code i wrote or used while reading up about Vision Transformers using the research paper: https://arxiv.org/abs/2010.11929
This jupyter notebook contains the code for the model that i wrote alongside explainations.
This imports a pretrained(on ImageNet) ViT model and applies it on a video input giving an output of a video which contains the classification as text.
vid.mp4 and video.mp4 is an example of the input and output video
This is code for a pretrained model which i made changes to in order to give weights alongside the output
This file uses the PretrainedModel to give the attention layer as an image.This helps us see which part of the image was under focus by the model.