Image caption generation is an interesting artificial intelligence problem where a descriptive sentence is generated for a given image. It involves the dual techniques from computer vision to understand the content of the image and a language model from the field of natural language processing to turn the understanding of the image into words in the right order
COCO dataset 80K training Image 40K val Image with 5 captioning for every image
Transfer learning : encode the image (224 * 224 * 3 ) into 4096 feature vector using pre-trained VGG 16 Linear layer : to get small feature vector with shape of 512
- Its job to encode the feature vector and generate word by word
- Since it’s generating sequence , it will need RNN, we will use LSTM
In a setting with Attention, we want the Decoder to be able to look at different parts of the image at different points in the sequence.
- The straightforward – and greedy – option would be to choose the word with the highest score and use it to predict the next word
- It would be best if we could somehow not decide until we've finished decoding completely, and choose the sequence that has the highest overall score from a basket of candidate sequences.
pizza sitting on top of a wooden table
A man riding a wave on top of a surfboard
A close up of a plate of food with broccoli
A small boat on a body of water
A box filled with different type of donuts
A baseball player swinging a bat at ball