Visual Question Answering and Image Captioning using BLIP and OpenVINO

BLIP is a pre-training framework for unified vision-language understanding and generation, which achieves state-of-the-art results on a wide range of vision-language tasks. This tutorial considers ways to use BLIP for visual question answering and image captioning.

The complete pipeline of this demo is shown below:

Image Captioning

The following image shows an example of the input image and generated caption:

Visual Question Answering

The following image shows an example of the input image, question and answer generated by model

Notebook Contents

The tutorial consists of the following parts:

Instantiate a BLIP model.
Convert the BLIP model to OpenVINO IR.
Run visual question answering and image captioning with OpenVINO.

Installation Instructions

If you have not installed all required dependencies, follow the Installation Guide.