BLIP is a pre-training framework for unified vision-language understanding and generation, which achieves state-of-the-art results on a wide range of vision-language tasks. This tutorial considers ways to use BLIP for visual question answering and image captioning.
The complete pipeline of this demo is shown below:
The following image shows an example of the input image and generated caption:
The following image shows an example of the input image, question and answer generated by modelThe tutorial consists of the following parts:
- Instantiate a BLIP model.
- Convert the BLIP model to OpenVINO IR.
- Run visual question answering and image captioning with OpenVINO.
If you have not installed all required dependencies, follow the Installation Guide.