Welcome to the repository of our state-of-the-art image captioning model. We have combined the strengths of the BLIP-2 (Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models) model with LoRa (Low-Rank Adaptation of Large Language Models) to create an effective and precise image captioning tool. Our dataset, rich in image descriptions, has been automatically labeled using a combination of multi-modal models.
Check out our GitHub Page! https://diegobonilla98.github.io/PixLore/
The main objective of this project is to generate detailed and accurate captions for a wide range of images. By combining the power of BLIP-2 and LoRa, we aim to make a significant leap in the image captioning domain.
Our dataset consists of 11,000 examples that have been automatically labeled using an assortment of multi-modal models.
Images: High-quality images randomly sampled from the COCO dataset. Captions: Rich and detailed descriptions automatically generated for each image.
We employ the BLIP-2 model architecture, known for effectively combining the strengths of image encoders and large language models. To further enhance its performance, we have fine-tuned the model using LoRa, a technique that provides low-rank adaptation for large language models.