a tiny vision language model that kicks ass and runs anywhere
1.6B parameter model built using SigLIP, Phi-1.5 and the LLaVA training dataset. Weights are licensed under CC-BY-SA due to using the LLaVA dataset. Try it out on Hugging Face Spaces!
Benchmarks
Model | Parameters | VQAv2 | GQA | TextVQA |
---|---|---|---|---|
LLaVA-1.5 | 13.3B | 80.0 | 63.3 | 61.3 |
LLaVA-1.5 | 7.3B | 78.5 | 62.0 | 58.2 |
moondream1 | 1.6B | 74.7 | 57.9 | 35.6 |
Examples
Usage
Clone this repository and install the dependencies:
pip install -r requirements.txt
Use the sample.py
script to run the model on CPU:
python sample.py --image [IMAGE_PATH] --prompt [PROMPT]
When the --prompt
argument is not provided, the script will allow you to ask
questions interactively.
Gradio demo
Use the gradio_demo.py
script to run the gradio app:
python gradio_demo.py
Limitations
- The model may generate inaccurate statements.
- It may struggle to adhere to intricate or nuanced instructions.
- It is primarily designed to understand English. Informal English, slang, and non-English languages may not work well.
- The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
- The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.