Skip to content

chenyuxiangg/moondream

 
 

Repository files navigation

🌔 moondream

a tiny vision language model that kicks ass and runs anywhere

Website | Hugging Face | Demo

Benchmarks

moondream2 is a 1.86B parameter model initialized with weights from SigLIP and Phi 1.5.

Model VQAv2 GQA TextVQA POPE TallyQA
moondream1 74.7 57.9 35.6 - -
moondream2 (latest) 75.4 59.8 43.1 (coming soon) (coming soon)

Examples

Image Example
What is the girl doing?
The girl is eating a hamburger.

What color is the girl's hair?
White
What is this?
A rack is present in the image, containing various electronic devices. A chair is situated on the left side, and a brick wall is visible in the background.

What is behind the stand?
A brick wall is visible behind the stand.

Usage

Using transformers (recommended)

pip install transformers timm einops
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-03-06"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

The model is updated regularly, so we recommend pinning the model version to a specific release as shown above.

To enable Flash Attention on the text model, pass in attn_implementation="flash_attention_2" when instantiating the model.

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision,
    torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).to("cuda")

Batch inference is also supported.

answers = moondream.batch_answer(
    images=[Image.open('<IMAGE_PATH_1>'), Image.open('<IMAGE_PATH_2>')],
    prompts=["Describe this image.", "Are there people in this image?"],
    tokenizer=tokenizer,
)

Using this repository

Clone this repository and install dependencies.

pip install -r requirements.txt

sample.py provides a CLI interface for running the model. When the --prompt argument is not provided, the script will allow you to ask questions interactively.

python sample.py --image [IMAGE_PATH] --prompt [PROMPT]

Use gradio_demo.py script to start a Gradio interface for the model.

python gradio_demo.py

webcam_gradio_demo.py provides a Gradio interface for the model that uses your webcam as input and performs inference in real-time.

python webcam_gradio_demo.py

Limitations

  • The model may generate inaccurate statements, and struggle to understand intricate or nuanced instructions.
  • The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
  • The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.

About

tiny vision language model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%