A single round of dialogue can contain multiple images (or no images):
- Qwen-VL Best Practice
- Qwen-Audio Best Practice
- Deepseek-VL Best Practice
- Internlm2-Xcomposers Best Practice
- Phi3-Vision Best Practice
A single round of dialogue can only contain one image:
The entire conversation revolves around one image.