Skip to content

Commit

Permalink
Internvl series models update (modelscope#1426)
Browse files Browse the repository at this point in the history
  • Loading branch information
hjh0119 authored Jul 18, 2024
1 parent 0d88ce1 commit e95db47
Show file tree
Hide file tree
Showing 9 changed files with 381 additions and 49 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,13 @@ You can contact us and communicate with us by adding our group:
<img src="asset/discord_qr.jpg" width="200" height="200"> | <img src="asset/wechat.png" width="200" height="200">

## 🎉 News
- 2024.07.17: Support newly released InternVL2 models: `model_type` are internvl2-1b, internvl2-40b, internvl2-llama3-76b. For best practices, refer to [here](docs/source_en/Multi-Modal/internvl-best-practice.md).
- 2024.07.17: Support the training and inference of [NuminaMath-7B-TIR](https://huggingface.co/AI-MO/NuminaMath-7B-TIR). Use with model_type `numina-math-7b`.
- 🔥2024.07.16: Support exporting for ollama and bitsandbytes. Use `swift export --model_type xxx --to_ollama true` or `swift export --model_type xxx --quant_method bnb --quant_bits 4`
- 2024.07.08: Support cogvlm2-video-13b-chat. You can check the best practice [here](docs/source_en/Multi-Modal/cogvlm2-video-best-practice.md).
- 2024.07.08: Support internlm-xcomposer2_5-7b-chat. You can check the best practice [here](docs/source_en/Multi-Modal/internlm-xcomposer2-best-practice.md).
- 🔥2024.07.06: Support for the llava-next-video series models: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-instruct. You can refer to [llava-video best practice](docs/source_en/Multi-Modal/llava-video-best-practice.md) for more information.
- 🔥2024.07.06: Support internvl2 series: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
- 🔥2024.07.06: Support InternVL2 series: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
- 2024.07.06: Support codegeex4-9b-chat.
- 2024.07.04: Support internlm2_5-7b series: internlm2_5-7b, internlm2_5-7b-chat, internlm2_5-7b-chat-1m.
- 2024.07.02: Support for using vLLM for accelerating inference and deployment of multimodal large models such as the llava series and phi3-vision models. You can refer to the [Multimodal & vLLM Inference Acceleration Documentation](docs/source_en/Multi-Modal/vllm-inference-acceleration.md) for more information.
Expand Down Expand Up @@ -606,7 +607,7 @@ The complete list of supported models and datasets can be found at [Supported Mo
| Llava1.5<br>Llava1.6 | [Llava series models](https://github.com/haotian-liu/LLaVA) | English | 7B-34B | chat model |
| Llava-Next<br>Llava-Next-Video | [Llava-Next series models](https://github.com/LLaVA-VL/LLaVA-NeXT) | Chinese<br>English | 7B-110B | chat model |
| mPLUG-Owl | [mPLUG-Owl series models](https://github.com/X-PLUG/mPLUG-Owl) | English | 11B | chat model |
| InternVL<br>Mini-Internvl<br>Internvl2 | [InternVL](https://github.com/OpenGVLab/InternVL) | Chinese<br>English | 2B-40B<br>including quantized version | chat model |
| InternVL<br>Mini-InternVL<br>InternVL2 | [InternVL](https://github.com/OpenGVLab/InternVL) | Chinese<br>English | 1B-40B<br>including quantized version | chat model |
| Llava-llama3 | [xtuner](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) | English | 8B | chat model |
| Phi3-Vision | Microsoft | English | 4B | chat model |
| PaliGemma | Google | English | 3B | chat model |
Expand Down
5 changes: 3 additions & 2 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,12 +56,13 @@ SWIFT具有丰富全面的文档,请查看我们的文档网站:


## 🎉 新闻
- 2024.07.17: 支持InternVL2系列新模型: `model_type`分别为internvl2-1b, internvl2-40b, internvl2-llama3-76b. 最佳实践可以查看[这里](docs/source/Multi-Modal/internvl最佳实践.md).
- 2024.07.17: 支持[NuminaMath-7B-TIR](https://www.modelscope.cn/models/AI-ModelScope/NuminaMath-7B-TIR)的训练和推理. model_type可以使用`numina-math-7b`.
- 🔥2024.07.16: 支持ollama和bitsandbytes导出. 可以使用命令: `swift export --model_type xxx --to_ollama true`或者`swift export --model_type xxx --quant_method bnb --quant_bits 4`.
- 2024.07.08: 支持cogvlm2-video-13b-chat. 最佳实践可以查看[这里](docs/source/Multi-Modal/cogvlm2-video最佳实践.md).
- 2024.07.08: 支持internlm-xcomposer2_5-7b-chat. 最佳实践可以查看[这里](docs/source/Multi-Modal/internlm-xcomposer2最佳实践.md).
- 🔥2024.07.06: 支持llava-next-video系列模型: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-instruct. 可以查看[llava-video最佳实践](docs/source/Multi-Modal/llava-video最佳实践.md)了解更多.
- 🔥2024.07.06: 支持internvl-2系列: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
- 🔥2024.07.06: 支持InternVL-2系列: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
- 2024.07.06: 支持codegeex4-9b-chat.
- 2024.07.04: 支持internlm2_5-7b系列: internlm2_5-7b, internlm2_5-7b-chat, internlm2_5-7b-chat-1m.
- 2024.07.02: 支持使用vllm对多模态大模型: llava系列, phi3-vision模型进行推理加速和部署. 可以查看[多模态&vLLM推理加速文档](docs/source/Multi-Modal/vLLM推理加速文档.md)获取更多信息.
Expand Down Expand Up @@ -600,7 +601,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
| Llava1.5<br>Llava1.6 | [Llava系列模型](https://github.com/haotian-liu/LLaVA) | 英文 | 7B-34B | chat模型 |
| Llava-Next<br>Llava-Next-Video | [Llava-Next系列模型](https://github.com/LLaVA-VL/LLaVA-NeXT) | 中文<br>英文 | 7B-110B | chat模型 |
| mPLUG-Owl | [mPLUG-Owl系列模型](https://github.com/X-PLUG/mPLUG-Owl) | 英文 | 11B | chat模型 |
| InternVL<br>Mini-Internvl<br>Internvl2 | [InternVL](https://github.com/OpenGVLab/InternVL) | 中文<br>英文 | 2B-40B<br>包含量化版本 | chat模型 |
| InternVL<br>Mini-InternVL<br>InternVL2 | [InternVL](https://github.com/OpenGVLab/InternVL) | 中文<br>英文 | 1B-40B<br>包含量化版本 | chat模型 |
| Llava-llama3 | [xtuner](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) | 英文 | 8B | chat model |
| Phi3-Vision | 微软 | 英文 | 4B | chat model |
| PaliGemma | Google | 英文 | 3B | chat model |
Expand Down
2 changes: 2 additions & 0 deletions docs/source/LLM/支持的模型和数据集.md
Original file line number Diff line number Diff line change
Expand Up @@ -354,11 +354,13 @@
|internvl-chat-v1_5-int8|[AI-ModelScope/InternVL-Chat-V1-5-int8](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5-int8](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-int8)|
|mini-internvl-chat-2b-v1_5|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)|
|mini-internvl-chat-4b-v1_5|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5/summary)|qkv_proj|internvl-phi3|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)|
|internvl2-1b|[OpenGVLab/InternVL2-1B](https://modelscope.cn/models/OpenGVLab/InternVL2-1B/summary)|q_proj, k_proj, v_proj|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-1B](https://huggingface.co/OpenGVLab/InternVL2-1B)|
|internvl2-2b|[OpenGVLab/InternVL2-2B](https://modelscope.cn/models/OpenGVLab/InternVL2-2B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B)|
|internvl2-4b|[OpenGVLab/InternVL2-4B](https://modelscope.cn/models/OpenGVLab/InternVL2-4B/summary)|qkv_proj|internvl2-phi3|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B)|
|internvl2-8b|[OpenGVLab/InternVL2-8B](https://modelscope.cn/models/OpenGVLab/InternVL2-8B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B)|
|internvl2-26b|[OpenGVLab/InternVL2-26B](https://modelscope.cn/models/OpenGVLab/InternVL2-26B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-26B](https://huggingface.co/OpenGVLab/InternVL2-26B)|
|internvl2-40b|[OpenGVLab/InternVL2-40B](https://modelscope.cn/models/OpenGVLab/InternVL2-40B/summary)|q_proj, k_proj, v_proj|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-40B](https://huggingface.co/OpenGVLab/InternVL2-40B)|
|internvl2-llama3-76b|[OpenGVLab/InternVL2-Llama3-76B](https://modelscope.cn/models/OpenGVLab/InternVL2-Llama3-76B/summary)|q_proj, k_proj, v_proj|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-Llama3-76B](https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B)|
|deepseek-vl-1_3b-chat|[deepseek-ai/deepseek-vl-1.3b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-1.3b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;||vision|[deepseek-ai/deepseek-vl-1.3b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-1.3b-chat)|
|deepseek-vl-7b-chat|[deepseek-ai/deepseek-vl-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;||vision|[deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)|
|paligemma-3b-pt-224|[AI-ModelScope/paligemma-3b-pt-224](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-pt-224/summary)|q_proj, k_proj, v_proj|paligemma|&#x2714;|&#x2718;|transformers>=4.41|vision|[google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224)|
Expand Down
113 changes: 109 additions & 4 deletions docs/source/Multi-Modal/internvl最佳实践.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,41 @@
- [internvl-chat-v1_5-int8](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)
- [mini-internvl-chat-2b-v1_5](https://www.modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)
- [mini-internvl-chat-4b-v1_5](https://www.modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)
- [internvl2-1b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-1B)
- [internvl2-2b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-2B)
- [internvl2-4b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-4B)
- [internvl2-8b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-8B)
- [internvl2-26b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-26B)
- [internvl2-40b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-40B)
- [internvl2-llama3-76b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-Llama3-76B)


以下实践以`internvl-chat-v1_5`为例,你也可以通过指定`--model_type`切换为其他模型.

**FAQ**

1. **模型显示 `The request model does not exist!`**
这种情况通常发生在尝试使用mini-internvl或InternVL2模型, 原因是modelscope上相应模型是申请制。解决这个问题,你需要登录modelscope, 并前往相应的模型页面进行**申请下载**, 申请成功后可以通过以下任意一种方式获取模型:
- 使用`snap_download`将模型下载到本地(在模型文件中的模型下载中有相应代码), 然后使用`--model_id_or_path`指定本地模型文件路径
-[modelscope账号主页](https://www.modelscope.cn/my/myaccesstoken)获取账号的SDK token, 使用参数`--hub_token`或者环境变量`MODELSCOPE_API_TOKEN`指定

也可以设置环境变量`USE_HF`, 从hugging face处下载模型

2. **多卡运行模型时, 为什么不同卡的分布不均匀, 导致OOM?**
transformers的auto device map算法对多模态模型支持不友好, 这可能导致不同 GPU 卡之间的显存分配不均匀。
- 可以通过参数`--device_max_memory`设置每张卡的显存使用, 比如四卡环境, 可以设置`--device_map_memory 15GB 15GB 15GB 15GB`
- 或者通过`--device_map_config_path`显式指定device map

3. **InternVL2模型与前系列(InternVL-V1.5和Mini-InternVL)模型的区别**
- InternVL2模型支持多轮多图推理和训练, 即多轮对话带有图片, 且单轮中支持文字图片交错,具体参考[自定义数据集](#自定义数据集)和推理的InternVL2部分。前系列模型支持多轮对话, 但只能有单轮带有图片
- InternVL2模型支持视频输入, 具体格式参考[自定义数据集](#自定义数据集)


## 目录
- [环境准备](#环境准备)
- [推理](#推理)
- [微调](#微调)
- [自定义数据集](#自定义数据集)
- [微调后推理](#微调后推理)


Expand Down Expand Up @@ -49,7 +72,7 @@ CUDA_VISIBLE_DEVICES=0,1 swift infer --model_type internvl-chat-v1_5 --dtype bf1
```

输出: (支持传入本地路径或URL)
```python
```
"""
<<< 你是谁
Input a media path or URL <<<
Expand Down Expand Up @@ -107,6 +130,64 @@ ui功能了。
"""
```

对于**InternVL2**系列模型, 支持多轮多图推理, 且单轮中支持图像文本交错, 需要在输入通过`<image>`确定图像位置
多轮多图示例
```shell
CUDA_VISIBLE_DEVICES=0 swift infer --model_type internvl2-2b
```

```
<<< <image>描述图片
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
这是一只非常可爱的幼猫。它的毛色主要是白色,脸部和耳朵上有黑色和灰色的条纹。猫的眼睛大而明亮,瞳孔是蓝色的,显得非常清澈和无辜。它的耳朵竖立着,耳朵尖端是浅色的。这只猫看起来非常活泼,眼神中透露出一种好奇和友善的气质。背景模糊,可能是为了突出这只猫本身。整体画面给人一种温馨和可爱的感觉。
--------------------------------------------------
<<< <image>描述与上一张图片的差距
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
这张图片与上一张图片相比,有以下几个显著的差距:
1. **动物种类不同**:
- 上一张图片中的动物是猫,而这张图片中的动物是羊。
2. **动物外观和特征不同**:
- 猫有明显的竖立耳朵和蓝色的眼睛,而羊有竖立的长角和白色的羊毛。
3. **背景和环境不同**:
- 上一张图片背景是模糊的,天空中有一些云朵,而这张图片背景是清晰的,背景中有绿色的山丘和蓝天白云。
4. **动物表情和姿态不同**:
- 猫的表情显得非常警觉和好奇,而羊的表情显得非常温和和友好。
5. **整体风格和风格不同**:
- 上一张图片有一种更加写实和细腻的绘画风格,而这张图片则有一种卡通和简化的风格。
这些差异使得这张图片与上一张图片在视觉上存在显著的差异。
--------------------------------------------------
<<< 两张图片的共同点是什么
Input an image path or URL <<<
两张图片的共同点如下:
1. **动物种类不同**:
- 两张图片中的动物都是羊。
2. **背景和环境不同**:
- 两张图片中的背景都是自然景观,包括绿色的山丘和蓝天白云。
3. **动物表情和姿态不同**:
- 猫和羊的表情和姿态都不同,但都显得非常可爱和友好。
4. **整体风格和风格不同**:
- 两张图片在风格上有所不同,但都具有卡通和简化的特点。
这些共同点使得两张图片在视觉上存在显著的差异,但它们都展示了可爱的动物形象。
```

单轮多图示例
```
<<< image1: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img> image2: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img> What is the difference bewteen the two images?
Input an image path or URL <<<
The two images are of the same kitten, but the first image is a close-up shot, while the second image is a more distant, artistic illustration. The close-up image captures the kitten in detail, showing its fur, eyes, and facial features in sharp focus. In contrast, the artistic illustration is more abstract and stylized, with a blurred background and a different color palette. The distant illustration gives the kitten a more whimsical and dreamy appearance, while the close-up image emphasizes the kitten's realism and detail.
```

示例图片如下:

cat:
Expand Down Expand Up @@ -134,6 +215,7 @@ ocr:
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
# os.environ['MODELSCOPE_API_TOKEN'] = 'Your API Token' # If the message "The request model does not exist!" appears.

from swift.llm import (
get_model_tokenizer, get_template, inference,
Expand All @@ -142,6 +224,7 @@ from swift.llm import (
from swift.utils import seed_everything
import torch


model_type = "internvl-chat-v1_5"
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
Expand Down Expand Up @@ -244,17 +327,39 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--sft_type full \
```


## 自定义数据集
[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:

(支持多轮对话, 但总的轮次对话只能包含一张图片, 支持传入本地路径或URL)
(支持多轮对话, 图片支持传入本地路径或URL, 多张图片用逗号','分割)

```jsonl
{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path1", "image_path2"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]], "images": ["image_path"]}
```

(支持纯文本数据)
```jsonl
{"query": "55555", "response": "66666"}
{"query": "eeeee", "response": "fffff", "history": []}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
```

**InternVL2**模型支持多图多轮训练, 使用tag `<image>` 标明图片在对话中的位置, 如果数据集中没有tag `<image>`, 默认放在最后一轮query的开头
```jsonl
{"query": "Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.", "response": "xxxxxxxxx", "history": [["<image> Describe the image", "xxxxxxx"], ["CCCCC", "DDDDD"]], "images": ["image_path1", "image_path2", "image_path3"]}
```
或者用`<img>image_path</img>` 表示图像路径和图像位置
""
```jsonl
{"query": "Image-1: <img>img_path</img>\n Image-2: <img>img_path2</img>\n Describe the two images in detail.", "response": "xxxxxxxxx", "history": [["<img>img_path3</img> Describe the image", "xxxxxxx"], ["CCCCC", "DDDDD"]], }
```

**InternVL2**模型支持视频数据集训练, 无需标明tag
```jsonl
{"query": "Describe this video in detail. Don't repeat", "response": "xxxxxxxxx", "history": [], "videos": ["video_path"]}
```

## 微调后推理
直接推理:
```shell
Expand Down
Loading

0 comments on commit e95db47

Please sign in to comment.