update docs (specific model arguments) (#2822)

modelscope · Dec 31, 2024 · d87d8ed · d87d8ed
1 parent 054ae1a
commit d87d8ed
Show file tree

Hide file tree

Showing 4 changed files with 130 additions and 10 deletions.
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -1,14 +1,14 @@
 # 命令行参数
 
-命令行参数的介绍会分为基本参数，原子参数和集成参数。命令行最终使用的参数列表为集成参数。集成参数继承自基本参数和一些原子参数。
+命令行参数的介绍会分为基本参数，原子参数、集成参数和特定模型参数。命令行最终使用的参数列表为集成参数。集成参数继承自基本参数和一些原子参数。特定模型参数是针对于具体模型的参数，可以通过`--model_kwargs'`或者环境变量进行设置。
 
 ## 基本参数
 
 - 🔥tuner_backend: 可选为'peft', 'unsloth', 默认为'peft'
 - 🔥train_type: 默认为'lora'. 可选为: 'lora', 'full', 'longlora', 'adalora', 'llamapro', 'adapter', 'vera', 'boft', 'fourierft', 'reft'
 - 🔥adapters: 用于指定adapter的id/path的list，默认为`[]`.
 - seed: 默认为42
-- model_kwargs: 特定模型可传入的额外参数. 该参数列表会在训练推理时打印日志进行提示
+- model_kwargs: 特定模型可传入的额外参数. 该参数列表会在训练推理时打印日志进行提示，例如`--model_kwargs '{"fps_max_frames": 12}'`
 - load_args: 当指定`--resume_from_checkpoint`, `--model`, `--adapters`会读取保存文件中的`args.json`，将默认为None的`基本参数`（除去数据参数和生成参数）进行赋值（可通过手动传入进行覆盖）。默认为True
 - load_data_args: 如果将该参数设置为True, 则会额外读取数据参数. 默认为False
 - use_hf: 默认为False. 控制模型下载、数据集下载、模型push的hub
@@ -392,3 +392,63 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
 - hub_model_id: 推送的model_id，默认为None
 - hub_private_repo: 是否是private repo，默认为False
 - commit_message: 提交信息，默认为'update files'
+
+
+## 特定模型参数
+特定模型参数可以通过`--model_kwargs`或者环境变量进行设置，例如: `--model_kwargs '{"fps_max_frames": 12}'`或者`FPS_MAX_FRAMES=12`
+
+### qwen2_vl, qvq
+参数含义可以查看[这里](https://github.com/QwenLM/Qwen2-VL/blob/main/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L24)
+
+- IMAGE_FACTOR: 默认为28
+- MIN_PIXELS: 默认为`4 * 28 * 28`
+- MAX_PIXELS: 默认为`16384 * 28 * 28`，参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/ocr.sh#L3)
+- MAX_RATIO: 默认为200
+- VIDEO_MIN_PIXELS: 默认为`128 * 28 * 28`
+- VIDEO_MAX_PIXELS: 默认为`768 * 28 * 28`，参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/video.sh#L7)
+- VIDEO_TOTAL_PIXELS: 默认为`24576 * 28 * 28`
+- FRAME_FACTOR: 默认为2
+- FPS: 默认为2.0
+- FPS_MIN_FRAMES: 默认为4
+- FPS_MAX_FRAMES: 默认为768，参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/video.sh#L8)
+
+### internvl, internvl_phi3
+参数含义可以查看[这里](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)
+- MAX_NUM: 默认为12
+- INPUT_SIZE: 默认为448
+
+### internvl2, internvl2_phi3, internvl2_5
+- MAX_NUM: 默认为12
+- INPUT_SIZE: 默认为448
+- VIDEO_MAX_NUM: 默认为1。视频的MAX_NUM
+- VIDEO_SEGMENTS: 默认为8
+
+
+### minicpmv2_6
+- MAX_SLICE_NUMS: 默认为9，参考[这里](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6/file/view/master?fileName=config.json&status=1)
+- VIDEO_MAX_SLICE_NUMS: 默认为1，视频的MAX_SLICE_NUMS，参考[这里](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6)
+- MAX_NUM_FRAMES: 默认为64，参考[这里](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6)
+
+### ovis1_6
+- MAX_PARTITION: 参考[这里](https://github.com/AIDC-AI/Ovis/blob/d248e34d755a95d24315c40e2489750a869c5dbc/ovis/model/modeling_ovis.py#L312)
+
+### mplug_owl3, mplug_owl3_241101
+- MAX_NUM_FRAMES: 默认为16，参考[这里](https://modelscope.cn/models/iic/mPLUG-Owl3-7B-240728)
+
+### xcomposer2_4khd
+- HD_NUM: 默认为55，参考[这里](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-4khd-7b)
+
+### xcomposer2_5
+- HD_NUM: 图片数量为1时，默认值为24。大于1，默认为6。参考[这里](https://modelscope.cn/models/AI-ModelScope/internlm-xcomposer2d5-7b/file/view/master?fileName=modeling_internlm_xcomposer2.py&status=1#L254)
+
+### video_cogvlm2
+- NUM_FRAMES: 默认为24，参考[这里](https://github.com/THUDM/CogVLM2/blob/main/video_demo/inference.py#L22)
+
+### phi3_vision
+- NUM_CROPS: 默认为4，参考[这里](https://modelscope.cn/models/LLM-Research/Phi-3.5-vision-instruct)
+
+### llama3_1_omni
+- N_MELS: 默认为128，参考[这里](https://github.com/ictnlp/LLaMA-Omni/blob/544d0ff3de8817fdcbc5192941a11cf4a72cbf2b/omni_speech/infer/infer.py#L57)
+
+### video_llava
+- NUM_FRAMES: 默认为16
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -1,14 +1,14 @@
 # Command Line Parameters
 
-The introduction to command line parameters will cover base arguments, atomic arguments, and integration arguments. The final list of arguments used in the command line is the integration arguments. The integration arguments inherit from the base arguments and some atomic arguments.
+The introduction to command line parameters will cover base arguments, atomic arguments, and integrated arguments, and specific model arguments. The final list of arguments used in the command line is the integration arguments. Integrated arguments inherit from basic arguments and some atomic arguments. Specific model arguments are designed for specific models and can be set using `--model_kwargs'` or the environment variable.
 
 ## Base Arguments
 
 - 🔥tuner_backend: Optional values are 'peft' and 'unsloth', default is 'peft'
 - 🔥train_type: Default is 'lora'. Optional values: 'lora', 'full', 'longlora', 'adalora', 'llamapro', 'adapter', 'vera', 'boft', 'fourierft', 'reft'
 - 🔥adapters: A list used to specify the ID/path of the adapter, default is `[]`.
 - seed: Default is 42
-- model_kwargs: Extra parameters specific to the model. This parameter list will be logged during training for reference.
+- model_kwargs: Extra parameters specific to the model. This parameter list will be logged during training for reference, for example, `--model_kwargs '{"fps_max_frames": 12}'`.
 - load_args: When `--resume_from_checkpoint`, `--model`, or `--adapters` is specified, it will read the `args.json` file from the saved checkpoint and assign values to the `BaseArguments` that are defaulted to None (excluding DataArguments and GenerationArguments). These can be overridden by manually passing in values. The default is `True`.
 - load_data_args: If this parameter is set to True, it will additionally read the data parameters. The default is `False`.
 - use_hf: Default is False. Controls model and dataset downloading, and model pushing to the hub.
@@ -392,3 +392,62 @@ Export Arguments include the [basic arguments](#base-arguments) and [merge argum
 - hub_model_id: Model ID for pushing, default is None.
 - hub_private_repo: Whether it is a private repo, default is False.
 - commit_message: Commit message, default is 'update files'.
+
+## Specific Model Arguments
+
+Specific model arguments can be set using `--model_kwargs` or environment variables, for example: `--model_kwargs '{"fps_max_frames": 12}'` or `FPS_MAX_FRAMES=12`.
+
+### qwen2_vl, qvq
+For the meaning of the arguments, please refer to [here](https://github.com/QwenLM/Qwen2-VL/blob/main/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L24)
+
+- IMAGE_FACTOR: Default is 28
+- MIN_PIXELS: Default is `4 * 28 * 28`
+- MAX_PIXELS: Default is `16384 * 28 * 28`, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/ocr.sh#L3)
+- MAX_RATIO: Default is 200
+- VIDEO_MIN_PIXELS: Default is `128 * 28 * 28`
+- VIDEO_MAX_PIXELS: Default is `768 * 28 * 28`, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/video.sh#L7)
+- VIDEO_TOTAL_PIXELS: Default is `24576 * 28 * 28`
+- FRAME_FACTOR: Default is 2
+- FPS: Default is 2.0
+- FPS_MIN_FRAMES: Default is 4
+- FPS_MAX_FRAMES: Default is 768, refer to [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/video.sh#L8)
+
+### internvl, internvl_phi3
+For the meaning of the arguments, please refer to [here](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)
+- MAX_NUM: Default is 12
+- INPUT_SIZE: Default is 448
+
+### internvl2, internvl2_phi3, internvl2_5
+- MAX_NUM: Default is 12
+- INPUT_SIZE: Default is 448
+- VIDEO_MAX_NUM: Default is 1, which is the MAX_NUM for videos
+- VIDEO_SEGMENTS: Default is 8
+
+### minicpmv2_6
+- MAX_SLICE_NUMS: Default is 9, refer to [here](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6/file/view/master?fileName=config.json&status=1)
+- VIDEO_MAX_SLICE_NUMS: Default is 1, which is the MAX_SLICE_NUMS for videos, refer to [here](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6)
+- MAX_NUM_FRAMES: Default is 64, refer to [here](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2_6)
+
+### ovis1_6
+- MAX_PARTITION: Refer to [here](https://github.com/AIDC-AI/Ovis/blob/d248e34d755a95d24315c40e2489750a869c5dbc/ovis/model/modeling_ovis.py#L312)
+
+### mplug_owl3, mplug_owl3_241101
+- MAX_NUM_FRAMES: Default is 16, refer to [here](https://modelscope.cn/models/iic/mPLUG-Owl3-7B-240728)
+
+### xcomposer2_4khd
+- HD_NUM: Default is 55, refer to [here](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-4khd-7b)
+
+### xcomposer2_5
+- HD_NUM: Default is 24 when the number of images is 1. Greater than 1, the default is 6. Refer to [here](https://modelscope.cn/models/AI-ModelScope/internlm-xcomposer2d5-7b/file/view/master?fileName=modeling_internlm_xcomposer2.py&status=1#L254)
+
+### video_cogvlm2
+- NUM_FRAMES: Default is 24, refer to [here](https://github.com/THUDM/CogVLM2/blob/main/video_demo/inference.py#L22)
+
+### phi3_vision
+- NUM_CROPS: Default is 4, refer to [here](https://modelscope.cn/models/LLM-Research/Phi-3.5-vision-instruct)
+
+### llama3_1_omni
+- N_MELS: Default is 128, refer to [here](https://github.com/ictnlp/LLaMA-Omni/blob/544d0ff3de8817fdcbc5192941a11cf4a72cbf2b/omni_speech/infer/infer.py#L57)
+
+### video_llava
+- NUM_FRAMES: Default is 16
diff --git a/swift/llm/template/template/internvl.py b/swift/llm/template/template/internvl.py
@@ -136,7 +136,10 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
         if images:
             has_video = bool(inputs.videos)
             input_size = get_env_args('input_size', int, 448)
-            max_num = get_env_args('max_num', int, 1 if has_video else 12)
+            max_num = get_env_args('max_num', int, 12)
+            video_max_num = get_env_args('video_max_num', int, 1)
+            if has_video:
+                max_num = video_max_num
             pixel_values = [transform_image(image, input_size, max_num) for image in images]
             num_patches = [pv.shape[0] for pv in pixel_values]
             pixel_values = torch.cat(pixel_values).to(self.config.torch_dtype)

diff --git a/swift/llm/template/template/minicpm.py b/swift/llm/template/template/minicpm.py
@@ -174,13 +174,11 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
         use_video = bool(inputs.videos)
         is_plain_text = not images and not use_video
         use_image_id = True
-        max_slice_nums = None
-
+        max_slice_nums = get_env_args('max_slice_nums', int, None)
+        video_max_slice_nums = get_env_args('video_max_slice_nums', int, 1)  # or 2
         if use_video:
+            max_slice_nums = video_max_slice_nums
             use_image_id = False
-            max_slice_nums = 1  # or 2
-
-        max_slice_nums = get_env_args('max_slice_nums', int, max_slice_nums)
         input_ids = encoded['input_ids']
         labels = encoded['labels']
         idx_list = findall(input_ids, -100)