Skip to content

Commit

Permalink
support transformers==4.41 (modelscope#979)
Browse files Browse the repository at this point in the history
  • Loading branch information
Jintao-Huang authored May 22, 2024
1 parent 14a5283 commit d1224e0
Show file tree
Hide file tree
Showing 6 changed files with 48 additions and 24 deletions.
10 changes: 5 additions & 5 deletions docs/source/LLM/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,15 +39,15 @@
- dataset_path的支持. e.g. '1.jsonl#5000'. (如果是相对路径,则为相对于运行目录的相对路径).
- `--val_dataset`: 用于指定单独的验证集, 格式和`dataset`参数相同, 如果使用本参数, 则`dataset_test_ratio`不再生效.
- `--dataset_seed`: 用于指定数据集处理的seed, 默认为`42`. 以random_state形式存在, 不影响全局seed.
- `--dataset_test_ratio`: 用于指定子数据集切分成训练集和验证集的比例, 默认为`0.01`.
- `--dataset_test_ratio`: 用于指定子数据集切分成训练集和验证集的比例, 默认为`0.01`. 若设置了`--val_dataset`, 则该参数失效.
- `--train_dataset_sample`: 对训练集的采样数, 默认是`-1`, 即使用完整的训练集进行训练. 该参数已废弃, 请使用`--dataset {dataset_name}#{dataset_sample}`
- `--val_dataset_sample`: 对验证集进行采样, 默认是`None`, 自动选取合适数量的数据集数量进行验证. 如果你指定为`-1`, 则使用完整的验证集进行验证. 该参数已废弃, 验证集数量完全由`dataset_test_ratio`控制.
- `--val_dataset_sample`: 对验证集进行采样, 默认是`None`, 自动选取合适数量的数据集数量进行验证. 如果你指定为`-1`, 则使用完整的验证集进行验证. 该参数已废弃, 验证集数量由`--dataset_test_ratio`或者`--val_dataset {dataset_name}#{dataset_sample}`控制.
- `--system`: 对话模板中使用的system, 默认为`None`, 即使用模型默认的system. 如果指定为'', 则不使用system.
- `--max_length`: token的最大长度, 默认为`2048`. 可以避免个别过长的数据样本造成OOM的问题. 当指定`--truncation_strategy delete`时, 如果某数据样本长度超过max_length, 我们会删除该数据样本. 如果指定`--truncation_strategy truncation_left`时, 我们会切除最前面的token: `input_ids[-max_length:]`. 如果设置为-1, 则无限制.
- `--truncation_strategy`: 默认是`'delete'`表示把超过max_length的句子从数据集中删除. `'truncation_left'`表示会将超过文本的左边给切除掉, 这可能会切到special token, 会影响性能, 并不推荐.
- `--check_dataset_strategy`: 默认值为`'none'`, 即不做检查. 如果你训练的模型是LLM, 则推荐使用`'warning'`作为数据检查的策略. 如果你的训练目标为句子分类等任务, 则建议设置为'`none`'.
- `--custom_train_dataset_path`: 默认值为`[]`. 该参数已废弃, 请使用`--dataset {dataset_path}`.
- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃, 不再区分训练集和验证集, 使用`dataset_test_ratio`统一进行切分. 请使用`--dataset {dataset_path}`.
- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃, 该参数已废弃. 请使用`--val_dataset {dataset_path}`.
- `--self_cognition_sample`: 自我认知数据集的采样数. 默认为`0`. 你该值设置为>0时, 需要同时指定`--model_name`, `--model_author`. 该参数已废弃, 请使用`--dataset self-cognition#{self_cognition_sample}`.
- `--model_name`: 默认为`[None, None]`. 如果开启了自我认知数据集的采样(即指定`--dataset self-cognition`或者self_cognition_sample>0), 你需要传入两个值, 分别代表模型的中文名和英文名. 例如: `--model_name 小黄 'Xiao Huang'`. 如果你想了解更多, 可以查看[自我认知微调最佳实践](自我认知微调最佳实践.md).
- `--model_author`: 默认为`[None, None]`. 如果开启了自我认知数据集的采样, 你需要传入两个值, 分别代表作者的中文名和英文名. 例如: `--model_author 魔搭 ModelScope`.
Expand Down Expand Up @@ -241,14 +241,14 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
- `--dtype`: 默认值为`'AUTO`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
- `--dataset`: 默认值为`[]`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
- `--dataset_seed`: 默认值为`42`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
- `--dataset_test_ratio`: 默认值为`None`, 如果`--load_dataset_config true`则使用训练时的dataset_test_ratio, 否则设置为1. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
- `--dataset_test_ratio`: 默认值为`0.01`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
- `--show_dataset_sample`: 表示想要评估和展示的验证集的数量, 默认值为`10`.
- `--system`: 默认值为`None`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
- `--max_length`: 默认值为`-1`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
- `--truncation_strategy`: 默认是`'delete'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
- `--check_dataset_strategy`: 默认值为`'none'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
- `--custom_train_dataset_path`: 默认值为`[]`. 该参数已废弃, 请使用`--dataset {dataset_path}`.
- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃, 不再区分训练集和验证集, 使用`dataset_test_ratio`统一进行切分. 请使用`--dataset {dataset_path}`.
- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃. 请使用`--val_dataset {dataset_path}`.
- `--quantization_bit`: 默认值为0. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
- `--quant_method`: 量化方法, 默认为`None`. 你可以选择为'bnb', 'hqq', 'eetq'.
- `--hqq_axis`: hqq量化参数,表示执行分组的所沿的轴,默认为`0`, 可选值包括`0`,`1`
Expand Down
10 changes: 5 additions & 5 deletions docs/source_en/LLM/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,16 +37,16 @@
- Support for dataset_path. For example, '1.jsonl#5000' (if it is a relative path, it is relative to the running directory).
- `--val_dataset`: Specify separate validation datasets with the same format of the `dataset` argument. If using `val_dataset`, the `dataset_test_ratio` will be ignored.
- `--dataset_seed`: Seed for dataset processing, default is `42`. Exists as random_state, does not affect global seed.
- `--dataset_test_ratio`: Ratio for splitting subdataset into train and validation sets, default is `0.01`.
- `--dataset_test_ratio`: Used to specify the ratio for splitting the sub-dataset into training and validation sets. The default value is `0.01`. If `--val_dataset` is set, this parameter becomes ineffective.
- `--train_dataset_sample`: The number of samples for the training dataset, default is `-1`, which means using the complete training dataset for training. This parameter is deprecated, please use `--dataset {dataset_name}#{dataset_sample}` instead.
- `--val_dataset_sample`: Sampling for the validation dataset, default is `None`, which automatically selects an appropriate number of samples for validation. If you specify `-1`, it uses the complete validation dataset for validation. This parameter is deprecated, and the number of samples in the validation dataset is fully controlled by dataset_test_ratio.
- `--val_dataset_sample`: Used to sample the validation set, with a default value of `None`, which automatically selects a suitable number of data samples for validation. If you specify `-1`, the complete validation set is used for validation. This parameter is deprecated and the number of samples in the validation set is controlled by `--dataset_test_ratio` or `--val_dataset {dataset_name}#{dataset_sample}`.
- `--system`: System used in dialogue template, default is `None`, i.e. use the model's default system. If set to '', no system is used.
- `--max_length`: Maximum token length, default is `2048`. Avoids OOM issues caused by individual overly long samples. When `--truncation_strategy delete` is specified, samples exceeding max_length will be deleted. When `--truncation_strategy truncation_left` is specified, the leftmost tokens will be truncated: `input_ids[-max_length:]`. If set to -1, no limit.
- `--truncation_strategy`: Default is `'delete'` which removes sentences exceeding max_length from dataset. `'truncation_left'` will truncate excess text from the left, which may truncate special tokens and affect performance, not recommended.
- `--check_dataset_strategy`: Default is `'none'`, i.e. no checking. If training an LLM model, `'warning'` is recommended as data check strategy. If your training target is sentence classification etc., setting to `'none'` is recommended.

- `--custom_train_dataset_path`: Default value is `[]`. This parameter has been deprecated, please use `--dataset {dataset_path}`.
- `--custom_val_dataset_path`: Default value is `[]`. This parameter has been deprecated. There is no longer a distinction between training and validation datasets, and the split is now unified using `dataset_test_ratio`. Please use `--dataset {dataset_path}`.
- `--custom_val_dataset_path`: Default value is `[]`. This parameter is deprecated. Please use `--val_dataset {dataset_path}` instead.
- `--self_cognition_sample`: The number of samples for the self-cognition dataset. Default is `0`. If you set this value to >0, you need to specify `--model_name` and `--model_author` at the same time. This parameter has been deprecated, please use `--dataset self-cognition#{self_cognition_sample}` instead.
- `--model_name`: Default value is `[None, None]`. If self-cognition dataset sampling is enabled (i.e., specifying `--dataset self-cognition` or self_cognition_sample>0), you need to provide two values, representing the Chinese and English names of the model, respectively. For example: `--model_name 小黄 'Xiao Huang'`. If you want to learn more, you can refer to the [Self-Cognition Fine-tuning Best Practices](Self-cognition-best-practice.md).
- `--model_name`: Default is `[None, None]`. If self-cognition dataset sampling is enabled (i.e. self_cognition_sample>0), you need to pass two values, representing the model's Chinese and English names respectively. E.g. `--model_name 小黄 'Xiao Huang'`.
Expand Down Expand Up @@ -240,14 +240,14 @@ dpo parameters inherit from sft parameters, with the following added parameters:
- `--dtype`: Default is `'AUTO`, see `sft.sh command line arguments` for parameter details.
- `--dataset`: Default is `[]`, see `sft.sh command line arguments` for parameter details.
- `--dataset_seed`: Default is `42`, see `sft.sh command line arguments` for parameter details.
`--dataset_test_ratio`: Default value is `None`, if `--load_dataset_config true` is set, then use the dataset_test_ratio from training, else set it to 1. For specific parameter details, refer to the `sft.sh command line arguments`.
`--dataset_test_ratio`: Default value is `0.01`. For specific parameter details, refer to the `sft.sh command line arguments`.
- `--show_dataset_sample`: Represents number of validation set samples to evaluate and display, default is `10`.
- `--system`: Default is `None`. See `sft.sh command line arguments` for parameter details.
- `--max_length`: Default is `-1`. See `sft.sh command line arguments` for parameter details.
- `--truncation_strategy`: Default is `'delete'`. See `sft.sh command line arguments` for parameter details.
- `--check_dataset_strategy`: Default is `'none'`, see `sft.sh command line arguments` for parameter details.
- `--custom_train_dataset_path`: Default value is `[]`. This parameter has been deprecated, please use `--dataset {dataset_path}`.
- `--custom_val_dataset_path`: Default value is `[]`. This parameter has been deprecated. There is no longer a distinction between training and validation datasets, and the split is now unified using `dataset_test_ratio`. Please use `--dataset {dataset_path}`.
- `--custom_val_dataset_path`: Default value is `[]`. This parameter is deprecated. Please use `--val_dataset {dataset_path}` instead.
- `--quantization_bit`: Default is 0. See `sft.sh command line arguments` for parameter details.
- `--quant_method`: Quantization method, default is None. You can choose from 'bnb', 'hqq', 'eetq'.
- `--hqq_axis`: Hqq argument. Axis along which grouping is performed. Supported values are 0 or 1. default is `0`
Expand Down
3 changes: 1 addition & 2 deletions requirements/framework.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
accelerate
dacite
datasets<=2.18 # modelscope
jieba
matplotlib
modelscope>=1.14
Expand All @@ -14,6 +13,6 @@ rouge
safetensors
tensorboard
tqdm
transformers>=4.33,<4.41
transformers>=4.33,<4.42
transformers_stream_generator
trl>=0.8.2
17 changes: 10 additions & 7 deletions swift/llm/infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -386,13 +386,16 @@ def llm_infer(args: InferArguments) -> None:
append_to_jsonl(jsonl_path, obj)
result.append(obj)
else:
_, val_dataset = get_dataset(
args.dataset,
args.dataset_test_ratio,
args.dataset_seed,
check_dataset_strategy=args.check_dataset_strategy,
model_name=args.model_name,
model_author=args.model_author)
dataset_kwargs = {
'dataset_seed': args.dataset_seed,
'check_dataset_strategy': args.check_dataset_strategy,
'model_name': args.model_name,
'model_author': args.model_author
}
if args.val_dataset is None:
_, val_dataset = get_dataset(args.dataset, args.dataset_test_ratio, **dataset_kwargs)
else:
_, val_dataset = get_dataset(args.val_dataset, 1.0, **dataset_kwargs)
_, val_dataset = args._handle_dataset_compat(_, val_dataset)
if args.show_dataset_sample >= 0 and val_dataset.shape[0] > args.show_dataset_sample:
random_state = np.random.RandomState(args.dataset_seed)
Expand Down
12 changes: 7 additions & 5 deletions swift/llm/utils/argument.py
Original file line number Diff line number Diff line change
Expand Up @@ -966,8 +966,9 @@ class InferArguments(ArgumentsBase):

dataset: List[str] = field(
default_factory=list, metadata={'help': f'dataset choices: {list(DATASET_MAPPING.keys())}'})
val_dataset: List[str] = field(default=None, metadata={'help': f'dataset choices: {list(DATASET_MAPPING.keys())}'})
dataset_seed: int = 42
dataset_test_ratio: Optional[float] = None
dataset_test_ratio: float = 0.01
show_dataset_sample: int = 10
save_result: bool = True
system: Optional[str] = None
Expand Down Expand Up @@ -1035,6 +1036,9 @@ def __post_init__(self) -> None:
'the dir contains a `configuration.json` file.')
self.handle_compatibility()
self._register_self_cognition()
if self.val_dataset is not None:
self.dataset_test_ratio = 0.0 if self.val_dataset is not None else self.dataset_test_ratio
logger.info('Using val_dataset, ignoring dataset_test_ratio')
self.handle_path()
logger.info(f'ckpt_dir: {self.ckpt_dir}')
if self.ckpt_dir is None and self.load_args_from_ckpt_dir:
Expand All @@ -1054,8 +1058,6 @@ def __post_init__(self) -> None:

self.torch_dtype, _, _ = self.select_dtype()
self.prepare_template()
if self.dataset_test_ratio is None:
self.dataset_test_ratio = 1
if self.eval_human is None:
if not len(self.dataset) > 0:
self.eval_human = True
Expand Down Expand Up @@ -1139,8 +1141,8 @@ def load_from_ckpt_dir(self) -> None:
]
if self.load_dataset_config:
imported_keys += [
'dataset', 'dataset_seed', 'dataset_test_ratio', 'check_dataset_strategy', 'self_cognition_sample',
'model_name', 'model_author', 'train_dataset_sample', 'val_dataset_sample'
'dataset', 'val_dataset', 'dataset_seed', 'dataset_test_ratio', 'check_dataset_strategy',
'self_cognition_sample', 'model_name', 'model_author', 'train_dataset_sample', 'val_dataset_sample'
]
for key in imported_keys:
value = getattr(self, key)
Expand Down
20 changes: 20 additions & 0 deletions tests/llm/test_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,25 @@ def setUp(self):
def tearDown(self):
shutil.rmtree(self.tmp_dir)

def test_template(self):
if not __name__ == '__main__':
# ignore citest error in github
return
torch.cuda.empty_cache()
output = sft_main(
SftArguments(
model_type=ModelType.qwen1half_1_8b,
model_id_or_path='../models/Qwen1.5-1.8B',
template_type='qwen',
sft_type='full',
dataset=f'{DatasetName.jd_sentiment_zh}#200',
eval_steps=5))
best_model_checkpoint = output['best_model_checkpoint']
torch.cuda.empty_cache()
result = infer_main(
InferArguments(ckpt_dir=best_model_checkpoint, load_dataset_config=True, val_dataset_sample=2))
assert len(result['result'][0]['response']) < 20

def test_basic(self):
output_dir = 'output'
quantization_bit_list = [0, 4]
Expand Down Expand Up @@ -481,6 +500,7 @@ def tokenize_func(examples):
metric_for_best_model='loss',
greater_is_better=False,
gradient_accumulation_steps=1,
logging_steps=5,
eval_steps=10,
save_only_model=save_only_model)
trainer_args._n_gpu = 1
Expand Down

0 comments on commit d1224e0

Please sign in to comment.