多卡finetune_chat时报mat1 and mat2 shapes cannot be multiplied (1024x2 and 1x11008) #240

18065013 · 2023-07-17T08:57:30Z

单卡情况下能跑通并且正常训练，未改过代码在多卡时报错，求解
mat1 and mat2 shapes cannot be multiplied (1024x2 and 1x11008)

18065013 · 2023-07-17T08:57:46Z

(vicuna_training_310) root@ubuntu-3090x2:/home/huwei/training/Chinese-Vicuna# python finetune_chat.py --data_path sample/merge_split_s1.json --model_path decapoda-research/llama-7b-hf --test_size=10

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Using custom data configuration default-9e71a4e6f9c8d1a3
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-9e71a4e6f9c8d1a3/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 759.84it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 191.43ex/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:10<00:00, 3.13it/s]
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
#0: 0%| | 0/625 [00:00<?, ?ex/s]normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization. | 0/625 [00:00<?, ?ex/s]
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
#1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:03<00:00, 201.15ex/s]
#0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:03<00:00, 197.08ex/s]
#3: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:03<00:00, 198.10ex/s]
#2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:03<00:00, 193.27ex/s]
The following columns in the training set don't have a corresponding argument in PeftModelForCausalLM.forward and have been ignored: input, output. If input, output are not expected by PeftModelForCausalLM.forward, you can safely ignore this message.████████████████████████████████████████▉ | 608/625 [00:03<00:00, 202.11ex/s]
/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning00:00, 209.69ex/s]
warnings.warn(
***** Running training *****
Num examples = 2500
Num Epochs = 3
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 256
Gradient Accumulation steps = 32
Total optimization steps = 19
Number of trainable parameters = 19988480
0%| | 0/19 [00:00<?, ?it/s]/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/huwei/training/Chinese-Vicuna/finetune_chat.py:273 in │
│ │
│ 270 if torch.version >= "2" and sys.platform != "win32": │
│ 271 │ model = torch.compile(model) │
│ 272 │
│ ❱ 273 trainer.train(resume_from_checkpoint=args.resume_from_checkpoint) │
│ 274 model.save_pretrained(OUTPUT_DIR) │
│ 275 │
│ │
│ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/trainer.py:16 │
│ 36 in train │
│ │
│ 1633 │ │ inner_training_loop = find_executable_batch_size( │
│ 1634 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1635 │ │ ) │
│ ❱ 1636 │ │ return inner_training_loop( │
│ 1637 │ │ │ args=args, │
│ 1638 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1639 │ │ │ trial=trial, │
│ │
│ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/trainer.py:19 │
│ 03 in _inner_training_loop │
│ │
│ 1900 │ │ │ │ │ with model.no_sync(): │
│ 1901 │ │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1902 │ │ │ │ else: │
│ ❱ 1903 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1904 │ │ │ │ │
│ 1905 │ │ │ │ if ( │
│ 1906 │ │ │ │ │ args.logging_nan_inf_filter │
│ │
│ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/trainer.py:26 │
│ 49 in training_step │
│ │
│ 2646 │ │ │ return loss_mb.reduce_mean().detach().to(self.args.device) │
│ 2647 │ │ │
│ 2648 │ │ with self.compute_loss_context_manager(): │
│ ❱ 2649 │ │ │ loss = self.compute_loss(model, inputs) │
│ 2650 │ │ │
│ 2651 │ │ if self.args.n_gpu > 1: │
│ 2652 │ │ │ loss = loss.mean() # mean() to average on multi-gpu parallel training │
│ │
│ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/trainer.py:26 │
│ 81 in compute_loss │
│ │
│ 2678 │ │ │ labels = inputs.pop("labels") │
│ 2679 │ │ else: │
│ 2680 │ │ │ labels = None │
│ ❱ 2681 │ │ outputs = model(**inputs) │
│ 2682 │ │ # Save past state if it exists │
│ 2683 │ │ # TODO: this needs to be fixed and made cleaner later. │
│ 2684 │ │ if self.args.past_index >= 0: │
│ │
│ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py │
│ :1194 in _call_impl │
│ │
│ 1191 │ │ # this function, and just call forward. │
│ 1192 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1193 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1194 │ │ │ return forward_call(*input, **kwargs) │
│ 1195 │ │ # Do not call functions when jit is used │
│ 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1197 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/parallel/data_par │
│ allel.py:171 in forward │
│ │
│ 168 │ │ │ if len(self.device_ids) == 1: │
│ 169 │ │ │ │ return self.module(*inputs[0], **kwargs[0]) │
│ 170 │ │ │ replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) │
│ ❱ 171 │ │ │ outputs = self.parallel_apply(replicas, inputs, kwargs) │
│ 172 │ │ │ return self.gather(outputs, self.output_device) │
│ 173 │ │
│ 174 │ def replicate(self, module, device_ids): │
│ │
│ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/parallel/data_par │
│ allel.py:181 in parallel_apply │
│ │
│ 178 │ │ return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) │
│ 179 │ │
│ 180 │ def parallel_apply(self, replicas, inputs, kwargs): │
│ ❱ 181 │ │ return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) │
│ 182 │ │
│ 183 │ def gather(self, outputs, output_device): │
│ 184 │ │ return gather(outputs, output_device, dim=self.dim) │
│ │
│ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/parallel/parallel │
│ _apply.py:89 in parallel_apply │
│ │
│ 86 │ for i in range(len(inputs)): │
│ 87 │ │ output = results[i] │
│ 88 │ │ if isinstance(output, ExceptionWrapper): │
│ ❱ 89 │ │ │ output.reraise() │
│ 90 │ │ outputs.append(output) │
│ 91 │ return outputs │
│ 92 │
│ │
│ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/_utils.py:543 in │
│ reraise │
│ │
│ 540 │ │ │ # If the exception takes multiple arguments, don't try to │
│ 541 │ │ │ # instantiate since we don't know how to │
│ 542 │ │ │ raise RuntimeError(msg) from None │
│ ❱ 543 │ │ raise exception │
│ 544 │
│ 545 │
│ 546 def _get_available_device_type(): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/peft/peft_model.py", line 529, in forward
return self.base_model(
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
outputs = self.model(
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 607, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 603, in custom_forward
return module(*inputs, output_attentions, None)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
hidden_states = self.mlp(hidden_states)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 151, in forward
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/peft/tuners/lora.py", line 522, in forward
result = super().forward(x)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x2 and 1x11008)

18065013 · 2023-07-17T09:00:10Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多卡finetune_chat时报mat1 and mat2 shapes cannot be multiplied (1024x2 and 1x11008) #240

多卡finetune_chat时报mat1 and mat2 shapes cannot be multiplied (1024x2 and 1x11008) #240

18065013 commented Jul 17, 2023

18065013 commented Jul 17, 2023

18065013 commented Jul 17, 2023

多卡finetune_chat时报mat1 and mat2 shapes cannot be multiplied (1024x2 and 1x11008) #240

多卡finetune_chat时报mat1 and mat2 shapes cannot be multiplied (1024x2 and 1x11008) #240

Comments

18065013 commented Jul 17, 2023

18065013 commented Jul 17, 2023

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

18065013 commented Jul 17, 2023

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues