Training error #128

minhduc01168 · 2024-10-15T08:07:19Z

I get an error after doing training. Can you help me?
deepspeed /GOT-OCR-2.0-master/GOT/train/train_GOT.py \ --deepspeed /GOT-OCR-2.0-master/zero_config/zero2.json --model_name_or_path /GOT_weights/ \ --use_im_start_end True \ --bf16 True \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 200 \ --save_total_limit 1 \ --weight_decay 0. \ --warmup_ratio 0.001 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 8192 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --report_to none \ --per_device_train_batch_size 2 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --datasets pdf-ocr+scence \ --output_dir /your/output/path_
Error:
[2024-10-15 08:01:18,206] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] please install triton==1.0.0 if you want to use sparse attention
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
[2024-10-15 08:01:24,289] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-10-15 08:01:24,289] [INFO] [runner.py:568:main] cmd = /opt/conda/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py --deepspeed /kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/zero_config/zero2.json --model_name_or_path /kaggle/working/GOT_OCR2/GOT_weights/ --use_im_start_end True --bf16 True --gradient_accumulation_steps 2 --evaluation_strategy no --save_strategy steps --save_steps 200 --save_total_limit 1 --weight_decay 0. --warmup_ratio 0.001 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 8192 --gradient_checkpointing True --dataloader_num_workers 8 --report_to none --per_device_train_batch_size 2 --num_train_epochs 1 --learning_rate 2e-5 --datasets plain --output_dir /kaggle/working/GOT_OCR2/output
[2024-10-15 08:01:26,184] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] please install triton==1.0.0 if you want to use sparse attention
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.20.3-1+cuda12.3
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.20.3-1
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NCCL_VERSION=2.20.3-1
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.20.3-1+cuda12.3
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.20.3-1
[2024-10-15 08:01:30,469] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-10-15 08:01:30,469] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-10-15 08:01:30,469] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-10-15 08:01:30,469] [INFO] [launch.py:164:main] dist_world_size=2
[2024-10-15 08:01:30,469] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-10-15 08:01:30,470] [INFO] [launch.py:256:main] process 328 spawned with command: ['/opt/conda/bin/python3.10', '-u', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py', '--local_rank=0', '--deepspeed', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/zero_config/zero2.json', '--model_name_or_path', '/kaggle/working/GOT_OCR2/GOT_weights/', '--use_im_start_end', 'True', '--bf16', 'True', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '1', '--weight_decay', '0.', '--warmup_ratio', '0.001', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '8192', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '8', '--report_to', 'none', '--per_device_train_batch_size', '2', '--num_train_epochs', '1', '--learning_rate', '2e-5', '--datasets', 'plain', '--output_dir', '/kaggle/working/GOT_OCR2/output']
[2024-10-15 08:01:30,471] [INFO] [launch.py:256:main] process 329 spawned with command: ['/opt/conda/bin/python3.10', '-u', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py', '--local_rank=1', '--deepspeed', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/zero_config/zero2.json', '--model_name_or_path', '/kaggle/working/GOT_OCR2/GOT_weights/', '--use_im_start_end', 'True', '--bf16', 'True', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '1', '--weight_decay', '0.', '--warmup_ratio', '0.001', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '8192', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '8', '--report_to', 'none', '--per_device_train_batch_size', '2', '--num_train_epochs', '1', '--learning_rate', '2e-5', '--datasets', 'plain', '--output_dir', '/kaggle/working/GOT_OCR2/output']
False

===================================BUG REPORT===================================
False

===================================BUG REPORT===================================
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

warn(msg)

The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib'), PosixPath('/usr/local/lib/x86_64-linux-gnu'), PosixPath('/usr/local/nvidia/lib')}
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/opt/conda/lib/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We select the PyTorch default libcudart.so, which is {torch.version.cuda},but this might missmatch with the CUDA version that is needed for bitsandbytes.To override this behavior set the BNB_CUDA_VERSION=<version string, e.g. 122> environmental variableFor example, if you want to use the CUDA version 122BNB_CUDA_VERSION=122 python ...OR set the environmental variable in your .bashrc: export BNB_CUDA_VERSION=122In the case of a manual override, make sure you set the LD_LIBRARY_PATH, e.g.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2
warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//www.kaggle.com')}
The following directories listed in your path were found to be non-existent: {PosixPath('gcr.io/kaggle-gpu-images/python@sha256'), PosixPath('141219e230dab548ccc19aa4e62bcf805ed9de0b4d5112227e28f5f1a25991f8')}
The following directories listed in your path were found to be non-existent: {PosixPath('tf2-gpu/2-16+cu123')}
The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib'), PosixPath('/usr/local/lib/x86_64-linux-gnu'), PosixPath('/usr/local/nvidia/lib')}
The following directories listed in your path were found to be non-existent: {PosixPath('/kaggle/lib/kagglegym')}
The following directories listed in your path were found to be non-existent: {PosixPath('//dp.kaggle.net'), PosixPath('https')}
DEBUG: Possible options found for libcudart.so: {PosixPath('/opt/conda/lib/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}
CUDA SETUP: PyTorch settings found: CUDA_VERSION=123, Highest Compute Capability: 7.5.
CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
CUDA SETUP: Required library version not found: libbitsandbytes_cuda123.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

You need to manually override the PyTorch CUDA version. Please see: "https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
CUDA driver not installed/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

warn(msg)

================================================================================3. CUDA not installed

The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/lib/x86_64-linux-gnu'), PosixPath('/usr/local/cuda/lib'), PosixPath('/usr/local/nvidia/lib')}4. You have multiple conflicting CUDA libraries

Required library not pre-compiled for this bitsandbytes release!
CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/opt/conda/lib/libcudart.so')}.. We select the PyTorch default libcudart.so, which is {torch.version.cuda},but this might missmatch with the CUDA version that is needed for bitsandbytes.To override this behavior set the BNB_CUDA_VERSION=<version string, e.g. 122> environmental variableFor example, if you want to use the CUDA version 122BNB_CUDA_VERSION=122 python ...OR set the environmental variable in your .bashrc: export BNB_CUDA_VERSION=122In the case of a manual override, make sure you set the LD_LIBRARY_PATH, e.g.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2
warn(msg)
CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
================================================================================/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)

The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//www.kaggle.com')}

The following directories listed in your path were found to be non-existent: {PosixPath('141219e230dab548ccc19aa4e62bcf805ed9de0b4d5112227e28f5f1a25991f8'), PosixPath('gcr.io/kaggle-gpu-images/python@sha256')}CUDA SETUP: Something unexpected happened. Please compile from source:
The following directories listed in your path were found to be non-existent: {PosixPath('tf2-gpu/2-16+cu123')}

git clone https://github.com/TimDettmers/bitsandbytes.gitThe following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/lib/x86_64-linux-gnu'), PosixPath('/usr/local/cuda/lib'), PosixPath('/usr/local/nvidia/lib')}

cd bitsandbytesThe following directories listed in your path were found to be non-existent: {PosixPath('/kaggle/lib/kagglegym')}

The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//dp.kaggle.net')}CUDA_VERSION=123

DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/opt/conda/lib/libcudart.so')}python setup.py install

CUDA SETUP: PyTorch settings found: CUDA_VERSION=123, Highest Compute Capability: 7.5.
CUDA SETUP: Setup Failed!CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md

CUDA SETUP: Required library version not found: libbitsandbytes_cuda123.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

You need to manually override the PyTorch CUDA version. Please see: "https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
CUDA driver not installed
CUDA not installed
You have multiple conflicting CUDA libraries
Required library not pre-compiled for this bitsandbytes release!
CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=123
python setup.py install
CUDA SETUP: Setup Failed!
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1364, in _get_module
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1364, in _get_module
return importlib.import_module("." + module_name, self.name)
File "/opt/conda/lib/python3.10/importlib/init.py", line 126, in import_module
return importlib.import_module("." + module_name, self.name)
File "/opt/conda/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)return _bootstrap._gcd_import(name[level:], package, level)

File "", line 1050, in _gcd_import
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "", line 241, in _call_with_frames_removed
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 190, in
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 190, in
from peft import PeftModel
from peft import PeftModel File "/opt/conda/lib/python3.10/site-packages/peft/init.py", line 22, in

File "/opt/conda/lib/python3.10/site-packages/peft/init.py", line 22, in
from .auto import (
from .auto import ( File "/opt/conda/lib/python3.10/site-packages/peft/auto.py", line 30, in

File "/opt/conda/lib/python3.10/site-packages/peft/auto.py", line 30, in
from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPINGfrom .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING

File "/opt/conda/lib/python3.10/site-packages/peft/mapping.py", line 20, in
File "/opt/conda/lib/python3.10/site-packages/peft/mapping.py", line 20, in
from .peft_model import (from .peft_model import (

File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 39, in
File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 39, in
from .tuners import (from .tuners import (

File "/opt/conda/lib/python3.10/site-packages/peft/tuners/init.py", line 21, in
File "/opt/conda/lib/python3.10/site-packages/peft/tuners/init.py", line 21, in
from .lora import LoraConfig, LoraModel
File "/opt/conda/lib/python3.10/site-packages/peft/tuners/lora.py", line 42, in
from .lora import LoraConfig, LoraModel
File "/opt/conda/lib/python3.10/site-packages/peft/tuners/lora.py", line 42, in
import bitsandbytes as bnb
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/init.py", line 6, in
import bitsandbytes as bnb
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/init.py", line 6, in
from . import cuda_setup, utils, research
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/init.py", line 1, in
from . import cuda_setup, utils, research
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/init.py", line 1, in
from . import nn
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/nn/init.py", line 1, in
from . import nn
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/nn/init.py", line 1, in
from .modules import LinearFP8Mixed, LinearFP8Global
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/nn/modules.py", line 8, in
from .modules import LinearFP8Mixed, LinearFP8Global
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/nn/modules.py", line 8, in
from bitsandbytes.optim import GlobalOptimManager
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/optim/init.py", line 6, in
from bitsandbytes.optim import GlobalOptimManager
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/optim/init.py", line 6, in
from bitsandbytes.cextension import COMPILED_WITH_CUDA
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 20, in
from bitsandbytes.cextension import COMPILED_WITH_CUDA
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 20, in
raise RuntimeError('''
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issuesraise RuntimeError('''

The above exception was the direct cause of the following exception:

RuntimeErrorTraceback (most recent call last):
:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues  File "/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py", line 25, in <module>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py", line 25, in
from GOT.train.trainer_vit_fixlr import GOTTrainer
File "/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/trainer_vit_fixlr.py", line 5, in
from GOT.train.trainer_vit_fixlr import GOTTrainer
File "/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/trainer_vit_fixlr.py", line 5, in
from transformers import Trainer
File "", line 1075, in _handle_fromlist
from transformers import Trainer
File "", line 1075, in _handle_fromlist
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1354, in getattr
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1354, in getattr
module = self._get_module(self._class_to_module[name])
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1366, in _get_module
module = self._get_module(self._class_to_module[name])
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1366, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):

    CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
raise RuntimeError(

RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):

    CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

[2024-10-15 08:01:53,494] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 328
[2024-10-15 08:01:53,494] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 329
[2024-10-15 08:01:53,495] [ERROR] [launch.py:325:sigkill_handler] ['/opt/conda/bin/python3.10', '-u', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py', '--local_rank=1', '--deepspeed', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/zero_config/zero2.json', '--model_name_or_path', '/kaggle/working/GOT_OCR2/GOT_weights/', '--use_im_start_end', 'True', '--bf16', 'True', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '1', '--weight_decay', '0.', '--warmup_ratio', '0.001', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '8192', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '8', '--report_to', 'none', '--per_device_train_batch_size', '2', '--num_train_epochs', '1', '--learning_rate', '2e-5', '--datasets', 'plain', '--output_dir', '/kaggle/working/GOT_OCR2/output'] exits with return code = 1

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training error #128

Training error #128

minhduc01168 commented Oct 15, 2024 •

edited

Loading

Training error #128

Training error #128

Comments

minhduc01168 commented Oct 15, 2024 • edited Loading

warn(msg)

minhduc01168 commented Oct 15, 2024 •

edited

Loading