Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training error #128

Open
minhduc01168 opened this issue Oct 15, 2024 · 0 comments
Open

Training error #128

minhduc01168 opened this issue Oct 15, 2024 · 0 comments

Comments

@minhduc01168
Copy link

minhduc01168 commented Oct 15, 2024

I get an error after doing training. Can you help me?
deepspeed /GOT-OCR-2.0-master/GOT/train/train_GOT.py \ --deepspeed /GOT-OCR-2.0-master/zero_config/zero2.json --model_name_or_path /GOT_weights/ \ --use_im_start_end True \ --bf16 True \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 200 \ --save_total_limit 1 \ --weight_decay 0. \ --warmup_ratio 0.001 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 8192 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --report_to none \ --per_device_train_batch_size 2 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --datasets pdf-ocr+scence \ --output_dir /your/output/path_
Error:
[2024-10-15 08:01:18,206] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] please install triton==1.0.0 if you want to use sparse attention
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
[2024-10-15 08:01:24,289] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-10-15 08:01:24,289] [INFO] [runner.py:568:main] cmd = /opt/conda/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py --deepspeed /kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/zero_config/zero2.json --model_name_or_path /kaggle/working/GOT_OCR2/GOT_weights/ --use_im_start_end True --bf16 True --gradient_accumulation_steps 2 --evaluation_strategy no --save_strategy steps --save_steps 200 --save_total_limit 1 --weight_decay 0. --warmup_ratio 0.001 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 8192 --gradient_checkpointing True --dataloader_num_workers 8 --report_to none --per_device_train_batch_size 2 --num_train_epochs 1 --learning_rate 2e-5 --datasets plain --output_dir /kaggle/working/GOT_OCR2/output
[2024-10-15 08:01:26,184] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] please install triton==1.0.0 if you want to use sparse attention
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.20.3-1+cuda12.3
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.20.3-1
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NCCL_VERSION=2.20.3-1
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.20.3-1+cuda12.3
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2024-10-15 08:01:30,469] [INFO] [launch.py:139:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.20.3-1
[2024-10-15 08:01:30,469] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-10-15 08:01:30,469] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-10-15 08:01:30,469] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-10-15 08:01:30,469] [INFO] [launch.py:164:main] dist_world_size=2
[2024-10-15 08:01:30,469] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-10-15 08:01:30,470] [INFO] [launch.py:256:main] process 328 spawned with command: ['/opt/conda/bin/python3.10', '-u', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py', '--local_rank=0', '--deepspeed', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/zero_config/zero2.json', '--model_name_or_path', '/kaggle/working/GOT_OCR2/GOT_weights/', '--use_im_start_end', 'True', '--bf16', 'True', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '1', '--weight_decay', '0.', '--warmup_ratio', '0.001', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '8192', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '8', '--report_to', 'none', '--per_device_train_batch_size', '2', '--num_train_epochs', '1', '--learning_rate', '2e-5', '--datasets', 'plain', '--output_dir', '/kaggle/working/GOT_OCR2/output']
[2024-10-15 08:01:30,471] [INFO] [launch.py:256:main] process 329 spawned with command: ['/opt/conda/bin/python3.10', '-u', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py', '--local_rank=1', '--deepspeed', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/zero_config/zero2.json', '--model_name_or_path', '/kaggle/working/GOT_OCR2/GOT_weights/', '--use_im_start_end', 'True', '--bf16', 'True', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '1', '--weight_decay', '0.', '--warmup_ratio', '0.001', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '8192', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '8', '--report_to', 'none', '--per_device_train_batch_size', '2', '--num_train_epochs', '1', '--learning_rate', '2e-5', '--datasets', 'plain', '--output_dir', '/kaggle/working/GOT_OCR2/output']
False

===================================BUG REPORT===================================
False

===================================BUG REPORT===================================
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

warn(msg)

The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib'), PosixPath('/usr/local/lib/x86_64-linux-gnu'), PosixPath('/usr/local/nvidia/lib')}
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/opt/conda/lib/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We select the PyTorch default libcudart.so, which is {torch.version.cuda},but this might missmatch with the CUDA version that is needed for bitsandbytes.To override this behavior set the BNB_CUDA_VERSION=<version string, e.g. 122> environmental variableFor example, if you want to use the CUDA version 122BNB_CUDA_VERSION=122 python ...OR set the environmental variable in your .bashrc: export BNB_CUDA_VERSION=122In the case of a manual override, make sure you set the LD_LIBRARY_PATH, e.g.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2
warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//www.kaggle.com')}
The following directories listed in your path were found to be non-existent: {PosixPath('gcr.io/kaggle-gpu-images/python@sha256'), PosixPath('141219e230dab548ccc19aa4e62bcf805ed9de0b4d5112227e28f5f1a25991f8')}
The following directories listed in your path were found to be non-existent: {PosixPath('tf2-gpu/2-16+cu123')}
The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib'), PosixPath('/usr/local/lib/x86_64-linux-gnu'), PosixPath('/usr/local/nvidia/lib')}
The following directories listed in your path were found to be non-existent: {PosixPath('/kaggle/lib/kagglegym')}
The following directories listed in your path were found to be non-existent: {PosixPath('//dp.kaggle.net'), PosixPath('https')}
DEBUG: Possible options found for libcudart.so: {PosixPath('/opt/conda/lib/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}
CUDA SETUP: PyTorch settings found: CUDA_VERSION=123, Highest Compute Capability: 7.5.
CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
CUDA SETUP: Required library version not found: libbitsandbytes_cuda123.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

  1. You need to manually override the PyTorch CUDA version. Please see: "https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
  2. CUDA driver not installed/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

warn(msg)

================================================================================3. CUDA not installed

The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/lib/x86_64-linux-gnu'), PosixPath('/usr/local/cuda/lib'), PosixPath('/usr/local/nvidia/lib')}4. You have multiple conflicting CUDA libraries

  1. Required library not pre-compiled for this bitsandbytes release!
    CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
    /opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/opt/conda/lib/libcudart.so')}.. We select the PyTorch default libcudart.so, which is {torch.version.cuda},but this might missmatch with the CUDA version that is needed for bitsandbytes.To override this behavior set the BNB_CUDA_VERSION=<version string, e.g. 122> environmental variableFor example, if you want to use the CUDA version 122BNB_CUDA_VERSION=122 python ...OR set the environmental variable in your .bashrc: export BNB_CUDA_VERSION=122In the case of a manual override, make sure you set the LD_LIBRARY_PATH, e.g.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2
    warn(msg)
    CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
    ================================================================================/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
    warn(msg)

The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//www.kaggle.com')}

The following directories listed in your path were found to be non-existent: {PosixPath('141219e230dab548ccc19aa4e62bcf805ed9de0b4d5112227e28f5f1a25991f8'), PosixPath('gcr.io/kaggle-gpu-images/python@sha256')}CUDA SETUP: Something unexpected happened. Please compile from source:
The following directories listed in your path were found to be non-existent: {PosixPath('tf2-gpu/2-16+cu123')}

git clone https://github.com/TimDettmers/bitsandbytes.gitThe following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/lib/x86_64-linux-gnu'), PosixPath('/usr/local/cuda/lib'), PosixPath('/usr/local/nvidia/lib')}

cd bitsandbytesThe following directories listed in your path were found to be non-existent: {PosixPath('/kaggle/lib/kagglegym')}

The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//dp.kaggle.net')}CUDA_VERSION=123

DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/opt/conda/lib/libcudart.so')}python setup.py install

CUDA SETUP: PyTorch settings found: CUDA_VERSION=123, Highest Compute Capability: 7.5.
CUDA SETUP: Setup Failed!CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md

CUDA SETUP: Required library version not found: libbitsandbytes_cuda123.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

================================================ERROR=====================================
CUDA SETUP: CUDA detection failed! Possible reasons:

  1. You need to manually override the PyTorch CUDA version. Please see: "https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
  2. CUDA driver not installed
  3. CUDA not installed
  4. You have multiple conflicting CUDA libraries
  5. Required library not pre-compiled for this bitsandbytes release!
    CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
    CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via conda list | grep cuda.
    ================================================================================

CUDA SETUP: Something unexpected happened. Please compile from source:
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=123
python setup.py install
CUDA SETUP: Setup Failed!
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1364, in _get_module
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1364, in _get_module
return importlib.import_module("." + module_name, self.name)
File "/opt/conda/lib/python3.10/importlib/init.py", line 126, in import_module
return importlib.import_module("." + module_name, self.name)
File "/opt/conda/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)return _bootstrap._gcd_import(name[level:], package, level)

File "", line 1050, in _gcd_import
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "", line 241, in _call_with_frames_removed
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 190, in
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 190, in
from peft import PeftModel
from peft import PeftModel File "/opt/conda/lib/python3.10/site-packages/peft/init.py", line 22, in

File "/opt/conda/lib/python3.10/site-packages/peft/init.py", line 22, in
from .auto import (
from .auto import ( File "/opt/conda/lib/python3.10/site-packages/peft/auto.py", line 30, in

File "/opt/conda/lib/python3.10/site-packages/peft/auto.py", line 30, in
from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPINGfrom .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING

File "/opt/conda/lib/python3.10/site-packages/peft/mapping.py", line 20, in
File "/opt/conda/lib/python3.10/site-packages/peft/mapping.py", line 20, in
from .peft_model import (from .peft_model import (

File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 39, in
File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 39, in
from .tuners import (from .tuners import (

File "/opt/conda/lib/python3.10/site-packages/peft/tuners/init.py", line 21, in
File "/opt/conda/lib/python3.10/site-packages/peft/tuners/init.py", line 21, in
from .lora import LoraConfig, LoraModel
File "/opt/conda/lib/python3.10/site-packages/peft/tuners/lora.py", line 42, in
from .lora import LoraConfig, LoraModel
File "/opt/conda/lib/python3.10/site-packages/peft/tuners/lora.py", line 42, in
import bitsandbytes as bnb
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/init.py", line 6, in
import bitsandbytes as bnb
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/init.py", line 6, in
from . import cuda_setup, utils, research
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/init.py", line 1, in
from . import cuda_setup, utils, research
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/init.py", line 1, in
from . import nn
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/nn/init.py", line 1, in
from . import nn
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/nn/init.py", line 1, in
from .modules import LinearFP8Mixed, LinearFP8Global
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/nn/modules.py", line 8, in
from .modules import LinearFP8Mixed, LinearFP8Global
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/research/nn/modules.py", line 8, in
from bitsandbytes.optim import GlobalOptimManager
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/optim/init.py", line 6, in
from bitsandbytes.optim import GlobalOptimManager
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/optim/init.py", line 6, in
from bitsandbytes.cextension import COMPILED_WITH_CUDA
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 20, in
from bitsandbytes.cextension import COMPILED_WITH_CUDA
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 20, in
raise RuntimeError('''
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issuesraise RuntimeError('''

The above exception was the direct cause of the following exception:

RuntimeErrorTraceback (most recent call last):
:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues  File "/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py", line 25, in <module>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py", line 25, in
from GOT.train.trainer_vit_fixlr import GOTTrainer
File "/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/trainer_vit_fixlr.py", line 5, in
from GOT.train.trainer_vit_fixlr import GOTTrainer
File "/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/trainer_vit_fixlr.py", line 5, in
from transformers import Trainer
File "", line 1075, in _handle_fromlist
from transformers import Trainer
File "", line 1075, in _handle_fromlist
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1354, in getattr
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1354, in getattr
module = self._get_module(self._class_to_module[name])
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1366, in _get_module
module = self._get_module(self._class_to_module[name])
File "/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1366, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):

    CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
raise RuntimeError(

RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):

    CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

[2024-10-15 08:01:53,494] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 328
[2024-10-15 08:01:53,494] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 329
[2024-10-15 08:01:53,495] [ERROR] [launch.py:325:sigkill_handler] ['/opt/conda/bin/python3.10', '-u', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/GOT/train/train_GOT.py', '--local_rank=1', '--deepspeed', '/kaggle/working/GOT_OCR2/GOT-OCR-2.0-master/zero_config/zero2.json', '--model_name_or_path', '/kaggle/working/GOT_OCR2/GOT_weights/', '--use_im_start_end', 'True', '--bf16', 'True', '--gradient_accumulation_steps', '2', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '1', '--weight_decay', '0.', '--warmup_ratio', '0.001', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '8192', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '8', '--report_to', 'none', '--per_device_train_batch_size', '2', '--num_train_epochs', '1', '--learning_rate', '2e-5', '--datasets', 'plain', '--output_dir', '/kaggle/working/GOT_OCR2/output'] exits with return code = 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant