You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi authors, I do not have access to solid hardware. What I have for now is two 3090s (24GB each). I am planning I run/debug the code with this setup and then move experiments to A100s. On these two 3090s, I have CUDA 12.4. Torch version 2.0.0 (in the requirements) does not support CUDA 12.X. I found that torch 12.1.1 support CUDA 12.1. This is the only change I have made in the requirements, otherwise the setup is as suggested.
I am the code is getting stuck at the following progress step,
LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-7b-hf', vocab_size=32000, model_max_length=32, is_fast=True, padding_side='right', truncation
_side='left', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=F
alse), added_tokens_decoder={
0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
/mnt/shared_home/vdeshpande/miniconda3/envs/env_spag/lib/python3.9/site-packages/accelerate/accelerator.py:457: FutureWarning: Passing the follow
ing arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']).
Please pass an `accelerate.DataLoaderConfiguration` instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
Installed CUDA version 12.2 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /mnt/shared_home/vdeshpande/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Installed CUDA version 12.2 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /mnt/shared_home/vdeshpande/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/shared_home/vdeshpande/.cache/torch_extensions/py39_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.3634016513824463 seconds
Time to load cpu_adam op: 3.0814285278320312 seconds
Parameter Offload: Total persistent parameters: 532480 in 130 params [2024-09-04 15:34:07,215] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 50337 closing signal SIGTERM
[2024-09-04 15:34:22,297] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 50336) of binary: /mnt
/shared_home/vdeshpande/miniconda3/envs/env_spag/bin/python
Any insights on resolving this issue?
The text was updated successfully, but these errors were encountered:
Hi authors, I do not have access to solid hardware. What I have for now is two 3090s (24GB each). I am planning I run/debug the code with this setup and then move experiments to A100s. On these two 3090s, I have CUDA 12.4. Torch version 2.0.0 (in the requirements) does not support CUDA 12.X. I found that torch 12.1.1 support CUDA 12.1. This is the only change I have made in the requirements, otherwise the setup is as suggested.
When I run
I am the code is getting stuck at the following progress step,
Any insights on resolving this issue?
The text was updated successfully, but these errors were encountered: