You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run the
'''
python llama_finetune.py --run-name 7b-test-run --model 7b
'''
However, the training loss didn't decrease for 2 epochs, keeping to be around 7.1. And the grad_norm is nan. A slice is as follows:
I am trying to run the
'''
python llama_finetune.py --run-name 7b-test-run --model 7b
'''
However, the training loss didn't decrease for 2 epochs, keeping to be around 7.1. And the grad_norm is nan. A slice is as follows:
{'loss': 7.1384, 'grad_norm': nan, 'learning_rate': 2.7709798710423807e-05, 'epoch': 1.95}
{'loss': 7.1393, 'grad_norm': nan, 'learning_rate': 2.757036104838462e-05, 'epoch': 1.95}
{'loss': 7.15, 'grad_norm': nan, 'learning_rate': 2.7431141430957046e-05, 'epoch': 1.96}
{'loss': 7.1296, 'grad_norm': nan, 'learning_rate': 2.7292141211532795e-05, 'epoch': 1.96}
{'loss': 7.1576, 'grad_norm': nan, 'learning_rate': 2.7153361741370774e-05, 'epoch': 1.96}
{'loss': 7.1666, 'grad_norm': nan, 'learning_rate': 2.7014804369583906e-05, 'epoch': 1.97}
{'loss': 7.1668, 'grad_norm': nan, 'learning_rate': 2.6876470443125978e-05, 'epoch': 1.97}
{'loss': 7.1313, 'grad_norm': nan, 'learning_rate': 2.673836130677871e-05, 'epoch': 1.97}
I am training on 2 A6000 GPU, and the environment is:
accelerate==0.34.2
aiohappyeyeballs==2.4.3
aiohttp==3.10.8
aiosignal==1.3.1
attrs==24.2.0
bitsandbytes==0.44.1
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
contourpy==1.3.0
cycler==0.12.1
datasets==3.0.1
dill==0.3.8
docker-pycreds==0.4.0
filelock==3.16.1
fonttools==4.54.1
frozenlist==1.4.1
fsspec==2024.6.1
gitdb==4.0.11
GitPython==3.1.43
huggingface-hub==0.25.1
idna==3.10
inquirerpy==0.3.4
Jinja2==3.1.4
joblib==1.4.2
kiwisolver==1.4.7
latexcodec==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
monty==2024.7.30
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
networkx==3.3
numpy==2.1.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.77
nvidia-nvtx-cu12==12.1.105
packaging==24.1
palettable==3.3.3
pandas==2.2.3
peft==0.13.0
pfzy==0.3.4
pillow==10.4.0
platformdirs==4.3.6
plotly==5.24.1
prompt_toolkit==3.0.48
protobuf==5.28.2
psutil==6.0.0
pyarrow==17.0.0
pybtex==0.24.0
pymatgen==2024.9.17.1
pyparsing==3.1.4
python-dateutil==2.9.0.post0
pytz==2024.2
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
safetensors==0.4.5
scipy==1.14.1
sentencepiece==0.2.0
sentry-sdk==2.15.0
setproctitle==1.3.3
setuptools==75.1.0
six==1.16.0
smmap==5.0.1
spglib==2.5.0
sympy==1.13.3
tabulate==0.9.0
tenacity==9.0.0
tokenizers==0.20.0
torch==2.4.1
torchaudio==2.4.1
torchvision==0.19.1
tqdm==4.66.5
transformers==4.45.1
triton==3.0.0
typing_extensions==4.12.2
tzdata==2024.2
uncertainties==3.2.2
urllib3==2.2.3
wandb==0.18.3
wcwidth==0.2.13
wheel==0.44.0
xxhash==3.5.0
yarl==1.13.1
Any suggestions are appreciated!!
The text was updated successfully, but these errors were encountered: