Training loss didn't decrease #13

Jun-Kai-Zhang · 2024-10-03T23:39:07Z

I am trying to run the
'''
python llama_finetune.py --run-name 7b-test-run --model 7b
'''
However, the training loss didn't decrease for 2 epochs, keeping to be around 7.1. And the grad_norm is nan. A slice is as follows:

{'loss': 7.1384, 'grad_norm': nan, 'learning_rate': 2.7709798710423807e-05, 'epoch': 1.95}
{'loss': 7.1393, 'grad_norm': nan, 'learning_rate': 2.757036104838462e-05, 'epoch': 1.95}
{'loss': 7.15, 'grad_norm': nan, 'learning_rate': 2.7431141430957046e-05, 'epoch': 1.96}
{'loss': 7.1296, 'grad_norm': nan, 'learning_rate': 2.7292141211532795e-05, 'epoch': 1.96}
{'loss': 7.1576, 'grad_norm': nan, 'learning_rate': 2.7153361741370774e-05, 'epoch': 1.96}
{'loss': 7.1666, 'grad_norm': nan, 'learning_rate': 2.7014804369583906e-05, 'epoch': 1.97}
{'loss': 7.1668, 'grad_norm': nan, 'learning_rate': 2.6876470443125978e-05, 'epoch': 1.97}
{'loss': 7.1313, 'grad_norm': nan, 'learning_rate': 2.673836130677871e-05, 'epoch': 1.97}

I am training on 2 A6000 GPU, and the environment is:

accelerate==0.34.2
aiohappyeyeballs==2.4.3
aiohttp==3.10.8
aiosignal==1.3.1
attrs==24.2.0
bitsandbytes==0.44.1
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
contourpy==1.3.0
cycler==0.12.1
datasets==3.0.1
dill==0.3.8
docker-pycreds==0.4.0
filelock==3.16.1
fonttools==4.54.1
frozenlist==1.4.1
fsspec==2024.6.1
gitdb==4.0.11
GitPython==3.1.43
huggingface-hub==0.25.1
idna==3.10
inquirerpy==0.3.4
Jinja2==3.1.4
joblib==1.4.2
kiwisolver==1.4.7
latexcodec==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
monty==2024.7.30
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
networkx==3.3
numpy==2.1.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.77
nvidia-nvtx-cu12==12.1.105
packaging==24.1
palettable==3.3.3
pandas==2.2.3
peft==0.13.0
pfzy==0.3.4
pillow==10.4.0
platformdirs==4.3.6
plotly==5.24.1
prompt_toolkit==3.0.48
protobuf==5.28.2
psutil==6.0.0
pyarrow==17.0.0
pybtex==0.24.0
pymatgen==2024.9.17.1
pyparsing==3.1.4
python-dateutil==2.9.0.post0
pytz==2024.2
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
safetensors==0.4.5
scipy==1.14.1
sentencepiece==0.2.0
sentry-sdk==2.15.0
setproctitle==1.3.3
setuptools==75.1.0
six==1.16.0
smmap==5.0.1
spglib==2.5.0
sympy==1.13.3
tabulate==0.9.0
tenacity==9.0.0
tokenizers==0.20.0
torch==2.4.1
torchaudio==2.4.1
torchvision==0.19.1
tqdm==4.66.5
transformers==4.45.1
triton==3.0.0
typing_extensions==4.12.2
tzdata==2024.2
uncertainties==3.2.2
urllib3==2.2.3
wandb==0.18.3
wcwidth==0.2.13
wheel==0.44.0
xxhash==3.5.0
yarl==1.13.1

Any suggestions are appreciated!!

Zhongan-Wang · 2024-12-31T03:20:58Z

Hi, I met same problems. Did you solve it ?

Jun-Kai-Zhang · 2025-01-01T18:30:14Z

Hi, I met same problems. Did you solve it ?

Please try with 13b model. I don't know why the 7b is not compatible with this model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training loss didn't decrease #13

Training loss didn't decrease #13

Jun-Kai-Zhang commented Oct 3, 2024

Zhongan-Wang commented Dec 31, 2024

Jun-Kai-Zhang commented Jan 1, 2025

Training loss didn't decrease #13

Training loss didn't decrease #13

Comments

Jun-Kai-Zhang commented Oct 3, 2024

Zhongan-Wang commented Dec 31, 2024

Jun-Kai-Zhang commented Jan 1, 2025