Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training loss didn't decrease #13

Open
Jun-Kai-Zhang opened this issue Oct 3, 2024 · 2 comments
Open

Training loss didn't decrease #13

Jun-Kai-Zhang opened this issue Oct 3, 2024 · 2 comments

Comments

@Jun-Kai-Zhang
Copy link

I am trying to run the
'''
python llama_finetune.py --run-name 7b-test-run --model 7b
'''
However, the training loss didn't decrease for 2 epochs, keeping to be around 7.1. And the grad_norm is nan. A slice is as follows:

{'loss': 7.1384, 'grad_norm': nan, 'learning_rate': 2.7709798710423807e-05, 'epoch': 1.95}
{'loss': 7.1393, 'grad_norm': nan, 'learning_rate': 2.757036104838462e-05, 'epoch': 1.95}
{'loss': 7.15, 'grad_norm': nan, 'learning_rate': 2.7431141430957046e-05, 'epoch': 1.96}
{'loss': 7.1296, 'grad_norm': nan, 'learning_rate': 2.7292141211532795e-05, 'epoch': 1.96}
{'loss': 7.1576, 'grad_norm': nan, 'learning_rate': 2.7153361741370774e-05, 'epoch': 1.96}
{'loss': 7.1666, 'grad_norm': nan, 'learning_rate': 2.7014804369583906e-05, 'epoch': 1.97}
{'loss': 7.1668, 'grad_norm': nan, 'learning_rate': 2.6876470443125978e-05, 'epoch': 1.97}
{'loss': 7.1313, 'grad_norm': nan, 'learning_rate': 2.673836130677871e-05, 'epoch': 1.97}

I am training on 2 A6000 GPU, and the environment is:

accelerate==0.34.2
aiohappyeyeballs==2.4.3
aiohttp==3.10.8
aiosignal==1.3.1
attrs==24.2.0
bitsandbytes==0.44.1
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
contourpy==1.3.0
cycler==0.12.1
datasets==3.0.1
dill==0.3.8
docker-pycreds==0.4.0
filelock==3.16.1
fonttools==4.54.1
frozenlist==1.4.1
fsspec==2024.6.1
gitdb==4.0.11
GitPython==3.1.43
huggingface-hub==0.25.1
idna==3.10
inquirerpy==0.3.4
Jinja2==3.1.4
joblib==1.4.2
kiwisolver==1.4.7
latexcodec==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
monty==2024.7.30
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
networkx==3.3
numpy==2.1.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.77
nvidia-nvtx-cu12==12.1.105
packaging==24.1
palettable==3.3.3
pandas==2.2.3
peft==0.13.0
pfzy==0.3.4
pillow==10.4.0
platformdirs==4.3.6
plotly==5.24.1
prompt_toolkit==3.0.48
protobuf==5.28.2
psutil==6.0.0
pyarrow==17.0.0
pybtex==0.24.0
pymatgen==2024.9.17.1
pyparsing==3.1.4
python-dateutil==2.9.0.post0
pytz==2024.2
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
safetensors==0.4.5
scipy==1.14.1
sentencepiece==0.2.0
sentry-sdk==2.15.0
setproctitle==1.3.3
setuptools==75.1.0
six==1.16.0
smmap==5.0.1
spglib==2.5.0
sympy==1.13.3
tabulate==0.9.0
tenacity==9.0.0
tokenizers==0.20.0
torch==2.4.1
torchaudio==2.4.1
torchvision==0.19.1
tqdm==4.66.5
transformers==4.45.1
triton==3.0.0
typing_extensions==4.12.2
tzdata==2024.2
uncertainties==3.2.2
urllib3==2.2.3
wandb==0.18.3
wcwidth==0.2.13
wheel==0.44.0
xxhash==3.5.0
yarl==1.13.1

Any suggestions are appreciated!!

@Zhongan-Wang
Copy link

Hi, I met same problems. Did you solve it ?

@Jun-Kai-Zhang
Copy link
Author

Hi, I met same problems. Did you solve it ?

Please try with 13b model. I don't know why the 7b is not compatible with this model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants