lavin-7B结果达不到，两次结果差别很大，一样的代码 #40

yanghu819 · 2024-01-07T06:11:40Z

2xA100跑lavin-7B的结果
运行指令：bash ./scripts/finetuning_sqa_7b.sh

两次的结果：
{'acc_natural': '87.88', 'acc_social': '94.71', 'acc_language': '84.91', 'acc_has_text': '87.24', 'acc_has_image': '86.37', 'acc_no_context': '87.87', 'acc_grade_1_6': '89.35', 'acc_grade_7_12': '87.08', 'acc_average': '88.54'}

{'acc_natural': '87.08', 'acc_social': '95.61', 'acc_language': '87.82', 'acc_has_text': '86.46', 'acc_has_image': '86.27', 'acc_no_context': '89.83', 'acc_grade_1_6': '90.09', 'acc_grade_7_12': '87.21', 'acc_average': '89.06'}

两次独立的没有任何更改的运行，结果相差很大，请教可能的原因

luogen1996 · 2024-01-07T06:23:33Z

单卡下结果是比较稳定的，多卡可能要加上以下代码固定随机种子。依赖代码的版本也会导致性能波动，请确保和requirements.txt一致。
random.seed(seed) g = torch.Generator() g.manual_seed(seed) DataLoader( train_dataset, batch_size=batch_size, num_workers=num_workers, generator=g, )

yanghu819 · 2024-01-09T08:33:42Z

我试了单卡a100跑7B
依赖版本和官方一致

运行命令：
CUDA_VISIBLE_DEVICES=0 /opt/anaconda3/envs/lavin/bin/torchrun --nproc_per_node 1 --master_port 11111 train.py
--llm_model 7B
--llama_model_path ../data/weights/
--data_path ../data/alpaca_data.json
--max_seq_len 512
--batch_size 4
--accum_iter 8
--epochs 20
--warmup_epochs 2
--blr 9e-3
--weight_decay 0.02
--output_dir ./LaVIN-7B/
--adapter_type attn
--adapter_dim 8
--adapter_scale 1
--n_prompt 6
--prompt_format QCM-ALE
--temperature 10.
--visual_adapter_type router

CUDA_VISIBLE_DEVICES=0 /opt/anaconda3/envs/lavin/bin/torchrun --nproc_per_node 1 --master_port 11111 eval.py
--ckpt_dir ../data/weights/
--llm_model 7B
--tokenizer_path ../data/weights/tokenizer.model
--data_root ../data
--caption_file ../data/captions.json
--adapter_path ./LaVIN-7B/checkpoint-19.pth
--adapter_type attn
--adapter_dim 8
--adapter_scale 1
--prompt_format QCM-ALE
--max_batch_size 64
--max_seq_len 512
--split test
--n_prompt 6
--temperature 10.
--visual_adapter_type router

结果：
{'acc_natural': '88.19', 'acc_social': '94.38', 'acc_language': '85.27', 'acc_has_text': '87.10', 'acc_has_image': '86.22', 'acc_no_context': '88.22', 'acc_grade_1_6': '89.76', 'acc_grade_7_12': '86.88', 'acc_average': '88.73'}

请教下可能的原因，batch_size 和accum_iter 在单卡下如何设置？目前是
--batch_size 4
--accum_iter 8 \

luogen1996 · 2024-01-09T08:54:41Z

我试了单卡a100跑7B 依赖版本和官方一致

运行命令： CUDA_VISIBLE_DEVICES=0 /opt/anaconda3/envs/lavin/bin/torchrun --nproc_per_node 1 --master_port 11111 train.py --llm_model 7B --llama_model_path ../data/weights/ --data_path ../data/alpaca_data.json --max_seq_len 512 --batch_size 4 --accum_iter 8 --epochs 20 --warmup_epochs 2 --blr 9e-3 --weight_decay 0.02 --output_dir ./LaVIN-7B/ --adapter_type attn --adapter_dim 8 --adapter_scale 1 --n_prompt 6 --prompt_format QCM-ALE --temperature 10. --visual_adapter_type router

CUDA_VISIBLE_DEVICES=0 /opt/anaconda3/envs/lavin/bin/torchrun --nproc_per_node 1 --master_port 11111 eval.py --ckpt_dir ../data/weights/ --llm_model 7B --tokenizer_path ../data/weights/tokenizer.model --data_root ../data --caption_file ../data/captions.json --adapter_path ./LaVIN-7B/checkpoint-19.pth --adapter_type attn --adapter_dim 8 --adapter_scale 1 --prompt_format QCM-ALE --max_batch_size 64 --max_seq_len 512 --split test --n_prompt 6 --temperature 10. --visual_adapter_type router

结果： {'acc_natural': '88.19', 'acc_social': '94.38', 'acc_language': '85.27', 'acc_has_text': '87.10', 'acc_has_image': '86.22', 'acc_no_context': '88.22', 'acc_grade_1_6': '89.76', 'acc_grade_7_12': '86.88', 'acc_average': '88.73'}

请教下可能的原因，batch_size 和accum_iter 在单卡下如何设置？目前是 --batch_size 4 --accum_iter 8 \

感谢你的关注，我们发现遇到这个问题的小伙伴比较多。在我们A100 40G的机器上性能是能够稳定的。目前我们已经换到了A800 80G上进行测试，发现性能确实存在波动，中间的gap我们在紧急排查中，我们尽量在最短时间内修复这个问题。

yanghu819 · 2024-01-09T09:13:33Z

请教下A100 40G配置下单卡的具体的参数配置。

luogen1996 · 2024-01-09T09:31:20Z

请教下A100 40G配置下单卡的具体的参数配置。

我们目前也是
--batch_size 4
--accum_iter 8
能否导出conda环境发我们，我们排查一下

yanghu819 · 2024-01-11T02:24:25Z

A100 40G的结果，仍然达不到：
[8737] {'acc_natural': '87.66', 'acc_social': '94.71', 'acc_language': '85.64', 'acc_has_text': '87.15', 'acc_has_image': '86.86', 'acc_no_context': '88.08', 'acc_grade_1_6': '89.79', 'acc_grade_7_12': '86.49', 'acc_average': '88.61'}
torch等：
torch 1.13.0+cu117
transformers 4.37.0.dev0
bitsandbytes 0.41.3.post2
具体环境

name: lavin
channels:

pytorch
defaults
dependencies:
_libgcc_mutex=0.1=main
_openmp_mutex=5.1=1_gnu
blas=1.0=mkl
brotli-python=1.0.9=py38h6a678d5_7
bzip2=1.0.8=h7b6447c_0
ca-certificates=2023.12.12=h06a4308_0
certifi=2023.11.17=py38h06a4308_0
cffi=1.16.0=py38h5eee18b_0
charset-normalizer=2.0.4=pyhd3eb1b0_0
cryptography=41.0.7=py38hdda0065_0
cudatoolkit=11.3.1=h2bc3f7f_2
ffmpeg=4.3=hf484d3e_0
freetype=2.12.1=h4a9f257_0
giflib=5.2.1=h5eee18b_3
gmp=6.2.1=h295c915_3
gnutls=3.6.15=he1e5248_0
idna=3.4=py38h06a4308_0
intel-openmp=2023.1.0=hdb19cb5_46306
jpeg=9e=h5eee18b_1
lame=3.100=h7b6447c_0
lcms2=2.12=h3be6417_0
ld_impl_linux-64=2.38=h1181459_1
lerc=3.0=h295c915_0
libdeflate=1.17=h5eee18b_1
libffi=3.4.4=h6a678d5_0
libgcc-ng=11.2.0=h1234567_1
libgomp=11.2.0=h1234567_1
libiconv=1.16=h7f8727e_2
libidn2=2.3.4=h5eee18b_0
libpng=1.6.39=h5eee18b_0
libstdcxx-ng=11.2.0=h1234567_1
libtasn1=4.19.0=h5eee18b_0
libtiff=4.5.1=h6a678d5_0
libunistring=0.9.10=h27cfd23_0
libwebp=1.3.2=h11a3e52_0
libwebp-base=1.3.2=h5eee18b_0
lz4-c=1.9.4=h6a678d5_0
mkl=2023.1.0=h213fc3f_46344
mkl-service=2.4.0=py38h5eee18b_1
mkl_fft=1.3.8=py38h5eee18b_0
mkl_random=1.2.4=py38hdb19cb5_0
ncurses=6.4=h6a678d5_0
nettle=3.7.3=hbbd107a_1
numpy=1.24.3=py38hf6e8229_1
numpy-base=1.24.3=py38h060ed82_1
openh264=2.1.1=h4ff587b_0
openjpeg=2.4.0=h3ad879b_0
openssl=3.0.12=h7f8727e_0
pillow=10.0.1=py38ha6cbd5a_0
pip=23.3.1=py38h06a4308_0
pycparser=2.21=pyhd3eb1b0_0
pyopenssl=23.2.0=py38h06a4308_0
pysocks=1.7.1=py38h06a4308_0
python=3.8.18=h955ad1f_0
pytorch-mutex=1.0=cuda
readline=8.2=h5eee18b_0
requests=2.31.0=py38h06a4308_0
setuptools=68.2.2=py38h06a4308_0
sqlite=3.41.2=h5eee18b_0
tbb=2021.8.0=hdb19cb5_0
tk=8.6.12=h1ccaba5_0
torchaudio=0.12.1=py38_cu113
torchvision=0.13.1=py38_cu113
urllib3=1.26.18=py38h06a4308_0
wheel=0.41.2=py38h06a4308_0
xz=5.4.5=h5eee18b_0
zlib=1.2.13=h5eee18b_0
zstd=1.5.5=hc292b87_0
pip:
- absl-py==2.0.0
- aiofiles==23.2.1
- altair==5.2.0
- annotated-types==0.6.0
- anyio==3.7.1
- asttokens==2.4.1
- attrs==23.1.0
- backcall==0.2.0
- bitsandbytes==0.41.3.post2
- cachetools==5.3.2
- click==8.1.7
- colorama==0.4.6
- contourpy==1.1.1
- cycler==0.12.1
- decorator==5.1.1
- exceptiongroup==1.2.0
- executing==2.0.1
- fairscale==0.4.13
- fastapi==0.105.0
- ffmpy==0.3.1
- filelock==3.13.1
- fire==0.5.0
- fonttools==4.47.0
- fsspec==2023.12.2
- ftfy==6.1.3
- google-auth==2.25.2
- google-auth-oauthlib==1.0.0
- gradio==4.12.0
- gradio-client==0.8.0
- grpcio==1.60.0
- h11==0.14.0
- httpcore==1.0.2
- httpx==0.26.0
- huggingface-hub==0.20.1
- importlib-metadata==7.0.1
- importlib-resources==6.1.1
- ipdb==0.13.13
- ipython==8.12.3
- jedi==0.19.1
- jinja2==3.1.2
- jsonschema==4.20.0
- jsonschema-specifications==2023.11.2
- kiwisolver==1.4.5
- markdown==3.5.1
- markdown-it-py==3.0.0
- markupsafe==2.1.3
- matplotlib==3.7.4
- matplotlib-inline==0.1.6
- mdurl==0.1.2
- nvidia-cublas-cu11==11.10.3.66
- nvidia-cuda-nvrtc-cu11==11.7.99
- nvidia-cuda-runtime-cu11==11.7.99
- nvidia-cudnn-cu11==8.5.0.96
- oauthlib==3.2.2
- orjson==3.9.10
- packaging==23.2
- pandas==2.0.3
- parso==0.8.3
- pexpect==4.9.0
- pickleshare==0.7.5
- pkgutil-resolve-name==1.3.10
- prompt-toolkit==3.0.43
- protobuf==4.25.1
- ptyprocess==0.7.0
- pure-eval==0.2.2
- pyasn1==0.5.1
- pyasn1-modules==0.3.0
- pydantic==2.5.3
- pydantic-core==2.14.6
- pydub==0.25.1
- pygments==2.17.2
- pyparsing==3.1.1
- python-dateutil==2.8.2
- python-multipart==0.0.6
- pytz==2023.3.post1
- pyyaml==6.0.1
- referencing==0.32.0
- regex==2023.12.25
- requests-oauthlib==1.3.1
- rich==13.7.0
- rpds-py==0.15.2
- rsa==4.9
- safetensors==0.4.1
- scipy==1.10.1
- semantic-version==2.10.0
- sentencepiece==0.1.99
- shellingham==1.5.4
- six==1.16.0
- sniffio==1.3.0
- stack-data==0.6.3
- starlette==0.27.0
- tensorboard==2.14.0
- tensorboard-data-server==0.7.2
- termcolor==2.4.0
- timm==0.6.12
- tokenizers==0.15.0
- tomli==2.0.1
- tomlkit==0.12.0
- toolz==0.12.0
- torch==1.13.0
- tqdm==4.66.1
- traitlets==5.14.0
- transformers==4.37.0.dev0
- typer==0.9.0
- typing-extensions==4.9.0
- tzdata==2023.3
- uvicorn==0.25.0
- wcwidth==0.2.12
- websockets==11.0.3
- werkzeug==3.0.1
- zipp==3.17.0
  prefix: /opt/anaconda3/envs/lavin

luogen1996 · 2024-01-17T05:35:20Z

2xA100跑lavin-7B的结果运行指令：bash ./scripts/finetuning_sqa_7b.sh

两次的结果： {'acc_natural': '87.88', 'acc_social': '94.71', 'acc_language': '84.91', 'acc_has_text': '87.24', 'acc_has_image': '86.37', 'acc_no_context': '87.87', 'acc_grade_1_6': '89.35', 'acc_grade_7_12': '87.08', 'acc_average': '88.54'}

{'acc_natural': '87.08', 'acc_social': '95.61', 'acc_language': '87.82', 'acc_has_text': '86.46', 'acc_has_image': '86.27', 'acc_no_context': '89.83', 'acc_grade_1_6': '90.09', 'acc_grade_7_12': '87.21', 'acc_average': '89.06'}

两次独立的没有任何更改的运行，结果相差很大，请教可能的原因

由于代码的随机种子没有完全固定，导致结果波动，我们已经更新了代码

luogen1996 · 2024-01-17T05:38:19Z

A100 40G的结果，仍然达不到： [8737] {'acc_natural': '87.66', 'acc_social': '94.71', 'acc_language': '85.64', 'acc_has_text': '87.15', 'acc_has_image': '86.86', 'acc_no_context': '88.08', 'acc_grade_1_6': '89.79', 'acc_grade_7_12': '86.49', 'acc_average': '88.61'} torch等： torch 1.13.0+cu117 transformers 4.37.0.dev0 bitsandbytes 0.41.3.post2 具体环境

name: lavin channels:

pytorch

defaults
dependencies:

_libgcc_mutex=0.1=main

_openmp_mutex=5.1=1_gnu

blas=1.0=mkl

brotli-python=1.0.9=py38h6a678d5_7

bzip2=1.0.8=h7b6447c_0

ca-certificates=2023.12.12=h06a4308_0

certifi=2023.11.17=py38h06a4308_0

cffi=1.16.0=py38h5eee18b_0

charset-normalizer=2.0.4=pyhd3eb1b0_0

cryptography=41.0.7=py38hdda0065_0

cudatoolkit=11.3.1=h2bc3f7f_2

ffmpeg=4.3=hf484d3e_0

freetype=2.12.1=h4a9f257_0

giflib=5.2.1=h5eee18b_3

gmp=6.2.1=h295c915_3

gnutls=3.6.15=he1e5248_0

idna=3.4=py38h06a4308_0

intel-openmp=2023.1.0=hdb19cb5_46306

jpeg=9e=h5eee18b_1

lame=3.100=h7b6447c_0

lcms2=2.12=h3be6417_0

ld_impl_linux-64=2.38=h1181459_1

lerc=3.0=h295c915_0

libdeflate=1.17=h5eee18b_1

libffi=3.4.4=h6a678d5_0

libgcc-ng=11.2.0=h1234567_1

libgomp=11.2.0=h1234567_1

libiconv=1.16=h7f8727e_2

libidn2=2.3.4=h5eee18b_0

libpng=1.6.39=h5eee18b_0

libstdcxx-ng=11.2.0=h1234567_1

libtasn1=4.19.0=h5eee18b_0

libtiff=4.5.1=h6a678d5_0

libunistring=0.9.10=h27cfd23_0

libwebp=1.3.2=h11a3e52_0

libwebp-base=1.3.2=h5eee18b_0

lz4-c=1.9.4=h6a678d5_0

mkl=2023.1.0=h213fc3f_46344

mkl-service=2.4.0=py38h5eee18b_1

mkl_fft=1.3.8=py38h5eee18b_0

mkl_random=1.2.4=py38hdb19cb5_0

ncurses=6.4=h6a678d5_0

nettle=3.7.3=hbbd107a_1

numpy=1.24.3=py38hf6e8229_1

numpy-base=1.24.3=py38h060ed82_1

openh264=2.1.1=h4ff587b_0

openjpeg=2.4.0=h3ad879b_0

openssl=3.0.12=h7f8727e_0

pillow=10.0.1=py38ha6cbd5a_0

pip=23.3.1=py38h06a4308_0

pycparser=2.21=pyhd3eb1b0_0

pyopenssl=23.2.0=py38h06a4308_0

pysocks=1.7.1=py38h06a4308_0

python=3.8.18=h955ad1f_0

pytorch-mutex=1.0=cuda

readline=8.2=h5eee18b_0

requests=2.31.0=py38h06a4308_0

setuptools=68.2.2=py38h06a4308_0

sqlite=3.41.2=h5eee18b_0

tbb=2021.8.0=hdb19cb5_0

tk=8.6.12=h1ccaba5_0

torchaudio=0.12.1=py38_cu113

torchvision=0.13.1=py38_cu113

urllib3=1.26.18=py38h06a4308_0

wheel=0.41.2=py38h06a4308_0

xz=5.4.5=h5eee18b_0

zlib=1.2.13=h5eee18b_0

zstd=1.5.5=hc292b87_0

pip:

absl-py==2.0.0

aiofiles==23.2.1

altair==5.2.0

annotated-types==0.6.0

anyio==3.7.1

asttokens==2.4.1

attrs==23.1.0

backcall==0.2.0

bitsandbytes==0.41.3.post2

cachetools==5.3.2

click==8.1.7

colorama==0.4.6

contourpy==1.1.1

cycler==0.12.1

decorator==5.1.1

exceptiongroup==1.2.0

executing==2.0.1

fairscale==0.4.13

fastapi==0.105.0

ffmpy==0.3.1

filelock==3.13.1

fire==0.5.0

fonttools==4.47.0

fsspec==2023.12.2

ftfy==6.1.3

google-auth==2.25.2

google-auth-oauthlib==1.0.0

gradio==4.12.0

gradio-client==0.8.0

grpcio==1.60.0

h11==0.14.0

httpcore==1.0.2

httpx==0.26.0

huggingface-hub==0.20.1

importlib-metadata==7.0.1

importlib-resources==6.1.1

ipdb==0.13.13

ipython==8.12.3

jedi==0.19.1

jinja2==3.1.2

jsonschema==4.20.0

jsonschema-specifications==2023.11.2

kiwisolver==1.4.5

markdown==3.5.1

markdown-it-py==3.0.0

markupsafe==2.1.3

matplotlib==3.7.4

matplotlib-inline==0.1.6

mdurl==0.1.2

nvidia-cublas-cu11==11.10.3.66

nvidia-cuda-nvrtc-cu11==11.7.99

nvidia-cuda-runtime-cu11==11.7.99

nvidia-cudnn-cu11==8.5.0.96

oauthlib==3.2.2

orjson==3.9.10

packaging==23.2

pandas==2.0.3

parso==0.8.3

pexpect==4.9.0

pickleshare==0.7.5

pkgutil-resolve-name==1.3.10

prompt-toolkit==3.0.43

protobuf==4.25.1

ptyprocess==0.7.0

pure-eval==0.2.2

pyasn1==0.5.1

pyasn1-modules==0.3.0

pydantic==2.5.3

pydantic-core==2.14.6

pydub==0.25.1

pygments==2.17.2

pyparsing==3.1.1

python-dateutil==2.8.2

python-multipart==0.0.6

pytz==2023.3.post1

pyyaml==6.0.1

referencing==0.32.0

regex==2023.12.25

requests-oauthlib==1.3.1

rich==13.7.0

rpds-py==0.15.2

rsa==4.9

safetensors==0.4.1

scipy==1.10.1

semantic-version==2.10.0

sentencepiece==0.1.99

shellingham==1.5.4

six==1.16.0

sniffio==1.3.0

stack-data==0.6.3

starlette==0.27.0

tensorboard==2.14.0

tensorboard-data-server==0.7.2

termcolor==2.4.0

timm==0.6.12

tokenizers==0.15.0

tomli==2.0.1

tomlkit==0.12.0

toolz==0.12.0

torch==1.13.0

tqdm==4.66.1

traitlets==5.14.0

transformers==4.37.0.dev0

typer==0.9.0

typing-extensions==4.9.0

tzdata==2023.3

uvicorn==0.25.0

wcwidth==0.2.12

websockets==11.0.3

werkzeug==3.0.1

zipp==3.17.0
prefix: /opt/anaconda3/envs/lavin

{'acc_natural': '88.90', 'acc_social': '95.61', 'acc_language': '84.45', 'acc_has_text': '87.83', 'acc_has_image': '87.65', 'acc_no_context': '88.50', 'acc_grade_1_6': '90.57', 'acc_grade_7_12': '86.62', 'acc_average': '89.15'}
由于代码的随机种子没有完全固定，导致每轮结果波动，我们已经更新了代码，这是我们在A800（80G）上单卡的结果，实际测试性能还是会因为环境不同在正负0.2左右波动。指令是：
CUDA_VISIBLE_DEVICES=4 torchrun --nproc_per_node 1 --master_port 11411 train.py
--llm_model 7B
--llama_model_path ../data/weights/
--data_path ../data/alpaca_data.json
--max_seq_len 512
--batch_size 4
--accum_iter 8
--epochs 20
--warmup_epochs 2
--blr 9e-3
--weight_decay 0.02
--output_dir ./LaVIN-7B/
--adapter_type attn
--adapter_dim 8
--adapter_scale 1
--n_prompt 6
--prompt_format QCM-ALE
--temperature 10.
--visual_adapter_type router

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lavin-7B结果达不到，两次结果差别很大，一样的代码 #40

lavin-7B结果达不到，两次结果差别很大，一样的代码 #40

yanghu819 commented Jan 7, 2024

luogen1996 commented Jan 7, 2024

yanghu819 commented Jan 9, 2024

luogen1996 commented Jan 9, 2024

yanghu819 commented Jan 9, 2024

luogen1996 commented Jan 9, 2024

yanghu819 commented Jan 11, 2024

luogen1996 commented Jan 17, 2024

luogen1996 commented Jan 17, 2024

lavin-7B结果达不到，两次结果差别很大，一样的代码 #40

lavin-7B结果达不到，两次结果差别很大，一样的代码 #40

Comments

yanghu819 commented Jan 7, 2024

luogen1996 commented Jan 7, 2024

yanghu819 commented Jan 9, 2024

luogen1996 commented Jan 9, 2024

yanghu819 commented Jan 9, 2024

luogen1996 commented Jan 9, 2024

yanghu819 commented Jan 11, 2024

luogen1996 commented Jan 17, 2024

luogen1996 commented Jan 17, 2024