Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lavin-7B结果达不到,两次结果差别很大,一样的代码 #40

Open
yanghu819 opened this issue Jan 7, 2024 · 8 comments
Open

Comments

@yanghu819
Copy link

2xA100跑lavin-7B的结果
运行指令:bash ./scripts/finetuning_sqa_7b.sh

两次的结果:
{'acc_natural': '87.88', 'acc_social': '94.71', 'acc_language': '84.91', 'acc_has_text': '87.24', 'acc_has_image': '86.37', 'acc_no_context': '87.87', 'acc_grade_1_6': '89.35', 'acc_grade_7_12': '87.08', 'acc_average': '88.54'}

{'acc_natural': '87.08', 'acc_social': '95.61', 'acc_language': '87.82', 'acc_has_text': '86.46', 'acc_has_image': '86.27', 'acc_no_context': '89.83', 'acc_grade_1_6': '90.09', 'acc_grade_7_12': '87.21', 'acc_average': '89.06'}

两次独立的没有任何更改的运行,结果相差很大,请教可能的原因

@luogen1996
Copy link
Owner

单卡下结果是比较稳定的,多卡可能要加上以下代码固定随机种子。依赖代码的版本也会导致性能波动,请确保和requirements.txt一致。
random.seed(seed) g = torch.Generator() g.manual_seed(seed) DataLoader( train_dataset, batch_size=batch_size, num_workers=num_workers, generator=g, )

@yanghu819
Copy link
Author

我试了单卡a100跑7B
依赖版本和官方一致

运行命令:
CUDA_VISIBLE_DEVICES=0 /opt/anaconda3/envs/lavin/bin/torchrun --nproc_per_node 1 --master_port 11111 train.py
--llm_model 7B
--llama_model_path ../data/weights/
--data_path ../data/alpaca_data.json
--max_seq_len 512
--batch_size 4
--accum_iter 8
--epochs 20
--warmup_epochs 2
--blr 9e-3
--weight_decay 0.02
--output_dir ./LaVIN-7B/
--adapter_type attn
--adapter_dim 8
--adapter_scale 1
--n_prompt 6
--prompt_format QCM-ALE
--temperature 10.
--visual_adapter_type router

CUDA_VISIBLE_DEVICES=0 /opt/anaconda3/envs/lavin/bin/torchrun --nproc_per_node 1 --master_port 11111 eval.py
--ckpt_dir ../data/weights/
--llm_model 7B
--tokenizer_path ../data/weights/tokenizer.model
--data_root ../data
--caption_file ../data/captions.json
--adapter_path ./LaVIN-7B/checkpoint-19.pth
--adapter_type attn
--adapter_dim 8
--adapter_scale 1
--prompt_format QCM-ALE
--max_batch_size 64
--max_seq_len 512
--split test
--n_prompt 6
--temperature 10.
--visual_adapter_type router

结果:
{'acc_natural': '88.19', 'acc_social': '94.38', 'acc_language': '85.27', 'acc_has_text': '87.10', 'acc_has_image': '86.22', 'acc_no_context': '88.22', 'acc_grade_1_6': '89.76', 'acc_grade_7_12': '86.88', 'acc_average': '88.73'}

请教下可能的原因,batch_size 和accum_iter 在单卡下如何设置?目前是
--batch_size 4
--accum_iter 8 \

@luogen1996
Copy link
Owner

我试了单卡a100跑7B 依赖版本和官方一致

运行命令: CUDA_VISIBLE_DEVICES=0 /opt/anaconda3/envs/lavin/bin/torchrun --nproc_per_node 1 --master_port 11111 train.py --llm_model 7B --llama_model_path ../data/weights/ --data_path ../data/alpaca_data.json --max_seq_len 512 --batch_size 4 --accum_iter 8 --epochs 20 --warmup_epochs 2 --blr 9e-3 --weight_decay 0.02 --output_dir ./LaVIN-7B/ --adapter_type attn --adapter_dim 8 --adapter_scale 1 --n_prompt 6 --prompt_format QCM-ALE --temperature 10. --visual_adapter_type router

CUDA_VISIBLE_DEVICES=0 /opt/anaconda3/envs/lavin/bin/torchrun --nproc_per_node 1 --master_port 11111 eval.py --ckpt_dir ../data/weights/ --llm_model 7B --tokenizer_path ../data/weights/tokenizer.model --data_root ../data --caption_file ../data/captions.json --adapter_path ./LaVIN-7B/checkpoint-19.pth --adapter_type attn --adapter_dim 8 --adapter_scale 1 --prompt_format QCM-ALE --max_batch_size 64 --max_seq_len 512 --split test --n_prompt 6 --temperature 10. --visual_adapter_type router

结果: {'acc_natural': '88.19', 'acc_social': '94.38', 'acc_language': '85.27', 'acc_has_text': '87.10', 'acc_has_image': '86.22', 'acc_no_context': '88.22', 'acc_grade_1_6': '89.76', 'acc_grade_7_12': '86.88', 'acc_average': '88.73'}

请教下可能的原因,batch_size 和accum_iter 在单卡下如何设置?目前是 --batch_size 4 --accum_iter 8 \

感谢你的关注,我们发现遇到这个问题的小伙伴比较多。在我们A100 40G的机器上性能是能够稳定的。目前我们已经换到了A800 80G上进行测试,发现性能确实存在波动,中间的gap我们在紧急排查中,我们尽量在最短时间内修复这个问题。

@yanghu819
Copy link
Author

请教下A100 40G配置下单卡的具体的参数配置。

@luogen1996
Copy link
Owner

请教下A100 40G配置下单卡的具体的参数配置。

我们目前也是
--batch_size 4
--accum_iter 8
能否导出conda环境发我们,我们排查一下

@yanghu819
Copy link
Author

A100 40G的结果,仍然达不到:
[8737] {'acc_natural': '87.66', 'acc_social': '94.71', 'acc_language': '85.64', 'acc_has_text': '87.15', 'acc_has_image': '86.86', 'acc_no_context': '88.08', 'acc_grade_1_6': '89.79', 'acc_grade_7_12': '86.49', 'acc_average': '88.61'}
torch等:
torch 1.13.0+cu117
transformers 4.37.0.dev0
bitsandbytes 0.41.3.post2
具体环境

name: lavin
channels:

  • pytorch
  • defaults
    dependencies:
  • _libgcc_mutex=0.1=main
  • _openmp_mutex=5.1=1_gnu
  • blas=1.0=mkl
  • brotli-python=1.0.9=py38h6a678d5_7
  • bzip2=1.0.8=h7b6447c_0
  • ca-certificates=2023.12.12=h06a4308_0
  • certifi=2023.11.17=py38h06a4308_0
  • cffi=1.16.0=py38h5eee18b_0
  • charset-normalizer=2.0.4=pyhd3eb1b0_0
  • cryptography=41.0.7=py38hdda0065_0
  • cudatoolkit=11.3.1=h2bc3f7f_2
  • ffmpeg=4.3=hf484d3e_0
  • freetype=2.12.1=h4a9f257_0
  • giflib=5.2.1=h5eee18b_3
  • gmp=6.2.1=h295c915_3
  • gnutls=3.6.15=he1e5248_0
  • idna=3.4=py38h06a4308_0
  • intel-openmp=2023.1.0=hdb19cb5_46306
  • jpeg=9e=h5eee18b_1
  • lame=3.100=h7b6447c_0
  • lcms2=2.12=h3be6417_0
  • ld_impl_linux-64=2.38=h1181459_1
  • lerc=3.0=h295c915_0
  • libdeflate=1.17=h5eee18b_1
  • libffi=3.4.4=h6a678d5_0
  • libgcc-ng=11.2.0=h1234567_1
  • libgomp=11.2.0=h1234567_1
  • libiconv=1.16=h7f8727e_2
  • libidn2=2.3.4=h5eee18b_0
  • libpng=1.6.39=h5eee18b_0
  • libstdcxx-ng=11.2.0=h1234567_1
  • libtasn1=4.19.0=h5eee18b_0
  • libtiff=4.5.1=h6a678d5_0
  • libunistring=0.9.10=h27cfd23_0
  • libwebp=1.3.2=h11a3e52_0
  • libwebp-base=1.3.2=h5eee18b_0
  • lz4-c=1.9.4=h6a678d5_0
  • mkl=2023.1.0=h213fc3f_46344
  • mkl-service=2.4.0=py38h5eee18b_1
  • mkl_fft=1.3.8=py38h5eee18b_0
  • mkl_random=1.2.4=py38hdb19cb5_0
  • ncurses=6.4=h6a678d5_0
  • nettle=3.7.3=hbbd107a_1
  • numpy=1.24.3=py38hf6e8229_1
  • numpy-base=1.24.3=py38h060ed82_1
  • openh264=2.1.1=h4ff587b_0
  • openjpeg=2.4.0=h3ad879b_0
  • openssl=3.0.12=h7f8727e_0
  • pillow=10.0.1=py38ha6cbd5a_0
  • pip=23.3.1=py38h06a4308_0
  • pycparser=2.21=pyhd3eb1b0_0
  • pyopenssl=23.2.0=py38h06a4308_0
  • pysocks=1.7.1=py38h06a4308_0
  • python=3.8.18=h955ad1f_0
  • pytorch-mutex=1.0=cuda
  • readline=8.2=h5eee18b_0
  • requests=2.31.0=py38h06a4308_0
  • setuptools=68.2.2=py38h06a4308_0
  • sqlite=3.41.2=h5eee18b_0
  • tbb=2021.8.0=hdb19cb5_0
  • tk=8.6.12=h1ccaba5_0
  • torchaudio=0.12.1=py38_cu113
  • torchvision=0.13.1=py38_cu113
  • urllib3=1.26.18=py38h06a4308_0
  • wheel=0.41.2=py38h06a4308_0
  • xz=5.4.5=h5eee18b_0
  • zlib=1.2.13=h5eee18b_0
  • zstd=1.5.5=hc292b87_0
  • pip:
    • absl-py==2.0.0
    • aiofiles==23.2.1
    • altair==5.2.0
    • annotated-types==0.6.0
    • anyio==3.7.1
    • asttokens==2.4.1
    • attrs==23.1.0
    • backcall==0.2.0
    • bitsandbytes==0.41.3.post2
    • cachetools==5.3.2
    • click==8.1.7
    • colorama==0.4.6
    • contourpy==1.1.1
    • cycler==0.12.1
    • decorator==5.1.1
    • exceptiongroup==1.2.0
    • executing==2.0.1
    • fairscale==0.4.13
    • fastapi==0.105.0
    • ffmpy==0.3.1
    • filelock==3.13.1
    • fire==0.5.0
    • fonttools==4.47.0
    • fsspec==2023.12.2
    • ftfy==6.1.3
    • google-auth==2.25.2
    • google-auth-oauthlib==1.0.0
    • gradio==4.12.0
    • gradio-client==0.8.0
    • grpcio==1.60.0
    • h11==0.14.0
    • httpcore==1.0.2
    • httpx==0.26.0
    • huggingface-hub==0.20.1
    • importlib-metadata==7.0.1
    • importlib-resources==6.1.1
    • ipdb==0.13.13
    • ipython==8.12.3
    • jedi==0.19.1
    • jinja2==3.1.2
    • jsonschema==4.20.0
    • jsonschema-specifications==2023.11.2
    • kiwisolver==1.4.5
    • markdown==3.5.1
    • markdown-it-py==3.0.0
    • markupsafe==2.1.3
    • matplotlib==3.7.4
    • matplotlib-inline==0.1.6
    • mdurl==0.1.2
    • nvidia-cublas-cu11==11.10.3.66
    • nvidia-cuda-nvrtc-cu11==11.7.99
    • nvidia-cuda-runtime-cu11==11.7.99
    • nvidia-cudnn-cu11==8.5.0.96
    • oauthlib==3.2.2
    • orjson==3.9.10
    • packaging==23.2
    • pandas==2.0.3
    • parso==0.8.3
    • pexpect==4.9.0
    • pickleshare==0.7.5
    • pkgutil-resolve-name==1.3.10
    • prompt-toolkit==3.0.43
    • protobuf==4.25.1
    • ptyprocess==0.7.0
    • pure-eval==0.2.2
    • pyasn1==0.5.1
    • pyasn1-modules==0.3.0
    • pydantic==2.5.3
    • pydantic-core==2.14.6
    • pydub==0.25.1
    • pygments==2.17.2
    • pyparsing==3.1.1
    • python-dateutil==2.8.2
    • python-multipart==0.0.6
    • pytz==2023.3.post1
    • pyyaml==6.0.1
    • referencing==0.32.0
    • regex==2023.12.25
    • requests-oauthlib==1.3.1
    • rich==13.7.0
    • rpds-py==0.15.2
    • rsa==4.9
    • safetensors==0.4.1
    • scipy==1.10.1
    • semantic-version==2.10.0
    • sentencepiece==0.1.99
    • shellingham==1.5.4
    • six==1.16.0
    • sniffio==1.3.0
    • stack-data==0.6.3
    • starlette==0.27.0
    • tensorboard==2.14.0
    • tensorboard-data-server==0.7.2
    • termcolor==2.4.0
    • timm==0.6.12
    • tokenizers==0.15.0
    • tomli==2.0.1
    • tomlkit==0.12.0
    • toolz==0.12.0
    • torch==1.13.0
    • tqdm==4.66.1
    • traitlets==5.14.0
    • transformers==4.37.0.dev0
    • typer==0.9.0
    • typing-extensions==4.9.0
    • tzdata==2023.3
    • uvicorn==0.25.0
    • wcwidth==0.2.12
    • websockets==11.0.3
    • werkzeug==3.0.1
    • zipp==3.17.0
      prefix: /opt/anaconda3/envs/lavin

@luogen1996
Copy link
Owner

2xA100跑lavin-7B的结果 运行指令:bash ./scripts/finetuning_sqa_7b.sh

两次的结果: {'acc_natural': '87.88', 'acc_social': '94.71', 'acc_language': '84.91', 'acc_has_text': '87.24', 'acc_has_image': '86.37', 'acc_no_context': '87.87', 'acc_grade_1_6': '89.35', 'acc_grade_7_12': '87.08', 'acc_average': '88.54'}

{'acc_natural': '87.08', 'acc_social': '95.61', 'acc_language': '87.82', 'acc_has_text': '86.46', 'acc_has_image': '86.27', 'acc_no_context': '89.83', 'acc_grade_1_6': '90.09', 'acc_grade_7_12': '87.21', 'acc_average': '89.06'}

两次独立的没有任何更改的运行,结果相差很大,请教可能的原因

由于代码的随机种子没有完全固定,导致结果波动,我们已经更新了代码

@luogen1996
Copy link
Owner

A100 40G的结果,仍然达不到: [8737] {'acc_natural': '87.66', 'acc_social': '94.71', 'acc_language': '85.64', 'acc_has_text': '87.15', 'acc_has_image': '86.86', 'acc_no_context': '88.08', 'acc_grade_1_6': '89.79', 'acc_grade_7_12': '86.49', 'acc_average': '88.61'} torch等: torch 1.13.0+cu117 transformers 4.37.0.dev0 bitsandbytes 0.41.3.post2 具体环境

name: lavin channels:

  • pytorch

  • defaults
    dependencies:

  • _libgcc_mutex=0.1=main

  • _openmp_mutex=5.1=1_gnu

  • blas=1.0=mkl

  • brotli-python=1.0.9=py38h6a678d5_7

  • bzip2=1.0.8=h7b6447c_0

  • ca-certificates=2023.12.12=h06a4308_0

  • certifi=2023.11.17=py38h06a4308_0

  • cffi=1.16.0=py38h5eee18b_0

  • charset-normalizer=2.0.4=pyhd3eb1b0_0

  • cryptography=41.0.7=py38hdda0065_0

  • cudatoolkit=11.3.1=h2bc3f7f_2

  • ffmpeg=4.3=hf484d3e_0

  • freetype=2.12.1=h4a9f257_0

  • giflib=5.2.1=h5eee18b_3

  • gmp=6.2.1=h295c915_3

  • gnutls=3.6.15=he1e5248_0

  • idna=3.4=py38h06a4308_0

  • intel-openmp=2023.1.0=hdb19cb5_46306

  • jpeg=9e=h5eee18b_1

  • lame=3.100=h7b6447c_0

  • lcms2=2.12=h3be6417_0

  • ld_impl_linux-64=2.38=h1181459_1

  • lerc=3.0=h295c915_0

  • libdeflate=1.17=h5eee18b_1

  • libffi=3.4.4=h6a678d5_0

  • libgcc-ng=11.2.0=h1234567_1

  • libgomp=11.2.0=h1234567_1

  • libiconv=1.16=h7f8727e_2

  • libidn2=2.3.4=h5eee18b_0

  • libpng=1.6.39=h5eee18b_0

  • libstdcxx-ng=11.2.0=h1234567_1

  • libtasn1=4.19.0=h5eee18b_0

  • libtiff=4.5.1=h6a678d5_0

  • libunistring=0.9.10=h27cfd23_0

  • libwebp=1.3.2=h11a3e52_0

  • libwebp-base=1.3.2=h5eee18b_0

  • lz4-c=1.9.4=h6a678d5_0

  • mkl=2023.1.0=h213fc3f_46344

  • mkl-service=2.4.0=py38h5eee18b_1

  • mkl_fft=1.3.8=py38h5eee18b_0

  • mkl_random=1.2.4=py38hdb19cb5_0

  • ncurses=6.4=h6a678d5_0

  • nettle=3.7.3=hbbd107a_1

  • numpy=1.24.3=py38hf6e8229_1

  • numpy-base=1.24.3=py38h060ed82_1

  • openh264=2.1.1=h4ff587b_0

  • openjpeg=2.4.0=h3ad879b_0

  • openssl=3.0.12=h7f8727e_0

  • pillow=10.0.1=py38ha6cbd5a_0

  • pip=23.3.1=py38h06a4308_0

  • pycparser=2.21=pyhd3eb1b0_0

  • pyopenssl=23.2.0=py38h06a4308_0

  • pysocks=1.7.1=py38h06a4308_0

  • python=3.8.18=h955ad1f_0

  • pytorch-mutex=1.0=cuda

  • readline=8.2=h5eee18b_0

  • requests=2.31.0=py38h06a4308_0

  • setuptools=68.2.2=py38h06a4308_0

  • sqlite=3.41.2=h5eee18b_0

  • tbb=2021.8.0=hdb19cb5_0

  • tk=8.6.12=h1ccaba5_0

  • torchaudio=0.12.1=py38_cu113

  • torchvision=0.13.1=py38_cu113

  • urllib3=1.26.18=py38h06a4308_0

  • wheel=0.41.2=py38h06a4308_0

  • xz=5.4.5=h5eee18b_0

  • zlib=1.2.13=h5eee18b_0

  • zstd=1.5.5=hc292b87_0

  • pip:

    • absl-py==2.0.0
    • aiofiles==23.2.1
    • altair==5.2.0
    • annotated-types==0.6.0
    • anyio==3.7.1
    • asttokens==2.4.1
    • attrs==23.1.0
    • backcall==0.2.0
    • bitsandbytes==0.41.3.post2
    • cachetools==5.3.2
    • click==8.1.7
    • colorama==0.4.6
    • contourpy==1.1.1
    • cycler==0.12.1
    • decorator==5.1.1
    • exceptiongroup==1.2.0
    • executing==2.0.1
    • fairscale==0.4.13
    • fastapi==0.105.0
    • ffmpy==0.3.1
    • filelock==3.13.1
    • fire==0.5.0
    • fonttools==4.47.0
    • fsspec==2023.12.2
    • ftfy==6.1.3
    • google-auth==2.25.2
    • google-auth-oauthlib==1.0.0
    • gradio==4.12.0
    • gradio-client==0.8.0
    • grpcio==1.60.0
    • h11==0.14.0
    • httpcore==1.0.2
    • httpx==0.26.0
    • huggingface-hub==0.20.1
    • importlib-metadata==7.0.1
    • importlib-resources==6.1.1
    • ipdb==0.13.13
    • ipython==8.12.3
    • jedi==0.19.1
    • jinja2==3.1.2
    • jsonschema==4.20.0
    • jsonschema-specifications==2023.11.2
    • kiwisolver==1.4.5
    • markdown==3.5.1
    • markdown-it-py==3.0.0
    • markupsafe==2.1.3
    • matplotlib==3.7.4
    • matplotlib-inline==0.1.6
    • mdurl==0.1.2
    • nvidia-cublas-cu11==11.10.3.66
    • nvidia-cuda-nvrtc-cu11==11.7.99
    • nvidia-cuda-runtime-cu11==11.7.99
    • nvidia-cudnn-cu11==8.5.0.96
    • oauthlib==3.2.2
    • orjson==3.9.10
    • packaging==23.2
    • pandas==2.0.3
    • parso==0.8.3
    • pexpect==4.9.0
    • pickleshare==0.7.5
    • pkgutil-resolve-name==1.3.10
    • prompt-toolkit==3.0.43
    • protobuf==4.25.1
    • ptyprocess==0.7.0
    • pure-eval==0.2.2
    • pyasn1==0.5.1
    • pyasn1-modules==0.3.0
    • pydantic==2.5.3
    • pydantic-core==2.14.6
    • pydub==0.25.1
    • pygments==2.17.2
    • pyparsing==3.1.1
    • python-dateutil==2.8.2
    • python-multipart==0.0.6
    • pytz==2023.3.post1
    • pyyaml==6.0.1
    • referencing==0.32.0
    • regex==2023.12.25
    • requests-oauthlib==1.3.1
    • rich==13.7.0
    • rpds-py==0.15.2
    • rsa==4.9
    • safetensors==0.4.1
    • scipy==1.10.1
    • semantic-version==2.10.0
    • sentencepiece==0.1.99
    • shellingham==1.5.4
    • six==1.16.0
    • sniffio==1.3.0
    • stack-data==0.6.3
    • starlette==0.27.0
    • tensorboard==2.14.0
    • tensorboard-data-server==0.7.2
    • termcolor==2.4.0
    • timm==0.6.12
    • tokenizers==0.15.0
    • tomli==2.0.1
    • tomlkit==0.12.0
    • toolz==0.12.0
    • torch==1.13.0
    • tqdm==4.66.1
    • traitlets==5.14.0
    • transformers==4.37.0.dev0
    • typer==0.9.0
    • typing-extensions==4.9.0
    • tzdata==2023.3
    • uvicorn==0.25.0
    • wcwidth==0.2.12
    • websockets==11.0.3
    • werkzeug==3.0.1
    • zipp==3.17.0
      prefix: /opt/anaconda3/envs/lavin

{'acc_natural': '88.90', 'acc_social': '95.61', 'acc_language': '84.45', 'acc_has_text': '87.83', 'acc_has_image': '87.65', 'acc_no_context': '88.50', 'acc_grade_1_6': '90.57', 'acc_grade_7_12': '86.62', 'acc_average': '89.15'}
由于代码的随机种子没有完全固定,导致每轮结果波动,我们已经更新了代码,这是我们在A800(80G)上单卡的结果,实际测试性能还是会因为环境不同在正负0.2左右波动。指令是:
CUDA_VISIBLE_DEVICES=4 torchrun --nproc_per_node 1 --master_port 11411 train.py
--llm_model 7B
--llama_model_path ../data/weights/
--data_path ../data/alpaca_data.json
--max_seq_len 512
--batch_size 4
--accum_iter 8
--epochs 20
--warmup_epochs 2
--blr 9e-3
--weight_decay 0.02
--output_dir ./LaVIN-7B/
--adapter_type attn
--adapter_dim 8
--adapter_scale 1
--n_prompt 6
--prompt_format QCM-ALE
--temperature 10.
--visual_adapter_type router

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants