-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lavin-7B结果达不到,两次结果差别很大,一样的代码 #40
Comments
单卡下结果是比较稳定的,多卡可能要加上以下代码固定随机种子。依赖代码的版本也会导致性能波动,请确保和requirements.txt一致。 |
我试了单卡a100跑7B 运行命令: CUDA_VISIBLE_DEVICES=0 /opt/anaconda3/envs/lavin/bin/torchrun --nproc_per_node 1 --master_port 11111 eval.py 结果: 请教下可能的原因,batch_size 和accum_iter 在单卡下如何设置?目前是 |
感谢你的关注,我们发现遇到这个问题的小伙伴比较多。在我们A100 40G的机器上性能是能够稳定的。目前我们已经换到了A800 80G上进行测试,发现性能确实存在波动,中间的gap我们在紧急排查中,我们尽量在最短时间内修复这个问题。 |
请教下A100 40G配置下单卡的具体的参数配置。 |
我们目前也是 |
A100 40G的结果,仍然达不到: name: lavin
|
由于代码的随机种子没有完全固定,导致结果波动,我们已经更新了代码 |
{'acc_natural': '88.90', 'acc_social': '95.61', 'acc_language': '84.45', 'acc_has_text': '87.83', 'acc_has_image': '87.65', 'acc_no_context': '88.50', 'acc_grade_1_6': '90.57', 'acc_grade_7_12': '86.62', 'acc_average': '89.15'} |
2xA100跑lavin-7B的结果
运行指令:bash ./scripts/finetuning_sqa_7b.sh
两次的结果:
{'acc_natural': '87.88', 'acc_social': '94.71', 'acc_language': '84.91', 'acc_has_text': '87.24', 'acc_has_image': '86.37', 'acc_no_context': '87.87', 'acc_grade_1_6': '89.35', 'acc_grade_7_12': '87.08', 'acc_average': '88.54'}
{'acc_natural': '87.08', 'acc_social': '95.61', 'acc_language': '87.82', 'acc_has_text': '86.46', 'acc_has_image': '86.27', 'acc_no_context': '89.83', 'acc_grade_1_6': '90.09', 'acc_grade_7_12': '87.21', 'acc_average': '89.06'}
两次独立的没有任何更改的运行,结果相差很大,请教可能的原因
The text was updated successfully, but these errors were encountered: