Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练到一半变得很慢怎么解决,是因为数据问题吗 #152

Open
MichealZhangxa opened this issue Dec 9, 2024 · 5 comments
Open

Comments

@MichealZhangxa
Copy link

之前训练都是15秒左右一个item,突然变得很慢,我不知道什么原因,还非常不稳定,GPU温度也不算高,但是利用率非常低,看起来也没有频繁的发生数据交换,因为我一开始还算快,我感觉频繁跟内存交换数据的话会一直很慢,之前训练llava_dataset_665k里面的coco数据集约为llava_dataset_665k的一半,没有遇到这个问题,但是现在训练llava_dataset_665k就遇到这个问题了
question2

@ZhangXJ199
Copy link
Collaborator

只使用我们的代码不添加任何额外结构也会出现这种问题吗?

@MichealZhangxa
Copy link
Author

MichealZhangxa commented Dec 10, 2024

只使用我们的代码不添加任何额外结构也会出现这种问题吗?

稍微加了一点点东西,相当于加了线性层,但是一开始训练的很正常,训练到一半出现这种情况

@ZhangXJ199
Copy link
Collaborator

把group_by_modality_length设置为false试一下,如果还是出现这种情况,可能需要显存更大的显卡

@MichealZhangxa
Copy link
Author

group_by_modality_length

"--dataloader_num_workers", "8",这个参数会影响训练的快慢吗,我弄小一点会不会训练就不会变慢

@ZhangXJ199
Copy link
Collaborator

可以尝试一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants