-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for multi gpu setup training with a single GPU. #974
base: main
Are you sure you want to change the base?
Conversation
check_nvidia() originally spawns a new process for nvidia-smi, thus bypassing that GPU count might be limited by an OS environmental variable as this won't be reflected in the new process. Added check for if GPU is limited by OS environ, if multiple, raises error like original behaviour. If only one GPU enabled, only returns output for that GPU.
Add fixed code to the trainer patcher.
Fixed variable misname in trainer patcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to merge, tested in my local multi gpu setup, and now possible to single train if limited by OS ENVIRON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several instances where potential failures (like in convert_to_fast_tokenizer, try_fix_tokenizer, and assert_same_tokenization) are not properly logged or handled. Instead of silently returning the slow tokenizer, there should be logs or warnings for failed conversions or mismatches.
if not check_vocab or not check_special:
logger.warning("Vocab or special tokens do not match between slow and fast tokenizer.")
return slow_tokenizer
@RahulVadisetty91 Hello, your review does not seem relevant to my PR. Can you elaborate on the relevancy? |
index_for_cuda = -1 | ||
if "CUDA_VISIBLE_DEVICES" in os.environ: | ||
index_for_cuda = os.environ["CUDA_VISIBLE_DEVICES"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index_for_cuda = -1 | |
if "CUDA_VISIBLE_DEVICES" in os.environ: | |
index_for_cuda = os.environ["CUDA_VISIBLE_DEVICES"] | |
index_for_cuda = os.environ.get("CUDA_VISIBLE_DEVICES", -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if CUDA_VISIBLE_DEVICES="0,1,2"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next few lines would take care of that I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, sorry
@Sehyo @hife-ai isn't it better till the dev is ongoing someone updates this in ReadMe about using os.environ specifically this is a problem for the notebooks also I think we should use a device argument which can be a list or int or leverage that everywhere rather than checking like this. And if someone specifies more devices then the code can exit in the start itself saying multi GPU is not supported. What you say ? |
@giuliabaldini in that case can you please link that here? |
@Datta0 I am still testing that it works, I will open the PR as soon as I am sure and link it here! |
I am not sure why you would think that is the case. The trainer patcher clearly contains code that will break multi gpu functionality without my PR. This is the appropriate solution. |
Sure I could do it tomorrow if necessary |
Let me know if you can't and I will do it! |
check_nvidia() originally spawns a new process for nvidia-smi, thus bypassing that GPU count might be limited by an OS environmental variable as this won't be reflected in the new process.
Added check for if GPU is limited by OS environ, if multiple, raises error like original behaviour.
If only one GPU enabled, only returns output for that GPU.