-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add HuggingFace tests with pinned dependencies to CI #8542
Open
tengyifei opened this issue
Jan 7, 2025
· 0 comments
· May be fixed by GoogleCloudPlatform/ml-auto-solutions#548
Open
[RFC] Add HuggingFace tests with pinned dependencies to CI #8542
tengyifei opened this issue
Jan 7, 2025
· 0 comments
· May be fixed by GoogleCloudPlatform/ml-auto-solutions#548
Comments
tengyifei
added a commit
to GoogleCloudPlatform/ml-auto-solutions
that referenced
this issue
Jan 9, 2025
As proposed in pytorch/xla#8542, this change adds back accelerate smoke test and bert training example to 2.6 and nightly CI. Additionally, llama2 training, accelerate smoke test, and bert training are modified to install huggingface dependencies following a constraint file.
4 tasks
tengyifei
added a commit
to GoogleCloudPlatform/ml-auto-solutions
that referenced
this issue
Jan 9, 2025
As proposed in pytorch/xla#8542, this change adds back accelerate smoke test and bert training example to 2.6 and nightly CI. Additionally, llama2 training, accelerate smoke test, and bert training are modified to install huggingface dependencies following a constraint file.
tengyifei
added a commit
to GoogleCloudPlatform/ml-auto-solutions
that referenced
this issue
Jan 9, 2025
As proposed in pytorch/xla#8542, this change adds back accelerate smoke test and bert training example to 2.6 and nightly CI. Additionally, llama2 training, accelerate smoke test, and bert training are modified to install huggingface dependencies following a constraint file.
tengyifei
added a commit
to GoogleCloudPlatform/ml-auto-solutions
that referenced
this issue
Jan 9, 2025
As proposed in pytorch/xla#8542, this change adds back accelerate smoke test and bert training example to 2.6 and nightly CI. Additionally, llama2 training, accelerate smoke test, and bert training are modified to install huggingface dependencies following a constraint file.
tengyifei
added a commit
to GoogleCloudPlatform/ml-auto-solutions
that referenced
this issue
Jan 9, 2025
* Revert "Remove HuggingFace tests from PyTorch/XLA CI (#513)" This reverts commit 313e33a. * Add back most Huggingface tests As proposed in pytorch/xla#8542, this change adds back accelerate smoke test and bert training example to 2.6 and nightly CI. Additionally, llama2 training, accelerate smoke test, and bert training are modified to install huggingface dependencies following a constraint file.
tengyifei
added a commit
to GoogleCloudPlatform/ml-auto-solutions
that referenced
this issue
Jan 14, 2025
This change fixes pytorch/xla#8542. See the linked proposal. I referenced https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/training/trillium/Diffusion-2-PyTorch when creating this test.
4 tasks
tengyifei
added a commit
to GoogleCloudPlatform/ml-auto-solutions
that referenced
this issue
Jan 14, 2025
This change fixes pytorch/xla#8542. See the linked proposal. I referenced https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/training/trillium/Diffusion-2-PyTorch when creating this test.
tengyifei
added a commit
to GoogleCloudPlatform/ml-auto-solutions
that referenced
this issue
Jan 14, 2025
This change fixes pytorch/xla#8542. See the linked proposal. I referenced https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/training/trillium/Diffusion-2-PyTorch when creating this test.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
🚀 Feature
This RFC proposes to add a number of tests in PyTorch/XLA CI that exercises the
combination of
torch_xla
and Hugging Face libraries.Motivation
Testing against our customer's code ensures that we do not break common user
workflows.
Pitch
Historically, PyTorch/XLA CI had some HuggingFace tests that install the latest
version of
transformer
,diffuser
, andaccelerate
from the main branch ofthe respective git repositories. That causes test breakage when HuggingFace
introduced backwards incompatible changes. To prevent those issues, we'll pin
HuggingFace libraries to a fixed version when running the tests.
In principle, we should pin other packages that may affect the training,
such as
numpy
. However,torch_xla
andtorch
itself also depends on anumber of Python libraries, such as
numpy
andnetworkx
. Therefore we'll keepthe list of pinned packages small to start with, and we can always grow later if
a particular package becomes problematic.
List of tests
We propose these tests, which is a slight variation of the existing tests removed
in 3.
llama2-google-next-training
branch in pytorch-tpu fork of HF transformersmain
branch in pytorch-tpu fork of HF diffusersThe SD2 training test will be added referencing the recipe in
tpu-recipes
2.Note #1: the accelerate test broke for a few weeks and we suspected it was due to
upstream changes in Hugging Face. After I filed 4, it turns out that this was
really a case of PyTorch/XLA changes 5 breaking Hugging Face. When we add back
this test we should workaround the breakage.
Note #2: during local testing, the
bert
test has a race condition at the endcausing a
OSError: handle is closed
. That also looks like a legit errorstemming from incorrect multiprocessing usage.
Initial pinned versions
Based on local testing, I've narrowed to the following versions that works for
the above tests:
We'll check in this file as a
pip-constraints.txt
(constraint file 1) inhttps://github.com/GoogleCloudPlatform/ml-auto-solutions, so that whenever a
HuggingFace library is installed, it is constrained to be one of the tested
version. This file will be shared by all tests in the list above.
transformers
will be installed fromhttps://github.com/pytorch-tpu/transformers/tree/llama2-google-next-training
and
diffusers
will be installed fromhttps://github.com/pytorch-tpu/diffusers/tree/main. If we don't touch these
branches, then they will also be effectively pinned.
What to do if a test fail?
We should prioritize on reverting an offending PR if a change in
torch_xla
broke HuggingFace tests.
Alternatives
It's also worth testing tip-of-tree versions of HuggingFace libraries against
stable versions of
torch_xla
. This ensures that HuggingFace does not introducenew breakages in their development cycle. We should work with the HuggingFace
team to help them setup the tests on their end. That can be done independently
from this proposal.
Additional context
We had some HuggingFace tests for a while but they frequently broke due to the
lack of version pinning, and they were removed in 3.
The text was updated successfully, but these errors were encountered: