Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add HuggingFace tests with pinned dependencies to CI #8542

Open
tengyifei opened this issue Jan 7, 2025 · 0 comments · May be fixed by GoogleCloudPlatform/ml-auto-solutions#548
Open
Assignees

Comments

@tengyifei
Copy link
Collaborator

🚀 Feature

This RFC proposes to add a number of tests in PyTorch/XLA CI that exercises the
combination of torch_xla and Hugging Face libraries.

Motivation

Testing against our customer's code ensures that we do not break common user
workflows.

Pitch

Historically, PyTorch/XLA CI had some HuggingFace tests that install the latest
version of transformer, diffuser, and accelerate from the main branch of
the respective git repositories. That causes test breakage when HuggingFace
introduced backwards incompatible changes. To prevent those issues, we'll pin
HuggingFace libraries to a fixed version when running the tests.

In principle, we should pin other packages that may affect the training,
such as numpy. However, torch_xla and torch itself also depends on a
number of Python libraries, such as numpy and networkx. Therefore we'll keep
the list of pinned packages small to start with, and we can always grow later if
a particular package becomes problematic.

List of tests

We propose these tests, which is a slight variation of the existing tests removed
in 3.

Name Type Test in nightly? Test in RC? Notes
Llama 2 7B training Example Yes (already exists) Yes (already exists) Testing the llama2-google-next-training branch in pytorch-tpu fork of HF transformers
SD2 training Example New addition New addition Testing the main branch in pytorch-tpu fork of HF diffusers
accelerate test Smoke test Add back Add back See note #1.
bert Example Add back Add back This exercises our own test (pytorch/xla/test/pjrt/test_train_hf_transformer.py) so we should run it
diffusers Example Remove Remove This trains stable-diffusion-v1. Replaced by planned SD2 training test

The SD2 training test will be added referencing the recipe in tpu-recipes 2.

Note #1: the accelerate test broke for a few weeks and we suspected it was due to
upstream changes in Hugging Face. After I filed 4, it turns out that this was
really a case of PyTorch/XLA changes 5 breaking Hugging Face. When we add back
this test we should workaround the breakage.

Note #2: during local testing, the bert test has a race condition at the end
causing a OSError: handle is closed. That also looks like a legit error
stemming from incorrect multiprocessing usage.

Initial pinned versions

Based on local testing, I've narrowed to the following versions that works for
the above tests:

accelerate==1.2.1
datasets==3.2.0
evaluate==0.4.3
huggingface-hub==0.27.1
safetensors==0.5.0
tokenizers==0.19.1

We'll check in this file as a pip-constraints.txt (constraint file 1) in
https://github.com/GoogleCloudPlatform/ml-auto-solutions, so that whenever a
HuggingFace library is installed, it is constrained to be one of the tested
version. This file will be shared by all tests in the list above.

transformers will be installed from
https://github.com/pytorch-tpu/transformers/tree/llama2-google-next-training
and diffusers will be installed from
https://github.com/pytorch-tpu/diffusers/tree/main. If we don't touch these
branches, then they will also be effectively pinned.

What to do if a test fail?

We should prioritize on reverting an offending PR if a change in torch_xla
broke HuggingFace tests.

Alternatives

It's also worth testing tip-of-tree versions of HuggingFace libraries against
stable versions of torch_xla. This ensures that HuggingFace does not introduce
new breakages in their development cycle. We should work with the HuggingFace
team to help them setup the tests on their end. That can be done independently
from this proposal.

Additional context

We had some HuggingFace tests for a while but they frequently broke due to the
lack of version pinning, and they were removed in 3.

@tengyifei tengyifei self-assigned this Jan 7, 2025
tengyifei added a commit to GoogleCloudPlatform/ml-auto-solutions that referenced this issue Jan 9, 2025
As proposed in pytorch/xla#8542, this change
adds back accelerate smoke test and bert training example to 2.6 and
nightly CI.

Additionally, llama2 training, accelerate smoke test, and bert training
are modified to install huggingface dependencies following a constraint
file.
tengyifei added a commit to GoogleCloudPlatform/ml-auto-solutions that referenced this issue Jan 9, 2025
As proposed in pytorch/xla#8542, this change
adds back accelerate smoke test and bert training example to 2.6 and
nightly CI.

Additionally, llama2 training, accelerate smoke test, and bert training
are modified to install huggingface dependencies following a constraint
file.
tengyifei added a commit to GoogleCloudPlatform/ml-auto-solutions that referenced this issue Jan 9, 2025
As proposed in pytorch/xla#8542, this change
adds back accelerate smoke test and bert training example to 2.6 and
nightly CI.

Additionally, llama2 training, accelerate smoke test, and bert training
are modified to install huggingface dependencies following a constraint
file.
tengyifei added a commit to GoogleCloudPlatform/ml-auto-solutions that referenced this issue Jan 9, 2025
As proposed in pytorch/xla#8542, this change
adds back accelerate smoke test and bert training example to 2.6 and
nightly CI.

Additionally, llama2 training, accelerate smoke test, and bert training
are modified to install huggingface dependencies following a constraint
file.
tengyifei added a commit to GoogleCloudPlatform/ml-auto-solutions that referenced this issue Jan 9, 2025
* Revert "Remove HuggingFace tests from PyTorch/XLA CI (#513)"

This reverts commit 313e33a.

* Add back most Huggingface tests

As proposed in pytorch/xla#8542, this change
adds back accelerate smoke test and bert training example to 2.6 and
nightly CI.

Additionally, llama2 training, accelerate smoke test, and bert training
are modified to install huggingface dependencies following a constraint
file.
tengyifei added a commit to GoogleCloudPlatform/ml-auto-solutions that referenced this issue Jan 14, 2025
tengyifei added a commit to GoogleCloudPlatform/ml-auto-solutions that referenced this issue Jan 14, 2025
tengyifei added a commit to GoogleCloudPlatform/ml-auto-solutions that referenced this issue Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant