[RFC] Add HuggingFace tests with pinned dependencies to CI

## 🚀 Feature

This RFC proposes to add a number of tests in PyTorch/XLA CI that exercises the
combination of `torch_xla` and Hugging Face libraries.


## Motivation

Testing against our customer's code ensures that we do not break common user
workflows.


## Pitch

Historically, PyTorch/XLA CI had some HuggingFace tests that install the latest
version of `transformer`, `diffuser`, and `accelerate` from the main branch of
the respective git repositories. That causes test breakage when HuggingFace
introduced backwards incompatible changes. To prevent those issues, we'll pin
HuggingFace libraries to a fixed version when running the tests.

In principle, we should pin other packages that may affect the training,
such as `numpy`. However, `torch_xla` and `torch` itself also depends on a
number of Python libraries, such as `numpy` and `networkx`. Therefore we'll keep
the list of pinned packages small to start with, and we can always grow later if
a particular package becomes problematic.


### List of tests

We propose these tests, which is a slight variation of the existing tests removed
in [3].

| Name                | Type       | Test in nightly?     | Test in RC?          | Notes                                                                                                |
|---------------------|------------|----------------------|----------------------|------------------------------------------------------------------------------------------------------|
| Llama 2 7B training | Example    | Yes (already exists) | Yes (already exists) | Testing the `llama2-google-next-training` branch in pytorch-tpu fork of HF transformers              |
| SD2 training        | Example    | **New addition**     | **New addition**     | Testing the `main` branch in pytorch-tpu fork of HF diffusers                                        |
| accelerate test     | Smoke test | Add back             | Add back             | See note #1.                                                                                         |
| bert                | Example    | Add back             | Add back             | This exercises our own test (pytorch/xla/test/pjrt/test_train_hf_transformer.py) so we should run it |
| diffusers           | Example    | **Remove**           | **Remove**           | This trains stable-diffusion-v1. Replaced by planned SD2 training test                               |

The SD2 training test will be added referencing the recipe in `tpu-recipes` [2].

Note #1: the accelerate test broke for a few weeks and we suspected it was due to
upstream changes in Hugging Face. After I filed [4], it turns out that this was
really a case of PyTorch/XLA changes [5] breaking Hugging Face. When we add back
this test we should workaround the breakage.

Note #2: during local testing, the `bert` test has a race condition at the end
causing a `OSError: handle is closed`. That also looks like a legit error
stemming from incorrect multiprocessing usage.

### Initial pinned versions

Based on local testing, I've narrowed to the following versions that works for
the above tests:

```
accelerate==1.2.1
datasets==3.2.0
evaluate==0.4.3
huggingface-hub==0.27.1
safetensors==0.5.0
tokenizers==0.19.1
```

We'll check in this file as a `pip-constraints.txt` (constraint file [1]) in
https://github.com/GoogleCloudPlatform/ml-auto-solutions, so that whenever a
HuggingFace library is installed, it is constrained to be one of the tested
version. This file will be shared by all tests in the list above.

`transformers` will be installed from
https://github.com/pytorch-tpu/transformers/tree/llama2-google-next-training
and `diffusers` will be installed from
https://github.com/pytorch-tpu/diffusers/tree/main. If we don't touch these
branches, then they will also be effectively pinned.


### What to do if a test fail?

We should prioritize on reverting an offending PR if a change in `torch_xla`
broke HuggingFace tests.


## Alternatives

It's also worth testing tip-of-tree versions of HuggingFace libraries against
stable versions of `torch_xla`. This ensures that HuggingFace does not introduce
new breakages in their development cycle. We should work with the HuggingFace
team to help them setup the tests on their end. That can be done independently
from this proposal.


## Additional context

We had some HuggingFace tests for a while but they frequently broke due to the
lack of version pinning, and they were removed in [3].


[1]: https://pip.pypa.io/en/stable/user_guide/#constraints-files
[2]: https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/Diffusion-2-PyTorch/train.sh
[3]: https://github.com/GoogleCloudPlatform/ml-auto-solutions/commit/313e33af1c6011b071c773749d08ddb3524353af#diff-8b676cd9c43692fb6a49b7b4cfed15b388a2cbe2bfa389b933ceefefcf419c03L37
[4]: https://github.com/huggingface/accelerate/issues/3304
[5]: https://github.com/pytorch/xla/commit/3cd46778cc170cf744d8ed1a7eac095e07064f5b


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Add HuggingFace tests with pinned dependencies to CI #8542

🚀 Feature

Motivation

Pitch

List of tests

Initial pinned versions

What to do if a test fail?

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Name	Type	Test in nightly?	Test in RC?	Notes
Llama 2 7B training	Example	Yes (already exists)	Yes (already exists)	Testing the `llama2-google-next-training` branch in pytorch-tpu fork of HF transformers
SD2 training	Example	New addition	New addition	Testing the `main` branch in pytorch-tpu fork of HF diffusers
accelerate test	Smoke test	Add back	Add back	See note #1.
bert	Example	Add back	Add back	This exercises our own test (pytorch/xla/test/pjrt/test_train_hf_transformer.py) so we should run it
diffusers	Example	Remove	Remove	This trains stable-diffusion-v1. Replaced by planned SD2 training test

[RFC] Add HuggingFace tests with pinned dependencies to CI #8542

Description

🚀 Feature

Motivation

Pitch

List of tests

Initial pinned versions

What to do if a test fail?

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions