Transformers Example with Ignite

In this example, we show how to use Ignite to finetune a transformer model:

on 1 or more GPUs or TPUs
compute training/validation metrics
log learning rate, metrics etc
save the best model weights

Configurations:

single GPU
multi GPUs on a single node
TPUs on Colab

Requirements:

pytorch-ignite: pip install pytorch-ignite
transformers: pip install transformers
datasets: pip install datasets
tqdm: pip install tqdm
tensorboardx: pip install tensorboardX
python-fire: pip install fire
Optional: clearml: pip install clearml

Alternatively, install the all requirements using pip install -r requirements.txt.

Usage:

Run the example on a single GPU:

python main.py run

If needed, please, adjust the batch size to your GPU device with --batch_size argument.

The default model is bert-base-uncased incase you need to change that use the --model argument, for details on which models can be used refer here

Example:

#Using DistilBERT which has 40% less parameters than bert-base-uncased
python main.py run --model="distilbert-base-uncased"

For details on accepted arguments:

python main.py run -- --help

Distributed training

Single node, multiple GPUs

Let's start training on a single node with 2 gpus:

# using torch.distributed.launch
python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl"

or

# using function spawn inside the code
python -u main.py run --backend="nccl" --nproc_per_node=2

Using Horovod as distributed backend

Please, make sure to have Horovod installed before running.

Let's start training on a single node with 2 gpus:

# horovodrun
horovodrun -np=2 python -u main.py run --backend="horovod"

or

# using function spawn inside the code
python -u main.py run --backend="horovod" --nproc_per_node=2

Colab or Kaggle kernels, on 8 TPUs

# setup TPU environment
import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'

VERSION = "nightly"
!curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION > /dev/null

from main import run
run(backend="xla-tpu", nproc_per_node=8)

ClearML fileserver

If ClearML server is used (i.e. --with_clearml argument), the configuration to upload artifact must be done by modifying the ClearML configuration file ~/clearml.conf generated by clearml-init. According to the documentation, the output_uri argument can be configured in sdk.development.default_output_uri to fileserver uri. If server is self-hosted, ClearML fileserver uri is http://localhost:8081.

For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!