In this example, we show how to use Ignite to finetune a transformer model:
- on 1 or more GPUs or TPUs
- compute training/validation metrics
- log learning rate, metrics etc
- save the best model weights
Configurations:
- single GPU
- multi GPUs on a single node
- TPUs on Colab
- pytorch-ignite:
pip install pytorch-ignite
- transformers:
pip install transformers
- datasets:
pip install datasets
- tqdm:
pip install tqdm
- tensorboardx:
pip install tensorboardX
- python-fire:
pip install fire
- Optional: clearml:
pip install clearml
Alternatively, install the all requirements using pip install -r requirements.txt
.
Run the example on a single GPU:
python main.py run
If needed, please, adjust the batch size to your GPU device with --batch_size
argument.
The default model is bert-base-uncased
incase you need to change that use the --model
argument, for details on which models can be used refer here
Example:
#Using DistilBERT which has 40% less parameters than bert-base-uncased
python main.py run --model="distilbert-base-uncased"
For details on accepted arguments:
python main.py run -- --help
Let's start training on a single node with 2 gpus:
# using torch.distributed.launch
python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl"
or
# using function spawn inside the code
python -u main.py run --backend="nccl" --nproc_per_node=2
Using Horovod as distributed backend
Please, make sure to have Horovod installed before running.
Let's start training on a single node with 2 gpus:
# horovodrun
horovodrun -np=2 python -u main.py run --backend="horovod"
or
# using function spawn inside the code
python -u main.py run --backend="horovod" --nproc_per_node=2
# setup TPU environment
import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'
VERSION = "nightly"
!curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION > /dev/null
from main import run
run(backend="xla-tpu", nproc_per_node=8)
If ClearML
server is used (i.e. --with_clearml
argument), the configuration to upload artifact must be done by
modifying the ClearML
configuration file ~/clearml.conf
generated by clearml-init
. According to the
documentation, the output_uri
argument can be
configured in sdk.development.default_output_uri
to fileserver uri. If server is self-hosted, ClearML
fileserver uri is
http://localhost:8081
.
For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html