Skip to content

Commit

Permalink
TrainsLogger -> ClearMLLogger (pytorch#1557)
Browse files Browse the repository at this point in the history
* TrainsLogger -> ClearMLLogger

* TrainsLogger -> ClearMLLogger

* add docs

* add tests for TrainsLogger
  • Loading branch information
Jeff Yang authored Jan 18, 2021
1 parent 55d8cd8 commit 2485fd4
Show file tree
Hide file tree
Showing 34 changed files with 1,601 additions and 1,524 deletions.
6 changes: 2 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

<img src="assets/logo/ignite_logo_mixed.svg" width=512>


<!-- [![image](https://travis-ci.com/pytorch/ignite.svg?branch=master)](https://travis-ci.com/pytorch/ignite) -->

| ![image](https://img.shields.io/badge/-Tests:-black?style=flat-square) [![image](https://github.com/pytorch/ignite/workflows/Run%20unit%20tests/badge.svg)](https://github.com/pytorch/ignite/actions) [![image](https://img.shields.io/badge/-GPU%20tests-black?style=flat-square)](https://app.circleci.com/pipelines/github/pytorch/ignite?branch=master)[![image](https://circleci.com/gh/pytorch/ignite.svg?style=svg)](https://app.circleci.com/pipelines/github/pytorch/ignite?branch=master) [![image](https://codecov.io/gh/pytorch/ignite/branch/master/graph/badge.svg)](https://codecov.io/gh/pytorch/ignite) [![image](https://img.shields.io/badge/dynamic/json.svg?label=docs&url=https%3A%2F%2Fpypi.org%2Fpypi%2Fpytorch-ignite%2Fjson&query=%24.info.version&colorB=brightgreen&prefix=v)](https://pytorch.org/ignite/index.html) |
Expand Down Expand Up @@ -364,10 +363,9 @@ Complete list of examples can be found [here](https://pytorch.org/ignite/example
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pytorch/ignite/blob/master/examples/notebooks/MNIST_on_TPU.ipynb) [MNIST training on a single
TPU](https://github.com/pytorch/ignite/blob/master/examples/notebooks/MNIST_on_TPU.ipynb)
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) [CIFAR10 Training on multiple TPUs](https://github.com/pytorch/ignite/tree/master/examples/contrib/cifar10)
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pytorch/ignite/blob/master/examples/notebooks/HandlersTimeProfiler_MNIST.ipynb) [Basic example of handlers
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pytorch/ignite/blob/master/examples/notebooks/HandlersTimeProfiler_MNIST.ipynb) [Basic example of handlers
time profiling on MNIST training example](https://github.com/pytorch/ignite/blob/master/examples/notebooks/HandlersTimeProfiler_MNIST.ipynb)


## Reproducible Training Examples

Inspired by [torchvision/references](https://github.com/pytorch/vision/tree/master/references),
Expand All @@ -379,7 +377,7 @@ we provide several reproducible baselines for vision tasks:
Features:

- Distributed training with mixed precision by [nvidia/apex](https://github.com/NVIDIA/apex/)
- Experiments tracking with [MLflow](https://mlflow.org/), [Polyaxon](https://polyaxon.com/) or [TRAINS](https://github.com/allegroai/trains/)
- Experiments tracking with [MLflow](https://mlflow.org/), [Polyaxon](https://polyaxon.com/) or [ClearML](https://github.com/allegroai/clearml/)

<!-- ############################################################################################################### -->

Expand Down
2 changes: 1 addition & 1 deletion docker/hvd/Dockerfile.hvd-apex-vision
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ RUN pip install --upgrade --no-cache-dir albumentations \
opencv-python \
py_config_runner \
pillow \
"trains>=0.15.0"
clearml
2 changes: 1 addition & 1 deletion docker/hvd/Dockerfile.hvd-vision
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ RUN pip install --upgrade --no-cache-dir albumentations \
opencv-python \
py_config_runner \
pillow \
trains
clearml
2 changes: 1 addition & 1 deletion docker/main/Dockerfile.apex-vision
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ RUN pip install --upgrade --no-cache-dir albumentations \
opencv-python \
py_config_runner \
pillow \
trains
clearml
2 changes: 1 addition & 1 deletion docker/main/Dockerfile.vision
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ RUN pip install --upgrade --no-cache-dir albumentations \
opencv-python \
py_config_runner \
pillow \
trains
clearml
2 changes: 1 addition & 1 deletion docker/msdp/Dockerfile.msdp-apex-vision
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ RUN pip install --upgrade --no-cache-dir albumentations \
opencv-python \
py_config_runner \
pillow \
trains
clearml
14 changes: 9 additions & 5 deletions docs/source/contrib/handlers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -153,22 +153,27 @@ for detailed usage.
:members:
:inherited-members:

trains_logger
clearml_logger
---------------

See `trains mnist example <https://github.com/pytorch/ignite/blob/master/examples/contrib/mnist/mnist_with_trains_logger.py>`_
See `clearml mnist example <https://github.com/pytorch/ignite/blob/master/examples/contrib/mnist/mnist_with_clearml_logger.py>`_
for detailed usage.

.. currentmodule:: ignite.contrib.handlers.trains_logger
.. currentmodule:: ignite.contrib.handlers.clearml_logger

.. autosummary::
:nosignatures:
:autolist:

.. automodule:: ignite.contrib.handlers.trains_logger
.. automodule:: ignite.contrib.handlers.clearml_logger
:members:
:inherited-members:

trains_logger
--------------

.. note:: ``trains_logger`` was renamed to ``clearml_logger``. Please refer to :ref:`clearml_logger`.

More on parameter scheduling
----------------------------

Expand Down Expand Up @@ -477,4 +482,3 @@ Concatenate with torch schedulers
.. image:: ../_static/img/schedulers/concat_linear_exp_step_lr.png

2 changes: 1 addition & 1 deletion docs/source/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,4 @@ reproducible baselines for vision tasks:
Features:

- Distributed training with mixed precision by `nvidia/apex <https://github.com/NVIDIA/apex/>`_
- Experiments tracking with `MLflow <https://mlflow.org/>`_ or `Polyaxon <https://polyaxon.com/>`_ or `TRAINS <https://github.com/allegroai/trains/>`_
- Experiments tracking with `MLflow <https://mlflow.org/>`_ or `Polyaxon <https://polyaxon.com/>`_ or `ClearML <https://github.com/allegroai/clearml/>`_
52 changes: 32 additions & 20 deletions examples/contrib/cifar10/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
# CIFAR10 Example with Ignite

In this example, we show how to use *Ignite* to train a neural network:
In this example, we show how to use _Ignite_ to train a neural network:

- on 1 or more GPUs or TPUs
- compute training/validation metrics
- log learning rate, metrics etc
- save the best model weights

Configurations:

* [x] single GPU
* [x] multi GPUs on a single node
* [x] multi GPUs on multiple nodes
* [x] TPUs on Colab
- [x] single GPU
- [x] multi GPUs on a single node
- [x] multi GPUs on multiple nodes
- [x] TPUs on Colab

## Requirements:

Expand All @@ -20,21 +21,24 @@ Configurations:
- [tqdm](https://github.com/tqdm/tqdm/): `pip install tqdm`
- [tensorboardx](https://github.com/lanpa/tensorboard-pytorch): `pip install tensorboardX`
- [python-fire](https://github.com/google/python-fire): `pip install fire`
- Optional: [trains](https://github.com/allegroai/trains): `pip install trains`
- Optional: [clearml](https://github.com/allegroai/clearml): `pip install clearml`

## Usage:

Run the example on a single GPU:

```bash
python main.py run
```

For details on accepted arguments:

```bash
python main.py run -- --help
```

If user would like to provide already downloaded dataset, the path can be setup in parameters as

```bash
--data_path="/path/to/cifar10/"
```
Expand All @@ -44,11 +48,14 @@ If user would like to provide already downloaded dataset, the path can be setup
#### Single node, multiple GPUs

Let's start training on a single node with 2 gpus:

```bash
# using torch.distributed.launch
python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl"
```
or

or

```bash
# using function spawn inside the code
python -u main.py run --backend="nccl" --nproc_per_node=2
Expand All @@ -59,28 +66,29 @@ python -u main.py run --backend="nccl" --nproc_per_node=2
Please, make sure to have Horovod installed before running.

Let's start training on a single node with 2 gpus:

```bash
# horovodrun
horovodrun -np=2 python -u main.py run --backend="horovod"
```
or

or

```bash
# using function spawn inside the code
python -u main.py run --backend="horovod" --nproc_per_node=2
```


#### Colab, on 8 TPUs


Same code can be run on TPUs: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx)


#### Multiple nodes, multiple GPUs

Let's start training on two nodes with 2 gpus each. We assuming that master node can be connected as `master`, e.g. `ping master`.

1) Execute on master node
1. Execute on master node

```bash
python -u -m torch.distributed.launch \
--nnodes=2 \
Expand All @@ -90,7 +98,8 @@ python -u -m torch.distributed.launch \
main.py run --backend="nccl"
```

2) Execute on worker node
2. Execute on worker node

```bash
python -u -m torch.distributed.launch \
--nnodes=2 \
Expand All @@ -100,17 +109,18 @@ python -u -m torch.distributed.launch \
main.py run --backend="nccl"
```


### Check resume training

#### Single GPU

Initial training with a stop on 1000 iteration (~11 epochs)

```bash
python main.py run --stop_iteration=1000
```

Resume from the latest checkpoint

```bash
python main.py run --resume-from=/tmp/output-cifar10/resnet18_backend-None-1_stop-on-1000/training_checkpoint_1000.pt
```
Expand All @@ -120,25 +130,27 @@ python main.py run --resume-from=/tmp/output-cifar10/resnet18_backend-None-1_sto
#### Single node, multiple GPUs

Initial training on a single node with 2 gpus with a stop on 1000 iteration (~11 epochs):

```bash
# using torch.distributed.launch
python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl" --stop_iteration=1000
```

Resume from the latest checkpoint

```bash
python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl" \
--resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-1000/training_checkpoint_1000.pt
```

Similar commands can be adapted for other cases.

## Trains fileserver
## ClearML fileserver

If `Trains` server is used (i.e. `--with_trains` argument), the configuration to upload artifact must be done by
modifying the `Trains` configuration file `~/trains.config` generated by `trains-init`. According to the
[documentation](https://allegro.ai/docs/examples/reporting/artifacts/), the `output_uri` argument can be
configured in `sdk.development.default_output_uri` to fileserver uri. If server is self-hosted, `Trains` fileserver uri is
If `ClearML` server is used (i.e. `--with_clearml` argument), the configuration to upload artifact must be done by
modifying the `ClearML` configuration file `~/clearml.conf` generated by `clearml-init`. According to the
[documentation](https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html), the `output_uri` argument can be
configured in `sdk.development.default_output_uri` to fileserver uri. If server is self-hosted, `ClearML` fileserver uri is
`http://localhost:8081`.

For more details, see https://allegro.ai/docs/examples/reporting/artifacts/
For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html
14 changes: 7 additions & 7 deletions examples/contrib/cifar10/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,8 @@ def training(local_rank, config):
if "cuda" in device.type:
config["cuda device name"] = torch.cuda.get_device_name(local_rank)

if config["with_trains"]:
from trains import Task
if config["with_clearml"]:
from clearml import Task

task = Task.init("CIFAR10-Training", task_name=output_path.stem)
task.connect_configuration(config)
Expand Down Expand Up @@ -150,7 +150,7 @@ def run(
log_every_iters=15,
nproc_per_node=None,
stop_iteration=None,
with_trains=False,
with_clearml=False,
**spawn_kwargs,
):
"""Main entry to train an model on CIFAR10 dataset.
Expand All @@ -177,7 +177,7 @@ def run(
log_every_iters (int): argument to log batch loss every ``log_every_iters`` iterations.
It can be 0 to disable it. Default, 15.
stop_iteration (int, optional): iteration to stop the training. Can be used to check resume from checkpoint.
with_trains (bool): if True, experiment Trains logger is setup. Default, False.
with_clearml (bool): if True, experiment ClearML logger is setup. Default, False.
**spawn_kwargs: Other kwargs to spawn run in child processes: master_addr, master_port, node_rank, nnodes
"""
Expand Down Expand Up @@ -340,10 +340,10 @@ def train_step(engine, batch):


def get_save_handler(config):
if config["with_trains"]:
from ignite.contrib.handlers.trains_logger import TrainsSaver
if config["with_clearml"]:
from ignite.contrib.handlers.clearml_logger import ClearMLSaver

return TrainsSaver(dirname=config["output_path"])
return ClearMLSaver(dirname=config["output_path"])

return DiskSaver(config["output_path"], require_empty=False)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
"""
MNIST example with training and validation monitoring using Trains.
MNIST example with training and validation monitoring using ClearML.
Requirements:
Trains: `pip install trains`
ClearML: `pip install clearml`
Usage:
Run the example:
```bash
python mnist_with_trains_logger.py
python mnist_with_clearml_logger.py
```
"""
from argparse import ArgumentParser
Expand All @@ -21,7 +21,7 @@
from torchvision.datasets import MNIST
from torchvision.transforms import Compose, ToTensor, Normalize

from ignite.contrib.handlers.trains_logger import *
from ignite.contrib.handlers.clearml_logger import *
from ignite.engine import Events, create_supervised_trainer, create_supervised_evaluator
from ignite.handlers import Checkpoint
from ignite.metrics import Accuracy, Loss
Expand Down Expand Up @@ -86,43 +86,43 @@ def compute_metrics(engine):
train_evaluator.run(train_loader)
validation_evaluator.run(val_loader)

trains_logger = TrainsLogger(project_name="examples", task_name="ignite")
clearml_logger = ClearMLLogger(project_name="examples", task_name="ignite")

trains_logger.attach_output_handler(
clearml_logger.attach_output_handler(
trainer,
event_name=Events.ITERATION_COMPLETED(every=100),
tag="training",
output_transform=lambda loss: {"batchloss": loss},
)

for tag, evaluator in [("training metrics", train_evaluator), ("validation metrics", validation_evaluator)]:
trains_logger.attach_output_handler(
clearml_logger.attach_output_handler(
evaluator,
event_name=Events.EPOCH_COMPLETED,
tag=tag,
metric_names=["loss", "accuracy"],
global_step_transform=global_step_from_engine(trainer),
)

trains_logger.attach_opt_params_handler(
clearml_logger.attach_opt_params_handler(
trainer, event_name=Events.ITERATION_COMPLETED(every=100), optimizer=optimizer
)

trains_logger.attach(
clearml_logger.attach(
trainer, log_handler=WeightsScalarHandler(model), event_name=Events.ITERATION_COMPLETED(every=100)
)

trains_logger.attach(trainer, log_handler=WeightsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=100))
clearml_logger.attach(trainer, log_handler=WeightsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=100))

trains_logger.attach(
clearml_logger.attach(
trainer, log_handler=GradsScalarHandler(model), event_name=Events.ITERATION_COMPLETED(every=100)
)

trains_logger.attach(trainer, log_handler=GradsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=100))
clearml_logger.attach(trainer, log_handler=GradsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=100))

handler = Checkpoint(
{"model": model},
TrainsSaver(),
ClearMLSaver(),
n_saved=1,
score_function=lambda e: e.state.metrics["accuracy"],
score_name="val_acc",
Expand All @@ -134,7 +134,7 @@ def compute_metrics(engine):
# kick everything off
trainer.run(train_loader, max_epochs=epochs)

trains_logger.close()
clearml_logger.close()


if __name__ == "__main__":
Expand Down
Loading

0 comments on commit 2485fd4

Please sign in to comment.