TrainsLogger -> ClearMLLogger (pytorch#1557)

* TrainsLogger -> ClearMLLogger * TrainsLogger -> ClearMLLogger * add docs * add tests for TrainsLogger
pranavvp16 · Jan 18, 2021 · 2485fd4 · 2485fd4
1 parent 55d8cd8
commit 2485fd4
Show file tree

Hide file tree

Showing 34 changed files with 1,601 additions and 1,524 deletions.
diff --git a/README.md b/README.md
@@ -4,7 +4,6 @@
 
 <img src="assets/logo/ignite_logo_mixed.svg" width=512>
 
-
 <!-- [![image](https://travis-ci.com/pytorch/ignite.svg?branch=master)](https://travis-ci.com/pytorch/ignite) -->
 
 | ![image](https://img.shields.io/badge/-Tests:-black?style=flat-square) [![image](https://github.com/pytorch/ignite/workflows/Run%20unit%20tests/badge.svg)](https://github.com/pytorch/ignite/actions) [![image](https://img.shields.io/badge/-GPU%20tests-black?style=flat-square)](https://app.circleci.com/pipelines/github/pytorch/ignite?branch=master)[![image](https://circleci.com/gh/pytorch/ignite.svg?style=svg)](https://app.circleci.com/pipelines/github/pytorch/ignite?branch=master) [![image](https://codecov.io/gh/pytorch/ignite/branch/master/graph/badge.svg)](https://codecov.io/gh/pytorch/ignite) [![image](https://img.shields.io/badge/dynamic/json.svg?label=docs&url=https%3A%2F%2Fpypi.org%2Fpypi%2Fpytorch-ignite%2Fjson&query=%24.info.version&colorB=brightgreen&prefix=v)](https://pytorch.org/ignite/index.html) |
@@ -364,10 +363,9 @@ Complete list of examples can be found [here](https://pytorch.org/ignite/example
 - [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pytorch/ignite/blob/master/examples/notebooks/MNIST_on_TPU.ipynb) [MNIST training on a single
   TPU](https://github.com/pytorch/ignite/blob/master/examples/notebooks/MNIST_on_TPU.ipynb)
 - [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) [CIFAR10 Training on multiple TPUs](https://github.com/pytorch/ignite/tree/master/examples/contrib/cifar10)
-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pytorch/ignite/blob/master/examples/notebooks/HandlersTimeProfiler_MNIST.ipynb) [Basic example of handlers 
+- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pytorch/ignite/blob/master/examples/notebooks/HandlersTimeProfiler_MNIST.ipynb) [Basic example of handlers
   time profiling on MNIST training example](https://github.com/pytorch/ignite/blob/master/examples/notebooks/HandlersTimeProfiler_MNIST.ipynb)
 
-
 ## Reproducible Training Examples
 
 Inspired by [torchvision/references](https://github.com/pytorch/vision/tree/master/references),
@@ -379,7 +377,7 @@ we provide several reproducible baselines for vision tasks:
 Features:
 
 - Distributed training with mixed precision by [nvidia/apex](https://github.com/NVIDIA/apex/)
-- Experiments tracking with [MLflow](https://mlflow.org/), [Polyaxon](https://polyaxon.com/) or [TRAINS](https://github.com/allegroai/trains/)
+- Experiments tracking with [MLflow](https://mlflow.org/), [Polyaxon](https://polyaxon.com/) or [ClearML](https://github.com/allegroai/clearml/)
 
 <!-- ############################################################################################################### -->
 

diff --git a/docker/hvd/Dockerfile.hvd-apex-vision b/docker/hvd/Dockerfile.hvd-apex-vision
@@ -16,4 +16,4 @@ RUN pip install --upgrade --no-cache-dir albumentations \
                                          opencv-python \
                                          py_config_runner \
                                          pillow \
-                                         "trains>=0.15.0"
+                                         clearml
diff --git a/docker/hvd/Dockerfile.hvd-vision b/docker/hvd/Dockerfile.hvd-vision
@@ -16,4 +16,4 @@ RUN pip install --upgrade --no-cache-dir albumentations \
                                          opencv-python \
                                          py_config_runner \
                                          pillow \
-                                         trains
+                                         clearml
diff --git a/docker/main/Dockerfile.apex-vision b/docker/main/Dockerfile.apex-vision
@@ -16,4 +16,4 @@ RUN pip install --upgrade --no-cache-dir albumentations \
                                          opencv-python \
                                          py_config_runner \
                                          pillow \
-                                         trains
+                                         clearml
diff --git a/docker/main/Dockerfile.vision b/docker/main/Dockerfile.vision
@@ -16,4 +16,4 @@ RUN pip install --upgrade --no-cache-dir albumentations \
                                          opencv-python \
                                          py_config_runner \
                                          pillow \
-                                         trains
+                                         clearml
diff --git a/docker/msdp/Dockerfile.msdp-apex-vision b/docker/msdp/Dockerfile.msdp-apex-vision
@@ -16,4 +16,4 @@ RUN pip install --upgrade --no-cache-dir albumentations \
                                          opencv-python \
                                          py_config_runner \
                                          pillow \
-                                         trains
+                                         clearml
diff --git a/docs/source/contrib/handlers.rst b/docs/source/contrib/handlers.rst
@@ -153,22 +153,27 @@ for detailed usage.
    :members:
    :inherited-members:
 
-trains_logger
+clearml_logger
 ---------------
 
-See `trains mnist example <https://github.com/pytorch/ignite/blob/master/examples/contrib/mnist/mnist_with_trains_logger.py>`_
+See `clearml mnist example <https://github.com/pytorch/ignite/blob/master/examples/contrib/mnist/mnist_with_clearml_logger.py>`_
 for detailed usage.
 
-.. currentmodule:: ignite.contrib.handlers.trains_logger
+.. currentmodule:: ignite.contrib.handlers.clearml_logger
 
 .. autosummary::
     :nosignatures:
     :autolist:
 
-.. automodule:: ignite.contrib.handlers.trains_logger
+.. automodule:: ignite.contrib.handlers.clearml_logger
    :members:
    :inherited-members:
 
+trains_logger
+--------------
+
+.. note:: ``trains_logger`` was renamed to ``clearml_logger``. Please refer to :ref:`clearml_logger`.
+
 More on parameter scheduling
 ----------------------------
 
@@ -477,4 +482,3 @@ Concatenate with torch schedulers
 
 
 .. image:: ../_static/img/schedulers/concat_linear_exp_step_lr.png
-
diff --git a/docs/source/examples.rst b/docs/source/examples.rst
@@ -77,4 +77,4 @@ reproducible baselines for vision tasks:
 Features:
 
 - Distributed training with mixed precision by `nvidia/apex <https://github.com/NVIDIA/apex/>`_
-- Experiments tracking with `MLflow <https://mlflow.org/>`_ or `Polyaxon <https://polyaxon.com/>`_ or `TRAINS <https://github.com/allegroai/trains/>`_
+- Experiments tracking with `MLflow <https://mlflow.org/>`_ or `Polyaxon <https://polyaxon.com/>`_ or `ClearML <https://github.com/allegroai/clearml/>`_
diff --git a/examples/contrib/cifar10/README.md b/examples/contrib/cifar10/README.md
@@ -1,17 +1,18 @@
 # CIFAR10 Example with Ignite
 
-In this example, we show how to use *Ignite* to train a neural network:
+In this example, we show how to use _Ignite_ to train a neural network:
+
 - on 1 or more GPUs or TPUs
 - compute training/validation metrics
 - log learning rate, metrics etc
 - save the best model weights
 
 Configurations:
 
-* [x] single GPU
-* [x] multi GPUs on a single node
-* [x] multi GPUs on multiple nodes
-* [x] TPUs on Colab
+- [x] single GPU
+- [x] multi GPUs on a single node
+- [x] multi GPUs on multiple nodes
+- [x] TPUs on Colab
 
 ## Requirements:
 
@@ -20,21 +21,24 @@ Configurations:
 - [tqdm](https://github.com/tqdm/tqdm/): `pip install tqdm`
 - [tensorboardx](https://github.com/lanpa/tensorboard-pytorch): `pip install tensorboardX`
 - [python-fire](https://github.com/google/python-fire): `pip install fire`
-- Optional: [trains](https://github.com/allegroai/trains): `pip install trains`
+- Optional: [clearml](https://github.com/allegroai/clearml): `pip install clearml`
 
 ## Usage:
 
 Run the example on a single GPU:
+
 ```bash
 python main.py run
 ```
 
 For details on accepted arguments:
+
 ```bash
 python main.py run -- --help
 ```
 
 If user would like to provide already downloaded dataset, the path can be setup in parameters as
+
 ```bash
 --data_path="/path/to/cifar10/"
 ```
@@ -44,11 +48,14 @@ If user would like to provide already downloaded dataset, the path can be setup
 #### Single node, multiple GPUs
 
 Let's start training on a single node with 2 gpus:
+
 ```bash
 # using torch.distributed.launch
 python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl"
 ```
-or 
+
+or
+
 ```bash
 # using function spawn inside the code
 python -u main.py run --backend="nccl" --nproc_per_node=2
@@ -59,28 +66,29 @@ python -u main.py run --backend="nccl" --nproc_per_node=2
 Please, make sure to have Horovod installed before running.
 
 Let's start training on a single node with 2 gpus:
+
 ```bash
 # horovodrun
 horovodrun -np=2 python -u main.py run --backend="horovod"
 ```
-or 
+
+or
+
 ```bash
 # using function spawn inside the code
 python -u main.py run --backend="horovod" --nproc_per_node=2
 ```
 
-
 #### Colab, on 8 TPUs
 
-
 Same code can be run on TPUs: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx)
 
-
 #### Multiple nodes, multiple GPUs
 
 Let's start training on two nodes with 2 gpus each. We assuming that master node can be connected as `master`, e.g. `ping master`.
 
-1) Execute on master node
+1. Execute on master node
+
 ```bash
 python -u -m torch.distributed.launch \
     --nnodes=2 \
@@ -90,7 +98,8 @@ python -u -m torch.distributed.launch \
     main.py run --backend="nccl"
 ```
 
-2) Execute on worker node
+2. Execute on worker node
+
 ```bash
 python -u -m torch.distributed.launch \
     --nnodes=2 \
@@ -100,17 +109,18 @@ python -u -m torch.distributed.launch \
     main.py run --backend="nccl"
 ```
 
-
 ### Check resume training
 
 #### Single GPU
 
 Initial training with a stop on 1000 iteration (~11 epochs)
+
 ```bash
 python main.py run --stop_iteration=1000
 ```
 
 Resume from the latest checkpoint
+
 ```bash
 python main.py run --resume-from=/tmp/output-cifar10/resnet18_backend-None-1_stop-on-1000/training_checkpoint_1000.pt
 ```
@@ -120,25 +130,27 @@ python main.py run --resume-from=/tmp/output-cifar10/resnet18_backend-None-1_sto
 #### Single node, multiple GPUs
 
 Initial training on a single node with 2 gpus with a stop on 1000 iteration (~11 epochs):
+
 ```bash
 # using torch.distributed.launch
 python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl" --stop_iteration=1000
 ```
 
 Resume from the latest checkpoint
+
 ```bash
 python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl" \
     --resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-1000/training_checkpoint_1000.pt
 ```
 
 Similar commands can be adapted for other cases.
 
-## Trains fileserver
+## ClearML fileserver
 
-If `Trains` server is used (i.e. `--with_trains` argument), the configuration to upload artifact must be done by 
-modifying the `Trains` configuration file `~/trains.config` generated by `trains-init`. According to the
-[documentation](https://allegro.ai/docs/examples/reporting/artifacts/), the `output_uri` argument can be 
-configured in `sdk.development.default_output_uri` to fileserver uri. If server is self-hosted, `Trains` fileserver uri is
+If `ClearML` server is used (i.e. `--with_clearml` argument), the configuration to upload artifact must be done by
+modifying the `ClearML` configuration file `~/clearml.conf` generated by `clearml-init`. According to the
+[documentation](https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html), the `output_uri` argument can be
+configured in `sdk.development.default_output_uri` to fileserver uri. If server is self-hosted, `ClearML` fileserver uri is
 `http://localhost:8081`.
 
-For more details, see https://allegro.ai/docs/examples/reporting/artifacts/
+For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html
diff --git a/examples/contrib/cifar10/main.py b/examples/contrib/cifar10/main.py
@@ -47,8 +47,8 @@ def training(local_rank, config):
         if "cuda" in device.type:
             config["cuda device name"] = torch.cuda.get_device_name(local_rank)
 
-        if config["with_trains"]:
-            from trains import Task
+        if config["with_clearml"]:
+            from clearml import Task
 
             task = Task.init("CIFAR10-Training", task_name=output_path.stem)
             task.connect_configuration(config)
@@ -150,7 +150,7 @@ def run(
     log_every_iters=15,
     nproc_per_node=None,
     stop_iteration=None,
-    with_trains=False,
+    with_clearml=False,
     **spawn_kwargs,
 ):
     """Main entry to train an model on CIFAR10 dataset.
@@ -177,7 +177,7 @@ def run(
         log_every_iters (int): argument to log batch loss every ``log_every_iters`` iterations.
             It can be 0 to disable it. Default, 15.
         stop_iteration (int, optional): iteration to stop the training. Can be used to check resume from checkpoint.
-        with_trains (bool): if True, experiment Trains logger is setup. Default, False.
+        with_clearml (bool): if True, experiment ClearML logger is setup. Default, False.
         **spawn_kwargs: Other kwargs to spawn run in child processes: master_addr, master_port, node_rank, nnodes
 
     """
@@ -340,10 +340,10 @@ def train_step(engine, batch):
 
 
 def get_save_handler(config):
-    if config["with_trains"]:
-        from ignite.contrib.handlers.trains_logger import TrainsSaver
+    if config["with_clearml"]:
+        from ignite.contrib.handlers.clearml_logger import ClearMLSaver
 
-        return TrainsSaver(dirname=config["output_path"])
+        return ClearMLSaver(dirname=config["output_path"])
 
     return DiskSaver(config["output_path"], require_empty=False)
 

diff --git a/...contrib/mnist/mnist_with_trains_logger.py → ...ontrib/mnist/mnist_with_clearml_logger.py b/...contrib/mnist/mnist_with_trains_logger.py → ...ontrib/mnist/mnist_with_clearml_logger.py
@@ -1,14 +1,14 @@
 """
- MNIST example with training and validation monitoring using Trains.
+ MNIST example with training and validation monitoring using ClearML.
 
  Requirements:
-    Trains: `pip install trains`
+    ClearML: `pip install clearml`
 
  Usage:
 
     Run the example:
     ```bash
-    python mnist_with_trains_logger.py
+    python mnist_with_clearml_logger.py
     ```
 """
 from argparse import ArgumentParser
@@ -21,7 +21,7 @@
 from torchvision.datasets import MNIST
 from torchvision.transforms import Compose, ToTensor, Normalize
 
-from ignite.contrib.handlers.trains_logger import *
+from ignite.contrib.handlers.clearml_logger import *
 from ignite.engine import Events, create_supervised_trainer, create_supervised_evaluator
 from ignite.handlers import Checkpoint
 from ignite.metrics import Accuracy, Loss
@@ -86,43 +86,43 @@ def compute_metrics(engine):
         train_evaluator.run(train_loader)
         validation_evaluator.run(val_loader)
 
-    trains_logger = TrainsLogger(project_name="examples", task_name="ignite")
+    clearml_logger = ClearMLLogger(project_name="examples", task_name="ignite")
 
-    trains_logger.attach_output_handler(
+    clearml_logger.attach_output_handler(
         trainer,
         event_name=Events.ITERATION_COMPLETED(every=100),
         tag="training",
         output_transform=lambda loss: {"batchloss": loss},
     )
 
     for tag, evaluator in [("training metrics", train_evaluator), ("validation metrics", validation_evaluator)]:
-        trains_logger.attach_output_handler(
+        clearml_logger.attach_output_handler(
             evaluator,
             event_name=Events.EPOCH_COMPLETED,
             tag=tag,
             metric_names=["loss", "accuracy"],
             global_step_transform=global_step_from_engine(trainer),
         )
 
-    trains_logger.attach_opt_params_handler(
+    clearml_logger.attach_opt_params_handler(
         trainer, event_name=Events.ITERATION_COMPLETED(every=100), optimizer=optimizer
     )
 
-    trains_logger.attach(
+    clearml_logger.attach(
         trainer, log_handler=WeightsScalarHandler(model), event_name=Events.ITERATION_COMPLETED(every=100)
     )
 
-    trains_logger.attach(trainer, log_handler=WeightsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=100))
+    clearml_logger.attach(trainer, log_handler=WeightsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=100))
 
-    trains_logger.attach(
+    clearml_logger.attach(
         trainer, log_handler=GradsScalarHandler(model), event_name=Events.ITERATION_COMPLETED(every=100)
     )
 
-    trains_logger.attach(trainer, log_handler=GradsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=100))
+    clearml_logger.attach(trainer, log_handler=GradsHistHandler(model), event_name=Events.EPOCH_COMPLETED(every=100))
 
     handler = Checkpoint(
         {"model": model},
-        TrainsSaver(),
+        ClearMLSaver(),
         n_saved=1,
         score_function=lambda e: e.state.metrics["accuracy"],
         score_name="val_acc",
@@ -134,7 +134,7 @@ def compute_metrics(engine):
     # kick everything off
     trainer.run(train_loader, max_epochs=epochs)
 
-    trains_logger.close()
+    clearml_logger.close()
 
 
 if __name__ == "__main__":