Skip to content

Latest commit

 

History

History

cifar10_qat

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Example of Quantization Aware Training (QAT) with Ignite on CIFAR10

Model's implementation is based on https://discuss.pytorch.org/t/evaluator-returns-nan/107972/3

In this example, we show how to use Ignite to train a neural network:

  • on 1 or more GPUs
  • compute training/validation metrics
  • log learning rate, metrics etc
  • save the best model weights

Configurations:

  • single GPU
  • multi GPUs on a single node

Requirements:

  • pytorch-ignite: pip install pytorch-ignite
  • torchvision: pip install torchvision
  • tqdm: pip install tqdm
  • tensorboardx: pip install tensorboardX
  • python-fire: pip install fire
  • brevitas: pip install git+https://github.com/Xilinx/brevitas.git

Usage:

We can train, for example, ResNet-18 with 8 bit weights and activations.

Run the example on a single GPU:

CUDA_VISIBLE_DEVICES=0 python main.py run --model="resnet18_QAT_8b"

Note: torch DataParallel is not working (v1.7.1) with QAT.

For details on accepted arguments:

python main.py run -- --help

If user would like to provide already downloaded dataset, the path can be setup in parameters as

--data_path="/path/to/cifar10/"

Other available models can be found here:

  • resnet18_QAT_8b - ResNet-18 with 8 bit weights and activations
  • resnet18_QAT_6b - ResNet-18 with 6 bit weights and activations
  • resnet18_QAT_5b - ResNet-18 with 5 bit weights and activations
  • resnet18_QAT_4b - ResNet-18 with 4 bit weights and activations
  • torchvision models

Distributed training

Single node, multiple GPUs

Let's start training on a single node with 2 gpus:

# using torch.distributed.launch
python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl" --model="resnet18_QAT_8b"
Using Horovod as distributed backend

Please, make sure to have Horovod installed before running.

Let's start training on a single node with 2 gpus:

# horovodrun
horovodrun -np=2 python -u main.py run --backend="horovod" --model="resnet18_QAT_8b"

or

# using function spawn inside the code
python -u main.py run --backend="horovod" --nproc_per_node=2 --model="resnet18_QAT_8b"

Online logs

On TensorBoard.dev: https://tensorboard.dev/experiment/Kp9Wod3XR36Sg2I1gAh1cA/