Regeneration of Google's tpu-resnet tutorial
- @ Jeju Google Deep Learning Camp 2018
- Special thanks to Sourabh and Yu-han @ Google
Easy GCP TPU training in Jeju Google Deep Learning Camp 2018
- macOS and command line interface only
- Tensorflow >= 1.8
You need to install gcloud SDK directly from the link:
- what we must configure are that
- account
- project
- a default compute region and zone
Note that the zone should be set to
us-central1-f
in this Google camp.
- You can check your configuration by
$ gcloud config list
[compute]
region = us-central1
zone = us-central1-f
[core]
account = [email protected]
disable_usage_reporting = False
project = ordinal-virtue-208004
Your active configuration is: [default]
The use of GCP TPU has three steps:
1) Related API enabling
2) Virtual machine(vm) instance generation + ssh connection to the vm
3) TPU instance generation in the vm
In order to use TPU, enabling of below two APIs must be performed.
- Cloud TPU API enabling
- Cloud Engine API enabling
Basically. our aim is to use TPUs in a certain GCP zone by creating a virtual machine on cloud.
- The zones, where the TPU can be used, is limited (see for detail info)
- For create VM, we have
create_vm_instance.sh
:
#! /bin/bash
# create_vm_instance.sh
export YOUR_PRJ_NAME=ordinal-virtue-208004
export YOUR_ZONE=us-central1-f
echo Set your proj and zone again
gcloud config set project $YOUR_PRJ_NAME
gcloud config set compute/zone $YOUR_ZONE
echo CRATE GCLOUD VM
gcloud compute instances create $USER-vm \
--machine-type=n1-standard-2 \
--image-project=ml-images \
--image-family=tf-1-8 \
--scopes=cloud-platform
Note that in order to use TPU, you first need permission from Google.
Then, now you can connect to your vm by ssh
.
gcloud compute ssh $USER-vm
In the vm, you need three configuration for the TPU use.
- your project name
- your zone
- your tpu ip
create_tpu_instance.sh
provides the above configutation + TPU instance generation.
#! /bin/bash
#create_tpu_instance.sh
export YOUR_PRJ_NAME=ordinal-virtue-208004
export YOUR_ZONE=us-central1-f
export TPU_IP=10.240.6.2
echo Set your proj and zone again
gcloud config set project $YOUR_PRJ_NAME
gcloud config set compute/zone $YOUR_ZONE
echo CREATE TPU INSTANCE
gcloud alpha compute tpus create $USER-tpu \
--range=${TPU_IP/%2/0}/29 --version=1.8 --network=default
echo CHECK YOUR TPU INSTANCE
gcloud alpha compute tpus list
After finishing the TPU use, we need to remove the TPU instance.
by delete_tpu_instance.sh
.
#! /bin/bash
# delete_tpu_instance.sh
export YOUR_ZONE=us-central1-f
echo DELETE TPU INSTANCE
gcloud alpha compute tpus delete $USER-tpu --zone=$YOUR_ZONE
Google prepare a perfect example for the TPU use practice.
For this tutorial, you need to connect to the vm holding a tpu instance, by
$ gcloud compute ssh $USER-vm
- This tutorial includes below steps:
1) Downloading MNIST dataset and converting to TFrecord
2) GCP Bucket generation for holding the TFrecord dataset
3) Git clone TPU tutorial repo
4) Run ResNet codes by TPU
create_mnistdata_to_tfrecord.sh
#! /bin/bash
# create_mnistdata_to_tfrecord.sh
echo Downloading and converting the MNIST data to TFRecords
python /usr/share/tensorflow/tensorflow/examples/how_tos/reading_data/convert_to_records.py --directory=./data
gunzip ./data/*.gz
For the TPU use, we need to create Bucket
. We have two main purpose to use the Bucket in GCP.
- For holding dataset in training
- For storing
.ckpt
generated by training
Here, we just create single bucket and use it for above two purpose.
However, in your actual training, we recommend to create two different buckets for dataset and .ckpt
.
#! /bin/bash
echo CREATE BUCKET
export STORAGE_BUCKET=gs://mnist_tfrecord
export YOUR_PRJ_NAME=ordinal-virtue-208004
export YOUR_ZONE=us-central1-f
gsutil mb -l ${YOUR_ZONE} -p ${YOUR_PRJ_NAME} ${STORAGE_BUCKET}
echo COPY DATA TO BUCKET FROM /DATA DIR
gsutil cp -r ./data ${STORAGE_BUCKET}
First we need to git clone the tutorial Resnet codes from the Tensorflow repository.
An important note here is that we should switch git branch from
master
tor1.8
. because the master branch does not support tf.estimator for the TPU use.
#! /bin/bash
# gitclone_resnet_repo.sh
echo git clone Resnet repository
git clone https://github.com/tensorflow/tpu.git ./tpu
echo First you need to checkout to /r1.8 branch
cd ./tpu
git branch -r
git checkout -t origin/r1.8
cd ..
Only remaining is to run resnet_main.py
with
#! /bin/bash
# run_python_resnet_main.sh
echo RUN RESNET TRAINING BY TPU
export DATA_DIR=./data
python ./tpu/models/official/resnet/resnet_main.py \
--tpu=$USER-tpu \
--data_dir=$DATA_DIR\
--model_dir=${STORAGE_BUCKET}/resnet
- Jaewook Kang ([email protected])