Skip to content

anooptoffy/tpu-resnet-tutorial

 
 

Repository files navigation

tpu-resnet-tutorial

Regeneration of Google's tpu-resnet tutorial

  • @ Jeju Google Deep Learning Camp 2018
  • Special thanks to Sourabh and Yu-han @ Google

Readme for KOR is provided in this link

About

Easy GCP TPU training in Jeju Google Deep Learning Camp 2018

Dependencies

  • macOS and command line interface only
  • Tensorflow >= 1.8

gcloud SDK Installation

You need to install gcloud SDK directly from the link:

  • what we must configure are that
    • account
    • project
    • a default compute region and zone

Note that the zone should be set to us-central1-f in this Google camp.

  • You can check your configuration by
$ gcloud config list
[compute]
region = us-central1
zone = us-central1-f
[core]
account = [email protected]
disable_usage_reporting = False
project = ordinal-virtue-208004

Your active configuration is: [default]

TPU Instance Creation and Deletion

The use of GCP TPU has three steps:

1) Related API enabling 
2) Virtual machine(vm) instance generation + ssh connection to the vm
3) TPU instance generation in the vm

1) API enabling (local)

In order to use TPU, enabling of below two APIs must be performed.

  • Cloud TPU API enabling
  • Cloud Engine API enabling

2) VM instance generation + ssh access (local)

Basically. our aim is to use TPUs in a certain GCP zone by creating a virtual machine on cloud.

  • The zones, where the TPU can be used, is limited (see for detail info)
  • For create VM, we have create_vm_instance.sh:
#! /bin/bash
# create_vm_instance.sh

export YOUR_PRJ_NAME=ordinal-virtue-208004
export YOUR_ZONE=us-central1-f

echo  Set your proj and zone again
gcloud config set project $YOUR_PRJ_NAME
gcloud config set compute/zone $YOUR_ZONE


echo CRATE GCLOUD VM
gcloud compute instances create $USER-vm \
  --machine-type=n1-standard-2 \
  --image-project=ml-images \
  --image-family=tf-1-8 \
  --scopes=cloud-platform

Note that in order to use TPU, you first need permission from Google.

Then, now you can connect to your vm by ssh.

gcloud compute ssh $USER-vm

3) TPU instance generation (vm)

In the vm, you need three configuration for the TPU use.

  • your project name
  • your zone
  • your tpu ip

create_tpu_instance.sh provides the above configutation + TPU instance generation.

#! /bin/bash
#create_tpu_instance.sh

export YOUR_PRJ_NAME=ordinal-virtue-208004
export YOUR_ZONE=us-central1-f
export TPU_IP=10.240.6.2

echo  Set your proj and zone again
gcloud config set project $YOUR_PRJ_NAME
gcloud config set compute/zone $YOUR_ZONE

echo  CREATE TPU INSTANCE
gcloud alpha compute tpus create $USER-tpu \
	--range=${TPU_IP/%2/0}/29 --version=1.8 --network=default
echo CHECK YOUR TPU INSTANCE
gcloud alpha compute tpus list

After finishing the TPU use, we need to remove the TPU instance. by delete_tpu_instance.sh.

#! /bin/bash
# delete_tpu_instance.sh

export YOUR_ZONE=us-central1-f

echo DELETE TPU INSTANCE
gcloud alpha compute tpus delete $USER-tpu --zone=$YOUR_ZONE

Run ResNet Tutorial

Google prepare a perfect example for the TPU use practice.

For this tutorial, you need to connect to the vm holding a tpu instance, by $ gcloud compute ssh $USER-vm

  • This tutorial includes below steps:
1) Downloading MNIST dataset and converting to TFrecord
2) GCP Bucket generation for holding the TFrecord dataset
3) Git clone TPU tutorial repo 
4) Run ResNet codes by TPU 

1) Downloading MNIST dataset and converting to TFrecord

  • create_mnistdata_to_tfrecord.sh
#! /bin/bash
# create_mnistdata_to_tfrecord.sh

echo Downloading and converting the MNIST data to TFRecords
python  /usr/share/tensorflow/tensorflow/examples/how_tos/reading_data/convert_to_records.py --directory=./data
gunzip ./data/*.gz

2) GCP Bucket generation for holding the TFrecord dataset

For the TPU use, we need to create Bucket. We have two main purpose to use the Bucket in GCP.

  • For holding dataset in training
  • For storing .ckpt generated by training

Here, we just create single bucket and use it for above two purpose. However, in your actual training, we recommend to create two different buckets for dataset and .ckpt.

#! /bin/bash

echo CREATE BUCKET
export STORAGE_BUCKET=gs://mnist_tfrecord
export YOUR_PRJ_NAME=ordinal-virtue-208004
export YOUR_ZONE=us-central1-f

gsutil mb -l ${YOUR_ZONE} -p ${YOUR_PRJ_NAME} ${STORAGE_BUCKET}

echo COPY DATA TO BUCKET FROM /DATA DIR
gsutil cp -r ./data ${STORAGE_BUCKET}

3) Git clone TPU tutorial repo

First we need to git clone the tutorial Resnet codes from the Tensorflow repository.

An important note here is that we should switch git branch from master to r1.8. because the master branch does not support tf.estimator for the TPU use.

#! /bin/bash
# gitclone_resnet_repo.sh

echo git clone Resnet repository
git clone https://github.com/tensorflow/tpu.git ./tpu

echo First you need to checkout to /r1.8 branch
cd ./tpu
git branch -r
git checkout -t origin/r1.8
cd ..

4) Run ResNet codes by TPU

Only remaining is to run resnet_main.py with

#! /bin/bash
# run_python_resnet_main.sh

echo RUN RESNET TRAINING  BY TPU
export DATA_DIR=./data

python ./tpu/models/official/resnet/resnet_main.py \
	  --tpu=$USER-tpu \
	  --data_dir=$DATA_DIR\
	  --model_dir=${STORAGE_BUCKET}/resnet

References

Feedback

About

Regeneration of Google's tpu-resnet tutorial

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%