This directory contains scripts and libraries that help users run torchelastic jobs on AWS.
- Familiarity with basic AWS (EC2, Auto Scaling Groups, S3, EFS).
- (suggested) install and setup
awscli
. - Basic knowledge of containers (we use Docker in our examples).
git clone https://github.com/pytorch/elastic.git
cd elastic/aws && pip install -r requirements.txt
- The following AWS resources:
- EC2 instance profile
- Subnet(s)
- Security group
- EFS volume
- S3 Bucket
- Install the AWS Session Manager plugin
This guide shows how to get a simple torchelastic job running on AWS. petctl
is a commandline tool that helps run distributed jobs written with torchelastic
on EC2 instances.
python3 petctl.py setup
This will bootstrap all the AWS resources required to run a torchelastic job. For details take a look at the CloudFormation template .
Use --s3_bucket
and --efs_id
to use an existing S3 bucket and EFS
file system. Otherwise an S3 bucket and EFS volume will be created.
IMPORTANT when specifying
--efs_id
you MUST ensure that NO mount targets exist on the EFS file system. torchelastic's cfn stack will attempt to create mount targets for the subnets it creates and WILL FAIL if the file system already has mount targets on a different VPC. For more information refer to the EFS docs.
TIP: If the stack creation fails, log into the CloudFormation console, inspect
the failure reason, address the failure, then manually delete the stack and re-run
petctl configure
.
If you are familiar with AWS or already have the resources specified in the
Requirements section above, then you can follow the Manual Setup instructions
below and simply copy the sample specs file and fill
in the template, then run python petctl.py configure
.
If you already have a script that uses torchelastic to run distributed training, great! Otherwise you can use the provided imagenet example or classy vision. Or... this is a great time to work on one.
If you are using
examples/imagenet/main.py
you must download the imagenet dataset from here onto the EFS volume you specified. The dataset will be available to the workers on/mnt/efs/fs1/<download_dir>
.
We will assume that you are working with the imagenet example and that
you have downloaded the imagenet dataset to /mnt/efs/fs1/data/imagenet/train
.
To run the script we'll use petctl
,
python3 aws/petctl.py run_job --size 2 --min_size 1 --max_size 3 --name ${USER}-job examples/imagenet/main.py -- --input_path /mnt/efs/fs1/data/imagenet/train
In the example above, the named arguments, such as, --size
, --min_size
, and
--max_size
are parameters to the run_job
sub-command of petctl
. In the example
above, we created an elastic job where the initial worker --size=2
, we are
allowed to scale down to --min_size=1
and up to --max_size=3
. This is used by
torchelastic's rendezvous algorithm to determine how many nodes to admit on each
re-rendezvous before considering the group final and start the train_step
.
The other positional arguments have the form:
[local script] -- [script args ...]
-- or --
[local directory] -- [script] [script args...]
If the first positional argument is a path to a script file, then the script
is uploaded to S3 and the script arguments specified after the --
delimiter
are passed through to the script.
If the first positional argument is a directory, then a tarball of the directory
is created and uploaded to S3 and is extracted on the worker-side. In this case
the first argument after the --
delimiter is the path to the script relative to the
specified directory and the rest of the arguments after the delimiter is passed
to the script.
In our example we specified
petctl.py run_job [...] examples/imagenet/main.py -- --input_path /mnt/efs/fs1/data/imagenet/train
We could have decided to specify the directory instead
petctl.py run_job [...] examples -- imagenet/main.py --input_path /mnt/efs/fs1/data/imagenet/train
TIP 1: Besides a local script or directory you can run with scripts or tar
files
that have already been uploaded to S3 or directly point it to a file or directory
on the container.
python3 petctl.py run_job [...] s3://my-bucket/my_script.py
python3 petctl.py run_job [...] s3://my-bucket/my_dir.tar.gz -- my_script.py
# or
python3 petctl.py run_job [...] docker:///abs/path/in/container/dir -- my_script.py
python3 petctl.py run_job [...] docker://rel/path/in/container/dir/my_script.py
TIP 2: To iterate quickly, simply make changes to your local script and upload the script to S3 using
python3 petctl.py upload examples/imagenet/main.py s3://<bucket>/<prefix>/<job_name>
TIP 3: Use the EFS volume attached on /mnt/efs/fs1
on all the workers to
save input data, checkpoints and job output.
Once the run_job
command returns log into the EC2 console, you will see two
Auto Scaling Groups
- etcd server
- workers
Log into the AWS CloudWatch Logs console. You should see a log group called
torchelastic/$USER
. Under it there will be a log stream per instance with the
name $job_name/$instance_id
(e.g. my_job/i0b938EXAMPLE
).
To SSH onto the worker nodes to debug/inspect the worker process use AWS Session Manager instead of the ec2 key pair. Install the Session Manager plugin and run
# get the instance ids of the workers
python3 petctl.py list_hosts <job_name>
# ssh onto one of the workers
awscli ssm start-session --target <instance_id>
-- example --
awscli ssm start-session --target i-00b00EXAMPLE
Once SSH'ed, the workers run in a docker container managed by systemd
.
You can take a look at their console outputs by running
# see the status of the worker
sudo systemctl status torchelastic_worker
# get the container id
sudo docker ps
# tail the container logs
sudo docker logs -f <container id>
Note since we have configured the log driver to be
awslogs
tailing the docker logs will not work. For more information see: https://docs.docker.com/config/containers/logging/awslogs/
You can also manually stop and start the workers by running
sudo systemctl stop torchelastic_worker
sudo systemctl start torchelastic_worker
EXERCISE: Try stopping or adding worker(s) to see elasticity in action! To add workers, simply increase the
desired
size of the worker autoscaling group.
- torchelastic_worker systemd:
/etc/systemd/service/torchelastic_worker.service
- torchelastic run scripts:
/var/torchelastic
Note: by design,
petctl
tries to use the least number of AWS services. This was done intentionally to allow non-AWS users to easily transfer the functionality to their environment. Hence it currently does not have the functionality to query status of the job or to terminate the ASG when the job is done (there is nothing that is monitoring the job!). In practice consider using EKS, Batch, or SageMaker.
To stop the job and tear down the resources, use the kill_job
command:
python3 petctl.py kill_job ${USER}-job
You'll notice that the two ASGs created with the run_job
command are deleted.
First, lets create a launch spec. This is a simple json file that specifies
the launch configuration of EC2 instances. We have included a
sample specs file so make a copy and fill it in.
You only need to fill in the fields with {{ AWS_Resource_Id }}
, you can leave the other
fields alone for now.
# cd $torchelastic_repository_root
mkdir ~/torchelastic_workspace
cp aws/config/sample_specs.json ~/torchelastic_workspace/specs.json
The specs file is divided into two sections: rdzv
and worker
. As their names
imply the rdzv
section contains the launch specs for the instances
of the rendezvous backend (e.g. etcd). The worker
section contains the launch
specs for the worker instances.
The following subsections describe the fields in the specs file.
"instance_type" : "[ec2 instance type]",
"accelerator" : "[none | gpu]",
The instance type specifies the EC2 instance type to use. The accelerator
field can either be none
or gpu
. If an EC2 instance that has GPU capability
is specified (e.g. g3
, p2
, p3
instance families) then you must specify
accelerator = gpu
.
If
accelerator=gpu
is not specified on a GPU capable instance type,petctl
assumes you will only use CPU and will use an AMI that does not have CUDA nor NVIDIA drivers.
Note that you can add multiple subnets. Each subnet belongs to an availability zone (AZ) so you can spread your instances across AZs by specifying multiple subnets:
"subnets" : [
"[subnet_in_us-west-2a]",
"[subnet_in_us-west-2b]",
...
],
Some instances are not available in all AZs, make sure to create the subnet in the AZ that supports the instance type that you plan to run your jobs on.
torchelastic jobs are distributed and hence require nodes to communicate with each other. Make sure that your security group allows all inbound traffic between instances within the same security group.
Optionally you may want to allow inbound SSH access in case you need to log into the instance for debugging.
"instance_role" : "[ec2 instance profile]",
This is the IAM role that is used when accessing other AWS services from within the EC2 instance (e.g. accessing S3 from the worker host). To learn more about instance profiles refer to the AWS documentation.
Refer to the Ref: IAMRoleRendezvous
and Ref: IAMRoleWorker
resources
in the CloudFormation template for a full list of IAM policies
that must be attached to this role.
You may wish to add other privileges to this role depending on what your workers do. For instance, if your job writes output to S3, then you will have to attach S3 Write IAM policy to this profile.
petctl
uploads your local script to S3 and pulls it down on the worker.
Specify the S3 bucket and prefix for this purpose:
"s3_bucket" : "<YOUR S3 BUCKET NAME>",
"s3_prefix" : "<YOUR S3 PREFIX>"
Workers have a couple of additional specs compared to rdzv.
"docker_image" : "torchelastic/examples:0.1.0rc1",
Note that the worker
section in the specs file has an extra docker_image
configuration. This is because the workers run in a docker container whereas
the rendezvous backend (etcd) runs directly on the instance. The
torchelastic/aws
image contains torchelastic (along with all of its dependencies)
and a few runtime utilities such as the fetch_and_run
script that allows us
to run arbitrary user scripts. For production, you may consider creating your
own docker image with a custom ENTRYPOINT
specific to your application.
An Elastic File System volume is mounted on each worker instance (it is mounted through all the way to the container). EFS acts much like NFS in terms of semantics. Use it as if you were using NFS. You may store your input dataset, store model checkpoints, or job outputs here.
The specified EFS volume is mounted on
/mnt/efs/fs1
. On the host and container.
After creating a specs file, configure petctl
python3 petctl.py configure
This will prompt for the absolute path of your specs file and the AWS region.
You can run the configure
sub-command as many times as you wish in case you made
a mistake or you need to reconfigure petctl
.
You'll notice that after configuration is done, there is a config file generated
under $HOME/.petctl
.
This is similar to how the aws cli is configured. If you decide to skip configuration, then you must pass
--specs_file
and--region
arguments topetctl
each time (e.g.petctl --sepcs_file /home/$USER/specs.json --region us-west-2
).