This solution was designed to provide a reproducible, easy to deploy environment to integrate Hail with AWS EMR. Where possible, AWS native tools have been used.
To integrate Hail and EMR, we leverage Packer from HashiCorp alongside AWS CodeBuild to create a custom AMI pre-packaged with Hail, and optionally containing the Variant Effect Predictor (VEP). Then, an EMR cluster is launched using this custom AMI.
Users leverage an AWS SageMaker Notebook instance to run JupyterLab, and pass commands to Hail from the Notebook via Apache Livy.
This repository contains an AWS Quick Start solution for rapid deployment into your AWS account. Certain parts of this repository assume a working knowledge of AWS, CloudFormation, S3, EMR, Hail, Jupyter, SageMaker, EC2, Packer, and shell scripting.
The core directories in this repository are:
- packer-files - Documentation and example configuration of Packer (used in the AMI build process)
- sagemaker - Sample Jupyter Notebooks and shell scripts
- submodules - Optional submodules supporting the deployment
- templates - CloudFormation nested stacks
- vep-configuration - VEP JSON configuration files
This document will walk through deployment steps, and highlight potential pitfalls.
Note: This process will create S3 buckets, IAM resources, AMI build resources, a SageMaker notebook, and an EMR cluster. These resources may not be covered by the AWS Free Tier, and may generate significant cost. For up to date information, refer to the AWS Pricing page.
You will require elevated IAM privileges in AWS, ideally AdministratorAccess, to complete this process.
To deploy Hail on EMR, follow these steps:
-
Log into your AWS account and navigate to the S3 console.
-
Create an S3 bucket in the region you would like to launch this CloudFormation stack in. In your newly created S3 Bucket, create a directory called "quickstart-hail".
-
Download the contents of this repository, unzip, and place the downloaded contents into the "quickstart-hail" directory in your S3 bucket. Your final result should look the following:
-
Navigate to the "templates" directory of your S3 bucket. Find the file named "hail-launcher.template.yaml" and copy the S3 object URL. Save this URL as we will use it to launch our solution.
-
Navigate to the CloudFormation console.
-
Create a new stack using the S3 URL that you copied in step 4 as a template source.
-
Set parameters based on your environment requirements. This CloudFormation template includes an optional Identity and Access Management section where you can set a permission boundary as well custom prefixes/suffixes to be used with all IAM role and policy names created by the templates. Check with your IT administrator if this is required in your AWS environment. Once all parameters are set, choose Next.
-
Optionally configure stack options and choose Next.
-
Review your settings and acknowledge the stack capabilities. Choose Create Stack.
-
Once stack creation is complete, select the root stack and open the Outputs tab. Locate and choose the Service Catalog Portfolio URL.
-
The Service Catalog Portfolio requires assignment to specific Users, Groups, or Roles. Select the
Users, Groups, or Roles
tab and clickAdd groups, roles, users
. -
Select the users, groups, and/or roles that will be allowed to deploy the Hail EMR cluster and SageMaker notebook instances. When complete, click
Add Access
. -
The selected users, groups, or roles can now click
Products
in the Service Catalog console. -
Launch a Hail EMR Cluster using your custom Hail AMI built from the Building AMIs section to get started. Note: Building custom AMI is required before launching EMR cluster.
-
Launch a Hail SageMaker Notebook Instance. Once the SageMaker Notebook Instance is provisioned open the Console Notebook URL. This will bring you to the SageMaker console for your specific notebook instance.
-
Select
Open JupyterLab
. -
Inside your notebook server, note that there is a
common-notebooks
directory. This directory contains tutorial notebooks to get started interacting with your Hail EMR cluster.
The Service Catalog product for the Hail EMR cluster will deploy a single master node, a minimum of 1 core node, and optional autoscaling task nodes.
The AWS Systems Manager Agent (SSM) can be used to gain ingress to the EMR nodes. This agent is pre-installed on the AMI. To allow SageMaker Notebook instance to connect to the Hail cluster nodes, set the following parameter to true.
Notebook service catalog deployments also require a parameter adjustment to complete access.
Task nodes can be set to 0 to omit them. The target market, SPOT or ON_DEMAND, is also set through parameters. If SPOT is selected, the bid price is set to the current on-demand price of the selected instance type.
EMR uses managed scaling which lets you automatically increase or decrease the number of instances or units in your cluster based on workload. EMR managed scaling continuously evaluates cluster metrics to make scaling decisions that optimize your clusters for cost and speed.
The Service Catalog product for the SageMaker Notebook instance deploys a single Notebook instance in the same subnet as your EMR cluster. Upon launch, several example Notebooks are seeded into the common-notebooks folder. These example notebooks offer an immediate orientation interacting with your Hail EMR cluster.
CloudFormation parameters exist on both the EMR Cluster and SageMaker Notebook products to optionally allow Notebook instances shell access through SSM. Set the following parameter to true on when deploying your notebook product to allow SSM access.
Example connection from Jupyter Lab shell:
Hail on EMR requires the use of a custom AMI with Hail, Spark, VEP, and reference genomes preconfigured. This build process is driven by Packer, and leverages AWS CodeBuild. Note that some of these software packages are optional, and the build process can be executed for different versions or combinations of these software packages.
Before building, keep the following in mind:
- Builds including VEP can take a very long time (upwards of 1-2 hours in some cases)
- AMI names are unique. If building an updated AMI, deregister the previous
From the AWS CodeBuild dashboard, select the desired build's radio button and click Start with overrides.
On the next page you may optionally override any build parameters but are required to override the HAIL_VERSION
value to whatever hail version you wish to use then click Start build.
Once the build begins you can optionally tail logs to the view progress. Closing this window will not terminate the build.
AMI names are unique. In order to rebuild an AMI with the same name you will need to deregister the AMI your AWS account and target region.
Additional documentation on the building a custom Hail AMI can be found in the AMI Creation Guide