This repository provides a container and associated pipeline to do comparison of running a Singularity workflow across several cloud providers:
- Google Cloud
- Amazon AWS
- Stanford Sherlock Cluster
- Stanford SCG4 Cluster
- Microsoft Azure
We will ideally look at metrics such as memory and cost, and assess the differences (or lack thereof) in running the analysis in multiple cloud environments.
- Start with a basic analysis, meaning an analysis packaged in a container.
- Build the container using Singularity Hub
- Use Singularity Hub and Packer builds (included in this repo) to run and compare.
The folder cloud contains runscript and other files necessary for running the pipeline on the cloud providers in the list above. The folder hpc contains the equivalent scripts necessary for running on local HPC.
You have a few options. If this were a pipeline intended to run in parallel, you would want an endpoint waiting to receive a POST
with data, or even a simple function to upload to Dropbox. For the purposes of testing, an easy solution is to do one of the following:
- add the logs to a Github repo
- transfer the files using scp (or a command, detailed below)
Google cloud has easy transfer of files using the gcloud
command line utility. Eg:
# Copy from instance to present working directory
gcloud compute copy-files singularity-scientific:/scratch/logs/* $PWD
You can use the traditional tool scp to do this, giving your credential (.pem
) file for the -i
argument,
scp -i ~/.ssh/amazon.pem [email protected]:/scratch/logs/* $PWD
For HPC clusters Sherlock and scg4, I used gftp from my Ubuntu 16.04 machine.
git clone https://www.github.com/vsoch/singularity-testing
cd singularity-testing
Then you can follow the run.sh script in the cloud folder, given that you have sudo access on your endpoint. Running on a cloud provider (with sudo) is equivalent to this.
As an alternative to building the Docker image from the Dockerfile provided, you can also use the one on docker hub.
- Moveable We have included the entire analysis in Github repo that can be easily cloned and run, given the user provides credentials to the various environments. The container is served by Singularity Hub.
- Transparent The container is transparent in that running it reveals instructions for its use. This should be a minimal requirement for a scientific container.
- Scalable. The analysis should be possible to run optimally on a cluster (HPC).
- Environment Agnostic. It should work equivalently on a local computer, a computer cluster, Google Cloud, AWS, Azure, or other.
- Customizable. Hopefully the modular nature, and that each container takes input arguments, makes this possible.