Read the Leonhard/Euler cluster guides: Getting_started_with_clusters
For this you can follow the guide on the official cluster web-side which shows you how to generate and copy your local ssh key to the cluster.
Steps in short:
- Connect to the ETH network via VPN CiscoAnyConnect is highly recommended. (most stable)
- Generate your local ssh key.
- Copy your local ssh key to the cluster by running:
cat ~/.ssh/id_rsa.pub | ssh [email protected] "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys"
Here we assume you already created your ssh key at '~/.ssh/id_rsa.pub' on your locale machine.
- Try to connect. 'ssh [email protected]'
When you connect to the cluster you connect to a login node. There exists a variety of modules pre-installed.
At first make sure to use the new software stack with the following command:
You can list the currently loaded modules with:
module list
When you want to develop something in Python you can either use pre-compiled binaries by loading the correct python module
module load gcc/6.3.0 python_gpu/3.7.4 cuda/10.1.243
module load gcc/6.3.0 python_gpu/3.8.5 cuda/11.0.3
or create your own Python installation.
The job execution nodes are not directly connected to the internet, but you can access the internet by loading the proxy module.
module load eth_proxy
You can take a look into the provided pre-compiled python binaries here: https://scicomp.ethz.ch/wiki/Python_on_Euler
In general we recommend setting up miniconda to manage your python environment. This allows you to fully match the cluster and your locale setup.
Using anaconda to setup a custom python environment. (https://docs.conda.io/en/latest/miniconda.html)
To install miniconda:
- Connect to the cluster
- Navigate to $HOME
- Run the following:
cd ~ && wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.9.2-Linux-x86_64.sh
chmod +x . Miniconda3-py38_4.9.2-Linux-x86_64.sh
In general:
It is important to install the conda environment (which will contain a lot of small files), to your $HOME folder (/cluster/home/username/miniconda3). This directory will always before running a job be copied to the compute node. Your home folder is quite small < 15GB but perfect for storing your code and the python environments.
- Source the .bashrc file or open a new shell.
source ~/.bashrc
- Verify your installation:
You should now see the currently loaded conda environments in brackets before your username.
(base) [username@login-noden ~]$
Follow this guide on how to setup a new environment. When using GPUs make sure to match the CUDA Version. You can load different CUDA-Versions with module load. Also be aware of the GCC Version. We recommend GCC version 6.3.0 and CUDA 11.0.
Guide how to ,manage conda environments
Execute the following command to create your Python environment named myenv (you can change the name):
conda create -n myenv python=3.8.5
conda activate myenv
Install some packages:
Example PyTorch Installation (Here it's important to match the cudatoolkit version!):
conda install pytorch==1.7.1 \
torchvision==0.8.2 \
torchaudio==0.7.2 \
cudatoolkit=11.0 -c pytorch
- At first check your python path: Command:
which python
If an other path is given try to execute conda deactivate
Reactivate your environment: conda activate myenv
- Open an interactive python shell:
import torch
To check that you have installed the correct pytorch version.
Exit the shell with exit()
It's important to manage your data storage correctly on the cluster.
All large datasets should be stored under the /cluster/work/riner
Also if your experiment results are large store them under the /cluster/work/riner
as well.
It's important to not store small files. When you need to train your model on a large dataset the workflow is the following.
- Tar the dataset folder without compression!
- Schedule the job and request SCRATCH storage (will be discussed in the job-section)
- Untar the dataset to the SCRATCH partition of the compute node ($TMPDIR). The SCRATCH partition is mounted under $TMPDIR
- Now you can access the small files individually very fast given that they are on the SSD directly on the compute-node and no network transfer is needed.
If you don't follow this procedure and try to access a lot of small files on a network storage (/cluster/work/riner) you will slow down the network and your bandwidth will be massively reduced when you hit a certain file number limit.
cd directory/containing/datasets
tar -cvf dataset.tar dataset_folder
Open a shell on your local computer
scp -r ./path/to/local_folder [email protected]:/cluster/work/riner/some_folder
Open a shell on your local computer
scp -r [email protected]:/cluster/work/riner/results ./path/to/local_results
tar -xvf /cluster/work/riner/datasets.tar -C $TMPDIR
Given that the TMPDIR variable is automatically set you can access the location of the dataset as follows:
import os
tmpdir = os.getenv('TMPDIR)
os.system(f'tar \cluster\work\riner\yourtarfile -C {tmpdir}')
Don't use a compression if you already have compressed files such as images stored as jpgs or pngs.
HDF5 files are also handy to use.
If your dataset is small you can consider loading all files into the RAM given that you can request a huge amount of RAM.
Read the Using the batch system section. Getting_started_with_clusters
At first let's start an interactive job running a shell.
bsub -n 16 -W 1:00 -R "rusage[mem=5000,ngpus_excl_p=2]" -R "select[gpu_mtotal0>=10000]" -R "rusage[scratch=10000]" -Is bash
This command will return an interactive bash session (-Is) with 16 cores (-n 16) that runs for 1 hour (-W 1:00) with 2 GPUS with more then 10GB of memory. A total RAM of 16x5000MB and a total SSD Scratch of 10000x16MB.
We can run the following two commands to see the GPU utilization nvidia-smi
and CPU usage htop
You can now simply activate the correct conda environment and run your python code as on your local computer. This is especially useful for debugging. If your code crashes it might happen that the terminal freezes and you have to submit a new interactive session.
If you know a workaround for this freezing problem I please share it!
Jo can see the running Jobs with bjobs
or bbjobs
for more details.
Jo can use the JOB-IDS to stop or peek the job.
bkill JOB-ID # Sends stop signal to the selected job
bkill 0 # Sends stop signals to ALL-jobs.
bpeek JOB-ID # Prints STD OUT of the selected job to the terminal.
When you want the evaluate or debug certain problems its helpful to connect to the job-execution directly.
bjob_connect JOB-ID
You will see in brackets how the node changes from a login node to the execution node.
To schedule a python job we will create shell-script submit.sh
Don't forget to set the correct permissions for execution:
chmod +x submit.sh
# Always reload all-modules before execution for consistency.
module list &> /dev/null || source /cluster/apps/modules/init/bash
module purge
module load legacy new gcc/6.3.0 hdf5 eth_proxy
# Navigate to the folder containing your python project.
# Specify the conda version.
# $@ allows you to pass arguments to the python file
$HOME/miniconda3/envs/myenv/bin/python main.py $@
Scheduling the Job:
bsub -I -n 4 -W 1:00 -R "rusage[mem=5000]" $HOME/run.sh --env=hello --exp=world
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--exp', help='Some flag.')
parser.add_argument('--env', help='Other flag')
args = parser.parse_args()
print( args.exp, args.env )
When using interactive bash sessions, you would like to break the program using Ctrl-C without freezing the terminal; it helps to explicitly catch the signal.
By adding the following to the main script:
import signal
def signal_handler(signal, frame):
print('exiting on CRTL-C')
signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)
Tested on Leonhard and Euler.
Machine learning and vision tasks.
Python 3.8.5 & GCC/6.3.0
conda env create -f ./conda/py38.yml
- torch=1.7.1+cu110
- scikit-learn=0.24
- scipy=1.6.1
- numpy=1.19.2
- pandas=1.2.3
- pytorch-lightning=1.2.3
- opencv=4.5.1
- imageio=2.9.0
- pillow=8.1.2
- torchvision=0.8.2+cu110
- h5py=h5py
- matplotlib=3.3.4
- neptune-client=0.5.1
- tensorboard=2.4.1
Append the following lines to the end of your ~/.bashrc file. vi ~/.bashrc
export NEPTUNE_API_TOKEN="""torken"""
export ENV_WORKSTATION_NAME="""leonhard"""
Specify your neptune.ai key for debugging. (only necessary if you want to use neptune)
Specify the name of the cluster. This allows later to access this variable from your python script. Therefore you're able to keep track on which cluster you're on. Also this variable will be used to load the correct environment yaml file with the same name /home/jonfrey/ASL_leonhard_euler/cfg/env/euler.yml
where you are able to specify cluster specific paths and settings.
This allows you to easily move between your workstation and cluster.
Follow the installing ansible on Ubuntu guide.
Configure ansible settings by modifying the following files.
sudo vi /etc/ansible/ansible.cfg
host_key_checking = False
sudo_flags=-H -S
private_key_file = /home/jonfrey/.ssh/id_rsa
pipelining = True
sudo vi /etc/ansible/hosts
login.leonhard.ethz.ch ansible_ssh_user=username
euler.ethz.ch ansible_ssh_user=username
Replace the username with your ETH email abbreviation.
You should now be able to ping the configured hosts: Command:
sudo ansible all -m ping
euler.ethz.ch | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
"changed": false,
"ping": "pong"
login.leonhard.ethz.ch | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
"changed": false,
"ping": "pong"
At first have a look into the official documentation (https://docs.ansible.com/ansible/latest/user_guide/playbooks_intro.html)
Example Playbook (ansible/queue_jobs.yml)
- name: Schedule Experiments
hosts: euler
- project_dir: "{{ ansible_env.HOME }}/"
- name: Sync
src: /home/jonfrey/ASL_leonhard_euler
dest: "{{ project_dir }}"
- name: Load variables
file: /home/jonfrey/ASL_leonhard_euler/ansible/experiments.yml
name: experiments
- name: Schedule all experiments
shell: >
bsub -n 1 -W 0:10 -R "rusage[mem=5000,ngpus_excl_p=2]" -R "select[gpu_mtotal0>=10000]" -R "rusage[scratch=1000]" $HOME/ASL_leonhard_euler/scripts/submit.sh --exp={{ item.exp }}
loop: "{{ experiments.jobs }}"
Playbook Explanation:
- Specify the execution host:
hosts: euler
The available hosts can be found in the previously setup/etc/ansible/hosts
file - Synchronize your local code with the cluster:
You can modify thedest
path as needed.
Also it's possible to usersync
instead here. - Load variables:
Loads theansible/experiments.yml
where paths to experiment files are listed.
Each of the entries in thejobs
list will be handled separately. We will loop over the jobs list in the next command. - Scheduling:
Schedule the job with the bash command. Sets the correct exp-file-path for each experiment. Thescripts/submit.sh
file loads the correct module. And starts themain.py
with the template conda environment. the arguments that are passed to the script (--exp=
) will be passed to the main.py file. With the loop command ansible knows it is supposed to loop over the list.
loop: "{{ experiments.jobs }}"
sudo ansible-playbook ansible/queue_jobs.yml
PLAY [Schedule Experiments] *********************************************************************************
TASK [Gathering Facts] **************************************************************************************
ok: [euler.ethz.ch]
TASK [Sync] *************************************************************************************************
changed: [euler.ethz.ch]
TASK [Load experiments] *************************************************************************************
ok: [euler.ethz.ch]
TASK [Schedule all experiments] ******************************************************************************
changed: [euler.ethz.ch] => (item={u'exp': u'/home/jonfrey/ASL_leonhard_euler/cfg/exp/exp.yml'})
changed: [euler.ethz.ch] => (item={u'exp': u'/home/jonfrey/ASL_leonhard_euler/cfg/exp/exp.yml'})
PLAY RECAP **************************************************************************************************
euler.ethz.ch : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
(base) [jonfrey@eu-login-11 ~]$ bjobs
165381072 jonfrey PEND gpu.4h eu-login-21 *p/exp.yml Mar 15 07:00
165381081 jonfrey PEND gpu.4h eu-login-21 *p/exp.yml Mar 15 07:00
