Python is widely used in scientific research for tasks such as data processing, analysis and visualisation. However, it is not yet widely used for large-scale modelling and simulation on High Performance Computer (HPC) architectures due to issues with performance – Python is primarily designed for ease of use and flexibility, not for speed. For example, a C program naively translated in to Python can often be over 100 times slower. However, there are techniques that can be used to dramatically increase the speed of Python programs such as fast array processing using numpy, parallelisation using MPI message-passing and running on HPC systems.
In this hands-on workshop we will illustrate these techniques in practice by applying them to a toy application which simulates traffic flow using a simple cellular automaton model. Having developed and tested the program on your own laptop, we will then move to the UK National Supercomputer ARCHER2 which has in excess of 750,000 CPU-cores. With a combination of numpy and mpi4py we will aim for a performance increase of more than a factor of 10,000 compared to the original program.
As much of the programming as possible will be done using Jupyter Notebooks, but we will run on ARCHER2 via the standard Slurm batch scheduling system. Although attendees will be encouraged to write their own code, full solutions will be provided for all exercises. The main aim is to illustrate the potential power of supercomputer systems to new users, and to demystify HPC and parallel programming.
Follow the installation instructions at https://github.com/davidhenty/PythonHPCprep/#readme.
Unless otherwise indicated all material is Copyright © EPCC, The University of Edinburgh, and is only made available for private study.
- 09:00 - 09:10 : Introduction
- 09:10 - 09:30 : Traffic Model (lecture)
- 09:30 - 09:45 : Traffic Model (hands-on)
- 09:45 - 10:10 : Introduction to Numpy Arrays (lecture)
- 10:10 - 10:30 : Traffic Model with Numpy (hands-on)
- 10:30 - 11:00 : Coffee Break
- 11:00 - 11:30 : Message-Passing Concepts (lecture)
- 11:30 - 11:40 : Log on to ARCHER2 (hands-on)
- 11:40 - 12:00 : Basic MPI with mpi4py (lecture)
- 12:00 - 12:30 : Parallel Traffic Model (hands-on)
Download and unzip
PythonHPC-main.zip
which includes all the Jupyter notebooks for the exercises on your
laptop - see the directory notebooks
. Lecture material is available in the pdf directory.
We are not using Jupyter notebooks on ARCHER2 - follow the instructions below.
- Log on to ARCHER2 using
ssh
- you will have an account in the projectta083
- Change directory to
/work
usingcd /work/ta083/ta083/$USER/
- Copy over the exercises material:
cp /work/ta083/shared/PythonHPC.zip .
- Unpack it:
unzip PythonHPC.zip
- Change directory:
cd PythonHPC/code/
You will see a number of directories containing C, Fortran and Python examples in serial and MPI (and OpenMP except for Python). We are only concerned with the Python examples today.
- Load the Python environment:
module load cray-python
- Go to the
P-SER-NP
directory for the numpy version. - Execute the traffic model:
python traffic.py
- how does the speed compare to your laptop?
We should really be running all computational jobs on the compute
nodes which we access by submitting the batch script archer2.job
(in
fact you cannot run parallel jobs on the login nodes - you have to use
the compute nodes).
-
Submit the batch job:
sbatch archer2.job
- this will run the numpy code on a single process. -
Output will eventually appear in a file called something like
traffic-1234567.out
- you can monitor progress of your jobs usingsqueue -u $USER
-
Is the performance on the compute nodes similar to the login nodes?
We will now run the parallel version on ARCHER2. You will see that we have increased the number of iterations by a factor of 10 - without this the runtimes become so short that it is difficult to get accurate performance figures. As a rule of thumb, anything less than a second is probably too short.
-
First change directory to
P-MPI
-
Submit the batch job:
sbatch archer2.job
- this will run the parallel MPI code code on all 128 processors of a single ARCHER2 node. -
What is the performance? How much faster is it than the numpy code on a single process? How much faster is it than the original code?
-
You can run on 256 or 512 processors by editing
archer2.job
and increasing the value ofnodes
to 2 or 4. These jobs may take a surprisingly long time to run - this is due to slow startup times for parallel Python jobs and not slow performance. At these high process counts you may want to increase the number of iterations by another factor of 10. -
Did we manage to break the 10,000 times faster barrier?
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.