NIWA Auckland / Climate and Weather Applications, Forecasting Services
Auckland, New Zealand, 31 August and 1 September 2017
Contact:
Nicolas Fauchereau
- The Anaconda python distribution
- Installation of Some additional libraries
- Running the Jupyter notebooks
- Content of the workshop
For this tutorial, I strongly recommend installing the Anaconda Python distribution. It is a completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing. It includes the python interpreter itself, the python standard library as well as a set of packages exposing data structures and methods for data manipulation and scientific computing and visualization. In particular it provides Numpy, Scipy, Pandas, Matplotlib, etc ... i.e. all the main packages we will be using during the tutorial. The full list of packages is available at:
http://docs.continuum.io/anaconda/pkgs.html
The Anaconda python distribution (NOTE: select the version shipping with Python 3.6) must be downloaded from:
For your platform.
You should not need administrator rights, as Anaconda is completely self-contained and can be installed in your HOME
directory. I suggest installing Anaconda
in /Users/USERNAME/anaconda
on Macs, and C:\Users\USERNAME\anaconda
on Windows
Once you have installed Anaconda, you can update to the latest compatible versions of all the pre-installed packages by running, at the command line, i.e. in Windows select Cmd.exe
, on Mac the Terminal
application in Utilities
, the $
sign signifies the command prompt, so do not enter it !, also when prompted do you want to proceed?
just enter yes (y
) :
$ conda update conda
Then
$ conda update anaconda
You also might want to install pip to install packages from the Python Package Index.
$ conda install pip
netcdf4 allows you to read and write netcdf files (version 3 and 4 supported), install it by:
$ conda install netcdf4
Basemap is a graphic library for plotting (static, publication quality) geographical maps (see http://matplotlib.org/basemap/). Basemap is available directly in Anaconda for Mac and Linux using the conda package manager, install with :
$ conda install basemap
but on Windows, you'll have to use an alternative channel (conda-forge), the command becomes:
$ conda install -c conda-forge basemap
Cartopy is a the second library for making geographical maps that we are going to see. It has been developed by the UKMO, and will eventually replace Basemap. However to this date Cartopy does not have all the features present in Basemap.
Cartopy is not available through the anaconda standard channel, to install you need to use the community maintained conda-forge channel, with this syntax:
$ conda install -c conda-forge cartopy
seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. You should be able to install it with conda
as well:
$ conda install seaborn
xarray (previously xray) is a library aimed at bringing the power of Pandas to multidimensional labelled arrays, such as the ones usually associated with geophysical quantities varying along time and space dimensions (e.g. [time, latitudes, longitudes], [time, level, latitudes, longitudes], etc) and supports reading and writing netcdf files. It can be installed via conda
:
$ conda install xarray
All of this can also be done using the Anaconda Navigator GUI (Graphical User Interface) which looks like below
If you go the Environments
tab on the left, and select the root
environment (we will talk about the concept of Python environments later in the workshop) you can then search, select and install available packages, including the ones mentionned above.
The material of the tutorial is in the form of Jupyter notebooks. In a nutshell a Jupyter notebook is a web-based (i.e. running in the browser) interactive computational environment where you can combine Python code execution, text, rendered mathematic expression, plots, images, and rich media into a single document.
It is in my opinion the best environment when you are doing exploratory data analysis, and it allows you to weave comments, background information, interpretations and references along with the code itself, and it can be exported to a variety of formats (HTML, pdf) for sharing with colleagues. I strongly recommend spending some time learning about the notebook and its many features.
After uncompressing the archive of the repo (or after cloning it with git
), navigate to the notebooks
directory (containing the *.ipynb
files) and type:
$ jupyter notebook
That should bring up the Jupyter notebook dashboard (looking as below) in your default Browser, you should be ready to go !
The following is a brief description of the workshop material content:
-
01_test.ipynb
:A simple test Jupyter notebook to test the installation of the main libraries and their version
-
02_resources.ipynb
:Some links to interesting resources and topics not covered during the workshop
-
03_Jupyter_notebook.ipynb
Introduction to the main features of the Jupyter notebook
-
04_introduction_Python.ipynb
: Introduction the basics of the python language:-
what's an interpreted programming language, what is the python interpreter
-
how to write a python script
-
basic Python native data types:
- dealing with numbers in native Python
- dealing with
strings
and their formatting lists
,tuples
,dictionnaries
, ...
-
arithmetic operators (
+. -, *, =
) -
logical operators (
==, >=, <=, !=
) and how they relate to booleans (True / False
) -
control flow structures (
for, if / elif / else, while
etc.) -
reusing your code: writing functions, modules and packages
-
-
05_Numpy.ipynb
:A introduction to Numpy: Numpy introduces in particular a data structure called the
ndarray
(N-dimensional array) which can store numerical values. It is the foundational library of the Python scientific ecosystem. In this notebook we'll learn the main principles of manipulating numpy ndarrays. While you might actually not spend much time working with numpy itself, it is necessary to understand its basic principles, as (almost) everything else in scientific Python is built on top ofnumpy
. We'll see:-
How to create numpy arrays
-
indexing, slicing, reshaping, transposing, etc.
-
The main methods and functions in numpy operating on numpy
ndarrays
-
-
06_Scipy.ipynb
:Scipy is the second pilar of the python scientific ecosystem. Where Numpy introduces a data structure (the ndarray), Scipy provides a collection of efficient scientific algorithms. These are organized in submodules, and cover topics ranging from linear algebra to signal and image processing and optimisation (see here for list of the submodules and their dedicated tutorial). In this notebook we will focus on interpolation, and on some of the statistical algorithms and methods that scipy makes available.
-
07_Pandas.ipynb
This is where we're gonna spend quite a bit of time ! Pandas is THE library that you need to use when dealing with tabular data, i.e. "spreadsheet-like" data, where values are stored in 2D arrays with rows and column labels. It is typically the type of data you find in csv, tab / space delimited or Excel files. In this notebook we'll see first:
- The basics of the main data structures in Pandas: the Series and the Dataframe
- How to read from / write to common data files (csv, excel, space delimited, tab delimited etc). It will include how to read from a list of e.g. csv files, and concatenating their data in a Dataframe
- How to manipulate tabular data in Pandas, from selecting rows / columns to more sophisticated conditional indexing
- How to perform complex queries and assignements on Pandas Dataframes
- How to perform groupby operations (split / apply / combine)
- How to deal with Missing Values
One of the strenths of Pandas is its ability to store, and perform sophisticated operations on, time-series: i.e. data that is indexed by dates, dates and times, timestamps, etc. In the second part of the Pandas tutorial we will focus on time-series manipulation in Pandas. In particular we'll see:
-
how to read in and correctly parse files containing date / time information
-
how to resample time-series
-
rolling window operations (e.g. moving averages)
-
how to deal with missing values and missing index values
-
08_xarray.ipynb
:xarray is a library to read / write netcdf files and for the manipulation of multi-dimensional labelled arrays, it is especially handy for gridded data varying along latitudes, longitudes, time, depth, height, etc. Its design follows closely that of Pandas, meaning that familiarity with Pandas allows to quickly pick up xarray. We'll see how to read, write netcdf files in xarray, and perform a series of common analyses such as:
- calculating a climatology
- calculating anomalies
- going from and to Pandas Dataframes
- calculating composite anomalies
- resampling, aggregation, groupby operations
-
09_plotting.ipynb
:In this notebook we'll go over the basics of Matplotlib, the main plotting library in python. This is more or less the equivalent of numpy but for plotting: i.e. a foundational library of the Python scientific ecosystem, on wich a number of — more specialised — plotting libraries have been built. One of these we will briefly go over in particular is seaborn, a plotting library handy for statistical plots. A short and non-exhaustive list of plotting libraries in Python is provided covering both static plots (i.e. plotnine, ggplot2, etc) and interactive plots (bokeh, plotly, holoviews, etc)
-
10_Mapping.ipynb
:Making maps in Python: a brief overview of basemap, cartopy, and — very briefly, for interactive maps — folium. We'll see also briefly how to read shapefiles (with geopandas), transform into a geojson and edit in the browser.