This repository contains jupyter notebooks and examples data-sets for my Apache Spark tutorial.
NOTE: Please note that the tutorial is still under active development, so please make sure you update (pull) it on the day of the workshop
Required skills:
- Confident user of
python
programming language including the use ofmap
,filter
andreduce
functions andlambda
expressions. - Basic skills in using Linux based operating systems
If you need a bush-up the python
part please check out he following tutorials: Python2 Tutorial or Python Functions - map ....
(Highly) Recommended skills:
- Familiarity with ipython/jupyter notebooks (https://jupyter.org/)
- Familiarity the Databricks platform (see below)
- Experience with
python
packages such aspandas
andscikit-learn
- Basic understanding of Machine Learning concepts such as: classification, regression, linear models, training and testing sets, cross validation, parameter turning
Helpful skills:
- Experience with
python
packages such asnumpy
andmatplotlib
- Experience with R
- Exposure to functional programming
- Exposure to SQL and RDBMS (Relational Data Management Systems)
- Basic understanding of Hadoop Map-Reduce and Hadoop HDFS
- Basic experience with using public cloud services (AWS or Google Cloud)
The easiest to run the examples is to use the Databricks Platform, which provides a cloud based analytics platform for data science and engineering. Databricks was founded by the team who created Apache Spark, and works with the open source community to continue to expand the project.
In addition to the commercial platform Databricks offers a free Community Edition, which is a great way to learn both the Databricks Platform, Apache Spark and to run this tutorial.
To sign up for a free account go to: https://community.cloud.databricks.com/
To learn how to use Databricks Platform please familiarize yourself with the Getting Started guide.
You may also like to check out some of the featured notebooks, for example Introduction to Apache Spark on Databricks.
If it seems a bit complicated, do not panic - we well cover the basic at the workshop.
To setup the Databricks Platform for the tutorial:
- In your
Workspace
create a folder namedspark-tutorial
- Import to
spark-tutorial
the notebook from this URL: https://raw.githubusercontent.com/piotrszul/spark-tutorial/master/setup/0.1_Setup_DataBricks.ipynb (or from the file insetup/0.1_Setup_DataBricks.ipynb
) - Run the notebook with
Run All
on the default clusterMyCluster
- it should run without errors Restart
MyCluster
To verify that the setup if correct and run you first tutorial notebook:
- Import to
spark-tutorial
the notebook from this URL: https://raw.githubusercontent.com/piotrszul/spark-tutorial/master/notebooks/0.1_Welcome.ipynb (or from the file innotebooks/0.1_Welcome.ipynb
) Run All
the notebook - it should run without errors (the and the output should be like this- Please, have a look through the notebook to familiarize yourself with different ways of using the notebook to: run python code, shell commands and visualize the results.
To run the rest of the tutorial follow the steps above for other notebooks in the notebooks
directory.
Here are the URLs to copy:
- 1.1_RDD-Basics.ipynb
- 1.2_RDD_Data-Processing.ipynb
- 1.3_RDD_Text-Processing.ipynb
- 2.1_StructuredData-Introduction.ipynb
- 2.2_StructuredData-Formats.ipynb
- 2.3_StructuredData-Analyzing.ipynb
- 3.1_ML-Introduction.ipynb
- 3.2_ML_Classification-Text.ipynb
- 3.3_ML_Classification-Categorical.ipynb
- 4.1_SparkR-Introduction.html
- 5.1_BigData_Genomics-Clustering.ipynb
- 5.2_BigData_Genomics_Visualise.ipynb
You can also run the tutorial notebooks on your computer with the Docker Platform.
First you need to install Docker Community Edition for you computer and OS from https://www.docker.com.
Once the installation is successful you need to download the Docker image with the tutorial to your machine.
WARNING: The image size is about 5GB so you will need to have at least that amount of free disk space. The download size is also a few giga-bytes so be mindful of your Internet usage and the time it may take to finish.
To download the image at the command prompt type:
docker pull piotrszul/spark-tutorial
Once the download is complete you can start your local pyspark enabled jupyter notebook with:
docker run -it -p 8888:8888 piotrszul/spark-tutorial
If all is good you should see the following output:
[I 02:41:57.564 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret
[W 02:41:57.895 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[W 02:41:57.896 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using authentication. This is highly insecure and not recommended.
[I 02:41:57.903 NotebookApp] Serving notebooks from local directory: /home/jovyan/spark-tutorial/notebooks
[I 02:41:57.903 NotebookApp] 0 active kernels
[I 02:41:57.903 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/
[I 02:41:57.903 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
You can navigate your browser now to: http://localhost:8888/ to see the list of the tutorial notebooks.
- Open the
0.1_Welcome.ipynb
and run all the cells withCell\Run All
- it should run without errors (the and the output should be like this - Please, have a look through the notebook to familiarize yourself with the different ways of using the notebook to: run python code, shell commands and visualize the results.
IMPORTANT
When you are done with the notebook close it with File\Close and Halt
. ONLY ONE pyspark notebook may be running at any given time.
To stop the jupyter
server press [Ctrl+C]
twice at the console window.
You can start it again at any time with the docker run ...
command above.
NOTE: Every time you use docker run ...
the server will start from the clean image, without any changes that you might have made. If you want to preserve your changes use the --name option when you run the server for the first time and then start
_command to connect to it again, e.g.: _
# run a server (container) named `my-server` for the first time:
docker run -it -p 8888:8888 --name 'my-server' piotrszul/spark-tutorial
# do some work and stop the server
# to start `my-server` again use:
docker start -i 'my-server'
NOTE: Please note that the piotrszul/spark-tutorial
image is still under active development, so please make sure you update (pull) it on the day of the workshop. The update is incremental so the download size should not exceed 100MB)
Just go https://piotrszul.github.io/spark-tutorial/.
The 5.1_BigData_Genomics-Clustering.ipynb example can be run on the entire chromosome 22 data (or potentially the entire genome) with Google Cloud Dataproc
service.
The instruction is available in google-cloud/README.md
All data sets used in this tutorial are described in data/README.md
- Python API Docs
- Spark Programming Guide
- Spark SQL Programming Guide
- Spark ML Programming Guide
- Spark R Programming Guide
Other interesting sites:
- Kaggle - data science competitions with (sometime huge) prizes + interesting data sets
- UCI Machine Learning Repository - many interesting machine learning data sets