This repository contains supplementary material for the Apache Spark lectures in the course Data Engineering 1 [1TD169], offered by the Department of Information Technology, Uppsala University. © 2025 by Usama Zafar.
This repository provides resources, guides, and code examples to help students learn and practice Apache Spark concepts covered in the course. It includes instructions for setting up a Spark cluster and a Spark driver VM, as well as notebooks and other materials to support your learning.
Before using this repository, you should have:
- Basic knowledge of Apache Spark and distributed computing.
- Access to a Spark cluster (deployed for the course).
- A Spark driver VM configured for submitting applications.
A Spark cluster should already be deployed for you. However, if you want to learn how to deploy a cluster yourself, follow the Spark Cluster Deployment Guide.
To run the notebooks and submit applications to the Spark cluster, you need to set up your Spark driver VM. Follow the Spark Driver VM Deployment Instructions. Make sure to customize the setup to fit your environment.
The repository for course instance DE-2025/
is organized as follows:
guides/
: Contains step-by-step instructions for setting up the Spark cluster and driver VM.examples/
: Includes Jupyter notebooks with code examples.data/
: Sample datasets used in the notebooks.spark-deploy/
: Helper code for deployment of Spark cluster using OpenStack API.⚠ Warning: This code is intended for advanced users familiar with OpenStack API.
Once your Spark driver VM is set up, you can:
- Access the provided Jupyter notebooks to explore Spark concepts.
- Submit Spark applications to the cluster using the driver VM.
- Modify and experiment with the code to deepen your understanding.
This repository is primarily for course material. However, if you find any issues or have suggestions for improvement, feel free to open an issue or submit a pull request.
This repository builds on the work of previous teaching assistants for the Data Engineering 1 course. Special thanks to Tianru Zhang for his contributions to the materials and setup instructions.
This repository is licensed under the Apache License 2.0. See the LICENSE file for details. By using this repository, you agree to comply with the terms of the Apache 2.0 License.