[1TD169] - Data Engineering 1: Spark Lecture

This repository contains supplementary material for the Apache Spark lectures in the course Data Engineering 1 [1TD169], offered by the Department of Information Technology, Uppsala University. © 2025 by Usama Zafar.

Overview

This repository provides resources, guides, and code examples to help students learn and practice Apache Spark concepts covered in the course. It includes instructions for setting up a Spark cluster and a Spark driver VM, as well as notebooks and other materials to support your learning.

Prerequisites

Before using this repository, you should have:

Basic knowledge of Apache Spark and distributed computing.
Access to a Spark cluster (deployed for the course).
A Spark driver VM configured for submitting applications.

Setup Instructions

1. Spark Cluster Deployment

A Spark cluster should already be deployed for you. However, if you want to learn how to deploy a cluster yourself, follow the Spark Cluster Deployment Guide.

2. Spark Driver VM Setup

To run the notebooks and submit applications to the Spark cluster, you need to set up your Spark driver VM. Follow the Spark Driver VM Deployment Instructions. Make sure to customize the setup to fit your environment.

Repository Structure

The repository for course instance DE-2025/ is organized as follows:

guides/: Contains step-by-step instructions for setting up the Spark cluster and driver VM.
examples/: Includes Jupyter notebooks with code examples.
data/: Sample datasets used in the notebooks.
spark-deploy/: Helper code for deployment of Spark cluster using OpenStack API.

⚠ Warning: This code is intended for advanced users familiar with OpenStack API.

Usage

Once your Spark driver VM is set up, you can:

Access the provided Jupyter notebooks to explore Spark concepts.
Submit Spark applications to the cluster using the driver VM.
Modify and experiment with the code to deepen your understanding.

Contributing

This repository is primarily for course material. However, if you find any issues or have suggestions for improvement, feel free to open an issue or submit a pull request.

Acknowledgments

This repository builds on the work of previous teaching assistants for the Data Engineering 1 course. Special thanks to Tianru Zhang for his contributions to the materials and setup instructions.

License

This repository is licensed under the Apache License 2.0. See the LICENSE file for details. By using this repository, you agree to comply with the terms of the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
DE-2025		DE-2025
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[1TD169] - Data Engineering 1: Spark Lecture

Overview

Prerequisites

Setup Instructions

1. Spark Cluster Deployment

2. Spark Driver VM Setup

Repository Structure

Usage

Contributing

Acknowledgments

License

About

Releases

Packages

Languages

License

mccyx/DE1-Spark

Folders and files

Latest commit

History

Repository files navigation

[1TD169] - Data Engineering 1: Spark Lecture

Overview

Prerequisites

Setup Instructions

1. Spark Cluster Deployment

2. Spark Driver VM Setup

Repository Structure

Usage

Contributing

Acknowledgments

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages