Skip to content

mccyx/DE1-Spark

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

[1TD169] - Data Engineering 1: Spark Lecture

This repository contains supplementary material for the Apache Spark lectures in the course Data Engineering 1 [1TD169], offered by the Department of Information Technology, Uppsala University. © 2025 by Usama Zafar.

Overview

This repository provides resources, guides, and code examples to help students learn and practice Apache Spark concepts covered in the course. It includes instructions for setting up a Spark cluster and a Spark driver VM, as well as notebooks and other materials to support your learning.


Prerequisites

Before using this repository, you should have:

  • Basic knowledge of Apache Spark and distributed computing.
  • Access to a Spark cluster (deployed for the course).
  • A Spark driver VM configured for submitting applications.

Setup Instructions

1. Spark Cluster Deployment

A Spark cluster should already be deployed for you. However, if you want to learn how to deploy a cluster yourself, follow the Spark Cluster Deployment Guide.

2. Spark Driver VM Setup

To run the notebooks and submit applications to the Spark cluster, you need to set up your Spark driver VM. Follow the Spark Driver VM Deployment Instructions. Make sure to customize the setup to fit your environment.


Repository Structure

The repository for course instance DE-2025/ is organized as follows:

  • guides/: Contains step-by-step instructions for setting up the Spark cluster and driver VM.
  • examples/: Includes Jupyter notebooks with code examples.
  • data/: Sample datasets used in the notebooks.
  • spark-deploy/: Helper code for deployment of Spark cluster using OpenStack API.

    ⚠ Warning: This code is intended for advanced users familiar with OpenStack API.


Usage

Once your Spark driver VM is set up, you can:

  1. Access the provided Jupyter notebooks to explore Spark concepts.
  2. Submit Spark applications to the cluster using the driver VM.
  3. Modify and experiment with the code to deepen your understanding.

Contributing

This repository is primarily for course material. However, if you find any issues or have suggestions for improvement, feel free to open an issue or submit a pull request.


Acknowledgments

This repository builds on the work of previous teaching assistants for the Data Engineering 1 course. Special thanks to Tianru Zhang for his contributions to the materials and setup instructions.


License

This repository is licensed under the Apache License 2.0. See the LICENSE file for details. By using this repository, you agree to comply with the terms of the Apache 2.0 License.

About

Data Engineering 1: Spark Lecture

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 97.7%
  • Python 1.7%
  • Other 0.6%