Premier League Data Pipeline

Overview

Note

This repository contains a personal project designed to enhance my skills in Data Engineering. It focuses on developing data pipelines that extract, transform, and load data from various sources into diverse databases. Additionally, it involves creating a dashboard with visualizations using Streamlit.

Important

Many architectural choices and decisions in this project may not make the most efficent sense on purpose for the sake of practicing and learning.

Important Links

Documentation - currently under construction 🔨
Streamlit App
Version History

Infrastructure

Tools & Services

Databases

Code Quality

Security Linter	Code Formatting	Type Checking	Code Linting
`bandit`	`ruff-format`	`mypy`	`ruff`

Data and CI/CD Pipelines

Data Pipelines

Data Pipeline 1

Data from the Financial Modeling Prep API is extracted with Python using the /quote endpoint.
The data is loaded directly into a PostgreSQL database hosted on Cloud SQL with no transformations.
The prior steps are orchestrated with Prefect.
Once the data is loaded into PostgreSQL, Datastream replicates the data into BigQuery. Datastream checks for staleness every 15 minutes.
dbt is used to transform the data in BigQuery and create a view with transformed data.

Data Pipeline 2

Data is extracted from multiple API sources with Python:
- Data from the Football Data API is extracted with Python using the /standings, /teams, and top_scorers endpoints.
- Data from the NewsAPI is extracted with Python using the /everything endpoint with parameters set to search for the Premier League.
- Data from the Go & Gin API is extracted with Python using the /stadiums endpoint.
Python performs any necessary transformations and loads the data into BigQuery.
The prior steps are orchestrated with Prefect.

Data Pipeline 3

Data from the Football Data API is extracted with Python using the /fixtures endpoint.
Python creates dictionaries from the data and loads the data into Firestore
The prior steps are orchestrated with Cloud Scheduler as a Docker container hosted on Cloud Run as a Job.

Pipeline Diagram

CI/CD Pipeline

The CI/CD pipeline is focused on building the Streamlit app into a Docker container that is then pushed to Artifact Registry and deployed to Cloud Run as a Service. Different architecutres are buit for different machine types and pushed to Docker Hub.

The repository code is checked out and a Docker image containing the updated streamlit_app.py file will build.
The newly built Docker image will be pushed to Artifact Registry.
The Docker image is then deployed to Cloud Run as a Service.

Pipeline Diagram

Security

Syft and Grype work together to scan the Streamlit Docker image. Syft creates an SBOM and Grype scans the SBOM for vulnerabilities. The results are sent to the repository's Security tab.
Snyk is also used to scan the repository for vulnerabilities in the Python packages.

Name		Name	Last commit message	Last commit date
Latest commit History 673 Commits
.github/workflows		.github/workflows
api		api
components		components
dbt_prod/models		dbt_prod/models
etl		etl
prefect		prefect
soda		soda
terraform		terraform
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prefectignore		.prefectignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Premier League Data Pipeline

Overview

Important Links

Infrastructure

Tools & Services

Databases

Code Quality

Data and CI/CD Pipelines

Data Pipelines

Data Pipeline 1

Data Pipeline 2

Data Pipeline 3

Pipeline Diagram

CI/CD Pipeline

Pipeline Diagram

Security

About

Releases

Packages

Languages

jxareas/premier-league

Folders and files

Latest commit

History

Repository files navigation

Premier League Data Pipeline

Overview

Important Links

Infrastructure

Tools & Services

Databases

Code Quality

Data and CI/CD Pipelines

Data Pipelines

Data Pipeline 1

Data Pipeline 2

Data Pipeline 3

Pipeline Diagram

CI/CD Pipeline

Pipeline Diagram

Security

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages