Analytics of Airline Delays of US Airports

Analytics of Airline Delays of US Airports

Preface

For my data engineering course project, I chose to utilize the Airline Delay and Cancellation Data, 2009 - 2018. The dataset is a summary information on the number of on-time, delayed, canceled, and diverted flights.

Problem

This dataset is not readily analyzable and there are few obstacles for data analysis, however. It has a few limitations, however. Firstly, it is divided into yearly segments spanning from 2009 to 2018. Additionally, the information is not easily extractable as there is little metadata. For example, the use of airline and airport codes may pose a challenge as there is no accompanying information within the dataset to identify these codes. To overcome this issue, I was able to source supplementary CSV files that contain details on each airport’s location, state, city and corresponding airport code.

Data Engineering Architecture Overview

Repository Structure

This repository has a typical data processing workflow structure, consisting of infrastructure as code (Terraform), raw data in CSV format, processed data in parquet format, and data processing scripts in the processing folder. It also includes lookup tables for airline and airport codes in the data folder.

Infrastructure as Code

terraform: folder that contains Terraform configurations

terraform/main.tf: includes terraform for Google Cloud image

terraform/variable.tf: includes terraform variables for Google Cloud image

Data

/data/csv: the raw data after downloading it from Kaggle

data/parquet: parquet files

data/: includes lookup tables for airline and airports codes

Processing

processing/flow: folder that contains transform and uploading raw data from local to Google Cloud Bucket then from Bucket to Google Cloud

DHW Queries for Partitioning and Clustering

How to run it

Download the dataset from Airline Delay and Cancellation Data, 2009 - 2018
Move the dataset to data/csv folder in the code
Run the Python Script processing/flow/etl_to_gcs.py
```
python ./processing/flow/etl_to_gcs.py    
```
Run the python Script processing/flow/etl_gcs_to_bq.py
```
python ./processing/flow/etl_gcs_to_bq.py 
```
You can find the dbt files in second the branch called dbt-model
Go to your Google data studio and connect your BigQuery with the Dashboard Studio
You can find BigQuery Queries for partitioning and clustering in code

My Database Optimization through Partitioning and Clustering

CREATE TABLE data-engineering-camp-376112.us_airline_delay.flights_delay_partitioned PARTITION BY FL_DATE CLUSTER BY OP_CARRIER, ORIGIN, DEST OPTIONS( expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 30 DAY), description='My partitioned and clustered table' ) AS SELECT * FROM data-engineering-camp-376112.us_airline_delay.flights_delay;

CREATE TABLE data-engineering-camp-376112.us_airline_delay.flights_delay_clustered PARTITION BY FL_DATE CLUSTER BY OP_CARRIER OPTIONS( expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 30 DAY), description='My partitioned and clustered table' ) AS SELECT * FROM data-engineering-camp-376112.us_airline_delay.flights_delay_partitioned;

DBT documentation

https://cloud.getdbt.com/accounts/148444/runs/138771363/docs/#!/overview

Visualization

The Visualization section showcases a collection of charts that are designed to visually represent flight delay and cancellation data.

Google Data Studio link

Page 1 : Flights Delay

Page 2 : Flights Cancellations

Contact

If you have any questions, feedback, or issues with the repository, please feel free to reach out to us using the following channels:

Email: [email protected]
LinkedIn: Shaalan Marwan
GitHub Issue: https://github.com/ShaalanMarwan/airline-delay-and-cancellation/issues

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.vscode		.vscode
dbt		dbt
images		images
terraform		terraform
.gitignore		.gitignore
DHW-queries.md		DHW-queries.md
README.md		README.md
cancellation_code.md		cancellation_code.md
columns.md		columns.md
data-eng.md		data-eng.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analytics of Airline Delays of US Airports

Preface

Problem

Data Engineering Architecture Overview

Repository Structure

Infrastructure as Code

Data

Processing

How to run it

My Database Optimization through Partitioning and Clustering

DBT documentation

Visualization

Page 1 : Flights Delay

Page 2 : Flights Cancellations

Contact

About

Releases

Packages

Languages

ShaalanMarwan/airline-delay-and-cancellation

Folders and files

Latest commit

History

Repository files navigation

Analytics of Airline Delays of US Airports

Preface

Problem

Data Engineering Architecture Overview

Repository Structure

Infrastructure as Code

Data

Processing

How to run it

My Database Optimization through Partitioning and Clustering

DBT documentation

Visualization

Page 1 : Flights Delay

Page 2 : Flights Cancellations

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages