Skip to content

ShaalanMarwan/airline-delay-and-cancellation

Repository files navigation

Analytics of Airline Delays of US Airports


Preface

For my data engineering course project, I chose to utilize the Airline Delay and Cancellation Data, 2009 - 2018. The dataset is a summary information on the number of on-time, delayed, canceled, and diverted flights.


Problem

This dataset is not readily analyzable and there are few obstacles for data analysis, however. It has a few limitations, however. Firstly, it is divided into yearly segments spanning from 2009 to 2018. Additionally, the information is not easily extractable as there is little metadata. For example, the use of airline and airport codes may pose a challenge as there is no accompanying information within the dataset to identify these codes. To overcome this issue, I was able to source supplementary CSV files that contain details on each airport’s location, state, city and corresponding airport code.


Data Engineering Architecture Overview

Markdown Monster icon


Repository Structure

This repository has a typical data processing workflow structure, consisting of infrastructure as code (Terraform), raw data in CSV format, processed data in parquet format, and data processing scripts in the processing folder. It also includes lookup tables for airline and airport codes in the data folder.

  • Infrastructure as Code

terraform: folder that contains Terraform configurations

terraform/main.tf: includes terraform for Google Cloud image

terraform/variable.tf: includes terraform variables for Google Cloud image

  • Data

/data/csv: the raw data after downloading it from Kaggle

data/parquet: parquet files

data/: includes lookup tables for airline and airports codes

  • Processing

processing/flow: folder that contains transform and uploading raw data from local to Google Cloud Bucket then from Bucket to Google Cloud

DHW Queries for Partitioning and Clustering


How to run it

  • Download the dataset from Airline Delay and Cancellation Data, 2009 - 2018

  • Move the dataset to data/csv folder in the code

  • Run the Python Script processing/flow/etl_to_gcs.py

    python ./processing/flow/etl_to_gcs.py    
  • Run the python Script processing/flow/etl_gcs_to_bq.py

    python ./processing/flow/etl_gcs_to_bq.py 
  • You can find the dbt files in second the branch called dbt-model

  • Go to your Google data studio and connect your BigQuery with the Dashboard Studio

  • You can find BigQuery Queries for partitioning and clustering in code


My Database Optimization through Partitioning and Clustering

CREATE TABLE data-engineering-camp-376112.us_airline_delay.flights_delay_partitioned PARTITION BY FL_DATE CLUSTER BY OP_CARRIER, ORIGIN, DEST OPTIONS( expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 30 DAY), description='My partitioned and clustered table' ) AS SELECT * FROM data-engineering-camp-376112.us_airline_delay.flights_delay;

CREATE TABLE data-engineering-camp-376112.us_airline_delay.flights_delay_clustered PARTITION BY FL_DATE CLUSTER BY OP_CARRIER OPTIONS( expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 30 DAY), description='My partitioned and clustered table' ) AS SELECT * FROM data-engineering-camp-376112.us_airline_delay.flights_delay_partitioned;


DBT documentation

Markdown Monster icon

https://cloud.getdbt.com/accounts/148444/runs/138771363/docs/#!/overview


Visualization

The Visualization section showcases a collection of charts that are designed to visually represent flight delay and cancellation data.

Google Data Studio link

Page 1 : Flights Delay

Markdown Monster icon

Page 2 : Flights Cancellations

Markdown Monster icon


Contact

If you have any questions, feedback, or issues with the repository, please feel free to reach out to us using the following channels:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published