Retail Project

Dataset

The dataset for this project is available on Kaggle and can be accessed here.

Columns Description

Column	Description
InvoiceNo	Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction.
StockCode	Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description	Product (item) name. Nominal.
Quantity	The quantities of each product (item) per transaction. Numeric.
InvoiceDate	Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice	Unit price. Numeric, Product price per unit in sterling.
CustomerID	Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country	Country name. Nominal, the name of the country where each customer resides.

Pipeline

Data Modeling

Prerequisites

Docker
Astro CLI
Soda
Google Cloud account

Steps

IMPORTANT! Open the Dockerfile and make sure you use quay.io/astronomer/astro-runtime:8.8.0 in the Dockerfile (or airflow 2.6.1), If not, use that version and restart Airflow (astro dev restart with the Astro CLI)

Download the dataset from Kaggle and store the CSV file in include/dataset/online_retail.csv.
Add apache-airflow-providers-google==10.3.0 to requirements.txt and restart Airflow.
Create a Google Cloud Storage (GCS) bucket with a unique name <your_name>_online_retail.
Create a service account named airflow-online-retail and grant admin access to GCS and BigQuery. Create a JSON key for the service account and save it as service_account.json in include/gcp/.
Add the Google Cloud connection in Airflow with the service account key.
Create the DAG for loading the dataset into GCS.
Test the task for uploading the CSV to GCS.
Create an empty dataset in BigQuery.
Create a task for loading the CSV file into a BigQuery table.
Install Soda Core and create a configuration file configuration.yml.
Create a Soda Cloud account and add the API key to the configuration file.
Test the connection and create quality check YAML files for raw invoices.
Install Cosmos-DBT and set up the required files and configurations.
Run the DBT models to transform the data.
Add quality check YAML files for transformed data.
Add a task for running quality checks on transformed data.
Test the task for quality checks.

Reports

Note: The reports section is yet to be completed.

In include/dbt/models/report

-- daily_revenue.sql

-- Get daily revenue
SELECT
    date_part('date', datetime) AS date,
    SUM(total) AS revenue
FROM fct_invoices
WHERE datetime BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY), DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY)
GROUP BY date

-- monthly_revenue.sql

-- Get monthly revenue
SELECT
    EXTRACT(YEAR FROM datetime) AS year,
    EXTRACT(MONTH FROM datetime) AS month,
    SUM(total) AS revenue
FROM fct_invoices
WHERE datetime BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 12 MONTH), MONTH) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), MONTH)
GROUP BY year, month

-- country_revenue.sql

-- Get revenue by country
SELECT
    dc.iso,
    SUM(total) AS revenue
FROM fct_invoices fi
INNER JOIN dim_customer dc ON fi.customer_id = dc.customer_id
GROUP BY dc.iso

Test the reports
```
astro dev bash
dbt run --models report
```
🏆 First dbt reports in place!

Visualization

Install Google Cloud SDK

# Google Cloud SDK
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
apt-get update -y && \
apt-get install google-cloud-sdk -y

Create a new file include/viz/plotly.py

import plotly.express as px
from google.cloud import bigquery

client = bigquery.Client()

def plot_country_revenue():
    query = """
    SELECT
        dc.iso,
        SUM(total) AS revenue
    FROM airtube-390719.retail.fct_invoices fi
    INNER JOIN airtube-390719.retail.dim_customer dc ON fi.customer_id = dc.customer_id
    GROUP BY dc.iso
    """
    df = client.query(query).to_dataframe()
    
    fig = px.bar(df, x='iso', y='revenue', title='Revenue by Country')
    fig.show()

Test the visualization

astro dev bash
python include/viz/plotly.py

🏆 First visualization in place!

Conclusion

Congratulations! Setting up a data pipeline for the retail project is finished. Here's what was accomplished:

Dataset Acquisition: Obtained the dataset from Kaggle.
Data Modeling: Defined a data model and transformed the raw data into structured tables using dbt.
Data Quality: Implemented data quality checks to ensure the integrity of the data.
Reports: Created SQL queries to generate daily, monthly, and country-wise revenue reports.
Visualization: Developed a visualization to display revenue by country using Plotly.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.astro		.astro
dags		dags
images		images
include		include
plugins		plugins
tests/dags		tests/dags
venv		venv
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
Astronomer_readme.md		Astronomer_readme.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.override.yml		docker-compose.override.yml
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retail Project

Dataset

Columns Description

Pipeline

Data Modeling

Prerequisites

Steps

Reports

Visualization

Conclusion

About

Releases

Packages

Languages

Javid912/Retail_Data_Pipeline_Airflow_GCP_dbt

Folders and files

Latest commit

History

Repository files navigation

Retail Project

Dataset

Columns Description

Pipeline

Data Modeling

Prerequisites

Steps

Reports

Visualization

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages