The dataset for this project is available on Kaggle and can be accessed here.
Column | Description |
---|---|
InvoiceNo | Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. |
StockCode | Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. |
Description | Product (item) name. Nominal. |
Quantity | The quantities of each product (item) per transaction. Numeric. |
InvoiceDate | Invice Date and time. Numeric, the day and time when each transaction was generated. |
UnitPrice | Unit price. Numeric, Product price per unit in sterling. |
CustomerID | Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. |
Country | Country name. Nominal, the name of the country where each customer resides. |
- Docker
- Astro CLI
- Soda
- Google Cloud account
IMPORTANT! Open the Dockerfile and make sure you use quay.io/astronomer/astro-runtime:8.8.0 in the Dockerfile (or airflow 2.6.1), If not, use that version and restart Airflow (astro dev restart with the Astro CLI)
- Download the dataset from Kaggle and store the CSV file in
include/dataset/online_retail.csv
. - Add
apache-airflow-providers-google==10.3.0
torequirements.txt
and restart Airflow. - Create a Google Cloud Storage (GCS) bucket with a unique name
<your_name>_online_retail
. - Create a service account named
airflow-online-retail
and grant admin access to GCS and BigQuery. Create a JSON key for the service account and save it asservice_account.json
ininclude/gcp/
. - Add the Google Cloud connection in Airflow with the service account key.
- Create the DAG for loading the dataset into GCS.
- Test the task for uploading the CSV to GCS.
- Create an empty dataset in BigQuery.
- Create a task for loading the CSV file into a BigQuery table.
- Install Soda Core and create a configuration file
configuration.yml
. - Create a Soda Cloud account and add the API key to the configuration file.
- Test the connection and create quality check YAML files for raw invoices.
- Install Cosmos-DBT and set up the required files and configurations.
- Run the DBT models to transform the data.
- Add quality check YAML files for transformed data.
- Add a task for running quality checks on transformed data.
- Test the task for quality checks.
Note: The reports section is yet to be completed.
-
In
include/dbt/models/report
-- daily_revenue.sql -- Get daily revenue SELECT date_part('date', datetime) AS date, SUM(total) AS revenue FROM fct_invoices WHERE datetime BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY), DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY) GROUP BY date
-- monthly_revenue.sql -- Get monthly revenue SELECT EXTRACT(YEAR FROM datetime) AS year, EXTRACT(MONTH FROM datetime) AS month, SUM(total) AS revenue FROM fct_invoices WHERE datetime BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 12 MONTH), MONTH) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), MONTH) GROUP BY year, month
-- country_revenue.sql -- Get revenue by country SELECT dc.iso, SUM(total) AS revenue FROM fct_invoices fi INNER JOIN dim_customer dc ON fi.customer_id = dc.customer_id GROUP BY dc.iso
-
Test the reports
astro dev bash dbt run --models report
🏆 First dbt reports in place!
-
Install Google Cloud SDK
# Google Cloud SDK RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \ curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \ apt-get update -y && \ apt-get install google-cloud-sdk -y
-
Create a new file
include/viz/plotly.py
import plotly.express as px from google.cloud import bigquery client = bigquery.Client() def plot_country_revenue(): query = """ SELECT dc.iso, SUM(total) AS revenue FROM airtube-390719.retail.fct_invoices fi INNER JOIN airtube-390719.retail.dim_customer dc ON fi.customer_id = dc.customer_id GROUP BY dc.iso """ df = client.query(query).to_dataframe() fig = px.bar(df, x='iso', y='revenue', title='Revenue by Country') fig.show()
-
Test the visualization
astro dev bash python include/viz/plotly.py
🏆 First visualization in place!
Congratulations! Setting up a data pipeline for the retail project is finished. Here's what was accomplished:
- Dataset Acquisition: Obtained the dataset from Kaggle.
- Data Modeling: Defined a data model and transformed the raw data into structured tables using dbt.
- Data Quality: Implemented data quality checks to ensure the integrity of the data.
- Reports: Created SQL queries to generate daily, monthly, and country-wise revenue reports.
- Visualization: Developed a visualization to display revenue by country using Plotly.