IFCO Data Application

Problem Statement

To assist IFCO's Data Team in the analysis of some business data. For this purpose, you have been provided with two files:

orders.csv (which contains facmtual information regarding the orders received)
invoicing_data.json (which contains invoicing information)

For this exercise, you can only use Python, PySpark or SQL (e.g. in dbt). Unit testing is essential for ensuring the reliability and correctness of your code. Please include appropriate unit tests for each task.

Highlevel Solution & Approach

Data Infrastructure
- Setup databricks runtime as the base image to process data using PySpark, Python and SQL.
- Use Jupyter Notebook and Streamlit for visualization.
- Python and Java environment
- Supervisor for process management
Data Ingestion.
- Ingest orders data (CSV) into a dataframe.Clean the source data for ingestion such as removing special characters etc.
- Ingest invoices data (JSON) into a data frame, and create a view, invoices.
- Create base views orders and Invoices
Data Transform
- Transform the data, apply data validation rules and create views for reuse.
  - ORDERS_VW - Create base view with business rules.
  - SALESOWNERS_VW - Create a de-normalized view of sales owners.
  - SALES_OWNER_COMMISSION_VW - Create a salesowners commission View
  - ORDER_SALES_OWNER_VW - Create a de-normalised view of order and sales owners - a record per order & sales owner.

Data Analytics

Generate datasets for each use case using the above views.

   -  Test 1: Distribution of Crate Type per Company
   -  Test 2: DataFrame of Orders with Full Name of the Contact
   -  Test 3: DataFrame of Orders with Contact Address
   -  Test 4: Calculation of Sales Team Commissions
   -  Test 5: DataFrame of Companies with Sales Owners
   -  Test 6: Data visualization data sets

Render the data visualization using Streamlit

Getting Started

  ### Dependencies
    - Ubuntu 24.04.1 LTS or Ensure WSL 2 (Windows Subsystem for Linux)
    - Java openjdk-11-jdk and set JAVA_HOME=/usr/lib/jvm/java-1.11.0-openjdk-amd64
    - Set PYTHONPATH=$(pwd)/src:$PYTHONPATH (to execute the /src files. 
  
  **Core dependencies**
    - pandas
    - pyspark
    - urllib3
    - streamlit
      
  **Jupyter dependencies**
    - notebook
    - jupyterlab
    - ipykernel
    - ipywidgets
    - matplotlib
  
  **Additional dependencies**
    - requests

How to install & Execute

Using Docker Image from Registry

Install Docker on Windows
- Install Docker Desktop for Windows.
- Ensure WSL 2 (Windows Subsystem for Linux) is enabled, as it's recommended for running Linux containers.
Pull the Image from a Registry Docker image link Docker Hub Link
```
docker pull mooney042/ifco-data-app:latest
```
Run the Image on Windows If the image is a Linux-based image, it will run fine on Windows if Docker is set to run Linux containers (which is the default with WSL 2).
```
docker run -it --rm -p 8888:8888 -p 8501:8501 mooney042/ifco-data-app:latest
```
Accessing Notebook and Visualization
- To access Jupyter and Streamlit by navigating to:
  - Jupyter Notebook: http://localhost:8888
    - Jupyter Notebook
  - Streamlit Dashboard: http://localhost:8501

Building a Docker image from Git

Install Docker on Windows
- Install Docker Desktop for Windows.
- Ensure WSL 2 (Windows Subsystem for Linux) is enabled, as it's recommended for running Linux containers.
Checkout or downaload all the files form the Git to your local project folder. Navigate to the project root folder in the terminal.
Excute docker build process
- If the Dockerfile is in a different directory than the current project directory, you need to specify its location using the -f flag.
- Docker is looking for a Dockerfile in the current directory (.), but in the example below Dockerfile is inside the subdirectory Docker/ folder.
```
docker build -t <my_container_name> -f Docker/Dockerfile .
```
Run the Image If the image is a Linux-based image, it will run fine on Windows if Docker is set to run Linux containers (which is the default with WSL 2).
```
docker run -it --rm -p 8888:8888 -p 8501:8501 <my_container_name> 
```
Accessing Notebook and Visualization
- To access Jupyter and Streamlit by navigating to:
- Jupyter Notebook: http://localhost:8888
  - Jupyter Notebook
- Streamlit Dashboard: http://localhost:8501

Assumptions

For the current application requirement, the source data both orders and invoices are static files.
No file format changes expected for the source filesv- the orders data is CSV format, invoices data is in JSON format.
No schema changes expected for both source data files.
No requirement to perisist the data for future use.
Using Windows and enabling WSL 2 for running docker.

How to Debug?

Look for tracebacks or errors in the output. docker logs <container_id>

Restart the container:

docker stop <container_id> && docker rm <container_id>
docker run -d <your-image>

Verify Running Processes Inside the Container
- Open an interactive shell inside the container:
```
# get container id
docker ps
```
```
docker exec -it <container_id> /bin/bash
```
- Manually Start Jupyter & Streamlit
```
jupyter notebook --allow-root --ip=0.0.0.0 --port=8888
```
```
streamlit run /path/to/app.py --server.port 8501 --server.address 0.0.0.0
```
- Verify if supervisor is working in the container supervisorctl status supervisorctl restart all

Improvements & Next Steps

Ingesting the source data from realtime API or from a database or an automated process for production use.
Persiting the data in database for reuability and making data available across teams and geographies.
Use Docker-Compose.yml to deal with persisting data, multiple services, shared volumes, environment variables, or networking.
Use data visulization tool such as Tableau for more dashboard capabilities as data is persisted in database.
Change the base image to optimized Databricks image for better performance.
Create purpose built dimension data models such as facts and dimensions or denormalised tables for better query and report performance.

Reference Links

Dockerfile
[Jupyter Notebook Git Link].(https://github.com/Lenin-Subramonian/data-engineering-ifco_test/blob/main/notebooks/IFCO_Data_Analysis.ipynb)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.images		.images
.ipynb_checkpoints		.ipynb_checkpoints
.venv		.venv
Docker		Docker
data		data
ifco-project-requirement		ifco-project-requirement
notebooks		notebooks
src		src
streamlit_app		streamlit_app
tests		tests
.gitignore		.gitignore
README.md		README.md
crate_order_distribution.csv		crate_order_distribution.csv
crate_sale_distribution.csv		crate_sale_distribution.csv
sales_top_5.csv		sales_top_5.csv
supervisord.conf		supervisord.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IFCO Data Application

Problem Statement

Highlevel Solution & Approach

Getting Started

How to install & Execute

Using Docker Image from Registry

Building a Docker image from Git

Assumptions

How to Debug?

Improvements & Next Steps

Reference Links

About

Releases

Packages

Languages

Lenin-Subramonian/data-engineering-ifco_test

Folders and files

Latest commit

History

Repository files navigation

IFCO Data Application

Problem Statement

Highlevel Solution & Approach

Getting Started

How to install & Execute

Using Docker Image from Registry

Building a Docker image from Git

Assumptions

How to Debug?

Improvements & Next Steps

Reference Links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages