Azure End To End Data LakeHouse

Introduction

The goal of this project is to implement a Data LakeHouse architecture in Azure cloud, using the innovative Medallion architecture.

Here's a summary of the data pipeline stages described:

Stage 1 - Extracting data from the local SQL Server using Data Factory and ingesting it into Data Lake Gen 2 in Parquet format in the Bronze layer.
Stage 2 - Utilizing DataBricks to read and clean the data from Data Lake, formatting columns and types. After processing, the cleaned and transformed data is stored in the Silver layer.
Stage 3 - Silver to Gold transition involves making minor adjustments to the data, such as standardizing column names and performing business aggregations.
Stage 4 - Views are created in Synapse to directly consume the processed data from the LakeHouse in the Gold layer, ensuring reliable data and high performance.

This structured approach ensures data integrity, efficient processing, and supports analytics and reporting needs effectively through the layers of Bronze, Silver, and Gold within the LakeHouse architecture.

Concepts

Data Lakehouse is an innovative concept.

This concept combines the best features of a Data Lake and a Data Warehouse, integrating organization, flexibility, and performance into a single platform. Additionally, it provides governance and control over data and processes.

The issue that Lakehouse resolves: Imagine a company with a data lake receiving continuous data and multiple applications consuming these data streams, such as a Machine Learning application or a Data Warehouse.

During data ingestion or transformation, processing errors or data inconsistencies can occur.

In the traditional model, that inconsistent data transferred, as there is no transaction control.

With Lakehouse and the ACID concept, only committed and ready-to-use data. This ensures accuracy and reliability, transforming how your company manages and utilizes data.

Medallion Architecture

Layered Data Processing: Organizes data processing into distinct Bronze, Silver, and Gold layers:

Bronze Layer: Stores raw data directly from sources while preserving its original integrity, serving as the initial landing zone for incoming data.
Silver Layer: Transforms and cleanses data to improve quality and standardization, preparing it for intermediate analytics and processing.
Gold Layer: Highly refines and optimizes data for fast, precise analytical queries, often involving aggregation and summarization for efficient querying and reporting.

Ok! Let's go to talk about the Project!

Security

How to maintain security when connecting to the source and extracting data? It is crucial to securely manage user credentials to prevent sensitive information, such as connection details to your transactional server, from being exposed or unprotected.

Using Azure Key Vault, we register access credentials to the transactional server, such as username and password, only once and in a single location.

This also simplifies its integration with other cloud services, crucially ensuring security, monitoring access, and safeguarding sensitive data centrally.

Data Source

Now, with the secure connection between the source and the cloud established, let's move on to the next step.

How to automate getting various tables and load this on Data Lake?

Ingestion and Orchestration

Data Factory

Using Data Factory, I parameterized a Lookup activity to execute a query in local database metadata and retrieve tables with the SalesLT schema.

Using this information listed within a Data Factory variable, we utilized it as a parameter for the next step: For Each.

Within this loop, we read each database table using the list as a parameter and loaded the data into the Bronze layer, organized into folders where each table corresponds to a separate folder.

After that, I configured Data Factory to read and execute the Databricks notebooks in sequence, applying the logic for data transformation and transferring data between the layers.

LookUp Configuration

ForEach Configuration

Data in Layer Bronze

Transformation

Databricks

Here, with this excellent tool, we see the data lakehouse in action through data reading and storage on top of the Delta framework, which is composed of JSON and Parquet files, ensuring ACID transactions and various features such as Time Travel.

Time Travel allows you to view and recover previous versions of tables and query changes at the row level.

In this part, some data transformations and typecasting are performed. Refer to the file in the Databricks folder to view the complete code.

Load

After the data is processed in each layer, the treated data is saved in their respective folders in Data Lake Gen 2.

Check the codes and transformations performed in the notebooks inside the Databricks folder.

Synapse Analytics

Within Azure Synapse Analytics, I created a SQL Serverless instance to be used for data consumption in the Gold layer.

How do we consume the data?

Views were created from the final tables in the Gold layer, which are in Delta format.

Why?

Data Governance: Views ensure that data is governed in a centralized and controlled manner.

Single Point of Consumption: All data is consumed from a single location, avoiding redundancies and inconsistencies.
Storage Efficiency: Using views allows for more efficient storage, as there is no need to duplicate data for other applications.
Flexibility and Cost Savings: Utilizing views provides flexibility in data access and cost savings, as there is no need to replicate data for different applications to consume.

The main benefit is that using views allows for flexibility and cost savings while maintaining data efficiency and integrity.

Now, the data is ready to be consumed by BI applications, data analyst teams, data scientist teams, and users for ad hoc queries.

Conclusion

If you've read this far, thank you very much! This project allowed me to understand how the Data Lakehouse architecture works in practice, enabling me to apply it in different scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
CreateSQLServerlessView_gold.sql		CreateSQLServerlessView_gold.sql
README.md		README.md
bronze to silver.ipynb		bronze to silver.ipynb
silver to gold.ipynb		silver to gold.ipynb
storagemount.ipynb		storagemount.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure End To End Data LakeHouse

Introduction

Concepts

Data Lakehouse is an innovative concept.

Medallion Architecture

Ok! Let's go to talk about the Project!

Security

Data Source

Ingestion and Orchestration

Data Factory

LookUp Configuration

ForEach Configuration

Data in Layer Bronze

Transformation

Databricks

Load

Synapse Analytics

Why?

Conclusion

About

Releases

Packages

Languages

LuisGustavoCorrea/Azure-End-To-End-Data-Engineering

Folders and files

Latest commit

History

Repository files navigation

Azure End To End Data LakeHouse

Introduction

Concepts

Data Lakehouse is an innovative concept.

Medallion Architecture

Ok! Let's go to talk about the Project!

Security

Data Source

Ingestion and Orchestration

Data Factory

LookUp Configuration

ForEach Configuration

Data in Layer Bronze

Transformation

Databricks

Load

Synapse Analytics

Why?

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages