Skip to content

Commit

Permalink
add new demo for fabric integration
Browse files Browse the repository at this point in the history
  • Loading branch information
rgward committed Apr 29, 2024
1 parent 5ccdd03 commit ab4a840
Show file tree
Hide file tree
Showing 8 changed files with 305 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
USE SalesDB;
GO
-- Switch partition 1 into the Sales Archive table
--
ALTER TABLE Sales
SWITCH PARTITION 1 TO SalesArchive;
GO
SELECT * FROM SalesArchive;
GO
-- Create external table to export to Parquet from SalesArchive
--
CREATE EXTERNAL TABLE SalesArchiveSept2022
WITH (
LOCATION = '/salessept2022',
DATA_SOURCE = bwdatalake,
FILE_FORMAT = ParquetFileFormat
)
AS
SELECT * FROM SalesArchive;
GO
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
USE SalesDB;
GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'StrongPass0wrd!';
GO
DROP EXTERNAL DATA SOURCE bwdatalake;
GO
DROP DATABASE SCOPED CREDENTIAL bwdatalake_creds;
GO
CREATE DATABASE SCOPED CREDENTIAL bwdatalake_creds
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = '<SAS Token>';
GO
CREATE EXTERNAL DATA SOURCE bwdatalake
WITH
(
LOCATION = 'abs://[email protected]'
,CREDENTIAL = bwdatalake_creds
);
GO
IF EXISTS (SELECT * FROM sys.external_file_formats WHERE name = 'ParquetFileFormat')
DROP EXTERNAL FILE FORMAT ParquetFileFormat;
CREATE EXTERNAL FILE FORMAT ParquetFileFormat WITH(FORMAT_TYPE = PARQUET);
GO
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
EXEC sp_configure 'polybase enabled', 1;
GO
RECONFIGURE;
GO
EXEC sp_configure 'allow polybase export', 1;
GO
RECONFIGURE;
GO
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
-- Query archive table
--
USE SalesDB;
GO
SELECT * FROM SalesArchiveSept2022;
GO
-- Combine existing sales with Archive
--
SELECT * FROM Sales
UNION
SELECT * FROM SalesArchiveSept2022
ORDER BY sales_dt;
GO

-- Optionally truncate SalesArchive
--
TRUNCATE TABLE SalesArchive;
GO
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Demo to use Data Virtualization to archive partitioned tables into a data lake

This is a demo to use SQL Server 2022 data virtualiztion to archive partitions from a table considered "cold data" into a data lake but access like a table

## Setup

1. Install SQL Server 2022. You must enable the Polybase feature
2. Using your Azure subscription create an Azure Storage account using these steps https://learn.microsoft.com/azure/storage/blobs/create-data-lake-storage-account.
3. Create a container for the storage account using these steps: https://learn.microsoft.com/azure/storage/blobs/blob-containers-portal. Note you can leave access as Private.
4. Executed the script enablepolybase.sql
5. Read through the details in salesddl.sql and execute all the T-SQL in the script. You now have a database with a partitioned table for Sales Data partitioned by date ranges.
6. Created a Shared Access Signature (SAS) for the Azure Storage Account. To get tips on creating this and setting the right access see the doc page at https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql. Look down at the section for arguents titled CREDENTIAL=credential_name.
7. Edit the script ddl_datalake.sql to put in your proper storage account name, container, and Shared Access Signature. Execute the script ddl_datalake.sql to setup external data sources and file formats.

## Archive cold data to a data lake

1. Execute the script archive_table.sql to move a partition to the archive table and then export the archive table to the data lake.
2. Check the Azure Portal for your container to make sure the new folder and parquet file exist.
3. Execute the script getarchivesept2022.sql to query the archive files through the external table, union with existing Sales table, and truncate the archive table.
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
USE master;
GO
DROP DATABASE IF EXISTS SalesDB;
GO
CREATE DATABASE SalesDB;
GO
USE SalesDB;
GO
DROP TABLE IF EXISTS Sales;
GO
DROP PARTITION SCHEME myRangePS;
GO
DROP PARTITION FUNCTION myRangePF;
GO
CREATE PARTITION FUNCTION myRangePF (date)
AS RANGE RIGHT FOR VALUES ('20221001', '20221101', '20221201') ;
GO

CREATE PARTITION SCHEME myRangePS
AS PARTITION myRangePF
ALL TO ('PRIMARY') ;
GO
CREATE TABLE Sales (salesid int identity not null, customer varchar(50) not null, sales_dt date not null, salesperson varchar(50) not null, sales_amount bigint not null,
CONSTRAINT PKSales PRIMARY KEY CLUSTERED(sales_dt, salesid))
ON myRangePS (sales_dt);
GO

-- Insert data for the 5 partitions
--
-- Insert data for September 2022
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_1', '20220901', 'SalesPerson1', 500);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_2', '20220902', 'SalesPerson1', 200);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_3', '20220903', 'SalesPerson1', 500);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_4', '20220904', 'SalesPerson2', 100);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_5', '20220905', 'SalesPerson2', 100);

--
-- Insert data for October 2022
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_1', '20221001', 'SalesPerson1', 100);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_2', '20221002', 'SalesPerson1', 200);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_3', '20221003', 'SalesPerson1', 300);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_4', '20221004', 'SalesPerson2', 400);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_5', '20221005', 'SalesPerson2', 500);
--
-- Insert data for November 2022
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_1', '20221101', 'SalesPerson1', 100);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_2', '20221102', 'SalesPerson1', 200);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_3', '20221103', 'SalesPerson1', 300);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_4', '20221104', 'SalesPerson2', 400);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_5', '20221105', 'SalesPerson2', 500);

--
-- Insert data for December 2022
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_1', '20221201', 'SalesPerson1', 100);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_2', '20221202', 'SalesPerson1', 200);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_3', '20221203', 'SalesPerson1', 300);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_4', '20221204', 'SalesPerson2', 400);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_5', '20221205', 'SalesPerson2', 500);

--
-- Insert data for January 2023
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_1', '20230101', 'SalesPerson1', 100);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_2', '20230102', 'SalesPerson1', 200);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_3', '20230103', 'SalesPerson1', 300);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_4', '20230104', 'SalesPerson2', 400);
INSERT INTO Sales (customer, sales_dt, salesperson, sales_amount)
VALUES ('Customer_5', '20230105', 'SalesPerson2', 500);

-- Check partitions
--
SELECT
p.partition_number AS [Partition],
fg.name AS [Filegroup],
p.Rows
FROM sys.partitions p
INNER JOIN sys.allocation_units au
ON au.container_id = p.hobt_id
INNER JOIN sys.filegroups fg
ON fg.data_space_id = au.data_space_id
WHERE p.object_id = OBJECT_ID('Sales');
GO

-- Create an archive table that is empty
--
DROP TABLE IF EXISTS SalesArchive;
GO
CREATE TABLE SalesArchive (salesid int identity not null, customer varchar(50) not null, sales_dt date not null, salesperson varchar(50) not null, sales_amount bigint not null,
CONSTRAINT PKSalesArchive PRIMARY KEY CLUSTERED(sales_dt, salesid));
GO






Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
customer,productsentiment,salespersonsentiment,surveydate
Customer_1,Excellent,Excellent,9/1/22 12:00 AM
Customer_2,Negative,Good,9/1/22 12:00 AM
Customer_3,Good,Excellent,9/1/22 12:00 AM
Customer_4,Good,Negative,9/1/22 12:00 AM
Customer_5,Good,Negative,9/1/22 12:00 AM
92 changes: 92 additions & 0 deletions demos/sqlserver2022/microsoftfabricsqlserver2022/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Integrating SQL Server 2022 and the Microsoft Fabric.

In this example you will learn how to:

- Archive "cold" data for sales to a data lake using data virtualization capabilities in SQL Server 2022 to Azure Storage.
- Learn how to query archived data in Azure Storage just like it was a SQL Server table.
- How to integrate the archived sales data into a Microsoft Fabric Lakehouse.
- How to analyze trends for sales with customer sentiment using Microsoft Fabric Lakehouse data and PowerBI visualization.

> **Note:** You should be able to also demonstrate this example using Azure SQL Managed Instance since it also has the same data virtualization capabilities as used in this example as SQL Server 2022.
## Prerequisites

- You have installed SQL Server 2022 and enabled the PolyBase Query Service for External Data Feature during setup.
- Download SQL Server Management Studio (SSMS) from <https://aka.ms/ssms19> to run in your client machine.
- You access to a Premium PowerBI workspace to use the Microsoft Fabric.
- You have installed the OneLake File Explorer add-on for Windows.

## Archive and access data with SQL Server 2022 and data virtualization.

Follow the instructions in the **archivetodatalake/readme.md** file to setup, archive, and access data with data virtualization in SQL Server 2022.

## Create a shortcut for archived data from Azure Storage

1. Create a new Lakehouse in your Microsoft Fabric workspace
1. In the Lakehouse explorer create a new Shortcut under Files.
1. Select Azure Data Lake Storage Gen 2 as the External Source.
1. Use the dfs Primary Endpoint for the ADLS storage account (you can find this under the JSON view of the storage account.) For example, from the archivedatalake demo my dfs endpoint is https://bwdatalakestorage.dfs.core.windows.net. Put in the full SAS token from the storage account under Connection Credentials.
1. Provide the shortcut a name.
1. For subpath put in the Container Name
1. Use the Lakehouse Explorer to drill and verify the parquet file can be seen.

## Verify the data using a Notebook in Lakehouse Explorer

Use a Notebook in the Lakehouse Explorer to verify the archived sales data.

1. Select Open notebook/New notebook
1. Paste in the following PySpark code to query the data in the first cell.

```python
df = spark.read.parquet("<file path>")
display(df)
```

1. In the Lakehouse explorer select "..." from the parquet file and select **Copy relative path for Spark**
1. Paste in the copied path into `"<file path>"` (leave the quotes)
1. Select **Run all** at the top of the screen.
1. After a few seconds the results of your data should appear. YOu can now verify this is valid sales data.
1. Select Stop Session at the top of the menu.
1. You can optionally select the Save button to save the Notebook.

## Load the archived sales data as a table in the Lakehouse

1. Select the Lakehouse you created on the left hand menu to go back to the full Lakehouse explorer view.
1. Drill into the shortcut from Files and select "..." next to the parquet file. Select **Load to Delta table**. Put in a table name of salessept2022. You will see a Loading table in progress message. This will take a few seconds until it says Table successfully loaded.
1. In the Lakehouse explorer under Tables you will see the new table. If you click on the table you will see the sales data.

## Upload customer sentiment data into the Microsoft Fabric Lakehouse

1. Using the OneLake File Explorer add-on, copy the provided customersentimentsept2022.csv file into the your `<Lakehouse name>`.Lakehouse\Files folder.
1. In the Lakehouse Explore you should be able to click on Files and see the .csv now exists.
1. Select the "..." next to the file name and select Load to Delta table. Use the default name provided. You should get a **Loading table in progress** message and eventually a Table successfully loaded message.
1. You can now see the new table in the Tables view in Lakehouse Explorer. If you click the table you will table you will see the data in a table format.

## Create a relationship between the data

1. Let's make sure the two tables are known to have a relationship based on the customer column.
1. At the top right corner of the Lakehouse Explorer screen select the Lakehouse dropdown and select SQL Endpoint.
1. At the bottom of the screen select Model.
1. You will now see a visual of the two tables.
1. Drag the customer field from the customersentimentsept2022 table onto the customer field of salesept2022.
1. In the Create Relationship screen select Confirm.
1. At the top right of the screen select the SQL Endpoint drop-down and select Lakehouse.

## Analyze customer sentiment data with archived sales data.

Let's see if we can visualize any trends with customer sentiment captured by surveys and sales data.

1. Select New Power BI dataset at the top menu.
1. Click Select All and click Confirm
1. You will be presented with a Synapse Data Engineering page.
1. In the middle of the page select **+ Create from scratch** on the Visual this data part of the page. Select Auto-create.
1. When the Your report is ready message pops-up select View report.
1. On the right-hand side of the screen is a view of the columns of the tables. Expand both and unselect any columns selected.
1. We want to see relationships between sentiment for both product and salespersons and sales.
1. For the salessept2022 table select the customer, sales_amount (use the Sum option), and salesperson fields. For cusgtomersentimentsept2022 select customer, productsentiment, and salespersonsentiment.
1. You can now analyze any trend data. There are 6 visuals to view.
1. In the upper left view we can see Customer_1 and Customer_3 have the highest sales. Click on Customer_1. You can see Customer_1 has SalesPeron1 and sentiment is Excellent across the board.
1. Customer_3 also is for SalesPerson1 and has Good Product Sentiment but Excellent SalesPerson sentiment.
1. Customer_2 has lower sales and Negative Product sentiment but Good SalesPerson sentiment and also belongs to SalesPerons1
1. Looking at Customer_4 you can see SalesPerson2 is assigned and even though the ProductSentiment is good the SalesPerson sentiment is Negative.
1. Customer_5 also has lower sales with the same trend.

0 comments on commit ab4a840

Please sign in to comment.