Skip to content

Commit fd1ad22

Browse files
committed
2 parents 3a92779 + 61a1a6a commit fd1ad22

36 files changed

Lines changed: 945 additions & 1888 deletions

samples/databases/wide-world-importers/wwi-app/wwwroot/lib/bootstrap/Gemfile.lock

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@ GEM
33
specs:
44
addressable (2.4.0)
55
colorator (0.1)
6-
ffi (1.9.14-x64-mingw32)
7-
jekyll (3.1.6)
6+
ffi (1.9.24-x64-mingw32)
7+
jekyll (3.6.3)
88
colorator (~> 0.1)
99
jekyll-sass-converter (~> 1.0)
1010
jekyll-watch (~> 1.1)

samples/databases/wide-world-importers/wwi-ssdt/wwi-ssdt/PostDeploymentScripts/Script.PostDeployment1.sql

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,7 @@ EXEC DataLoadSimulation.DailyProcessToCreateHistory
172172
GO
173173

174174
:r .\pds400-ins-unkown-orderline.sql
175+
:r .\pds410-update-archive-tables.sql
175176

176177
/*
177178
There is one other stored procedure you may find useful:
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
-- NOTE: This script should be moved to MakeTemporalChanges procedure, but currently it doesn't work there.
2+
-- jovanpop creating a separate file here.
3+
-- @TODO: Investigate how to move it there.
4+
5+
PRINT N'Updating StockItems history...'
6+
GO
7+
EXEC DataloadSimulation.DeactivatetemporalTablesBeforeDataLoad;
8+
GO
9+
UPDATE Warehouse.StockItems_Archive
10+
SET UnitPrice = s.UnitPrice * (1 - .05 *(DATEDIFF(DAY, sa.ValidFrom, GETDATE())/365 )),
11+
RecommendedRetailPrice = s.RecommendedRetailPrice * (1 - .03 *(DATEDIFF(DAY, sa.ValidFrom, GETDATE())/365 )),
12+
TaxRate = s.TaxRate * (1 + .02 *(DATEDIFF(DAY, sa.ValidFrom, GETDATE())/365 )),
13+
QuantityPerOuter = CEILING(s.QuantityPerOuter * (1 + .05 *(DATEDIFF(DAY, sa.ValidFrom, GETDATE())/365 ))),
14+
LeadTimeDays = CEILING(s.LeadTimeDays * (1 + .03 *(DATEDIFF(DAY, sa.ValidFrom, GETDATE())/365 ))),
15+
TypicalWeightPerUnit = CEILING(s.TypicalWeightPerUnit * (1 + .02 *(DATEDIFF(DAY, sa.ValidFrom, GETDATE())/365 )))
16+
FROM Warehouse.StockItems_Archive sa
17+
JOIN Warehouse.StockItems s
18+
ON sa.StockItemID = s.StockItemID;
19+
GO
20+
EXEC DataloadSimulation.ReActivatetemporalTablesAfterDataLoad;
21+
GO

samples/databases/wide-world-importers/wwi-ssdt/wwi-ssdt/WideWorldImporters.sqlproj

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -844,5 +844,6 @@
844844
<None Include="PostDeploymentScripts\pds105-ins-dls-ficticiousnamepool.sql" />
845845
<None Include="PostDeploymentScripts\pds106-ins-dls-areacode.sql" />
846846
<None Include="PostDeploymentScripts\pds400-ins-unkown-orderline.sql" />
847+
<None Include="PostDeploymentScripts\pds410-update-archive-tables.sql" />
847848
</ItemGroup>
848849
</Project>
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# SQL Server big data clusters
2+
3+
## Pre-requisites
4+
1. Kubernetes cluster configuration & Kubectl command-line utility
5+
2. Curl utility
6+
3. Sqlcmd utility
7+
4. Bcp utility
8+
5. Azure Data Studio or SQL Server Management Studio
9+
6. SQL Server 2019 big data cluster
10+
11+
Installation instructions for SQL Server 2019 big data cluster can be found [here](https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-guidance?view=sql-server-2017).
12+
13+
## Samples Setup
14+
15+
**Before you begin**, download the sample database [backup file](https://sqlchoice.blob.core.windows.net/sqlchoice/static/tpcxbb_1gb.bak) and save it locally. Run the CMD script called *bootstrap-sample-db.cmd* or the shell script *bootstrap-sample-db.sh* depending on your platform. This script will restore the database on the SQL Master instance, execute the *bootstrap-sample-db.sql* script, create the database objects needed, export the web_clickstreams & inventory tables to CSV file, and upload the web_clickstreams CSV file to HDFS inside the SQL Server 2019 big data cluster.
16+
17+
__[data-pool](data-pool/)__
18+
19+
### Data ingestion using Spark
20+
Connect to the master instance in your SQL Server big data cluster and the SQL Server big data cluster endpoint, and follow the steps in *data-pool/data-ingestion-spark.sql*.
21+
22+
### Data ingestion using sql
23+
Connect to the master instance in your SQL Server big data cluster and execute the steps in *data-pool/data-ingestion-sql.sql*.
24+
25+
__[data-virtualization](data-virtualization/)__
26+
27+
### External table over HDFS
28+
Connect to the master instance in your SQL Server big data cluster and execute the steps in *data-virtualization/external-table-hdfs.sql*.
29+
30+
### External table over Oracle
31+
To execute this sample script, you will need following:
32+
1. Oracle instance and credentials
33+
1. Create inventory table in Oracle using [data-virtualization/inventory-oracle.sql](data-virtualization/inventory-oracle.sql/) script
34+
1. Import the inventory.csv file generated by the bootstrap-sample-db script to a table in Oracle
35+
36+
Connect to the master instance in your SQL Server big data cluster and execute the steps in *data-virtualization/external-table-oracle.sql*.
37+
38+
__[machine-learning](machine-learning/)__
39+
40+
### SQL Server ML Services on master instance
41+
Connect to the master instance in your SQL Server big data cluster and execute the steps in *machine-learning/sql/book-category-r-ml.sql*.
42+
43+
### Spark ML
44+
Connect to the SQL Server big data cluster endpoint, and run the notebook files *machine-learning/spark/1-data-prep.ipynb* and *machine-learning/spark/2-build-ml-model.ipynb* cell by cell.
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
@echo off
2+
REM CLICKSTREAM FILES
3+
setlocal enableextensions
4+
set CLUSTER_NAMESPACE=%1
5+
set SQL_MASTER_IP=%2
6+
set SQL_MASTER_SA_PASSWORD=%3
7+
set BACKUP_FILE_PATH=%~4
8+
set KNOX_IP=%5
9+
set KNOX_PASSWORD=%6
10+
set STARTUP_PATH=%~dp0
11+
12+
if NOT DEFINED CLUSTER_NAMESPACE goto :usage
13+
if NOT DEFINED SQL_MASTER_IP goto :usage
14+
if NOT DEFINED SQL_MASTER_SA_PASSWORD goto :usage
15+
if NOT DEFINED BACKUP_FILE_PATH goto :usage
16+
if NOT DEFINED KNOX_IP goto :usage
17+
if NOT DEFINED KNOX_PASSWORD set KNOX_PASSWORD=%SQL_MASTER_SA_PASSWORD%
18+
19+
set SQL_MASTER_INSTANCE=%SQL_MASTER_IP%,31433
20+
set KNOX_ENDPOINT=%KNOX_IP%:30443
21+
22+
echo Verifying sqlcmd.exe is in path & CALL WHERE /Q sqlcmd.exe || GOTO exit
23+
echo Verifying bcp.exe is in path & CALL WHERE /Q bcp.exe || GOTO exit
24+
echo Verifying kubectl.exe is in path & CALL WHERE /Q kubectl.exe || echo HINT: Install the kubernetes-cli - https://kubernetes.io/docs/tasks/tools/install-kubectl && GOTO exit
25+
echo Verifying curl.exe is in path & CALL WHERE /Q curl.exe || echo HINT: Install curl - https://curl.haxx.se/download.html && GOTO exit
26+
27+
REM Copy the backup file, restore the database, create necessary objects and data file
28+
echo Copying database backup file...
29+
pushd "%BACKUP_FILE_PATH%"
30+
%DEBUG% kubectl cp tpcxbb_1gb.bak mssql-master-pool-0:/var/opt/mssql/data -c mssql-server -n %CLUSTER_NAMESPACE% || goto exit
31+
popd
32+
33+
echo Configuring sample database...
34+
%DEBUG% sqlcmd -S %SQL_MASTER_INSTANCE% -Usa -P%SQL_MASTER_SA_PASSWORD% -i "%STARTUP_PATH%bootstrap-sample-db.sql" -o "%STARTUP_PATH%bootstrap.out" -I -b || goto exit
35+
36+
for %%F in (web_clickstreams inventory) do (
37+
echo Exporting %%F data...
38+
%DEBUG% bcp sales.dbo.%%F out "%STARTUP_PATH%%%F.csv" -S %SQL_MASTER_INSTANCE% -Usa -P%SQL_MASTER_SA_PASSWORD% -c -t, -o "%STARTUP_PATH%%%F.out" -e "%STARTUP_PATH%%%F.err" || goto exit
39+
)
40+
41+
REM Copy the data file to HDFS
42+
echo Uploading web_clickstreams data to HDFS...
43+
pushd "%STARTUP_PATH%"
44+
%DEBUG% curl -i -L -k -u root:%KNOX_PASSWORD% -X PUT "https://%KNOX_ENDPOINT%/gateway/default/webhdfs/v1/clickstream_data?op=MKDIRS" || goto exit
45+
%DEBUG% curl -i -L -k -u root:%KNOX_PASSWORD% -X PUT "https://%KNOX_ENDPOINT%/gateway/default/webhdfs/v1/clickstream_data/web_clickstreams.csv?op=create" -H "Content-Type: application/octet-stream" -T "web_clickstreams.csv" || goto exit
46+
47+
:: del /q *.out *.err *.csv
48+
popd
49+
50+
endlocal
51+
exit /b 0
52+
goto :eof
53+
54+
:exit
55+
echo Bootstrap of the sample database failed.
56+
exit /b %ERRORLEVEL%
57+
58+
:usage
59+
echo USAGE: %0 ^<CLUSTER_NAMESPACE^> ^<SQL_MASTER_IP^> ^<SQL_MASTER_SA_PASSWORD^> ^<BACKUP_FILE_PATH^> ^<KNOX_IP^> [^<KNOX_PASSWORD^>]
60+
echo Default ports are assumed for SQL Master instance ^& Knox gateway.
61+
exit /b 0
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
#!/bin/bash
2+
set -e
3+
set -o pipefail
4+
USAGE_MESSAGE="USAGE: $0 <CLUSTER_NAMESPACE> <SQL_MASTER_IP> <SQL_MASTER_SA_PASSWORD> <BACKUP_FILE_PATH> <KNOX_IP> [<KNOX_PASSWORD>]"
5+
ERROR_MESSAGE="Bootstrap of the sample database failed."
6+
7+
# Print usage if mandatory parameters are missing
8+
: "${1:?$USAGE_MESSAGE}"
9+
: "${2:?$USAGE_MESSAGE}"
10+
: "${3:?$USAGE_MESSAGE}"
11+
: "${4:?$USAGE_MESSAGE}"
12+
: "${5:?$USAGE_MESSAGE}"
13+
: "${DEBUG=}"
14+
15+
# Save the input parameters
16+
CLUSTER_NAMESPACE=$1
17+
SQL_MASTER_IP=$2
18+
SQL_MASTER_SA_PASSWORD=$3
19+
BACKUP_FILE_PATH=$4
20+
KNOX_IP=$5
21+
KNOX_PASSWORD=$6
22+
# If Knox password is not supplied then default to SQL Master password
23+
KNOX_PASSWORD=${KNOX_PASSWORD:=$SQL_MASTER_SA_PASSWORD}
24+
25+
SQL_MASTER_INSTANCE=$SQL_MASTER_IP,31433
26+
KNOX_ENDPOINT=$KNOX_IP:30443
27+
28+
# Copy the backup file, restore the database, create necessary objects and data file
29+
echo Copying database backup file...
30+
pushd "$BACKUP_FILE_PATH"
31+
$DEBUG kubectl cp tpcxbb_1gb.bak mssql-master-pool-0:/var/opt/mssql/data -c mssql-server -n $CLUSTER_NAMESPACE || (echo $ERROR_MESSAGE && exit 1)
32+
popd
33+
34+
echo Configuring sample database...
35+
# WSL ex: "/mnt/c/Program Files/Microsoft SQL Server/Client SDK/ODBC/130/Tools/Binn/SQLCMD.EXE"
36+
$DEBUG sqlcmd -S $SQL_MASTER_INSTANCE -Usa -P$SQL_MASTER_SA_PASSWORD -i "bootstrap-sample-db.sql" -o "bootstrap.out" -I -b || (echo $ERROR_MESSAGE && exit 2)
37+
38+
for table in web_clickstreams inventory
39+
do
40+
echo Exporting $table data...
41+
# WSL ex: "/mnt/c/Program Files/Microsoft SQL Server/Client SDK/ODBC/130/Tools/Binn/bcp.exe"
42+
$DEBUG bcp sales.dbo.$table out "$table.csv" -S $SQL_MASTER_INSTANCE -Usa -P$SQL_MASTER_SA_PASSWORD -c -t, -o "$table.out" -e "$table.err" || (echo $ERROR_MESSAGE && exit 3)
43+
done
44+
45+
# Copy the data file to HDFS
46+
echo Uploading web_clickstreams data to HDFS...
47+
$DEBUG curl -i -L -k -u root:$KNOX_PASSWORD -X PUT "https://$KNOX_ENDPOINT/gateway/default/webhdfs/v1/clickstream_data?op=MKDIRS" || (echo $ERROR_MESSAGE && exit 4)
48+
$DEBUG curl -i -L -k -u root:$KNOX_PASSWORD -X PUT "https://$KNOX_ENDPOINT/gateway/default/webhdfs/v1/clickstream_data/web_clickstreams.csv?op=create" -H 'Content-Type: application/octet-stream' -T "web_clickstreams.csv" || (echo $ERROR_MESSAGE && exit 5)
49+
50+
# rm -f *.out *.err *.csv
51+
exit
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
USE master;
2+
GO
3+
-- Enable external scripts execution for R/Python/Java:
4+
exec sp_configure 'external scripts enabled', 1;
5+
RECONFIGURE WITH OVERRIDE;
6+
GO
7+
8+
IF DB_ID('sales') IS NULL
9+
RESTORE DATABASE sales
10+
FROM DISK=N'/var/opt/mssql/data/tpcxbb_1gb.bak'
11+
WITH
12+
MOVE N'tpcxbb_1gb' TO N'/var/opt/mssql/data/sales.mdf',
13+
MOVE N'tpcxbb_1gb_log' TO N'/var/opt/mssql/data/sales.ldf';
14+
GO
15+
16+
USE sales;
17+
GO
18+
-- Create default data sources for SQL Big Data Cluster
19+
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlDataPool')
20+
CREATE EXTERNAL DATA SOURCE SqlDataPool
21+
WITH (LOCATION = 'sqldatapool://service-mssql-controller:8080/datapools/default');
22+
23+
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlStoragePool')
24+
CREATE EXTERNAL DATA SOURCE SqlStoragePool
25+
WITH (LOCATION = 'sqlhdfs://service-mssql-controller:8080');
26+
GO
27+
28+
-- Create view used for ML services training stored procedure
29+
CREATE OR ALTER VIEW [dbo].[web_clickstreams_book_clicks]
30+
AS
31+
SELECT
32+
q.clicks_in_category,
33+
CASE WHEN cd.cd_education_status IN ('Advanced Degree', 'College', '4 yr Degree', '2 yr Degree') THEN 1 ELSE 0 END AS college_education,
34+
CASE WHEN cd.cd_gender = 'M' THEN 1 ELSE 0 END AS male,
35+
q.clicks_in_1,
36+
q.clicks_in_2,
37+
q.clicks_in_3,
38+
q.clicks_in_4,
39+
q.clicks_in_5,
40+
q.clicks_in_6,
41+
q.clicks_in_7,
42+
q.clicks_in_8,
43+
q.clicks_in_9
44+
FROM(
45+
SELECT
46+
w.wcs_user_sk,
47+
SUM( CASE WHEN i.i_category = 'Books' THEN 1 ELSE 0 END) AS clicks_in_category,
48+
SUM( CASE WHEN i.i_category_id = 1 THEN 1 ELSE 0 END) AS clicks_in_1,
49+
SUM( CASE WHEN i.i_category_id = 2 THEN 1 ELSE 0 END) AS clicks_in_2,
50+
SUM( CASE WHEN i.i_category_id = 3 THEN 1 ELSE 0 END) AS clicks_in_3,
51+
SUM( CASE WHEN i.i_category_id = 4 THEN 1 ELSE 0 END) AS clicks_in_4,
52+
SUM( CASE WHEN i.i_category_id = 5 THEN 1 ELSE 0 END) AS clicks_in_5,
53+
SUM( CASE WHEN i.i_category_id = 6 THEN 1 ELSE 0 END) AS clicks_in_6,
54+
SUM( CASE WHEN i.i_category_id = 7 THEN 1 ELSE 0 END) AS clicks_in_7,
55+
SUM( CASE WHEN i.i_category_id = 8 THEN 1 ELSE 0 END) AS clicks_in_8,
56+
SUM( CASE WHEN i.i_category_id = 9 THEN 1 ELSE 0 END) AS clicks_in_9
57+
FROM web_clickstreams as w
58+
INNER JOIN item as i ON (w.wcs_item_sk = i_item_sk
59+
AND w.wcs_user_sk IS NOT NULL)
60+
GROUP BY w.wcs_user_sk
61+
) AS q
62+
INNER JOIN customer as c ON q.wcs_user_sk = c.c_customer_sk
63+
INNER JOIN customer_demographics as cd ON c.c_current_cdemo_sk = cd.cd_demo_sk;
64+
GO
65+
66+
-- Create table for storing the machine learning models
67+
CREATE TABLE sales_models (
68+
model_name varchar(100) NOT NULL PRIMARY KEY,
69+
model varbinary(max) NOT NULL,
70+
model_native varbinary(max) NOT NULL,
71+
created_by nvarchar(300) NOT NULL DEFAULT(SYSTEM_USER),
72+
create_time datetime2 NOT NULL DEFAULT(SYSDATETIME())
73+
);
74+
GO
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Data pools in SQL Server 2019 big data cluster
2+
3+
SQL Server Big Data clusters provide scale-out compute and storage to improve the performance of analyzing any data. Data from a variety of sources can be ingested and distributed across data pool instances for analysis. In this example, we will insert data from a SQL query into an external table stored in a data pool and query it.
4+
5+
## Data ingestion using SQL stored procedure
6+
7+
SQL Server Big Data clusters provide scale-out compute and storage to improve the performance of analyzing any data. Data from a variety of sources can be ingested and distributed across data pool instances for analysis. In this example, we will insert data from a SQL query into an external table stored in a data pool and query it.
8+
9+
### Instructions
10+
11+
1. Connect to SQL Server Master instance.
12+
13+
1. Execute the .sql script [data-ingestion-sql.sql](data-ingestion-sql.sql).
14+
15+
## Data ingestion using Spark streaming
16+
17+
In this example, you are going to use Spark to read and transform data from HDFS and cache it in a data pool. Querying the external table created over this aggregated data stored in data pools will be much more efficient than going to the raw data always.
18+
19+
### Instructions
20+
21+
1. Using Azure Data Studio, connect to the HDFS/Spark gateway (SQL Server big data cluster connection type).
22+
23+
1. Connect to SQL Server Master instance using Azure Data Studio.
24+
25+
1. Execute the SQL script [data-ingestion-spark.sql](data-ingestion-spark.sql).
26+
27+
1. Create and submit a Spark job that ingests data from HDFS into the external table.
28+
29+
Submitting a Spark job will start a Spark streaming session using spark-submit.
30+
31+
The arguments to the jar file are:
32+
33+
1. server name - sql server to connect to read the table schema
34+
2. port number
35+
3. username - sql server username for master instance
36+
4. password - sql server password for master instance
37+
5. database name
38+
6. external table name
39+
7. Source directory for streaming. This must be a full URI - such as "hdfs:///clickstream_data"
40+
8. Input format. This can be "csv", "parquet", "json".
41+
9. enable checkpoint: true or false
42+
43+
Submit a Spark job with the below parameters. You can use the Spark submit experience from Azure Data Studio (right click on big data cluster endpoint -> Submit Spark Job):
44+
45+
ARGUMENTS:
46+
47+
**job name:** yourJobName
48+
49+
**switch** from "Local" to "HDFS"
50+
51+
**Path to jar** (copy/paste this):
52+
53+
/jar/mssql-spark-lib-assembly-1.0.jar
54+
55+
**Main class:**
56+
FileStreaming
57+
58+
**Parameters (copy/paste this; make sure you replace the password!):**
59+
60+
mssql-master-pool-0.service-master-pool 1433 sa passwordHere sales web_clickstreams_spark_results hdfs:///clickstream_data csv false
61+
62+
6. Query the external table we created earlier using the SELECT queries in the script to see data coming from the streaming job and landing in the table.

0 commit comments

Comments
 (0)