Skip to content

Commit 09b207f

Browse files
committed
Initial samples for SQL Server 2019 big data cluster
Demonstrates various functionality in big data cluster.
1 parent 5083b5c commit 09b207f

20 files changed

Lines changed: 862 additions & 0 deletions
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# SQL Server big data clusters
2+
3+
## Pre-requisites
4+
1. Kubernetes cluster configuration & Kubectl command-line utility
5+
2. Curl utility
6+
3. Sqlcmd utility
7+
4. Bcp utility
8+
5. Azure Data Studio or SQL Server Management Studio
9+
6. SQL Server 2019 big data cluster
10+
11+
Installation instructions for SQL Server 2019 big data cluster can be found [here](https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-guidance?view=sql-server-2017).
12+
13+
## Samples Setup
14+
15+
**Before you begin**, download the sample database [backup file](https://sqlchoice.blob.core.windows.net/sqlchoice/static/tpcxbb_1gb.bak) and save it locally. Run the CMD script called *bootstrap-sample-db.cmd* or the shell script *bootstrap-sample-db.sh* depending on your platform. This script will restore the database on the SQL Master instance, execute the *bootstrap-sample-db.sql* script, create the database objects needed, export the web_clickstreams & inventory tables to CSV file, and upload the web_clickstreams CSV file to HDFS inside the SQL Server 2019 big data cluster.
16+
17+
__[data-pool](data-pool/)__
18+
19+
### Data ingestion using Spark
20+
Connect to the master instance in your SQL Server big data cluster and the SQL Server big data cluster endpoint, and follow the steps in *data-pool/data-ingestion-spark.sql*.
21+
22+
### Data ingestion using sql
23+
Connect to the master instance in your SQL Server big data cluster and execute the steps in *data-pool/data-ingestion-sql.sql*.
24+
25+
__[data-virtualization](data-virtualization/)__
26+
27+
### External table over HDFS
28+
Connect to the master instance in your SQL Server big data cluster and execute the steps in *data-virtualization/external-table-hdfs.sql*.
29+
30+
### External table over Oracle
31+
To execute this sample script, you will need following:
32+
1. Oracle instance and credentials
33+
1. Create inventory table in Oracle using [data-virtualization/inventory-oracle.sql](data-virtualization/inventory-oracle.sql/) script
34+
1. Import the inventory.csv file generated by the bootstrap-sample-db script to a table in Oracle
35+
36+
Connect to the master instance in your SQL Server big data cluster and execute the steps in *data-virtualization/external-table-oracle.sql*.
37+
38+
__[machine-learning](machine-learning/)__
39+
40+
### SQL Server ML Services on master instance
41+
Connect to the master instance in your SQL Server big data cluster and execute the steps in *machine-learning/sql/book-category-r-ml.sql*.
42+
43+
### Spark ML
44+
Connect to the SQL Server big data cluster endpoint, and run the notebook files *machine-learning/spark/1-data-prep.ipynb* and *machine-learning/spark/2-build-ml-model.ipynb* cell by cell.
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
@echo off
2+
REM CLICKSTREAM FILES
3+
setlocal enableextensions
4+
set CLUSTER_NAMESPACE=%1
5+
set SQL_MASTER_IP=%2
6+
set SQL_MASTER_SA_PASSWORD=%3
7+
set BACKUP_FILE_PATH=%~4
8+
set KNOX_IP=%5
9+
set KNOX_PASSWORD=%6
10+
set STARTUP_PATH=%~dp0
11+
12+
if NOT DEFINED CLUSTER_NAMESPACE goto :usage
13+
if NOT DEFINED SQL_MASTER_IP goto :usage
14+
if NOT DEFINED SQL_MASTER_SA_PASSWORD goto :usage
15+
if NOT DEFINED BACKUP_FILE_PATH goto :usage
16+
if NOT DEFINED KNOX_IP goto :usage
17+
if NOT DEFINED KNOX_PASSWORD set KNOX_PASSWORD=%SQL_MASTER_SA_PASSWORD%
18+
19+
set SQL_MASTER_INSTANCE=%SQL_MASTER_IP%,31433
20+
set KNOX_ENDPOINT=%KNOX_IP%:30443
21+
22+
echo Verifying sqlcmd.exe is in path & CALL WHERE /Q sqlcmd.exe || GOTO exit
23+
echo Verifying bcp.exe is in path & CALL WHERE /Q bcp.exe || GOTO exit
24+
echo Verifying kubectl.exe is in path & CALL WHERE /Q kubectl.exe || echo HINT: Install the kubernetes-cli - https://kubernetes.io/docs/tasks/tools/install-kubectl && GOTO exit
25+
echo Verifying curl.exe is in path & CALL WHERE /Q curl.exe || echo HINT: Install curl - https://curl.haxx.se/download.html && GOTO exit
26+
27+
REM Copy the backup file, restore the database, create necessary objects and data file
28+
echo Copying database backup file...
29+
pushd "%BACKUP_FILE_PATH%"
30+
%DEBUG% kubectl cp tpcxbb_1gb.bak mssql-master-pool-0:/var/opt/mssql/data -c mssql-server -n %CLUSTER_NAMESPACE% || goto exit
31+
popd
32+
33+
echo Configuring sample database...
34+
%DEBUG% sqlcmd -S %SQL_MASTER_INSTANCE% -Usa -P%SQL_MASTER_SA_PASSWORD% -i "%STARTUP_PATH%bootstrap-sample-db.sql" -o "%STARTUP_PATH%bootstrap.out" -I -b || goto exit
35+
36+
for %%F in (web_clickstreams inventory) do (
37+
echo Exporting %%F data...
38+
%DEBUG% bcp sales.dbo.%%F out "%STARTUP_PATH%%%F.csv" -S %SQL_MASTER_INSTANCE% -Usa -P%SQL_MASTER_SA_PASSWORD% -c -t, -o "%STARTUP_PATH%%%F.out" -e "%STARTUP_PATH%%%F.err" || goto exit
39+
)
40+
41+
REM Copy the data file to HDFS
42+
echo Uploading web_clickstreams data to HDFS...
43+
pushd "%STARTUP_PATH%"
44+
%DEBUG% curl -i -L -k -u root:%KNOX_PASSWORD% -X PUT "https://%KNOX_ENDPOINT%/gateway/default/webhdfs/v1/clickstream_data?op=MKDIRS" || goto exit
45+
%DEBUG% curl -i -L -k -u root:%KNOX_PASSWORD% -X PUT "https://%KNOX_ENDPOINT%/gateway/default/webhdfs/v1/clickstream_data/web_clickstreams.csv?op=create" -H "Content-Type: application/octet-stream" -T "web_clickstreams.csv" || goto exit
46+
47+
:: del /q *.out *.err *.csv
48+
popd
49+
50+
endlocal
51+
exit /b 0
52+
goto :eof
53+
54+
:exit
55+
echo Bootstrap of the sample database failed.
56+
exit /b %ERRORLEVEL%
57+
58+
:usage
59+
echo USAGE: %0 ^<CLUSTER_NAMESPACE^> ^<SQL_MASTER_IP^> ^<SQL_MASTER_SA_PASSWORD^> ^<BACKUP_FILE_PATH^> ^<KNOX_IP^> [^<KNOX_PASSWORD^>]
60+
echo Default ports are assumed for SQL Master instance ^& Knox gateway.
61+
exit /b 0
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
#!/bin/bash
2+
set -e
3+
set -o pipefail
4+
USAGE_MESSAGE="USAGE: $0 <CLUSTER_NAMESPACE> <SQL_MASTER_IP> <SQL_MASTER_SA_PASSWORD> <BACKUP_FILE_PATH> <KNOX_IP> [<KNOX_PASSWORD>]"
5+
ERROR_MESSAGE="Bootstrap of the sample database failed."
6+
7+
# Print usage if mandatory parameters are missing
8+
: "${1:?$USAGE_MESSAGE}"
9+
: "${2:?$USAGE_MESSAGE}"
10+
: "${3:?$USAGE_MESSAGE}"
11+
: "${4:?$USAGE_MESSAGE}"
12+
: "${5:?$USAGE_MESSAGE}"
13+
: "${DEBUG=}"
14+
15+
# Save the input parameters
16+
CLUSTER_NAMESPACE=$1
17+
SQL_MASTER_IP=$2
18+
SQL_MASTER_SA_PASSWORD=$3
19+
BACKUP_FILE_PATH=$4
20+
KNOX_IP=$5
21+
KNOX_PASSWORD=$6
22+
# If Knox password is not supplied then default to SQL Master password
23+
KNOX_PASSWORD=${KNOX_PASSWORD:=$SQL_MASTER_SA_PASSWORD}
24+
25+
SQL_MASTER_INSTANCE=$SQL_MASTER_IP,31433
26+
KNOX_ENDPOINT=$KNOX_IP:30443
27+
28+
# Copy the backup file, restore the database, create necessary objects and data file
29+
echo Copying database backup file...
30+
pushd "$BACKUP_FILE_PATH"
31+
$DEBUG kubectl cp tpcxbb_1gb.bak mssql-master-pool-0:/var/opt/mssql/data -c mssql-server -n $CLUSTER_NAMESPACE || (echo $ERROR_MESSAGE && exit 1)
32+
popd
33+
34+
echo Configuring sample database...
35+
# WSL ex: "/mnt/c/Program Files/Microsoft SQL Server/Client SDK/ODBC/130/Tools/Binn/SQLCMD.EXE"
36+
$DEBUG sqlcmd -S $SQL_MASTER_INSTANCE -Usa -P$SQL_MASTER_SA_PASSWORD -i "bootstrap-sample-db.sql" -o "bootstrap.out" -I -b || (echo $ERROR_MESSAGE && exit 2)
37+
38+
for table in web_clickstreams inventory
39+
do
40+
echo Exporting $table data...
41+
# WSL ex: "/mnt/c/Program Files/Microsoft SQL Server/Client SDK/ODBC/130/Tools/Binn/bcp.exe"
42+
$DEBUG bcp sales.dbo.$table out "$table.csv" -S $SQL_MASTER_INSTANCE -Usa -P$SQL_MASTER_SA_PASSWORD -c -t, -o "$table.out" -e "$table.err" || (echo $ERROR_MESSAGE && exit 3)
43+
done
44+
45+
# Copy the data file to HDFS
46+
echo Uploading web_clickstreams data to HDFS...
47+
$DEBUG curl -i -L -k -u root:$KNOX_PASSWORD -X PUT "https://$KNOX_ENDPOINT/gateway/default/webhdfs/v1/clickstream_data?op=MKDIRS" || (echo $ERROR_MESSAGE && exit 4)
48+
$DEBUG curl -i -L -k -u root:$KNOX_PASSWORD -X PUT "https://$KNOX_ENDPOINT/gateway/default/webhdfs/v1/clickstream_data/web_clickstreams.csv?op=create" -H 'Content-Type: application/octet-stream' -T "web_clickstreams.csv" || (echo $ERROR_MESSAGE && exit 5)
49+
50+
# rm -f *.out *.err *.csv
51+
exit
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
USE master;
2+
GO
3+
-- Enable external scripts execution for R/Python/Java:
4+
exec sp_configure 'external scripts enabled', 1;
5+
RECONFIGURE WITH OVERRIDE;
6+
GO
7+
8+
IF DB_ID('sales') IS NULL
9+
RESTORE DATABASE sales
10+
FROM DISK=N'/var/opt/mssql/data/tpcxbb_1gb.bak'
11+
WITH
12+
MOVE N'tpcxbb_1gb' TO N'/var/opt/mssql/data/sales.mdf',
13+
MOVE N'tpcxbb_1gb_log' TO N'/var/opt/mssql/data/sales.ldf';
14+
GO
15+
16+
USE sales;
17+
GO
18+
-- Create default data sources for SQL Big Data Cluster
19+
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlDataPool')
20+
CREATE EXTERNAL DATA SOURCE SqlDataPool
21+
WITH (LOCATION = 'sqldatapool://service-mssql-controller:8080/datapools/default');
22+
23+
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlStoragePool')
24+
CREATE EXTERNAL DATA SOURCE SqlStoragePool
25+
WITH (LOCATION = 'sqlhdfs://service-mssql-controller:8080');
26+
GO
27+
28+
-- Create view used for ML services training stored procedure
29+
CREATE OR ALTER VIEW [dbo].[web_clickstreams_book_clicks]
30+
AS
31+
SELECT
32+
q.clicks_in_category,
33+
CASE WHEN cd.cd_education_status IN ('Advanced Degree', 'College', '4 yr Degree', '2 yr Degree') THEN 1 ELSE 0 END AS college_education,
34+
CASE WHEN cd.cd_gender = 'M' THEN 1 ELSE 0 END AS male,
35+
q.clicks_in_1,
36+
q.clicks_in_2,
37+
q.clicks_in_3,
38+
q.clicks_in_4,
39+
q.clicks_in_5,
40+
q.clicks_in_6,
41+
q.clicks_in_7,
42+
q.clicks_in_8,
43+
q.clicks_in_9
44+
FROM(
45+
SELECT
46+
w.wcs_user_sk,
47+
SUM( CASE WHEN i.i_category = 'Books' THEN 1 ELSE 0 END) AS clicks_in_category,
48+
SUM( CASE WHEN i.i_category_id = 1 THEN 1 ELSE 0 END) AS clicks_in_1,
49+
SUM( CASE WHEN i.i_category_id = 2 THEN 1 ELSE 0 END) AS clicks_in_2,
50+
SUM( CASE WHEN i.i_category_id = 3 THEN 1 ELSE 0 END) AS clicks_in_3,
51+
SUM( CASE WHEN i.i_category_id = 4 THEN 1 ELSE 0 END) AS clicks_in_4,
52+
SUM( CASE WHEN i.i_category_id = 5 THEN 1 ELSE 0 END) AS clicks_in_5,
53+
SUM( CASE WHEN i.i_category_id = 6 THEN 1 ELSE 0 END) AS clicks_in_6,
54+
SUM( CASE WHEN i.i_category_id = 7 THEN 1 ELSE 0 END) AS clicks_in_7,
55+
SUM( CASE WHEN i.i_category_id = 8 THEN 1 ELSE 0 END) AS clicks_in_8,
56+
SUM( CASE WHEN i.i_category_id = 9 THEN 1 ELSE 0 END) AS clicks_in_9
57+
FROM web_clickstreams as w
58+
INNER JOIN item as i ON (w.wcs_item_sk = i_item_sk
59+
AND w.wcs_user_sk IS NOT NULL)
60+
GROUP BY w.wcs_user_sk
61+
) AS q
62+
INNER JOIN customer as c ON q.wcs_user_sk = c.c_customer_sk
63+
INNER JOIN customer_demographics as cd ON c.c_current_cdemo_sk = cd.cd_demo_sk;
64+
GO
65+
66+
-- Create table for storing the machine learning models
67+
CREATE TABLE sales_models (
68+
model_name varchar(100) NOT NULL PRIMARY KEY,
69+
model varbinary(max) NOT NULL,
70+
model_native varbinary(max) NOT NULL,
71+
created_by nvarchar(300) NOT NULL DEFAULT(SYSTEM_USER),
72+
create_time datetime2 NOT NULL DEFAULT(SYSDATETIME())
73+
);
74+
GO
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Data ingestion using Spark streaming
2+
3+
SQL Server Big Data clusters provide scale-out compute and storage to improve the performance of analyzing any data. Data from a variety of sources can be ingested and distributed across data pool instances for analysis. In this example, you are going to use Spark to read and transform data from HDFS and cache it in a data pool. Querying the external table created over this aggregated data stored in data pools will be much more efficient than going to the raw data always.
4+
5+
### Instructions
6+
7+
1. Using Azure Data Studio, connect to the HDFS/Spark gateway (SQL Server big data cluster connection type).
8+
9+
1. Connect to SQL Server Master instance using Azure Data Studio.
10+
11+
1. Execute the SQL script [data-ingestion-spark.sql](data-ingestion-spark.sql).
12+
13+
1. Create and submit a Spark job that ingests data from HDFS into the external table.
14+
15+
Submitting a Spark job will start a Spark streaming session using spark-submit.
16+
17+
The arguments to the jar file are:
18+
19+
1. server name - sql server to connect to read the table schema
20+
2. port number
21+
3. username - sql server username for master instance
22+
4. password - sql server password for master instance
23+
5. database name
24+
6. external table name
25+
7. Source directory for streaming. This must be a full URI - such as "hdfs:///clickstream_data"
26+
8. Input format. This can be "csv", "parquet", "json".
27+
9. enable checkpoint: true or false
28+
29+
Submit a Spark job with the below parameters. You can use the Spark submit experience from Azure Data Studio (right click on big data cluster endpoint -> Submit Spark Job):
30+
31+
ARGUMENTS:
32+
33+
**job name:** yourJobName
34+
35+
**switch** from "Local" to "HDFS"
36+
37+
**Path to jar** (copy/paste this):
38+
39+
/jar/mssql-spark-lib-assembly-1.0.jar
40+
41+
**Main class:**
42+
FileStreaming
43+
44+
**Parameters (copy/paste this; make sure you replace the password!):**
45+
46+
mssql-master-pool-0.service-master-pool 1433 sa passwordHere sales web_clickstreams_spark_results hdfs:///clickstream_data csv false
47+
48+
6. Query the external table we created earlier using the SELECT queries in the script to see data coming from the streaming job and landing in the table.
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
USE sales
2+
GO
3+
4+
-- Create external table in a data pool in SQL Server 2019 big data cluster.
5+
-- The SqlDataPool data source is a special data source that is available in
6+
-- any new database in SQL Master instance. This is used to reference the
7+
-- data pool in a SQL Server 2019 big data cluster.
8+
--
9+
CREATE EXTERNAL TABLE [web_clickstreams_spark_results]
10+
("wcs_click_date_sk" BIGINT , "wcs_click_time_sk" BIGINT , "wcs_sales_sk" BIGINT , "wcs_item_sk" BIGINT , "wcs_web_page_sk" BIGINT , "wcs_user_sk" BIGINT)
11+
WITH
12+
(
13+
DATA_SOURCE = SqlDataPool,
14+
DISTRIBUTION = ROUND_ROBIN
15+
);
16+
17+
-- Data can be ingested into the external table from a spark job.
18+
--
19+
-- Submit spark job with below parameters. You can use the Spark submit experience from Azure Data Studio.
20+
-- Right click on server name in a SQL Server big data cluster connection and click "Submit Spark Job".
21+
--
22+
-- Specify following values in the Job submission dialog box:
23+
---- job name: <yourJobName>
24+
---- switch from "Local" to "HDFS"
25+
---- Main class: "FileStreaming"
26+
---- Path to jar: /jar/mssql-spark-lib-assembly-1.0.jar
27+
---- Arguments:
28+
---- mssql-master-pool-0.service-master-pool 1433 sa %PASSWORD% sales web_clickstreams_spark_results hdfs:///clickstream_data csv false
29+
30+
-- The arguments to jar file are
31+
-- 1: server name - sql server to connect to read the table schema
32+
-- 2: port number
33+
-- 3: username - sql server username for master instance
34+
-- 4: password - sql server password for master instance
35+
-- 5: database name
36+
-- 6: external table name
37+
-- 7: Source directory for streaming. This must be a full URI - such as "hdfs:///clickstream_data"
38+
-- 8: Input format. This can be "csv", "parquet", "json".
39+
-- 9: enable checkpoint: true or false
40+
--
41+
42+
-- After the Spark streaming job has been sucessfully submitted, you can run below query to view the results.
43+
--
44+
-- Wait until some rows are available.
45+
WHILE (1=1)
46+
IF EXISTS(SELECT * FROM [web_clickstreams_spark_results])
47+
BREAK;
48+
49+
SELECT count(*) FROM [web_clickstreams_spark_results];
50+
SELECT TOP 10 * FROM [web_clickstreams_spark_results];
51+
GO
52+
53+
DROP EXTERNAL TABLE [dbo].[web_clickstreams_spark_results];
54+
GO
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Data ingestion using SQL stored procedure
2+
3+
SQL Server Big Data clusters provide scale-out compute and storage to improve the performance of analyzing any data. Data from a variety of sources can be ingested and distributed across data pool instances for analysis. In this example, we will insert data from a SQL query into an external table stored in a data pool and query it.
4+
5+
## Instructions
6+
7+
1. Connect to SQL Server Master instance.
8+
9+
1. Execute the .sql script [data-ingestion-sql.sql](data-ingestion-sql.sql).

0 commit comments

Comments
 (0)