Skip to content

Commit 64d95b0

Browse files
committed
Added sklearn ML procs. Fixed other BIN files.
1 parent 63e8050 commit 64d95b0

5 files changed

Lines changed: 147 additions & 1 deletion

File tree

samples/features/sql-big-data-cluster/machine-learning/sql/python/README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,20 @@ SQL Server 2016 added capability to run R script from T-SQL. SQL Server 2017 add
66

77
**Applies to:** SQL Server 2017+, SQL Server 2019, SQL Server 2019 big data cluster
88

9-
In this example, we are building a machine learning model using Python and a logistic regression algorithm for a recommendation engine on an online store. Based on existing users' click pattern online and their interest in other categories and demographics, we are training a machine learning model. This model will then be used to predict if the visitor is interested in a given item category using the T-SQL PREDICT function.
9+
In this example, we are building a machine learning model using Python. The script uses a logistic regression algorithm from revoscalepy package in Microsoft ML Server. Based on existing users' click pattern online and their interest in other categories and demographics, we are training a machine learning model. This model will then be used to predict if the visitor is interested in a given item category using the T-SQL PREDICT function.
1010

1111
[book-click-prediction-partitioned-py.sql](book-click-prediction-partitioned-py.sql/)
1212

1313
**Applies to:** SQL Server 2019, SQL Server 2019 big data cluster
1414

1515
In this example, we are leveraging the new partitioning support (SQL Server 2019) in sp_execute_external_script to partition the input data and run the Python script per partition. So we will modify the training script to train model per group of users based on credit rating. The Python script will produce N models for the same input data set.
1616

17+
[book-click-prediction-sklearn-py.sql](book-click-prediction-sklearn-py.sql/)
18+
19+
**Applies to:** SQL Server 2017+, SQL Server 2019 big data cluster
20+
21+
In this example, we are building a machine learning model using Python. The script uses a logistic regression algorithm from sklearn package. In SQL Server 2017 or SQL Server 2019, you need to install ***sklearn*** package before running the SQL script.
22+
1723
## Instructions
1824

1925
1. Connect to SQL Server or SQL Server Master instance.
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
USE sales
2+
GO
3+
4+
-- Inspect top 100 rows
5+
--
6+
SELECT TOP(100) * FROM web_clickstreams_hdfs_book_clicks;
7+
GO
8+
9+
-- Step #1a
10+
-- Create the training stored procedure
11+
CREATE OR ALTER PROCEDURE [dbo].[train_book_category_visitor_sklearn_python]
12+
(@model_name varchar(100))
13+
AS
14+
BEGIN
15+
DECLARE @model varbinary(max)
16+
, @input_query nvarchar(max)
17+
, @train_script nvarchar(max)
18+
19+
-- Set the input query for training. We will use 80% of the data.
20+
SET @input_query = N'
21+
SELECT TOP(80) PERCENT SIGN(q.clicks_in_category) AS book_category
22+
, q.college_education
23+
, q.male
24+
, q.clicks_in_1
25+
, q.clicks_in_2
26+
, q.clicks_in_3
27+
, q.clicks_in_4
28+
, q.clicks_in_5
29+
, q.clicks_in_6
30+
, q.clicks_in_7
31+
, q.clicks_in_8
32+
, q.clicks_in_9
33+
FROM web_clickstreams_hdfs_book_clicks as q
34+
';
35+
-- Python script that uses logistic regression function from sklearn package to generate model to predict book_category click(s).
36+
SET @train_script = N'
37+
model = bytes()
38+
39+
# build classification model to predict book_category
40+
import pickle
41+
from sklearn.linear_model import LogisticRegression
42+
43+
# 1. instantiate model
44+
logreg = LogisticRegression( solver="lbfgs")
45+
46+
# 2. fit and finalize the model
47+
feature_cols = ["college_education", "male", "clicks_in_1", "clicks_in_2","clicks_in_3","clicks_in_4","clicks_in_5","clicks_in_6","clicks_in_7","clicks_in_8","clicks_in_9"]
48+
logit_model = logreg.fit(indata[feature_cols], indata["book_category"])
49+
50+
model = pickle.dumps(logit_model)
51+
';
52+
53+
-- Generate sales model using Python script with the book clicks stats for each user
54+
EXECUTE sp_execute_external_script
55+
@language = N'Python'
56+
, @script = @train_script
57+
, @input_data_1 = @input_query
58+
, @input_data_1_name = N'indata'
59+
, @params = N'@model varbinary(max) OUTPUT'
60+
, @model = @model OUTPUT;
61+
62+
-- Save the trained model to predict user clicks on book category in the website
63+
DELETE FROM sales_models WHERE model_name = @model_name;
64+
INSERT INTO sales_models (model_name, model) VALUES(@model_name, @model);
65+
END;
66+
GO
67+
68+
69+
-- Step #1b
70+
-- Train the book category prediction model:
71+
DECLARE @model_name varchar(100) = 'category_model - sklearn (Python)';
72+
EXECUTE dbo.train_book_category_visitor_sklearn_python @model_name;
73+
SELECT * FROM sales_models WHERE model_name = @model_name;
74+
GO
75+
76+
-- Step #2a
77+
-- Predict the book category clicks for new users based on their pattern of
78+
-- visiting various categories in the web site
79+
CREATE OR ALTER PROCEDURE [dbo].[predict_book_category_visitor_sklearn_python]
80+
(@model_name varchar(100), @top_percent int = 20)
81+
AS
82+
BEGIN
83+
DECLARE @model varbinary(max) = (SELECT model FROM sales_models WHERE model_name = @model_name)
84+
, @input_query nvarchar(max)
85+
, @predict_script nvarchar(max);
86+
87+
-- Set the input query for scoring. We will use 20% of the data by default
88+
SET @input_query = N'
89+
SELECT TOP(@top_count_value) PERCENT SIGN(q.clicks_in_category) AS book_category
90+
, q.college_education
91+
, q.male
92+
, q.clicks_in_1
93+
, q.clicks_in_2
94+
, q.clicks_in_3
95+
, q.clicks_in_4
96+
, q.clicks_in_5
97+
, q.clicks_in_6
98+
, q.clicks_in_7
99+
, q.clicks_in_8
100+
, q.clicks_in_9
101+
FROM web_clickstreams_hdfs_book_clicks as q
102+
';
103+
104+
-- Scoring script that uses sklearn logistic regression model to predict book_category click(s)
105+
SET @predict_script = N'
106+
import pandas as pd
107+
import pickle
108+
109+
logit_model = pickle.loads(model)
110+
111+
feature_cols = ["college_education", "male", "clicks_in_1", "clicks_in_2","clicks_in_3","clicks_in_4","clicks_in_5","clicks_in_6","clicks_in_7","clicks_in_8","clicks_in_9"]
112+
113+
predictions = logit_model.predict(indata[feature_cols])
114+
115+
predictions_df = pd.DataFrame(predictions, columns = ["book_category_prediction"])
116+
outdata = pd.concat([predictions_df, indata], axis = 1, copy = False)
117+
';
118+
119+
-- Predict the book category click based on the sklearn model
120+
EXECUTE sp_execute_external_script
121+
@language = N'Python'
122+
, @script = @predict_script
123+
, @input_data_1 = @input_query
124+
, @input_data_1_name = N'indata'
125+
, @output_data_1_name = N'outdata'
126+
, @params = N'@model varbinary(max), @top_count_value int'
127+
, @model = @model
128+
, @top_count_value = @top_percent
129+
WITH RESULT SETS ((book_category_prediction bit, book_category_actual bit, college_education varchar(30), male bit,
130+
clicks_in_1 int, clicks_in_2 int, clicks_in_3 int, clicks_in_4 int, clicks_in_5 int,
131+
clicks_in_6 int, clicks_in_7 int, clicks_in_8 int, clicks_in_9 int));
132+
END
133+
GO
134+
135+
-- Step #2b
136+
-- Predict the book category clicks for new users based on their pattern of
137+
-- visiting various categories in the web site
138+
DECLARE @model_name varchar(100) = 'category_model - sklearn (Python)';
139+
EXECUTE dbo.predict_book_category_visitor_sklearn_python @model_name, 1 /* Score only on 1 PERENT for testing purpose. */;
140+
GO
Binary file not shown.

0 commit comments

Comments
 (0)