You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: samples/features/sql-big-data-cluster/data-pool/README.md
+16-2Lines changed: 16 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,20 @@
1
-
# Data ingestion using Spark streaming
1
+
# Data pools in SQL Server 2019 big data cluster
2
2
3
-
SQL Server Big Data clusters provide scale-out compute and storage to improve the performance of analyzing any data. Data from a variety of sources can be ingested and distributed across data pool instances for analysis. In this example, you are going to use Spark to read and transform data from HDFS and cache it in a data pool. Querying the external table created over this aggregated data stored in data pools will be much more efficient than going to the raw data always.
3
+
SQL Server Big Data clusters provide scale-out compute and storage to improve the performance of analyzing any data. Data from a variety of sources can be ingested and distributed across data pool instances for analysis. In this example, we will insert data from a SQL query into an external table stored in a data pool and query it.
4
+
5
+
## Data ingestion using SQL stored procedure
6
+
7
+
SQL Server Big Data clusters provide scale-out compute and storage to improve the performance of analyzing any data. Data from a variety of sources can be ingested and distributed across data pool instances for analysis. In this example, we will insert data from a SQL query into an external table stored in a data pool and query it.
8
+
9
+
### Instructions
10
+
11
+
1. Connect to SQL Server Master instance.
12
+
13
+
1. Execute the .sql script [data-ingestion-sql.sql](data-ingestion-sql.sql).
14
+
15
+
## Data ingestion using Spark streaming
16
+
17
+
In this example, you are going to use Spark to read and transform data from HDFS and cache it in a data pool. Querying the external table created over this aggregated data stored in data pools will be much more efficient than going to the raw data always.
# Data virtualization in SQL Server 2019 big data cluster
2
+
3
+
In SQL Server 2019 big data clusters, the SQL Server engine has gained the ability to natively read HDFS files, such as CSV and parquet files, by using SQL Server instances collocated on each of the HDFS data nodes to filter and aggregate data locally in parallel across all of the HDFS data nodes. SQL Server 2019 introduces new ODBC connectors to data sources like SQL Server, Oracle, MongoDB and Teradata.
4
+
5
+
## Query data in HDFS from SQL Server master
6
+
7
+
In this example, you are going to create an external table in the SQL Server Master instance that points to data in HDFS within the SQL Server Big data cluster. Then you will join the data in the external table with high value data in SQL Master instance.
8
+
9
+
### Instructions
10
+
11
+
1. Connect to SQL Server Master instance.
12
+
13
+
1. Execute the [external-table-hdfs.sql](external-table-hdfs.sql).
14
+
15
+
## Query data in Oracle from SQL Server master
16
+
17
+
In this example, you are going to create an external table in SQL Server Master instance over the inventory table that sits on an Oracle server.
18
+
19
+
**Before you begin**, you need to have an Oracle instance and credentials. Execute the SQL script [inventory-ora.sql](inventory-ora.sql/) in Oracle to create the table and import the "inventory.csv" file created by the bootstrap sample database.
20
+
21
+
### Instructions
22
+
23
+
1. Connect to SQL Server Master instance.
24
+
25
+
1. Execute the SQL [external-table-oracle.sql](external-table-oracle.sql/).
Copy file name to clipboardExpand all lines: samples/features/sql-big-data-cluster/machine-learning/README.md
+19-7Lines changed: 19 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,26 +1,38 @@
1
-
# Machine learning with Spark on SQL Server 2019 big data cluster
1
+
# Machine learning in SQL Server 2019 big data cluster
2
+
3
+
## SQL Server Machine Learning Services on SQL Master instance
4
+
5
+
In this example, we are building a machine learning model using R and a logistic regression algorithm for a recommendation engine on an online store. Based on existing users' click pattern online and their interest in other categories and demographics, we are training a machine learning model. This model will then be used to predict if the visitor is interested in a given item category using the T-SQL PREDICT function.
6
+
7
+
### Instructions
8
+
9
+
1. Connect to SQL Server Master instance.
10
+
11
+
1. Execute the SQL [sql/book-click-prediction-r.sql](sql/book-click-prediction-r.sql/).
12
+
13
+
## Machine learning using Spark
2
14
3
15
The new built-in notebooks in Azure Data Studio enables data scientists and data engineers to run Python, R, or Scala code against the cluster. This is a great way to explore the data and build machine learning models. Notebooks facilitate collaboration between teammates working on a shared data set.
4
16
5
17
This sample builds a machine learning model using AdultCensusIncome.csv available [here](https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv).
6
18
7
19
8
-
## Instructions
20
+
###Instructions
9
21
10
22
In this example, you are going to run sample notebooks that build a machine learning model over a public data set.
11
23
12
24
Follow the steps below to get up and running with the sample.
13
25
14
-
## Upload the data for analysis
26
+
####Upload the data for analysis
15
27
16
28
1. From Azure Data Studio, connect to the SQL Server big data cluster endpoint. Information about how you connect from Azure Data Studio can be found [here](https://docs.microsoft.com/en-us/sql/azure-data-studio/sql-server-2019-extension?view=sql-server-ver15).
17
29
18
30
2. Download the data from https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv and save AdultCensusIncome.csv in a folder called spark_ml in HDFS.
19
31
20
-
## Run notebook for data preparation
32
+
####Run notebook for data preparation
21
33
As a first step we'll load the data, do some basic cleanup on that data, choose the features that we want to build the machine learning model with. Finally we'll split the data set as training and test sets.
22
34
23
-
1. Download and save the notebook file [1-data-prep.ipynb](1-data-prep.ipynb/) locally.
35
+
1. Download and save the notebook file [spark/1-data-prep.ipynb](spark/1-data-prep.ipynb/) locally.
24
36
25
37
1. Open the notebook file in Azure Data Studio (right click on the SQL Server big data cluster server name-> **Manage**-> Open Notebook.
26
38
@@ -30,10 +42,10 @@ As a first step we'll load the data, do some basic cleanup on that data, choose
30
42
31
43
1. The training and test sets created would be stored as /spark_ml/AdultCensusIncomeTrain and /spark_ml/AdultCensusIncomeTest
32
44
33
-
## Run notebook to create a machine learning model and use it to predict
45
+
####Run notebook to create a machine learning model and use it to predict
34
46
We'll now create the machine learning model, use the model to predict results on the test set and then save the created model to a file.
35
47
36
-
1. Download and save the notebook (ipynb) file [2-build-ml-model.ipynb] (2-build-ml-model.ipynb/)
48
+
1. Download and save the notebook (ipynb) file [spark\2-build-ml-model.ipynb](spark/2-build-ml-model.ipynb/)
37
49
38
50
1. Open the notebook file in Azure Data Studio (right click on the SQL Server big data cluster server name-> **Manage**-> Open Notebook.
0 commit comments