Skip to content

Commit 6cf3543

Browse files
authored
Merge pull request #552 from shivsood/master
MSSQL Spark connector, and readme fixes
2 parents 48f6e16 + 6d5bdc9 commit 6cf3543

4 files changed

Lines changed: 127 additions & 30 deletions

File tree

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,27 @@
11
# SQL Server big data clusters
22

3-
The new built-in notebooks in Azure Data Studio enables data scientists and data engineers to run Python, R, Scala, or Spark SQL code against the cluster.
3+
SQL Server Big Data cluster bundles Spark and HDFS together with SQL server. Azure Data Studio IDE provides built in notebooks that enables data scientists and data engineers to run Spark notebooks and job in Python, R, or Scala code against the Big Data Cluster. This folder contains spark sample notebook on using Spark in SQL server Big data cluster
44

5-
## Instructions to open a notebook from Azure Data Studio and execute the commands
5+
## Folder contents
66

7-
1. Connect to the SQL Server Master instance in a big data cluster
7+
[PySpark Hello World](dataloading/hello_PySpark.ipynb)
88

9-
1. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and use open Notebook.
9+
[Scala Hello World ](dataloading/hello_Scala.ipynb)
1010

11-
1. Open the notebook in Azure Data Studio, wait for the “Kernel” and the target context (“Attach to”) to be populated.
11+
[SparkR Hello World ](dataloading/hello_sparkR.ipynb)
1212

13-
1. Run each cell in the Notebook sequentially.
13+
[DataLoading - Transforming CSV to Parquet](dataloading/transform-csv-files.ipynb/)
1414

15-
## __[data-loading](data-loading/)__
15+
[Data Transfer - Spark to SQL using Spark JDBC connector](data-virtualization/spark_to_sql_jdbc.ipynb/)
1616

17-
This folder contains samples that show how to load data using Spark and query them using SQL statements.
17+
[Data Transfer - Spark to SQL using MSSQL Spark connector](spark_to_sql/mssql_spark_connector.ipynb/)
1818

19-
[data-loading/transform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/)
20-
21-
This samnple notebook shows how to transform CSV files in HDFS to parquet files.
19+
## Instructions on how to run in Azure Data Studio
2220

23-
[dataloading/spark-sql.ipynb](dataloading/spark-sql.ipynb/)
21+
[data-loading/transform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/)
2422

25-
This samnple notebook shows how to query hive tables created from Spark.
23+
2. From Azure Data Studio Connect to the SQL Server Master instance in a big data cluster.
2624

27-
## __[data-virtualization](data-virtualization/)__
25+
3. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and open the notebook in Azure Data Studio. Wait for the “Kernel” and the target context (“Attach to”) to be populated. If required set the relevant “Kernel” ( e.g **PySpark3** ) and **Attach to** needs to be the IP address of your big data cluster endpoint.
2826

29-
This folder contains samples that show how to integrate Spark with other data sources.
27+
4. Run each cell in the Notebook sequentially.

samples/features/sql-big-data-cluster/spark/data-virtualization/spark_to_sql_jdbc.ipynb

Lines changed: 32 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
"cells": [
2020
{
2121
"cell_type": "markdown",
22-
"source": "# Read and write from Spark to SQL\r\nA typical big data scenario is large scale ETL in Spark and writing the processed data to SQLServer. The following samples shows \r\n- reading a HDFS file, \r\n- some basic processing on it and \r\n- then processed data to SQL Server table.\r\n\r\nNeed a database precreated in SQL for this sample. Here we are using database name \"MyTestDatabase\" that can be created using SQL statements below.\r\n\r\n``` sql\r\nCreate DATABASE MyTestDatabase\r\nGO \r\n``` \r\n ",
22+
"source": "# Read and write from Spark to SQL\r\nA typical big data scenario is large scale ETL in Spark and post processing the data is written out to SQLServer for access to LOB applications. This sample shows how to write to SQLServer from Spark. The main steps in the sample are \r\n- Reading a HDFS file, \r\n- Basic processing on it and \r\n- Then writing processed data to SQL Server table using JDBC\r\n\r\nPreReq : \r\n- The sample uses a SQL database named \"MyTestDatabase\". Create this before you run this sample. The database can be created as follows\r\n ``` sql\r\n Create DATABASE MyTestDatabase\r\n GO \r\n ``` \r\n- Download [AdultCensusIncome.csv]( https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv ) to your local machine. Create a hdfs folder named spark_data and upload the file there. \r\n\r\n \r\n ",
2323
"metadata": {}
2424
},
2525
{
@@ -28,51 +28,69 @@
2828
"metadata": {},
2929
"outputs": [
3030
{
31-
"output_type": "stream",
3231
"name": "stdout",
33-
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education-num| marital-status| occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows"
32+
"text": "Starting Spark application\n",
33+
"output_type": "stream"
34+
},
35+
{
36+
"data": {
37+
"text/plain": "<IPython.core.display.HTML object>",
38+
"text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>2</td><td>application_1554755839506_0003</td><td>pyspark3</td><td>idle</td><td><a target=\"_blank\" href=\"https://40.78.42.207:30443/gateway/default/yarn/proxy/application_1554755839506_0003/\">Link</a></td><td><a target=\"_blank\" href=\"http://storage-0-1.storage-0-svc.mssql-cluster.svc.cluster.local:8042/node/containerlogs/container_1554755839506_0003_01_000001/root\">Link</a></td><td>✔</td></tr></table>"
39+
},
40+
"metadata": {},
41+
"output_type": "display_data"
42+
},
43+
{
44+
"name": "stdout",
45+
"text": "SparkSession available as 'spark'.\n",
46+
"output_type": "stream"
47+
},
48+
{
49+
"name": "stdout",
50+
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education-num| marital-status| occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows",
51+
"output_type": "stream"
3452
}
3553
],
36-
"execution_count": 8
54+
"execution_count": 3
3755
},
3856
{
3957
"cell_type": "code",
4058
"source": "\r\n#Process this data. Very simple data cleanup steps. Replacing \"-\" with \"_\" in column names\r\ncolumns_new = [col.replace(\"-\", \"_\") for col in df.columns]\r\ndf = df.toDF(*columns_new)\r\ndf.show(5)\r\n\r\n",
4159
"metadata": {},
4260
"outputs": [
4361
{
44-
"output_type": "stream",
4562
"name": "stdout",
46-
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows"
63+
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows",
64+
"output_type": "stream"
4765
}
4866
],
49-
"execution_count": 9
67+
"execution_count": 4
5068
},
5169
{
5270
"cell_type": "code",
53-
"source": "#Write from Spark to SQL table using JDBC\r\nprint(\"Use build in JDBC connector to write to SQLServer master instance in Big data \")\r\n\r\nservername = \"jdbc:sqlserver://mssql-master-pool-0.service-master-pool\"\r\ndbname = \"MyTestDatabase\"\r\nurl = servername + \";\" + \"databaseName=\" + dbname + \";\"\r\n\r\nc = \"dbo.AdultCensus\"\r\nuser = \"sa\"\r\npassword = \"****\"\r\n\r\nprint(\"url is \", url)\r\n\r\ntry:\r\n df.write \\\r\n .format(\"jdbc\") \\\r\n .mode(\"overwrite\") \\\r\n .option(\"url\", url) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password)\\\r\n .save()\r\nexcept ValueError as error :\r\n print(\"JDBC Write failed\", error)\r\n\r\nprint(\"JDBC Write done \")\r\n\r\n\r\n",
71+
"source": "#Write from Spark to SQL table using JDBC\r\nprint(\"Use build in JDBC connector to write to SQLServer master instance in Big data \")\r\n\r\nservername = \"jdbc:sqlserver://master-0.master-svc\"\r\ndbname = \"MyTestDatabase\"\r\nurl = servername + \";\" + \"databaseName=\" + dbname + \";\"\r\n\r\ndbtable = \"dbo.AdultCensus\"\r\nuser = \"sa\"\r\npassword = \"Yukon900\"\r\n\r\nprint(\"url is \", url)\r\n\r\ntry:\r\n df.write \\\r\n .format(\"jdbc\") \\\r\n .mode(\"overwrite\") \\\r\n .option(\"url\", url) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password)\\\r\n .save()\r\nexcept ValueError as error :\r\n print(\"JDBC Write failed\", error)\r\n\r\nprint(\"JDBC Write done \")\r\n\r\n\r\n",
5472
"metadata": {},
5573
"outputs": [
5674
{
57-
"output_type": "stream",
5875
"name": "stdout",
59-
"text": "Use build in JDBC connector to write to SQLServer master instance in Big data \nurl is jdbc:sqlserver://mssql-master-pool-0.service-master-pool;databaseName=MyTestDatabase;\nJDBC Write done"
76+
"text": "Use build in JDBC connector to write to SQLServer master instance in Big data \nurl is jdbc:sqlserver://master-0.master-svc;databaseName=MyTestDatabase;\nJDBC Write done",
77+
"output_type": "stream"
6078
}
6179
],
62-
"execution_count": 10
80+
"execution_count": 9
6381
},
6482
{
6583
"cell_type": "code",
6684
"source": "#Read to Spark from SQL table using JDBC\r\nprint(\"read data from SQL server table \")\r\njdbcDF = spark.read \\\r\n .format(\"jdbc\") \\\r\n .option(\"url\", url\r\n ) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password) \\\r\n .load()\r\n\r\njdbcDF.show(5)",
6785
"metadata": {},
6886
"outputs": [
6987
{
70-
"output_type": "stream",
7188
"name": "stdout",
72-
"text": "read data from SQL server table \n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows"
89+
"text": "read data from SQL server table \n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows",
90+
"output_type": "stream"
7391
}
7492
],
75-
"execution_count": 13
93+
"execution_count": 11
7694
}
7795
]
7896
}

0 commit comments

Comments
 (0)