Skip to content

Commit b2ef5f6

Browse files
committed
minor doc fixes
1 parent 1916815 commit b2ef5f6

3 files changed

Lines changed: 56 additions & 20 deletions

File tree

samples/features/sql-big-data-cluster/spark/README.md

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,30 @@
22

33
The new built-in notebooks in Azure Data Studio enables data scientists and data engineers to run Python, R, or Scala code against the cluster.
44

5-
## Instructions
5+
## Instructions to open a notebook from Azure Data Studio
6+
7+
1. Connect to the SQL Server Master instance in a big data cluster
8+
9+
1. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and use open Notebook
10+
11+
## __[dataloading](dataloading/)__
12+
13+
This folder contains samples that show how to load data using Spark.
14+
15+
[PySpark Hello World](dataloading/hello_PySpark.ipynb)
616

7-
1. Download and save the notebook file [spark-sql.ipynb](spark-sql.ipynb/) locally.
17+
[Scala Hello World ](dataloading/hello_Scala.ipynb)
18+
19+
[SparkR Hello World ](dataloading/hello_sparkR.ipynb)
20+
21+
[DataLoading - Transforming CSV to Parquet](dataloading/transform-csv-files.ipynb/)
22+
23+
[Data Transfer - Spark to SQL using JDBC ](spark_to_sql/spark_to_sql_jdbc.ipynb/)
24+
25+
## Instructions
826

9-
1. Open the notebook file in Azure Data Studio (right click on the SQL Server big data cluster server name-> **Manage**-> Open Notebook.
27+
1. Download and save the notebook file [dataloading/transnform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/) locally.
1028

11-
1. Wait for the “Kernel” and the target context (“Attach to”) to be populated. Set the “Kernel” to **PySpark3** and Attach to needs to be the IP address of your big data cluster endpoint.
29+
2. Open the notebook in Azure Data Studio, wait for the “Kernel” and the target context (“Attach to”) to be populated. Set the “Kernel” to **PySpark3** and **Attach to** needs to be the IP address of your big data cluster endpoint.
1230

13-
1. Run each cell in the Notebook sequentially using Azure Data Studio.
31+
3. Run each cell in the Notebook sequentially.

samples/features/sql-big-data-cluster/spark/spark_to_sql/spark_to_sql_jdbc.ipynb

Lines changed: 32 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
"cells": [
2020
{
2121
"cell_type": "markdown",
22-
"source": "# Read and write from Spark to SQL\r\nA typical big data scenario is large scale ETL in Spark and writing the processed data to SQLServer. The following samples shows \r\n- reading a HDFS file, \r\n- some basic processing on it and \r\n- then processed data to SQL Server table.\r\n\r\nNeed a database precreated in SQL for this sample. Here we are using database name \"MyTestDatabase\" that can be created using SQL statements below.\r\n\r\n``` sql\r\nCreate DATABASE MyTestDatabase\r\nGO \r\n``` \r\n ",
22+
"source": "# Read and write from Spark to SQL\r\nA typical big data scenario is large scale ETL in Spark and post processing the data is written out to SQLServer for access to LOB applications. This sample shows how to write to SQLServer from Spark. The main steps in the sample are \r\n- Reading a HDFS file, \r\n- Basic processing on it and \r\n- Then writing processed data to SQL Server table using JDBC\r\n\r\nPreReq : \r\n- The sample uses a SQL database named \"MyTestDatabase\". Create this before you run this sample. The database can be created as follows\r\n ``` sql\r\n Create DATABASE MyTestDatabase\r\n GO \r\n ``` \r\n- Download [AdultCensusIncome.csv]( https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv ) to your local machine. Create a hdfs folder named spark_data and upload the file there. \r\n\r\n \r\n ",
2323
"metadata": {}
2424
},
2525
{
@@ -28,51 +28,69 @@
2828
"metadata": {},
2929
"outputs": [
3030
{
31-
"output_type": "stream",
3231
"name": "stdout",
33-
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education-num| marital-status| occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows"
32+
"text": "Starting Spark application\n",
33+
"output_type": "stream"
34+
},
35+
{
36+
"data": {
37+
"text/plain": "<IPython.core.display.HTML object>",
38+
"text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>2</td><td>application_1554755839506_0003</td><td>pyspark3</td><td>idle</td><td><a target=\"_blank\" href=\"https://40.78.42.207:30443/gateway/default/yarn/proxy/application_1554755839506_0003/\">Link</a></td><td><a target=\"_blank\" href=\"http://storage-0-1.storage-0-svc.mssql-cluster.svc.cluster.local:8042/node/containerlogs/container_1554755839506_0003_01_000001/root\">Link</a></td><td>✔</td></tr></table>"
39+
},
40+
"metadata": {},
41+
"output_type": "display_data"
42+
},
43+
{
44+
"name": "stdout",
45+
"text": "SparkSession available as 'spark'.\n",
46+
"output_type": "stream"
47+
},
48+
{
49+
"name": "stdout",
50+
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education-num| marital-status| occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows",
51+
"output_type": "stream"
3452
}
3553
],
36-
"execution_count": 8
54+
"execution_count": 3
3755
},
3856
{
3957
"cell_type": "code",
4058
"source": "\r\n#Process this data. Very simple data cleanup steps. Replacing \"-\" with \"_\" in column names\r\ncolumns_new = [col.replace(\"-\", \"_\") for col in df.columns]\r\ndf = df.toDF(*columns_new)\r\ndf.show(5)\r\n\r\n",
4159
"metadata": {},
4260
"outputs": [
4361
{
44-
"output_type": "stream",
4562
"name": "stdout",
46-
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows"
63+
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows",
64+
"output_type": "stream"
4765
}
4866
],
49-
"execution_count": 9
67+
"execution_count": 4
5068
},
5169
{
5270
"cell_type": "code",
53-
"source": "#Write from Spark to SQL table using JDBC\r\nprint(\"Use build in JDBC connector to write to SQLServer master instance in Big data \")\r\n\r\nservername = \"jdbc:sqlserver://mssql-master-pool-0.service-master-pool\"\r\ndbname = \"MyTestDatabase\"\r\nurl = servername + \";\" + \"databaseName=\" + dbname + \";\"\r\n\r\nc = \"dbo.AdultCensus\"\r\nuser = \"sa\"\r\npassword = \"****\"\r\n\r\nprint(\"url is \", url)\r\n\r\ntry:\r\n df.write \\\r\n .format(\"jdbc\") \\\r\n .mode(\"overwrite\") \\\r\n .option(\"url\", url) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password)\\\r\n .save()\r\nexcept ValueError as error :\r\n print(\"JDBC Write failed\", error)\r\n\r\nprint(\"JDBC Write done \")\r\n\r\n\r\n",
71+
"source": "#Write from Spark to SQL table using JDBC\r\nprint(\"Use build in JDBC connector to write to SQLServer master instance in Big data \")\r\n\r\nservername = \"jdbc:sqlserver://master-0.master-svc\"\r\ndbname = \"MyTestDatabase\"\r\nurl = servername + \";\" + \"databaseName=\" + dbname + \";\"\r\n\r\ndbtable = \"dbo.AdultCensus\"\r\nuser = \"sa\"\r\npassword = \"Yukon900\"\r\n\r\nprint(\"url is \", url)\r\n\r\ntry:\r\n df.write \\\r\n .format(\"jdbc\") \\\r\n .mode(\"overwrite\") \\\r\n .option(\"url\", url) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password)\\\r\n .save()\r\nexcept ValueError as error :\r\n print(\"JDBC Write failed\", error)\r\n\r\nprint(\"JDBC Write done \")\r\n\r\n\r\n",
5472
"metadata": {},
5573
"outputs": [
5674
{
57-
"output_type": "stream",
5875
"name": "stdout",
59-
"text": "Use build in JDBC connector to write to SQLServer master instance in Big data \nurl is jdbc:sqlserver://mssql-master-pool-0.service-master-pool;databaseName=MyTestDatabase;\nJDBC Write done"
76+
"text": "Use build in JDBC connector to write to SQLServer master instance in Big data \nurl is jdbc:sqlserver://master-0.master-svc;databaseName=MyTestDatabase;\nJDBC Write done",
77+
"output_type": "stream"
6078
}
6179
],
62-
"execution_count": 10
80+
"execution_count": 9
6381
},
6482
{
6583
"cell_type": "code",
6684
"source": "#Read to Spark from SQL table using JDBC\r\nprint(\"read data from SQL server table \")\r\njdbcDF = spark.read \\\r\n .format(\"jdbc\") \\\r\n .option(\"url\", url\r\n ) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password) \\\r\n .load()\r\n\r\njdbcDF.show(5)",
6785
"metadata": {},
6886
"outputs": [
6987
{
70-
"output_type": "stream",
7188
"name": "stdout",
72-
"text": "read data from SQL server table \n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows"
89+
"text": "read data from SQL server table \n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows",
90+
"output_type": "stream"
7391
}
7492
],
75-
"execution_count": 13
93+
"execution_count": 11
7694
}
7795
]
7896
}

0 commit comments

Comments
 (0)