Skip to content

Commit 92ae53e

Browse files
committed
Re-organized spark samples
1 parent 954fb3c commit 92ae53e

7 files changed

Lines changed: 168 additions & 48 deletions

File tree

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,29 @@
11
# SQL Server big data clusters
22

3-
The new built-in notebooks in Azure Data Studio enables data scientists and data engineers to run Python, R, or Scala code against the cluster.
3+
The new built-in notebooks in Azure Data Studio enables data scientists and data engineers to run Python, R, Scala, or Spark SQL code against the cluster.
44

5-
## Instructions to open a notebook from Azure Data Studio
5+
## Instructions to open a notebook from Azure Data Studio and execute the commands
66

77
1. Connect to the SQL Server Master instance in a big data cluster
88

9-
1. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and use open Notebook
9+
1. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and use open Notebook.
1010

11-
## __[dataloading](dataloading/)__
11+
1. Open the notebook in Azure Data Studio, wait for the “Kernel” and the target context (“Attach to”) to be populated.
1212

13-
This folder contains samples that show how to load data using Spark.
13+
1. Run each cell in the Notebook sequentially.
1414

15-
[dataloading/transform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/)
15+
## __[data-loading](data-loading/)__
1616

17-
## Instructions
17+
This folder contains samples that show how to load data using Spark and query them using SQL statements.
1818

19-
1. Download and save the notebook file [dataloading/transnform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/) locally.
19+
[data-loading/transform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/)
2020

21-
1. Open the notebook in Azure Data Studio, wait for the “Kernel” and the target context (“Attach to”) to be populated. Set the “Kernel” to **PySpark3** and **Attach to** needs to be the IP address of your big data cluster endpoint.
21+
This samnple notebook shows how to transform CSV files in HDFS to parquet files.
2222

23-
1. Run each cell in the Notebook sequentially.
23+
[dataloading/spark-sql.ipynb](dataloading/spark-sql.ipynb/)
24+
25+
This samnple notebook shows how to query hive tables created from Spark.
26+
27+
## __[data-virtualization](data-virtualization/)__
28+
29+
This folder contains samples that show how to integrate Spark with other data sources.

samples/features/sql-big-data-cluster/spark/dataloading/hello_PySpark.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/hello_PySpark.ipynb

File renamed without changes.

samples/features/sql-big-data-cluster/spark/dataloading/hello_Scala.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/hello_Scala.ipynb

File renamed without changes.

samples/features/sql-big-data-cluster/spark/dataloading/hello_sparkR.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/hello_sparkR.ipynb

File renamed without changes.

samples/features/sql-big-data-cluster/spark/data-loading/spark-sql.ipynb

Lines changed: 122 additions & 0 deletions
Large diffs are not rendered by default.

samples/features/sql-big-data-cluster/spark/dataloading/transform-csv-files.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/transform-csv-files.ipynb

Lines changed: 30 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
"cells": [
2020
{
2121
"cell_type": "markdown",
22-
"source": "# Spark sample showing read/write methods\nIn this sample notebook, we will read CSV file from HDFS, write it as parquet file and save a Hive table definition. We will also run some Spark SQL commands using the Hive table.\n",
22+
"source": "# Spark sample showing read/write methods\nIn this sample notebook, we will read CSV file from HDFS, write it as parquet file and save a Hive table definition.",
2323
"metadata": {}
2424
},
2525
{
@@ -28,71 +28,63 @@
2828
"metadata": {},
2929
"outputs": [
3030
{
31+
"output_type": "stream",
3132
"name": "stdout",
32-
"text": "root\n |-- wcs_click_date_sk: integer (nullable = true)\n |-- wcs_click_time_sk: integer (nullable = true)\n |-- wcs_sales_sk: integer (nullable = true)\n |-- wcs_item_sk: integer (nullable = true)\n |-- wcs_web_page_sk: integer (nullable = true)\n |-- wcs_user_sk: integer (nullable = true)\n\n+-----------------+-----------------+------------+-----------+---------------+-----------+\n|wcs_click_date_sk|wcs_click_time_sk|wcs_sales_sk|wcs_item_sk|wcs_web_page_sk|wcs_user_sk|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\n| 36890| 40052| null| 4379| 34| null|\n| 36890| 41285| null| 6245| 34| null|\n| 36890| 23115| null| 13852| 34| null|\n| 36890| 17702| null| 15975| 34| null|\n| 36890| 62676| null| 2119| 34| null|\n| 36890| 34267| null| 10273| 34| null|\n| 36890| 8502| null| 17790| 34| null|\n| 36890| 54340| null| 3453| 34| null|\n| 36890| 54370| null| 6372| 34| null|\n| 36890| 6578| null| 17203| 34| null|\n| 36890| 75088| null| 4891| 34| null|\n| 36890| 23922| null| 11332| 34| null|\n| 36890| 28761| null| 4484| 34| null|\n| 36890| 21444| null| 5582| 34| null|\n| 36890| 58917| null| 8833| 34| null|\n| 36890| 27578| null| 8599| 34| null|\n| 36890| 8059| null| 6720| 34| null|\n| 36890| 43008| null| 17175| 34| null|\n| 36890| 4378| null| 10644| 34| null|\n| 36890| 55403| null| 8139| 34| null|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\nonly showing top 20 rows",
33-
"output_type": "stream"
34-
}
35-
],
36-
"execution_count": 3
37-
},
38-
{
39-
"cell_type": "code",
40-
"source": "# Disable saving SUCCESS file\r\nsc._jsc.hadoopConfiguration().set(\"mapreduce.fileoutputcommitter.marksuccessfuljobs\", \"false\") \r\n\r\n# Print the current warehouse directory where the parquet files will be stored\r\nprint(spark.conf.get(\"spark.sql.warehouse.dir\"))\r\n\r\n# Save results as parquet & orc file and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"web_clickstreams\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"web_clickstreams_orc\")",
41-
"metadata": {},
42-
"outputs": [
33+
"text": "Starting Spark application\n"
34+
},
4335
{
36+
"output_type": "display_data",
37+
"data": {
38+
"text/plain": "<IPython.core.display.HTML object>",
39+
"text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>0</td><td>application_1555189187089_0001</td><td>pyspark3</td><td>idle</td><td><a target=\"_blank\" href=\"http://master-0.master-svc:8088/proxy/application_1555189187089_0001/\">Link</a></td><td><a target=\"_blank\" href=\"http://storage-0-1.storage-0-svc.demo-ctp25.svc.cluster.local:8042/node/containerlogs/container_1555189187089_0001_01_000001/root\">Link</a></td><td>✔</td></tr></table>"
40+
},
41+
"metadata": {}
42+
},
43+
{
44+
"output_type": "stream",
4445
"name": "stdout",
45-
"text": "hdfs:///user/hive/warehouse",
46-
"output_type": "stream"
46+
"text": "SparkSession available as 'spark'.\n"
47+
},
48+
{
49+
"output_type": "stream",
50+
"name": "stdout",
51+
"text": "root\n |-- wcs_click_date_sk: integer (nullable = true)\n |-- wcs_click_time_sk: integer (nullable = true)\n |-- wcs_sales_sk: integer (nullable = true)\n |-- wcs_item_sk: integer (nullable = true)\n |-- wcs_web_page_sk: integer (nullable = true)\n |-- wcs_user_sk: integer (nullable = true)\n\n+-----------------+-----------------+------------+-----------+---------------+-----------+\n|wcs_click_date_sk|wcs_click_time_sk|wcs_sales_sk|wcs_item_sk|wcs_web_page_sk|wcs_user_sk|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\n| 36890| 40052| null| 4379| 34| null|\n| 36890| 41285| null| 6245| 34| null|\n| 36890| 23115| null| 13852| 34| null|\n| 36890| 17702| null| 15975| 34| null|\n| 36890| 62676| null| 2119| 34| null|\n| 36890| 34267| null| 10273| 34| null|\n| 36890| 8502| null| 17790| 34| null|\n| 36890| 54340| null| 3453| 34| null|\n| 36890| 54370| null| 6372| 34| null|\n| 36890| 6578| null| 17203| 34| null|\n| 36890| 75088| null| 4891| 34| null|\n| 36890| 23922| null| 11332| 34| null|\n| 36890| 28761| null| 4484| 34| null|\n| 36890| 21444| null| 5582| 34| null|\n| 36890| 58917| null| 8833| 34| null|\n| 36890| 27578| null| 8599| 34| null|\n| 36890| 8059| null| 6720| 34| null|\n| 36890| 43008| null| 17175| 34| null|\n| 36890| 4378| null| 10644| 34| null|\n| 36890| 55403| null| 8139| 34| null|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\nonly showing top 20 rows"
4752
}
4853
],
49-
"execution_count": 4
54+
"execution_count": 2
5055
},
5156
{
5257
"cell_type": "code",
53-
"source": "# Execute Spark SQL commands\r\nsqlDF = spark.sql(\"SELECT * FROM web_clickstreams LIMIT 100\")\r\nsqlDF.show()\r\n\r\nsqlDF = spark.sql(\"SELECT wcs_user_sk, COUNT(*)\\\r\n FROM web_clickstreams\\\r\n WHERE wcs_user_sk IS NOT NULL\\\r\n GROUP BY wcs_user_sk\\\r\n ORDER BY COUNT(*) DESC LIMIT 100\")\r\nsqlDF.show()",
58+
"source": "# Disable saving SUCCESS file\r\nsc._jsc.hadoopConfiguration().set(\"mapreduce.fileoutputcommitter.marksuccessfuljobs\", \"false\") \r\n\r\n# Print the current warehouse directory where the parquet files will be stored\r\nprint(spark.conf.get(\"spark.sql.warehouse.dir\"))\r\n\r\n# Save results as parquet & orc file and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"web_clickstreams\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"web_clickstreams_orc\")",
5459
"metadata": {},
5560
"outputs": [
5661
{
62+
"output_type": "stream",
5763
"name": "stdout",
58-
"text": "+-----------------+-----------------+------------+-----------+---------------+-----------+\n|wcs_click_date_sk|wcs_click_time_sk|wcs_sales_sk|wcs_item_sk|wcs_web_page_sk|wcs_user_sk|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\n| 37506| 7933| null| 1384| 2| 39437|\n| 37506| 56044| null| 14689| 2| 26419|\n| 37506| 52706| null| 8541| 2| 44016|\n| 37506| 67325| null| 16129| 2| 83371|\n| 37506| 84857| null| 1869| 2| 13090|\n| 37506| 49599| null| 2994| 2| 8940|\n| 37506| 78150| null| 11392| 2| 65633|\n| 37506| 38720| null| 14366| 2| 22281|\n| 37506| 79915| null| 11102| 2| 81755|\n| 37506| 67253| null| 5380| 2| 46868|\n| 37506| 6507| null| 6813| 2| 49363|\n| 37506| 18280| null| 1458| 2| 49363|\n| 37506| 72258| null| 2869| 2| 67756|\n| 37506| 8045| null| 615| 2| 86035|\n| 37506| 86164| null| 7000| 2| 94821|\n| 37506| 29724| null| 2767| 2| 94821|\n| 37506| 55471| null| 3584| 2| 62792|\n| 37506| 677| null| 1720| 2| 27212|\n| 37506| 66638| null| 9898| 2| 20370|\n| 37506| 48515| null| 9394| 2| 17157|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\nonly showing top 20 rows\n\n+-----------+--------+\n|wcs_user_sk|count(1)|\n+-----------+--------+\n| 65042| 832|\n| 55928| 821|\n| 15570| 791|\n| 31138| 788|\n| 68188| 784|\n| 88205| 760|\n| 15678| 757|\n| 48063| 741|\n| 77518| 741|\n| 92978| 728|\n| 82129| 727|\n| 21700| 725|\n| 69707| 724|\n| 38895| 719|\n| 97643| 716|\n| 74426| 707|\n| 7813| 704|\n| 49528| 700|\n| 55766| 698|\n| 54355| 697|\n+-----------+--------+\nonly showing top 20 rows",
59-
"output_type": "stream"
64+
"text": "hdfs:///user/hive/warehouse"
6065
}
6166
],
62-
"execution_count": 5
67+
"execution_count": 3
6368
},
6469
{
6570
"cell_type": "code",
6671
"source": "# Read the product reviews CSV files into a spark data frame, print schema & top rows\r\nresults = spark.read.option(\"inferSchema\", \"true\").csv('/product_review_data').toDF(\r\n \"pr_review_sk\", \"pr_review_content\"\r\n )\r\nresults.printSchema()\r\nresults.show()",
6772
"metadata": {},
6873
"outputs": [
6974
{
75+
"output_type": "stream",
7076
"name": "stdout",
71-
"text": "root\n |-- pr_review_sk: integer (nullable = true)\n |-- pr_review_content: string (nullable = true)\n\n+------------+--------------------+\n|pr_review_sk| pr_review_content|\n+------------+--------------------+\n| 72621|Works fine. Easy ...|\n| 89334|great product to ...|\n| 89335|Next time will go...|\n| 84259|Great Gift Great ...|\n| 84398|After trip to Par...|\n| 66434|Simply the best t...|\n| 66501|This is the exact...|\n| 66587|Not super magnet;...|\n| 66680|Installed as bath...|\n| 66694|Our home was buil...|\n| 84489|Hi ;We are runnin...|\n| 79052|Terra cotta is th...|\n| 73034|One of my fingern...|\n| 73298|We installed thes...|\n| 66810|needed silicone c...|\n| 66912|Great Gift Great ...|\n| 67028|Laguiole knives a...|\n| 89770|Good sound timers...|\n| 84679|AWESOME FEEDBACK ...|\n| 84953|love the retro gl...|\n+------------+--------------------+\nonly showing top 20 rows",
72-
"output_type": "stream"
77+
"text": "root\n |-- pr_review_sk: integer (nullable = true)\n |-- pr_review_content: string (nullable = true)\n\n+------------+--------------------+\n|pr_review_sk| pr_review_content|\n+------------+--------------------+\n| 72621|Works fine. Easy ...|\n| 89334|great product to ...|\n| 89335|Next time will go...|\n| 84259|Great Gift Great ...|\n| 84398|After trip to Par...|\n| 66434|Simply the best t...|\n| 66501|This is the exact...|\n| 66587|Not super magnet;...|\n| 66680|Installed as bath...|\n| 66694|Our home was buil...|\n| 84489|Hi ;We are runnin...|\n| 79052|Terra cotta is th...|\n| 73034|One of my fingern...|\n| 73298|We installed thes...|\n| 66810|needed silicone c...|\n| 66912|Great Gift Great ...|\n| 67028|Laguiole knives a...|\n| 89770|Good sound timers...|\n| 84679|AWESOME FEEDBACK ...|\n| 84953|love the retro gl...|\n+------------+--------------------+\nonly showing top 20 rows"
7378
}
7479
],
75-
"execution_count": 6
80+
"execution_count": 5
7681
},
7782
{
7883
"cell_type": "code",
79-
"source": "# Save results as parquet, and orc formats and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"product_reviews\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"product_reviews_orc\")\r\n",
84+
"source": "# Save results as parquet, and orc formats and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"product_reviews\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"product_reviews_orc\")",
8085
"metadata": {},
8186
"outputs": [],
82-
"execution_count": 7
83-
},
84-
{
85-
"cell_type": "code",
86-
"source": "# Execute Spark SQL commands\r\nsqlDF = spark.sql(\"SELECT pr_review_sk, CHAR_LENGTH(pr_review_content) as len FROM product_reviews LIMIT 100\")\r\nsqlDF.show()",
87-
"metadata": {},
88-
"outputs": [
89-
{
90-
"name": "stdout",
91-
"text": "+------------+----+\n|pr_review_sk| len|\n+------------+----+\n| 26035| 876|\n| 26037| 109|\n| 26038| 478|\n| 26041| 106|\n| 26043| 332|\n| 26044| 487|\n| 26045| 428|\n| 26048| 87|\n| 26049| 118|\n| 26051|2906|\n| 26053| 464|\n| 26054| 212|\n| 26059| 191|\n| 26060| 207|\n| 26061| 515|\n| 26063| 59|\n| 26069| 487|\n| 26070| 160|\n| 26071| 380|\n| 26072| 234|\n+------------+----+\nonly showing top 20 rows",
92-
"output_type": "stream"
93-
}
94-
],
95-
"execution_count": 8
87+
"execution_count": 6
9688
}
9789
]
98-
}}
90+
}

samples/features/sql-big-data-cluster/spark/spark_to_sql/spark_to_sql_jdbc.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-virtualization/spark_to_sql_jdbc.ipynb

File renamed without changes.

0 commit comments

Comments
 (0)