microsoft
diff --git a/‎samples/features/sql-big-data-cluster/spark/README.md‎
Lines changed: 16 additions & 10 deletions b/‎samples/features/sql-big-data-cluster/spark/README.md‎
Lines changed: 16 additions & 10 deletions
diff --git a/‎…er/spark/dataloading/hello_PySpark.ipynb‎ ‎…r/spark/data-loading/hello_PySpark.ipynb‎samples/features/sql-big-data-cluster/spark/dataloading/hello_PySpark.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/hello_PySpark.ipynb b/‎…er/spark/dataloading/hello_PySpark.ipynb‎ ‎…r/spark/data-loading/hello_PySpark.ipynb‎samples/features/sql-big-data-cluster/spark/dataloading/hello_PySpark.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/hello_PySpark.ipynb
diff --git a/‎…ster/spark/dataloading/hello_Scala.ipynb‎ ‎…ter/spark/data-loading/hello_Scala.ipynb‎samples/features/sql-big-data-cluster/spark/dataloading/hello_Scala.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/hello_Scala.ipynb b/‎…ster/spark/dataloading/hello_Scala.ipynb‎ ‎…ter/spark/data-loading/hello_Scala.ipynb‎samples/features/sql-big-data-cluster/spark/dataloading/hello_Scala.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/hello_Scala.ipynb
diff --git a/‎…ter/spark/dataloading/hello_sparkR.ipynb‎ ‎…er/spark/data-loading/hello_sparkR.ipynb‎samples/features/sql-big-data-cluster/spark/dataloading/hello_sparkR.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/hello_sparkR.ipynb b/‎…ter/spark/dataloading/hello_sparkR.ipynb‎ ‎…er/spark/data-loading/hello_sparkR.ipynb‎samples/features/sql-big-data-cluster/spark/dataloading/hello_sparkR.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/hello_sparkR.ipynb
diff --git a/‎samples/features/sql-big-data-cluster/spark/data-loading/spark-sql.ipynb‎
Lines changed: 122 additions & 0 deletions b/‎samples/features/sql-big-data-cluster/spark/data-loading/spark-sql.ipynb‎
Lines changed: 122 additions & 0 deletions
diff --git a/‎…rk/dataloading/transform-csv-files.ipynb‎ ‎…k/data-loading/transform-csv-files.ipynb‎samples/features/sql-big-data-cluster/spark/dataloading/transform-csv-files.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/transform-csv-files.ipynb
Lines changed: 30 additions & 38 deletions b/‎…rk/dataloading/transform-csv-files.ipynb‎ ‎…k/data-loading/transform-csv-files.ipynb‎samples/features/sql-big-data-cluster/spark/dataloading/transform-csv-files.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-loading/transform-csv-files.ipynb
Lines changed: 30 additions & 38 deletions
diff --git a/‎…ark/spark_to_sql/spark_to_sql_jdbc.ipynb‎ ‎…a-virtualization/spark_to_sql_jdbc.ipynb‎samples/features/sql-big-data-cluster/spark/spark_to_sql/spark_to_sql_jdbc.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-virtualization/spark_to_sql_jdbc.ipynb b/‎…ark/spark_to_sql/spark_to_sql_jdbc.ipynb‎ ‎…a-virtualization/spark_to_sql_jdbc.ipynb‎samples/features/sql-big-data-cluster/spark/spark_to_sql/spark_to_sql_jdbc.ipynb renamed to samples/features/sql-big-data-cluster/spark/data-virtualization/spark_to_sql_jdbc.ipynb
@@ -1,23 +1,29 @@
 # SQL Server big data clusters
 
-The new built-in notebooks in Azure Data Studio enables data scientists and data engineers to run Python, R, or Scala code against the cluster.
+The new built-in notebooks in Azure Data Studio enables data scientists and data engineers to run Python, R, Scala, or Spark SQL code against the cluster.
 
-## Instructions to open a notebook from Azure Data Studio
+## Instructions to open a notebook from Azure Data Studio and execute the commands
 
 1. Connect to the SQL Server Master instance in a big data cluster
 
-1. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and use open Notebook
+1. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and use open Notebook.
 
-## __[dataloading](dataloading/)__
+1. Open the notebook in Azure Data Studio, wait for the “Kernel” and the target context (“Attach to”) to be populated.
 
-This folder contains samples that show how to load data using Spark.
+1. Run each cell in the Notebook sequentially.
 
-[dataloading/transform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/)
+## __[data-loading](data-loading/)__
 
-## Instructions
+This folder contains samples that show how to load data using Spark and query them using SQL statements.
 
-1. Download and save the notebook file [dataloading/transnform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/) locally.
+[data-loading/transform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/)
 
-1. Open the notebook in Azure Data Studio, wait for the “Kernel” and the target context (“Attach to”) to be populated. Set the “Kernel” to **PySpark3** and **Attach to** needs to be the IP address of your big data cluster endpoint.
+This samnple notebook shows how to transform CSV files in HDFS to parquet files.
 
-1. Run each cell in the Notebook sequentially.
+[dataloading/spark-sql.ipynb](dataloading/spark-sql.ipynb/)
+
+This samnple notebook shows how to query hive tables created from Spark.
+
+## __[data-virtualization](data-virtualization/)__
+
+This folder contains samples that show how to integrate Spark with other data sources.
@@ -19,7 +19,7 @@
     "cells": [
         {
             "cell_type": "markdown",
-            "source": "# Spark sample showing read/write methods\nIn this sample notebook, we will read CSV file from HDFS, write it as parquet file and save a Hive table definition. We will also run some Spark SQL commands using the Hive table.\n",
+            "source": "# Spark sample showing read/write methods\nIn this sample notebook, we will read CSV file from HDFS, write it as parquet file and save a Hive table definition.",
             "metadata": {}
         },
         {
@@ -28,71 +28,63 @@
             "metadata": {},
             "outputs": [
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "root\n |-- wcs_click_date_sk: integer (nullable = true)\n |-- wcs_click_time_sk: integer (nullable = true)\n |-- wcs_sales_sk: integer (nullable = true)\n |-- wcs_item_sk: integer (nullable = true)\n |-- wcs_web_page_sk: integer (nullable = true)\n |-- wcs_user_sk: integer (nullable = true)\n\n+-----------------+-----------------+------------+-----------+---------------+-----------+\n|wcs_click_date_sk|wcs_click_time_sk|wcs_sales_sk|wcs_item_sk|wcs_web_page_sk|wcs_user_sk|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\n|            36890|            40052|        null|       4379|             34|       null|\n|            36890|            41285|        null|       6245|             34|       null|\n|            36890|            23115|        null|      13852|             34|       null|\n|            36890|            17702|        null|      15975|             34|       null|\n|            36890|            62676|        null|       2119|             34|       null|\n|            36890|            34267|        null|      10273|             34|       null|\n|            36890|             8502|        null|      17790|             34|       null|\n|            36890|            54340|        null|       3453|             34|       null|\n|            36890|            54370|        null|       6372|             34|       null|\n|            36890|             6578|        null|      17203|             34|       null|\n|            36890|            75088|        null|       4891|             34|       null|\n|            36890|            23922|        null|      11332|             34|       null|\n|            36890|            28761|        null|       4484|             34|       null|\n|            36890|            21444|        null|       5582|             34|       null|\n|            36890|            58917|        null|       8833|             34|       null|\n|            36890|            27578|        null|       8599|             34|       null|\n|            36890|             8059|        null|       6720|             34|       null|\n|            36890|            43008|        null|      17175|             34|       null|\n|            36890|             4378|        null|      10644|             34|       null|\n|            36890|            55403|        null|       8139|             34|       null|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\nonly showing top 20 rows",
-                    "output_type": "stream"
-                }
-            ],
-            "execution_count": 3
-        },
-        {
-            "cell_type": "code",
-            "source": "# Disable saving SUCCESS file\r\nsc._jsc.hadoopConfiguration().set(\"mapreduce.fileoutputcommitter.marksuccessfuljobs\", \"false\") \r\n\r\n# Print the current warehouse directory where the parquet files will be stored\r\nprint(spark.conf.get(\"spark.sql.warehouse.dir\"))\r\n\r\n# Save results as parquet & orc file and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"web_clickstreams\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"web_clickstreams_orc\")",
-            "metadata": {},
-            "outputs": [
+                    "text": "Starting Spark application\n"
+                },
                 {
+                    "output_type": "display_data",
+                    "data": {
+                        "text/plain": "<IPython.core.display.HTML object>",
+                        "text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>0</td><td>application_1555189187089_0001</td><td>pyspark3</td><td>idle</td><td><a target=\"_blank\" href=\"http://master-0.master-svc:8088/proxy/application_1555189187089_0001/\">Link</a></td><td><a target=\"_blank\" href=\"http://storage-0-1.storage-0-svc.demo-ctp25.svc.cluster.local:8042/node/containerlogs/container_1555189187089_0001_01_000001/root\">Link</a></td><td>✔</td></tr></table>"
+                    },
+                    "metadata": {}
+                },
+                {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "hdfs:///user/hive/warehouse",
-                    "output_type": "stream"
+                    "text": "SparkSession available as 'spark'.\n"
+                },
+                {
+                    "output_type": "stream",
+                    "name": "stdout",
+                    "text": "root\n |-- wcs_click_date_sk: integer (nullable = true)\n |-- wcs_click_time_sk: integer (nullable = true)\n |-- wcs_sales_sk: integer (nullable = true)\n |-- wcs_item_sk: integer (nullable = true)\n |-- wcs_web_page_sk: integer (nullable = true)\n |-- wcs_user_sk: integer (nullable = true)\n\n+-----------------+-----------------+------------+-----------+---------------+-----------+\n|wcs_click_date_sk|wcs_click_time_sk|wcs_sales_sk|wcs_item_sk|wcs_web_page_sk|wcs_user_sk|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\n|            36890|            40052|        null|       4379|             34|       null|\n|            36890|            41285|        null|       6245|             34|       null|\n|            36890|            23115|        null|      13852|             34|       null|\n|            36890|            17702|        null|      15975|             34|       null|\n|            36890|            62676|        null|       2119|             34|       null|\n|            36890|            34267|        null|      10273|             34|       null|\n|            36890|             8502|        null|      17790|             34|       null|\n|            36890|            54340|        null|       3453|             34|       null|\n|            36890|            54370|        null|       6372|             34|       null|\n|            36890|             6578|        null|      17203|             34|       null|\n|            36890|            75088|        null|       4891|             34|       null|\n|            36890|            23922|        null|      11332|             34|       null|\n|            36890|            28761|        null|       4484|             34|       null|\n|            36890|            21444|        null|       5582|             34|       null|\n|            36890|            58917|        null|       8833|             34|       null|\n|            36890|            27578|        null|       8599|             34|       null|\n|            36890|             8059|        null|       6720|             34|       null|\n|            36890|            43008|        null|      17175|             34|       null|\n|            36890|             4378|        null|      10644|             34|       null|\n|            36890|            55403|        null|       8139|             34|       null|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\nonly showing top 20 rows"
                 }
             ],
-            "execution_count": 4
+            "execution_count": 2
         },
         {
             "cell_type": "code",
-            "source": "# Execute Spark SQL commands\r\nsqlDF = spark.sql(\"SELECT * FROM web_clickstreams LIMIT 100\")\r\nsqlDF.show()\r\n\r\nsqlDF = spark.sql(\"SELECT wcs_user_sk, COUNT(*)\\\r\n                     FROM web_clickstreams\\\r\n                    WHERE wcs_user_sk IS NOT NULL\\\r\n                   GROUP BY wcs_user_sk\\\r\n                   ORDER BY COUNT(*) DESC LIMIT 100\")\r\nsqlDF.show()",
+            "source": "# Disable saving SUCCESS file\r\nsc._jsc.hadoopConfiguration().set(\"mapreduce.fileoutputcommitter.marksuccessfuljobs\", \"false\") \r\n\r\n# Print the current warehouse directory where the parquet files will be stored\r\nprint(spark.conf.get(\"spark.sql.warehouse.dir\"))\r\n\r\n# Save results as parquet & orc file and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"web_clickstreams\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"web_clickstreams_orc\")",
             "metadata": {},
             "outputs": [
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "+-----------------+-----------------+------------+-----------+---------------+-----------+\n|wcs_click_date_sk|wcs_click_time_sk|wcs_sales_sk|wcs_item_sk|wcs_web_page_sk|wcs_user_sk|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\n|            37506|             7933|        null|       1384|              2|      39437|\n|            37506|            56044|        null|      14689|              2|      26419|\n|            37506|            52706|        null|       8541|              2|      44016|\n|            37506|            67325|        null|      16129|              2|      83371|\n|            37506|            84857|        null|       1869|              2|      13090|\n|            37506|            49599|        null|       2994|              2|       8940|\n|            37506|            78150|        null|      11392|              2|      65633|\n|            37506|            38720|        null|      14366|              2|      22281|\n|            37506|            79915|        null|      11102|              2|      81755|\n|            37506|            67253|        null|       5380|              2|      46868|\n|            37506|             6507|        null|       6813|              2|      49363|\n|            37506|            18280|        null|       1458|              2|      49363|\n|            37506|            72258|        null|       2869|              2|      67756|\n|            37506|             8045|        null|        615|              2|      86035|\n|            37506|            86164|        null|       7000|              2|      94821|\n|            37506|            29724|        null|       2767|              2|      94821|\n|            37506|            55471|        null|       3584|              2|      62792|\n|            37506|              677|        null|       1720|              2|      27212|\n|            37506|            66638|        null|       9898|              2|      20370|\n|            37506|            48515|        null|       9394|              2|      17157|\n+-----------------+-----------------+------------+-----------+---------------+-----------+\nonly showing top 20 rows\n\n+-----------+--------+\n|wcs_user_sk|count(1)|\n+-----------+--------+\n|      65042|     832|\n|      55928|     821|\n|      15570|     791|\n|      31138|     788|\n|      68188|     784|\n|      88205|     760|\n|      15678|     757|\n|      48063|     741|\n|      77518|     741|\n|      92978|     728|\n|      82129|     727|\n|      21700|     725|\n|      69707|     724|\n|      38895|     719|\n|      97643|     716|\n|      74426|     707|\n|       7813|     704|\n|      49528|     700|\n|      55766|     698|\n|      54355|     697|\n+-----------+--------+\nonly showing top 20 rows",
-                    "output_type": "stream"
+                    "text": "hdfs:///user/hive/warehouse"
                 }
             ],
-            "execution_count": 5
+            "execution_count": 3
         },
         {
             "cell_type": "code",
             "source": "# Read the product reviews CSV files into a spark data frame, print schema & top rows\r\nresults = spark.read.option(\"inferSchema\", \"true\").csv('/product_review_data').toDF(\r\n            \"pr_review_sk\", \"pr_review_content\"\r\n            )\r\nresults.printSchema()\r\nresults.show()",
             "metadata": {},
             "outputs": [
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "root\n |-- pr_review_sk: integer (nullable = true)\n |-- pr_review_content: string (nullable = true)\n\n+------------+--------------------+\n|pr_review_sk|   pr_review_content|\n+------------+--------------------+\n|       72621|Works fine. Easy ...|\n|       89334|great product to ...|\n|       89335|Next time will go...|\n|       84259|Great Gift Great ...|\n|       84398|After trip to Par...|\n|       66434|Simply the best t...|\n|       66501|This is the exact...|\n|       66587|Not super magnet;...|\n|       66680|Installed as bath...|\n|       66694|Our home was buil...|\n|       84489|Hi ;We are runnin...|\n|       79052|Terra cotta is th...|\n|       73034|One of my fingern...|\n|       73298|We installed thes...|\n|       66810|needed silicone c...|\n|       66912|Great Gift Great ...|\n|       67028|Laguiole knives a...|\n|       89770|Good sound timers...|\n|       84679|AWESOME FEEDBACK ...|\n|       84953|love the retro gl...|\n+------------+--------------------+\nonly showing top 20 rows",
-                    "output_type": "stream"
+                    "text": "root\n |-- pr_review_sk: integer (nullable = true)\n |-- pr_review_content: string (nullable = true)\n\n+------------+--------------------+\n|pr_review_sk|   pr_review_content|\n+------------+--------------------+\n|       72621|Works fine. Easy ...|\n|       89334|great product to ...|\n|       89335|Next time will go...|\n|       84259|Great Gift Great ...|\n|       84398|After trip to Par...|\n|       66434|Simply the best t...|\n|       66501|This is the exact...|\n|       66587|Not super magnet;...|\n|       66680|Installed as bath...|\n|       66694|Our home was buil...|\n|       84489|Hi ;We are runnin...|\n|       79052|Terra cotta is th...|\n|       73034|One of my fingern...|\n|       73298|We installed thes...|\n|       66810|needed silicone c...|\n|       66912|Great Gift Great ...|\n|       67028|Laguiole knives a...|\n|       89770|Good sound timers...|\n|       84679|AWESOME FEEDBACK ...|\n|       84953|love the retro gl...|\n+------------+--------------------+\nonly showing top 20 rows"
                 }
             ],
-            "execution_count": 6
+            "execution_count": 5
         },
         {
             "cell_type": "code",
-            "source": "# Save results as parquet, and orc formats and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"product_reviews\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"product_reviews_orc\")\r\n",
+            "source": "# Save results as parquet, and orc formats and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"product_reviews\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"product_reviews_orc\")",
             "metadata": {},
             "outputs": [],
-            "execution_count": 7
-        },
-        {
-            "cell_type": "code",
-            "source": "# Execute Spark SQL commands\r\nsqlDF = spark.sql(\"SELECT pr_review_sk, CHAR_LENGTH(pr_review_content) as len FROM product_reviews LIMIT 100\")\r\nsqlDF.show()",
-            "metadata": {},
-            "outputs": [
-                {
-                    "name": "stdout",
-                    "text": "+------------+----+\n|pr_review_sk| len|\n+------------+----+\n|       26035| 876|\n|       26037| 109|\n|       26038| 478|\n|       26041| 106|\n|       26043| 332|\n|       26044| 487|\n|       26045| 428|\n|       26048|  87|\n|       26049| 118|\n|       26051|2906|\n|       26053| 464|\n|       26054| 212|\n|       26059| 191|\n|       26060| 207|\n|       26061| 515|\n|       26063|  59|\n|       26069| 487|\n|       26070| 160|\n|       26071| 380|\n|       26072| 234|\n+------------+----+\nonly showing top 20 rows",
-                    "output_type": "stream"
-                }
-            ],
-            "execution_count": 8
+            "execution_count": 6
         }
     ]
-}}
+}