You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The new built-in notebooks in Azure Data Studio enables data scientists and data engineers to run Python, R, or Scala code against the cluster.
3
+
The new built-in notebooks in Azure Data Studio enables data scientists and data engineers to run Python, R, Scala, or Spark SQL code against the cluster.
4
4
5
-
## Instructions to open a notebook from Azure Data Studio
5
+
## Instructions to open a notebook from Azure Data Studio and execute the commands
6
6
7
7
1. Connect to the SQL Server Master instance in a big data cluster
8
8
9
-
1. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and use open Notebook
9
+
1. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and use open Notebook.
10
10
11
-
## __[dataloading](dataloading/)__
11
+
1. Open the notebook in Azure Data Studio, wait for the “Kernel” and the target context (“Attach to”) to be populated.
12
12
13
-
This folder contains samples that show how to load data using Spark.
1. Open the notebook in Azure Data Studio, wait for the “Kernel” and the target context (“Attach to”) to be populated. Set the “Kernel” to **PySpark3** and **Attach to** needs to be the IP address of your big data cluster endpoint.
21
+
This samnple notebook shows how to transform CSV files in HDFS to parquet files.
Copy file name to clipboardExpand all lines: samples/features/sql-big-data-cluster/spark/data-loading/transform-csv-files.ipynb
+30-38Lines changed: 30 additions & 38 deletions
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,7 @@
19
19
"cells": [
20
20
{
21
21
"cell_type": "markdown",
22
-
"source": "# Spark sample showing read/write methods\nIn this sample notebook, we will read CSV file from HDFS, write it as parquet file and save a Hive table definition. We will also run some Spark SQL commands using the Hive table.\n",
22
+
"source": "# Spark sample showing read/write methods\nIn this sample notebook, we will read CSV file from HDFS, write it as parquet file and save a Hive table definition.",
"source": "# Disable saving SUCCESS file\r\nsc._jsc.hadoopConfiguration().set(\"mapreduce.fileoutputcommitter.marksuccessfuljobs\", \"false\") \r\n\r\n# Print the current warehouse directory where the parquet files will be stored\r\nprint(spark.conf.get(\"spark.sql.warehouse.dir\"))\r\n\r\n# Save results as parquet & orc file and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"web_clickstreams\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"web_clickstreams_orc\")",
"source": "# Execute Spark SQL commands\r\nsqlDF = spark.sql(\"SELECT * FROM web_clickstreams LIMIT 100\")\r\nsqlDF.show()\r\n\r\nsqlDF = spark.sql(\"SELECT wcs_user_sk, COUNT(*)\\\r\n FROM web_clickstreams\\\r\n WHERE wcs_user_sk IS NOT NULL\\\r\n GROUP BY wcs_user_sk\\\r\n ORDER BY COUNT(*) DESC LIMIT 100\")\r\nsqlDF.show()",
58
+
"source": "# Disable saving SUCCESS file\r\nsc._jsc.hadoopConfiguration().set(\"mapreduce.fileoutputcommitter.marksuccessfuljobs\", \"false\") \r\n\r\n# Print the current warehouse directory where the parquet files will be stored\r\nprint(spark.conf.get(\"spark.sql.warehouse.dir\"))\r\n\r\n# Save results as parquet & orc file and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"web_clickstreams\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"web_clickstreams_orc\")",
"source": "# Read the product reviews CSV files into a spark data frame, print schema & top rows\r\nresults = spark.read.option(\"inferSchema\", \"true\").csv('/product_review_data').toDF(\r\n\"pr_review_sk\", \"pr_review_content\"\r\n )\r\nresults.printSchema()\r\nresults.show()",
67
72
"metadata": {},
68
73
"outputs": [
69
74
{
75
+
"output_type": "stream",
70
76
"name": "stdout",
71
-
"text": "root\n |-- pr_review_sk: integer (nullable = true)\n |-- pr_review_content: string (nullable = true)\n\n+------------+--------------------+\n|pr_review_sk| pr_review_content|\n+------------+--------------------+\n| 72621|Works fine. Easy ...|\n| 89334|great product to ...|\n| 89335|Next time will go...|\n| 84259|Great Gift Great ...|\n| 84398|After trip to Par...|\n| 66434|Simply the best t...|\n| 66501|This is the exact...|\n| 66587|Not super magnet;...|\n| 66680|Installed as bath...|\n| 66694|Our home was buil...|\n| 84489|Hi ;We are runnin...|\n| 79052|Terra cotta is th...|\n| 73034|One of my fingern...|\n| 73298|We installed thes...|\n| 66810|needed silicone c...|\n| 66912|Great Gift Great ...|\n| 67028|Laguiole knives a...|\n| 89770|Good sound timers...|\n| 84679|AWESOME FEEDBACK ...|\n| 84953|love the retro gl...|\n+------------+--------------------+\nonly showing top 20 rows",
72
-
"output_type": "stream"
77
+
"text": "root\n |-- pr_review_sk: integer (nullable = true)\n |-- pr_review_content: string (nullable = true)\n\n+------------+--------------------+\n|pr_review_sk| pr_review_content|\n+------------+--------------------+\n| 72621|Works fine. Easy ...|\n| 89334|great product to ...|\n| 89335|Next time will go...|\n| 84259|Great Gift Great ...|\n| 84398|After trip to Par...|\n| 66434|Simply the best t...|\n| 66501|This is the exact...|\n| 66587|Not super magnet;...|\n| 66680|Installed as bath...|\n| 66694|Our home was buil...|\n| 84489|Hi ;We are runnin...|\n| 79052|Terra cotta is th...|\n| 73034|One of my fingern...|\n| 73298|We installed thes...|\n| 66810|needed silicone c...|\n| 66912|Great Gift Great ...|\n| 67028|Laguiole knives a...|\n| 89770|Good sound timers...|\n| 84679|AWESOME FEEDBACK ...|\n| 84953|love the retro gl...|\n+------------+--------------------+\nonly showing top 20 rows"
73
78
}
74
79
],
75
-
"execution_count": 6
80
+
"execution_count": 5
76
81
},
77
82
{
78
83
"cell_type": "code",
79
-
"source": "# Save results as parquet, and orc formats and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"product_reviews\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"product_reviews_orc\")\r\n",
84
+
"source": "# Save results as parquet, and orc formats and create hive table\r\nresults.write.format(\"parquet\").mode(\"overwrite\").saveAsTable(\"product_reviews\")\r\nresults.write.format(\"orc\").mode(\"overwrite\").saveAsTable(\"product_reviews_orc\")",
80
85
"metadata": {},
81
86
"outputs": [],
82
-
"execution_count": 7
83
-
},
84
-
{
85
-
"cell_type": "code",
86
-
"source": "# Execute Spark SQL commands\r\nsqlDF = spark.sql(\"SELECT pr_review_sk, CHAR_LENGTH(pr_review_content) as len FROM product_reviews LIMIT 100\")\r\nsqlDF.show()",
0 commit comments