Skip to content

Commit e3d1d3a

Browse files
Merge pull request #578 from shivsood/CTP31_CR
CTP3.1 Update for samples
2 parents ac3639b + c3cd034 commit e3d1d3a

9 files changed

Lines changed: 549 additions & 103 deletions

File tree

samples/features/sql-big-data-cluster/spark/README.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
SQL Server Big Data cluster bundles Spark and HDFS together with SQL server. Azure Data Studio IDE provides built in notebooks that enables data scientists and data engineers to run Spark notebooks and job in Python, R, or Scala code against the Big Data Cluster. This folder contains spark sample notebook on using Spark in SQL server Big data cluster
44

5-
## Folder contents
5+
## Contents
66

77
[PySpark Hello World](dataloading/hello_PySpark.ipynb)
88

@@ -14,14 +14,18 @@ SQL Server Big Data cluster bundles Spark and HDFS together with SQL server. Azu
1414

1515
[Data Transfer - Spark to SQL using Spark JDBC connector](data-virtualization/spark_to_sql_jdbc.ipynb/)
1616

17-
[Data Transfer - Spark to SQL using MSSQL Spark connector](spark_to_sql/mssql_spark_connector.ipynb/)
17+
[Data Transfer - Spark to SQL using MSSQL Spark connector](data-virtualization/mssql_spark_connector.ipynb/)
1818

19-
## Instructions on how to run in Azure Data Studio
19+
[Configure - Configure a spark session using a notebook](config-install/configure_spark_session.ipynb/)
20+
21+
[Install - Install 3rd party packages](config-install/installpackage_Spark.ipynb/)
2022

21-
[data-loading/transform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/)
23+
[Restful-Access - Access Spark in BDC via restful Livy APIs](restful-api-accessn/accessing_spark_via_livy.ipynb/)
24+
25+
## Instructions on how to run in Azure Data Studio
2226

23-
2. From Azure Data Studio Connect to the SQL Server Master instance in a big data cluster.
27+
1. From Azure Data Studio Connect to the SQL Server Master instance in a big data cluster.
2428

25-
3. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and open the notebook in Azure Data Studio. Wait for the “Kernel” and the target context (“Attach to”) to be populated. If required set the relevant “Kernel” ( e.g **PySpark3** ) and **Attach to** needs to be the IP address of your big data cluster endpoint.
29+
2. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and open the notebook in Azure Data Studio. Wait for the “Kernel” and the target context (“Attach to”) to be populated. If required set the relevant “Kernel” ( e.g **PySpark3** ) and **Attach to** needs to be the IP address of your big data cluster endpoint.
2630

27-
4. Run each cell in the Notebook sequentially.
31+
3. Run each cell in the Notebook sequentially.
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
{
2+
"metadata": {
3+
"kernelspec": {
4+
"name": "pyspark3kernel",
5+
"display_name": "PySpark3"
6+
},
7+
"language_info": {
8+
"name": "pyspark3",
9+
"mimetype": "text/x-python",
10+
"codemirror_mode": {
11+
"name": "python",
12+
"version": 3
13+
},
14+
"pygments_lexer": "python3"
15+
}
16+
},
17+
"nbformat_minor": 2,
18+
"nbformat": 4,
19+
"cells": [
20+
{
21+
"cell_type": "markdown",
22+
"source": "# Configuring a Spark session using configure-f\r\nRefer to [Spark Configurations](https://spark.apache.org/docs/latest/configuration.html) for specific parameters",
23+
"metadata": {}
24+
},
25+
{
26+
"cell_type": "code",
27+
"source": "%%configure -f\r\n{\"conf\": {\r\n \"spark.executor.memory\": \"4g\",\r\n \"spark.driver.memory\": \"4g\",\r\n \"spark.executor.cores\": 2,\r\n \"spark.driver.cores\": 1,\r\n \"spark.executor.instances\": 4\r\n }\r\n}",
28+
"metadata": {},
29+
"outputs": [
30+
{
31+
"data": {
32+
"text/plain": "<IPython.core.display.HTML object>",
33+
"text/html": "Current session configs: <tt>{'conf': {'spark.executor.memory': '4g', 'spark.driver.memory': '4g', 'spark.executor.cores': 2, 'spark.driver.cores': 1, 'spark.executor.instances': 4}, 'kind': 'pyspark3'}</tt><br>"
34+
},
35+
"metadata": {},
36+
"output_type": "display_data"
37+
},
38+
{
39+
"data": {
40+
"text/plain": "<IPython.core.display.HTML object>",
41+
"text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>93</td><td>application_1558765999724_0190</td><td>pyspark</td><td>idle</td><td><a target=\"_blank\" href=\"https://10.193.16.144:30443/gateway/default/yarn/proxy/application_1558765999724_0190/\">Link</a></td><td><a target=\"_blank\" href=\"http://storage-0-1.storage-0-svc.test.svc.cluster.local:8042/node/containerlogs/container_1558765999724_0190_01_000001/root\">Link</a></td><td></td></tr></table>"
42+
},
43+
"metadata": {},
44+
"output_type": "display_data"
45+
}
46+
],
47+
"execution_count": 3
48+
},
49+
{
50+
"cell_type": "code",
51+
"source": "datafile = \"/spark_data/AdultCensusIncome.csv\"\r\ndf = spark.read.format('csv').options(header='true', inferSchema='true').load(datafile)\r\n\r\ndf.show(5)",
52+
"metadata": {},
53+
"outputs": [
54+
{
55+
"name": "stdout",
56+
"text": "Starting Spark application\n",
57+
"output_type": "stream"
58+
},
59+
{
60+
"data": {
61+
"text/plain": "<IPython.core.display.HTML object>",
62+
"text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>96</td><td>application_1558765999724_0193</td><td>pyspark3</td><td>idle</td><td><a target=\"_blank\" href=\"https://10.193.16.144:30443/gateway/default/yarn/proxy/application_1558765999724_0193/\">Link</a></td><td><a target=\"_blank\" href=\"http://storage-0-0.storage-0-svc.test.svc.cluster.local:8042/node/containerlogs/container_1558765999724_0193_01_000001/root\">Link</a></td><td>✔</td></tr></table>"
63+
},
64+
"metadata": {},
65+
"output_type": "display_data"
66+
},
67+
{
68+
"name": "stdout",
69+
"text": "SparkSession available as 'spark'.\n",
70+
"output_type": "stream"
71+
},
72+
{
73+
"name": "stdout",
74+
"text": "+---+-----------------+--------+----------+--------------+-------------------+------------------+--------------+------+-------+-------------+-------------+---------------+---------------+-------+\n|age| workclass| fnlwgt| education| education-num| marital-status| occupation| relationship| race| sex| capital-gain| capital-loss| hours-per-week| native-country| income|\n+---+-----------------+--------+----------+--------------+-------------------+------------------+--------------+------+-------+-------------+-------------+---------------+---------------+-------+\n| 39| State-gov| 77516.0| Bachelors| 13.0| Never-married| Adm-clerical| Not-in-family| White| Male| 2174.0| 0.0| 40.0| United-States| <=50K|\n| 50| Self-emp-not-inc| 83311.0| Bachelors| 13.0| Married-civ-spouse| Exec-managerial| Husband| White| Male| 0.0| 0.0| 13.0| United-States| <=50K|\n| 38| Private|215646.0| HS-grad| 9.0| Divorced| Handlers-cleaners| Not-in-family| White| Male| 0.0| 0.0| 40.0| United-States| <=50K|\n| 53| Private|234721.0| 11th| 7.0| Married-civ-spouse| Handlers-cleaners| Husband| Black| Male| 0.0| 0.0| 40.0| United-States| <=50K|\n| 28| Private|338409.0| Bachelors| 13.0| Married-civ-spouse| Prof-specialty| Wife| Black| Female| 0.0| 0.0| 40.0| Cuba| <=50K|\n+---+-----------------+--------+----------+--------------+-------------------+------------------+--------------+------+-------+-------------+-------------+---------------+---------------+-------+\nonly showing top 5 rows",
75+
"output_type": "stream"
76+
}
77+
],
78+
"execution_count": 4
79+
},
80+
{
81+
"cell_type": "code",
82+
"source": "from pyspark import SparkConf\r\nfrom pyspark.sql import SparkSession\r\n\r\ndef isConfiguredItem(cfg_items):\r\n if(cfg_items == 'spark.executor.instances' or cfg_items == 'spark.executor.memory' or \\\r\n cfg_items == 'spark.executor.cores' or cfg_items == 'spark.driver.memory' or \\\r\n cfg_items == 'spark.driver.cores'):\r\n return True\r\n\r\nspark = SparkSession.builder.getOrCreate()\r\nconf = SparkConf().getAll()\r\n\r\nfor cfg_items in conf:\r\n if(isConfiguredItem(cfg_items[0])):\r\n print(cfg_items)\r\n\r\n",
83+
"metadata": {},
84+
"outputs": [
85+
{
86+
"name": "stdout",
87+
"text": "('spark.executor.instances', '4')\n('spark.driver.memory', '4g')\n('spark.driver.cores', '1')\n('spark.executor.memory', '4g')\n('spark.executor.cores', '2')",
88+
"output_type": "stream"
89+
}
90+
],
91+
"execution_count": 21
92+
}
93+
]
94+
}
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
{
2+
"metadata": {
3+
"kernelspec": {
4+
"name": "sparkkernel",
5+
"display_name": "Spark | Scala"
6+
},
7+
"language_info": {
8+
"name": "scala",
9+
"mimetype": "text/x-scala",
10+
"codemirror_mode": "text/x-scala",
11+
"pygments_lexer": "scala"
12+
}
13+
},
14+
"nbformat_minor": 2,
15+
"nbformat": 4,
16+
"cells": [
17+
{
18+
"cell_type": "markdown",
19+
"source": "# Packaging in Spark\r\n",
20+
"metadata": {}
21+
},
22+
{
23+
"cell_type": "markdown",
24+
"source": "## Use Case 1: I can have key packages in boxed\r\n - All pacakges that come with spark and hadoop distribution\r\n - Python3.5 and Python 2.7\r\n - Pandas, Sklearn and several other supporting ml packages\r\n - R and supporting pacakges as part of MRO\r\n - sparklyr\r\n\r\n \r\n ",
25+
"metadata": {}
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"source": "## Use Case 2: I can install pacakges from maven repo to my spark cluster\r\nMaven central is a source of lot of packages. A lot of spark ecosystem pacakges are availble there. These pacakages can be installed to your spark cluster using notebook cell configuration at the start of your spark session.\r\n",
30+
"metadata": {}
31+
},
32+
{
33+
"cell_type": "code",
34+
"source": "%%configure -f\n{\"conf\": {\"spark.jars.packages\": \"com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1\"}}",
35+
"metadata": {
36+
"language": "scala"
37+
},
38+
"outputs": [
39+
{
40+
"output_type": "display_data",
41+
"data": {
42+
"text/plain": "<IPython.core.display.HTML object>",
43+
"text/html": "Current session configs: <tt>{'conf': {'spark.jars.packages': 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.50'}, 'kind': 'spark'}</tt><br>"
44+
},
45+
"metadata": {}
46+
},
47+
{
48+
"output_type": "display_data",
49+
"data": {
50+
"text/plain": "<IPython.core.display.HTML object>",
51+
"text/html": "No active sessions."
52+
},
53+
"metadata": {}
54+
}
55+
],
56+
"execution_count": 3
57+
},
58+
{
59+
"cell_type": "code",
60+
"source": "import com.microsoft.azure.eventhubs._",
61+
"metadata": {},
62+
"outputs": [
63+
{
64+
"output_type": "stream",
65+
"name": "stdout",
66+
"text": "import com.microsoft.azure.eventhubs._\n"
67+
}
68+
],
69+
"execution_count": 5
70+
},
71+
{
72+
"cell_type": "markdown",
73+
"source": "## Use Case 3: I have a local jar that i want to run in the spark cluster\r\nAs a user you may build your own customer pacakges that want to run as part of your spark jobs. These pacakges can be uploaded as HDFS and using a notebook configuration spark can consume these pacakges in a jar.\r\n\r\n\r\n",
74+
"metadata": {}
75+
},
76+
{
77+
"cell_type": "code",
78+
"source": "%%configure -f\r\n {\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}",
79+
"metadata": {},
80+
"outputs": [],
81+
"execution_count": 0
82+
},
83+
{
84+
"cell_type": "code",
85+
"source": "import com.my.mycodeJar._",
86+
"metadata": {},
87+
"outputs": [],
88+
"execution_count": 0
89+
}
90+
]
91+
}
221 KB
Loading

0 commit comments

Comments
 (0)