Skip to content

Commit 88eff98

Browse files
committed
spark samples v1
1 parent c7910d6 commit 88eff98

5 files changed

Lines changed: 354 additions & 0 deletions

File tree

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
{
2+
"metadata": {
3+
"kernelspec": {
4+
"name": "pyspark3kernel",
5+
"display_name": "PySpark3"
6+
},
7+
"language_info": {
8+
"name": "pyspark3",
9+
"mimetype": "text/x-python",
10+
"codemirror_mode": {
11+
"name": "python",
12+
"version": 3
13+
},
14+
"pygments_lexer": "python3"
15+
}
16+
},
17+
"nbformat_minor": 2,
18+
"nbformat": 4,
19+
"cells": [
20+
{
21+
"cell_type": "code",
22+
"source": "print(\"Hello World! \")\r\n\r\nimport sys\r\nprint(\"Python version \",sys.version)\r\n\r\n#Run some python in notebook\r\nnum = [i*i for i in range(0,20)]\r\nprint(\"My squared numbers \", num)\r\n",
23+
"metadata": {
24+
"language": "python"
25+
},
26+
"outputs": [
27+
{
28+
"name": "stdout",
29+
"text": "Hello World! \nPython version 3.5.2 (default, Nov 12 2018, 13:43:14) \n[GCC 5.4.0 20160609]\nMy squared numbers [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]",
30+
"output_type": "stream"
31+
}
32+
],
33+
"execution_count": 1
34+
}
35+
]
36+
}
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
{
2+
"metadata": {
3+
"kernelspec": {
4+
"name": "sparkkernel",
5+
"display_name": "Spark | Scala"
6+
},
7+
"language_info": {
8+
"name": "scala",
9+
"mimetype": "text/x-scala",
10+
"codemirror_mode": "text/x-scala",
11+
"pygments_lexer": "scala"
12+
}
13+
},
14+
"nbformat_minor": 2,
15+
"nbformat": 4,
16+
"cells": [
17+
{
18+
"cell_type": "code",
19+
"source": "object HelloWorld {\r\n def main(args: Array[String]): Unit= { println(\"Hello Spark Scala\")\r\n }\r\n}\r\n",
20+
"metadata": {},
21+
"outputs": [
22+
{
23+
"name": "stdout",
24+
"text": "Starting Spark application\n",
25+
"output_type": "stream"
26+
},
27+
{
28+
"data": {
29+
"text/plain": "<IPython.core.display.HTML object>",
30+
"text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>3</td><td>application_1554316083160_0004</td><td>spark</td><td>idle</td><td><a target=\"_blank\" href=\"https://52.191.187.81:30443/gateway/default/yarn/proxy/application_1554316083160_0004/\">Link</a></td><td><a target=\"_blank\" href=\"http://mssql-storage-pool-default-1.service-storage-pool-default.ctp24.svc.cluster.local:8042/node/containerlogs/container_1554316083160_0004_01_000001/root\">Link</a></td><td>✔</td></tr></table>"
31+
},
32+
"metadata": {},
33+
"output_type": "display_data"
34+
},
35+
{
36+
"name": "stdout",
37+
"text": "SparkSession available as 'spark'.\n",
38+
"output_type": "stream"
39+
},
40+
{
41+
"name": "stdout",
42+
"text": "defined object HelloWorld\n",
43+
"output_type": "stream"
44+
}
45+
],
46+
"execution_count": 2
47+
}
48+
]
49+
}
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"metadata": {
3+
"kernelspec": {
4+
"name": "sparkrkernel",
5+
"display_name": "Spark | R"
6+
},
7+
"language_info": {
8+
"name": "sparkR",
9+
"mimetype": "text/x-rsrc",
10+
"codemirror_mode": "text/x-rsrc",
11+
"pygments_lexer": "r"
12+
}
13+
},
14+
"nbformat_minor": 2,
15+
"nbformat": 4,
16+
"cells": [
17+
{
18+
"cell_type": "code",
19+
"source": "print(\"Hello SparkR\")\r\n\r\nhead(iris)\r\n",
20+
"metadata": {},
21+
"outputs": [
22+
{
23+
"name": "stdout",
24+
"text": "[1] \"Hello SparkR\"\n Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n1 5.1 3.5 1.4 0.2 setosa\n2 4.9 3.0 1.4 0.2 setosa\n3 4.7 3.2 1.3 0.2 setosa\n4 4.6 3.1 1.5 0.2 setosa\n5 5.0 3.6 1.4 0.2 setosa\n6 5.4 3.9 1.7 0.4 setosa",
25+
"output_type": "stream"
26+
}
27+
],
28+
"execution_count": 5
29+
}
30+
]
31+
}
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
{
2+
"metadata": {
3+
"kernelspec": {
4+
"name": "pyspark3kernel",
5+
"display_name": "PySpark3"
6+
},
7+
"language_info": {
8+
"name": "pyspark3",
9+
"mimetype": "text/x-python",
10+
"codemirror_mode": {
11+
"name": "python",
12+
"version": 3
13+
},
14+
"pygments_lexer": "python3"
15+
}
16+
},
17+
"nbformat_minor": 2,
18+
"nbformat": 4,
19+
"cells": [
20+
{
21+
"cell_type": "markdown",
22+
"source": "# Read and write from Spark to SQL\r\nA typical big data scenario is large scale ETL in Spark and writing the processed data to SQLServer. The following samples shows \r\n- reading a HDFS file, \r\n- some basic processing on it and \r\n- then processed data to SQL Server table.\r\n\r\nNeed a database precreated in SQL for this sample. Here we are using database name \"MyTestDatabase\" that can be created using SQL statements below.\r\n\r\n``` sql\r\nCreate DATABASE MyTestDatabase\r\nGO \r\n``` \r\n ",
23+
"metadata": {}
24+
},
25+
{
26+
"cell_type": "code",
27+
"source": "\r\n#Read a file and then write it to the SQL table\r\ndatafile = \"/spark_data/AdultCensusIncome.csv\"\r\ndf = spark.read.format('csv').options(header='true', inferSchema='true', ignoreLeadingWhiteSpace='true', ignoreTrailingWhiteSpace='true').load(datafile)\r\ndf.show(5)\r\n",
28+
"metadata": {},
29+
"outputs": [
30+
{
31+
"name": "stdout",
32+
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education-num| marital-status| occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows",
33+
"output_type": "stream"
34+
}
35+
],
36+
"execution_count": 8
37+
},
38+
{
39+
"cell_type": "code",
40+
"source": "\r\n#Process this data. Very simple data cleanup steps. Replacing \"-\" with \"_\" in column names\r\ncolumns_new = [col.replace(\"-\", \"_\") for col in df.columns]\r\ndf = df.toDF(*columns_new)\r\ndf.show(5)\r\n\r\n",
41+
"metadata": {},
42+
"outputs": [
43+
{
44+
"name": "stdout",
45+
"text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows",
46+
"output_type": "stream"
47+
}
48+
],
49+
"execution_count": 9
50+
},
51+
{
52+
"cell_type": "code",
53+
"source": "#Write from Spark to SQL table using JDBC\r\nprint(\"Use build in JDBC connector to write to SQLServer master instance in Big data \")\r\n\r\nservername = \"jdbc:sqlserver://mssql-master-pool-0.service-master-pool\"\r\ndbname = \"MyTestDatabase\"\r\nurl = servername + \";\" + \"databaseName=\" + dbname + \";\"\r\n\r\nc = \"dbo.AdultCensus\"\r\nuser = \"sa\"\r\npassword = \"ShivTron007\"\r\n\r\nprint(\"url is \", url)\r\n\r\ntry:\r\n df.write \\\r\n .format(\"jdbc\") \\\r\n .mode(\"overwrite\") \\\r\n .option(\"url\", url) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password)\\\r\n .save()\r\nexcept ValueError as error :\r\n print(\"JDBC Write failed\", error)\r\n\r\nprint(\"JDBC Write done \")\r\n\r\n\r\n",
54+
"metadata": {},
55+
"outputs": [
56+
{
57+
"name": "stdout",
58+
"text": "Use build in JDBC connector to write to SQLServer master instance in Big data \nurl is jdbc:sqlserver://mssql-master-pool-0.service-master-pool;databaseName=MyTestDatabase;\nJDBC Write done",
59+
"output_type": "stream"
60+
}
61+
],
62+
"execution_count": 10
63+
},
64+
{
65+
"cell_type": "code",
66+
"source": "#Read to Spark from SQL table using JDBC\r\nprint(\"read data from SQL server table \")\r\njdbcDF = spark.read \\\r\n .format(\"jdbc\") \\\r\n .option(\"url\", url\r\n ) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password) \\\r\n .load()\r\n\r\njdbcDF.show(5)",
67+
"metadata": {},
68+
"outputs": [
69+
{
70+
"name": "stdout",
71+
"text": "read data from SQL server table \n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows",
72+
"output_type": "stream"
73+
}
74+
],
75+
"execution_count": 13
76+
}
77+
]
78+
}

0 commit comments

Comments
 (0)