Skip to content

Commit a82e76f

Browse files
authored
Merge pull request #956 from DaniBunny/20210806-object-store-example
Spark to S3 examples
2 parents e748524 + f67c3ad commit a82e76f

4 files changed

Lines changed: 319 additions & 2 deletions

File tree

samples/features/sql-big-data-cluster/spark/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,11 @@ SQL Server Big Data cluster bundles Spark and HDFS together with SQL server. Azu
1212

1313
[Data Loading - Transforming CSV to Parquet](data-loading/transform-csv-files.ipynb/)
1414

15-
[Data Transfer - Spark to SQL using Spark JDBC connector](data-virtualization/spark_to_sql_jdbc.ipynb/)
15+
[Data Virtualization - Spark to SQL using MSSQL Spark connector](data-virtualization/mssql_spark_connector_non_ad_pyspark.ipynb/)
1616

17-
[Data Transfer - Spark to SQL using MSSQL Spark connector](data-virtualization/mssql_spark_connector_non_ad_pyspark.ipynb/)
17+
[Data Virtualization - Spark to SQL using Spark JDBC connector](data-virtualization/spark_to_sql_jdbc.ipynb/)
18+
19+
[Data Virtualization - Spark with external Object Stores](data-virtualization/spark_external_object_store.ipynb/)
1820

1921
[Configure - Configure a spark session using a notebook](config-install/configure_spark_session.ipynb/)
2022

Lines changed: 315 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,315 @@
1+
{
2+
"metadata": {
3+
"kernelspec": {
4+
"name": "pysparkkernel",
5+
"display_name": "PySpark",
6+
"language": ""
7+
},
8+
"language_info": {
9+
"name": "pyspark",
10+
"mimetype": "text/x-python",
11+
"codemirror_mode": {
12+
"name": "python",
13+
"version": 3
14+
},
15+
"file_extension": ".py",
16+
"pygments_lexer": "python3"
17+
}
18+
},
19+
"nbformat_minor": 2,
20+
"nbformat": 4,
21+
"cells": [
22+
{
23+
"cell_type": "markdown",
24+
"source": [
25+
"# Spark read and write on remote Object Stores\n",
26+
"\n",
27+
"By loading the right libraries, Spark is able to both read and write to external Object Stores in a distributed way.\n",
28+
"\n",
29+
"SQL Server Server Big Data Clusters ships libraries to access S3 and ADLS Gen2 protocols.\n",
30+
"\n",
31+
"Libraries are updated with each cumulative update, please make sure to list the available libraries. To list S3 protocol libraries use the following:\n",
32+
"\n",
33+
"```\n",
34+
"kubectl -n <YOUR-BDC-NAMESPACE> exec sparkhead-0 -- bash -c \"ls /opt/hadoop/share/hadoop/tools/lib/*aws*\"\n",
35+
"```\n",
36+
"\n",
37+
"If your scenario requires a library either unavailable or version incompatible with what is shipped with Big Data Clusters, you have some options:\n",
38+
"\n",
39+
"1. Use a session based configure cell with dynamic library loading on Notebooks or Jobs.\n",
40+
"2. Copy the additional libraries to a known HDFS on BDC and reference that at session configuration.\n",
41+
"\n",
42+
"These two scenarios are described in detail in the [Manage libraries](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-install-packages?view=sql-server-ver15) and [Submit Spark jobs by using command-line tools](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job-command-line?view=sql-server-ver15) articles.\n",
43+
"\n",
44+
"## Step 1 - Configure access to the remote storage\n",
45+
"\n",
46+
"In this example we will access a remote S3 protocol object store. \n",
47+
" \n",
48+
"The example considers a [MinIO](https://min.io/) object store service, but would would work with other S3 protocol providers.\n",
49+
"\n",
50+
"Please check your S3 object store provider documentation to understand which libraries are required.\n",
51+
"\n",
52+
"With that information at hand, configure your notebook session or job to use the right library like the example bellow."
53+
],
54+
"metadata": {
55+
"azdata_cell_guid": "32339638-a7ad-465b-b991-17c33798e5d5"
56+
},
57+
"attachments": {}
58+
},
59+
{
60+
"cell_type": "code",
61+
"source": [
62+
"%%configure -f \\\r\n",
63+
"{\r\n",
64+
" \"conf\": {\r\n",
65+
" \"spark.driver.extraClassPath\": \"/opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.271.jar:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.1.168513.jar\",\r\n",
66+
" \"spark.executor.extraClassPath\": \"/opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.271.jar:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.1.168513.jar\",\r\n",
67+
" \"spark.hadoop.fs.s3a.buffer.dir\": \"/var/opt/yarnuser\"\r\n",
68+
" }\r\n",
69+
"}"
70+
],
71+
"metadata": {
72+
"azdata_cell_guid": "64591678-3192-4005-9bff-edd8d05bd982"
73+
},
74+
"outputs": [],
75+
"execution_count": null
76+
},
77+
{
78+
"cell_type": "code",
79+
"source": [
80+
"spark"
81+
],
82+
"metadata": {
83+
"azdata_cell_guid": "6b343b63-8876-4e58-8058-d2f5798c50dd"
84+
},
85+
"outputs": [],
86+
"execution_count": null
87+
},
88+
{
89+
"cell_type": "markdown",
90+
"source": [
91+
"## Step 2 - Add in access tokens to access the remote storage dynamically\r\n",
92+
"\r\n",
93+
"Follow your S3 provider security documentation to change the following cells to correctly configure Spark to connect to the endpoint."
94+
],
95+
"metadata": {
96+
"azdata_cell_guid": "235198a1-5a97-404f-8067-929ed35d8af1"
97+
},
98+
"attachments": {}
99+
},
100+
{
101+
"cell_type": "code",
102+
"source": [
103+
"access_key=\"YOUR_ACCESS_KEY\"\r\n",
104+
"secret=\"YOUR_SECRET\""
105+
],
106+
"metadata": {
107+
"azdata_cell_guid": "9d2880f4-3e74-474e-a955-9b4206c2653b",
108+
"tags": []
109+
},
110+
"outputs": [],
111+
"execution_count": null
112+
},
113+
{
114+
"cell_type": "code",
115+
"source": [
116+
"spark._jsc.hadoopConfiguration().set(\"fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\")\r\n",
117+
"spark._jsc.hadoopConfiguration().set(\"fs.s3a.access.key\", access_key)\r\n",
118+
"spark._jsc.hadoopConfiguration().set(\"fs.s3a.secret.key\", secret)\r\n",
119+
"spark._jsc.hadoopConfiguration().set(\"fs.s3a.endpoint\", \"YOUR_ENDPOINT\")\r\n",
120+
"spark._jsc.hadoopConfiguration().set(\"spark.hadoop.fs.s3a.buffer.dir\", \"/var/opt/yarnuser\") # Temp dir for writes back to S3\r\n",
121+
"spark._jsc.hadoopConfiguration().set(\"fs.s3a.connection.ssl.enabled\", \"false\")"
122+
],
123+
"metadata": {
124+
"azdata_cell_guid": "97fb73a6-af3c-47d2-8bd5-13c9999135dd"
125+
},
126+
"outputs": [],
127+
"execution_count": null
128+
},
129+
{
130+
"cell_type": "markdown",
131+
"source": [
132+
"## Spark read and write patterns\n",
133+
"\n",
134+
"Use the following examples to cover a range of read and write scenarios to remote object stores.\n",
135+
"\n",
136+
"### Read from external S3 and write to BDC HDFS as table"
137+
],
138+
"metadata": {
139+
"azdata_cell_guid": "3d34444e-026e-44ea-ac79-24e19fe6b4a0"
140+
},
141+
"attachments": {}
142+
},
143+
{
144+
"cell_type": "code",
145+
"source": [
146+
"df = spark.read.csv(\"s3a://NYC-Cab/fhv_tripdata_2015-01.csv\", header=True)"
147+
],
148+
"metadata": {
149+
"azdata_cell_guid": "273a3ff7-0f2e-41a3-a229-0d3eda1c95d6"
150+
},
151+
"outputs": [],
152+
"execution_count": null
153+
},
154+
{
155+
"cell_type": "code",
156+
"source": [
157+
"df.count()"
158+
],
159+
"metadata": {
160+
"azdata_cell_guid": "c966b240-2e2c-4bb3-9f22-1dacf491d72a"
161+
},
162+
"outputs": [],
163+
"execution_count": null
164+
},
165+
{
166+
"cell_type": "code",
167+
"source": [
168+
"df.write.format(\"parquet\").save(\"/securelake/fhv_tripdata_2015-01\")"
169+
],
170+
"metadata": {
171+
"azdata_cell_guid": "4bc6608d-da67-423e-8238-3ec186b7396a"
172+
},
173+
"outputs": [],
174+
"execution_count": null
175+
},
176+
{
177+
"cell_type": "code",
178+
"source": [
179+
"%%sql\r\n",
180+
"DROP TABLE tripdata"
181+
],
182+
"metadata": {
183+
"azdata_cell_guid": "1ea7a5ed-9567-4c21-932d-99b05af79c4f",
184+
"tags": [
185+
"hide_input"
186+
]
187+
},
188+
"outputs": [],
189+
"execution_count": null
190+
},
191+
{
192+
"cell_type": "code",
193+
"source": [
194+
"%%sql\r\n",
195+
"CREATE TABLE tripdata\r\n",
196+
"USING parquet\r\n",
197+
"LOCATION '/securelake/fhv_tripdata_2015-01'"
198+
],
199+
"metadata": {
200+
"azdata_cell_guid": "9c3c2ce9-1796-4d32-be3f-c522425306b9"
201+
},
202+
"outputs": [],
203+
"execution_count": null
204+
},
205+
{
206+
"cell_type": "code",
207+
"source": [
208+
"%%sql\r\n",
209+
"select count(*) from tripdata"
210+
],
211+
"metadata": {
212+
"azdata_cell_guid": "9fa76a3b-60ff-4868-8184-956191d0d96c"
213+
},
214+
"outputs": [],
215+
"execution_count": null
216+
},
217+
{
218+
"cell_type": "code",
219+
"source": [
220+
"%%sql\r\n",
221+
"select * from tripdata limit 10"
222+
],
223+
"metadata": {
224+
"azdata_cell_guid": "0a99df82-8a4f-479e-8681-b406883646ac"
225+
},
226+
"outputs": [],
227+
"execution_count": null
228+
},
229+
{
230+
"cell_type": "markdown",
231+
"source": [
232+
"### Write back to S3 as parquet"
233+
],
234+
"metadata": {
235+
"azdata_cell_guid": "9f540e0c-b18c-436d-9a88-e123382d9778"
236+
}
237+
},
238+
{
239+
"cell_type": "code",
240+
"source": [
241+
"df.write.format(\"parquet\").save(\"s3a://NYC-Cab/fhv_tripdata_2015-01-3\")"
242+
],
243+
"metadata": {
244+
"azdata_cell_guid": "5cdeaf5c-3c29-4fbf-97f9-110a803be1d0"
245+
},
246+
"outputs": [],
247+
"execution_count": null
248+
},
249+
{
250+
"cell_type": "markdown",
251+
"source": [
252+
"### Create external table on S3\n",
253+
"\n",
254+
"This example virtualizes a folder on external object store as a Hive table."
255+
],
256+
"metadata": {
257+
"azdata_cell_guid": "e734b5ae-1b6e-4209-abad-94f9226e9af6"
258+
},
259+
"attachments": {}
260+
},
261+
{
262+
"cell_type": "code",
263+
"source": [
264+
"%%sql\r\n",
265+
"DROP TABLE tripdata_s3"
266+
],
267+
"metadata": {
268+
"azdata_cell_guid": "3f5b308e-d528-4c39-afe1-e4924ce38db8",
269+
"tags": [
270+
"hide_input"
271+
]
272+
},
273+
"outputs": [],
274+
"execution_count": null
275+
},
276+
{
277+
"cell_type": "code",
278+
"source": [
279+
"%%sql\r\n",
280+
"CREATE TABLE tripdata_s3\r\n",
281+
"USING parquet\r\n",
282+
"LOCATION 's3a://NYC-Cab/fhv_tripdata_2015-01-3'"
283+
],
284+
"metadata": {
285+
"azdata_cell_guid": "1229cff9-7457-48c1-9f0a-dc12ae0ad031"
286+
},
287+
"outputs": [],
288+
"execution_count": null
289+
},
290+
{
291+
"cell_type": "code",
292+
"source": [
293+
"%%sql\r\n",
294+
"select count(*) from tripdata_s3"
295+
],
296+
"metadata": {
297+
"azdata_cell_guid": "ef4f6b6f-ac19-4704-9436-e1a590cf64fc"
298+
},
299+
"outputs": [],
300+
"execution_count": null
301+
},
302+
{
303+
"cell_type": "code",
304+
"source": [
305+
"%%sql\r\n",
306+
"select * from tripdata_s3 limit 10"
307+
],
308+
"metadata": {
309+
"azdata_cell_guid": "1878f87c-ef41-4312-acb1-c97de021a817"
310+
},
311+
"outputs": [],
312+
"execution_count": null
313+
}
314+
]
315+
}

samples/manage/.DS_Store

-6 KB
Binary file not shown.
-6 KB
Binary file not shown.

0 commit comments

Comments
 (0)