1+ {
2+ "metadata" : {
3+ "kernelspec" : {
4+ "name" : " pysparkkernel" ,
5+ "display_name" : " PySpark" ,
6+ "language" : " "
7+ },
8+ "language_info" : {
9+ "name" : " pyspark" ,
10+ "mimetype" : " text/x-python" ,
11+ "codemirror_mode" : {
12+ "name" : " python" ,
13+ "version" : 3
14+ },
15+ "file_extension" : " .py" ,
16+ "pygments_lexer" : " python3"
17+ }
18+ },
19+ "nbformat_minor" : 2 ,
20+ "nbformat" : 4 ,
21+ "cells" : [
22+ {
23+ "cell_type" : " markdown" ,
24+ "source" : [
25+ " # Spark read and write on remote Object Stores\n " ,
26+ " \n " ,
27+ " By loading the right libraries, Spark is able to both read and write to external Object Stores in a distributed way.\n " ,
28+ " \n " ,
29+ " SQL Server Server Big Data Clusters ships libraries to access S3 and ADLS Gen2 protocols.\n " ,
30+ " \n " ,
31+ " Libraries are updated with each cumulative update, please make sure to list the available libraries. To list S3 protocol libraries use the following:\n " ,
32+ " \n " ,
33+ " ```\n " ,
34+ " kubectl -n <YOUR-BDC-NAMESPACE> exec sparkhead-0 -- bash -c \" ls /opt/hadoop/share/hadoop/tools/lib/*aws*\"\n " ,
35+ " ```\n " ,
36+ " \n " ,
37+ " If your scenario requires a library either unavailable or version incompatible with what is shipped with Big Data Clusters, you have some options:\n " ,
38+ " \n " ,
39+ " 1. Use a session based configure cell with dynamic library loading on Notebooks or Jobs.\n " ,
40+ " 2. Copy the additional libraries to a known HDFS on BDC and reference that at session configuration.\n " ,
41+ " \n " ,
42+ " These two scenarios are described in detail in the [Manage libraries](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-install-packages?view=sql-server-ver15) and [Submit Spark jobs by using command-line tools](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job-command-line?view=sql-server-ver15) articles.\n " ,
43+ " \n " ,
44+ " ## Step 1 - Configure access to the remote storage\n " ,
45+ " \n " ,
46+ " In this example we will access a remote S3 protocol object store. \n " ,
47+ " \n " ,
48+ " The example considers a [MinIO](https://min.io/) object store service, but would would work with other S3 protocol providers.\n " ,
49+ " \n " ,
50+ " Please check your S3 object store provider documentation to understand which libraries are required.\n " ,
51+ " \n " ,
52+ " With that information at hand, configure your notebook session or job to use the right library like the example bellow."
53+ ],
54+ "metadata" : {
55+ "azdata_cell_guid" : " 32339638-a7ad-465b-b991-17c33798e5d5"
56+ },
57+ "attachments" : {}
58+ },
59+ {
60+ "cell_type" : " code" ,
61+ "source" : [
62+ " %%configure -f \\\r\n " ,
63+ " {\r\n " ,
64+ " \" conf\" : {\r\n " ,
65+ " \" spark.driver.extraClassPath\" : \" /opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.271.jar:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.1.168513.jar\" ,\r\n " ,
66+ " \" spark.executor.extraClassPath\" : \" /opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.271.jar:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.1.168513.jar\" ,\r\n " ,
67+ " \" spark.hadoop.fs.s3a.buffer.dir\" : \" /var/opt/yarnuser\"\r\n " ,
68+ " }\r\n " ,
69+ " }"
70+ ],
71+ "metadata" : {
72+ "azdata_cell_guid" : " 64591678-3192-4005-9bff-edd8d05bd982"
73+ },
74+ "outputs" : [],
75+ "execution_count" : null
76+ },
77+ {
78+ "cell_type" : " code" ,
79+ "source" : [
80+ " spark"
81+ ],
82+ "metadata" : {
83+ "azdata_cell_guid" : " 6b343b63-8876-4e58-8058-d2f5798c50dd"
84+ },
85+ "outputs" : [],
86+ "execution_count" : null
87+ },
88+ {
89+ "cell_type" : " markdown" ,
90+ "source" : [
91+ " ## Step 2 - Add in access tokens to access the remote storage dynamically\r\n " ,
92+ " \r\n " ,
93+ " Follow your S3 provider security documentation to change the following cells to correctly configure Spark to connect to the endpoint."
94+ ],
95+ "metadata" : {
96+ "azdata_cell_guid" : " 235198a1-5a97-404f-8067-929ed35d8af1"
97+ },
98+ "attachments" : {}
99+ },
100+ {
101+ "cell_type" : " code" ,
102+ "source" : [
103+ " access_key=\" YOUR_ACCESS_KEY\"\r\n " ,
104+ " secret=\" YOUR_SECRET\" "
105+ ],
106+ "metadata" : {
107+ "azdata_cell_guid" : " 9d2880f4-3e74-474e-a955-9b4206c2653b" ,
108+ "tags" : []
109+ },
110+ "outputs" : [],
111+ "execution_count" : null
112+ },
113+ {
114+ "cell_type" : " code" ,
115+ "source" : [
116+ " spark._jsc.hadoopConfiguration().set(\" fs.s3a.aws.credentials.provider\" , \" org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\" )\r\n " ,
117+ " spark._jsc.hadoopConfiguration().set(\" fs.s3a.access.key\" , access_key)\r\n " ,
118+ " spark._jsc.hadoopConfiguration().set(\" fs.s3a.secret.key\" , secret)\r\n " ,
119+ " spark._jsc.hadoopConfiguration().set(\" fs.s3a.endpoint\" , \" YOUR_ENDPOINT\" )\r\n " ,
120+ " spark._jsc.hadoopConfiguration().set(\" spark.hadoop.fs.s3a.buffer.dir\" , \" /var/opt/yarnuser\" ) # Temp dir for writes back to S3\r\n " ,
121+ " spark._jsc.hadoopConfiguration().set(\" fs.s3a.connection.ssl.enabled\" , \" false\" )"
122+ ],
123+ "metadata" : {
124+ "azdata_cell_guid" : " 97fb73a6-af3c-47d2-8bd5-13c9999135dd"
125+ },
126+ "outputs" : [],
127+ "execution_count" : null
128+ },
129+ {
130+ "cell_type" : " markdown" ,
131+ "source" : [
132+ " ## Spark read and write patterns\n " ,
133+ " \n " ,
134+ " Use the following examples to cover a range of read and write scenarios to remote object stores.\n " ,
135+ " \n " ,
136+ " ### Read from external S3 and write to BDC HDFS as table"
137+ ],
138+ "metadata" : {
139+ "azdata_cell_guid" : " 3d34444e-026e-44ea-ac79-24e19fe6b4a0"
140+ },
141+ "attachments" : {}
142+ },
143+ {
144+ "cell_type" : " code" ,
145+ "source" : [
146+ " df = spark.read.csv(\" s3a://NYC-Cab/fhv_tripdata_2015-01.csv\" , header=True)"
147+ ],
148+ "metadata" : {
149+ "azdata_cell_guid" : " 273a3ff7-0f2e-41a3-a229-0d3eda1c95d6"
150+ },
151+ "outputs" : [],
152+ "execution_count" : null
153+ },
154+ {
155+ "cell_type" : " code" ,
156+ "source" : [
157+ " df.count()"
158+ ],
159+ "metadata" : {
160+ "azdata_cell_guid" : " c966b240-2e2c-4bb3-9f22-1dacf491d72a"
161+ },
162+ "outputs" : [],
163+ "execution_count" : null
164+ },
165+ {
166+ "cell_type" : " code" ,
167+ "source" : [
168+ " df.write.format(\" parquet\" ).save(\" /securelake/fhv_tripdata_2015-01\" )"
169+ ],
170+ "metadata" : {
171+ "azdata_cell_guid" : " 4bc6608d-da67-423e-8238-3ec186b7396a"
172+ },
173+ "outputs" : [],
174+ "execution_count" : null
175+ },
176+ {
177+ "cell_type" : " code" ,
178+ "source" : [
179+ " %%sql\r\n " ,
180+ " DROP TABLE tripdata"
181+ ],
182+ "metadata" : {
183+ "azdata_cell_guid" : " 1ea7a5ed-9567-4c21-932d-99b05af79c4f" ,
184+ "tags" : [
185+ " hide_input"
186+ ]
187+ },
188+ "outputs" : [],
189+ "execution_count" : null
190+ },
191+ {
192+ "cell_type" : " code" ,
193+ "source" : [
194+ " %%sql\r\n " ,
195+ " CREATE TABLE tripdata\r\n " ,
196+ " USING parquet\r\n " ,
197+ " LOCATION '/securelake/fhv_tripdata_2015-01'"
198+ ],
199+ "metadata" : {
200+ "azdata_cell_guid" : " 9c3c2ce9-1796-4d32-be3f-c522425306b9"
201+ },
202+ "outputs" : [],
203+ "execution_count" : null
204+ },
205+ {
206+ "cell_type" : " code" ,
207+ "source" : [
208+ " %%sql\r\n " ,
209+ " select count(*) from tripdata"
210+ ],
211+ "metadata" : {
212+ "azdata_cell_guid" : " 9fa76a3b-60ff-4868-8184-956191d0d96c"
213+ },
214+ "outputs" : [],
215+ "execution_count" : null
216+ },
217+ {
218+ "cell_type" : " code" ,
219+ "source" : [
220+ " %%sql\r\n " ,
221+ " select * from tripdata limit 10"
222+ ],
223+ "metadata" : {
224+ "azdata_cell_guid" : " 0a99df82-8a4f-479e-8681-b406883646ac"
225+ },
226+ "outputs" : [],
227+ "execution_count" : null
228+ },
229+ {
230+ "cell_type" : " markdown" ,
231+ "source" : [
232+ " ### Write back to S3 as parquet"
233+ ],
234+ "metadata" : {
235+ "azdata_cell_guid" : " 9f540e0c-b18c-436d-9a88-e123382d9778"
236+ }
237+ },
238+ {
239+ "cell_type" : " code" ,
240+ "source" : [
241+ " df.write.format(\" parquet\" ).save(\" s3a://NYC-Cab/fhv_tripdata_2015-01-3\" )"
242+ ],
243+ "metadata" : {
244+ "azdata_cell_guid" : " 5cdeaf5c-3c29-4fbf-97f9-110a803be1d0"
245+ },
246+ "outputs" : [],
247+ "execution_count" : null
248+ },
249+ {
250+ "cell_type" : " markdown" ,
251+ "source" : [
252+ " ### Create external table on S3\n " ,
253+ " \n " ,
254+ " This example virtualizes a folder on external object store as a Hive table."
255+ ],
256+ "metadata" : {
257+ "azdata_cell_guid" : " e734b5ae-1b6e-4209-abad-94f9226e9af6"
258+ },
259+ "attachments" : {}
260+ },
261+ {
262+ "cell_type" : " code" ,
263+ "source" : [
264+ " %%sql\r\n " ,
265+ " DROP TABLE tripdata_s3"
266+ ],
267+ "metadata" : {
268+ "azdata_cell_guid" : " 3f5b308e-d528-4c39-afe1-e4924ce38db8" ,
269+ "tags" : [
270+ " hide_input"
271+ ]
272+ },
273+ "outputs" : [],
274+ "execution_count" : null
275+ },
276+ {
277+ "cell_type" : " code" ,
278+ "source" : [
279+ " %%sql\r\n " ,
280+ " CREATE TABLE tripdata_s3\r\n " ,
281+ " USING parquet\r\n " ,
282+ " LOCATION 's3a://NYC-Cab/fhv_tripdata_2015-01-3'"
283+ ],
284+ "metadata" : {
285+ "azdata_cell_guid" : " 1229cff9-7457-48c1-9f0a-dc12ae0ad031"
286+ },
287+ "outputs" : [],
288+ "execution_count" : null
289+ },
290+ {
291+ "cell_type" : " code" ,
292+ "source" : [
293+ " %%sql\r\n " ,
294+ " select count(*) from tripdata_s3"
295+ ],
296+ "metadata" : {
297+ "azdata_cell_guid" : " ef4f6b6f-ac19-4704-9436-e1a590cf64fc"
298+ },
299+ "outputs" : [],
300+ "execution_count" : null
301+ },
302+ {
303+ "cell_type" : " code" ,
304+ "source" : [
305+ " %%sql\r\n " ,
306+ " select * from tripdata_s3 limit 10"
307+ ],
308+ "metadata" : {
309+ "azdata_cell_guid" : " 1878f87c-ef41-4312-acb1-c97de021a817"
310+ },
311+ "outputs" : [],
312+ "execution_count" : null
313+ }
314+ ]
315+ }
0 commit comments