microsoft
diff --git a/‎samples/features/sql-big-data-cluster/spark/README.md‎
Lines changed: 4 additions & 2 deletions b/‎samples/features/sql-big-data-cluster/spark/README.md‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎samples/features/sql-big-data-cluster/spark/data-virtualization/spark_external_object_store.ipynb‎
Lines changed: 315 additions & 0 deletions b/‎samples/features/sql-big-data-cluster/spark/data-virtualization/spark_external_object_store.ipynb‎
Lines changed: 315 additions & 0 deletions
diff --git a/‎samples/manage/.DS_Store‎
-6 KB b/‎samples/manage/.DS_Store‎
-6 KB
diff --git a/‎samples/manage/sql-assessment-api/.DS_Store‎
-6 KB b/‎samples/manage/sql-assessment-api/.DS_Store‎
-6 KB
@@ -12,9 +12,11 @@ SQL Server Big Data cluster bundles Spark and HDFS together with SQL server. Azu
 
 [Data Loading   - Transforming CSV to Parquet](data-loading/transform-csv-files.ipynb/)
 
-[Data Transfer - Spark to SQL using Spark JDBC connector](data-virtualization/spark_to_sql_jdbc.ipynb/)
+[Data Virtualization - Spark to SQL using MSSQL Spark connector](data-virtualization/mssql_spark_connector_non_ad_pyspark.ipynb/)
 
-[Data Transfer - Spark to SQL using MSSQL Spark connector](data-virtualization/mssql_spark_connector_non_ad_pyspark.ipynb/)
+[Data Virtualization - Spark to SQL using Spark JDBC connector](data-virtualization/spark_to_sql_jdbc.ipynb/)
+
+[Data Virtualization - Spark with external Object Stores](data-virtualization/spark_external_object_store.ipynb/)
 
 [Configure  - Configure a spark session using a notebook](config-install/configure_spark_session.ipynb/)
 
 
@@ -0,0 +1,315 @@
+{
+    "metadata": {
+        "kernelspec": {
+            "name": "pysparkkernel",
+            "display_name": "PySpark",
+            "language": ""
+        },
+        "language_info": {
+            "name": "pyspark",
+            "mimetype": "text/x-python",
+            "codemirror_mode": {
+                "name": "python",
+                "version": 3
+            },
+            "file_extension": ".py",
+            "pygments_lexer": "python3"
+        }
+    },
+    "nbformat_minor": 2,
+    "nbformat": 4,
+    "cells": [
+        {
+            "cell_type": "markdown",
+            "source": [
+                "# Spark read and write on remote Object Stores\n",
+                "\n",
+                "By loading the right libraries, Spark is able to both read and write to external Object Stores in a distributed way.\n",
+                "\n",
+                "SQL Server Server Big Data Clusters ships libraries to access S3 and ADLS Gen2 protocols.\n",
+                "\n",
+                "Libraries are updated with each cumulative update, please make sure to list the available libraries. To list S3 protocol libraries use the following:\n",
+                "\n",
+                "```\n",
+                "kubectl -n <YOUR-BDC-NAMESPACE> exec sparkhead-0 -- bash -c \"ls /opt/hadoop/share/hadoop/tools/lib/*aws*\"\n",
+                "```\n",
+                "\n",
+                "If your scenario requires a library either unavailable or version incompatible with what is shipped with Big Data Clusters, you have some options:\n",
+                "\n",
+                "1. Use a session based configure cell with dynamic library loading on Notebooks or Jobs.\n",
+                "2. Copy the additional libraries to a known HDFS on BDC and reference that at session configuration.\n",
+                "\n",
+                "These two scenarios are described in detail in the [Manage libraries](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-install-packages?view=sql-server-ver15) and [Submit Spark jobs by using command-line tools](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job-command-line?view=sql-server-ver15) articles.\n",
+                "\n",
+                "## Step 1 - Configure access to the remote storage\n",
+                "\n",
+                "In this example we will access a remote S3 protocol object store.  \n",
+                "  \n",
+                "The example considers a [MinIO](https://min.io/) object store service, but would would work with other S3 protocol providers.\n",
+                "\n",
+                "Please check your S3 object store provider documentation to understand which libraries are required.\n",
+                "\n",
+                "With that information at hand, configure your notebook session or job to use the right library like the example bellow."
+            ],
+            "metadata": {
+                "azdata_cell_guid": "32339638-a7ad-465b-b991-17c33798e5d5"
+            },
+            "attachments": {}
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "%%configure -f \\\r\n",
+                "{\r\n",
+                "    \"conf\": {\r\n",
+                "        \"spark.driver.extraClassPath\": \"/opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.271.jar:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.1.168513.jar\",\r\n",
+                "        \"spark.executor.extraClassPath\": \"/opt/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.271.jar:/opt/hadoop/share/hadoop/tools/lib/hadoop-aws-3.1.168513.jar\",\r\n",
+                "        \"spark.hadoop.fs.s3a.buffer.dir\": \"/var/opt/yarnuser\"\r\n",
+                "    }\r\n",
+                "}"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "64591678-3192-4005-9bff-edd8d05bd982"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "spark"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "6b343b63-8876-4e58-8058-d2f5798c50dd"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "markdown",
+            "source": [
+                "## Step 2 - Add in access tokens to access the remote storage dynamically\r\n",
+                "\r\n",
+                "Follow your S3 provider security documentation to change the following cells to correctly configure Spark to connect to the endpoint."
+            ],
+            "metadata": {
+                "azdata_cell_guid": "235198a1-5a97-404f-8067-929ed35d8af1"
+            },
+            "attachments": {}
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "access_key=\"YOUR_ACCESS_KEY\"\r\n",
+                "secret=\"YOUR_SECRET\""
+            ],
+            "metadata": {
+                "azdata_cell_guid": "9d2880f4-3e74-474e-a955-9b4206c2653b",
+                "tags": []
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "spark._jsc.hadoopConfiguration().set(\"fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\")\r\n",
+                "spark._jsc.hadoopConfiguration().set(\"fs.s3a.access.key\", access_key)\r\n",
+                "spark._jsc.hadoopConfiguration().set(\"fs.s3a.secret.key\", secret)\r\n",
+                "spark._jsc.hadoopConfiguration().set(\"fs.s3a.endpoint\", \"YOUR_ENDPOINT\")\r\n",
+                "spark._jsc.hadoopConfiguration().set(\"spark.hadoop.fs.s3a.buffer.dir\", \"/var/opt/yarnuser\") # Temp dir for writes back to S3\r\n",
+                "spark._jsc.hadoopConfiguration().set(\"fs.s3a.connection.ssl.enabled\", \"false\")"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "97fb73a6-af3c-47d2-8bd5-13c9999135dd"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "markdown",
+            "source": [
+                "## Spark read and write patterns\n",
+                "\n",
+                "Use the following examples to cover a range of read and write scenarios to remote object stores.\n",
+                "\n",
+                "### Read from external S3 and write to BDC HDFS as table"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "3d34444e-026e-44ea-ac79-24e19fe6b4a0"
+            },
+            "attachments": {}
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "df = spark.read.csv(\"s3a://NYC-Cab/fhv_tripdata_2015-01.csv\", header=True)"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "273a3ff7-0f2e-41a3-a229-0d3eda1c95d6"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "df.count()"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "c966b240-2e2c-4bb3-9f22-1dacf491d72a"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "df.write.format(\"parquet\").save(\"/securelake/fhv_tripdata_2015-01\")"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "4bc6608d-da67-423e-8238-3ec186b7396a"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "%%sql\r\n",
+                "DROP TABLE tripdata"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "1ea7a5ed-9567-4c21-932d-99b05af79c4f",
+                "tags": [
+                    "hide_input"
+                ]
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "%%sql\r\n",
+                "CREATE TABLE tripdata\r\n",
+                "USING parquet\r\n",
+                "LOCATION '/securelake/fhv_tripdata_2015-01'"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "9c3c2ce9-1796-4d32-be3f-c522425306b9"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "%%sql\r\n",
+                "select count(*) from tripdata"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "9fa76a3b-60ff-4868-8184-956191d0d96c"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "%%sql\r\n",
+                "select * from tripdata limit 10"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "0a99df82-8a4f-479e-8681-b406883646ac"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "markdown",
+            "source": [
+                "### Write back to S3 as parquet"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "9f540e0c-b18c-436d-9a88-e123382d9778"
+            }
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "df.write.format(\"parquet\").save(\"s3a://NYC-Cab/fhv_tripdata_2015-01-3\")"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "5cdeaf5c-3c29-4fbf-97f9-110a803be1d0"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "markdown",
+            "source": [
+                "### Create external table on S3\n",
+                "\n",
+                "This example virtualizes a folder on external object store as a Hive table."
+            ],
+            "metadata": {
+                "azdata_cell_guid": "e734b5ae-1b6e-4209-abad-94f9226e9af6"
+            },
+            "attachments": {}
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "%%sql\r\n",
+                "DROP TABLE tripdata_s3"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "3f5b308e-d528-4c39-afe1-e4924ce38db8",
+                "tags": [
+                    "hide_input"
+                ]
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "%%sql\r\n",
+                "CREATE TABLE tripdata_s3\r\n",
+                "USING parquet\r\n",
+                "LOCATION 's3a://NYC-Cab/fhv_tripdata_2015-01-3'"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "1229cff9-7457-48c1-9f0a-dc12ae0ad031"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "%%sql\r\n",
+                "select count(*) from tripdata_s3"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "ef4f6b6f-ac19-4704-9436-e1a590cf64fc"
+            },
+            "outputs": [],
+            "execution_count": null
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "%%sql\r\n",
+                "select * from tripdata_s3 limit 10"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "1878f87c-ef41-4312-acb1-c97de021a817"
+            },
+            "outputs": [],
+            "execution_count": null
+        }
+    ]
+}