update notebook with the correct link to the image.

lgongmsft · lgongmsft · commit 168bcb3c2eda · 2019-06-05T12:11:06.000-07:00
diff --git a/samples/features/sql-big-data-cluster/spark/sparkml/train_score_export_ml_models_with_spark.ipynb b/samples/features/sql-big-data-cluster/spark/sparkml/train_score_export_ml_models_with_spark.ipynb
@@ -19,7 +19,7 @@
     "cells": [
         {
             "cell_type": "markdown",
-            "source": "# Machine learning with SPARK in SQL Server 2019 Big Data Cluster\r\nSpark in Unified Big data compute engine that enables big data processing, Machine learning and AI\r\n\r\nKey Spark advantages are \r\n1. Distributed compute enging \r\n2. Choice of langauge (Python, R, Scala, Java)\r\n3. Single engine for Batch and Streaming jobs\r\n\r\nIn this tutorial we'll cover how we can use Spark to create and deploy machine learning models. The example is a python(PySpark) sample. The same can also be done using Scala and R ( SparkR) in Spark.\r\n\r\n<img src = \"C:\\repos\\sql-server-samples\\samples\\features\\sql-big-data-cluster\\spark\\sparkml\\Train_Score_Export_with_Spark.jpg\" style=\"float: center;\" alt=\"drawing\" width=\"900\">\r\n\r\n## Steps\r\n1. Explore your Data\r\n2. Data Prep and split Data as Training and Test set\r\n3. Model Training\r\n4. Model Scoring \r\n5. Persist as Spark Model\r\n6. Persist as Portable Model\r\n\r\nE2E machine learning involves several additional step e.g data exploration, feature selection and principal component analysis,model selection etc. Many of these steps are ignored here for brevity.\r\n\r\n\r\n\r\n",
+            "source": "# Machine learning with SPARK in SQL Server 2019 Big Data Cluster\r\nSpark in Unified Big data compute engine that enables big data processing, Machine learning and AI\r\n\r\nKey Spark advantages are \r\n1. Distributed compute enging \r\n2. Choice of langauge (Python, R, Scala, Java)\r\n3. Single engine for Batch and Streaming jobs\r\n\r\nIn this tutorial we'll cover how we can use Spark to create and deploy machine learning models. The example is a python(PySpark) sample. The same can also be done using Scala and R ( SparkR) in Spark.\r\n\r\n<img src = \"Train_Score_Export_with_Spark.jpg\" style=\"float: center;\" alt=\"drawing\" width=\"900\">\r\n\r\n## Steps\r\n1. Explore your Data\r\n2. Data Prep and split Data as Training and Test set\r\n3. Model Training\r\n4. Model Scoring \r\n5. Persist as Spark Model\r\n6. Persist as Portable Model\r\n\r\nE2E machine learning involves several additional step e.g data exploration, feature selection and principal component analysis,model selection etc. Many of these steps are ignored here for brevity.\r\n\r\n\r\n\r\n",
             "metadata": {}
         },
         {
@@ -29,31 +29,31 @@
         },
         {
             "cell_type": "code",
-            "source": "datafile = \"/spark_data/AdultCensusIncome.csv\"\r\n\r\n#Read the data to a spark data frame.\r\ndata_all = spark.read.format('csv').options(header='true', inferSchema='true', ignoreLeadingWhiteSpace='true', ignoreTrailingWhiteSpace='true').load(datafile)\r\nprint(\"Number of rows: {},  Number of coulumns : {}\".format(data_all.count(), len(data_all.columns)))\r\ndata_all.printSchema() \r\n\r\n#Replace \"-\" with \"_\" in column names\r\ncolumns_new = [col.replace(\"-\", \"_\") for col in data_all.columns]\r\ndata_all = data_all.toDF(*columns_new)\r\ndata_all.printSchema()\r\n",
+            "source": "![title](datafile = \"/spark_data/AdultCensusIncome.csv\"\r\n\r\n#Read the data to a spark data frame.\r\ndata_all = spark.read.format('csv').options(header='true', inferSchema='true', ignoreLeadingWhiteSpace='true', ignoreTrailingWhiteSpace='true').load(datafile)\r\nprint(\"Number of rows: {},  Number of coulumns : {}\".format(data_all.count(), len(data_all.columns)))\r\ndata_all.printSchema() \r\n\r\n#Replace \"-\" with \"_\" in column names\r\ncolumns_new = [col.replace(\"-\", \"_\") for col in data_all.columns]\r\ndata_all = data_all.toDF(*columns_new)\r\ndata_all.printSchema()\r\n",
             "metadata": {},
             "outputs": [
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "Starting Spark application\n",
-                    "output_type": "stream"
+                    "text": "Starting Spark application\n"
                 },
                 {
+                    "output_type": "display_data",
                     "data": {
                         "text/plain": "<IPython.core.display.HTML object>",
                         "text/html": "<table>\n<tr><th>ID</th><th>YARN Application ID</th><th>Kind</th><th>State</th><th>Spark UI</th><th>Driver log</th><th>Current session?</th></tr><tr><td>19</td><td>application_1559313998190_0085</td><td>pyspark3</td><td>idle</td><td><a target=\"_blank\" href=\"https://10.193.17.7:30443/gateway/default/yarn/proxy/application_1559313998190_0085/\">Link</a></td><td><a target=\"_blank\" href=\"https://10.193.17.7:30443/gateway/default/yarn/container/container_1559313998190_0085_01_000001/root\">Link</a></td><td>✔</td></tr></table>"
                     },
-                    "metadata": {},
-                    "output_type": "display_data"
+                    "metadata": {}
                 },
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "SparkSession available as 'spark'.\n",
-                    "output_type": "stream"
+                    "text": "SparkSession available as 'spark'.\n"
                 },
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "Number of rows: 32561,  Number of coulumns : 15\nroot\n |-- age: integer (nullable = true)\n |-- workclass: string (nullable = true)\n |-- fnlwgt: integer (nullable = true)\n |-- education: string (nullable = true)\n |-- education-num: integer (nullable = true)\n |-- marital-status: string (nullable = true)\n |-- occupation: string (nullable = true)\n |-- relationship: string (nullable = true)\n |-- race: string (nullable = true)\n |-- sex: string (nullable = true)\n |-- capital-gain: integer (nullable = true)\n |-- capital-loss: integer (nullable = true)\n |-- hours-per-week: integer (nullable = true)\n |-- native-country: string (nullable = true)\n |-- income: string (nullable = true)\n\nroot\n |-- age: integer (nullable = true)\n |-- workclass: string (nullable = true)\n |-- fnlwgt: integer (nullable = true)\n |-- education: string (nullable = true)\n |-- education_num: integer (nullable = true)\n |-- marital_status: string (nullable = true)\n |-- occupation: string (nullable = true)\n |-- relationship: string (nullable = true)\n |-- race: string (nullable = true)\n |-- sex: string (nullable = true)\n |-- capital_gain: integer (nullable = true)\n |-- capital_loss: integer (nullable = true)\n |-- hours_per_week: integer (nullable = true)\n |-- native_country: string (nullable = true)\n |-- income: string (nullable = true)",
-                    "output_type": "stream"
+                    "text": "Number of rows: 32561,  Number of coulumns : 15\nroot\n |-- age: integer (nullable = true)\n |-- workclass: string (nullable = true)\n |-- fnlwgt: integer (nullable = true)\n |-- education: string (nullable = true)\n |-- education-num: integer (nullable = true)\n |-- marital-status: string (nullable = true)\n |-- occupation: string (nullable = true)\n |-- relationship: string (nullable = true)\n |-- race: string (nullable = true)\n |-- sex: string (nullable = true)\n |-- capital-gain: integer (nullable = true)\n |-- capital-loss: integer (nullable = true)\n |-- hours-per-week: integer (nullable = true)\n |-- native-country: string (nullable = true)\n |-- income: string (nullable = true)\n\nroot\n |-- age: integer (nullable = true)\n |-- workclass: string (nullable = true)\n |-- fnlwgt: integer (nullable = true)\n |-- education: string (nullable = true)\n |-- education_num: integer (nullable = true)\n |-- marital_status: string (nullable = true)\n |-- occupation: string (nullable = true)\n |-- relationship: string (nullable = true)\n |-- race: string (nullable = true)\n |-- sex: string (nullable = true)\n |-- capital_gain: integer (nullable = true)\n |-- capital_loss: integer (nullable = true)\n |-- hours_per_week: integer (nullable = true)\n |-- native_country: string (nullable = true)\n |-- income: string (nullable = true)"
                 }
             ],
             "execution_count": 3
@@ -64,9 +64,9 @@
             "metadata": {},
             "outputs": [
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "Select few columns to see the data\n+------+---+--------------+---------+\n|income|age|hours_per_week|education|\n+------+---+--------------+---------+\n| <=50K| 39|            40|Bachelors|\n| <=50K| 50|            13|Bachelors|\n| <=50K| 38|            40|  HS-grad|\n| <=50K| 53|            40|     11th|\n| <=50K| 28|            40|Bachelors|\n| <=50K| 37|            40|  Masters|\n| <=50K| 49|            16|      9th|\n|  >50K| 52|            45|  HS-grad|\n|  >50K| 31|            50|  Masters|\n|  >50K| 42|            40|Bachelors|\n+------+---+--------------+---------+\nonly showing top 10 rows\n\nNumber of distinct values for income\n+------+\n|income|\n+------+\n| <=50K|\n|  >50K|\n+------+\n\nAdded numeric column(income_code) derived from income column\n+------+---+--------------+---------+-----------+\n|income|age|hours_per_week|education|income_code|\n+------+---+--------------+---------+-----------+\n| <=50K| 39|            40|Bachelors|          0|\n| <=50K| 50|            13|Bachelors|          0|\n| <=50K| 38|            40|  HS-grad|          0|\n| <=50K| 53|            40|     11th|          0|\n| <=50K| 28|            40|Bachelors|          0|\n| <=50K| 37|            40|  Masters|          0|\n| <=50K| 49|            16|      9th|          0|\n|  >50K| 52|            45|  HS-grad|          1|\n|  >50K| 31|            50|  Masters|          1|\n|  >50K| 42|            40|Bachelors|          1|\n+------+---+--------------+---------+-----------+\nonly showing top 10 rows\n\nPrint a statistical summary of a few columns\n+-------+------+------------------+------------------+------------+-------------------+\n|summary|income|               age|    hours_per_week|   education|        income_code|\n+-------+------+------------------+------------------+------------+-------------------+\n|  count| 32561|             32561|             32561|       32561|              32561|\n|   mean|  null| 38.58164675532078|40.437455852092995|        null| 0.2408095574460244|\n| stddev|  null|13.640432553581356|12.347428681731838|        null|0.42758148856469247|\n|    min| <=50K|                17|                 1|        10th|                  0|\n|    max|  >50K|                90|                99|Some-college|                  1|\n+-------+------+------------------+------------------+------------+-------------------+\n\nCalculate Co variance between a few columns to understand features to use\nCovariance between income and hours_per_week is 1.2\nCovariance between income and age is 1.4",
-                    "output_type": "stream"
+                    "text": "Select few columns to see the data\n+------+---+--------------+---------+\n|income|age|hours_per_week|education|\n+------+---+--------------+---------+\n| <=50K| 39|            40|Bachelors|\n| <=50K| 50|            13|Bachelors|\n| <=50K| 38|            40|  HS-grad|\n| <=50K| 53|            40|     11th|\n| <=50K| 28|            40|Bachelors|\n| <=50K| 37|            40|  Masters|\n| <=50K| 49|            16|      9th|\n|  >50K| 52|            45|  HS-grad|\n|  >50K| 31|            50|  Masters|\n|  >50K| 42|            40|Bachelors|\n+------+---+--------------+---------+\nonly showing top 10 rows\n\nNumber of distinct values for income\n+------+\n|income|\n+------+\n| <=50K|\n|  >50K|\n+------+\n\nAdded numeric column(income_code) derived from income column\n+------+---+--------------+---------+-----------+\n|income|age|hours_per_week|education|income_code|\n+------+---+--------------+---------+-----------+\n| <=50K| 39|            40|Bachelors|          0|\n| <=50K| 50|            13|Bachelors|          0|\n| <=50K| 38|            40|  HS-grad|          0|\n| <=50K| 53|            40|     11th|          0|\n| <=50K| 28|            40|Bachelors|          0|\n| <=50K| 37|            40|  Masters|          0|\n| <=50K| 49|            16|      9th|          0|\n|  >50K| 52|            45|  HS-grad|          1|\n|  >50K| 31|            50|  Masters|          1|\n|  >50K| 42|            40|Bachelors|          1|\n+------+---+--------------+---------+-----------+\nonly showing top 10 rows\n\nPrint a statistical summary of a few columns\n+-------+------+------------------+------------------+------------+-------------------+\n|summary|income|               age|    hours_per_week|   education|        income_code|\n+-------+------+------------------+------------------+------------+-------------------+\n|  count| 32561|             32561|             32561|       32561|              32561|\n|   mean|  null| 38.58164675532078|40.437455852092995|        null| 0.2408095574460244|\n| stddev|  null|13.640432553581356|12.347428681731838|        null|0.42758148856469247|\n|    min| <=50K|                17|                 1|        10th|                  0|\n|    max|  >50K|                90|                99|Some-college|                  1|\n+-------+------+------------------+------------------+------------+-------------------+\n\nCalculate Co variance between a few columns to understand features to use\nCovariance between income and hours_per_week is 1.2\nCovariance between income and age is 1.4"
                 }
             ],
             "execution_count": 4
@@ -77,9 +77,9 @@
             "metadata": {},
             "outputs": [
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "label = income\nfeatures = ['age', 'hours_per_week', 'education']\nCount of rows that are <=50K 24720\nCount of rows that are >50K 7841",
-                    "output_type": "stream"
+                    "text": "label = income\nfeatures = ['age', 'hours_per_week', 'education']\nCount of rows that are <=50K 24720\nCount of rows that are >50K 7841"
                 }
             ],
             "execution_count": 5
@@ -91,13 +91,13 @@
         },
         {
             "cell_type": "code",
-            "source": "train, test = data.randomSplit([0.75, 0.25], seed=123)\r\n\r\nprint(\"train ({}, {})\".format(train.count(), len(train.columns)))\r\nprint(\"test ({}, {})\".format(test.count(), len(test.columns)))\r\n\r\ntrain_data_path = \"/spark_ml/AdultCensusIncomeTrain\"\r\ntest_data_path = \"/spark_ml/AdultCensusIncomeTest\"\r\n\r\ntrain.write.mode('overwrite').orc(train_data_path)\r\ntest.write.mode('overwrite').orc(test_data_path)\r\nprint(\"train and test datasets saved to {} and {}\".format(train_data_path, test_data_path))",
+            "source": "![title](train, test = data.randomSplit([0.75, 0.25], seed=123)\r\n\r\nprint(\"train ({}, {})\".format(train.count(), len(train.columns)))\r\nprint(\"test ({}, {})\".format(test.count(), len(test.columns)))\r\n\r\ntrain_data_path = \"/spark_ml/AdultCensusIncomeTrain\"\r\ntest_data_path = \"/spark_ml/AdultCensusIncomeTest\"\r\n\r\ntrain.write.mode('overwrite').orc(train_data_path)\r\ntest.write.mode('overwrite').orc(test_data_path)\r\nprint(\"train and test datasets saved to {} and {}\".format(train_data_path, test_data_path))",
             "metadata": {},
             "outputs": [
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "train (24469, 4)\ntest (8092, 4)\ntrain and test datasets saved to /spark_ml/AdultCensusIncomeTrain and /spark_ml/AdultCensusIncomeTest",
-                    "output_type": "stream"
+                    "text": "train (24469, 4)\ntest (8092, 4)\ntrain and test datasets saved to /spark_ml/AdultCensusIncomeTrain and /spark_ml/AdultCensusIncomeTest"
                 }
             ],
             "execution_count": 6
@@ -113,9 +113,9 @@
             "metadata": {},
             "outputs": [
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "Using LogisticRegression model with Regularization Rate of 0.1.\nPipeline Created\nModel Trained\nModel is  PipelineModel_8c7a4fdc6110\nModel Stages [StringIndexer_7d244350f55d, OneHotEncoderEstimator_559780c5ce92, StringIndexer_53280e6349e6, VectorAssembler_1051507100cb, LogisticRegressionModel: uid = LogisticRegression_5c0eda4eab78, numClasses = 2, numFeatures = 17]",
-                    "output_type": "stream"
+                    "text": "Using LogisticRegression model with Regularization Rate of 0.1.\nPipeline Created\nModel Trained\nModel is  PipelineModel_8c7a4fdc6110\nModel Stages [StringIndexer_7d244350f55d, OneHotEncoderEstimator_559780c5ce92, StringIndexer_53280e6349e6, VectorAssembler_1051507100cb, LogisticRegressionModel: uid = LogisticRegression_5c0eda4eab78, numClasses = 2, numFeatures = 17]"
                 }
             ],
             "execution_count": 7
@@ -131,9 +131,9 @@
             "metadata": {},
             "outputs": [
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "Area under ROC: 0.7964496884726682\nArea Under PR: 0.5358180243123482\n+------+-----+----------+\n|income|label|prediction|\n+------+-----+----------+\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n| <=50K|  0.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n+------+-----+----------+\nonly showing top 20 rows",
-                    "output_type": "stream"
+                    "text": "Area under ROC: 0.7964496884726682\nArea Under PR: 0.5358180243123482\n+------+-----+----------+\n|income|label|prediction|\n+------+-----+----------+\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n| <=50K|  0.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n| <=50K|  0.0|       1.0|\n|  >50K|  1.0|       1.0|\n+------+-----+----------+\nonly showing top 20 rows"
                 }
             ],
             "execution_count": 8
@@ -149,9 +149,9 @@
             "metadata": {},
             "outputs": [
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "saved model to /spark_ml/AdultCensus.mml\nSuccessfully loaded from /spark_ml/AdultCensus.mml",
-                    "output_type": "stream"
+                    "text": "saved model to /spark_ml/AdultCensus.mml\nSuccessfully loaded from /spark_ml/AdultCensus.mml"
                 }
             ],
             "execution_count": 9
@@ -167,9 +167,9 @@
             "metadata": {},
             "outputs": [
                 {
+                    "output_type": "stream",
                     "name": "stdout",
-                    "text": "persist the mleap bundle from local to hdfs",
-                    "output_type": "stream"
+                    "text": "persist the mleap bundle from local to hdfs"
                 }
             ],
             "execution_count": 10