Creating machine learning models in Watson Studio notebooks

You can create a machine learning model in a notebook by writing the code and implementing the machine learning API. After a model is created, trained, and deployed, you can run the deployed model in a notebook.

For example, refering to the following sample notebook, you can see how to load data, create an Apache Spark model, create a pipeline, and train the model.

Load and explore data

In this section you will load the data as an Apache® Spark DataFrame and perform a basic exploration.

Load the data to the Spark DataFrame by using the wget command to upload the data to gpfs and then read method.

Example: First, you need to install required packages. You can do it by running the following code. Run it only one time.

!pip install wget --user --upgrade

Then you are able to run the following commands:

Input:

import wget
import json
import os
filename = 'GoSales_Tx_NaiveBayes.csv'

if not os.path.isfile(filename):
    link_to_data = 'https://apsportal.ibm.com/exchange-api/v1/entries/8044492073eb964f46597b4be06ff5ea/data?accessKey=9561295fa407698694b1e254d0099600'
    filename = wget.download(link_to_data)

print(filename)

Output:

GoSales_Tx_NaiveBayes.csv

The csv file GoSales_Tx_NaiveBayes.csv is availble on gpfs now. Load the file to Apache® Spark DataFrame using the following code.

Input:

spark = SparkSession.builder.getOrCreate()

df_data = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load(filename)

Explore the loaded data by using the following Apache® Spark DataFrame methods:

  • print schema
  • print top ten records
  • count all records

Input:

df_data.printSchema()

Output:

root
 |-- PRODUCT_LINE: string (nullable = true)
 |-- GENDER: string (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- MARITAL_STATUS: string (nullable = true)
 |-- PROFESSION: string (nullable = true)

As you can see, the data contains five fields. PRODUCT_LINE field is the one we would like to predict (label).

Input:

df_data.show()

Output:

+--------------------+------+---+--------------+------------+
|        PRODUCT_LINE|GENDER|AGE|MARITAL_STATUS|  PROFESSION|
+--------------------+------+---+--------------+------------+
|Personal Accessories|     M| 27|        Single|Professional|
|Personal Accessories|     F| 39|       Married|       Other|
|Mountaineering Eq...|     F| 39|       Married|       Other|
|Personal Accessories|     F| 56|   Unspecified| Hospitality|
|      Golf Equipment|     M| 45|       Married|     Retired|
|      Golf Equipment|     M| 45|       Married|     Retired|
|   Camping Equipment|     F| 39|       Married|       Other|
|   Camping Equipment|     F| 49|       Married|       Other|
|  Outdoor Protection|     F| 49|       Married|       Other|
|      Golf Equipment|     M| 47|       Married|     Retired|
|      Golf Equipment|     M| 47|       Married|     Retired|
|Mountaineering Eq...|     M| 21|        Single|      Retail|
|Personal Accessories|     F| 66|       Married|       Other|
|   Camping Equipment|     F| 35|       Married|Professional|
|Mountaineering Eq...|     M| 20|        Single|       Sales|
|Mountaineering Eq...|     M| 20|        Single|       Sales|
|Mountaineering Eq...|     M| 20|        Single|       Sales|
|Personal Accessories|     F| 37|        Single|       Other|
|   Camping Equipment|     M| 42|       Married|       Other|
|   Camping Equipment|     F| 24|       Married|      Retail|
+--------------------+------+---+--------------+------------+
only showing top 20 rows

Input:

df_data.count()

Output:

60252

As you can see, the data set contains 60252 records.

Create an Apache® Spark machine learning model

In this section you will learn how to prepare data, create an Apache® Spark machine learning pipeline, and train a model.

Prepare data

In this subsection you will split your data into: train, test and predict datasets.

Input:

splitted_data = df_data.randomSplit([0.8, 0.2], 24)
train_data = splitted_data[0]
test_data = splitted_data[1]

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

Output:

Number of training records: 48176
Number of testing records : 12076

As you can see our data has been successfully split into three datasets:

  • The train data set, which is the largest group, is used for training.
  • The test data set will be used for model evaluation and is used to test the assumptions of the model.

Create pipeline and train a model

In this section you will create an Apache® Spark machine learning pipeline and then train the model.

In the first step you need to import the Apache® Spark machine learning packages that will be needed in the subsequent steps.

Input:

from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

In the following step, convert all the string fields to numeric ones by using the StringIndexer transformer.

Input:

stringIndexer_label = StringIndexer(inputCol="PRODUCT_LINE", outputCol="label").fit(df_data)
stringIndexer_prof = StringIndexer(inputCol="PROFESSION", outputCol="PROFESSION_IX")
stringIndexer_gend = StringIndexer(inputCol="GENDER", outputCol="GENDER_IX")
stringIndexer_mar = StringIndexer(inputCol="MARITAL_STATUS", outputCol="MARITAL_STATUS_IX")

In the following step, create a feature vector by combining all features together.

Input:

vectorAssembler_features = VectorAssembler(inputCols=["GENDER_IX", "AGE", "MARITAL_STATUS_IX", "PROFESSION_IX"], outputCol="features")

Next, define estimators you want to use for classification. Random Forest is used in the following example.

Input:

rf = RandomForestClassifier(labelCol="label", featuresCol="features")

Finally, indexed labels back to original labels.

Input:

labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=stringIndexer_label.labels)

Let's build the pipeline now. A pipeline consists of transformers and an estimator.

Input:

pipeline_rf = Pipeline(stages=[stringIndexer_label, stringIndexer_prof, stringIndexer_gend, stringIndexer_mar, vectorAssembler_features, rf, labelConverter])

Now, you can train your Random Forest model by using the previously defined pipeline and train data.

Input:

model_rf = pipeline_rf.fit(train_data)

You can check your model accuracy now. To evaluate the model, use test data.

Input:

predictions = model_rf.transform(test_data)
evaluatorRF = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluatorRF.evaluate(predictions)

print("Accuracy = %g" % accuracy)

Output:

Accuracy = 0.584796

Next steps

After you create a model, you can deploy it so that it can be used to create applications.

Interested in more sample code? Check out our extensive offering of sample notebooks and sample applications.