AutoAI tutorial: Build a Binary Classification Model

This tutorial guides you through training a model to predict if a customer is likely to buy a tent from an outdoor equipment store.

Create an AutoAI experiment to build a model that analyzes your data and selects the best model type and algorithms to produce, train, and optimize pipelines. After you review the pipelines, save one as a model, deploy it, and then test it to get a prediction.

Watch this video to see a preview of the steps in this tutorial.

This video provides a visual method to learn the concepts and tasks in this documentation.

Transcript

Synchronize transcript with video

Video transcript
Time	Transcript
00:00	In this video, you will see how to build a binary classification model that assesses the likelihood that a customer of an outdoor equipment company will buy a tent.
00:11	This video uses a data set called "GoSales", which you'll find in the Resource Hub.
00:16	View the data set.
00:20	The feature columns are "GENDER", "AGE", "MARITAL_STATUS", and "PROFESSION" and contain the attributes on which the machine learning model will base predictions.
00:31	The label columns are "IS_TENT", "PRODUCT_LINE", and "PURCHASE_AMOUNT" and contain historical outcomes that the models could be trained to predict.
00:44	Add this data set to the "Machine Learning" project and then go to the project.
00:56	You'll find the GoSales.csv file with your other data assets.
01:02	Add to the project an "AutoAI experiment".
01:08	This project already has the Watson Machine Learning service associated.
01:13	If you haven't done that yet, first, watch the video showing how to run an AutoAI experiment based on a sample.
01:22	Just provide a name for the experiment and then click "Create".
01:30	The AutoAI experiment builder displays.
01:33	You first need to load the training data.
01:36	In this case, the data set will be from the project.
01:40	Select the GoSales.csv file from the list.
01:45	AutoAI reads the data set and lists the columns found in the data set.
01:50	Since you want the model to predict the likelihood that a given customer will purchase a tent, select "IS_TENT" as the column to predict.
01:59	Now, edit the experiment settings.
02:03	First, look at the settings for the data source.
02:06	If you have a large data set, you can run the experiment on a subsample of rows and you can configure how much of the data will be used for training and how much will be used for evaluation.
02:19	The default is a 90%/10% split, where 10% of the data is reserved for evaluation.
02:27	You can also select which columns from the data set to include when running the experiment.
02:35	On the "Prediction" panel, you can select a prediction type.
02:39	In this case, AutoAI analyzed your data and determined that the "IS_TENT" column contains true-false information, making this data suitable for a "Binary classification" model.
02:52	The positive class is "TRUE" and the recommended metric is "Accuracy".
03:01	If you'd like, you can choose specific algorithms to consider for this experiment and the number of top algorithms for AutoAI to test, which determines the number of pipelines generated.
03:16	On the "Runtime" panel, you can review other details about the experiment.
03:21	In this case, accepting the default settings makes the most sense.
03:25	Now, run the experiment.
03:28	AutoAI first loads the data set, then splits the data into training data and holdout data.
03:37	Then wait, as the "Pipeline leaderboard" fills in to show the generated pipelines using different estimators, such as XGBoost classifier, or enhancements such as hyperparameter optimization and feature engineering, with the pipelines ranked based on the accuracy metric.
03:58	Hyperparameter optimization is a mechanism for automatically exploring a search space for potential hyperparameters, building a series of models and comparing the models using metrics of interest.
04:10	Feature engineering attempts to transform the raw data into the combination of features that best represents the problem to achieve the most accurate prediction.
04:21	Okay, the run has completed.
04:24	By default, you'll see the "Relationship map".
04:28	But you can swap views to see the "Progress map".
04:32	You may want to start with comparing the pipelines.
04:36	This chart provides metrics for the eight pipelines, viewed by cross validation score or by holdout score.
04:46	You can see the pipelines ranked based on other metrics, such as average precision.
04:55	Back on the "Experiment summary" tab, expand a pipeline to view the model evaluation measures and ROC curve.
05:03	During AutoAI training, your data set is split into two parts: training data and holdout data.
05:11	The training data is used by the AutoAI training stages to generate the model pipelines, and cross validation scores are used to rank them.
05:21	After training, the holdout data is used for the resulting pipeline model evaluation and computation of performance information, such as ROC curves and confusion matrices.
05:33	You can view an individual pipeline to see more details in addition to the confusion matrix, precision recall curve, model information, and feature importance.
05:46	This pipeline had the highest ranking, so you can save this as a machine learning model.
05:52	Just accept the defaults and save the model.
05:56	Now that you've trained the model, you're ready to view the model and deploy it.
06:04	The "Overview" tab shows a model summary and the input schema.
06:09	To deploy the model, you'll need to promote it to a deployment space.
06:15	Select the deployment space from the list, add a description for the model, and click "Promote".
06:24	Use the link to go to the deployment space.
06:28	Here's the model you just created, which you can now deploy.
06:33	In this case, it will be an online deployment.
06:37	Just provide a name for the deployment and click "Create".
06:41	Then wait, while the model is deployed.
06:44	When the model deployment is complete, view the deployment.
06:49	On the "API reference" tab, you'll find the scoring endpoint for future reference.
06:56	You'll also find code snippets for various programming languages to utilize this deployment from your application.
07:05	On the "Test" tab, you can test the model prediction.
07:09	You can either enter test input data or paste JSON input data, and click "Predict".
07:20	This shows that there's a very high probability that the first customer will buy a tent and a very high probability that the second customer will not buy a tent.
07:33	And back in the project, you'll find the AutoAI experiment and the model on the "Assets" tab.
07:44	Find more videos in the Cloud Pak for Data as a Service documentation.

Overview of the data sets

The sample data is structured (in rows and columns) and saved in a .csv file format.

You can view the sample data file in a text editor or spreadsheet program:
Spreadsheet of the Go Sales data set that contains customer and purchase information

What do you want to predict?

Choose the column whose values that your model predicts.

In this tutorial, the model predicts the values of the IS_TENT column:

IS_TENT: Whether the customer bought a tent

The model that is built in this tutorial predicts whether a customer is likely to purchase a tent.

Tasks overview

This tutorial presents the basic steps for building and training a machine learning model with AutoAI:

Create a project
Create an AutoAI experiment
Training the experiment
Deploy the trained model
Test the deployed model
Creating a batch to score the model

Task 1: Create a project

From the Resource hub, download the GoSales data set file to your local computer.
From the Projects page, to create a new project, select New Project.
b. Type your project name.
c. Click Create.

Task 2: Create an AutoAI experiment

On the Assets tab from within your project, click New asset > Build machine learning models automatically.
Specify a name and optional description for your new experiment.
Select the Associate a Machine Learning service instance link to associate the Watson Machine Learning Server instance with your project. Click Reload to confirm your configuration.
To add a data source, you can choose one of these options:
a. If you downloaded your file locally, upload the training data file, GoSales.csv, from your local computer. Drag the file onto the data panel or click browse and follow the prompts.
b. If you already uploaded your file to your project, click select from project, then select the data asset tab and choose GoSales.csv.

Task 3: Training the experiment

In Configuration details, select No for the option to create a Time Series Forecast.
Choose IS_TENT as the column to predict. AutoAI analyzes your data and determines that the IS_TENT column contains True and False information, making this data suitable for a binary classification model. The default metric for a binary classification is ROC/AUC.
Click Run experiment. As the model trains, an infographic shows the process of building the pipelines.

Note:
You might see slight differences in results based on the Cloud Pak for Data platform and version you use.

For a list of algorithms or estimators that are available with each machine learning technique in AutoAI, see AutoAI implementation detail.
When all the pipelines are created, you can compare their accuracy on the Pipeline leaderboard.
Select the pipeline with Rank 1 and click Save as to create your model. Then, select Create. This option saves the pipeline under the Models section in the Assets tab.

Task 4: Deploy the trained model

You can deploy the model from the model details page. You can access the model details page in one of these ways:
1. Clicking the model’s name in the notification displayed when you save the model.
2. Open the Assets tab for the project, select the Models section and select the model’s name.
Click the Promote to deployment space icon, and then select an existing space, or or create a new space where the model will be deployed.
1. Type a name for the deployment space.
2. Associate it with a Machine Learning Service.
3. Click Create.
After you create your deployment space or select an existing one, select Promote.
Click the deployment space link from the notification.
From the Assets tab of the deployment space:
1. Hover over the model’s name and click the Deploy icon .
  1. In the page that opens, complete the fields:
    1. Select Online as the Deployment type.
    2. Specify a name for the deployment.
    3. Click Create.

Creating an online deployment space to promote the model

After the deployment is complete, click Deployments and select the deployment name to view the details page.

Task 5: Test the deployed model

You can test the deployed model from the deployment details page:

On the Test tab of the deployment details page, complete the form with test values or enter JSON test data by clicking the Terminal icon to provide the following JSON input data.
```
{"input_data":[{

"fields":

["GENDER","AGE","MARITAL_STATUS","PROFESSION","PRODUCT_LINE","PURCHASE_AMOUNT"],

"values": [["M",27,"Single", "Professional","Camping Equipment",144.78]]

}]}
```
Note: The test data replicates the data fields for the model, except for the prediction field.
Click Predict to predict whether a customer with the entered attributes is likely to buy a tent. The resulting prediction indicates that a customer with the attributes entered has a high probability of purchasing a tent.

Result of the Tent model prediction. Prediction equals true, likely to buy a tent

Task 6: Creating a batch job to score the model

For a batch deployment, you provide input data, also known as the model payload, in a CSV file. The data must be structured like the training data, with the same column headers. The batch job processes each row of data and creates a corresponding prediction.

In a real scenario, you would submit new data to the model to get a score. However, this tutorial uses the same training data GoSales-updated.csv that you downloaded as part of the tutorial setup. Ensure that you delete the IS_TENT column and save the file before you upload it to the batch job. When deploying a model, you can add the payload data to a project, upload it to a space, or link to it in a storage repository such as a Cloud Object Storage bucket. For this tutorial, upload the file directly to the deployment space.

Step 1: Add data to space

From the Assets page of the deployment space:

Click Add to space then choose Data.
Upload the file GoSales-updated.csv file that you saved locally.

Step 2: Create the batch deployment

Now you can define the batch deployment.

Click the Deploy icon next to the model’s name.
Enter a name a name for the deployment.
1. Select Batch as the Deployment type.
2. Choose the smallest hardware specification.
3. Click Create.

Step 3: Create the batch job

The batch job runs the deployment. To create the job, you must specify the input data and the name for the output file. You can set up a job to run on a schedule or run immediately.

Click New job.
Specify a name for the job
Configure to the smallest hardware specification
(Optional): To set a schedule and receive notifications.
Upload the input file: GoSales-updated.csv
Name the output file: GoSales-output.csv
Review and click Create to run the job.

Step 4: View the output

When the deployment status changes to Deployed, return to the Assets page for the deployment space. The file GoSales-output.csv was created and added to your assets list.

Click the Download icon next to the output file and open the file in an editor. You can review the prediction results for the customer information that is submitted for batch processing.

For each case, the prediction that is returned indicates the confidence score of whether a customer will buy a tent.

Next steps

Building an AutoAI experiment

Parent topic: AutoAI overview