Model builder overview

The model builder in IBM Watson Studio guides you, step by step, through building a model that uses popular machine learning algorithms. Just upload your training data, and then let the model builder automatically prepare your data and recommend techniques that suit your data.

 

Highlights

 

Building a model

Using automatic mode in model builder, you can build a machine learning model in 3 steps.

 

Step 1: Supply the training data

Supply your structured, historical data in one of two ways:

  • Upload a .csv file
  • Connect to an IBM Db2 on Cloud database

Example

In this image, a .csv file containing historical data about loans is selected:

Supplying data in model builder

 

Step 2: Identify the label column and the feature columns

The label column is the column that contains the labeled, historical results (what your model will predict after it's trained).

The feature columns contain the historical data that are the basis of the prediction, such as: patient symptoms or customer attributes, for example.

Example

In this image, the column called RISK_LEVEL is specified as the label column, and by default all other columns are specified as feature columns:

Specifying the column to predict in model builder

 

Step 3: Choose the technique

Choose which machine learning technique to apply.

Example

In this image, the multiclass classification technique is recommended, because the label column, RISK_LEVEL, contains one of the three string values "LOW", "MED", and "HIGH" in the historical data .csv file:

Choosing a technique in model builder

 

Techniques

Table 1 lists the basic machine learning techniques available in the model builder.

Table 1. Basic machine learning techniques available in the model builder
Technique Description Example use cases
Binary classification Classifies data into two categories. (Recommended if the label column of the training data contains two distinct values.)
  • Predict whether or not a customer is likely to cancel their service
  • Predict whether or not someone will be receptive to a given advertising campaign
  • Classify a patient as either at risk for a specific disease or not at risk
Multiclass classification Classifies data into multiple categories. (Recommended if the label column of the training data contains multiple distinct values.)
  • Classify a customer into one of multiple customer cohorts
  • Predict which service a customer is likely to purchase
  • Assess the level of risk (HIGH, MEDIUM, or LOW) for a given loan application
Regression Predict a value from a continuous set of values. (Recommended if the label column of the training data contains a large number of numeric values.)
  • Predict house prices
  • Predict the number of cashiers needed in a store
  • Assess how likely it is that someone is going to purchase a product

 

Estimators

Using manual mode in model builder, you can also choose one or more specific estimators.

 

Estimators available when you choose the binary classification technique

Table 2. Estimators you can assemble together to manually build a binary classification model in the model builder
Estimator Description
Logistic regression Analyzes a data set in which there are one or more independent variables that determine one of two outcomes. Only binary logistic regression is supported
Decision tree classifier Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supported both binary and multiclass labels, as well as both continuous and categorical features.
Random forest classifier Constructs multiple decision trees to produce the label that is a mode of each decision tree. It supports both binary and multiclass labels, as well as both continuous and categorical features.
Gradient boosted tree classifier Produces a classification prediction model in the form of an ensemble of decision trees. It only supports binary labels, as well as both continuous and categorical features.

 

Estimators available when you choose the multiclass classification technique

Table 3. Estimators you can assemble together to manually build a multiclass classification model in the model builder
Estimator Description
Decision tree classifier Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supported both binary and multiclass labels, as well as both continuous and categorical features.
Random forest classifier Constructs multiple decision trees to produce the label that is a mode of each decision tree. It supports both binary and multiclass labels, as well as both continuous and categorical features.
Naive Bayes Classifies features based on Bayes' theorem, which assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

 

Estimators available when you choose the regression technique

Table 4. Estimators you can assemble together to manually build a regression model in the model builder
Estimator Description
Linear regression Models the linear relationship between a scalar-dependent variable y and one or more explanatory variables (or independent variables) x.
Decision tree regressor Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supports both continuous and categorical features.
Random forest regressor Constructs multiple decision trees to produce the mean prediction of each decision tree. It supports both continuous and categorical features.
Gradient boosted tree regressor Produces a regression prediction model in the form of an ensemble of decision trees. It supports both continuous and categorical features.
Isotonic regression Models the isotonic relationship of a sequence of observations by fitting a free-form line to the observations under the following constraints: the fitted free-form line must be non-decreasing everywhere, and it must lie as close to the observations as possible.

 

Automatic data preparation

The model builder automatically prepares your training data (in both automatic mode and manual mode):

  1. Extracts the first 1000 records as a data sample to determine if string categories exceed the maximum allowed.
  2. Handles missing string values, and defines missing values as a separate category.
  3. Applies mean value category encoding to numeric fields, and normalizes feature columns returned by the category encoding operation.
  4. If the label column contains strings, encodes the label column to a column of indices. (See: StringIndexer external link)
  5. Combines numeric columns into a single vector column. (See: VectorAssembler external link)
  6. Converts features that are categorical to category indices. (See: VectorIndexer external link)
  7. Groups all fields generated by StringIndexer and VectorIndexer in a separate vector field, and then filters out temporary fields generated by StringIndexer and VectorIndexer.