This tutorial builds a logistic regression model, which is a statistical technique for
classifying records based on values of input fields. It is analogous to linear regression, but takes
a categorical target field instead of a numeric field.
For example, suppose that a telecommunications provider is concerned about the number of
customers it's losing to competitors. If service usage data can be used to predict which customers
are liable to transfer to another provider, offers can be customized to retain as many customers as
possible.
Preview the tutorial
Copy link to section
Watch this video to preview the steps in this tutorial. There might
be slight differences in the user interface that is shown in the video. The video is intended to be
a companion to the written tutorial. This video provides a visual method to learn the concepts and
tasks in this documentation.
This tutorial uses the Telecommunications Churn flow in the sample project. The data file
used is telco.csv. The following image shows the sample modeler flow.
Figure 1. Sample modeler flow
This example focuses on using usage data to predict customer loss (churn). Because the target has
two distinct categories, a binomial model is used. If the target has multiple categories, a
multinomial model might be created instead.
The following image shows the data set used with this modeler flow.
Figure 2. Sample data set
Task 1: Open the sample project
Copy link to section
The sample project contains several data sets and sample modeler flows. If you don't already have
the sample project, then refer to the Tutorials topic to create the sample project. Then follow these steps to open the sample
project:
In watsonx, from the Navigation menu, choose
Projects > View all Projects.
Click SPSS Modeler Project.
Click the Assets tab to see the data sets and modeler flows.
Check your progress
The following image shows the project Assets tab. You are now ready to work with the sample
modeler flow associated with this tutorial.
Telecommunication Churn includes several nodes. Follow these steps to examine the Data
Asset and Type nodes:
From the Assets tab, open the Telecommunication Churn modeler
flow, and wait for the canvas to load.
Double-click the telco.csv node. This node is a Data Asset node that points to the
telco.csv file in the project.
Review the File format properties.
Optional: Click Preview data to see the full data set.
Double-click the Type node. This node specifies field properties, such as measurement
level (the type of data that the field contains), and the role of each field as a target or input in
modeling. Make sure that all measurement levels are set correctly. For example, most fields with
values of 0.0 and 1.0 can be regarded as flags, but certain
fields, such as gender, are more accurately viewed as a nominal field with two values.
Figure 3. Measurement levels
Notice that churn is set as a Flag with a Target role. The
role for all other fields is set to Input.
Double-click the churn (Feature Selection) modeling node to see its
properties. You can use a Feature Selection node to remove predictors or data that don't add
any useful information about the predictor/target relationship.
Hover over the churn (Feature Selection) node, and click the Run icon .
In the Outputs and models pane, click the first model in the list with the name
churn to view the model details.
Check your progress
The following image shows the model details. You are now ready to check the Filter
node.
Only some of the data in the telco.csv data file are useful in predicting churn. You can
use the filter to select just the data that is considered to be important for use as a predictor
(the fields marked as Important in the model that is generated in the previous task). Follow
these steps to see and check the Filter node:
Double-click the Important Features (Filter) node to see its properties.
Notice that this node filters out only selected fields: tenure,
age, address, income, ed,
employ, and more. Other fields are excluded for this analysis.
Figure 4. Filter node Important Features
Click Cancel.
Double-click the 28 Fields (Data Audit) output node after the Filter node.
Hover over the Data Audit node, and click the Run icon .
In the Outputs and models pane, click the results with the name Data Audit to view
the output.
Look at the % Complete column, which you can use to identify any fields
with large amounts of missing data. In this case, the only field you need to amend is
logtoll, which is less than 50% complete.
Figure 5. 28 Fields output
Close the Output.
Double-click the Missing Value Imputation supernode.
Click View supernode.
Double-click Fill logtoll (Filler) node.
Filler nodes are used to replace
field values and change storage. You can choose to replace values based on a specified CLEM
condition, such as @BLANK(FIELD). Alternatively, you can choose to replace
all blanks or null values with a specific value. Filler nodes are often used with the
Type node to replace missing values.
In the Fill in fields section, you can
specify the fields from the dataset whose values you want to be examined and replaced. In this case,
the logtoll column is specified together with a Blank and null values option
under the Replace section.
Figure 6. Missing Value Imputation supernode with Filler
properties
Click Return to previous flow.
Check your progress
The following image shows the flow. You are now ready to build the model.
Follow these steps to generate a model nugget from the Logistic node:
Hover over the churn (Logistic) node, and click the Run icon .
In the Outputs and models pane, click the churn model to view the results.
The Variables in the Equation page shows the target (churn) and inputs (predictor fields)
used by the model. These fields are chosen based on the Forwards Stepwise method, not the complete
list submitted for consideration.
To assess how well the model fits your data, several diagnostics are available in the expert node
settings when you're building the flow.
Note also that these results are based on the training data only. To assess how well the model
generalizes to other data in the real world, you use a Partition node to hold out a subset of
records for purposes of testing and validation.
This example showed how to use usage data to predict customer loss (churn) by building a binomial
model because the target has two distinct categories.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.