To build a flow that will create a model, we need at least three elements:
- A Data Asset node that reads in data from an external source, in this case a .csv data file
- An Import or Type node that specifies field properties, such as measurement level (the type of data that the field contains), and the role of each field as a target or input in modeling
- A modeling node that generates a model nugget when the flow runs
In this example, we're using a CHAID modeling node. CHAID, or Chi-squared Automatic Interaction Detection, is a classification method that builds decision trees by using a particular type of statistics known as chi-square statistics to work out the best places to make the splits in the decision tree.
If measurement levels are specified in the source node, the separate Type node can be eliminated. Functionally, the result is the same.
This flow also has Table and Analysis nodes that will be used to view the scoring results after the model nugget has been created and added to the flow.
The Data Asset import node reads data in from the sample tree_credit.csv data file.
The Type node specifies the measurement level for each field. The measurement level is a category that indicates the type of data in the field. Our source data file uses three different measurement levels:
A Continuous field (such as the Age
field) contains continuous numeric values, while a Nominal field (such as the
Credit rating
field) has two or more distinct values, for example
Bad
, Good
, or No credit history
. An
Ordinal field (such as the Income level
field) describes
data with multiple distinct values that have an inherent order—in this case Low
,
Medium
and High
.
For each field, the Type node also specifies a role to indicate
the part that each field plays in modeling. The role is set to Target
for the field
Credit rating
, which is the field that indicates whether or not a given customer
defaulted on the loan. This is the target
, or the field for which we want to
predict the value.
Role is set to Input
for the other fields. Input fields are
sometimes known as predictors
, or fields whose values are used by the modeling
algorithm to predict the value of the target field.
The CHAID modeling node generates the model. In the node's properties, under FIELDS, the option Use custom field roles is available. We could select this option and change the field roles, but for this example we'll use the default targets and inputs as specified in the Type node.
- Double-click the CHAID node (named Creditrating). The
node properties are displayed.
Here there are several options where we could specify the kind of model we want to build.
We want a brand-new model, so under OBJECTIVES we'll use the default option Build new model.
We also just want a single, standard decision tree model without any enhancements, so we'll also use the default objective option Create a standard model.
For this example, we want to keep the tree fairly simple, so we'll limit the tree growth by raising the minimum number of cases for parent and child nodes.
- Under STOPPING RULES, select Use absolute value.
- Set Minimum records in parent branch to 400.
- Set Minimum records in child branch to 200.
We can use all the other default options for this example, so click Save and then click the Run button on the toolbar to create the model. (Alternatively, right-click the CHAID node and choose Run from the context menu.)