The Classification and Regression (C&R) Tree node is a
tree-based classification and prediction method. Similar to C5.0, this method uses recursive
partitioning to split the training records into segments with similar output field values. The
C&R Tree node starts by examining the input fields to find the best split, measured by the
reduction in an impurity index that results from the split. The split defines two subgroups, each of
which is subsequently split into two more subgroups, and so on, until one of the stopping criteria
is triggered. All splits are binary (only two subgroups).
Pruning
Copy link to section
C&R Trees give you the option to first grow the tree and then prune based
on a cost-complexity algorithm that adjusts the risk estimate based on the number of terminal nodes.
This method, which enables the tree to grow large before pruning based on more complex criteria, may
result in smaller trees with better cross-validation properties. Increasing the number of terminal
nodes generally reduces the risk for the current (training) data, but the actual risk may be higher
when the model is generalized to unseen data. In an extreme case, suppose you have a separate
terminal node for each record in the training set. The risk estimate would be 0%, since every record
falls into its own node, but the risk of misclassification for unseen (testing) data would almost
certainly be greater than 0. The cost-complexity measure attempts to compensate for this.
Example. A cable TV company has commissioned a
marketing study to determine which customers would buy a subscription to an interactive news service
via cable. Using the data from the study, you can create a flow in which the target field is the
intent to buy the subscription and the predictor fields include age, sex, education, income
category, hours spent watching television each day, and number of children. By applying a C&R
Tree node to the flow, you will be able to predict and classify the responses to get the highest
response rate for your campaign.
Requirements. To train a C&R Tree model, you need
one or more Input fields and exactly one Target field. Target and
input fields can be continuous (numeric range) or categorical. Fields set to Both
or None are ignored. Fields used in the model must have their types fully
instantiated, and any ordinal (ordered set) fields used in the model must have numeric storage (not
string). If necessary, the Reclassify node can be used to convert them.
Strengths. C&R Tree models are quite robust in the
presence of problems such as missing data and large numbers of fields. They usually do not require
long training times to estimate. In addition, C&R Tree models tend to be easier to understand
than some other model types--the rules derived from the model have a very straightforward
interpretation. Unlike C5.0, C&R Tree can accommodate continuous as well as categorical output
fields.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.