Random trees models in notebooks
The Random Trees algorithm is a sophisticated modern approach to supervised learning for categorical or continuous targets. The algorithm uses groups of classification or regression trees and randomness to make predictions that are particularly robust when applied to new observations. The IBM SPSS Spark Machine Learning Library implementation features a table of top decision rules for classification models without imbalance handling and measures of relative predictor importance for all models.
The following steps show an example random trees model that you might build, visualize, and interpret.
Step 1. Build a model
The following code shows an example of a random trees model that you might build.
// Executing "ibm spss/randomtrees" operation import com.ibm.spss.ml.classificationandregression.ensemble.RandomTrees val randomTrees = RandomTrees() .setInputFieldList(Array("income", "gender", "agecat")) .setTargetField("approved") .setNumTrees(10) .setMaxTreeDepth(5) val randomTreesModel = randomTrees.fit(data)
The code performs these steps:
- Create a RandomTrees() object and assign to value randomTrees.
- Set the inputFieldList as the array of variable names: “income”, “gender”, and “agecat”.
- Set the targetField to the variable “approved” which is what the example model is trying to classify.
- Set the parameter values. In the example, we set the number of trees as 10 and maximum tree depth as 5.
- To fit the model, call the fit method on the RandomTrees object passing in the argument of the data set, in this case, data. This fitted model is assigned to the value randomTreesModel.
For syntax options, see Algorithm syntax for random trees.
Step 2. Create model visualizations
After you build your model, the best way to understand it is by calling the
toHTML method from the ModelViewer class from the IBM SPSS Spark Machine Learning Library. To call this method, pass your fitted model to
ModelViewer.toHTML() then display the output by using
import com.ibm.spss.scala.ModelViewer val html = ModelViewer.toHTML(pc,randomTreesModel) kernel.magics.html(html)
where randomTreesModel is the predictive model that you created in the earlier cells in the notebook.
Step 3. Interpret visualizations
The output includes the following charts and tables to review to evaluate the effectiveness of your model:
- Model information table
This table contains information on how the model was fitted, so you can make sure that the model you have is what you intended. It contains information on input settings, including the target field, the frequency and regression weight fields if these have been specified, the number of predictors input and the type of model, which will be either Random Trees Classification for a categorical response, or Random Trees Regression for a scale or “continuous” response.
The table also contains summary statistics on the accuracy of the model predictions, with the specific measures shown depending on the type of model. Note that the statistics below are computed on the testing or “out-of-bag” instances for each tree in the set of random trees. The accuracy measures shown will thus typically be lower than what you will see if you score the training data set using the ensemble of random trees, and are likely to provide better estimates of accuracy when predicting new observations.
For a classification model based on a standard analysis, the table displays:
- Model accuracy, or the proportion of records correctly classified.
- Misclassification rate, which is 1 minus model accuracy.
For a classification model where imbalance handling has been specified, the table displays:
- The Gmean measure, or geometric mean of the recall or true positive rate (TPR) values for each observed category of the response where the recall or TPR value is positive.
- The recall or true positive rate (TPR) for each category of the response.
For a regression model, the table displays:
- The Root Mean Squared Error (RMSE), or the square root of the mean squared error (MSE) of prediction.
- The Relative Error, which is the ratio of the sum of squared errors of prediction divided by the sum of squared deviations from the grand mean of the target. This is equal to 1-R2.
- The Variance Explained expressed as a proportion, or R2.
Note: Since the Relative Error and R2 are computed by comparing predictions on out-of-bag observations to predicting the overall grand mean, it is possible for the relative error to exceed 1 and thus for R2 to be negative.
- Records summary table
This table shows you how many records were used to fit the model and whether any records were excluded due to missing data. If frequency weighting is in effect, it shows information about both unweighted and weighted numbers of records.
- Predictor importance chart
This chart displays bars representing the predictors in descending order of relative importance for predicting the target, as determined by decreases in the Gini measure of impurity for categorical targets, or by decreases in the least squares deviation (LSD) impurity criterion for scale or “continuous” targets, taken over all splits on that predictor in all random trees. The importance values for each predictor are scaled so that they add to 1. Hovering over the bar for a particular predictor shows a table with its importance value and descriptive statistics about the predictor.
- Top decision rules table
This table appears only for classification models without imbalance handling. It is based on identification of decision rules that are deemed most interesting according to an “Interestingness Index.” To identify the most interesting decision rules, first the set of trees used to evaluate the model on out-of-bag observations is searched and each leaf node in a tree that is less than a certain depth from the root node (by default, five levels) and that has a certain level of support (by default, 5% of the observations in the root node) becomes a candidate node. Nodes that are deeper into a tree and/or have insufficient support are merged with those sharing the same parent until they meet the consideration criteria or cannot be merged and are abandoned. Once the set of candidate nodes has been assembled, the Interestingness Index is computed for the splitting rule used in each of them.
Interesting node decision rules are defined as those which have both high prediction accuracy and high agreement with the predictions of the set of random trees. The Interestingness Index is defined as the probability of a correct prediction by the node decision rule multiplied by the probability of a correct prediction by the ensemble of trees, multiplied by the probability that the node and ensemble decision rules agree. As the product of three probabilities, it thus ranges from 0 to 1. By default, the top five decision rules based on the Interestingness Index are displayed in the Top Decision Rules table.
The table contains the following columns:
- The Decision Rule – A listing of the components of the decision rule, including lists of categories for categorical predictors and cut-points for scale or “continuous” predictors.
- The Most Frequent Category – The observed target category in which the largest number of observations that fall into that node appear (the classification decision based on that decision rule).
- The Rule Accuracy – The proportion of the observations that fall into that node that are correctly predicted by the node’s decision rule.
- The Ensemble Accuracy – The proportion of the observations that fall into that node that are correctly predicted by the set of random trees.
- The Interestingness Index – The product of:
- the probability of a correct classification, using the node’s rule, for an observation falling into that node,
- the probability of a correct classification, using the set of random trees, for an observation falling into that node, and
- the probability that the node decision rule and the overall decision rule were both correct or were both incorrect, for the observations that fell into that node.
The rules listed in the table are listed in descending order of the Interestingness Index. Atop the table is a drop-down menu entitled “Table Contents:” that has a default setting of “All Rules,” which means that all interesting decision rules are displayed. If you click on the arrow on the right of the drop-down menu, you can choose the option “Top Rules by Insight.” If this option is chosen, a second drop-down menu appears to the right of the first, entitled “Target Category.” The options in the second list consist of all distinct target categories that appear in any of the interesting decision rules. Selecting one of these restricts the rules displayed to the set that have that category as their most frequent category.
- Confusion matrix (Classification table)
The confusion matrix or classification table contains a cross-classification of observed by predicted labels or groups, where predictions are based on the majority vote of the trees in the ensemble of random trees, applied to out-of-bag observations. The numbers of correct predictions are shown in the cells along the main diagonal. Correct percentages are shown for each row, column and overall:
- The percent correct for each row shows what percentage of the observations with that observed label were correctly predicted by the ensemble. If a given label is considered a target label, this is known as sensitivity, recall or true positive rate (TPR). In a 2 x 2 confusion matrix, if one label is considered the non-target label, the percentage for that row is known as the specificity or true negative rate (TNR).
- The percent correct for each column shows the percentage of observations with that predicted label that were correctly predicted. If a given predicted label is considered a target label, this is known as precision or positive predictive value (PPV). For a 2 x 2 confusion matrix, if one label is considered the non-target label, the percentage for that column is known as the negative predictive value (NPV).
- The percent correct at the bottom right of the table gives the overall percentage of correctly classified observations, known as the overall accuracy.
Note: The numbers in the confusion matrix are based on predictions for observations when they are out of bag (i.e., based on predictions for observations from trees built without using those observations) and are thus not the same as what you would see if you use the model results to score the training data. The accuracy numbers in the table provide a better estimate of prediction accuracy on future data.