Random Trees Visualizations
The following tables and options are available for Random Trees visualizations.
Model Evaluation panel
For classification models, the Model Evaluation panel shows a bar graph displaying the overall prediction accuracy, or proportion of correct predictions, and a table containing a set of evaluation statistics (if the prediction accuracy is exactly 0, the graph will not be shown). The evaluation statistics include the overall accuracy and a series of figures based on treating each category of the target field as the category of interest (or positive response) and averaging the calculated statistics across categories with weights proportional to the observed proportions of instances in each category. The weighted measures include true and false positive rates (TPR and FPR), precision, recall, and the F1 measure, which is the harmonic mean of precision and recall. When weighted in this manner (based on observed proportions), weighted true positive rate and weighted recall are the same as overall accuracy.
For regression models, the panel shows a bar graph displaying the R2 as a measure of prediction accuracy, and a table with R2, mean squared error (MSE) and root mean squared error (RMSE).
Note that error estimates for this panel are based on the final ensemble model and do not use the out-of-bag instances for error estimation as do the measures reported in the Model Information table.
Model Information table
This table contains information on how the model was fitted, so you can make sure that the model you have is what you intended. It contains information on input settings, including the target field, the frequency and regression weight fields if these have been specified, the number of predictors input and the type of model, which will be either Random Trees Classification for a categorical response, or Random Trees Regression for a scale or "continuous" response.
The table also contains summary statistics on the accuracy of the model predictions, with the specific measures shown depending on the type of model. Note that the following statistics are computed on the testing or "out-of-bag" instances for each tree in the set of random trees. The accuracy measures shown will thus typically be lower than what you will see if you score the training data set using the ensemble of random trees, and are likely to provide better estimates of accuracy when predicting new observations.
For a classification model based on a standard analysis, the table displays:
- Model accuracy, or the proportion of records correctly classified.
- Misclassification rate, which is 1 minus model accuracy
For a classification model where imbalance handling has been specified, the table displays:
- The Gmean measure, or geometric mean of the recall or true positive rate (TPR) values for each observed category of the response where the recall or TPR value is positive.
- The recall or true positive rate (TPR) for each category of the response.
For a regression model, the table displays:
- The Root Mean Squared Error (RMSE), or the square root of the mean squared error (MSE) of prediction.
- The Relative Error, which is the ratio of the sum of squared errors of prediction divided by the sum of squared deviations from the grand mean of the target. This is equal to 1-R2.
- The Variance Explained expressed as a proportion, or R2.
Note : Since the Relative Error and R2 are computed by comparing predictions on out-of-bag observations to predicting the overall grand mean, it is possible for the relative error to exceed 1 and thus for R2 to be negative.
Records Summary table
This table shows you how many records were used to fit the model and whether any records were excluded due to missing data. If frequency weighting is in effect, it shows information about both unweighted and weighted numbers of records.
Predictor Importance chart
This chart displays bars representing the predictors in descending order of relative importance for predicting the target, as determined by decreases in the Gini measure of impurity for categorical targets, or by decreases in the least squares deviation (LSD) impurity criterion for scale or "continuous" targets, taken over all splits on that predictor in all random trees. The importance values for each predictor are scaled so that they add to 1. Hovering over the bar for a particular predictor shows a table with its importance value and descriptive statistics about the predictor.
Top Decision Rules table
This table appears only for classification models without imbalance handling. It is based on identification of decision rules that are deemed most interesting according to an "Interestingness Index." To identify the most interesting decision rules, first the set of trees used to evaluate the model on out-of-bag observations is searched and each leaf node in a tree that is less than a certain depth from the root node (by default, five levels) and that has a certain level of support (by default, 5% of the observations in the root node) becomes a candidate node. Nodes that are deeper into a tree and/or have insufficient support are merged with those sharing the same parent until they meet the consideration criteria or cannot be merged and are abandoned. Once the set of candidate nodes has been assembled, the Interestingness Index is computed for the splitting rule used in each of them.
Interesting node decision rules are defined as those which have both high prediction accuracy and high agreement with the predictions of the set of random trees. The Interestingness Index is defined as the probability of a correct prediction by the node decision rule multiplied by the probability of a correct prediction by the ensemble of trees, multiplied by the probability that the node and ensemble decision rules agree. As the product of three probabilities, it thus ranges from 0 to 1. By default, the top five decision rules based on the Interestingness Index are displayed in the Top Decision Rules table.
The table contains the following columns:
- The Decision Rule – A listing of the components of the decision rule, including lists of categories for categorical predictors and cut-points for scale or "continuous" predictors.
- The Most Frequent Category – The observed target category in which the largest number of observations that fall into that node appear (the classification decision based on that decision rule).
- The Rule Accuracy – The proportion of the observations that fall into that node that are correctly predicted by the node's decision rule.
- The Ensemble Accuracy – The proportion of the observations that fall into that node that are correctly predicted by the set of random trees.
The Interestingness Index – The product of:
- the probability of a correct classification, using the node's rule, for an observation falling into that node,
- the probability of a correct classification, using the set of random trees, for an observation falling into that node, and
- the probability that the node decision rule and the overall decision rule were both correct or were both incorrect, for the observations that fell into that node.
The rules listed in the table are listed in descending order of the Interestingness Index. Atop the table is a drop-down menu entitled "Table Contents:" that has a default setting of "All Rules," which means that all interesting decision rules are displayed. If you click on the arrow on the right of the drop-down menu, you can choose the option "Top Rules by Insight." If this option is chosen, a second drop-down menu appears to the right of the first, entitled "Target Category." The options in the second list consist of all distinct target categories that appear in any of the interesting decision rules. Selecting one of these restricts the rules displayed to the set that have that category as their most frequent category.
Confusion Matrix (Classification Table)
The confusion matrix or classification table contains a cross-classification of observed by predicted labels or groups, where predictions are based on the majority vote of the trees in the ensemble of random trees, applied to out-of-bag observations. The numbers of correct predictions are shown in the cells along the main diagonal. Correct percentages are shown for each row, column and overall:
- The percent correct for each row shows what percentage of the observations with that observed label were correctly predicted by the ensemble. If a given label is considered a target label, this is known as sensitivity, recall or true positive rate (TPR). In a 2 x 2 confusion matrix, if one label is considered the non-target label, the percentage for that row is known as the specificity or true negative rate (TNR).
- The percent correct for each column shows the percentage of observations with that predicted label that were correctly predicted. If a given predicted label is considered a target label, this is known as precision or positive predictive value (PPV). For a 2 x 2 confusion matrix, if one label is considered the non-target label, the percentage for that column is known as the negative predictive value (NPV).
- The percent correct at the bottom right of the table gives the overall percentage of correctly classified observations, known as the overall accuracy.
Note: The numbers in the confusion matrix are based on predictions for observations when they are out of bag (i.e., based on predictions for observations from trees built without using those observations) and are thus not the same as what you would see if you use the model results to score the training data. The accuracy numbers in the table provide a better estimate of prediction accuracy on future data.
Like your visualization? Why not deploy it? For more information, see Deploy a model.