The Auto Classifier node estimates and compares models for
either nominal (set) or binary (yes/no) targets, using a number of different methods, enabling you
to try out a variety of approaches in a single modeling run. You can select the algorithms to use,
and experiment with multiple combinations of options. For example, rather than choose between Radial
Basis Function, polynomial, sigmoid, or linear methods for an SVM, you can try them all. The node
explores every possible combination of options, ranks each candidate model based on the measure you
specify, and saves the best models for use in scoring or further analysis.
Example
A retail company has historical data tracking the offers made to specific customers in past
campaigns. The company now wants to achieve more profitable results by matching the appropriate
offer to each customer.
Requirements
A target field with a measurement level of either Nominal or
Flag (with the role set to Target), and at least one input
field (with the role set to Input). For a flag field, the
True value defined for the target is assumed to represent a hit when calculating
profits, lift, and related statistics. Input fields can have a measurement level of
Continuous or Categorical, with the limitation that some inputs
may not be appropriate for some model types. For example, ordinal fields used as inputs in C&R
Tree, CHAID, and QUEST models must have numeric storage (not string), and will be ignored by these
models if specified otherwise. Similarly, continuous input fields can be binned in some cases. The
requirements are the same as when using the individual modeling nodes; for example, a Bayes Net
model works the same whether generated from the Bayes Net node or the Auto Classifier node.
Frequency and weight fields
Frequency and weight are used to give extra importance to some records over others because, for
example, the user knows that the build dataset under-represents a section of the parent population
(Weight) or because one record represents a number of identical cases (Frequency). If specified, a
frequency field can be used by C&R Tree, CHAID, QUEST, Decision List, and Bayes Net models. A
weight field can be used by C&RT, CHAID, and C5.0 models. Other model types will ignore these
fields and build the models anyway. Frequency and weight fields are used only for model building,
and are not considered when evaluating or scoring models.
Prefixes
If you attach a table node to the nugget for the Auto Classifier Node, there are several new
variables in the table with names that begin with a $ prefix.
The names of the fields that are generated during scoring are based on the target field, but
with a standard prefix. Different model types use different sets of prefixes.
For example, the prefixes $G, $R, $C are used as the prefix for predictions that are generated
by the Generalized Linear model, CHAID model, and C5.0 model, respectively. $X is typically
generated by using an ensemble, and $XR, $XS, and $XF are used as prefixes in cases where the target
field is a Continuous, Categorical, or Flag field, respectively.
$..C prefixes are used for prediction confidence of a Categorical, or Flag target; for example,
$XFC is used as a prefix for ensemble Flag prediction confidence. $RC and $CC are the prefixes for a
single prediction of confidence for a CHAID model and C5.0 model respectively.
Supported Model Types
Copy link to section
Supported model types include Neural Net, C&R Tree, QUEST, CHAID, C5.0,
Logistic Regression, Decision List, Bayes Net, Discriminant, Nearest Neighbor, SVM, XGBoost Tree,
and XGBoost-AS.
Cross-validation settings
Copy link to section
In the node properties, note that cross-validation settings are available. Cross-validation is a
valuable technique for testing the effectiveness (avoiding overfitting) of machine learning models,
and it's also a re-sampling procedure you can use to evaluate a model if you have limited data.
K-fold is a popular and easy way to perform cross-validation. It generally results in a less
biased model compared to a single train/test partition, because it ensures that every observation
from the original dataset has the chance of appearing in training and test sets. The general
procedure of k-fold cross-validation is as follows.
Note: Parallel auto modeling in cross-validation
mode (running two or more auto modeling nodes at the same time, such as via the Run
all button) isn't supported at this time. As a workaround, you can run each auto
modeling node (with cross-validation enabled, which is disabled by default) one at a time.
Shuffle the dataset randomly.
Split the dataset into k-folds/groups.
For each unique fold/group:
Take the fold/group as a hold out or test dataset.
Take the remaining groups as a training dataset.
Fit a model on the training set and evaluate it on the test set.
Retain the evaluation score and discard the model.
Summarize the overall evaluation of the model using the retained k-fold evaluation scores.
Cross-validation is currently supported via the Auto Classifier node and the Auto Numeric node.
Double-click the node to open its properties. By selecting the Cross-validate
option, a single train/test partition is disabled and the Auto nodes will use k-fold
cross-validation to evaluate the selected set of different algorithms.
You can specify the Number of folds (K), The default is 5, with a range of
3 to 10. If you want to retain repeatable sampling during cross-validation, to have consistent final
evaluation measures for generated models across different executions, you can select the
Repeatable Cross Validation partition assignment option. You can also set the
Random seed to a specific value so the resulting model is exactly
reproducible. Or click Generate to always generate the same sequence of
random values, in which case running the node always yields the same generated model.
Continuous machine learning
Copy link to section
An inconvenience with modeling is models getting outdated due to changes to your data over time.
This is commonly referred to as model drift or concept drift. To help
overcome model drift effectively, SPSS Modeler provides continuous automated machine learning. This
feature is available for Auto Classifier node and Auto Numeric node model nuggets. For more
information, see Continuous machine learning.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.