Creating machine learning flows with SPSS Modeler nodes
You can create a machine learning flow by using SPSS nodes.
Note: Watson Studio does not include SPSS functionality in Peru, Ecuador, Colombia and Venezuela.
The SPSS Modeler node palette
For detailed information about the nodes, in the Flow Editor from the node settings panel, click the get information icon. An information panel appears with both overview and detailed information about the node, such as how to use fields and controls.
Source nodes contain the data for your machine learning flows and are available via ObjectStore and the Connections or Db2 on Cloud tabs from within Watson Studio. In addition to these data sets, the User Input node, which can be used to generate a small data set, is available in the Record Operations tab.
- Data Asset
The Data Asset node contains the data from your project for your flows. Drag it from the Palette to the canvas area. Double-click the node, click the Change data asset button, and select one of the data assets from your project, and then click OK.
Table 1. User Input node input and output values Input Output Data only from a data asset that has already been added to the project All data from the source, either from a file or data connection
- User Input
The User Input node provides an easy way for you to create synthetic data, either from scratch or by altering existing data. This is useful, for example, when you want to create a test data set for modeling.
Table 1. User Input node input and output values Input Output Data that a user types into the node itself All data that has been typed into the node
- Sim Gen
The Simulation Generate node provides an easy way to generate simulated data, either without historical data using user specified statistical distributions, or automatically using the distributions obtained from running a Simulation Fitting node on existing historical data. Generating simulated data is useful when you want to evaluate the outcome of a predictive model in the presence of uncertainty in the model inputs.
Table 1. NODE node input and output values Input Output To use this with actual historical data, input would be from a Sim Fit node, however Sim Gen can get data from an Import node, such as the Data Asset node or from the preceding stream of connected nodes.
Simulated data either with or without historical data
You can use Select nodes to select or discard a subset of records from the data stream based on a specific condition, such as BP (blood pressure) = "HIGH".
Table 1. Select node input and output values Input Output Data from an Import node, such as the Data Asset node or from the preceding stream of connected nodes A subset of data that is received and that is created by selecting Include or Exclude and typing an SQL statement
The Sample node selects a subset of records. A variety of sample types are supported, including stratified, clustered, and nonrandom (structured) samples. Sampling can be useful to improve performance, and to select groups of related records or transactions for analysis.
Only the simple modes are currently supported.
Table 1. Sample node input and output values Input Output Data from an Import node, such as the Data Asset node or from the preceding stream of connected nodes A subset of data that is received and that is created by selecting a Sampling Type and typing an SQL statement
You can use Sort nodes to sort records into ascending or descending order based on the values of one or more fields. For example, Sort nodes are frequently used to view and select records with the most common data values. Typically, you first aggregate the data by using the Aggregate node and then use the Sort node to sort the aggregated data into descending order of record counts. Display these results in a table so you can explore the data and make decisions, such as selecting the records of the top 10 best customers.
The basic Settings tab is implemented, but the Optimization tab is not yet implemented.
Table 1. Sort node input and output values Input Output Data from an Aggregate node (but also possibly an Import node, such as the Data Asset node or from the preceding stream of connected nodes) A subset of sorted data usually sent to a Table node for further analysis
The Balance node corrects imbalances in a data set, so it conforms to a specified condition. The balancing directive adjusts the proportion of records where a condition is true by the factor specified.
Table 1. Balance node input and output values Input Output Data from an Import node, such as the Data Asset node, from the preceding stream of connected nodes, or from a Distribution node Either a subset or superset of data that has been re-proportioned to meet the specified condition
Duplicate records in a data set must be removed before data mining can begin. For example, in a marketing database, individuals may appear multiple times with different address or company information. You can use the Distinct node to find or remove duplicate records in your data, or to create a single, composite record from a group of duplicate records.
Table 1. Distinct node input and output values Input Output Data from an Import node, such as the Data Asset node or from the preceding stream of connected nodes A subset of data with duplicate records removed
Aggregation is a data preparation task that is frequently used to reduce the size of a data set. Before proceeding with aggregation, you should take time to clean the data, concentrating especially on missing values. When you aggregate, potentially useful information regarding missing values might be lost.
On the Settings tab, the Key fields, Aggregate fields, Include record count and the count field name are implemented. Default operations for other fields are currently fixed.
Table 1. Aggregate node input and output values Input Output A subset of merged data records
The Merge node takes multiple input records and creates a single output record that contains some or all of the input fields. It is useful for merging data from different sources, such as internal customer data and purchased demographic data.
Only the Merge tab is currently supported. Within that, the Merge Method of Ranked Condition is not supported.
Table 1. Merge node input and output values Input Output Data from multiple Import nodes, such as Data Asset nodes or from the preceding stream of connected nodes A single data set that combines records from multiple sources
This allows multiple data sets to be appended together (similar to 'UNION' in SQL). For example, a customer may have sales data in separate files for each month and wants to combine them into a single view of sales over several years.
Table 1. Append node input and output values Input Output Data from multiple Import nodes, such as Data Asset nodes or from the preceding stream of connected nodes A single data set that creates a more complex data structure from multiple sources
- Streaming TS
You use the Streaming Time Series node to build and score time series models in one step. A separate time series model is built for each target field, however model nuggets are not added to the generated models palette and the model information cannot be browsed.
Table 1. Streaming TS node input and output values Input Output For each target field, a separate times series model is built when the flow is run.
The Synthetic Minority Over-sampling Technique (SMOTE) node provides an over-sampling algorithm to deal with imbalanced data sets. It provides an advanced method for balancing data. The SMOTE process node is implemented in Python and requires the imbalanced-learn© Python library.
Table 1. SMOTE node input and output values Input Output A superset of data that has been re-apportioned to meet the criteria spesified in the imbalanced-learn Python library
- RFM Aggregate
The Recency, Frequency, Monetary (RFM) Aggregate node allows you to take customers' historical transactional data, strip away any unused data, and combine all of their remaining transaction data into a single row (using their unique customer ID as a key) that lists when they last dealt with you (recency), how many transactions they have made (frequency), and the total value of those transactions (monetary).
Table 1. RFM Aggregate node input and output values Input Output A subset of data that has had unused data removed and has been moved into a single row
- Extension Transform
Take data from a stream and apply transformations to the data using R scripting or Python for Spark scripting.
Table 1. NODE node input and output values Input Output
Data records that have been transformed via scripting
- Auto Data Prep
Preparing data for analysis is one of the most important steps in any project—and traditionally, one of the most time consuming. Automated Data Preparation (ADP) handles the task for you, analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques. You can use the algorithm in fully automatic fashion, allowing it to choose and apply fixes, or you can use it in interactive fashion, previewing the changes before they are made and accept or reject them as you want.
Table 1. Auto Data Prep node input and output values Input Output A subset of data with problematic records removed
The Type node specifies field metadata and properties. For example, you can specify a measurement level (continuous, nominal, ordinal, or flag) for each field, set options for handling missing values and system nulls, set the role of a field for modeling purposes, specify field and value labels, and specify values for a field. In some cases you might need to fully instantiate the Type node in order for other nodes to work correctly, such as the fields from property of the Set to Flag node. You can simply connect a Table node and execute it to instantiate the fields.
The Types tab is partly implemented in that many of the settings that can be changed by double-clicking the field name in desktop model builder can be edited. However, behavior in the cloud version is different from the desktop model builder in that it only shows property settings which are local to the node. You might not see the same derived information that would expect to see in the desktop model builder.
The Types tab in desktop model builder also has action buttons to clear values or force a data pass. This is not currently supported.
The Format tab is not currently implemented.
Table 1. Type node input and output values Input Output Data records with additional field metadata and properties that affect how to model a specific data point
Creates a new Filter node to filter fields that are not used by rules in the rule set.
The filtering (dropping) of fields and the renaming of fields are implemented by using separate tabs. As noted previously, dependencies between tabs might not be supported. For example, fields that are dropped on the Filter tab might still appear in the list of available fields in the Rename tab.
Table 1. Filter node input and output values Input Output Data records with only a subset of fields
The Derive node modifies data values or creates new fields from one or more existing fields. It creates fields of type formula, flag, nominal, state, count, and conditional.
Table 1. Derive node input and output values Input Output Data records with additional fields or modified field values
Filler nodes are used to replace field values and change storage. You can choose to replace values based on a specified CLEM condition, such as
@BLANK(FIELD). Alternatively, you can choose to replace all blanks or null values with a specific value. Filler nodes are often used in conjunction with the Type node to replace missing values. For example, you can fill blanks with the mean value of a field by specifying an expression such as
@GLOBAL_MEAN. This expression will fill all blanks with the mean value as calculated by a Set Globals node.
Table 1. Filler node input and output values Input Output Data records with replaced field values, based on CLEM conditions that you specify
The Reclassify node enables the transformation from one set of categorical values to another. Reclassification is useful for collapsing categories or regrouping data for analysis. For example, you could reclassify the values for Product into three groups, such as Kitchenware, Bath and Linens, and Appliances. Often, this operation is performed directly from a Distribution node by grouping values and generating a Reclassify node.
Table 1. Reclassify node input and output values Input Output Typically, data from a Distribution node or from the preceding stream of connected nodes Data records with subsets assigned to different classes
Instead of displaying all data values individually, you can bin them. Binning involves grouping individual data values into one instance of a graphic element. A bin may be a point that indicates the number of cases in the bin. Or it may be a histogram bar, whose height indicates the number of cases in the bin. The majority of settings are supported, however, the ability to view the bin intervals after they have been computed is not yet supported.
Table 1. Binning node input and output values Input Output Data records with individual data values grouped into one instance of a graphic
- RFM Analysis
You can use the Recency, Frequency, Monetary (RFM) Analysis node to determine quantitatively which customers are likely to be the best ones by examining how recently they last purchased from you (recency), how often they purchased (frequency), and how much they spent over all transactions (monetary).
Table 1. RFM Analysis node input and output values Input Output Data records with calculated recency, frequency, and monetary values
The Ensemble node combines two or more model nuggets to obtain more accurate predictions than can be gained from any of the individual models. By combining predictions from multiple models, limitations in individual models may be avoided, resulting in a higher overall accuracy. Models combined in this manner typically perform at least as well as the best of the individual models and often better.
Table 1. Ensemble node input and output values Input Output Two or more model nuggets Combined predictions from multiple models
Partition nodes are used to generate a partition field that splits the data into separate subsets or samples for the training, testing, and validation stages of model building. By using one sample to generate the model and a separate sample to test it, you can get a good indication of how well the model will generalize to larger data sets that are similar to the current data.
Only Simple mode (a single new field) is currently implemented.
While the desktop modeler supports multiple ways of deriving the new value, only the Formula mode is currently supported.
The Field type drop-down control in the desktop modeler includes a Specify option. This is not yet supported.
Table 1. Partition node input and output values Input Output Data records with the addition of the partition field that assigns individual records to the training, testing, or validation subset
- SetTo Flag
The Set to Flag node is used to derive flag fields based on the categorical values defined for one or more nominal fields. For example, your data set might contain a nominal field, BP (blood pressure), with the values High, Normal, and Low. For easier data manipulation, you might create a flag field for high blood pressure, which indicates whether or not the patient has high blood pressure.
Table 1. SetTo Flag node input and output values Input Output Data records with additional flag fields that you can use for classification
The Restructure node can be used to generate multiple fields based on the values of a nominal or flag field. The newly generated fields can contain values from another field or numeric flags (0 and 1). The functionality of this node is similar to that of the Set to Flag node. However, it offers more flexibility. It allows you to create fields of any type (including numeric flags), using the values from another field. You can then perform aggregation or other manipulations with other nodes downstream. (The Set to Flag node lets you aggregate fields in one step, which may be convenient if you are creating flag fields.)
Table 1. Restructure node input and output values Input Output Data from an Import node, such as the Data Asset node, from the preceding stream of connected nodes, or from a SetTo Flag node Data records with additional fields of any type that you can use for classification or aggregation
By default, columns are fields and rows are records or observations. If necessary, you can use a Transpose node to swap the data in rows and columns so that fields become records and records become fields. For example, if you have time series data where each series is a row rather than a column, you can transpose the data prior to analysis.
Table 1. Transpose node input and output values Input Output Data records where columns and rows are swapped
- Field Reorder
Use the Field Reorder node to define the natural order used to display fields downstream. This order affects the display of fields in a variety of places, such as tables, lists, and the Field Chooser. This operation is useful, for example, when working with wide data sets to make fields of interest more visible.
Table 1. Field Reorder node input and output values Input Output Sorted data records
History nodes are most often used for sequential data, such as time series data. They are used to create new fields containing data from fields in previous records. When using a History node, you may want to use data that is presorted by a particular field. You can use a Sort node to do this.
Table 1. History node input and output values Input Output
- Time Intervals
Use the Time Intervals node to specify intervals and derive a new time field for estimating or forecasting. A full range of time intervals is supported, from seconds to years.
Table 1. Time Intervals node input and output values Input Output
Data records with the addition of a new time field; the new field has the same storage type as the input time field you chose. The node generates the following items:
- The field specified on the Fields tab as the Time Field, along with the chosen prefix/suffix. By default the prefix is $TI_.
- The fields specified on the Fields tab as the Dimension fields.
- The fields specified on the Fields tab as the Fields to aggregate.
A number of extra fields can also be generated, depending on the selected interval or period (such as the minute or second within which a measurement falls).
With the Anonymize node, you can disguise field names, field values, or both when working with data that's to be included in a model downstream of the node. In this way, the generated model can be freely distributed (for example, to Technical Support) with no danger that unauthorized users will be able to view confidential data, such as employee records or patients' medical records.
Table 1. Anonymize node input and output values Input Output
Data records with selected field names or values disguised so that sensitive data can be hidden from view
Plot nodes show the relationship between numeric fields. You can create a plot using points (also known as a scatterplot), or you can use lines. You can create three types of line plots by specifying an X Mode in the dialog box.
Table 1. Plot node input and output values Input Output A scatterplot or line that shows the relationship between numeric fields
A multiplot is a special type of plot that displays multiple Y fields over a single X field. The Y fields are plotted as colored lines and each is equivalent to a Plot node with Style set to Line and X Mode set to Sort. Multiplots are useful when you have time sequence data and want to explore the fluctuation of several variables over time.
Most of the Plot tab is implemented, except for animation. Most of the Appearance tab is supported, except for the auto/custom X and Y labels.
Table 1. Multiplot node input and output values Input Output A multiplot with multiple Y fields depicted as colored lines over a single X field
- Time Plot
Time Plot nodes enable you to view one or more time series plotted over time. The series you plot must contain numeric values and are assumed to occur over a range of time in which the periods are uniform.
Table 1. Time Plot node input and output values Input Output A time plot with with numeric data from one or more time series plotted over time
A distribution graph or table shows the occurrence of symbolic (non-numeric) values, such as mortgage type or gender, in a data set. A typical use of the Distribution node is to show imbalances in the data that can be rectified by using a Balance node before creating a model. You can automatically generate a Balance node using the Generate menu in the distribution graph or table window.
The Plot settings are implemented. Most of the Appearance tab is supported, except for the auto/custom X and Y labels.
Table 1. Distribution node input and output values Input Output A distribution graph or table made up from non-numeric values
Histogram nodes show the occurrence of values for numeric fields. They are often used to explore the data before manipulations and model building. Similar to the Distribution node, Histogram nodes are frequently used to reveal imbalances in the data.
There are some limitations for the way that the Histogram node is implemented. Make note of the following restrictions:
- Most of the Plot tab is implemented, except for animation.
- None of the Options tab is implemented.
- Most of the Appearance tab is supported, except for the auto/custom X and Y labels.
Table 1. Histogram node input and output values Input Output A simplified version of a histogram that shows the occurrence of values for numeric fields
Collections are similar to histograms except that collections show the distribution of values for one numeric field relative to the values of another, rather than the occurrence of values for a single field. A collection is useful for illustrating a variable or field whose values change over time. Using 3-D graphing, you can also include a symbolic axis displaying distributions by category. Two dimensional Collections are shown as stacked bar charts, with overlays where used.
Table 1. Collection node input and output values Input Output A historgram that shows the distribution of values for one numeric field relative to the values of another
Web nodes show the strength of relationships between values of two or more symbolic fields. The graph displays connections using varying types of lines to indicate connection strength. You can use a Web node, for example, to explore the relationship between the purchase of various items at an e-commerce site or a traditional retail outlet.
Table 1. Web node input and output values Input Output A line chart that shows the strength of relationships between two or more symbolic fields
The Evaluation node offers an easy way to evaluate and compare predictive models to choose the best model for your application. Evaluation charts show how models perform in predicting particular outcomes. They work by sorting records based on the predicted value and confidence of the prediction, splitting the records into groups of equal size (quantiles), and then plotting the value of the business criterion for each quantile, from highest to lowest. Multiple models are shown as separate lines in the plot.
Table 1. Evaluation node input and output values Input Output An evaluation chart with plotted values for each quantile from highest to lowest
t-Distributed Stochastic Neighbor Embedding (t-SNE)© is a tool for visualizing high-dimensional data. It converts affinities of data points to probabilities. The affinities in the original space are represented by Gaussian joint probabilities and the affinities in the embedded space are represented by Student's t-distributions. This allows t-SNE to be particularly sensitive to local structure and has a few other advantages over existing techniques:
- Revealing the structure at many scales on a single map
- Revealing data that lie in multiple, different, manifolds, or clusters
- Reducing the tendency to crowd points together at the center
Table 1. t-SNE node input and output values Input Output A scatterplot that reduces highly dimensional data to only two or three dimensions
The following algorithms have got basic implementations, with most having the ability to set the values on the Fields tab and some Build Options:
- Auto Classifier
vThis builds several classification models using multiple algorithms and settings, evaluates them and selects the best performing. These can then be used to score new data and by combining ("ensembling") the results from those models, a more accurate prediction can be obtained.
Table 1. Auto Classifier node input and output values Input Output A classification model that has been compared and selected for its performance against other models
- Auto Numeric
This is equivalent to the Auto Classifier but for numeric/continuous targets.
Table 1. Auto Numeric node input and output values Input Output A classification model, based on numeric data, that has been compared and selected for its performance against other models
- Auto Cluster
The Auto Cluster node estimates and compares clustering models that identify groups of records with similar characteristics. The node works in the same manner as other automated modeling nodes, enabling you to experiment with multiple combinations of options in a single modeling pass. Models can be compared using basic measures with which to attempt to filter and rank the usefulness of the cluster models, and provide a measure based on the importance of particular fields.
Table 1. Auto Cluster node input and output values Input Output A cluster model that has been compared and selected for its performance against other models
- Bayes Net
A Bayesian network is a model that displays variables in a data set and the probabilistic, or conditional, independencies between them. Using the Netezza Bayes Net node, you can build a probability model by combining observed and recorded evidence with "common-sense" real-world knowledge to establish the likelihood of occurrences by using seemingly unlinked attributes.
Table 1. Bayes Net node input and output values Input Output A probability model that is built on both observed and recorded data points
The C5.0 node builds either a decision tree or a rule set. The model works by splitting the sample based on the field that provides the maximum information gain at each level. The target field must be categorical. Multiple splits into more than two subgroups are allowed.
Table 1. C5.0 node input and output values Input Output A decision tree or rules set model that uses categorical analysis
- C&R Tree
The Classification and Regression (C&R) Tree node generates a decision tree that you can use to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node in the tree is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all splits are binary (only two subgroups).
Table 1. C&R Tree node input and output values Input Output A decision tree model that is used for prediction or classification
Chi-square Automatic Interaction Detector is used to discover the relationship between categorical variables by building a predictive model or tree to explain an outcome.
Table 1. CHAID node input and output values Input Output A predictive model or tree that establishes the relationship between categorical values
QUEST (Quick, Unbiased, Efficient Statistical Tree) is a binary classification method for building decision trees. A major motivation in its development was to reduce the processing time required for large C&R Tree analyses with either many variables or many cases. A second goal of QUEST was to reduce the tendency found in classification tree methods to favor inputs that allow more splits, that is, continuous (numeric range) input fields or those with many categories.
Table 1. QUEST node input and output values Input Output A decision tree that uses a binary classification method
The Tree-AS node can be used with data in a distributed environment. In this node you can choose to build decision trees using either a CHAID or Exhaustive CHAID model. CHAID, or Chi-squared Automatic Interaction Detection, is a classification method for building decision trees by using chi-square statistics to identify optimal splits.
Table 1. Tree-AS node input and output values Input Output A decision tree that uses either the CHAID or Exhaustive CHAID methods
- Random Trees
The Random Trees node is similar to the C&RT node; however, the Random Trees node is designed to process big data to create a single tree. The Random Trees tree node generates a decision tree that you use to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node in the tree is considered pure if 100% of cases in the node fall into a specific category of the target field. Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all splits are binary (only two subgroups).
Table 1. Random Trees node input and output values Input Output A decision tree that predicts future observations
- Random Forest
Random Forest is an advanced implementation of a bagging algorithm with a tree model as the base model. In random forests, each tree in the ensemble is built from a sample drawn with replacement (for example, a bootstrap sample) from the training set. When splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. Because of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.
Table 1. Random Forest node input and output values Input Output A tree model based on a random subset of features
- Decision List
The Decision List node identifies subgroups, or segments, that show a higher or lower likelihood of a given binary outcome relative to the overall population. For example, you might look for customers who are unlikely to churn or are most likely to respond favorably to a campaign. You can incorporate your business knowledge into the model by adding your own custom segments and previewing alternative models side by side to compare the results. Decision List models consist of a list of rules in which each rule has a condition and an outcome. Rules are applied in order, and the first rule that matches determines the outcome.
Table 1. Decision List node input and output values Input Output A decision list model
- Time series
The Time Series node can be used with data in either a local or distributed environment; in a distributed environment you can harness the power of IBM® SPSS® Analytic Server. With this node, you can choose to estimate and build exponential smoothing, univariate Autoregressive Integrated Moving Average (ARIMA), or multivariate ARIMA (or transfer function) models for time series, and produce forecasts based on the time series data.
Table 1. Time series node input and output values Input Output A time series model
Use this node to create a temporal causal model (TCM).
Table 1. TCM node input and output values Input Output The TCM modeling operation creates a number of new fields with the prefix $TCM-, including the value forecasted by the model for each target series, the lower confidence intervals for each forecasted series, the upper confidence intervals for each forecasted series, and the noise residual value for each column of the generated model data
The generalized linear model expands the general linear model so that the dependent variable is linearly related to the factors and covariates via a specified link function. Moreover, the model allows for the dependent variable to have a non-normal distribution. It covers widely used statistical models, such as linear regression for normally distributed responses, logistic models for binary data, loglinear models for count data, complementary log-log models for interval-censored survival data, plus many other statistical models through its very general model formulation.
Table 1. GenLin node input and output values Input Output A GenLin model
Generalized Linear Engine uses a variety of statistical techniques to support both classification and continuous predicted values. Unlike many algorithms, the target does not need to have a normal distribution.
Table 1. GLE node input and output values Input Output A GLE model
Linear regression models predict a continuous target based on linear relationships between the target and one or more predictors.
Table 1. Linear node input and output values Input Output A linear regression model
Linear regression is a common statistical technique for classifying records based on the values of numeric input fields. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. Linear-AS can run when connected to IBM SPSS Analytic Server.
Table 1. Linear-AS node input and output values Input Output A linear regression model that is connected to IBM SPSS Analytic Server
Linear regression is a common statistical technique for classifying records based on the values of numeric input fields. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. The Regression node is due to be replaced by the Linear node in a future release. We recommend using Linear models for linear regression from now on.
Table 1. Regression node input and output values Input Output A linear regression model
The Linear Support Vector Machine (LSVM) is a classification algorithm that is particularly suited for use with wide data sets, that is, those with a large number of predictor fields.
Table 1. LSVM node input and output values Input Output A Linear Support Vector Machine (LSVM) model
Logistic regression is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression but takes a categorical target field instead of a numeric range.
Table 1. Logistic node input and output values Input Output A logistic regression model
- Neural Net
The Neural Net node uses a simplified model of the way the human brain processes information. It works by simulating a large number of interconnected simple processing units that resemble abstract versions of neurons. Neural networks are powerful general function estimators and require minimal statistical or mathematical knowledge to train or apply.
Table 1. Neural Net node input and output values Input Output A neural network model
Nearest Neighbor Analysis is a method for classifying cases based on their similarity to other cases. In machine learning, it was developed as a way to recognize patterns of data without requiring an exact match to any stored patterns, or cases. Similar cases are near each other and dissimilar cases are distant from each other. Thus, the distance between two cases is a measure of their dissimilarity.
Table 1. KNN node input and output values Input Output A KNN model
Cox Regression is used for survival analysis, such as estimating the probability that an event has occurred at a certain time. For example, a company is interested in modeling the time to churn in order to determine the factors that are associated with customers who are quick to switch to another service.
Table 1. Cox node input and output values Input Output A Cox regression model
The Principal Components Analysis node aims to reduce the complexity of data by finding a smaller number of derived fields that effectively summarizes the information in the original set of fields.
Table 1. PCA/Factor node input and output values Input Output A PCA model that uses a subset of critical fields
The SVM node enables you to use a support vector machine to classify data. SVM is particularly suited for use with wide data sets, that is, those with a large number of predictor fields. You can use the default settings on the node to produce a basic model relatively quickly, or you can use the Expert settings to experiment with different types of SVM model.
Table 1. SVM node input and output values Input Output A support vector machine model that you can use to classify data
- Feature Selection
The Feature Selection node screens input fields for removal based on a set of criteria, such as the percentage of missing values. It then ranks the importance of remaining inputs relative to a specified target. For example, given a data set with hundreds of potential inputs, which are most likely to be useful in modeling patient outcomes?
Table 1. Feature Selection node input and output values Input Output A ranked output of input fields based on the usefulness of the data for predictive modeling
Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant function (or, for more than two groups, a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups. The functions are generated from a sample of cases for which group membership is known; the functions can then be applied to new cases that have measurements for the predictor variables but have unknown group membership.
Table 1. Discriminant node input and output values Input Output A discriminant model that extracts a function for assigning records to a group
This node creates a generalized linear mixed model (GLMM).
Generalized linear mixed models extend the linear model so that:
- The target is linearly related to the factors and covariates via a specified link function
- The target can have a non-normal distribution
- The observations can be correlated
Generalized linear mixed models cover a wide variety of models, from simple linear regression to complex multilevel models for non-normal longitudinal data.
Table 1. NODE node input and output values Input Output
A generalized linear mixed model
The Self-Learning Response Model (SLRM) node enables you to build a model that you can continually update, or reestimate, as a dataset grows without having to rebuild the model every time using the complete dataset. For example, this is useful when you have several products and you want to identify which product a customer is most likely to buy if you offer it to them. This model allows you to predict which offers are most appropriate for customers and the probability of the offers being accepted.
Table 1. SLRM node input and output values Input Output An SLRM model
- Association Rules
Association rules are statements of the following form.
if condition(s) then prediction(s)
For example, "If a customer purchases a razor and after shave, then that customer will purchase shaving cream with 80% confidence." The Association Rules node extracts a set of rules from the data, pulling out the rules with the highest information content.
Table 1. Association Rules node input and output values Input Output An association rules model that extracts if/then rules from the data
The Apriori node discovers association rules in the data. Association rules are statements of the following form:
if antecedent(s) then consequent(s)
For example, "if a customer purchases a razor and after shave, then that customer will purchase shaving cream with 80% confidence." Apriori extracts a set of rules from the data, pulling out the rules with the highest information content. Apriori offers five different methods of selecting rules and uses a sophisticated indexing scheme to efficiently process large data sets.
Table 1. Apriori node input and output values Input Output An apriori model that is based on if/then rules
The CARMA node uses an association rules discovery algorithm to discover association rules in the data. Association rules are statements that are in the following form: if antecedent(s) then consequent(s)
Table 1. Carma node input and output values Input Output A Carma model
The Sequence node discovers patterns in sequential or time-oriented data, in the format bread -> cheese. The elements of a sequence are item sets that constitute a single transaction. For example, if a person goes to the store and purchases bread and milk and then a few days later returns to the store and purchases some cheese, that person's buying activity can be represented as two item sets. The first item set contains bread and milk, and the second one contains cheese. A sequence is a list of item sets that tend to occur in a predictable order. The Sequence node detects frequent sequences and creates a generated model node that can be used to make predictions.
Table 1. Sequence node input and output values Input Output A sequence model that lists transactions in a particular order
The Kohonen node generates a type of neural network that can be used to cluster the data set into distinct groups. When the network is fully trained, records that are similar should be close together on the output map, while records that are different will be far apart. You can look at the number of observations captured by each unit in the model nugget to identify the strong units. This may give you a sense of the appropriate number of clusters.
Table 1. Kohonen node input and output values Input Output A Kohonen neural network that places similar records close together in an output map
Identify outliers, or unusual cases, in the data. Unlike other modeling methods that store rules about unusual cases, anomaly detection models store information on what normal behavior looks like. This makes it possible to identify outliers even if they do not conform to any known pattern, and it can be particularly useful in applications, such as fraud detection.
Table 1. Anomaly node input and output values Input Output An anomaly model that highlights normal behavior
This is an unsupervised algorithm used to cluster the data set into distinct groups. Instead of trying to predict an outcome, k-means tries to uncover patterns in the set of input fields. Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar.
Table 1. K-Means node input and output values Input Output A k-means model in which data records are clustered or grouped automatically based on similarities
The TwoStep Cluster node provides a form of cluster analysis. It can be used to cluster the data set into distinct groups when you don't know what those groups are at the beginning. As with Kohonen nodes and K-Means nodes, TwoStep Cluster models do not use a target field. Instead of trying to predict an outcome, TwoStep Cluster tries to uncover patterns in the set of input fields. Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar.
Table 1. TwoStep node input and output values Input Output A cluster model where data records are grouped according to patterns
TwoStep Cluster is an exploratory tool that is designed to reveal natural groupings (or clusters) within a data set that would otherwise not be apparent.
Table 1. TwoStep-AS node input and output values Input Output A cluster model that reveals natural groupings of data records
Isotonic Regression belongs to the family of regression algorithms. The Isotonic-AS node in SPSS® Modeler is implemented in Spark. For details about Isotonic Regression algorithms, see Regression - RDD-based API.
Table 1. Isotonic-AS node input and output values Input Output An isotonic regression model
K-Means is one of the most commonly used clustering algorithms. It clusters data points into a predefined number of clusters.1 The K-Means-AS node in SPSS® Modeler is implemented in Spark. For details about K-Means algorithms, see K-Means-AS.
Table 1. K-Means-AS node input and output values Input Output A k</>-means model
XGBoost© is an advanced implementation of a gradient boosting algorithm. Boosting algorithms iteratively learn weak classifiers and then add them to a final strong classifier. XGBoost is very flexible and provides many parameters that can be overwhelming to most users, so the XGBoost-AS node in Watson Studio exposes the core features and commonly used parameters. The XGBoost-AS node is implemented in Spark.
Table 1. XGBoost-AS node input and output values Input Output An XGBoost-AS model
- XGBoost Tree
XGBoost Tree© is an advanced implementation of a gradient boosting algorithm with a tree model as the base model. Boosting algorithms iteratively learn weak classifiers and then add them to a final strong classifier. XGBoost Tree is very flexible and provides many parameters that can be overwhelming to most users, so the XGBoost Tree node in SPSS® Modeler exposes the core features and commonly used parameters. The node is implemented in Python.
Table 1. XGBoost Tree node input and output values Input Output An XGBoost tree model
- XGBoost Linear
XGBoost Linear© is an advanced implementation of a gradient boosting algorithm with a linear model as the base model. Boosting algorithms iteratively learn weak classifiers and then add them to a final strong classifier. The XGBoost Linear node in SPSS® Modeler is implemented in Python.
Table 1. XGBoost Linear node input and output values Input Output An XGBoost linear model
- One-Class SVM
The One-Class SVM© node uses an unsupervised learning algorithm. The node can be used for novelty detection. It will detect the soft boundary of a given set of samples, to then classify new points as belonging to that set or not. This One-Class SVM modeling node is implemented in Python and requires the scikit-learn© Python library.
Table 1. One-Class SVM node input and output values Input Output A One-Class SVM model
- Extension Model
Build and score results by running R or Python for Spark scripts.
Table 1. NODE node input and output values Input Output
An extended model
Use the controls on the Outputs tab, which appears on most output nodes, to generate an output object to disk rather than opening an output window containing the content. This is supported on any node and output management within Watson Studio. There is also an assumption that there is only one object that needs to be rendered in the UI, such as a table or a bitmap. Some output nodes have a more complex output structure, such as the Statistics node output, which has a tree-based structure. Apart from the Table node, outputs are exported HTML format, sent to the thin client, and rendered in an iFrame.
Displays the data in table format, which can also be written to a file. This is useful anytime that you need to inspect your data values or export them in an easily readable form.
The Settings and Formats tabs are not implemented.
Table 1. Table node input and output values Input Output Data records in a table format
The Matrix node enables you to create a table that shows relationships between fields. It is most commonly used to show the relationship between two categorical fields (flag, nominal, or ordinal), but it can also be used to show relationships between continuous (numeric range) fields.
Table 1. Matrix node input and output values Input Output Data records in a matrix format
The Analysis node can provide valuable information about the model itself. The Analysis node evaluates predictive models' ability to generate accurate predictions. Analysis nodes perform various comparisons between predicted values and actual values for one or more model nuggets. They can also compare predictive models to each other.
Table 1. Analysis node input and output values Input Output An evaluation of a model's accuracy or a comparison of models
- Data Audit
Provides a comprehensive first look at the data, including summary statistics, histograms and distribution for each field, as well as information about outliers, missing values, and extremes. Results are displayed in an easy-to-read matrix that can be sorted and used to generate full-size graphs and data preparation nodes.
The Settings tab is implemented.
On the Quality tab, only the Missing Values settings are available.
Table 1. Data Audit node input and output values Input Output A matrix that analyzes data and provides useful tools for understanding a data set
Normalizing input fields is an important step before using traditional scoring techniques, such as regression, logistic regression, and discriminant analysis. These techniques carry assumptions about normal distributions of data that may not be true for many raw data files. One approach to dealing with real-world data is to apply transformations that move a raw data element toward a more normal distribution. In addition, normalized fields can easily be compared with each other—for example, income and age are on totally different scales in a raw data file but when normalized, the relative impact of each can be easily interpreted.
Table 1. Transform node input and output values Input Output Data records that have been normalized
The Statistics node gives you basic summary information about numeric fields. You can get summary statistics for individual fields and correlations between fields.
Table 1. Statistics node input and output values Input Output A data table with statistics about numeric fields
The Means node compares the means between independent groups or between pairs of related fields to test whether a significant difference exists. For example, you can compare mean revenues before and after running a promotion or compare revenues from customers who didn't receive the promotion with those who did.
Table 1. Means node input and output values Input Output A table that compares the average of related fields or groups of fields
You can use the Report node to create formatted reports containing fixed text, data, or other expressions derived from the data. Specify the format of the report by using text templates to define the fixed text and the data output constructions. You can provide custom text formatting using HTML tags in the template and by setting output options.
Table 1. Report node input and output values Input Output Data values and other conditional output are included in the report using CLEM expressions in the template.
- Set Globals
The Set Globals node scans the data and computes summary values that can be used in CLEM expressions. For example, you can use a Set Globals node to compute statistics for a field called age and then use the overall mean of age in CLEM expressions by inserting the function
Table 1. NODE node input and output values Input Output
Data records with calculated summary values that can be used in CLEM expressions
- Sim Fit
The Simulation Fitting node fits a set of candidate statistical distributions to each field in the data. The fit of each distribution to a field is assessed using a goodness of fit criterion. When a Simulation Fitting node runs, a Simulation Generate node is built (or an existing node is updated). Each field is assigned its best fitting distribution. The Simulation Generate node can then be used to generate simulated data for each field.
Table 1. NODE node input and output values Input Output
Although the Simulation Fitting node is a terminal node, it does not add output to the Outputs panel, or export data.
- Sim Eval
The Simulation Evaluation node is a terminal node that evaluates a specified field, provides a distribution of the field, and produces charts of distributions and correlations. This node is primarily used to evaluate continuous fields. It therefore compliments the evaluation chart, which is generated by an Evaluation node and is useful for evaluating discrete fields. Another difference is that the Simulation Evaluation node evaluates a single prediction across several iterations, whereas the Evaluation node evaluates multiple predictions each with a single iteration. Iterations are generated when more than one value is specified for a distribution parameter in the Simulation Generate node.
Table 1. NODE node input and output values Input Output The Simulation Evaluation node is designed to be used with data that was obtained from the Simulation Fitting and Simulation Generate nodes. The node can, however, be used with any other node. Any number of processing steps can be placed between the Simulation Generate node and the Simulation Evaluation node.
The Simulation Evaluation node is a terminal node that evaluates a specified field, provides a distribution of the field, and produces charts of distributions and correlations.
- Extension Output
Run R or Python for Spark scripts to extend the output.
Table 1. NODE node input and output values Input Output
An extended output
- Data Asset Export
- You can use Database nodes to write data to data sources, which are listed in the project data assets.
Table 1. Data Asset Export node input and output values Input Output Updated or expanded data records written to the data source, which must be part of the project assets
Ready to create an SPSS Modeler Flow? For a real-world example of working with the flow tool, see this machine learning flow tutorial.
Check out our content pages for more samples, tutorials, documentation, how-tos, and blog posts.