Creating machine learning flows with SPSS Modeler nodes
You can create a machine learning flow by using SPSS nodes.
Note: Watson Studio does not include SPSS functionality in Peru, Ecuador, Colombia and Venezuela.
The SPSS Modeler node palette
Source nodes contain the data for your machine learning flows and are available via ObjectStore and the Connections or Db2 on Cloud tabs from within Watson Studio. In addition to these data sets, the User Input node, which can be used to generate a small data set, is available in the Record Operations tab.
- Data Asset
- The Data Asset node contains the data from your project for your flows. Drag it from the Palette to the canvas area. Double-click the node, click the Change data asset button, and select one of the data assets from your project, and then click OK.
- User Input
- The User Input node provides an easy way for you to create synthetic data, either from scratch or by altering existing data. This is useful, for example, when you want to create a test data set for modeling.
- You can use Select nodes to select or discard a subset of records from the data stream based on a specific condition, such as BP (blood pressure) = "HIGH".
The Sample node selects a subset of records. A variety of sample types are supported, including stratified, clustered, and nonrandom (structured) samples. Sampling can be useful to improve performance, and to select groups of related records or transactions for analysis.
Only the simple modes are currently supported.
You can use Sort nodes to sort records into ascending or descending order based on the values of one or more fields. For example, Sort nodes are frequently used to view and select records with the most common data values. Typically, you first aggregate the data by using the Aggregate node and then use the Sort node to sort the aggregated data into descending order of record counts. Display these results in a table so you can explore the data and make decisions, such as selecting the records of the top 10 best customers.
The basic Settings tab is implemented, but the Optimization tab is not yet implemented.
- The Balance node corrects imbalances in a data set, so it conforms to a specified condition. The balancing directive adjusts the proportion of records where a condition is true by the factor specified.
- Duplicate records in a data set must be removed before data mining can begin. For example, in a marketing database, individuals may appear multiple times with different address or company information. You can use the Distinct node to find or remove duplicate records in your data, or to create a single, composite record from a group of duplicate records.
Aggregation is a data preparation task that is frequently used to reduce the size of a data set. Before proceeding with aggregation, you should take time to clean the data, concentrating especially on missing values. When you aggregate, potentially useful information regarding missing values might be lost.
On the Settings tab, the Key fields, Aggregate fields, Include record count and the count field name are implemented. Default operations for other fields are currently fixed.
The Merge node takes multiple input records and creates a single output record that contains some or all of the input fields. It is useful for merging data from different sources, such as internal customer data and purchased demographic data.
Only the Merge tab is currently supported. Within that, the Merge Method of Ranked Condition is not supported.
- This allows multiple data sets to be appended together (similar to 'UNION' in SQL). For example, a customer may have sales data in separate files for each month and wants to combine them into a single view of sales over several years.
- Streaming TS
- You use the Streaming Time Series node to build and score time series models in one step. A separate time series model is built for each target field, however model nuggets are not added to the generated models palette and the model information cannot be browsed.
- The Synthetic Minority Over-sampling Technique (SMOTE) node provides an over-sampling algorithm to deal with imbalanced data sets. It provides an advanced method for balancing data. The SMOTE process node is implemented in Python and requires the imbalanced-learn© Python library.
- Auto Data Prep
- Preparing data for analysis is one of the most important steps in any project—and traditionally, one of the most time consuming. Automated Data Preparation (ADP) handles the task for you, analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques. You can use the algorithm in fully automatic fashion, allowing it to choose and apply fixes, or you can use it in interactive fashion, previewing the changes before they are made and accept or reject them as you want.
The Type node specifies field metadata and properties. For example, you can specify a measurement level (continuous, nominal, ordinal, or flag) for each field, set options for handling missing values and system nulls, set the role of a field for modeling purposes, specify field and value labels, and specify values for a field. In some cases you might need to fully instantiate the Type node in order for other nodes to work correctly, such as the fields from property of the Set to Flag node. You can simply connect a Table node and execute it to instantiate the fields.
The Types tab is partly implemented in that many of the settings that can be changed by double-clicking the field name in desktop model builder can be edited. However, behavior in the cloud version is different from the desktop model builder in that it only shows property settings which are local to the node. You might not see the same derived information that would expect to see in the desktop model builder.
The Types tab in desktop model builder also has action buttons to clear values or force a data pass. This is not currently supported.
The Format tab is not currently implemented.
Creates a new Filter node to filter fields that are not used by rules in the rule set.
The filtering (dropping) of fields and the renaming of fields are implemented by using separate tabs. As noted previously, dependencies between tabs might not be supported. For example, fields that are dropped on the Filter tab might still appear in the list of available fields in the Rename tab.
The Derive node modifies data values or creates new fields from one or more existing fields. It creates fields of type formula, flag, nominal, state, count, and conditional.
- Filler nodes are used to replace field values and change storage. You can choose to replace values based on a specified CLEM condition, such as
@BLANK(FIELD). Alternatively, you can choose to replace all blanks or null values with a specific value. Filler nodes are often used in conjunction with the Type node to replace missing values. For example, you can fill blanks with the mean value of a field by specifying an expression such as
@GLOBAL_MEAN. This expression will fill all blanks with the mean value as calculated by a Set Globals node.
- The Reclassify node enables the transformation from one set of categorical values to another. Reclassification is useful for collapsing categories or regrouping data for analysis. For example, you could reclassify the values for Product into three groups, such as Kitchenware, Bath and Linens, and Appliances. Often, this operation is performed directly from a Distribution node by grouping values and generating a Reclassify node.
- Instead of displaying all data values individually, you can bin them. Binning involves grouping individual data values into one instance of a graphic element. A bin may be a point that indicates the number of cases in the bin. Or it may be a histogram bar, whose height indicates the number of cases in the bin. The majority of settings are supported, however, the ability to view the bin intervals after they have been computed is not yet supported.
- The Ensemble node combines two or more model nuggets to obtain more accurate predictions than can be gained from any of the individual models. By combining predictions from multiple models, limitations in individual models may be avoided, resulting in a higher overall accuracy. Models combined in this manner typically perform at least as well as the best of the individual models and often better.
- Partition nodes are used to generate a partition field that splits the data into separate subsets or samples for the training, testing, and validation stages of model building. By using one sample to generate the model and a separate sample to
test it, you can get a good indication of how well the model will generalize to larger data sets that are similar to the current data.
Only Simple mode (a single new field) is currently implemented.
While the desktop modeler supports multiple ways of deriving the new value, only the Formula mode is currently supported.
The Field type drop-down control in the desktop modeler includes a Specify option. This is not yet supported.
- SetTo Flag
- The Set to Flag node is used to derive flag fields based on the categorical values defined for one or more nominal fields. For example, your data set might contain a nominal field, BP (blood pressure), with the values High, Normal, and Low. For easier data manipulation, you might create a flag field for high blood pressure, which indicates whether or not the patient has high blood pressure.
- The Restructure node can be used to generate multiple fields based on the values of a nominal or flag field. The newly generated fields can contain values from another field or numeric flags (0 and 1). The functionality of this node is similar to that of the Set to Flag node. However, it offers more flexibility. It allows you to create fields of any type (including numeric flags), using the values from another field. You can then perform aggregation or other manipulations with other nodes downstream. (The Set to Flag node lets you aggregate fields in one step, which may be convenient if you are creating flag fields.)
- By default, columns are fields and rows are records or observations. If necessary, you can use a Transpose node to swap the data in rows and columns so that fields become records and records become fields. For example, if you have time series data where each series is a row rather than a column, you can transpose the data prior to analysis.
- Field Reorder
- Use the Field Reorder node to define the natural order used to display fields downstream. This order affects the display of fields in a variety of places, such as tables, lists, and the Field Chooser. This operation is useful, for example, when working with wide data sets to make fields of interest more visible.
- Plot nodes show the relationship between numeric fields. You can create a plot using points (also known as a scatterplot), or you can use lines. You can create three types of line plots by specifying an X Mode in the dialog box.
A multiplot is a special type of plot that displays multiple Y fields over a single X field. The Y fields are plotted as colored lines and each is equivalent to a Plot node with Style set to Line and X Mode set to Sort. Multiplots are useful when you have time sequence data and want to explore the fluctuation of several variables over time.
Most of the Plot tab is implemented, except for animation. Most of the Appearance tab is supported, except for the auto/custom X and Y labels.
- Time Plot
Time Plot nodes enable you to view one or more time series plotted over time. The series you plot must contain numeric values and are assumed to occur over a range of time in which the periods are uniform.
A distribution graph or table shows the occurrence of symbolic (non-numeric) values, such as mortgage type or gender, in a data set. A typical use of the Distribution node is to show imbalances in the data that can be rectified by using a Balance node before creating a model. You can automatically generate a Balance node using the Generate menu in the distribution graph or table window.
The Plot settings are implemented. Most of the Appearance tab is supported, except for the auto/custom X and Y labels.
Histogram nodes show the occurrence of values for numeric fields. They are often used to explore the data before manipulations and model building. Similar to the Distribution node, Histogram nodes are frequently used to reveal imbalances in the data.
There are some limitations for the way that the Histogram node is implemented. Make note of the following restrictions:
- Most of the Plot tab is implemented, except for animation.
- None of the Options tab is implemented.
- Most of the Appearance tab is supported, except for the auto/custom X and Y labels.
- Collections are similar to histograms except that collections show the distribution of values for one numeric field relative to the values of another, rather than the occurrence of values for a single field. A collection is useful for illustrating a variable or field whose values change over time. Using 3-D graphing, you can also include a symbolic axis displaying distributions by category. Two dimensional Collections are shown as stacked bar charts, with overlays where used.
- Web nodes show the strength of relationships between values of two or more symbolic fields. The graph displays connections using varying types of lines to indicate connection strength. You can use a Web node, for example, to explore the relationship between the purchase of various items at an e-commerce site or a traditional retail outlet.
- The Evaluation node offers an easy way to evaluate and compare predictive models to choose the best model for your application. Evaluation charts show how models perform in predicting particular outcomes. They work by sorting records based on the predicted value and confidence of the prediction, splitting the records into groups of equal size (quantiles), and then plotting the value of the business criterion for each quantile, from highest to lowest. Multiple models are shown as separate lines in the plot.
t-Distributed Stochastic Neighbor Embedding (t-SNE)© is a tool for visualizing high-dimensional data. It converts affinities of data points to probabilities. The affinities in the original space are represented by Gaussian joint probabilities and the affinities in the embedded space are represented by Student's t-distributions. This allows t-SNE to be particularly sensitive to local structure and has a few other advantages over existing techniques:
- Revealing the structure at many scales on a single map
- Revealing data that lie in multiple, different, manifolds, or clusters
- Reducing the tendency to crowd points together at the center
The following algorithms have got basic implementations, with most having the ability to set the values on the Fields tab and some Build Options:
- Auto Classifier
- This builds several classification models using multiple algorithms and settings, evaluates them and selects the best performing. These can then be used to score new data and by combining ("ensembling") the results from those models, a more accurate prediction can be obtained.
- Auto Numeric
- This is equivalent to the Auto Classifier but for numeric/continuous targets.
- Auto Cluster
- The Auto Cluster node estimates and compares clustering models that identify groups of records with similar characteristics. The node works in the same manner as other automated modeling nodes, enabling you to experiment with multiple combinations of options in a single modeling pass. Models can be compared using basic measures with which to attempt to filter and rank the usefulness of the cluster models, and provide a measure based on the importance of particular fields.
- Bayes Net
- A Bayesian network is a model that displays variables in a data set and the probabilistic, or conditional, independencies between them. Using the Netezza Bayes Net node, you can build a probability model by combining observed and recorded evidence with "common-sense" real-world knowledge to establish the likelihood of occurrences by using seemingly unlinked attributes.
- The C5.0 node builds either a decision tree or a rule set. The model works by splitting the sample based on the field that provides the maximum information gain at each level. The target field must be categorical. Multiple splits into more than two subgroups are allowed.
- C&R Tree
- The Classification and Regression (C&R) Tree node generates a decision tree that you can use to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node in the tree is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all splits are binary (only two subgroups).
- Chi-square Automatic Interaction Detector is used to discover the relationship between categorical variables by building a predictive model or tree to explain an outcome.
- QUEST (Quick, Unbiased, Efficient Statistical Tree) is a binary classification method for building decision trees. A major motivation in its development was to reduce the processing time required for large C&R Tree analyses with either many variables or many cases. A second goal of QUEST was to reduce the tendency found in classification tree methods to favor inputs that allow more splits, that is, continuous (numeric range) input fields or those with many categories.
- The Tree-AS node can be used with data in a distributed environment. In this node you can choose to build decision trees using either a CHAID or Exhaustive CHAID model. CHAID, or Chi-squared Automatic Interaction Detection, is a classification method for building decision trees by using chi-square statistics to identify optimal splits.
- Random Trees
- The Random Trees node is similar to the C&RT node; however, the Random Trees node is designed to process big data to create a single tree. The Random Trees tree node generates a decision tree that you use to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node in the tree is considered pure if 100% of cases in the node fall into a specific category of the target field. Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all splits are binary (only two subgroups).
- Random Forest
- Random Forest is an advanced implementation of a bagging algorithm with a tree model as the base model. In random forests, each tree in the ensemble is built from a sample drawn with replacement (for example, a bootstrap sample) from the training set. When splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. Because of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.
- Decision List
- The Decision List node identifies subgroups, or segments, that show a higher or lower likelihood of a given binary outcome relative to the overall population. For example, you might look for customers who are unlikely to churn or are most likely to respond favorably to a campaign. You can incorporate your business knowledge into the model by adding your own custom segments and previewing alternative models side by side to compare the results. Decision List models consist of a list of rules in which each rule has a condition and an outcome. Rules are applied in order, and the first rule that matches determines the outcome.
- Time series
- The Time Series node can be used with data in either a local or distributed environment; in a distributed environment you can harness the power of IBM® SPSS® Analytic Server. With this node, you can choose to estimate and build exponential smoothing, univariate Autoregressive Integrated Moving Average (ARIMA), or multivariate ARIMA (or transfer function) models for time series, and produce forecasts based on the time series data.
- The generalized linear model expands the general linear model so that the dependent variable is linearly related to the factors and covariates via a specified link function. Moreover, the model allows for the dependent variable to have a non-normal distribution. It covers widely used statistical models, such as linear regression for normally distributed responses, logistic models for binary data, loglinear models for count data, complementary log-log models for interval-censored survival data, plus many other statistical models through its very general model formulation.
- Generalized Linear Engine uses a variety of statistical techniques to support both classification and continuous predicted values. Unlike many algorithms, the target does not need to have a normal distribution.
- Linear regression models predict a continuous target based on linear relationships between the target and one or more predictors.
- Linear regression is a common statistical technique for classifying records based on the values of numeric input fields. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. Linear-AS can run when connected to IBM SPSS Analytic Server.
- Linear regression is a common statistical technique for classifying records based on the values of numeric input fields. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. The Regression node is due to be replaced by the Linear node in a future release. We recommend using Linear models for linear regression from now on.
- The Linear Support Vector Machine (LSVM) is a classification algorithm that is particularly suited for use with wide data sets, that is, those with a large number of predictor fields.
- Logistic regression is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression but takes a categorical target field instead of a numeric range.
- Neural Net
- The Neural Net node uses a simplified model of the way the human brain processes information. It works by simulating a large number of interconnected simple processing units that resemble abstract versions of neurons. Neural networks are powerful general function estimators and require minimal statistical or mathematical knowledge to train or apply.
- Nearest Neighbor Analysis is a method for classifying cases based on their similarity to other cases. In machine learning, it was developed as a way to recognize patterns of data without requiring an exact match to any stored patterns, or cases. Similar cases are near each other and dissimilar cases are distant from each other. Thus, the distance between two cases is a measure of their dissimilarity.
- Cox Regression is used for survival analysis, such as estimating the probability that an event has occurred at a certain time. For example, a company is interested in modeling the time to churn in order to determine the factors that are associated with customers who are quick to switch to another service.
- The Principal Components Analysis node aims to reduce the complexity of data by finding a smaller number of derived fields that effectively summarizes the information in the original set of fields.
- The SVM node enables you to use a support vector machine to classify data. SVM is particularly suited for use with wide data sets, that is, those with a large number of predictor fields. You can use the default settings on the node to produce a basic model relatively quickly, or you can use the Expert settings to experiment with different types of SVM model.
- Feature Selection
- The Feature Selection node screens input fields for removal based on a set of criteria, such as the percentage of missing values. It then ranks the importance of remaining inputs relative to a specified target. For example, given a data set with hundreds of potential inputs, which are most likely to be useful in modeling patient outcomes?
- Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant function (or, for more than two groups, a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups. The functions are generated from a sample of cases for which group membership is known; the functions can then be applied to new cases that have measurements for the predictor variables but have unknown group membership.
- Association Rules
Association rules are statements of the following form.
if condition(s) then prediction(s)
For example, "If a customer purchases a razor and after shave, then that customer will purchase shaving cream with 80% confidence." The Association Rules node extracts a set of rules from the data, pulling out the rules with the highest information content.
- The Apriori node discovers association rules in the data. Association rules are statements of the following form:
if antecedent(s) then consequent(s)
For example, "if a customer purchases a razor and after shave, then that customer will purchase shaving cream with 80% confidence." Apriori extracts a set of rules from the data, pulling out the rules with the highest information content. Apriori offers five different methods of selecting rules and uses a sophisticated indexing scheme to efficiently process large data sets.
- The Sequence node discovers patterns in sequential or time-oriented data, in the format bread -> cheese. The elements of a sequence are item sets that constitute a single transaction. For example, if a person goes to the store and purchases bread and milk and then a few days later returns to the store and purchases some cheese, that person's buying activity can be represented as two item sets. The first item set contains bread and milk, and the second one contains cheese. A sequence is a list of item sets that tend to occur in a predictable order. The Sequence node detects frequent sequences and creates a generated model node that can be used to make predictions.
- The Kohonen node generates a type of neural network that can be used to cluster the data set into distinct groups. When the network is fully trained, records that are similar should be close together on the output map, while records that are different will be far apart. You can look at the number of observations captured by each unit in the model nugget to identify the strong units. This may give you a sense of the appropriate number of clusters.
- Identify outliers, or unusual cases, in the data. Unlike other modeling methods that store rules about unusual cases, anomaly detection models store information on what normal behavior looks like. This makes it possible to identify outliers even if they do not conform to any known pattern, and it can be particularly useful in applications, such as fraud detection.
- This is an unsupervised algorithm used to cluster the data set into distinct groups. Instead of trying to predict an outcome, k-means tries to uncover patterns in the set of input fields. Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar.
- The TwoStep Cluster node provides a form of cluster analysis. It can be used to cluster the data set into distinct groups when you don't know what those groups are at the beginning. As with Kohonen nodes and K-Means nodes, TwoStep Cluster models do not use a target field. Instead of trying to predict an outcome, TwoStep Cluster tries to uncover patterns in the set of input fields. Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar.
- TwoStep Cluster is an exploratory tool that is designed to reveal natural groupings (or clusters) within a data set that would otherwise not be apparent.
- Isotonic Regression belongs to the family of regression algorithms. The Isotonic-AS node in SPSS® Modeler is implemented in Spark. For details about Isotonic Regression algorithms, see Regression - RDD-based API.
- K-Means is one of the most commonly used clustering algorithms. It clusters data points into a predefined number of clusters.1 The K-Means-AS node in SPSS® Modeler is implemented in Spark. For details about K-Means algorithms, see K-Means-AS.
- XGBoost Tree
- XGBoost Tree© is an advanced implementation of a gradient boosting algorithm with a tree model as the base model. Boosting algorithms iteratively learn weak classifiers and then add them to a final strong classifier. XGBoost Tree is very flexible and provides many parameters that can be overwhelming to most users, so the XGBoost Tree node in SPSS® Modeler exposes the core features and commonly used parameters. The node is implemented in Python.
- XGBoost Linear
- XGBoost Linear© is an advanced implementation of a gradient boosting algorithm with a linear model as the base model. Boosting algorithms iteratively learn weak classifiers and then add them to a final strong classifier. The XGBoost Linear node in SPSS® Modeler is implemented in Python.
- One-Class SVM
- The One-Class SVM© node uses an unsupervised learning algorithm. The node can be used for novelty detection. It will detect the soft boundary of a given set of samples, to then classify new points as belonging to that set or not. This One-Class SVM modeling node is implemented in Python and requires the scikit-learn© Python library.
The Outputs tab appears on most output nodes and allows the user to generate an output object to disk rather than opening an output window containing the content. This is supported on any node and output management within Watson Studio is not yet implemented. There is also an assumption that there is only one object that needs to be rendered in the UI, such as a table or a bitmap. Some output nodes have a more complex output structure, such as the Statistics node output, which has a tree-based structure. Apart from the Table node, outputs are exported HTML format, sent to the thin client, and rendered in an iFrame.
Displays the data in table format, which can also be written to a file. This is useful anytime that you need to inspect your data values or export them in an easily readable form.
The Settings and Formats tabs are not implemented.
- The Matrix node enables you to create a table that shows relationships between fields. It is most commonly used to show the relationship between two categorical fields (flag, nominal, or ordinal), but it can also be used to show relationships between continuous (numeric range) fields.
- The Analysis node can provide valuable information about the model itself. The Analysis node evaluates predictive models' ability to generate accurate predictions. Analysis nodes perform various comparisons between predicted values and actual values for one or more model nuggets. They can also compare predictive models to each other.
- Data Audit
Provides a comprehensive first look at the data, including summary statistics, histograms and distribution for each field, as well as information about outliers, missing values, and extremes. Results are displayed in an easy-to-read matrix that can be sorted and used to generate full-size graphs and data preparation nodes.
The Settings tab is implemented.
On the Quality tab, only the Missing Values settings are available.
- Normalizing input fields is an important step before using traditional scoring techniques, such as regression, logistic regression, and discriminant analysis. These techniques carry assumptions about normal distributions of data that may not be true for many raw data files. One approach to dealing with real-world data is to apply transformations that move a raw data element toward a more normal distribution. In addition, normalized fields can easily be compared with each other—for example, income and age are on totally different scales in a raw data file but when normalized, the relative impact of each can be easily interpreted.
- The Statistics node gives you basic summary information about numeric fields. You can get summary statistics for individual fields and correlations between fields.
- The Means node compares the means between independent groups or between pairs of related fields to test whether a significant difference exists. For example, you can compare mean revenues before and after running a promotion or compare revenues from customers who didn't receive the promotion with those who did.
- Data Asset node
- You can use Database nodes to write data to data sources, which are listed in the project data assets.
Ready to create an SPSS Modeler Flow? For a real-world example of working with the flow tool, see this machine learning flow tutorial.
Check out our content pages for more samples, tutorials, documentation, how-tos, and blog posts.