TwoStep Overview

As its name suggests, TwoStep is a two-step clustering method. The first step makes a single pass through the data, during which it compresses the raw input data into a manageable set of sub-clusters. The second step uses an hierarchical clustering method to progressively merge the sub-clusters into larger and larger clusters, without requiring another pass through the data. Hierarchical clustering has the advantage of not requiring the number of clusters to be selected ahead of time. Many hierarchical clustering methods start with individual records as starting clusters and merge them recursively to produce ever larger clusters. Though such approaches often break down with large amounts of data, TwoStep's initial pre-clustering makes hierarchical clustering fast even for large datasets.

The TwoStep node can handle mixed field types and is able to handle large datasets efficiently. It also has the ability to test several cluster solutions and choose the best, so you don't need to know how many clusters to ask for at the outset. TwoStep can be set to automatically exclude outliers , or extremely unusual cases that can contaminate your results.

Note that as with other types of cluster analysis, the resulting model can depend to a certain extent on the order of the training data. Reordering the data and rebuilding the model may lead to a different final cluster model. The robustness of the solution to record order may be assessed by fitting a model multiple times, each time using a different random order of the records.

Next steps