K-Means Overview

The K-Means node provides a method of cluster analysis. It can be used to cluster the dataset into distinct groups when you don’t know what those groups are at the beginning. Unlike most learning methods in IBM SPSS Modeler, K-Means models do not use a target field. This type of learning, with no target field, is called unsupervised learning. Instead of trying to predict an outcome, K-Means tries to uncover patterns in the set of input fields. Records are grouped so that records within a group or cluster tend to be similar, but records in different groups are dissimilar.

K-Means works by defining a set of starting cluster centers derived from data. It then assigns each record to the cluster to which it is most similar, based on the record’s input field values. After all cases have been assigned, the cluster centers are updated to reflect the new set of records assigned to each cluster. The records are then checked again to see whether they should be reassigned to a different cluster, and the record assignment/cluster iteration process continues until either the maximum number of iterations is reached, or the change between one iteration and the next fails to exceed a specified threshold.

The K-Means algorithm results in an approximation to a solution to the problem of assigning instances to clusters to minimize the within cluster sum of squared Euclidean distances to the between cluster sum of squared distances across the clustering features. An exhaustive search to optimize this ratio for a given set of data becomes impractical in practice on even moderate sized problems.

To keep features on scales with larger variability from dominating the analysis, scale fields are rescaled to a standard 0-1 range. Categorical features are encoded as sets of dummy or indicator fields, and by default computations involving these sets are performed with a scaling factor that results in the contribution of a categorical field where two records disagree to the Euclidean distance between the records is the same as their being at opposite ends of the 0-1 scale on a scale feature.

Note that the resulting model often depends to a certain extent on the order of the training data. Reordering the data and rebuilding the model may lead to a different final cluster model. Despite the order-dependence of the results, the K-Means clustering algorithm remains among the most popular clustering algorithms, as it is fast and tends to yield results that are competitive with more complicated algorithms in situations where optimal solutions cannot feasibly be produced.

Next steps