About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Last updated: Jan 12, 2024
You can use the scalable Two-Step or the Cluster model evaluation algorithm to cluster data in notebooks.
Two-Step Cluster
Scalable Two-Step is based on the familiar two-step clustering algorithm, but extends both its functionality and performance in several directions.
First, it can effectively work with large and distributed data supported by Spark that provides the Map-Reduce computing paradigm.
Second, the algorithm provides mechanisms for selecting the most relevant features for clustering the given data, as well as detecting rare outlier points. Moreover, it provides an enhanced set of evaluation and diagnostic features for enabling insight.
The two-step clustering algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data cases in terms of summary statistics called cluster features. The cluster features are stored in memory in a data structure called the CF-tree. Finally, an agglomerative hierarchical clustering algorithm is applied to cluster the set of cluster features.
Python example code:
from spss.ml.clustering.twostep import TwoStep cluster = TwoStep(). \ setInputFieldList(["region", "happy", "age"]). \ setDistMeasure("LOGLIKELIHOOD"). \ setFeatureImportanceMethod("CRITERION"). \ setAutoClustering(True) clusterModel = cluster.fit(data) predictions = clusterModel.transform(data) predictions.show()
Cluster model evaluation
Cluster model evaluation (CME) aims to interpret cluster models and discover useful insights based on various evaluation measures.
It's a post-modeling analysis that's generic and independent from any types of cluster models.
Python example code:
from spss.ml.clustering.twostep import TwoStep cluster = TwoStep(). \ setInputFieldList(["region", "happy", "age"]). \ setDistMeasure("LOGLIKELIHOOD"). \ setFeatureImportanceMethod("CRITERION"). \ setAutoClustering(True) clusterModel = cluster.fit(data) predictions = clusterModel.transform(data) predictions.show()
Parent topic: SPSS predictive analytics algorithms