You can use the scalable Two-Step or the Cluster model evaluation algorithm to cluster data in notebooks.
Two-Step Cluster
Scalable Two-Step is based on the familiar two-step clustering algorithm, but extends both its functionality and performance in several directions.
First, it can effectively work with large and distributed data supported by Spark that provides the Map-Reduce computing paradigm.
Second, the algorithm provides mechanisms for selecting the most relevant features for clustering the given data, as well as detecting rare outlier points. Moreover, it provides an enhanced set of evaluation and diagnostic features for enabling insight.
The two-step clustering algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data cases in terms of summary statistics called cluster features. The cluster features are stored in memory in a data structure called the CF-tree. Finally, an agglomerative hierarchical clustering algorithm is applied to cluster the set of cluster features.
Python example code:
from spss.ml.clustering.twostep import TwoStep
cluster = TwoStep(). \
setInputFieldList(["region", "happy", "age"]). \
setDistMeasure("LOGLIKELIHOOD"). \
setFeatureImportanceMethod("CRITERION"). \
setAutoClustering(True)
clusterModel = cluster.fit(data)
predictions = clusterModel.transform(data)
predictions.show()
Cluster model evaluation
Cluster model evaluation (CME) aims to interpret cluster models and discover useful insights based on various evaluation measures.
It's a post-modeling analysis that's generic and independent from any types of cluster models.
Python example code:
from spss.ml.clustering.twostep import TwoStep
cluster = TwoStep(). \
setInputFieldList(["region", "happy", "age"]). \
setDistMeasure("LOGLIKELIHOOD"). \
setFeatureImportanceMethod("CRITERION"). \
setAutoClustering(True)
clusterModel = cluster.fit(data)
predictions = clusterModel.transform(data)
predictions.show()
Parent topic: SPSS predictive analytics algorithms