Clustering

Two Step Cluster

Scalable Two-Step is based on the familiar two-step clustering algorithm, but extends both its functionality and performance in several directions.

First, it can effectively work with large and distributed data supported by Spark that provides the Map-Reduce computing paradigm.

Second, the algorithm provides mechanisms for selecting the most relevant features for clustering the given data, as well as detecting rare outlier points. Moreover, it provides an enhanced set of evaluation and diagnostic features for enabling insight.

The two-step clustering algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data cases in terms of summary statistics called cluster features. The cluster features are stored in memory in a data structure called the CF-tree. Finally, an agglomerative hierarchical clustering algorithm is applied to cluster the set of cluster features.

Example code:

import com.ibm.spss.ml.clustering.TwoStep
val cluster = TwoStep().
  setInputFieldList(Array("region", "happy", "age")).
  setDistMeasure("LOGLIKELIHOOD").
  setFeatureImportanceMethod("CRITERION").
  setAutoClustering(true)
val clusterModel = cluster.fit(data)

val predictions = clusterModel.transform(data)
predictions.show()

Cluster model evaluation

Cluster model evaluation (CME) aims to interpret cluster models and discover useful insights based on various evaluation measures.

It is a post-modeling analysis that is generic and independent from any types of cluster models.

Example code:

import com.ibm.spss.ml.clustering.ClusterModelEvaluation

val cluster = ClusterModelEvaluation(local).
  setInputContainerKeys(List("k")).
  setEvaluationFieldList(Array("Na")).
  setNumBins(8).
  setNumExtEvalCats(8).
  setMaxNumImportantFields(3).
  setSigLevel(0.9).
  setFMeasureBeta(1.0)

val clusterModel = cluster.fit(data)

{: scala}