Data preparation

Descriptives

Descriptives provides efficient computation of the univariate and bivariate statistics and automatic data preparation features on large scale data. It can be used widely in data profiling, data exploration, and data preparation for subsequent modeling analyses.

The core statistical features include essential univariate and bivariate statistical summaries, univariate order statistics, metadata information creation from raw data, statistics for visualization of single fields and field pairs, data preparation features, as well as data interestingness score and data quality assessment. It can efficiently support the functionality required for automated data processing, user interactivity and obtaining data insights for single fields or the relationships between the pairs of fields inclusive with specified target.

Example code:

import com.ibm.spss.ml.datapreparation.Descriptives

val de = Descriptives().
 setInputFieldList(Array("Field1", "Field2")).
 setTargetFieldList(Array("Field3")).
 setTrimBlanks("TRIM_BOTH")

val deModel = de.fit(df)

val PMML = deModel.toPMML()
val statXML = deModel.statXML()

val predictions = deModel.transform(df)
predictions.show()

Descriptives Selection Strategy

When the number of field pairs is too large (for example, larger than the default of 1,000), SelectionStrategy is used to limit the number of pairs for which bivariate statistics will be computed. The strategy involves 2 steps:

  1. Limit the number of pairs based on the univariate statistics.
  2. Limit the number of pairs based on the core association bivariate statistics.

Notice that the pair will always be included under the following conditions:

  1. The pair consists of a predictor field and a target field.
  2. The pair of predictors or targets is enforced.

Smart Data Preprocessing

The Smart Data Preprocessing (SDP) engine is a new analytic component for data preparation. It consists of three separate modules: relevance analysis, relevance and redundancy analysis, and smart metadata (SMD) integration.

Given the data with regular fields, list fields, and map fields, relevance analysis evaluates the associations of input fields with targets, and selects a specified number of fields for subsequent analysis. Meanwhile, it expands list fields and map fields, and extracts the selected fields into regular column-based format.

Due to the efficiency of relevance analysis, it is also used to reduce the large number of fields in wide data to a moderate level where traditional analytics can work.

SmartDataPreprocessingRelevanceAnalysis exports outputs:

  • JSON file, containing model information
  • new column-based data
  • the related data model

For more details about outputs, see the SmartDataPreprocessing Output Document.

Example code:

import com.ibm.spss.ml.datapreparation.SmartDataPreprocessingRelevanceAnalysis
val sdpRA = SmartDataPreprocessingRelevanceAnalysis().
  setInputFieldList(Array("holderage", "vehicleage", "claimamt")).
  setTargetFieldList(Array("vehiclegroup", "nclaims")).
  setMaxNumTarget(3).
  setInvalidPairsThresEnabled(true).
  setRMSSEThresEnabled(true).
  setAbsVariCoefThresEnabled(true).
  setInvalidPairsThreshold(0.7).
  setRMSSEThreshold(0.7).
  setAbsVariCoefThreshold(0.05).
  setMaxNumSelFields(2).
  setConCatRatio(0.3).
  setFilterSelFields(true)

val predictions = sdpRA.transform(data)
predictions.show()

Sparse Data Convertor

Sparse Data Convertor (SDC) converts regular data fields into list fields. Users just need to specify the fields that they want to convert into list fields, then SDC will merge the fields according to their measurement level. It will generate, at most, three kinds of list fields: continuous list field, categorical list field, and map field.

Example code:

import com.ibm.spss.ml.datapreparation.SparseDataConverter
var sdc = SparseDataConverter().
  setInputFieldList(Array("Age", "Sex", "Marriage", "BP", "Cholesterol", "Na", "K", "Drug"))
val predictions = sdc.transform(data)
predictions.show()

Binning

The function can be used to derive one or more new binned fields and/or to obtain the bin definitions used to determine the bin values.

Example code:

import com.ibm.spss.ml.datapreparation.binning.Binning
val binDefinition = new BinDefinitions(1, false, true, true, List(new CutPoint(50.0, false)))
val binField = new BinRequest("integer_field", "integer_bin", Some(binDefinition), None)

val params: List[BinRequest] = List(binField)
val bining = Binning().setBinRequestsParam(params)

val outputDF = bining.transform(inputDF)

Hex Binning

The function can be used to calculate and assign hexagonal bins to two fields.

Example code:

import com.ibm.spss.ml.datapreparation.binning.HexBinning
import com.ibm.spss.ml.datapreparation.binning.HexBinningSetting

val params: List[HexBinningSetting] = List(
  new HexBinningSetting("field1_out", "field1", 5, -1.0, 25.0, 5.0),
  new HexBinningSetting("field2_out", "field2", 5, -1.0, 25.0, 5.0))

val hexBinning = HexBinning().setHexBinRequestsParam(params)
val outputDF = hexBinning.transform(inputDF)

Complex Sampling

The complexSampling function selects a pseudo-random sample of records from a data source.

The complexSampling function performs stratified sampling of incoming data using simple exact sampling and simple proportional sampling. The stratifying fields are specified as input and the sampling counts or sampling ratio for each of the strata to be sampled must also be provided. Optionally, the record counts for each strata may be provided to improve performance.

Example code:

import com.ibm.spss.ml.datapreparation.sampling.ComplexSampling
import com.ibm.spss.ml.datapreparation.params.{RealStrata, Strata, Stratification, StringStrata}
val transformer = ComplexSampling().
   setRandomSeed(123444).
   setRepeatable(true).
   setStratification(Stratification(List("real_field"), Some(List(
    Strata(key = List(RealStrata(11.1)), samplingCount = Some(25)),
    Strata(key = List(RealStrata(2.4)), samplingCount = Some(40)),
    Strata(key = List(RealStrata(12.9)), samplingRatio = Some(0.5)))))).
  setFrequencyField("frequency_field")

val sampled = transformer.transform(unionDF)

Count and Sample

The countAndSample function produces a pseudo-random sample having a size approximately equal to the 'samplingCount' input.

The sampling is accomplished by calling the SamplingComponent with a sampling ratio that is computed as 'samplingCount / totalRecords' where 'totalRecords' is the record count of the incoming data.

Example code:

import com.ibm.spss.ml.datapreparation.sampling.CountAndSample
val transformer = CountAndSample().setSamplingCount(20000).setRandomSeed(123)
val sampled = transformer.transform(unionDF)

MR Sampling

The mrsampling function selects a pseudo-random sample of records from a data source at a specified sampling ratio. The size of the sample will be approximately the specified proportion of the total number of records subject to an optional maximum. The set of records and their total number will vary with random seed. Every record in the data source has the same probability of being selected.

Example code:

import com.ibm.spss.ml.datapreparation.sampling.MRSampling
val transformer = MRSampling().setSamplingRatio(0.5).setRandomSeed(123).setDiscard(true)
val sampled = transformer.transform(unionDF)

Sampling Model

The samplingModel function selects a pseudo-random percentage of the subsequence of input records defined by every Nth record for a given step size N. The total sample size may be optionally limited by a maximum.

When the step size is 1, the subsequence is the entire sequence of input records. When the sampling ratio is 1.0, selection becomes deterministic, not pseudo-random.

Note that with distributed data, the samplingModel function applies the selection criteria independently to each data split. The maximum sample size, if any, applies independently to each split and not to the entire data source; the subsequence is started afresh at the start of each split.

Example code:

import com.ibm.spss.ml.datapreparation.sampling.SamplingModel
val transformer = SamplingModel().setSamplingRatio(1.0).setSamplingStep(2).setRandomSeed(123).setDiscard(false)
val sampled = transformer.transform(unionDF)

Sequential Sampling

The sequentialSampling function is similar to the samplingModel function. It also selects a pseudo-random percentage of the subsequence of input records defined by every Nth record for a given step size N. The total sample size may be optionally limited by a maximum.

When the step size is 1, the subsequence is the entire sequence of input records. When the sampling ratio is 1.0, selection becomes deterministic, not pseudo-random. The main difference between sequentialSampling and samplingModel is that with distributed data, the sequentialSampling function applies the selection criteria to the entire data source, while the samplingModel function applies the selection criteria independently to each data split.

Example code:

import com.ibm.spss.ml.datapreparation.sampling.SequentialSampling
val transformer = SequentialSampling().setSamplingRatio(1.0).setSamplingStep(2).setRandomSeed(123).setDiscard(false)
val sampled = transformer.transform(unionDF)