0 / 0
Data imputation in AutoAI experiments
Data imputation in AutoAI experiments

Data imputation in AutoAI experiments

Data imputation is the means of replacing missing values in your data set with substituted values. If you enable imputation, you can specify how missing values will be interpolated in your data.

Imputation by experiment type

Imputation methods depend on the type of experiment that you build.

  • For classification and regression you can configure categorical and numerical imputation methods.
  • For timeseries problems, you can choose from a set of imputation methods to apply to numerical columns. When the experiment runs, the best performing method from the set is applied automatically. You can also specify a specific value as a replacement value.

Enabling imputation

To view and set imputation options:

  1. Click Experiment settings when you configure your experiment.
  2. Click the Data source option.
  3. Click Enable data imputation. Note that if you do not explicitly enable data imputation but your data source has missing values, AutoAI warns you and applies default imputation methods. See imputation details.
  4. Select options in the Imputation section.
  5. Optionally set a threshold for the percentage of imputation acceptable for a column of data. If the percentage of missing values exceeds the specified threshold, the experiment fails. To resolve, update the data source or adjust the threshold.

Configuring imputation for classification and regression experiments

Choose one of these methods for imputing missing data in binary classification, multiclass classification, or regression experiments. Note that you can have one method for completing values for text-based (categorical) data and another for numerical data.

Method Description
Most frequent Replace missing value with the value that appears most frequently in the column.
Median Replace missing value with the value in the middle of the sorted column.
Mean Replace missing value with the average value for the column.

Configuring imputation for timeseries experiments

Choose some or all of these methods. When there are multiple methods that are selected, the best-performing method is automatically applied for the experiment.

Note: Imputation is not supported for date/time values.

Method Description
Cubic Uses cubic interpolation by using pandas/scipy method to fill missing values.
Fill Choose value as the type to replace the missing values with a numeric value you specify.
Flatten iterative Data is first flattened and then the Scikit-learn iterative imputer is applied to find missing values.
Linear Use linear interpolation by using pandas/scipy method to fill missing values.
Next Replace missing value with the next value.
Previous Replace missing value with the previous value.

Learn more