Data imputation in AutoAI experiments

Last updated: Jan 12, 2024

Data imputation is the means of replacing missing values in your data set with substituted values. If you enable imputation, you can specify how missing values are interpolated in your data.

Imputation by experiment type

Imputation methods depend on the type of experiment that you build.

For classification and regression you can configure categorical and numerical imputation methods.
For timeseries problems, you can choose from a set of imputation methods to apply to numerical columns. When the experiment runs, the best performing method from the set is applied automatically. You can also specify a specific value as a replacement value.

Enabling imputation

To view and set imputation options:

Click Experiment settings when you configure your experiment.
Click the Data source option.
Click Enable data imputation. Note that if you do not explicitly enable data imputation but your data source has missing values, AutoAI warns you and applies default imputation methods. See imputation details.
Select options in the Imputation section.
Optionally set a threshold for the percentage of imputation acceptable for a column of data. If the percentage of missing values exceeds the specified threshold, the experiment fails. To resolve, update the data source or adjust the threshold.

Configuring imputation for classification and regression experiments

Choose one of these methods for imputing missing data in binary classification, multiclass classification, or regression experiments. Note that you can have one method for completing values for text-based (categorical) data and another for numerical data.

Method	Description
Most frequent	Replace missing value with the value that appears most frequently in the column.
Median	Replace missing value with the value in the middle of the sorted column.
Mean	Replace missing value with the average value for the column.

Configuring imputation for timeseries experiments

Choose some or all of these methods. When multiple methods are selected, the best-performing method is automatically applied for the experiment.

Note: Imputation is not supported for date or time values.

Method	Description
Cubic	Uses cubic interpolation by using pandas/scipy method to fill missing values.
Fill	Choose value as the type to replace the missing values with a numeric value you specify.
Flatten iterative	Data is first flattened and then the Scikit-learn iterative imputer is applied to find missing values.
Linear	Use linear interpolation by using pandas/scipy method to fill missing values.
Next	Replace missing value with the next value.
Previous	Replace missing value with the previous value.

Next steps

Data imputation implementation details for time series experiments

Parent topic: AutoAI overview