Data imputation is the means of replacing missing values in your data set with substituted values. If you enable imputation, you can specify how missing values are interpolated in your data.
Imputation by experiment type
Copy link to section
Imputation methods depend on the type of experiment that you build.
For classification and regression you can configure categorical and numerical imputation methods.
For timeseries problems, you can choose from a set of imputation methods to apply to numerical columns. When the experiment runs, the best performing method from the set is applied automatically. You can also specify a specific value as a
replacement value.
Enabling imputation
Copy link to section
To view and set imputation options:
Click Experiment settings when you configure your experiment.
Click the Data source option.
Click Enable data imputation. Note that if you do not explicitly enable data imputation but your data source has missing values, AutoAI warns you and applies default imputation methods. See imputation details.
Select options in the Imputation section.
Optionally set a threshold for the percentage of imputation acceptable for a column of data. If the percentage of missing values exceeds the specified threshold, the experiment fails. To resolve, update the data source or adjust the threshold.
Configuring imputation for classification and regression experiments
Copy link to section
Choose one of these methods for imputing missing data in binary classification, multiclass classification, or regression experiments. Note that you can have one method for completing values for text-based (categorical) data and another for numerical
data.
Method
Description
Most frequent
Replace missing value with the value that appears most frequently in the column.
Median
Replace missing value with the value in the middle of the sorted column.
Mean
Replace missing value with the average value for the column.
Configuring imputation for timeseries experiments
Copy link to section
Choose some or all of these methods. When multiple methods are selected, the best-performing method is automatically applied for the experiment.
Note: Imputation is not supported for date or time values.
Method
Description
Cubic
Uses cubic interpolation by using pandas/scipy method to fill missing values.
Fill
Choose value as the type to replace the missing values with a numeric value you specify.
Flatten iterative
Data is first flattened and then the Scikit-learn iterative imputer is applied to find missing values.
Linear
Use linear interpolation by using pandas/scipy method to fill missing values.