AutoAI implementation details
AutoAI automatically prepares data, applies algorithms, or estimators, and builds model pipelines best suited for your data and use case.
This topic describes some of these technical details that go into generating the pipelines:
- Preparing and pre-processing the data
- Algorithms used for classification models
- Algorithms used for regression models
- Metrics by model type
- Data transformations
- AutoAI FAQ
Preparing the data for training
During automatic data preparation, AutoAI analyzes the training data and prepares it for model selection and pipeline generation. Data preparation involves these steps:
Feature column classification
- Detects the types of feature columns and classifies them as categorical or numerical class
- Detects various types of missing values (default, user-provided, outliers)
- Handles rows for which target values are missing (drop (default) or target imputation)
- Drops unique value columns (except datetime and timestamps)
- Drops constant value columns
Pre-processing (data imputation and encoding)
- Applies Sklearn imputation/encoding/scaling strategies (separately on each feature class)
- Handles labels of test set that were not seen in training set
Algorithms used for classification models
These algorithms are the default algorithms used for automatic model selection for classification problems.
|Decision Tree Classifier||Maps observations about an item (represented in branches) to conclusions about the item’s target value (represented in leaves). Supports both binary and multiclass labels, as well as both continuous and categorical features.|
|Extra Trees Classifier||An averaging algorithm based on randomized decision trees.|
|Gradient Boosted Tree Classifier||Produces a classification prediction model in the form of an ensemble of decision trees. It only supports binary labels, as well as both continuous and categorical features.|
|LGBM Classifier||Gradient boosting framework that uses leaf-wise (horizontal) tree-based learning algorithm.|
|Logistic Regression||Analyzes a data set in which there are one or more independent variables that determine one of two outcomes. Only binary logistic regression is supported|
|Random Forest Classifier||Constructs multiple decision trees to produce the label that is a mode of each decision tree. It supports both binary and multiclass labels, as well as both continuous and categorical features.|
|XGBoost Classifier||Accurate sure procedure that can be used for classification problems. XGBoost models are used in a variety of areas including Web search ranking and ecology.|
Algorithms used for regression models
These algorithms are the default algorithms used for automatic model selection for regression problems.
|Decision Tree Regression||Maps observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). It supports both continuous and categorical features.|
|Extra Trees Regression||An averaging algorithm based on randomized decision trees.|
|Gradient Boosting Regression||Produces a regression prediction model in the form of an ensemble of decision trees. It supports both continuous and categorical features.|
|LGBM Regression||Gradient boosting framework that uses tree-based learning algorithms.|
|Linear Regression||Models the linear relationship between a scalar-dependent variable y and one or more explanatory variables (or independent variables) x.|
|Random Forest Regression||Constructs multiple decision trees to produce the mean prediction of each decision tree. It supports both continuous and categorical features.|
|Ridge||Ridge regression is similar to Ordinary Least Squares but imposes a penalty on the size of coefficients.|
|XGBoost Regression||GBRT is an accurate and effective off-the-shelf procedure that can be used for regression problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology.|
Metrics by model type
The following metrics are available for measuring the accuracy of pipelines during training and when scoring data.
Binary classification metrics
- Accuracy (default for ranking the pipelines)
- Roc auc
- Average precision
- Negative log loss
Multi-class classification metrics
Metrics for multi-class models can be adjusted to account for imbalances in labels. for example:
- Metrics with the micro qualifier calculate metrics globally by counting the total true positives, false negatives and false positives.
- Metrics with the micro qualifier calculates metrics for each label, and finds their unweighted mean. This does not take label imbalance into account.
- Metrics with the weighted qualifier calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters macro to account for label imbalance; it can result in an F-score that is not between precision and recall.
These are the multi-class classification metrics:
- Accuracy (default for ranking the pipelines)
- F1 Micro
- F1 Macro
- F1 Weighted
- Precision Micro
- Precision Macro
- Precision Weighted
- Recall Micro
- Recall Macro
- Recall Weighted
- Negative root mean squared error (default for ranking the pipeline)
- Negative mean absolute error
- Negative root mean squared log error
- Explained variance
- Negative mean squared error
- Negative mean squared log error
- Negative median absolute error
Metrics used for feature importance
Feature importance is calculated from the average of nine measures applied to the training data:
- Linear Correlation (f_regression) metric
- Maximal Information Coefficient (MIC) metric
- Linear Regression (LR) metric
- L1 regularization metric (Lasso)
- Ridge metric
- RF metric
- Stability Selection
- Recursive Feature Elimination (RFE)
- Recursive Feature Elimination plus selection of best number of features
For feature engineering, AutoAI uses a novel approach that explores various feature construction choices in a structured, non-exhaustive manner, while progressively maximizing model accuracy using reinforcement learning. This results in an optimized sequence of transformations for the data that best match the algortihms, or estimators, of the model selection step. This table lists some of the transformations used and some well-known conditions under which they are useful. This is not an exhaustive list of scenarios where the transformation is useful, as that can be complex and hard to interpret. Finally, the listed scenarios are not an explanation of how the transformations are selected. The selection of which transforms to apply is done in a trial and error, performance-oriented manner.
|Principle Component Analysis||pca||Reduce dimensions of data and realign across a more suitable coordinate system. Helps tackle the ‘curse of dimensionality’ in linearly correlated data. It eliminates redundancy and separates significant signals in data.|
|Standard Scaler||stdscaler||Scales data features to a standard range.This helps the efficacy and efficiency of certain learning algorithms as well as other transformations such as PCA.|
|Logarithm||log||Reduces right skewness in features and make them more symmetric. Resulting symmetry in features helps algorithms understand the data better. Even scaling based on mean and variance is more meaningful on symmetrical data. Additionally, it can capture specific physical relationships between feature and target best described through a logarithm.|
|Cube Root||cbrt||Reduces right skewness in data like logarithm, but is weaker than log in its impact, which might be more suitable in some cases. It is also applicable to negative or zero values to which log doesn’t apply. Cube root can also change units such as reducing volume to length.|
|Square root||sqrt||Reduces mild right skewness in data. It is weaker than log or cube root. It works with zeroes and reduces spatial dimensions such as area to length.|
|Square||square||Reduces left skewness to a moderate extent to make such distributions more symmetric.It can also be helpful in capturing certain phenomena such as super-linear growth.|
|Product||product||A product of two features can expose a non-linear relationship to better predict the target value than the individual values alone. For example, item cost into number of items sold is a better indication of the size of a business than any of those alone.|
|Numerical XOR||nxor||This transform helps capture “exclusive disjunction” type of relationships between variables, similar to a bitwise XOR, but in a general numerical context.|
|Sum||sum||Sometimes the sum of two features is better correlated to the prediction target than the features alone. For instance, loans from different sources, when summed up, provide a better idea of a credit applicant’s total indebtedness.|
|Divide||divide||Division is a fundamental operand used to express quantities such as gross GDP over population (per capita GDP), representing a country’s average lifespan better than either GDP alone or population alone.|
|Maximum||max||Take the higher of two values.|
|Rounding||round||This transformation can be seen as perturbation or adding some noise to reduce overfitting that might have been a result of inaccurate observations.|
|Absolute Value||abs||Consider only the magnitude and not the sign of observation. Sometimes, the direction or sign of an observation doesn’t matter so much as the magnitude of it, such as physical displacement, while considering fuel or time spent in the actual movement.|
|Hyperbolic tangent||tanh||Non-linear activation function can improve prediction accuracy, similar to that of neural network activation functions.|
|Sine||sin||Can reorient data to discover periodic trends such as simple harmonic motions.|
|Cosine||cos||Can reorient data to discover periodic trends such as simple harmonic motions.|
|Tangent||tan||Trigonometric tangent transform is usually helpful in combination with other transforms.|
|Feature Agglomeration||featureagglomeration||Clustering different features into groups, based upon distance or affinity, provides ease of classification for the learning algorithm.|
|Sigmoid||sigmoid||Non-linear activation function can improve prediction accuracy, similar to that of neural network activation functions.|
|Isolation Forest||isoforestanomaly||Performs clustering by using an Isolation Forest to create a new feature containing an anomaly score for each sample.|
The following are commonly asked questions about creating an AutoAI experiment.
How many pipelines are created?
Two AutoAI parameters determine the number of pipelines:
max_num_daub_ensembles: Maximum number (top-K ranked by DAUB model selection) of the selected algorithm, or estimator types, for example LGBMClassifierEstimator, XGBoostClassifierEstimator, or LogisticRegressionEstimator to use in pipeline composition. The default is 1, where only the highest ranked by model selection algorithm type is used.
num_folds: Number of subsets of the full dataset to train pipelines in addition to the full dataset. The default is 1 for training the full data set. For each fold and algorithm type, AutoAI creates 4 pipelines of increased refinement, corresponding to:
- Pipeline with default sklearn parameters for this algorithm type,
- Pipeline with optimized algorithm using HPO
- Pipeline with optimized feature engineering
- Pipeline with optimized feature engineering and optimized algorithm using HPO
The total number of pipelines generated is :
TotalPipelines= max_num_daub_ensembles * 4, if num_folds = 1: TotalPipelines= (num_folds+1) * max_num_daub_ensembles * 4, if num_folds > 1 :
What hyperparameter optimization is applied to my model?
AutoAI uses a model-based, derivative-free global search algorithm, called RBfOpt, which is tailored for the costly machine learning model training and scoring evaluations required by hyperparameter optimization (HPO). In contrast to Bayesian optimization, which fits a Gaussian model to the unknown objective function, RBfOpt fits a radial basis function mode to accelerate the discovery of hyper-parameter configurations that maximize the objective function of the machine learning problem at hand. This acceleration is achieved by minimizing the number of expensive training and scoring machine learning models evaluations and by eliminating the need to compute partial derivatives.
For each fold and algorithm type, AutoAI creates two pipelines that use HPO to optimize for the algorithm type.
- The first is based on optimizing this algorithm type based on the preprocessed (imputed/encoded/scaled) dataset (pipeline 2) above).
- The second is based on optimizing the algorithm type based on optimized feature engineering of the preprocessed (imputed/encoded/scaled) data set.
The parameter values of the algorithms of all pipelines generated by AutoAI is published in status messages.
For more details regarding the RbfOpt algorithm, see: