AutoAI implementation details
AutoAI automatically prepares data, applies algorithms, or estimators, and builds model pipelines that are best suited for your data and use case.
The following sections describe some of these technical details that go into generating the pipelines and provide a list of research papers that describe how AutoAI was designed and implemented.
 Preparing the data for training (preprocessing)
 Automated model selection
 Algorithms used for classification models
 Algorithms used for regression models
 Metrics by model type
 Data transformations
 Automated Feature Engineering
 Hyperparameter optimization
 AutoAI FAQ
 Learn more
Preparing the data for training (data preprocessing)
During automatic data preparation, or preprocessing, AutoAI analyzes the training data and prepares it for model selection and pipeline generation. Most data sets contain missing values but machine learning algorithms typically expect no missing values. On exception to this rule is described in xgboost section 3.4. AutoAI algorithms perform various missing value imputations in your data set by using various techniques, making your data ready for machine learning. In addition, AutoAI detects and categorizes features based on their data types, such as categorical or numerical. It explores encoding and scaling strategies that are based on the feature categorization.
Data preparation involves these steps:
Feature column classification
 Detects the types of feature columns and classifies them as categorical or numerical class
 Detects various types of missing values (default, userprovided, outliers)
Feature engineering
 Handles rows for which target values are missing (drop (default) or target imputation)
 Drops unique value columns (except datetime and timestamps)
 Drops constant value columns
Preprocessing (data imputation and encoding)
 Applies Sklearn imputation/encoding/scaling strategies (separately on each feature class). For example, the current default method for missing value imputation strategies, which are used in the product are
most frequent
for categorical variables andmean
for numerical variables.  Handles labels of test set that were not seen in training set
 HPO feature: Optimizes imputation/encoding/scaling strategies given a data set and algorithm
Automatic model selection
The second stage in an AutoAI experiment training is automated model selection. The automated model selection algorithm uses the Data Allocation by using Upper Bounds strategy. This approach sequentially allocates small subsets of training data among a large set of algorithms. The goal is to select an algorithm that gives nearoptimal accuracy when trained on all data, while also minimizing the cost of misallocated samples. The system currently supports all Scikitlearn algorithms, and the popular XGBoost and LightGBM algorithms. Training and evaluation of models on large data sets is costly. The approach of starting small subsets and allocating incrementally larger ones to models that work well on the data set saves time, without sacrificing performance. Snap machine learning algorithms were added to the system to boost the performance even more.
Selecting algorithms for a model
Algorithms are selected to match the data and the nature of the model, but they can also balance accuracy and duration of runtime, if the model is configured for that option. For example, Snap ML algorithms are typically faster for training than Scikitlearn algorithms. They are often the preferred algorithms AutoAI selects automatically for cases where training is optimized for a shorter run time and accuracy. You can manually select them if training speed is a priority. For details, see Snap ML documentation. For a discussion of when SnapML algorithms are useful, see this blog post on using SnapML algorithms.
Algorithms used for classification models
These algorithms are the default algorithms that are used for model selection for classification problems.
Algorithm  Description 

Decision Tree Classifier  Maps observations about an item (represented in branches) to conclusions about the item's target value (represented in leaves). Supports both binary and multiclass labels, and both continuous and categorical features. 
Extra Trees Classifier  An averaging algorithm based on randomized decision trees. 
Gradient Boosted Tree Classifier  Produces a classification prediction model in the form of an ensemble of decision trees. It supports binary labels and both continuous and categorical features. 
LGBM Classifier  Gradient boosting framework that uses leafwise (horizontal) treebased learning algorithm. 
Logistic Regression  Analyzes a data set where one or more independent variables that determine one of two outcomes. Only binary logistic regression is supported 
Random Forest Classifier  Constructs multiple decision trees to produce the label that is a mode of each decision tree. It supports both binary and multiclass labels, and both continuous and categorical features. 
SnapDecisionTreeClassifier  This algorithm provides a decision tree classifier by using the IBM Snap ML library. 
SnapLogisticRegression  This algorithm provides regularized logistic regression by using the IBM Snap ML solver. 
SnapRandomForestClassifier  This algorithm provides a random forest classifier by using the IBM Snap ML library. 
SnapSVMClassifier  This algorithm provides a regularized support vector machine by using the IBM Snap ML solver. 
XGBoost Classifier  Accurate sure procedure that can be used for classification problems. XGBoost models are used in various areas, including web search ranking and ecology. 
SnapBoostingMachineClassifier  Boosting machine for binary and multiclass classification tasks that mix binary decision trees with linear models with random fourier features. 
Algorithms used for regression models
These algorithms are the default algorithms that are used for automatic model selection for regression problems.
Algorithm  Description 

Decision Tree Regression  Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supports both continuous and categorical features. 
Extra Trees Regression  An averaging algorithm based on randomized decision trees. 
Gradient Boosting Regression  Produces a regression prediction model in the form of an ensemble of decision trees. It supports both continuous and categorical features. 
LGBM Regression  Gradient boosting framework that uses treebased learning algorithms. 
Linear Regression  Models the linear relationship between a scalardependent variable y and one or more explanatory variables (or independent variables) x. 
Random Forest Regression  Constructs multiple decision trees to produce the mean prediction of each decision tree. It supports both continuous and categorical features. 
Ridge  Ridge regression is similar to Ordinary Least Squares but imposes a penalty on the size of coefficients. 
SnapBoostingMachineRegressor  This algorithm provides a boosting machine by using the IBM Snap ML library that can be used to construct an ensemble of decision trees. 
SnapDecisionTreeRegressor  This algorithm provides a decision tree by using the IBM Snap ML library. 
SnapRandomForestRegressor  This algorithm provides a random forest by using the IBM Snap ML library. 
XGBoost Regression  GBRT is an accurate and effective offtheshelf procedure that can be used for regression problems. Gradient Tree Boosting models are used in various areas, including web search ranking and ecology. 
Metrics by model type
The following metrics are available for measuring the accuracy of pipelines during training and for scoring data.
Binary classification metrics
 Accuracy (default for ranking the pipelines)
 Roc auc
 Average precision
 F
 Negative log loss
 Precision
 Recall
Multiclass classification metrics
Metrics for multiclass models generate scores for how well a pipeline performs against the specified measurement. For example, an F1 score averages precision (of the predictions made, how many positive predictions were correct) and recall (of all possible positive predictions, how many were predicted correctly).
You can further refine a score by qualifying it to calculate the given metric globally (macro), per label (micro), or to weight an imbalanced data set to favor classes with more representation.
 Metrics with the micro qualifier calculate metrics globally by counting the total number of true positives, false negatives and false positives.
 Metrics with the macro qualifier calculates metrics for each label, and finds their unweighted mean. All labels are weighted equally.
 Metrics with the weighted qualifier calculate metrics for each label, and find their average weighted by the contribution of each class. For example, in a data set that includes categories for apples, peaches, and plums, if there are many more instances of apples, the weighted metric gives greater importance to correctly predicting apples. This alters macro to account for label imbalance. Use a weighted metric such as F1weighted for an imbalanced data set.
These are the multiclass classification metrics:
 Accuracy (default for ranking the pipelines)
 F1
 F1 Micro
 F1 Macro
 F1 Weighted
 Precision
 Precision Micro
 Precision Macro
 Precision Weighted
 Recall
 Recall Micro
 Recall Macro
 Recall Weighted
Regression metrics
 Negative root mean squared error (default for ranking the pipeline)
 Negative mean absolute error
 Negative root mean squared log error
 Explained variance
 Negative mean squared error
 Negative mean squared log error
 Negative median absolute error
 R2
Automated Feature Engineering
The third stage in the AutoAI process is automated feature engineering. The automated feature engineering algorithm is based on Cognito, described in the research papers, Cognito: Automated Feature Engineering for Supervised Learning and Feature Engineering for Predictive Modeling by using Reinforcement Learning. The system explores various feature construction choices in a hierarchical and nonexhaustive manner, while progressively maximizing the accuracy of the model through an explorationexploitation strategy. This method is inspired from the "trial and error" strategy for feature engineering, but conducted by an autonomous agent in place of a human.
Metrics used for feature importance
For treebased classification and regression algorithms such as Decision Tree, Extra Trees, Random Forest, XGBoost, Gradient Boosted, and LGBM, feature importances are their inherent feature importance scores based on the reduction in the criterion that is used to select split points, and calculated when these algorithms are trained on the training data.
For nontree algorithms such as Logistic Regression, LInear Regression, SnapSVM, and Ridge, the feature importances are the feature importances of a Random Forest algorithm that is trained on the same training data as the nontree algorithm.
For any algorithm, all feature importances are in the range between zero and one and have been normalized as the ratio with respect to the maximum feature importance.
Data transformations
For feature engineering, AutoAI uses a novel approach that explores various feature construction choices in a structured, nonexhaustive manner, while progressively maximizing model accuracy by using reinforcement learning. This results in an optimized sequence of transformations for the data that best match the algorithms, or algorithms, of the model selection step. This table lists some of the transformations that are used and some wellknown conditions under which they are useful. This is not an exhaustive list of scenarios where the transformation is useful, as that can be complex and hard to interpret. Finally, the listed scenarios are not an explanation of how the transformations are selected. The selection of which transforms to apply is done in a trial and error, performanceoriented manner.
Name  Code  Function 

Principle Component Analysis  pca  Reduce dimensions of data and realign across a more suitable coordinate system. Helps tackle the 'curse of dimensionality' in linearly correlated data. It eliminates redundancy and separates significant signals in data. 
Standard Scaler  stdscaler  Scales data features to a standard range. This helps the efficacy and efficiency of certain learning algorithms and other transformations such as PCA. 
Logarithm  log  Reduces right skewness in features and make them more symmetric. Resulting symmetry in features helps algorithms understand the data better. Even scaling based on mean and variance is more meaningful on symmetrical data. Additionally, it can capture specific physical relationships between feature and target that is best described through a logarithm. 
Cube Root  cbrt  Reduces right skewness in data like logarithm, but is weaker than log in its impact, which might be more suitable in some cases. It is also applicable to negative or zero values to which log doesn't apply. Cube root can also change units such as reducing volume to length. 
Square root  sqrt  Reduces mild right skewness in data. It is weaker than log or cube root. It works with zeros and reduces spatial dimensions such as area to length. 
Square  square  Reduces left skewness to a moderate extent to make such distributions more symmetric. It can also be helpful in capturing certain phenomena such as superlinear growth. 
Product  product  A product of two features can expose a nonlinear relationship to better predict the target value than the individual values alone. For example, item cost into number of items that are sold is a better indication of the size of a business than any of those alone. 
Numerical XOR  nxor  This transform helps capture "exclusive disjunction" type of relationships between variables, similar to a bitwise XOR, but in a general numerical context. 
Sum  sum  Sometimes the sum of two features is better correlated to the prediction target than the features alone. For instance, loans from different sources, when summed up, provide a better idea of a credit applicant's total indebtedness. 
Divide  divide  Division is a fundamental operand that is used to express quantities such as gross GDP over population (per capita GDP), representing a country's average lifespan better than either GDP alone or population alone. 
Maximum  max  Take the higher of two values. 
Rounding  round  This transformation can be seen as perturbation or adding some noise to reduce overfitting that might be a result of inaccurate observations. 
Absolute Value  abs  Consider only the magnitude and not the sign of observation. Sometimes, the direction or sign of an observation doesn't matter so much as the magnitude of it, such as physical displacement, while considering fuel or time spent in the actual movement. 
Hyperbolic tangent  tanh  Nonlinear activation function can improve prediction accuracy, similar to that of neural network activation functions. 
Sine  sin  Can reorient data to discover periodic trends such as simple harmonic motions. 
Cosine  cos  Can reorient data to discover periodic trends such as simple harmonic motions. 
Tangent  tan  Trigonometric tangent transform is usually helpful in combination with other transforms. 
Feature Agglomeration  feature agglomeration  Clustering different features into groups, based on distance or affinity, provides ease of classification for the learning algorithm. 
Sigmoid  sigmoid  Nonlinear activation function can improve prediction accuracy, similar to that of neural network activation functions. 
Isolation Forest  isoforestanomaly  Performs clustering by using an Isolation Forest to create a new feature containing an anomaly score for each sample. 
Word to vector  word2vec  This algorithm, which is used for text analysis, is applied before all other transformations. It takes a corpus of text as input and outputs a set of vectors. By turning text into a numerical representation, it can detect and compare
similar words. When trained with enough data, word2vec can make accurate predictions about a word’s meaning or relationship to other words. The predictions can be used to analyze text and predict meaning in sentiment analysis
applications. 
Hyperparameter Optimization
The final stage in AutoAI is hyperparameter optimization. The AutoAI approach optimizes the parameters of the best performing pipelines from the previous phases. It is done by exploring the parameter ranges of these pipelines by using a black box hyperparameter optimizer called RBFOpt. RBFOpt is described in the research paper RBFOpt: an opensource library for blackbox optimization with costly function evaluations. RBFOpt is suited for AutoAI experiments because it is built for optimizations with costly evaluations, as in the case of training and scoring an algorithm. RBFOpt's approach builds and iteratively refines a surrogate model of the unknown objective function to converge quickly despite the long evaluation times of each iteration.
AutoAI FAQs
The following are commonly asked questions about creating an AutoAI experiment.
How many pipelines are created?
Two AutoAI parameters determine the number of pipelines:

max_num_daub_ensembles: Maximum number (topK ranked by DAUB model selection) of the selected algorithm, or estimator types, for example LGBMClassifierEstimator, XGBoostClassifierEstimator, or LogisticRegressionEstimator to use in pipeline composition. The default is 1, where only the highest ranked by model selection algorithm type is used.

num_folds: Number of subsets of the full data set to train pipelines in addition to the full data set. The default is 1 for training the full data set.
For each fold and algorithm type, AutoAI creates four pipelines of increased refinement, corresponding to:
 Pipeline with default sklearn parameters for this algorithm type,
 Pipeline with optimized algorithm by using HPO
 Pipeline with optimized feature engineering
 Pipeline with optimized feature engineering and optimized algorithm by using HPO
The total number of pipelines that are generated is:
TotalPipelines= max_num_daub_ensembles * 4, if num_folds = 1:
TotalPipelines= (num_folds+1) * max_num_daub_ensembles * 4, if num_folds > 1 :
What hyperparameter optimization is applied to my model?
AutoAI uses a modelbased, derivativefree global search algorithm, called RBfOpt, which is tailored for the costly machine learning model training and scoring evaluations that are required by hyperparameter optimization (HPO). In contrast to Bayesian optimization, which fits a Gaussian model to the unknown objective function, RBfOpt fits a radial basis function mode to accelerate the discovery of hyperparameter configurations that maximize the objective function of the machine learning problem at hand. This acceleration is achieved by minimizing the number of expensive training and scoring machine learning models evaluations and by eliminating the need to compute partial derivatives.
For each fold and algorithm type, AutoAI creates two pipelines that use HPO to optimize for the algorithm type.
 The first is based on optimizing this algorithm type based on the preprocessed (imputed/encoded/scaled) data set (pipeline 2) above).
 The second is based on optimizing the algorithm type based on optimized feature engineering of the preprocessed (imputed/encoded/scaled) data set.
The parameter values of the algorithms of all pipelines that are generated by AutoAI is published in status messages.
For more details regarding the RbfOpt algorithm, see:
How is feature significance calculated?
When you configure a classification or regression experiment, you can optionally specify how to handle features with no impact on the model. The choices are to always remove the feature, remove them when it improves the model quality, or do not remove them. Feature significance is calculated as follows:
 Feature importance is calculated on data sample.
 Some estimators do not have builtin capabilities to return feature importances. In those cases, an estimator such as RandomForest is used to measure impact.
 The number of features matters  if the importance value for a feature is 0.0000000001 but there are a large number of lowimportance features (for example, more than 200) then leaving or removing them can have some impact on the experiment results.
In auto mode, the following steps are used to validate that removal of lowimportance features does not affect the experiment results:
 If removing all features with calculated 0 importance has some impact on model accuracy, Principal Component Analysis algorithm is applied to those features and select top K components explaining the 90% of variance across those insignificant features.
 Next, the transformed components are used as new features in place of original ones and the model is evaluated again.
 If there is still a drop in accuracy, all original features are added back to the experiment.
Research references
This list includes some of the foundational research articles that further detail how AutoAI was designed and implemented to promote trust and transparency in the automated modelbuilding process.
Next steps
Data imputation in AutoAI experiments
Parent topic: AutoAI overview