AutoAI implementation details
AutoAI automatically prepares data, applies algorithms, or estimators, and builds model pipelines best suited for your data and use case.
This topic describes some of these technical details that go into generating the pipelines:
 Algorithms used for classification models
 Algorithms used for regression models
 Metrics by model type
 Data transformations
 AutoAI FAQ
Algorithms used for classification models
Algorithm  Description 

AdaBoost Classifier  Fits a sequence of weak learners (slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data then combines results to produce the final prediction. 
Bernoulli Naïve Bayes Classifier  Version of Naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions 
Calibrated Classifier with CrossValidation  Uses crossvalidation and estimates for each split of the model parameter on the training samples and the calibration of the test samples, then averages the probabilities of the folds. 
Decision Tree Classifier  Maps observations about an item (represented in branches) to conclusions about the item’s target value (represented in leaves). Supports both binary and multiclass labels, as well as both continuous and categorical features. 
Extra Trees Classifier  An averaging algorithm based on randomized decision trees. 
Gaussian Naïve Bayes Classifier  Classifies features based on Bayes’ theorem, which assumes that the presence of a feature in a class is unrelated to the presence of any other feature. 
Gaussian Process Classifier  Performs a probabilistic classification, where test predictions take the form of class probabilities. 
Gradient Boosted Tree Classifier  Produces a classification prediction model in the form of an ensemble of decision trees. It only supports binary labels, as well as both continuous and categorical features. 
Nearest Neighbor Analysis (KNN) Classifier  Determines class membership based on the membership of surrounding data points. 
Label Propagation  Infer labels by constructing a similarity graph over all items in the input dataset. 
Label Spreading  Infer labels by constructing a similarity graph over all items in the input dataset, minimizing the loss function. 
LGBM Classifier  Gradient boosting framework that uses leafwise (horizontal) treebased learning algorithm. 
Linear Discriminant Analysis  Makes predictions by estimating the probability that a new set of inputs belongs to each class. Best for binary classification. 
Linear Support Vector Classifier  An implementation of Support Vector Classification for the case of a linear kernel. 
Logistic Regression with CrossValidation  Divides dataset in to folds and in each iteration, omits a folds as the test set and trains the model on the rest of the folds. 
Logistic Regression  Analyzes a data set in which there are one or more independent variables that determine one of two outcomes. Only binary logistic regression is supported 
MLP Classifier  This algorithm optimizes the logloss function using LBFGS or stochastic gradient descent. 
Multinomial Naïve Bayes Classifier  Suitable for classification with discrete features (for example, word counts for text classification). 
Nearest Centroid  Each class is represented by its centroid, with test samples classified to the class with the nearest centroid. 
Nu Support Vector Classifier  Similar to Support Vector Classifier but uses a parameter to control the number of support vectors. 
Passive Aggressive Classifier  Classifier for largescale learning. Like the Perceptron in that it does not require a learning rate. Unlike Perceptron, it includes a regularization parameter. 
Perceptron  Classification algorithm suitable for large scale learning. Does not require a learning rate, is not regularized and only updates on mistakes. 
Quadratic Discriminant Analysis  A closedform solution that can be easily computed, is multiclass, and has no hyperparameters to tune. 
Radius Neighbors Classifier  Implements learning based on the number of neighbors within a fixed radius of each training point, using a floatingpoint value specified by the user. 
Random Forest Classifier  Constructs multiple decision trees to produce the label that is a mode of each decision tree. It supports both binary and multiclass labels, as well as both continuous and categorical features. 
Ridge Classifier with CrossValidation  Linear classifier tuned with generalized crossvalidation. 
Ridge Classifier  Linear classifier well suited for highdimensional problems such as text classification. 
SGD Classifier  Stochastic gradient descent is a simple yet efficient approach to fit linear models. Particularly useful when the number of samples and features is very large. 
Support Vector Classifier  Tries to find a combination of samples to build a plane maximizing the margin between the two classes. 
XGBoost Classifier  Accurate sure procedure that can be used for classification problems. XGBoost models are used in a variety of areas including Web search ranking and ecology. 
Algorithms used for regression models
Algorithm  Description 

AdaBoost Regression  Fits a sequence of weak learners, such as small decision trees, on repeatedly modified versions of the data. Combines a weighted majority vote to produce the final prediction 
ARD Regression  Similar to Naïve Bayes Regression with slightly different weighting. 
Bayesian Ridge Regression  Estimates a probabilistic model of the regression problem. 
CCA  canonical correlation analysis is used to find linear relations between two multivariate datasets. 
Decision Tree Regression  Maps observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). It supports both continuous and categorical features. 
Extra Trees Regression  An averaging algorithm based on randomized decision trees. 
Elastic Net with CrossValidation  A variation of Elastic Net tuned with cross validation. 
Elastic Net  Useful when there are multiple features which are correlated with each other. 
Gaussian Process  A generic supervised learning method designed to use probabilities to solve regressionand classification problems. 
Gaussian Process Regression  Creates a prediction based on probability starting with the value of the Gaussian Process and using interpolated data. 
Gradient Boosting Regression  Produces a regression prediction model in the form of an ensemble of decision trees. It supports both continuous and categorical features. 
Huber Regression  Linear regression model that is robust to outliers. 
Nearest Neighbor Analysis (KNN)  Determines the property value for an object using the average of the values of its k nearest neighbors. 
Kernel Ridge  Similar to support vector regression but uses squared error loss. 
Lars with CrossValidation  Applies the leastangle regression algorithm then tunes with cross validation. 
Lars  Leastangle regression (LARS) fits linear regression models to highdimensional data 
Lasso with CrossValidation  Used for highdimensional data sets with many collinear regressors 
Lasso  Linear model that performs both variable selection and regularization to enhance prediction accuracy. 
Lasso Lars with CrossValidation  Lasso algorithm tuned with cross validation. 
Lasso Lars  A lasso model implemented with the LARS algorithm 
Lasso Lars IC  Lasso model with Lars using Bayes information criterion for model selection 
LGBM Regression  Gradient boosting framework that uses treebased learning algorithms. 
Linear Regression  Models the linear relationship between a scalardependent variable y and one or more explanatory variables (or independent variables) x. 
Linear Support Vector Regression  Implementation of Support Vector Regression for linear kernels. 
MLP Regression  Multilayer Perceptron algorithm that can process one or more hidden layers between input and output. 
MultiTask Elastic Net CV  A variation of elastic net that works on multiple regression problems jointly and performs cross validation 
MultiTask Elastic Net  A variation of elastic net that works on multiple regression problems jointly 
Multi Task Lasso CV  A variation of Lasso that works on multiple regression problems jointly and performs cross validation 
Multi Task Lasso  A variation of Lasso that works on multiple regression problems jointly 
Nu SVR  A variation of Support Vector Regression that uses a parameter nu to control the number of support vectors. 
Orthogonal Matching Pursuit with CrossValidation  OMP with crossvalidation testing. 
Orthogonal Matching Pursuit  Approximates the fit of a linear model with constraints imposed on the number of coefficients. 
PassiveAggressive Regression  Algorithm used for largescale learning. Similar to Perceptron, but does not require a learning rate, but requires a regularization parameter. 
PLS Canonical  A variation of the Partial Least Squares algorithm. 
PLS Regression  Partial Least Squares is useful to find linear relations between two multivariate datasets. 
Radius Neighbors Regression  Learning algorithm based on the number of neighbors within a fixed radius of each training point. 
Random Forest Regression  Constructs multiple decision trees to produce the mean prediction of each decision tree. It supports both continuous and categorical features. 
RANSAC Regression  The passiveaggressive algorithms are a family of algorithms for largescale learning. They are similar to the Perceptron in that they do not require a learning rate. However, contrary to the Perceptron, they include a regularization parameter C. 
Ridge with CrossValidation  Ridge regression tuned with cross validation 
Ridge  Ridge regression is similar to Ordinary Least Squares but imposes a penalty on the size of coefficients. 
SGD Regression  Algorithm well suited largescale and machine learning problems associated with text classification and natural language processing problems. 
Support Vector Regression  Ignores the training data close to the model prediction so acts on less of the training data. 
TheilSen Regression  A generalizedmedian algorithm that handles outliers well. 
XGBoost Regression  GBRT is an accurate and effective offtheshelf procedure that can be used for regression problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology. 
Metrics by model type
The following metrics are available for measuring the accuracy of pipelines during training and when scoring data.
Binary classification metrics
 Roc auc (default for ranking the pipelines)
 Accuracy
 Average precision
 F
 Negative log loss
 Precision
 Recall
Multiclass classification metrics
Metrics for multiclass models can be adjusted to account for imbalances in labels. for example:
 Metrics with the micro qualifier calculate metrics globally by counting the total true positives, false negatives and false positives.
 Metrics with the micro qualifier calculates metrics for each label, and finds their unweighted mean. This does not take label imbalance into account.
 Metrics with the weighted qualifier calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters macro to account for label imbalance; it can result in an Fscore that is not between precision and recall.
These are the multiclass classification metrics:
 Accuracy (default for ranking the pipelines)
 F1
 F1 Micro
 F1 Macro
 F1 Weighted
 Precision
 Precision Micro
 Precision Macro
 Precision Weighted
 Recall
 Recall Micro
 Recall Macro
 Recall Weighted
Regression metrics
 Negative root mean squared error (default for ranking the pipeline)
 Negative mean absolute error
 Negative root mean squared log error
 Explained variance
 Negative mean squared error
 Negative mean squared log error
 Negative median absolute error
 R2
Metrics used for feature importance
Feature importance is calculated from the average of nine measures applied to the training data:
 Linear Correlation (f_regression) metric
 Maximal Information Coefficient (MIC) metric
 Linear Regression (LR) metric
 L1 regularization metric (Lasso)
 Ridge metric
 RF metric
 Stability Selection
 Recursive Feature Elimination (RFE)
 Recursive Feature Elimination plus selection of best number of features
Data transformations
For feature engineering, AutoAI uses a novel approach that explores various feature construction choices in a structured, nonexhaustive manner, while progressively maximizing model accuracy using reinforcement learning. This results in an optimized sequence of transformations for the data that best match the algortihms, or estimators, of the model selection step. This table lists some of the transformations used and some wellknown conditions under which they are useful. This is not an exhaustive list of scenarios where the transformation is useful, as that can be complex and hard to interpret. Finally, the listed scenarios are not an explanation of how the transformations are selected. The selection of which transforms to apply is done in a trial and error, performanceoriented manner.
Name  Code  Function 

Principle Component Analysis  pca  Reduce dimensions of data and realign across a more suitable coordinate system. Helps tackle the ‘curse of dimensionality’ in linearly correlated data. It eliminates redundancy and separates significant signals in data. 
Standard Scaler  stdscaler  Scales data features to a standard range.This helps the efficacy and efficiency of certain learning algorithms as well as other transformations such as PCA. 
Logarithm  log  Reduces right skewness in features and make them more symmetric. Resulting symmetry in features helps algorithms understand the data better. Even scaling based on mean and variance is more meaningful on symmetrical data. Additionally, it can capture specific physical relationships between feature and target best described through a logarithm. 
Cube Root  cbrt  Reduces right skewness in data like logarithm, but is weaker than log in its impact, which might be more suitable in some cases. It is also applicable to negative or zero values to which log doesn’t apply. Cube root can also change units such as reducing volume to length. 
Square root  sqrt  Reduces mild right skewness in data. It is weaker than log or cube root. It works with zeroes and reduces spatial dimensions such as area to length. 
Square  square  Reduces left skewness to a moderate extent to make such distributions more symmetric.It can also be helpful in capturing certain phenomena such as superlinear growth. 
Product  product  A product of two features can expose a nonlinear relationship to better predict the target value than the individual values alone. For example, item cost into number of items sold is a better indication of the size of a business than any of those alone. 
Numerical XOR  nxor  This transform helps capture “exclusive disjunction” type of relationships between variables, similar to a bitwise XOR, but in a general numerical context. 
Sum  sum  Sometimes the sum of two features is better correlated to the prediction target than the features alone. For instance, loans from different sources, when summed up, provide a better idea of a credit applicant’s total indebtedness. 
Divide  divide  Division is a fundamental operand used to express quantities such as gross GDP over population (per capita GDP), representing a country’s average lifespan better than either GDP alone or population alone. 
Maximum  max  Take the higher of two values. 
Rounding  round  This transformation can be seen as perturbation or adding some noise to reduce overfitting that might have been a result of inaccurate observations. 
Absolute Value  abs  Consider only the magnitude and not the sign of observation. Sometimes, the direction or sign of an observation doesn’t matter so much as the magnitude of it, such as physical displacement, while considering fuel or time spent in the actual movement. 
Hyperbolic tangent  tanh  Nonlinear activation function can improve prediction accuracy, similar to that of neural network activation functions. 
Sine  sin  Can reorient data to discover periodic trends such as simple harmonic motions. 
Cosine  cos  Can reorient data to discover periodic trends such as simple harmonic motions. 
Tangent  tan  Trigonometric tangent transform is usually helpful in combination with other transforms. 
Feature Agglomeration  featureagglomeration  Clustering different features into groups, based upon distance or affinity, provides ease of classification for the learning algorithm. 
Sigmoid  sigmoid  Nonlinear activation function can improve prediction accuracy, similar to that of neural network activation functions. 
AutoAI FAQs
The following are commonly asked questions about creating an AutoAI experiment.
How many pipelines are created?
Two AutoAI parameters determine the number of pipelines:

max_num_daub_ensembles: Maximum number (topK ranked by DAUB model selection) of the selected algorithm, or estimator types, for example LGBMClassifierEstimator, XGBoostClassifierEstimator, or LogisticRegressionEstimator to use in pipeline composition. The default is 1, where only the highest ranked by model selection algorithm type is used.

num_folds: Number of subsets of the full dataset to train pipelines in addition to the full dataset. The default is 1 for training the full data set. For each fold and algorithm type, AutoAI creates 4 pipelines of increased refinement, corresponding to:
 Pipeline with default sklearn parameters for this algorithm type,
 Pipeline with optimized algorithm using HPO
 Pipeline with optimized feature engineering
 Pipeline with optimized feature engineering and optimized algorithm using HPO
The total number of pipelines generated is :
TotalPipelines= max_num_daub_ensembles * 4, if num_folds = 1:
TotalPipelines= (num_folds+1) * max_num_daub_ensembles * 4, if num_folds > 1 :
What hyperparameter optimization is applied to my model?
AutoAI uses a modelbased, derivativefree global search algorithm, called RBfOpt, which is tailored for the costly machine learning model training and scoring evaluations required by hyperparameter optimization (HPO). In contrast to Bayesian optimization, which fits a Gaussian model to the unknown objective function, RBfOpt fits a radial basis function mode to accelerate the discovery of hyperparameter configurations that maximize the objective function of the machine learning problem at hand. This acceleration is achieved by minimizing the number of expensive training and scoring machine learning models evaluations and by eliminating the need to compute partial derivatives.
For each fold and algorithm type, AutoAI creates two pipelines that use HPO to optimize for the algorithm type.
 The first is based on optimizing this algorithm type based on the preprocessed (imputed/encoded/scaled) dataset (pipeline 2) above).
 The second is based on optimizing the algorithm type based on optimized feature engineering of the preprocessed (imputed/encoded/scaled) data set.
The parameter values of the algorithms of all pipelines generated by AutoAI is published in status messages.
For more details regarding the RbfOpt algorithm, see: