AutoAI implementation details

AutoAI automatically prepares data, applies algorithms, and builds model pipelines best suited for your data and use case. This topic describes some of the technical details that go into generating the pipelines.


Estimators used for classification models

Estimator Description
AdaBoost Classifier Fits a sequence of weak learners (slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data then combines results to produce the final prediction.
Bernoulli Naïve Bayes Classifier Version of Naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions
Calibrated Classifier with Cross-Validation Uses cross-validation and estimates for each split of the model parameter on the training samples and the calibration of the test samples, then averages the probabilities of the folds.
Decision Tree Classifier Maps observations about an item (represented in branches) to conclusions about the item’s target value (represented in leaves). Supports both binary and multiclass labels, as well as both continuous and categorical features.
Extra Trees Classifier An averaging algorithm based on randomized decision trees.
Gaussian Naïve Bayes Classifier Classifies features based on Bayes’ theorem, which assumes that the presence of a feature in a class is unrelated to the presence of any other feature.
Gaussian Process Classifier Performs a probabilistic classification, where test predictions take the form of class probabilities.
Gradient Boosted Tree Classifier Produces a classification prediction model in the form of an ensemble of decision trees. It only supports binary labels, as well as both continuous and categorical features.
Nearest Neighbor Analysis (KNN) Classifier Determines class membership based on the membership of surrounding data points.
Label Propagation Infer labels by constructing a similarity graph over all items in the input dataset.
Label Spreading Infer labels by constructing a similarity graph over all items in the input dataset, minimizing the loss function.
LGBM Classifier Gradient boosting framework that uses leaf-wise (horizontal) tree-based learning algorithm.
Linear Discriminant Analysis Makes predictions by estimating the probability that a new set of inputs belongs to each class. Best for binary classification.
Linear Support Vector Classifier An implementation of Support Vector Classification for the case of a linear kernel.
Logistic Regression with Cross-Validation Divides dataset in to folds and in each iteration, omits a folds as the test set and trains the model on the rest of the folds.
Logistic Regression Analyzes a data set in which there are one or more independent variables that determine one of two outcomes. Only binary logistic regression is supported
MLP Classifier This algorithm optimizes the log-loss function using LBFGS or stochastic gradient descent.
Multinomial Naïve Bayes Classifier Suitable for classification with discrete features (for example, word counts for text classification).
Nearest Centroid Each class is represented by its centroid, with test samples classified to the class with the nearest centroid.
Nu Support Vector Classifier Similar to Support Vector Classifier but uses a parameter to control the number of support vectors.
Passive Aggressive Classifier Classifier for large-scale learning. Like the Perceptron in that it does not require a learning rate. Unlike Perceptron, it includes a regularization parameter.
Perceptron Classification algorithm suitable for large scale learning. Does not require a learning rate, is not regularized and only updates on mistakes.
Quadratic Discriminant Analysis A closed-form solution that can be easily computed, is multiclass, and has no hyperparameters to tune.
Radius Neighbors Classifier Implements learning based on the number of neighbors within a fixed radius of each training point, using a floating-point value specified by the user.
Random Forest Classifier Constructs multiple decision trees to produce the label that is a mode of each decision tree. It supports both binary and multiclass labels, as well as both continuous and categorical features.
Ridge Classifier with Cross-Validation Linear classifier tuned with generalized cross-validation.
Ridge Classifier Linear classifier well suited for high-dimensional problems such as text classification.
SGD Classifier Stochastic gradient descent is a simple yet efficient approach to fit linear models. Particularly useful when the number of samples and features is very large.
Support Vector Classifier Tries to find a combination of samples to build a plane maximizing the margin between the two classes.
XGBoost Classifier Accurate sure procedure that can be used for classification problems. XGBoost models are used in a variety of areas including Web search ranking and ecology.


Estimators used for regression models

Estimator Description
AdaBoost Regression Fits a sequence of weak learners, such as small decision trees, on repeatedly modified versions of the data. Combines a weighted majority vote to produce the final prediction
ARD Regression Similar to Naïve Bayes Regression with slightly different weighting.
Bayesian Ridge Regression Estimates a probabilistic model of the regression problem.
CCA canonical correlation analysis is used to find linear relations between two multivariate datasets.
Decision Tree Regression Maps observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). It supports both continuous and categorical features.
Extra Trees Regression An averaging algorithm based on randomized decision trees.
Elastic Net with Cross-Validation A variation of Elastic Net tuned with cross validation.
Elastic Net Useful when there are multiple features which are correlated with each other.
Gaussian Process A generic supervised learning method designed to use probabilities to solve regressionand classification problems.
Gaussian Process Regression Creates a prediction based on probability starting with the value of the Gaussian Process and using interpolated data.
Gradient Boosting Regression Produces a regression prediction model in the form of an ensemble of decision trees. It supports both continuous and categorical features.
Huber Regression Linear regression model that is robust to outliers.
Nearest Neighbor Analysis (KNN) Determines the property value for an object using the average of the values of its k nearest neighbors.
Kernel Ridge Similar to support vector regression but uses squared error loss.
Lars with Cross-Validation Applies the least-angle regression algorithm then tunes with cross validation.
Lars Least-angle regression (LARS) fits linear regression models to high-dimensional data
Lasso with Cross-Validation Used for high-dimensional data sets with many collinear regressors
Lasso Linear model that performs both variable selection and regularization to enhance prediction accuracy.
Lasso Lars with Cross-Validation Lasso algorithm tuned with cross validation.
Lasso Lars A lasso model implemented with the LARS algorithm
Lasso Lars IC Lasso model with Lars using Bayes information criterion for model selection
LGBM Regression Gradient boosting framework that uses tree-based learning algorithms.
Linear Regression Models the linear relationship between a scalar-dependent variable y and one or more explanatory variables (or independent variables) x.
Linear Support Vector Regression Implementation of Support Vector Regression for linear kernels.
MLP Regression Multi-layer Perceptron algorithm that can process one or more hidden layers between input and output.
MultiTask Elastic Net CV A variation of elastic net that works on multiple regression problems jointly and performs cross validation
MultiTask Elastic Net A variation of elastic net that works on multiple regression problems jointly
Multi Task Lasso CV A variation of Lasso that works on multiple regression problems jointly and performs cross validation
Multi Task Lasso A variation of Lasso that works on multiple regression problems jointly
Nu SVR A variation of Support Vector Regression that uses a parameter nu to control the number of support vectors.
Orthogonal Matching Pursuit with Cross-Validation OMP with cross-validation testing.
Orthogonal Matching Pursuit Approximates the fit of a linear model with constraints imposed on the number of coefficients.
Passive-Aggressive Regression Algorithm used for large-scale learning. Similar to Perceptron, but does not require a learning rate, but requires a regularization parameter.
PLS Canonical A variation of the Partial Least Squares algorithm.
PLS Regression Partial Least Squares is useful to find linear relations between two multivariate datasets.
Radius Neighbors Regression Learning algorithm based on the number of neighbors within a fixed radius of each training point.
Random Forest Regression Constructs multiple decision trees to produce the mean prediction of each decision tree. It supports both continuous and categorical features.
RANSAC Regression The passive-aggressive algorithms are a family of algorithms for large-scale learning. They are similar to the Perceptron in that they do not require a learning rate. However, contrary to the Perceptron, they include a regularization parameter C.
Ridge with Cross-Validation Ridge regression tuned with cross validation
Ridge Ridge regression is similar to Ordinary Least Squares but imposes a penalty on the size of coefficients.
SGD Regression Algorithm well suited large-scale and machine learning problems associated with text classification and natural language processing problems.
Support Vector Regression Ignores the training data close to the model prediction so acts on less of the training data.
Theil-Sen Regression A generalized-median estimator that handles outliers well.
XGBoost Regression GBRT is an accurate and effective off-the-shelf procedure that can be used for regression problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology.

Metrics by model type

The following metrics are available for measuring the accuracy of pipelines during training and when scoring data.

Binary classification metrics

  • Roc auc (default for ranking the pipelines)
  • Accuracy
  • Average precision
  • F
  • Negative log loss
  • Precision
  • Recall

Multi-class classification metrics

Metrics for multi-class models can be adjusted to account for imbalances in labels. for example:

  • Metrics with the micro qualifier calculate metrics globally by counting the total true positives, false negatives and false positives.
  • Metrics with the micro qualifier calculates metrics for each label, and finds their unweighted mean. This does not take label imbalance into account.
  • Metrics with the weighted qualifier calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters macro to account for label imbalance; it can result in an F-score that is not between precision and recall.

These are the multi-class classification metrics:

  • Accuracy (default for ranking the pipelines)
  • F1
  • F1 Micro
  • F1 Macro
  • F1 Weighted
  • Precision
  • Precision Micro
  • Precision Macro
  • Precision Weighted
  • Recall
  • Recall Micro
  • Recall Macro
  • Recall Weighted

Regression metrics

  • Negative root mean squared error (default for ranking the pipeline)
  • Negative mean absolute error
  • Negative root mean squared log error
  • Explained variance
  • Negative mean squared error
  • Negative mean squared log error
  • Negative median absolute error
  • R2

Metrics used for feature importance

Feature importance is calculated from the average of nine measures applied to the training data:

  • Linear Correlation (f_regression) metric
  • Maximal Information Coefficient (MIC) metric
  • Linear Regression (LR) metric
  • L1 regularization metric (Lasso)
  • Ridge metric
  • RF metric
  • Stability Selection
  • Recursive Feature Elimination (RFE)
  • Recursive Feature Elimination plus selection of best number of features

How many pipelines are created?

Two AutoAI parameters determine the number of pipelines:

  • max_num_daub_ensembles: Maximum number (top-K ranked by DAUB model selection) of the selected estimator types, for example LGBMClassifierEstimator, XGBoostClassifierEstimator, or LogisticRegressionEstimator to use in pipeline composition.  The default is 1, where only the highest ranked by model selection estimator type is used.

  • num_folds: Number of subsets of the full dataset to train pipelines in addition to the full dataset. The default is 1 for training the full data set.           For each fold and estimator type, AutoAI creates 4 pipelines of increased refinement, corresponding to:

  1. Pipeline with default sklearn parameters for this estimator type, 
  2. Pipeline with optimized estimator using HPO  
  3. Pipeline with optimized feature engineering  
  4. Pipeline with optimized feature engineering and optimized estimator using HPO

The total number of pipelines generated is :                       

TotalPipelines= max_num_daub_ensembles * 4, if num_folds = 1:  
                       
TotalPipelines= (num_folds+1) * max_num_daub_ensembles * 4,  if num_folds > 1 :

What hyperparameter optimization is applied to my model?

For each fold and estimator type, AutoAI creates two pipelines that use HPO to optimize for the estimator type.

  • The first is based on optimizing this estimator type based on the preprocessed (imputed/encoded/scaled) dataset (pipeline 2) above).
  • The second is based on optimizing the estimator type based on optimized feature engineering of the preprocessed (imputed/encoded/scaled) data set. 

The parameter values of the estimators of all pipelines generated by AutoAI is published in status messages.