The AutoAI graphical tool analyzes your data and uses data algorithms, transformations, and parameter settings to create the best predictive model. AutoAI displays various potential models as model candidate pipelines and rank them on a leaderboard for you to choose from.
- Required service
- Watson Machine Learning
- Watson Studio
- Data format
- Tabular: CSV files, with comma (,) delimiter for all types of AutoAI experiments.
- Connected data from IBM Cloud Object Storage.
Note: You can use a data asset that is saved as a Feature Group (beta) but the metadata is not used to populate the AutoAI experiment settings.
- Data size
- Up to 1 GB or up to 20 GB. For details, refer to AutoAI data use.
AutoAI data use
These limits are based on the default compute configuration of 8 CPU and 32 GB.
AutoAI classification and regression experiments:
- You can upload a file up to 1GB for AutoAI experiments.
- If you connect to a data source that exceeds 1GB, only the first 1GB of records is used.
AutoAI time series experiments:
- If the data source contains a timestamp column, AutoAI samples the data at a uniform frequency. For example, data can be in increments of one minute, one hour, or one day. The specified timestamp is used to determine the lookback window to
improve the model accuracy.
Note: If the file size is larger than 1GB, AutoAi sorts the data in descending time order and only the first 1GB is used to train the experiment.
- If the data source does not contain a timestamp column, ensure AutoAI samples the data at uniform intervals and sorts the data in ascending time order. This means the value in the first row is the oldest, and the value in the last row is the most recent. Note: If the file size is larger than 1GB, truncate the file size so it is smaller than 1GB.
For more information on choosing the right tool for your data and use case, refer to Choosing a tool.
Using AutoAI, you can build and deploy a machine learning model with sophisticated training features and no coding. The tool does most of the work for you.
To view the code that created a particular experiment, or interact with the experiment programmatically, you can save an experiment as a notebook.
AutoAI automatically runs the following tasks to build and evaluate candidate model pipelines:
- Data pre-processing
- Automated model selection
- Automated feature engineering
- Hyperparameter optimization
- Ensembling and incremental learning
Understanding the AutoAI process
For additional detail on each of these phases, including links to associated research papers and descriptions of the algorithms applied to create the model pipelines, see AutoAI implementation details.
Most data sets contain different data formats and missing values, but standard machine learning algorithms only work with numbers and no missing values. Therefore, AutoAI applies various algorithms or estimators to analyze, clean, and prepare your raw data for machine learning. This automatically detects and categorizes values based on features, such as data type: categorical or numerical. Depending on the categorization, AutoAI uses hyper-parameter optimization to determine the best combination of strategies for missing value imputation, feature encoding, and feature scaling for your data.
Automated model selection
AutoAI uses automated model selection to identify the best model for your data. This novel approach tests potential models against small subsets of the data and ranks them based on accuracy. AutoAI then selects the most promising models and increases the size of the data subset until it identifies the best match. This approach saves time and improves performance by gradually narrowing down the potential models based on accuracy.
For information on how to handle automatically-generated pipelines to select the best model, refer to Selecting an AutoAI model.
Automated feature engineering
Feature engineering identifies the most accurate model by transforming raw data into a combination of features that best represent the problem. This unique approach explores various feature construction choices in a structured, non-exhaustive manner, while progressively maximizing model accuracy using reinforcement learning. This results in an optimized sequence of transformations for the data that best match the algorithms of the model selection step.
Hyperparameter optimization refines the best performing models. AutoAI uses a novel hyperparameter optimization algorithm for certain function evaluations, such as model training and scoring, that are typical in machine learning. This approach quickly identifies the best model despite long evaluation times at each iteration.
Ensembling and incremental learning
The process of building BatchedTreeEnsemble pipelines on top of the ranked pipelines. The ensemble pipelines provides incremental learning capabilities, and can be used to continue training using the remaining data in a subsampled source, dividing the remaining data into batches, if needed. Each batch of training data is scored independently using the optimized metric, so you can review the performance of each batch when you explore the results. For details, see Incremental learning.
AutoAI tutorial: Build a Binary Classification Model
Parent topic: Analyzing data and building models