Building an AutoAI model
AutoAI automatically prepares data, applies algorithms, and attempts to build model pipelines best suited for your data and use case. This topic describes how to generate the model pipelines.
Follow these steps to upload data and have AutoAI create the best model for your data and use case.
- Prepare your training data
- Create a project in Watson Studio
- Open the AutoAI tool
- Specify details of your model and training data and launch AutoAI
- View the results
Prepare your training data
Collect your model data in a CSV file that is less than 100MB. Where possible, AutoAI will transform the data and impute missing values.
You can join related data sources that share one or more keye. For details, see Joining data sources
Note: You can use the IBM Watson Studio Data Refinery tool to prepare and shape your data.
Create a project in Watson Studio
For your convenience, your AutoAI model creation uses the default storage associated with your project to store your data and to save model results, so you do not have to set up any separate repositories.
- In Watson Studio, click the IBM Watson link in the header to navigate to the home panel.
- Click New project.
- If you are prompted to select a region, choose the US South region.
- If you don’t already have the required Watson Machine Learning service, follow the prompts to create new service instances.
- (Optional) Upload your data training file in CSV format as a data asset for the project.
Open the AutoAI tool
- In your Watson Studio project, click Add to project.
- Click AUTOAI EXPERIMENT.
Note: After you create an AutoAI asset it will display on the Assets page for your project in the AutoAI experiment section, so you can return to it.
Specify details of your experiment
- Specify a name and description for your experiment.
- Select a machine learning service instance and a compute configuration and click Create.
- Choose data from your project or upload it from your file system, then press Continue. Data must be in a CSV file and must be smaller than 100MB. Click the Preview icon to review your data.
- Choose the Column to predict for the data you want the experiment to predict.
- Based on analyzing a subset of the data set, AutoAI chooses a default model type: binary classification, multiclass classification, or regression. Binary is selected if the target column has two possible values, multiclass if it has a discrete set of possible values, for example 5 or 10, and regression if there is an infinite number of possible values. You can override this selection.
- AutoAI chooses a default metric for optimizing. For example, the default metric for a binary classification model is ROC AUC, which balances precision, accuracy, and recall.
- By default, ten percent of the training data is held out to test the performance of the model.
- (Optional) Click Experiment settings to view or customize options for your AutoAI run. The default settings balance speed and depth. You can prioritize for speed, which generates fewer pipelines, or you can optimize for depth to increase both the number of pipelines generated and the duration of hyper-parameter optimization. Finally, you can create a custom configuration by adjusting:
- Source settings, where you can adjust the percentage of holdout data. Holdout data is withheld from training the model and used to measure the performance of the model. You can also adjust the number of rows used to train the model.
- Prediction settings, to change the model type or the metric to optimize. For binary classification models you can also edit the positive class.
- Advanced settings, where you can change the compute configuration and customize options for your AutoAI run. The default settings balance speed and depth. You can prioritize for speed, which generates fewer pipelines, or you can optimize for depth to increase both the number of pipelines generated and the duration of hyper-parameter optimization. You can also set:
- The number of estimators, which controls the number of pipelines generated after model selection. For each estimator, 4 pipelines are generated. For example, a setting of 2 generates 8 pipelines from the top 2 estimators.
- HPO duration, which controls the maximum time hyper-parameter optimization (HPO) stage runs for. If timeout occurs, the current hyper-parameter settings found by the HPO stage are returned. The default time-out is 60 minutes.
- Feature importance, which shows how each pipeline views the features and what new features are created. It is on by default.
Click Run Experiment to begin model pipeline creation.
A progress infographic shows you the creation of pipelines for your data. The duration of this phase depends on the size of your data set. A notification message informs you if the processing time will be brief or require more time. You can work in other parts of the product while the pipelines build.
View the results
When the pipeline generation process completes, you can view the leading model candidates and evaluate them before saving a pipeline as a model.
Follow the steps in Selecting an AutoAI model for details on how to evaluate the pipelines as model candidates, then save a model.