Deep Learning Experiment Builder

Last updated: Jun 20, 2022

As a data scientist, you need to train thousands of models to identify the right combination of data in conjunction with hyperparameters to optimize the performance of your neural networks. You want to perform more experiments and faster. You want to train deeper neural networks and explore more complicated hyperparameter spaces. IBM Watson Machine Learning accelerates this iterative cycle by simplifying the process to train models in parallel with an auto-allocated GPU compute containers.

Attention: Support for Deep Learning as a Service and the Deep Learning Experiment Builder on Cloud Pak for Data as a Service is deprecated and will be discontinued on April 2, 2022. See Service plan changes and deprecations for details.

Deep Learning experiment overview

Required service Watson Machine Learning

Data format Textual: CSV files with labeled textual data
Image: Image files in a PKL file. For example, a model testing signatures uses images resized to 32×32 pixels and stored as numpy arrays in a pickled format.
Files are stored in IBM Cloud Object Storage

Data size Any size

Prerequisites for using deep learning experiments

To use the Deep Learning Experiment Builder, you must provision the following assets:

IBM Watson Machine Learning service instance
IBM Cloud Object Storage, with a bucket for both your training results and a training source

You must create a separate bucket outside of the Watson Studio project so that it will not get deleted if or when the project is ever deleted. There are several ways that you can upload data to these buckets. For example, you can upload data via the IBM Cloud, use the AWS CLI, or an FTP tool.
a model definition (an internal training specification), which stores metadata about how a model needs to be trained. For more information, refer to Create a Model definition.
a Python execution script, which is used to deliver metrics to the training run

Review the Coding guidelines for deep learning programs to ensure that all scripts and manifest files comply with the requirements.

Creating a new experiment

Open a project that has the required service instances set up. If the correct service instances are not yet attached to the project, you will be prompted to create new service instances as part of defining the experiment details.
To create an experiment, click New asset > Experiment.

For more information on choosing the right tool for your data and use case, refer to Choosing a tool.

Defining your data connection

Use Experiment Builder to access existing connections as well as create new connections to your data assets on the IBM Cloud Object Storage. If one doesn't already exist, you must create a data connection.

Security Note: Although it is possible to reuse IBM Cloud Object Storage connections, to maintain security, a connection should be used only for a specific experiment. Why? The credentials used for experiments must be granted write access for storing the assets generated during training. For this reason, reusing connections is not recommended.

Type a name and description for this experiment.
Select the Machine Learning service instance.
In the IBM Cloud Object Storage section, click Select.
Choose an IBM Cloud Object Storage connection or create a new one.
Choose buckets for the training results and training source or create a new bucket. Although you can have the same bucket, you should choose different buckets so that the source and target are separate for easier management of large data. If you create a new bucket, you must upload the training data assets prior to initiating the training run.
Click Create.

Steps for building and deploying Deep Learning models

Write Python code to specify metrics for training runs. For more information about these requirements, refer to the coding guidelines
Associate model definitions. For more information, refer to Create a Model definition.
Perform hyperparameter optimization. Refer to Hyperparameter optimization overview and Defining hyperparameters. For a detailed HPO description, refer to Hyperparameter Optimization
Find the optimal values for large numbers of hyperparameters by running hundreds or thousands of training runs. For information on how to run a training, refer to Creating and running the experiment and Adding a training run to an experiment
Run distributed training with GPUs and specialized, powerful hardware and infrastructure. For details, refer to Distributed deep learning and Using graphics processing units (GPUs)
Compare the performance of training runs. Refer to Viewing training run results
Save a training run as a model. For details, refer to Saving as a model and deploying

Associating model definitions

You must associate one or more model definitions to this experiment. You can associate multiple model definitions as part of running an experiment. Model definitions can be a mix of either existing model definitions or ones that you create as part of the process.

Click Add model definition.
Choose whether to create a new model definition or use an existing model definition.
- To create a new model definition, click the New tab, and follow steps listed in Defining a new model definition.
- To choose an existing model definition, click the Existing tab, and follow steps listed in Selecting an existing model definition.

Defining a new model definition

Type a name and a description.
Choose a .zip file that has the Python code that you have set up to indicate the metrics to use for training runs.
From the Framework box, select the appropriate framework. This must be compatible with the code you use in the Python file.
In the Execution command box, type the execution command that can be used to execute the Python code.
- It must reference the .py file.
- It must indicate the data buckets.
In the Attributes during this experiment section, from the Compute configuration box, select a compute plan, which determines the number and size of GPUs to use for the experiment. For more information about GPUs, see Using GPUs.
From the Hyperparameter optimization method box, select a method.
Click Create.

Selecting an existing model definition

Notes:

If you select an existing model definition, you cannot view or modify the training attributes, however, you can change the computation plan.
You can use any compatible model definition as part of your experiment. You must make several selections to identify it to the experiment.
From the Select model definition box, select a model definition.
Select the buckets to be used for your training results and training source.
In the Attributes during this experiment section, from the Compute configuration box, select a compute plan, which determines the number and size of GPUs to use for the experiment. For more information about GPUs, see Using GPUs.
From the Hyperparameter optimization method box, select a method.
Click Select.

Hyperparameter optimization overview

Hyperparameter optimization enables your experiment to run against an array of parameters and find the most accurate models for you to deploy. You can choose to run the following HPO options:

None: No hyperparameters are used in the training runs. You must create all training runs manually to use in the experiment.
rbfopt: Uses a technique called RBFOpt to explore the search space. Determining the parameters of a neural network is a challenging problem because of the extremely large configuration space (for instance, how many nodes per layer, activation functions, learning rates, drop-out rates, filter sizes) and the computational cost of evaluating a proposed configuration, such as evaluating a single configuration can take hours to days. To address this challenging problem the rbfopt algorithm uses a model-based global optimization algorithm that does not require derivatives. Similarly to Bayesian Optimization which fits a Gaussian model to the unknown objective function, our approach fits a radial basis function model.

The underlying optimization software for the rbfopt algorithm is open source. For more information, refer to RbfOpt: A blackbox optimization library in Python.
Random: Implements a simple algorithm which randomly assigns hyperparameter values from the ranges specified for an experiment.

When you choose to run with HPO, the number of optimizer steps equates to the number of training runs that are executed.

Defining hyperparameters

Click Add hyperparameter.
Type a name for this hyperparameter.
Choose whether to have distinct values or a range of values.
- To use distinct values, click Distinct values and then list the values separated by commas (,) in the Values box.
- To use a range of values, click Range, type the Lower bound and Upper bound, and then choose whether to traverse the range either by a power (exponential) or a step. You must then enter the exponent or step value.
Choose the data type of the range value or distinct value. Watson automatically reads the data and selects the most likely type, however, you can change the default. The following data types are available:
- Integer
- Double
- String
To create this hyperparameter and then add another, click Add and Create Another.
To create this hyperparameter and return to the model definition window, click Add.

Depending on what you specify, you can have an exponential number of runs. Experiment Builder tries to give maximal results without wasting runs while maximizing accuracy.
Click Create.
Examine the results of your work. You might only create a single training run, but because of the use of multiple hyperparameters you may see a large set of training runs.

Creating and running the experiment

After you define or select the model definition file, it appears in the list of model definitions. Because a model definition must have an execution command, you are given the option of setting a global execution command.

Choose whether to use a global execution command. Although model definition assets can have a saved execution command, you can use the global execution command setting to override the commands saved with the model definitions file. It's possible to define a model definition without specifying an execution command. In this case, the global execution command can be used for training defnitions. It is important to note that this setting overrides all model definitions, even ones with predefined execution commands.
- To use a global execution command, set the Use global execution command check box.
- To use the execution command specific to the model definition file, clear the Use global execution command check box.
Click Create and run.

Adding a training run to an experiment

Because an experiment is a live asset, you can add additional training runs to an experiment.

From the Watson Studio action bar, click the experiment name.
Click Add model definition.

Viewing run results

After you create and run the model definition, you can find the new experiment asset listed in the IBM Watson Studio Projects area and also in the Watson Machine Learning service repository on IBM Cloud. The process of running the model definition also invokes the experiment run endpoint. It creates a manifest file and sends it to the IBM Watson deep learning service.

If you created a model definition without using hyperparameter optimization, the runs are specific to each model definition that you provisioned. If you created a model definition with hyperparameter optimization, the service executes many training runs based on the total number of metrics and parameters that you provisioned.

As the training run proceeds, the results are dynamically added to the display.

For real-time results, in the In progress section, click a training run.
- On the Monitor tab, you can view metrics, such as accuracy, loss, and values.
- On the Overview tab, you can view the model definition, framework, and execution command that was used to create this run.
- On the Logs tab, you can find a selection of logs. For performance reasons, only the most-recent 500 logs are displayed. To download the full logs directly, go to the training result bucket that correspond to the training run ID and download the log files.
Return to the Overview area. To compare multiple runs, click Compare Runs. Here you can find all of the hyperparameters that were used.
In the Completed section you can find the model metrics and compare them.

Saving as a model and deploying

After a job completes successfully, you can save it as a model and publish it to the IBM Watson Machine Learning service repository on IBM Cloud.

Go to the Experiment Builder window and find the job, click Actions > Save model
Type a name and description and click Save.
Go to the Watson Studio Projects page.
From the assets page, find and open the model.
Review the model details.
To deploy the model, navigate to the Deployment tab and click Create Deployment.

Parent topic: Running deep learning experiments