As a data scientist, you need to train thousands of models to identify the right combination of data in conjunction with hyperparameters to optimize the performance of your neural networks. You want to perform more experiments and faster. You want to train deeper neural networks and explore more complicated hyperparameter spaces. IBM Watson Machine Learning accelerates this iterative cycle by simplifying the process to train models in parallel with an auto-allocated GPU compute containers.
Prerequisites for using Experiment Builder
To use Experiment Builder, you must provision the following assets:
- IBM Watson Machine Learning service instance
IBM Cloud Object Storage, with a bucket for both your training results and a training source
You must create a separate bucket outside of the Watson Studio project so that it will not get deleted if or when the project is ever deleted. There are several ways that you can upload data to these buckets. For example, you can upload data via the IBM Cloud, use the AWS CLI, or an FTP tool.
a training definition (an internal training specification), which stores metadata about how a model needs to be trained.
You can use the Neural Network Design Flow Editor to define a neural network model and create a training definition. One training definition is required for each training run. For example, if you want to compare different runtimes, you must configure one training definition for each.
a Python execution script, which is used to deliver metrics to the training run
Review the Coding guidelines for deep learning programs to ensure that all scripts and manifest files comply with the requirements.
Create a new experiment
- Open a project that has the required service instances set up. If the correct service instances are not yet attached to the project, you will be prompted to create new service instances as part of defining the experiment details.
- Click New Experiment.
Define your data connection
Use Experiment Builder to access existing connections as well as create new connections to your data assets on the IBM Cloud Object Storage. If one doesn't already exist, you must create a data connection.
Security Note: Although it is possible to reuse IBM Cloud Object Storage connections, to maintain security, a connection should be used only for a specific experiment. Why? The credentials used for experiments must be granted
write access for storing the assets generated during training. For this reason, reusing connections is not recommended.
- Type a name and description for this experiment.
- Select the Machine Learning service instance.
- In the IBM Cloud Object Storage section, click Select.
- Choose an IBM Cloud Object Storage connection or create a new one.
- Choose buckets for the training results and training source or create a new bucket. Although you can have the same bucket, you should choose different buckets so that the source and target are separate for easier management of large data. If you create a new bucket, you must upload the training data assets prior to initiating the training run.
- Click Create.
Associate training definitions
You must associate one or more training definitions to this experiment. You can associate multiple training definitions as part of running an experiment. Training definitions can be a mix of either existing training definitions or ones that you create as part of the process.
- Click Add training definition.
Choose whether to create a new training definition or use an existing training definition.
- To create a new training definition, click the New training definition tab.
- To choose an existing training definition, click the Existing training definition tab. From the table select the training definition and then select the buckets to be used for your training results and training source. If you select an existing training definition, you cannot view or modify the training attributes, however, you can change the computation plan.
Define a new training definition
- Type a name and a description.
- Choose a .zip file that has the Python code that you have set up to indicate the metrics to use for training runs. For more information about these requirements, see the coding guidelines.
- From the Framework box, select the appropriate framework. This must be compatible with the code you use in the Python file.
- In the Execution command box, type the execution command that can be used to execute the Python code.
- It must reference the .py file.
- It must indicate the data buckets.
- In the Training definition attributes section, from the Compute plan box, select a compute plan, which determines the number and size of GPUs to use for the experiment. The following compute tiers are available.
Lite plan users are limited to the
- From the Hyperparameter optimization method box, select a method.
- Click Create.
Select an existing training definition
You might have created a training definition by using the Neural Network Design Flow Editor. You can use that or any compatible training definition as part of your experiment. You must make several selections to identify it to the experiment.
- From the Existing training definition box, select a training definition.
- In the Training definition attributes section, from the Computer plan box, select a compute plan, which determines the number and size of GPUs to use for the experiment. The following compute tiers are available:
- From the Hyperparameter optimization method box, select a method.
- Click Select.
Create an HPO experiment
Hyperparameter optimization enables your experiment to run against an array of parameters and find the most accurate models for you to deploy. You can choose to run the following HPO options:
- None: No hyperparameters are used in the training runs. You must create all training runs manually to use in the experiment.
rbfopt: Uses a technique called RBFOpt to explore the search space. Determining the parameters of a neural network is a challenging problem because of the extremely large configuration space (for instance, how many nodes per layer, activation functions, learning rates, drop-out rates, filter sizes) and the computational cost of evaluating a proposed configuration, such as evaluating a single configuration can take hours to days. To address this challenging problem the
rbfoptalgorithm uses a model-based global optimization algorithm that does not require derivatives. Similarly to Bayesian Optimization which fits a Gaussian model to the unknown objective function our approach fits a radial basis function model.
The underlying optimization software for the
rbfoptalgorithm is open source. For more information, see RbfOpt: A blackbox optimization library in Python.
Random: Implements a simple algorithm which randomly assigns hyperparameter values from the ranges specified for an experiment.
When you choose to run with HPO, the number of optimizer steps equates to the number of training runs that are executed.
- Click Add hyperparameter.
Define your hyperparameters
- Type a name for this hyperparameter.
Choose whether to have distinct values or a range of values.
- To use distinct values, click Distinct values and then list the values separated by commas (,) in the Values box.
- To use a range of values, click Range, type the Lower bound and Upper bound, and then choose whether to traverse the range either by a power (exponential) or a step. You must then enter the exponent or step value.
Choose the data type of the range value or distinct value. Watson automatically reads the data and selects the most likely type, however, you can change the default. The following data types are available:
To create this hyperparameter and then add another, click Add and Create Another.
To create this hyperparameter and return to the training definition window, click Add.
Depending on what you specify, you can have an exponential number of runs. Experiment Builder tries to give maximal results without wasting runs while maximizing accuracy.
- Examine the results of your work. You might only create a single training run, but because of the use of multiple hyperparameters you may see a large set of training runs.
Create and run the experiment
After you define or select the training definition file, it appears in the list of training definitions. Because a training definition must have an execution command, you are given the option of setting a global execution command.
Choose whether to use a global execution command. Although training definition assets can have a saved execution command, you can use the global execution command setting to override the commands saved with the training definitions file. It's possible to define a training definition without specifying an execution command. In this case, the global execution command can be used for training defnitions. It is important to note that this setting overrides all training definitions, even ones with predefined execution commands.
- To use a global execution command, set the Use global execution command check box.
- To use the execution command specific to the training definition file, clear the Use global execution command check box.
Click Create and run.
Training run results
After you create and run the training definition, you can find the new experiment asset listed in the IBM Watson Studio Projects area and also in the Watson Machine Learning service repository on IBM Cloud. The process of running the training definition also invokes the experiment run endpoint. It creates a manifest file and sends it to the IBM Watson deep learning service.
If you created a training definition without using hyperparameter optimization, the runs are specific to each training definition that you provisioned. If you created a training definition with hyperparameter optimization, the service executes many training runs based on the total number of metrics and parameters that you provisioned.
As the training run proceeds, the results are dynamically added to the display.
To see real-time results, in the In progress section, click a training run.
- On the Monitor tab, you can view metrics, such as accuracy, loss, and values.
- On the Overview tab, you can view the training definition, framework, and execution command that was used to create this run.
- On the Logs tab, you see a selection of logs. For performance reasons, only the most-recent 500 logs are displayed. To download the full logs directly, go to the training result bucket that correspond to the training run ID and download the log files.
Return to the Overview area. To compare multiple runs, click Compare Runs. Here you can see all of the hyperparameters that were used.
- In the Completed section you can see the model metrics and compare them.
Add a training run to an experiment
Because an experiment is a live asset, you can add additional training runs to an experiment.
- From the Watson Studio action bar, click the experiment name.
- Click Add training definition.
Save as a model and deploy
After a job completes successfully, you can save it as a model and publish it to the IBM Watson Machine Learning service repository on IBM Cloud.
- Go to the Experiment Builder window and find the job, click Actions > Save model
- Type a name and description and click Save.
- Go to the Watson Studio Projects page.
- From the assets page, find and open the model.
- Review the model details.
- To deploy the model, navigate to the Deployment tab and click Create Deployment.