You can run your experiments with HPO to easily find the best quality model.
Attention: Support for Deep Learning as a Service and the Deep Learning Experiment Builder on Cloud Pak for Data as a Service is deprecated and will be discontinued on April 2, 2022. See Service plan changes and deprecations for details.
Prerequisites
Before you use hyperparameters methods for creating training runs, you should refer to the specific instructions in the Coding guidelines for deep learning programs section and follow the coding standards that are outlined there to ensure that your script can be processed without error.
Introduction to HPO
Hyperparameter Optimization (HPO) is a mechanism for automatically exploring a search space of potential Hyperparameters, building a series of models and comparing the models using metrics of interest. To use HPO you must specify ranges of values to explore for each Hyperparameter.
Optimization algorithms
Currently two HPO methods are supported.
random
implements a simple algorithm which will randomly assign Hyperparameter values from the ranges specified for an experiment.
rbfopt
uses a technique called RBFOpt to explore the search space. Determining the parameters of a Neural-Networks effectively is a challenging problem due to the extremely large configuration space (for instance: how many nodes
per layer, activation functions, learning rates, drop-out rates, filter sizes, etc.) and the computational cost of evaluating a proposed configuration (e.g., evaluating a single configuration can take hours to days). To address this challenging
problem the rbfopt
algorithm uses a model-based global optimization algorithm that does not require derivatives. Similarly to Bayesian Optimization which fits a Gaussian model to the unknown objective function our approach fits
a radial basis function model. The underlying optimization software is open source and available here: RbfOpt: A blackbox optimization library in Python
An example application of RbfOpt in the context of Neural Networks is available here; A. Fokoue, G. Diaz, G. Nannicini, H. Samulowitz. An effective algorithm for hyperparameter optimization of neural networks. IBM Journal of Research and Development, 61(4-5), 2017
HPO requires you to set an upper limit on the number of models that it will build. Please note that if you have N Hyperparameters that you are tuning, the rbfopt
method uses the first N+1 models built as baseline models after
which it starts the actual optimization. If your budget does not allow more than N+1 models to be built, you'd be better advised to use the random
method.
Requirements
Write your code to use HPO
To use HPO you can broadly re-use the same code for your non-HPO experiments and training runs. However there are two important considerations.
Obtaining Hyperparameter values specified by the HPO algorithm
If your code is running as part of an HPO experiment it will need to obtain values for Hyperparameters assigned by the HPO algorithm. The Hyperparameters will be supplied in a file called config.json
as a JSON formatted dictionary,
located in the current folder and can be read using the following example snippet (which expects Hyperparameters to be defined for initial_learning_rate
and total_iterations
:
hyper_params = json.loads(open("config.json").read())
learning_rate = float(hyper_params["initial_learning_rate"])
training_iters = int(hyper_params["total_iterations"])
Logging metrics for collection by the HPO algorithm
At the end of your training run your code will need to create a file called $RESULT_DIR/val_dict_list.json
with the series of test metrics generated during each epoch. This file is analyzed by the HPO algorithm and the statistics
contained within it are used to guide choice of hyper-parameters in subsequent runs.
Logging metrics - tensorflow and keras
For code using tensorflow
or keras
you should log metrics using tensorboard as in the tensorflow example here (note the location of the tensorboard directory tb_directory
) and create the val_dict_list.json
file explicitly.
tb_directory = os.environ["LOG_DIR"]+"/"+os.environ["SUBID"]+"/logs/tb"
tensorflow.gfile.MakeDirs(tb_directory)
test_writer = tf.summary.FileWriter(tb_directory+'/test')
test_metrics = []
for epoch in range(0,total_epochs):
# FIRST perform training for epoch
# THEN compute test metrics into test_summary for epoch and write them to the tensorboard log.
test_summary, test_acc = sess.run([merged, accuracy], feed_dict={x: mnist.test.images, y: mnist.test.labels})
test_writer.add_summary(test_summary, epoch)
# FINALLY also record the test metrics for this metric into a history
test_metrics.append((epoch, {"accuracy": float(test_acc)}))
# once training is complete, write test metrics out to file $RESULT_DIR/val_dict_list.json
training_out =[]
for test_metric in test_metrics:
out = {'steps':test_metric[0]}
for (metric,value) in test_metric[1].items():
out[metric] = value
training_out.append(out)
with open('{}/val_dict_list.json'.format(os.environ['RESULT_DIR']), 'w') as f:
json.dump(training_out, f)
Logging metrics - pytorch
For code using pytorch
APIs you will need to use the following logging approach based on the Python code in emetrics.py.
You can download this Python file and add it to your model definition zip, then import it and use it in your program as shown in the following snippet.
from emetrics import EMetrics
import os
with EMetrics.open(os.environ["SUBID"]) as em:
for epoch in range(0,total_epochs):
# perform training for epoch
# compute training metrics and assign to values train_accuracy, train_loss etc
em.record(EMetrics.TRAIN_GROUP,epoch,{'accuracy': train_accuracy, 'loss': train_loss})
# compute test(validation) metrics and assign to values test_accuracy, test_loss etc
# NOTE for these use the group EMetrics.TEST_GROUP so that the service will recognize these metrics as computed on test/validation data
em.record(EMetrics.TEST_GROUP,epoch,{'accuracy': test_accuracy, 'loss': test_loss})
Note - the metrics collection object returned from EMetrics.open(...)
must be closed when training is complete - if necessary call its close()
method explictly. When the object is closed, information will be passed
back to the HPO algorithm. In the preceding example code, the object is closed implicitly when the with
statement completes. When the object is closed the file $RESULT_DIR/val_dict_list.json
will be generated.
Experiments with HPO
Define your experiments as described in working with experiments
To use HPO you will need to create an experiment with a training_reference
which includes an extra hyper_parameters_optimization
section. This section provides the configuration for the HPO algorithm.
There are two sub-sections within hyper_parameters_optimization
to consider:
hyper_parameters_optimization.method
is the first sub-section which configures the HPO algorithm...
hyper_parameters_optimization.method.name
specifies the HPO algorithm to use, one of random
and rbfopt
hyper_parameters_optimization.method.parameters
provides a list of parameters to the algorithm. You will need to provide the following parameters:
hyper_parameters_optimization:
method:
name: random
parameters:
- name: objective
string_value: accuracy
- name: maximize_or_minimize
string_value: maximize
- name: num_optimizer_steps
int_value: 4
where objective
specifies a test metric which your program records and which HPO will use to compare models, maximize_or_minimize
specifies whether HPO should attempt to minimize
or maximize
the metric. num_optimizer_steps
sets an upper bound on the number of models which HPO will train.
hyper_parameters_optimization.hyper_parameters
is the second sub-section which configures a list of named Hyperparameters. For each Hyperparameter specify the ranges of values which HPO may select for that Hyperparameter when
training models. Your code must be ready to read values for each of these named Hyperparameters from a file config.json
as discussed.
Defining ranges of values for hyper parameters
Use double_range
or int_range
to define ranges of values for a Hyperparameter with an optional step
. The following entry in the hyper_parameters
list specifies that the double values 0.005,
0.006, 0.007, 0.008, 0.009 and 0.01 may be used in the Hyperparameter called learning_rate
.
- name: learning_rate
double_range:
min_value: 0.005
max_value: 0.01
step: 0.001
A power optional attribute is also provided - in the following example the fc Hyperparameter values can vary from 2^5 (32) to 2^10 (1024):
- name: fc
int_range:
min_value: 5
max_value: 10
power: 2
Use double_values
, int_values
or string_values
to define a set of values that can be used:
- name: learning_rate
double_values: [ 0.001, 0.002, 0.005 ]
The full experiment manifest example is:
settings:
name: Experiment1
description: This is a sample experiment
author:
name: Bob Smith
email: [email protected]
training_references:
- name: model1
training_definition_url: https://ibm-watson-ml.mybluemix.net/v3/ml_assets/training_definitions/5b21ea03-0936-4829-a96c-946638e84d6d
command: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz
--trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz
--testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001 --trainingIters 4
compute_configuration:
name: k80
hyper_parameters_optimization:
method:
name: random
parameters:
- name: objective
string_value: accuracy
- name: maximize_or_minimize
string_value: maximize
- name: num_optimizer_steps
int_value: 4
hyper_parameters:
- name: learning_rate
double_range:
min_value: 0.005
max_value: 0.01
step": 0.001
- name: conv_filter_size1
int_range:
min_value: 5
max_value: 6
- name: conv_filter_size2
int_range:
min_value: 5
max_value: 6
- name: fc
int_range:
min_value: 5
max_value: 10
power: 2
training_results_reference:
name: training-results-reference_name
connection:
endpoint_url: <URL>
access_key_id: <ACCESS KEY>
secret_access_key: <SECRET ACCESS KEY>
target:
bucket: training-data
type: s3
Next Steps
- Check out sample notebooks for notebook samples that uses Keras with HPO.
- Create your own new training runs.
- Go in depth with the following Developer Works article: Introducing deep learning and long-short term memory networks.
Parent topic: Deep learning experiments