Coding guidelines for deep learning programs

In general you should be able to easily run your existing deep learning programs using the service. There are a few caveats including how your program will read data and save a trained model, but the modifications required should be relatively straightforward.

Obtaining the input data

Input data files located in the specified COS bucket are available to your program via the folder specified in environment variable DATA_DIR

# if your code needs to open the file imagedata.csv
# from within your training_data_reference bucket
# use the following approach

input_data_folder = os.environ["DATA_DIR"]

imagefile = open(os.path.join(input_data_folder,"imagedata.csv"))

Writing the trained model

Output your trained model to a folder named model under the folder specified in the environment variable RESULT_DIR. When you store a training-run into the repository, this model folder is saved.

output_model_folder = os.environ["RESULT_DIR"]
output_model_path = os.path.join(output_model_folder,"model")

mymodel = train_model()

Writing to the log

Just write to stdout (for example in the Python programming language, use the print function) and the output will be collected by the service and made available when you monitor a running job or obtain the resulting log.

print("starting to train model...")
mymodel = train_model()
print("model training completed..., starting evaluation")
print("model evaluation completed.")

Disclaimer: Client’s use of the deep learning training process includes the ability to write to the training log files. Personal data must not be written to these training log files as they are accessible to other users within Client’s Enterprise as well as to IBM as necessary to support the Cloud Service.

Reading Hyperparameters

If your code is running as part of a larger experiment it may need to obtain values for Hyperparameters defined in the experiment. The Hyperparameters will be supplied in a file called config.json as a JSON formatted dictionary, located in the current folder and can be read using the following example snippet (which expects Hyperparameters to be defined for initial_learning_rate and total_iterations:

hyper_params = json.loads(open("config.json").read())
learning_rate = float(hyper_params["initial_learning_rate"])
training_iters = int(hyper_params["total_iterations"])

Computing and sending metrics

You can generate metrics of interest using techniques based on the framework being used… you can of course simply print metric values in an unstructured way to the logs but you can also send metrics to the service in a structured way, so that the metrics can be extracted and monitored.

Computing and sending metrics for TensorFlow

For the tensorflow framework, you can use the tensorboard style summary metrics, with the following configuration, where the tensorboard summaries are written to a particular folder in the file system where they will be collected by the service.

The folder should be named $LOG_DIR/logs/tb/<folder> where <folder> is usually test (for test/validation metrics not computed on the training data) or train for metrics computed on the same data used to train the model.

In your TensorFlow Python code, first configure the location where the metrics will be written, and create the folder as follows:

tb_directory = os.environ["LOG_DIR"]+"/logs/tb"

Now in your code you can compute metrics and write them to an appropriately named sub-folder, for example to create a writer for test metrics, see the following snippet:

test_writer = tf.summary.FileWriter(tb_directory+'/test')

for epoch in range(0,total_epochs):
    # perform training for epoch
    # compute test metrics into test_summary for epoch and write them to the tensorboard log
    test_writer.add_summary(test_summary, epoch)

Computing and sending metrics for Keras

For metrics to be collected from a Keras program running in a tensorflow framework by the service, they should be written out to tensorboard style logs exactly as for tensorflow as described in the previous section.

For Keras this can be achieved using two TensorBoard callbacks and a simple splitter class called MetricsSplitter that routes metrics to the correct TensorBoard instance. The following code shows how to create the callbacks.

from keras.callbacks import TensorBoard


# create TensorBoard instance for writing test/validation metrics
tb_directory_test = os.environ["LOG_DIR"]+"/logs/tb/test"
tensorboard_test = TensorBoard(log_dir=tb_directory_test)

# create TensorBoard instance for writing training metrics
tb_directory_train = os.environ["LOG_DIR"]+"/logs/tb/train"
tensorboard_train = TensorBoard(log_dir=tb_directory_train)


The MetricsSplitter class is defined here:

from keras.callbacks import Callback

class MetricsSplitter(Callback):

    def __init__(self, train_tb, test_tb):
        super(MetricsSplitter, self).__init__()
        self.test_tb = test_tb   # TensorBoard callback to handle test metrics
        self.train_tb = train_tb # TensorBoard callback to handle training metrics

    def set_model(self, model):

    def isTestMetric(self,metricName):
        return metricName.find("val")==0 # metrics starting with val are computed on validation/test data

    def on_epoch_end(self, epoch, logs=None):
        # divide metrics up into test and train and route to the appropriate TensorBoard instance
        logs = logs or {}
        train_logs = {}
        test_logs = {}
        for metric in logs.keys():
            if self.isTestMetric(metric):
                test_logs[metric] = logs[metric]
                train_logs[metric] = logs[metric]

    def on_train_end(self, x):

Computing and sending metrics for Caffe

Metrics output from your Caffe 1.0 program should be automatically collected by the service.

Computing and sending metrics for PyTorch

For the pytorch framework please use the following logging approach based on the Python code in - which can be downloaded from You can download this Python file and add it to your training definition zip, then import it and use it in your program.

from emetrics import EMetrics

with as em:
	for epoch in range(0,total_epochs):
    	# perform training for epoch
    	# compute training metrics and assign to values train_metric1, train_metric2 etc
    	em.record("training",epoch,{'metric1': train_metric1, 'metric2': train_metric2})
    	# compute test metrics and assign to values test_metric1, test_metric2 etc
    	# NOTE for these use the group EMetrics.TEST_GROUP so that the service will recognize these metrics as computed on test/validation data
    	em.record(EMetrics.TEST_GROUP,epoch,{'metric1': test_metric1, 'metric2': test_metric2})

Computing and sending metrics for code running in an HPO experiment

When preparing Python code to run within an experiment with Hyperparameter Optimization (HPO) please note the following requirements:

Direct tensorboard logs to a folder incorporating the value of the SUBID environment variable. (SUBID is an environment variable set by WML to the HPO iteration number 0,1,2,… the logs written by each iteration need to be kept separate)

tb_directory = os.environ["LOG_DIR"]+"/"+os.environ["SUBID"]+"/logs/tb"

At the end of the training run, write out a list of validation metric values computed at regular intervals as the model was trained… the final metric value should reflect the performance of the final model. The metrics should be written as a JSON array to the file $RESULT_DIR/val_dict_list.json as in the following example (for a validation metric called accuracy, collected after each of 10 epochs):

   { "steps":1, "accuracy":0.07},
   { "steps":2, "accuracy":0.34},
   { "steps":3, "accuracy":0.45},
   { "steps":4, "accuracy":0.68},
   { "steps":5, "accuracy":0.67},
   { "steps":6, "accuracy":0.89},
   { "steps":7, "accuracy":0.93},
   { "steps":8, "accuracy":0.94},
   { "steps":9, "accuracy":0.94},
   { "steps":10, "accuracy":0.95}

The following code snippet can be used to write the JSON in this example:

# train the model
# obtain a list of validation metrics, collected during training after each epoch
# for example:
accuracies = [0.07,0.34,0.45,0.68,0.67,0.89,0.93,0.94,0.94,0.95]

# now write the values out to a JSON formatted file for the 
training_out =[]
for i in range(len(accuracies)):
    training_out.append({'steps':(i+1) , 'accuracy':accuracies[i]})

with open('{}/val_dict_list.json'.format(os.environ['RESULT_DIR']), 'w') as f:
    json.dump(training_out, f)