Coding guidelines for deep learning programs

In general, you should be able to easily run your existing deep learning programs using the service. There are a few caveats including how your program will read data and save a trained model, and some modifications might be required.

Attention: Support for Deep Learning as a Service and the Deep Learning Experiment Builder on Cloud Pak for Data as a Service is deprecated and will be discontinued on April 2, 2022. See Service plan changes and deprecations for details.

Best practices for training runs and data storage

This topic provides best practices and guidelines for coding a deep learning experiment.

Training runs

When you submit your training job, you are providing a manifest file and some code in a zip file. The manifest describes the resources that your job needs to run, such as the hardware you want to run your job on, where your training data is located, and what sort of deep learning library you want to use. When you submit the job, this information is used to prepare the environment where the contents of the zipfile will execute.

Your zip file contains the code necessary to run the training job, and it can be anything that can be executed, but is usually Python code or scripts. After the environment is set up, the contents of the zip file are extracted, and the specified command is executed. When the command completes, the environment is torn down, and resources are freed for the next job. No remnants of the job remain on the machine where it ran, so it is important to put results and other output in the correct location to ensure it is not lost at the end of the job.

Tips:

The command you specify will run under a linux bash shell, and you can use bash features or specify a bash script as the command to run. This enables you to run background processes along with your main job, define your own environment variables, or install libraries before your job runs.
Your job is billed based on the size of the machine, and the start and end times of your command. Be sure that if you start any background processes in your command that they exit with the main process, or your job could run much longer than you expected. For example, if you want to measure memory usage, use a command such as top -bc && python runMyJob.py; pkill -9 -f top

Data management

Use IBM Cloud Object Storage to get large amounts of data into and out of the service. As a customer, you will want to create an instance of Cloud Object Storage in the same region where your service runs, (for example, US Dallas), and use the endpoints associated with that location. Use the public endpoint to push data into Cloud Object Storage, and the service will use the private endpoint to extract your data. Similarly, for training results, the service should push data to the private endpoint, and you can then download it using the public endpoint. Refer to Using Cloud Object Storage best practices for managing data storage.

Tip: Preprocess your training data before uploading it to object storage. For example, don't upload raw images in folders, but instead resize the images and store them with their labels in dataset files such as HDF5 or tfrecord, on the order of ~100MB or larger each. This reduces costs by reducing the size of the data stored, and improves speed of training by reducing both I/O and compute time.

Running training jobs

The zip file you upload when launching your job is limited to 4MB, and is unique to each job. You would not want to upload your training data set or other large unchanging files with every run, and you need a way to save information such as logfiles, checkpoints, and trained models. This is accomplished by mounting two buckets from Cloud Object Storage into your machine, making them visible as directories containing files in the local filesystem, rather than objects in an external object store.

The first bucket is identified in the manifest as training_data_reference, and the environment variable $DATA_DIR points to the location where this is mounted. Every object in the bucket will be available to the running job. This is a good place to put not only training data, but also any code or libraries that change infrequently, to reduce the size of your zip file. The contents of this mount are cached locally and performance is optimized for read performance of larger files. It is also important to know that you can write or delete files in this location, and any changes made while your job runs will be reflected in the object storage bucket immediately.

The second bucket is identified in the manifest as training_results_reference. As with the first bucket, this is mounted as a local directory, all contents of the bucket are available to your job, and changes made by your program are reflected in object storage immediately. Unlike the first bucket, this one is optimized for writing, and is not cached, but you can also read from it. The $RESULT_DIR environment variable will point to a folder under your bucket with the name of the training ID (for example, bucket/training-XLsT8Gzmg). $CHECKPOINT_DIR will point to a folder under the bucket named “_wml_checkpoints” (for example, bucket/_wml_checkpoints). Anything you write to these locations will be visible in Cloud Object Storage when the file is closed on the machine by your program.

Tips:

If you have non-standard libraries in your job, put the necessary files into object storage and install them at the start of your job. For example, to install a python wheel, run with a command such as: pip install file://$DATA_DIR/mypackage.whl && python3 myProgram.py. Similarly, for an ubuntu package, you would run dpkg -i $DATA_DIR/mypackage.deb && caffe train -gpu all -solver mysolver.prototxt. In all cases, ensure that you include all dependencies. You can run a job that attempts the install to see if there are any errors, or use a command like pip list to see what is present already.
Create folders under $CHECKPOINT_DIR to segregate checkpoints for different types of experiments that run in parallel.
Write your program to look for a checkpoint at startup somewhere under $CHECKPOINT_DIR, and to write checkpoints to this same location at regular intervals. This ensures that if your job fails, or if you need to halt it for some reason, subsequent jobs can start from the checkpoints when appropriate.

Obtaining the input data

Input data files located in the specified COS bucket are available to your program via the folder specified in environment variable DATA_DIR

# if your code needs to open the file imagedata.csv
# from within your training_data_reference bucket
# use the following approach

input_data_folder = os.environ["DATA_DIR"]

imagefile = open(os.path.join(input_data_folder,"imagedata.csv"))

Writing the trained model

Output your trained model to a folder named model under the folder specified in the environment variable RESULT_DIR. When you store a training-run into the repository, this model folder is saved.

output_model_folder = os.environ["RESULT_DIR"]
output_model_path = os.path.join(output_model_folder,"model")

mymodel = train_model()
mymodel.save(output_model_path)

Writing to the log

Write to stdout (for example in the Python programming language, use the print function) and the output will be collected by the service and made available when you monitor a running job or obtain the resulting log.

print("starting to train model...")
mymodel = train_model()
print("model training completed..., starting evaluation")
evaluate_model(mymodel)
print("model evaluation completed.")

Disclaimer: Client’s use of the deep learning training process includes the ability to write to the training log files. Personal data must not be written to these training log files as they are accessible to other users within Client’s Enterprise as well as to IBM as necessary to support the Cloud Service.

Reading Hyperparameters

If your code is running as part of a larger experiment it may need to obtain values for Hyperparameters defined in the experiment. The Hyperparameters will be supplied in a file called config.json as a JSON formatted dictionary, located in the current folder and can be read using the following example snippet (which expects Hyperparameters to be defined for initial_learning_rate and total_iterations:

hyper_params = json.loads(open("config.json").read())
learning_rate = float(hyper_params["initial_learning_rate"])
training_iters = int(hyper_params["total_iterations"])

Computing and sending metrics

You can generate metrics of interest using techniques based on the framework being used. You can print metric values in an unstructured way to the logs, but you can also send metrics to the service in a structured way, so that the metrics can be extracted and monitored.

Computing and sending metrics for TensorFlow

For the tensorflow framework, you can use the tensorboard style summary metrics, with the following configuration, where the tensorboard summaries are written to a particular folder in the file system where they will be collected by the service.

The folder should be named $LOG_DIR/logs/tb/<folder> where <folder> is usually test (for test/validation metrics not computed on the training data) or train for metrics computed on the same data used to train the model.

In your TensorFlow Python code, first configure the location where the metrics will be written, and create the folder as follows:

tb_directory = os.environ["LOG_DIR"]+"/logs/tb"
tensorflow.gfile.MakeDirs(tb_directory)

Now in your code you can compute metrics and write them to an appropriately named sub-folder, for example to create a writer for test metrics, see the following snippet:

test_writer = tf.summary.FileWriter(tb_directory+'/test')


for epoch in range(0,total_epochs):
    # perform training for epoch
    # compute test metrics into test_summary for epoch and write them to the tensorboard log
    test_writer.add_summary(test_summary, epoch)

Computing and sending metrics for Keras

For metrics to be collected from a Keras program running in a tensorflow framework by the service, they should be written out to tensorboard style logs exactly as for tensorflow as described in the previous section.

For Keras this can be achieved using two TensorBoard callbacks and a simple splitter class called MetricsSplitter that routes metrics to the correct TensorBoard instance. The following code shows how to create the callbacks.

from keras.callbacks import TensorBoard

...

# create TensorBoard instance for writing test/validation metrics
tb_directory_test = os.environ["LOG_DIR"]+"/logs/tb/test"
tensorboard_test = TensorBoard(log_dir=tb_directory_test)

# create TensorBoard instance for writing training metrics
tb_directory_train = os.environ["LOG_DIR"]+"/logs/tb/train"
tensorboard_train = TensorBoard(log_dir=tb_directory_train)

splitter=MetricsSplitter(tensorboard_train,tensorboard_test)

model.fit(...,callbacks=[splitter])

The MetricsSplitter class is defined here:

from keras.callbacks import Callback

class MetricsSplitter(Callback):

    def __init__(self, train_tb, test_tb):
        super(MetricsSplitter, self).__init__()
        self.test_tb = test_tb   # TensorBoard callback to handle test metrics
        self.train_tb = train_tb # TensorBoard callback to handle training metrics

    def set_model(self, model):
        self.test_tb.set_model(model)
        self.train_tb.set_model(model)

    def isTestMetric(self,metricName):
        return metricName.find("val")==0 # metrics starting with val are computed on validation/test data

    def on_epoch_end(self, epoch, logs=None):
        # divide metrics up into test and train and route to the appropriate TensorBoard instance
        logs = logs or {}
        train_logs = {}
        test_logs = {}
        for metric in logs.keys():
            if self.isTestMetric(metric):
                test_logs[metric] = logs[metric]
            else:
                train_logs[metric] = logs[metric]
        self.test_tb.on_epoch_end(epoch,test_logs)
        self.train_tb.on_epoch_end(epoch,train_logs)

    def on_train_end(self, x):
        self.test_tb.on_train_end(x)
        self.train_tb.on_train_end(x)

Computing and sending metrics for PyTorch

For the pytorch framework please use the following logging approach based on the Python code in emetrics.py - which can be downloaded from emetrics.py. You can download this Python file and add it to your model definition zip, then import it and use it in your program.

from emetrics import EMetrics

with EMetrics.open() as em:
    for epoch in range(0,total_epochs):
        # perform training for epoch
        # compute training metrics and assign to values train_metric1, train_metric2 etc
        em.record("training",epoch,{'metric1': train_metric1, 'metric2': train_metric2})

        # compute test metrics and assign to values test_metric1, test_metric2 etc
        # NOTE for these use the group EMetrics.TEST_GROUP so that the service will recognize these metrics as computed on test/validation data
        em.record(EMetrics.TEST_GROUP,epoch,{'metric1': test_metric1, 'metric2': test_metric2})

Computing and sending metrics for code running in an HPO experiment

When preparing Python code to run within an experiment with Hyperparameter Optimization (HPO) please note the following requirements:

Direct tensorboard logs to a folder incorporating the value of the SUBID environment variable. (SUBID is an environment variable set by WML to the HPO iteration number 0,1,2,... the logs written by each iteration need to be kept separate)

tb_directory = os.environ["LOG_DIR"]+"/"+os.environ["SUBID"]+"/logs/tb"

At the end of the training run, write out a list of validation metric values computed at regular intervals as the model was trained... the final metric value should reflect the performance of the final model. The metrics should be written as a JSON array to the file $RESULT_DIR/val_dict_list.json as in the following example (for a validation metric called accuracy, collected after each of 10 epochs):

[ 
   { "steps":1, "accuracy":0.07},
   { "steps":2, "accuracy":0.34},
   { "steps":3, "accuracy":0.45},
   { "steps":4, "accuracy":0.68},
   { "steps":5, "accuracy":0.67},
   { "steps":6, "accuracy":0.89},
   { "steps":7, "accuracy":0.93},
   { "steps":8, "accuracy":0.94},
   { "steps":9, "accuracy":0.94},
   { "steps":10, "accuracy":0.95}
]

The following code snippet can be used to write the JSON in this example:

# train the model
# obtain a list of validation metrics, collected during training after each epoch
# for example:
accuracies = [0.07,0.34,0.45,0.68,0.67,0.89,0.93,0.94,0.94,0.95]

# now write the values out to a JSON formatted file for the 
training_out =[]
for i in range(len(accuracies)):
    training_out.append({'steps':(i+1) , 'accuracy':accuracies[i]})

with open('{}/val_dict_list.json'.format(os.environ['RESULT_DIR']), 'w') as f:
    json.dump(training_out, f)

Parent topic: Training runs