End-to-end example for running a deep learning training run
This document assumes that you have created an account and have uploaded your training data to IBM Cloud Object Storage. Also you have some familiarity with the basic concepts and know how to execute a training run. The example here walks through some of the commonly used concepts in trainings your model and saving your model for scoring by using IBM Watson Machine Learning.
Attention: Support for Deep Learning as a Service and the Deep Learning Experiment Builder on Cloud Pak for Data as a Service is deprecated and will be discontinued on April 2, 2022. See Service plan changes and deprecations for details.
Concepts
The goal of IBM Watson Machine Learning is to allow you to run a training that you were able to run on your machine and be able to run the exact same training on IBM Watson Machine Learning with little to no changes. The intent is to allow you to use more powerful hardware without the need to make changes in your code.
Every IBM Watson Machine Learning training run takes in a definition file that describes what framework/version/compute configuration needs to be used, where the data/results are supposed to be located and what command needs to be executed to process this data and get the results. Following is an example of a basic definition:
model_definition:
name: tf-mnist
author:
name: WML User
email: [email protected]
description: Simple MNIST model implemented in TF
framework:
name: tensorflow
version: "1.13"
runtimes:
- name: python
version: "3.6"
execution:
command: python3 mnist_e2e_example.py --MAX_STEPS=1000
compute_configuration:
name: k80
training_data_reference:
name: MNIST image data files
connection:
endpoint_url: <auth-url>
access_key_id: <username>
secret_access_key: <password>
source:
bucket: mnist-training-data
type: s3
training_results_reference:
name: DL Model Storage
connection:
endpoint_url: <auth-url>
access_key_id: <username>
secret_access_key: <password>
target:
bucket: mnist-training-models
type: s3
The preceding definition is responsible for carrying out a simple mnist training using TensorFlow 1.13 as mentioned in the framework name and version section. The command execued by this training runs the command python3 mnist_e2e_example.py --MAX_STEPS=1000
. We will be running this training on a single K80
GPU as mentioned in the compute configuration section. Running this on a more powerful hardware is as simple as change the values in the compute configuration. For more information
about the compute_configuration
field and GPUs, see Using GPUs.
The training_data_reference
section provides information about what bucket in IBM COS has the mnist training data, and the training_results_reference
provides information on how to connect to result bucket where the
results of this training will be stored.
Here's a sample mnist_e2e_example.py
that can be used for training:
"""
end to end mnist example demonstrating all the concepts of a single learner training using tensorflow
"""
import os, sys, time
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
learn = tf.contrib.learn
flags = tf.app.flags
flags.DEFINE_integer("MAX_STEPS", 200, "maximum number of steps to go through")
FLAGS = tf.app.flags.FLAGS
data_dir = os.getenv("DATA_DIR") #this is a reference to your training data bucket
result_dir = os.getenv("RESULT_DIR") #this is a reference to your current training_id/ folder under result bucket
base_checkpoint_dir = os.getenv("CHECKPOINT_DIR") #this is a reference to _wml_checkpoints/ folder under your result bucket
log_dir = os.getenv("LOG_DIR") # this is a reference to log dir that gets copied over to your result bucket at the end of training
model_path = os.path.join(result_dir, "model/") #where you want your models to be stored, your result bucket/training_id/model/
data_dir_mnist = os.path.join(data_dir, "mnist/") #location of our mnist data set
checkpoint_path = os.path.join(base_checkpoint_dir, "mnist_experiment_1/")
tboard_summaries = os.path.join(log_dir, "tboard/") # logging all the tensorboad metrics here
def main(_):
mnist = learn.datasets.mnist.read_data_sets(data_dir_mnist, one_hot=True)
x = tf.placeholder(tf.float32, [None,784], name="x_input")
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
y = tf.placeholder(tf.float32, [None,10])
model = tf.matmul(x, W) + b
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y, logits=model))
global_step = tf.train.get_or_create_global_step()
train_op = tf.train.GradientDescentOptimizer(0.5).minimize(cost, global_step=global_step)
prediction = tf.equal(tf.argmax(model,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))
hooks = [tf.train.StopAtStepHook(last_step=FLAGS.MAX_STEPS)] #ask user for how many steps need to be executed
#tensorboard
writer = tf.summary.FileWriter(tboard_summaries)
saver = tf.train.Saver()
#store all your checkpoints under result bucket/_wml_checkpoints/mnist_experiment_1/ path
#if your training crashes, you can resume from the same path
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_path, hooks=hooks) as sess:
writer.add_graph(sess.graph)
ckpt = tf.train.get_checkpoint_state(checkpoint_path)
if ckpt and ckpt.model_checkpoint_path: #if existing checkpoint available then start from there
# Restores from checkpoint
print("restoring from path {} with checkpoint {}".format(ckpt.model_checkpoint_path, ckpt))
saver.restore(sess, ckpt.model_checkpoint_path)
# loop through data batches
while not sess.should_stop():
batch_xs, batch_ys = mnist.train.next_batch(1000)
_, step = sess.run([train_op, global_step], feed_dict={x: batch_xs, y: batch_ys})
if (step % 10 == 0) and (not sess.should_stop()):
loss, acc = sess.run([cost,accuracy], feed_dict={x: mnist.validation.images, y: mnist.validation.labels})
print("{:4d}".format(step) + ": " + "{:.6f}".format(loss) + ", accuracy=" + "{:.5f}".format(acc))
sys.stdout.flush()
# saving the model and preparing it for inferencing later
latest_checkpoint = tf.train.latest_checkpoint(checkpoint_path)
print(latest_checkpoint)
sys.stdout.flush()
with tf.Session() as sess:
saver.restore(sess,latest_checkpoint) # restore the latest checkpoint from checkpoints folder
acc = sess.run(accuracy, feed_dict={x: mnist.test.images,y: mnist.test.labels});
print("Test accuracy = "+"{:5f}".format(acc))
# output model
predictor = tf.argmax(model, 1, name="predictor")
inputs_classes = tf.saved_model.utils.build_tensor_info(x) # input an image
outputs_classes = tf.saved_model.utils.build_tensor_info(predictor) # output its class (0-9)
signature = (tf.saved_model.signature_def_utils.build_signature_def(inputs={tf.saved_model.signature_constants.CLASSIFY_INPUTS:inputs_classes},outputs={tf.saved_model.signature_constants.CLASSIFY_OUTPUT_CLASSES:outputs_classes},method_name=tf.saved_model.signature_constants.CLASSIFY_METHOD_NAME))
builder = tf.saved_model.builder.SavedModelBuilder(model_path) # where to store the model
legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
builder.add_meta_graph_and_variables(sess,[tf.saved_model.tag_constants.SERVING],signature_def_map={'predict_images': signature},legacy_init_op=legacy_init_op)
save_path = builder.save()
print("Model saved in file: %s" % save_path.decode("utf-8"))
if __name__ == "__main__":
tf.app.run()
There are a few concepts worth noting here. There are a handful of environment variables available in the environment in which the training is executed:
- DATA_DIR: your training data bucket mounted as a folder. In the preceding example we use the
$DATA_DIR/mnist
which translates totraining data bucket/mnist
where we have our mnist data - RESULT_DIR: your current training_id/ folder under result bucket. This is unique per training. Use this to save your trained model. in the previous example we write the model to
$RESULT_DIR/model
which translated toresult bucket/<training id>/model
- CHECKPOINT_DIR: refers to the
_wml_checkpoints/
folder under your result bucket. you can append to this path and save your checkpoints here, and resume from here in case the training fails. in the previous example we write checkpoints under the path$CHECKPOINT_DIR/mnist_experiment_1
which translates toresult bucket/_wml_checkpoints/mnist_experiment_1
- LOG_DIR: refers to the logs directory. It is recommended to write your logs and summary metrics for tensorboard under this directory. Currently, this directory is synched to your $RESULT_DIR making logs and tensorboard summary metrics available to you.
Checkpointing and resuming from checkpoint
You can checkpoint and resume training using the following sample code snippet. For example, if we change the preceding command python3 mnist_e2e_example.py --MAX_STEPS=1000
to have MAX_STEPS=2000
then the training
will resume from the checkpoint created as a part of the 1000th step and not from begining. This is a powerful concept that you can leverage to account for any failures in your training or in the underlying infrastructure to make sure that
your progress is saved and you can resume from where you left.
ckpt = tf.train.get_checkpoint_state(checkpoint_path)
if ckpt and ckpt.model_checkpoint_path: #if existing checkpoint available then start from there
# Restores from checkpoint
print("restoring from path {} with checkpoint {}".format(ckpt.model_checkpoint_path, ckpt))
saver.restore(sess, ckpt.model_checkpoint_path)
Saving a model
To save a model to score it or use it for inferencing you can use code similar to following code sample. Notice the user of $RESULT_DIR
here in order to save your trained model.
with tf.Session() as sess:
saver.restore(sess,latest_checkpoint) # restore the latest checkpoint from checkpoints folder
acc = sess.run(accuracy, feed_dict={x: mnist.test.images,y: mnist.test.labels});
print("Test accuracy = "+"{:5f}".format(acc))
# output model
predictor = tf.argmax(model, 1, name="predictor")
inputs_classes = tf.saved_model.utils.build_tensor_info(x) # input an image
outputs_classes = tf.saved_model.utils.build_tensor_info(predictor) # output its class (0-9)
signature = (tf.saved_model.signature_def_utils.build_signature_def(inputs={tf.saved_model.signature_constants.CLASSIFY_INPUTS:inputs_classes},outputs={tf.saved_model.signature_constants.CLASSIFY_OUTPUT_CLASSES:outputs_classes},method_name=tf.saved_model.signature_constants.CLASSIFY_METHOD_NAME))
builder = tf.saved_model.builder.SavedModelBuilder(model_path) # where to store the model
legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
builder.add_meta_graph_and_variables(sess,[tf.saved_model.tag_constants.SERVING],signature_def_map={'predict_images': signature},legacy_init_op=legacy_init_op)
save_path = builder.save()
print("Model saved in file: %s" % save_path.decode("utf-8"))
Tensorboard
Summary metric writers intended to be consumed by Tensorboard may be constructed by passing a directory under the $LOG_DIR environment variable. These will create event files that will be copied to the Cloud Object Store result directory at the end of the training.
# Construct summary metrics writer for Tensorboard consumption, and also enables WML monitoring
writer = tf.summary.FileWriter(os.path.join(os.getenv("LOG_DIR"), "tboard"))
After the training is done, you may download that directory, and then point your local Tensorboard server to that that directory. For live monitoring the scaler metrics from your summary events will be available via WML monitoring.
Conclusion
This example demonstrates how you can basically take an example of from Tensorflow tutorial and with little to no changes in the code run that same example in IBM WML.
Parent topic: Training runs