Experiment builder with HPO tutorial: Build a TensorFlow model to recognize handwritten digits using the MNIST data set

This tutorial guides you through using the MNIST computer vision data set to train a TensorFlow model to recognize handwritten digits. In this tutorial, you will train, deploy, and test the model in IBM Watson Studio with experiment builder, using hyperparameter optimization (HPO).




Steps overview

This tutorial presents the basic steps for training a deep learning model with experiment builder in Watson Studio:

  1. Set up data files in IBM Cloud Object Storage
  2. Download sample code
  3. Train the model
  4. Monitor training progress and results
  5. Deploy the trained model
  6. Test the deployed model

This tutorial does not demonstrate distributed deep learning.


Step 1: Set up data files in Cloud Object Storage

Training a deep learning model using Watson Machine Learning relies on using Cloud Object Storage for reading input (such as training data) as well as for storing results (such as log files.)

  1. Download MNIST sample data files to your local computer from here: MNIST sample files external link
    Note: Some browsers automatically uncompress the sample data files, which causes errors later in this tutorial. Follow instructions on the MNIST download page for verifying how your browser handled the files.

  2. Open the Cloud Object Storage GUI:

    1. From the Services menu in Watson Studio, choose "Data Services".
    2. Select "Manage in IBM Cloud" from the ACTIONS menu beside the service instance of Cloud Object Storage that is associated with your deep learning project. (This opens the Cloud Object Storage GUI.)

  3. Perform these steps in the Cloud Object Storage GUI:

    1. Create two buckets: one for storing training data, and one for storing training results.
      See: Creating a Cloud Object Storage bucket

    2. Upload all of the MNIST sample data files to the training data bucket.
      See: Uploading data to Cloud Object Storage


Step 2: Download sample code

Download sample TensorFlow model-building Python code from here: tf-model-hpo.zip external link .

tf-model-hpo.zip contains two files:

  • input_data.py - A "helper" file for reading the MNIST data files
  • convolutional_network.py - The model-building code

Point of interest: The sample file convolutional_network.py demonstrates using the environment variable $RESULT_DIR to cause extra output to be sent to the Cloud Object Storage results bucket:

model_path = os.environ["RESULT_DIR"]+"/model"
builder = tf.saved_model.builder.SavedModelBuilder(model_path)

In this case, the trained model is saved in protobuf format to the results bucket. You could send any output to the results bucket using the $RESULT_DIR variable like this.


Step 3: Train the model

This tutorial demonstrates training the model using experiment builder in Watson Studio.


3.1 Define experiment details

  1. From the Assets page of your project in Watson Studio, click New experiment.

  2. Specify a name for the experiment.

  3. In the Machine Learning Service drop-down, select the Watson Machine Learning service instance that is associated with the project.

  4. Specify Cloud Object Storage details:

    1. In the area for Cloud Object Storage click Select.
    2. Click the New connection tab.
    3. In the Cloud Object Storage instance drop-down list, select the instance of Cloud Object Storage where you created the training data bucket and the training results bucket.
    4. From the drop-down lists, specify the training data and results buckets that you created before.
    5. Click Create


3.2 Add a training definition

A training definition is made up of two things:

  • Model-building code in a .zip file
  • Metadata about how to run the training


Model-building code and basic details

  1. Click Add training definition.
  2. Give the training definition a name.
  3. Upload the sample code, tf-model.zip, where prompted.
  4. Specify the framework that is used in the model-building code: tensorflow 1.5.
  5. In the Execution command box, specify this command for running the model-building code:
    python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --trainingIters 200000
    Point of interest: This sample execution command demonstrates using the environment variable $DATA_DIR to cause data to be read from the Cloud Object Storage training data bucket.
  6. Select "1/2 x NVIDIA Tesla K80 (1 GPU)" for the compute plan.

Metadata, including hyperparameter optimization

In the model-building code, you can see several hyperparameters being used. Here are a few places in convolutional_network.py that use hyperparameters learning_rate, conv_filter_size1, conv_filter_size2, and fc:

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
# Store layers weight & bias
weights = {
    # 5x5 conv, 1 input, 32 outputs
    'wc1': tf.Variable(tf.random_normal([conv_filter_size1, conv_filter_size1, 1, 32])),
    # 5x5 conv, 32 inputs, 64 outputs
    'wc2': tf.Variable(tf.random_normal([conv_filter_size2, conv_filter_size2, 32, 64])),
    # fully connected, 7*7*64 inputs, 1024 outputs
    'wd1': tf.Variable(tf.random_normal([7*7*64, fc])),
    # 1024 inputs, 10 outputs (class prediction)
    'out': tf.Variable(tf.random_normal([fc, n_classes]))

Specify how these hyperparameters should be optimized in experiment builder:

  1. From the Hyperparameter optimization method drop-down list, select "random".

  2. Specify the basic hyperparameter optimization details:

    Table 1. Hyperparameter optimization details
    Option Value to specify
    Optimizer steps
    (Number of training runs)
    Objective accuracy
    Maximize or minimize maximize

  3. Add four "Range" hyperparameters with these details:

    Table 2. Hyperparameters
    Name Lower bound Upper bound Step/Power Data type
    learning_rate 0.005 0.01 Step: 0.001 Double
    conv_filter_size1 5 6 Step: 1 Integer
    conv_filter_size2 5 6 Step: 1 Integer
    fc 9 10 Power: 2 Integer


After entering the hyperparameter details, click Create and run.


Step 4: Monitor training progress and results

  • You can monitor the progress of a training run in the Training Runs tab of experiment builder.

  • When the training run is complete, you can view details of training results:

    • In the Compare Runs tab, you can see how the results for the training runs compare:
      Comparing results in experiment builder
    • Click on a training definition to view details of its training results, including a graph of the accuracy, logs and other output:
      Viewing training run results in experiment builder


Step 5: Deploy the trained model

You can use your trained model to classify new images only after the model has been deployed.

  1. In the Training Runs tab of experiment builder, under the ACTIONS menu for the training run with the highest accuracy, select "Save model". Give the model a name and click Save. This stores the model in the Watson Machine Learning repository.

  2. In the Assets page of your project in Watson Studio, click the new model in the Models section.

  3. Click the Deployments tab and then click Add Deployment.

  4. Choose "Web Service" as the deployment type, specify a name for the deployment, and then click Save.

  5. Click the new deployment to view the model details page.


Step 6: Test the deployed model

You can quickly test your deployed model from the deployment details page.

  1. On your local computer, download this sample payload JSON file with input data corresponding to the handwritten digits "5" and "4": tf-mnist-test-payload.json external link

  2. In the Test area of the deployment details page in Watson Studio, paste the value of the payload field from tf-mnist-test-payload.json. Then click Predict.

    Sample output:

      "values": [
    This output shows: the first input data was correctly classified as belonging to the class "5", and the second input data was correctly classified as belonging to the class "4".