CLI with HPO tutorial: Build a TensorFlow model to recognize handwritten digits using the MNIST data set

This tutorial guides you through using the MNIST computer vision data set to train a TensorFlow model to recognize handwritten digits. In this tutorial, you will train, deploy, and test the model using the IBM Watson Machine Learning command line interface (CLI). This tutorial demonstrates using hyperparameter optimization (HPO).

 

Prerequisite

 

Steps overview

This tutorial presents the basic steps for training a deep learning model with Watson Machine Learning:

  1. Set up data files in IBM Cloud Object Storage
  2. Download sample code
  3. Train the model
  4. Monitor training progress and results
  5. Deploy the trained model
  6. Test the deployed model

This tutorial does not demonstrate distributed deep learning.

 

Step 1: Set up data files in Cloud Object Storage

Training a deep learning model using Watson Machine Learning relies on using Cloud Object Storage for reading input (such as training data) as well as for storing results (such as log files.)

  1. Download MNIST sample data files to your local computer from here: MNIST sample files external link
    Note: Some browsers automatically uncompress the sample data files, which causes errors later in this tutorial. Follow instructions on the MNIST download page for verifying how your browser handled the files.

  2. Create an instance of Cloud Object Storage external link .
    See: Creating a Cloud Object Storage service instance

  3. Perform these steps in the in the Cloud Object Storage GUI:

    1. Create two buckets: one for storing training data, and one for storing training results.
      *For each bucket, make a note of the endpoint_url and the bucket name.
      See: Creating a Cloud Object Storage bucket

    2. Upload all of the MNIST sample data files to the training data bucket.
      See: Uploading data to Cloud Object Storage

    3. Generate new HMAC credentials for working through this tutorial (create new credentials with {"HMAC":true} in the inline configuration parameters.)
      *Make a note of the access_key_id and the secret_access_key.
      See: Creating HMAC credentials for Cloud Object Storage

 

Step 2: Download sample code

There are three files to download:

  • tf-model-hpo.zip - Sample model-building code
  • tf-train-hpo.yaml - Training run manifest file, containing metadata specifying how to execute the sample model-building code
  • experimemts-hpo.yaml - Experiment manifest file, containing metadata about optimizing hyperparameters in multiple training runs

 

2.1 Sample model-building code

Download sample TensorFlow model-building Python code from here: tf-model-hpo.zip external link .

tf-model-hpo.zip contains two files:

  • convolutional_network.py - Model-building Python code</li>
  • input_data.py - A “helper” file for reading the MNIST data files</li>

Points of interest:

  • The sample file convolutional_network.py demonstrates using the environment variable $RESULT_DIR to cause extra output to be sent to the Cloud Object Storage results bucket:

    model_path = os.environ["RESULT_DIR"]+"/model"
    ...
    builder = tf.saved_model.builder.SavedModelBuilder(model_path)
    

    In this case, the trained model is saved in protobuf format to the results bucket. You could send any output to the results bucket using the $RESULT_DIR variable like this.

  • In the model-building code, you can see several hyperparameters being used. Here are a few places in convolutional_network.py that use hyperparameters learning_rate, conv_filter_size1, conv_filter_size2, and fc:

    # Define loss and optimizer
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
    ...
    # Store layers weight & bias
    weights = {
        # 5x5 conv, 1 input, 32 outputs
        'wc1': tf.Variable(tf.random_normal([conv_filter_size1, conv_filter_size1, 1, 32])),
        # 5x5 conv, 32 inputs, 64 outputs
        'wc2': tf.Variable(tf.random_normal([conv_filter_size2, conv_filter_size2, 32, 64])),
        # fully connected, 7*7*64 inputs, 1024 outputs
        'wd1': tf.Variable(tf.random_normal([7*7*64, fc])),
        # 1024 inputs, 10 outputs (class prediction)
        'out': tf.Variable(tf.random_normal([fc, n_classes]))
    ...
    }
    

 

2.2 Training run manifest file

Download a sample training run manifest file from here: tf-train-hpo.yaml external link .

Point of interest:

  • The command in the execution section demonstrates using the environment variable $DATA_DIR to cause data to be read from the Cloud Object Storage training data bucket.

 

2.3 Experiment manifest file

Download a sample experiment manifest file from here: experiments-hpo.yaml external link .

Points of interest:

  • The method subsection in the hyper_parameters_optimization section specifies how the hyperparameter optimization will be performed in the experiment training runs:

    hyper_parameters_optimization:
        method:
          name: random                 <-- Optimization method: "random"
          parameters:
          - name: objective
            string_value: accuracy     <-- Optimization objective: "maximize" the "accuracy"
          - name: maximize_or_minimize <--
            string_value: maximize     <--
          - name: num_optimizer_steps
            int_value: 4               <-- Run four training runs
    

  • The hyper_parameters subsection in the hyper_parameters_optimization section specifies how the individual hyperparameters in convolutional_network.py will vary in the experiment:

        hyper_parameters:
        - name: learning_rate
          double_range:
            min_value: 0.005
            max_value: 0.01
            step: 0.001
        - name: conv_filter_size1
          int_range:
            min_value: 5
            max_value: 6
        - name: conv_filter_size2
          int_range:
            min_value: 5
            max_value: 6
        - name: fc
          int_range: 
            min_value: 9
            max_value: 10
            power: 2
    

 

Step 3: Train the model

This tutorial demonstrates running four training runs in an experiment.

 

3.1 Create and store a training definition

A training definition is made up of two things:

  • Model-building code in a .zip file
  • Metadata about how to run the training

Steps:

  1. In the training run manifest file, tf-train-hpo.yaml, update the training_data_reference section with details of the Cloud Object Storage that you are using for this tutorial:

    • Update the endpoint_url field
    • Update the access_key_id field and the secret_access_key field

  2. Store the training run using the Machine Learning CLI:

    bx ml store training-definitions tf-model-hpo.zip tf-train-hpo.yaml
    

    Example output:

    Creating training-definition ...
    OK
    training-definition ID is 'b54925fe-e5d9-473d-9b8a-c1accc60b369' and version ID is '7ca0b7f5-3a73-440b-95df-24cbe4c261cb'
    

 

3.2 Create and store an experiment

  1. Update the sample experiment manifest file, experiments-hpo.yaml, with your details:

    • Replace <REPLACE WITH UUID> with the training-definition ID returned from the bx ml store training-definitions command.

    • Update the training_results_reference section with details of the Cloud Object Storage that you are using for this tutorial:

      • Update the endpoint_url field
      • Update the access_key_id field and the secret_access_key field

  2. Store the experiment using the Machine Learning CLI:

    bx ml store experiments experiments-hpo.yaml
    

    Example output:

    Creating experiment ...
    OK
    Experiment created with ID '44fc843c-33eb-4662-a46a-0f18e16e99a5'
    

    In this example output, "44fc843c-33eb-4662-a46a-0f18e16e99a5" is the experiment ID.

 

3.3 Run the experiment

Run the experiment using the Machine Learning CLI:

bx ml experiments run 44fc843c-33eb-4662-a46a-0f18e16e99a5

*The experiment ID in this example was returned from the previous bx ml store experiments command, “44fc843c-33eb-4662-a46a-0f18e16e99a5”. Replace that with the experiment ID that was returned for you.

Example output:

Starting to run the experiment with ID '44fc843c-33eb-4662-a46a-0f18e16e99a5' ...
OK
Experiment-run created with ID '626dc821-763d-4361-a91e-fac3373df932'

In this example output, “626dc821-763d-4361-a91e-fac3373df932” is the experiment-run ID. (The experiment ID and the experiment-run ID are not the same thing.)

 

Step 4: Monitor training progress and results

  • You can monitor the progress of the training runs using the CLI:

    bx ml list training-runs 44fc843c-33eb-4662-a46a-0f18e16e99a5 626dc821-763d-4361-a91e-fac3373df932
    

    *The IDs in this example were returned by previous commands:

    • The experiment ID was returned from the previous bx ml store experiments command, "44fc843c-33eb-4662-a46a-0f18e16e99a5"

    • The experiment-run ID was returned from the previous bx ml experiments run command, "626dc821-763d-4361-a91e-fac3373df932"

    Replace those with the experiment ID and the experiment-run ID that were returned for you

    Sample output:

    Fetching the list of training runs ...
    SI No   Name        guid                  status      framework    version   submitted-at
    34      model1      training-6AqY_vVmg    completed   tensorflow   1.5       2018-05-23T17:58:21Z
    35      model1      training-6AqY_vVmg_0  completed   tensorflow   1.5       2018-05-23T17:59:33Z
    36      model1      training-6AqY_vVmg_1  completed   tensorflow   1.5       2018-05-23T17:59:33Z
    37      model1      training-6AqY_vVmg_2  completed   tensorflow   1.5       2018-05-23T17:59:33Z
    38      model1      training-6AqY_vVmg_3  running     tensorflow   1.5       2018-05-23T17:59:33Z
    

  • After the training runs finish, you can view log files and other output in the training results bucket of your Cloud Object Storage.
    See: Viewing results in Cloud Object Storage

 

Step 5: Deploy the trained model

You can use your trained model to classify new images only after the model has been deployed.

  1. Identify which training run achieved the best accuracy by viewing the metrics of each training run:

    bx ml monitor training-runs training-6AqY_vVmg_2 metrics
    
    In this example, a training run guid returned from the previous bx ml list training-runs command, "training-6AqY_vVmg_2", is specified. Replace that with a training run guid that was returned for you.

    Sample output:

    OK
    Starting to fetch metrics for model-id 'training-6AqY_vVmg_2'
    ...
    [--METRICS]  name  accuracy   value  0.9534000158309937   current_iteration  179200   timestamp  2018-05-23T18:00:33Z
    [--METRICS]  name  accuracy   value  0.9578999876976013   current_iteration  192000   timestamp  2018-05-23T18:00:35Z
    

  2. Store the trained model of the best-performing training run in the Watson Machine Learning repository:

    bx ml store training-runs training-6AqY_vVmg_2
    
    In this example, a training run guid returned from the previous bx ml list training-runs command, "training-6AqY_vVmg_2", is specified. Replace that with a training run guid that was returned for you.

    Sample output:

    OK
    Model store successful. 
    Model-ID is 'a8379aaa-ea31-4c22-824d-89a01315dd6d'
    

  3. Deploy the model:

    bx ml deploy a8379aaa-ea31-4c22-824d-89a01315dd6d "my-hpo-deployment"
    
    In this example, the Model-ID returned from the bx ml store command, "a8379aaa-ea31-4c22-824d-89a01315dd6d" is specified. Replace that with the Model-ID that was returned for you.

    Sample output:

    Deploying the model with MODEL-ID 'a8379aaa-ea31-4c22-824d-89a01315dd6d'...
    DeploymentId       9d6a656c-e9d4-4d89-b335-f9da40e52179
    Scoring endpoint   https://2000ab8b-7e81-41b3-ad07-b70f849594f5...
    Name               my-hpo-deployment
    Type               tensorflow-1.5
    Runtime            python-3.5
    Created at         2018-05-23T19:46:19.770Z
    OK
    Deploy model successful
    

 

Step 6: Test the deployed model

You can quickly test your deployed model using the Watson Machine Learning CLI.

  1. Download this sample payload JSON file with input data corresponding to the handwritten digits "5" and "4": tf-mnist-test-payload.json external link

  2. Update the sample payload file, tf-mnist-sample-payload.json, with your model details:

    • modelId: Specify the Model-ID returned from the bx ml store command
    • deploymentId: Specify the DeploymentId returned from the bx ml deploy command

  3. Test the model using the Watson Machine Learning CLI:

    bx ml score tf-mnist-sample-payload.json
    
    Sample output:
    Fetching scoring results for the deployment '9d6a656c-e9d4-4d89-b335-f9da40e52179' ...
    {"classes":[5, 4]}
    OK
    Score request successful
    
    In this output, we can see: the first input data was correctly classified as belonging to the class "5", and the second input data was correctly classified as belonging to the class "4".