Create a Training definition

Training definitions are the organizing principle for using deep learning functions in IBM Watson Machine Learning. A typical scenario might consist of dozens to hundreds of training definitions . Each training definition is defined individually and consists of the following parts: the neural network defined by using one of the supported frameworks and location of the IBM Cloud Object Storage that contains your data set.

This document will explain how a training definitions can be set up and stored. You may also want to refer to a practical example described in Tutorial - Build a TensorFlow Model to recognize Handwritten Digits

Creating a model definition .zip file

After you define the neural network and associated data handling by using one of the supported frameworks, then package these files together by using the .zip format. For example, if the model was written in Torch then package your .lua files; if in Caffe then compress the .prototxt file; or if in TensorFlow/Keras/MXNet then compress your .py files. Other compression formats, such as gzip or tar are not supported. Consult the documentation for the Deep Learning framework you want to use in order to prepare the model definition files.

For example, a zip file tf-model.zip that contains the model definition for TensorFlow might contain the following output:

unzip -l tf-model.zip

Sample Output:

Archive:  tf-model.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     7094  09-21-2017 11:38   convolutional_network.py
     5486  09-19-2017 13:49   input_data.py
---------                     -------
    12580                     2 files

Additional requirements for a caffe model

The model.zip file requires these mentioned files in case of submitting a training job for a caffe model:

  • <network_definition_1>.prototxt: This file describes the layout of the neural network for training.
  • <network_definition_2>.prototxt: This file describes the layout of the neural network for scoring. This file will be used for online deployment and scoring
  • <network_solver>.prototxt: This file describes the training parameters like number of training iterations, learning rate, training checkpoints (called snapshots), prefix used to name the snapshots (snapshot_prefix), etc.
    Note: snapshot_prefix key should contain the value in the following format: ./model/<model_prefix>
  • deployment-meta.json: This file contains details such as: input layers, output layers, name of the network file and the name of the network weights file. This file will be used by the scoring service to detect the correct .prototxt and .caffemodel file for loading the model. The input and output layers specified in this file are used by the scoring service for making predictions.
    A sample deployment-meta.json looks as follows:
    {
    "input_layers": ["data"],
    "output_layers": ["probability"],
    "network_definitions_file_name": "lenet.prototxt", #<network_definition_2>.prototxt
    "weights_file_name": "lenet_iter_10000.caffemodel"
    }
    

The training command for caffe model: The "command" field used in the training-runs.yml/json file should be as follows:

mkdir $RESULT_DIR/model; cp *.prototxt $RESULT_DIR/model; \
cp deployment-meta.json $RESULT_DIR/model; ln -s $RESULT_DIR/model model; \
caffe train -solver <solver>.prototxt

A file known as the weights file which is of the format: <snapshot_prefix>_<n_iters>.caffemodel is generated by the training process. This file contains the weights for the neural network. The value for snapshot_prefix is picked up from the solver prototxt file and n_iters refers to the nth iteration of training. The value of n_iters is equal to the maximum number of iterations done for the training. This filename needs to be specified under the weights_file_name key in deployment-meta.json

Upload training data

Your training data must be uploaded to a compatible IBM Cloud Object Storage service instance. The credentials from that IBM Cloud Object Storage instance will be used in your manifest file. The object store is also used to store the trained model and log files at the end of your training run.

Creating a training definition manifest file

The manifest is a YAML formatted file which contains different fields describing the model to be trained, including the deep learning framework to use, the IBM Cloud Object Storage configuration and several arguments required for model execution during training and testing. Here, we describe the different fields of the training definition file for deep learning, continuing our TensorFlow handwriting recognition example. Note that other fields, such as the frameworks.runtime field are ignored.

  • name: You can provide any value to name to help identify your training run after it is launched. However, this does not have to be unique - the service will assign a unique training-definition-id for each training definition.
  • description: This is another field that you can use to describe the training definition.
  • framework: This field provides framework specific information the name and version must match one of the supported frameworks.
    • framework.name: Name of framework
    • framework.version: Version of framework. This should be a string, please enclose the value in double quotes, for example version: "0.8"
    • framework.runtimes: Optional. Provide a list of names and versions of relevant libraries or tools. For Deep Learning, currently the only relevant information is the version of python. The version should be provided as a string. If this information is not provided, python3 is assumed.
  • command: This field identifies the main program file along with any arguments that deep learning needs to execute.
  • training_data_reference: This section specifies the object store and bucket where the data files used to train the model is loaded from.
    • name: A descriptive name for this objectstore and bucket
    • connection: The connection variables for the data store.
    • type: Type of data store, currently this can only be set to s3.
    • source.bucket: The bucket where the training data resides.

For example, the following training definition manifest can be used to create a training definition:

name: training-definition-1
description:  Simple MNIST model implemented in TF
framework:
  name: tensorflow
  version: '1.5'
  runtimes:
    name: python
    version: '3.5'
training_data_reference:
- name: MNIST image data files
  connection:
    endpoint_url: <auth-url>
    access_key_id: <username>
    secret_access_key: <password>
  source:
    bucket: mnist-training-models
    type: s3

Generate a sample training definition manifest file.

Sample manifest file template can be generated by using the bx ml generate-manifest command: bx ml generate-manifest training-definitions

bx ml generate-manifest training-definitions

Sample Output:

OK
A sample manifest file is generated under training-definitions.yml

Store a training definition

After you prepare the model definition .zip and training manifest file, store the training definition by using the bx ml store training-definitions command: bx ml store training-definitions <path-to-model-definition-zip> <path-to-training-definitions-manifest-yaml>

bx ml store training-definitions tf-model.zip tf-train-def.yaml

Sample Output:

When the command is submitted successfully, a unique training-definition ID is returned. For example, the following output shows a Training-Definition-ID value of e73b7b9d-d44b-481a-80a9-8e8b13eb63a3:

Creating training-definition ...
OK
training-definition ID is 'e73b7b9d-d44b-481a-80a9-8e8b13eb63a3' and version ID is '354950bb-a5e0-4e5b-8c39-f5439078a705'

List a training definition

To list all training definitions, run the following command:

bx ml list training-definitions

Sample Output:

Fetching the list of training-definitions ...
SI No   Name                          guid                                   framework    created-at
1       caffe_training_definitionc1   0787fae0-3221-48c2-9270-db6f40e57d3b   caffe        2018-01-30T16:16:33.295Z
2       tf-training-definition2       512bb4a4-552d-4984-b404-2987b2600e3b   tensorflow   2018-01-30T16:16:54.516Z

2 records found.
OK
List all training-definitions successful

To check the details of a particular training-definition run use the cli command bx ml show training-definitions <training-defintions--id>:

bx ml show training-definitions 512bb4a4-552d-4984-b404-2987b2600e3b

Sample Output:

Fetching the training-definition details with ID '512bb4a4-552d-4984-b404-2987b2600e3b' ...
TrainingDefinitiontId   512bb4a4-552d-4984-b404-2987b2600e3b
name                    tf-training-definition2
framework               tensorflow
url                     https://ibm-watson-ml.mybluemix.net/v3/ml_assets/training_definitions/512bb4a4-552d-4984-b404-2987b2600e3b
created_at              2018-01-30T16:16:54.516Z
OK

Delete a training definition

To delete a training definition.

bx ml delete training-definitions fe119693-b42a-4516-9482-ee74c0c5ad8a

Sample Output:

Deleting the training-definition 'fe119693-b42a-4516-9482-ee74c0c5ad8a' ...
OK
Delete training-definition successful