Create a deep learning experiment

Last updated: Mar 04, 2022

In IBM Watson Machine Learning, a deep learning experiment is a logical grouping of one or more model definitions. When an experiment is run it creates training runs for each model definition that is part of the experiment.

Attention: Support for Deep Learning as a Service and the Deep Learning Experiment Builder on Cloud Pak for Data as a Service is deprecated and will be discontinued on April 2, 2022. See Service plan changes and deprecations for details.

relation of experiments to training runs

This document will explain how an experiment can be set up and run.

Prerequisites

Before you create an experiment you must write the Python script that is used to deliver the model definition. For expert information on specific requirements, see Coding guidelines for deep learning programs and follow the coding standards that are outlined there to ensure that your script can be processed without error.

Creating experiments manifest file

The manifest is a YAML formatted file that contains different fields describing the model definitions to be trained, the IBM Cloud Object Storage configuration and several arguments required for model execution during training and testing. The following fields of the experiments file are available to use for deep learning. Other fields, such as the frameworks.runtime field, are ignored.

settings:name: You can provide any value to name to help identify your experiment after it is created. However, this does not have to be unique - the service will assign a unique experiment-id for each experiment.
settings.description: This is another field that you can use to describe the experiment.
training_references: This section specifies the model definitions which need to be part of the experiment.
- name: A descriptive name for this training run.
- compute_configuration.name: This field specifies the resources that will be allocated for training and should be one of the following values. For more information about GPUs, see Using GPUs.
training_results_reference: This section specifies the object store where the resulting model files and logs will be stored after training completes.
- name: A descriptive name for this objectstore and bucket
- connection: The connection variables for the data store. The list of connection variables supported is data store type dependent.
- type: Type of data store, currently this can only be set to s3.
- target.bucket: The bucket where the training results will be written.

For example, the following model definition manifest can be used to create a model definition:

settings:
  name: Sample Experiment
  description: This is a sample experiment
training_references:
- name: model-1
  training_definition_url: https://ibm-watson-ml.mybluemix.net/v3/ml_assets/training_definitions/6e973044-eadd-42f4-9657-45c0c56764863
  compute_configuration:
      name: k80
- name: model-2
  training_definition_url: https://ibm-watson-ml.mybluemix.net/v3/ml_assets/training_definitions/9e038155-eadd-42f4-9657-45c0c56764863
  compute_configuration:
      name: k80
training_results_reference:
  name: training-results-reference_name
  connection:
    endpoint_url: <auth-url>
    access_key_id: <username>
    secret_access_key: <password>
  target:
    bucket: experiment-results
  type: s3

Allocating memory to your deep learning jobs

To specify the GPUs, CPUs, and memory to allocate for jobs, Watson Machine Learning provides GPU configuration resources that are easy to understand while also ensuring efficient allocation of resources within deep learning compute nodes.

For sizes to specify in the manifest.yml file, refer to Using GPUs.

Specify the GPU configuration in the manifest.yml by using the following syntax:

execution:
    compute_configuration: 
      name: v100x2

The preceding example shows the v100x2 compute tier, which allocates 2 GPUs of processing power to your deep learning job.

Generate a sample experiments manifest file

Sample manifest file template can be generated by using the bx ml generate-manifest command: bx ml generate-manifest experiments

Edit the generated manifest with appropriate values for access_key_id, secret_access_key, training_definition_url and bucket fields.

bx ml generate-manifest experiments

Sample Output:

OK
A sample manifest file is generated under experiments.yml

Store an experiment

After you prepare the experiments manifest file, store the experiment by using the bx ml store experiments command: bx ml store experiments <path-to-experiments-manifest-yaml>

bx ml store experiments experiments.yaml

Sample Output:

When the command is submitted successfully, a unique experiment ID is returned. For example, the following output shows a Experiment ID value of c2e94a92-cefe-45b7-bc99-56420abcaa1a:

Creating experiment ...
OK
Experiment created with ID 'c2e94a92-cefe-45b7-bc99-56420abcaa1a'

List an experiment

To list all experiments, run the following command:

bx ml list experiments

Sample Output:

Fetching the list of experiments ...
SI No   Name                  guid                                   created-at
1       sample experiment   422902ab-d384-4ce1-81aa-0350ca9e94b6   2018-01-30T16:18:17.265Z
2       tf-experiment     f5785fb5-a0bc-4db4-bd07-8a1cce4c9db4   2018-01-30T18:37:01.453Z

2 records found.
OK
List all experiments successful

To check the details of a particular experiment use the cli command bx ml show experiments <training-defintions--id>:

bx ml show experiments 422902ab-d384-4ce1-81aa-0350ca9e94b6

Sample Output:

Fetching the experiment details with ID '422902ab-d384-4ce1-81aa-0350ca9e94b6' ...
ExperimentId   422902ab-d384-4ce1-81aa-0350ca9e94b6
name           sample_experiment11
url            https://ibm-watson-ml.mybluemix.net/v3/experiments/422902ab-d384-4ce1-81aa-0350ca9e94b6
created_at     2018-01-30T16:18:17.265Z
OK

Run an experiment

After you store the experiment, the experiment can be submitted for the run by using the bx ml experiments run command: bx ml experiments run <experiment-ID> This is will start the training of the training-definitions included as part of the experiment.

bx ml experiments run c2e94a92-cefe-45b7-bc99-56420abcaa1a

Sample Output:

When the command is submitted successfully, a unique experiment run ID is returned. For example, the following output shows a Experiment Run ID value of 6d46291f-2266-4c4c-bb74-6de79f9b9b18:

Starting to run the experiment with ID 'c2e94a92-cefe-45b7-bc99-56420abcaa1a' ...
OK
Experiment-run created with ID '6d46291f-2266-4c4c-bb74-6de79f9b9b18'

List an experiment run

To list all runs under a particular experiment use bx ml list experiment-runs command: bx ml list experiment-runs <experiment-ID>

bx ml list experiment-runs c2e94a92-cefe-45b7-bc99-56420abcaa1a

Sample Output:

Fetching the list of experiment-runs ...
SI No   guid                                   state       created-at
1       6d46291f-2266-4c4c-bb74-6de79f9b9b18   completed   2018-02-01T09:14:09Z

1 records found.
OK
List all experiment-runs successful

List a training-run under an experiment run

To list all training-runs under a particular experiment-run use bx ml list experiment-runs command: bx ml list experiment-runs <experiment-ID> <experiment-run-ID>

bx ml list training-runs c2e94a92-cefe-45b7-bc99-56420abcaa1a 6d46291f-2266-4c4c-bb74-6de79f9b9b18

Sample Output:

Fetching the list of training-runs in experiment-run with ID '6d46291f-2266-4c4c-bb74-6de79f9b9b18' ...
SI No   Name                 guid                 state       submitted_at
1       model-1          training-5H2xmKCzR     completed   2018-02-01T09:14:14Z
2       model-2          training-8aBbiKCkg     completed   2018-02-01T09:14:19Z
OK
List training-runs successful

Monitor an experiment run

To continously monitor the logs from an experiment run, use the cli command bx ml experiments <experiment-ID> <experiment-run-ID>:

bx ml monitor experiments c2e94a92-cefe-45b7-bc99-56420abcaa1a 0478fd57-887f-4e38-9068-a09fcc7c688d

Sample Output:

Starting to fetch status messages and metrics for experiment id 'c2e94a92-cefe-45b7-bc99-56420abcaa1a' and experiment-run id '0478fd57-887f-4e38-9068-a09fcc7c688d'
[--LOGS]      Training with training/test data and model at:

[--LOGS]
[--LOGS]        DATA_DIR: /job/caffe-training-data

[--LOGS]
[--LOGS]        MODEL_DIR: /job/model-code

[--LOGS]
[--LOGS]        TRAINING_JOB:

[--LOGS]
[--LOGS]        TRAINING_COMMAND: caffe train -solver lenet_solver.prototxt

[--LOGS]
[--LOGS]      ARMADA_OPS_PROM2GRAPHITE_PORT=tcp://172.21.176.2:39888

[--LOGS]
[--LOGS]      ARMADA_OPS_PROM2GRAPHITE_PORT_39888_TCP=tcp://172.21.176.2:39888

You can add append logs or metrics to see only the log lines or only the metrics, foe example bx ml monitor experiments <experiment-ID> <experiment-run-ID> logs or bx ml monitor experiments <experiment-ID> <experiment-run-ID> metrics

Delete an experiment run

To delete a experiment run use bx ml delete experiment-runs <experiment-ID> <experiment-run-ID> command: bx ml list experiment-runs <experiment-ID> , this will also delete the training-runs under this experiment-run

bx ml delete experiment-runs c2e94a92-cefe-45b7-bc99-56420abcaa1a 6d46291f-2266-4c4c-bb74-6de79f9b9b18

Sample Output:

Deleting the experiment-run '6d46291f-2266-4c4c-bb74-6de79f9b9b18' ...
OK
Delete experiment-run successful

Delete an experiment

To delete an experiment.

bx ml delete experiments 422902ab-d384-4ce1-81aa-0350ca9e94b6

Sample Output:

Deleting the experiment '422902ab-d384-4ce1-81aa-0350ca9e94b6' ...
OK
Delete experiment successful

Learn more

To work with IBM Watson Machine Learning experiments to train Deep Learning models, check out these sample notebooks.

Parent topic: Deep learning experiments