Distributed deep learning

Deep learning models training can be significantly accelerated with distributed computing on GPUs. IBM Watson Machine Learning provides support for speeding up training using data parallelism three types of distributed learning:

Restrictions:

  • Distributed deep learning is in beta phase now.
  • Lite plan of IBM Watson Machine Learning does not support distributed deep learning.
  • TensorFlow is only supported at this stage.
  • Online deployment (scoring) is only supported for native TensorFlow.

TensorFlow

Refer to the documentation here to learn more about distributed TensorFlow. Out of box, tensorflow supports distributed using a notion of parameter server and worker approach, think of this as more of a master worker approach where workers are responsible for carrying out the work and master is responsible for sharing the learnings(weights calculated) amongst the workers.

Our current approach is that all the nodes start as equal and are provided with an id, that you can refer using environment variable $LEARNER_ID and have a host name that has a prefix $LEARNER_NAME_PREFIX and full host name as $LEARNER_NAME_PREFIX-$LEARNER_ID. Similarly you can find the total number of nodes using $NUM_LEARNERS. It is users responsibility to write code which can then designate some of these nodes as parameter servers (at least 1 needed) and workers (at least 1 needed). We are providing a sample launcher script that shows one approach of extract this infomation out.

When a distributed learning is started these nodes come up per distributed TensorFlow setup grpc server is started on these nodes on port 2222 the command that is provided by the user as the part of the manifest is execute. Refer to the example to see how the launcher script can be used to control how to provide appropriate task id and job name to these nodes and different node can act as a worker or parameter server depending on its learner id.

Requirements

API

To run TensorFlow distributed deep learning:

  • FRAMEWORK_NAME should be set to tensorflow
  • FRAMEWORK_VERSION should be 1.5
  • COMPUTE_CONFIGURATION name should be any of: k80x2, k80x4, p100x2, or v100x2
  • Number of nodes higher than one

IBM Distributed Deep Learning Library for TensorFlow

The IBM Distributed Deep Learning (DDL) library for TensorFlow automatically distributes the computatin across multiple nodes and multiple GPUs. Users are required to start with a single node GPU training code. Users modify their code with a few statement to active DDL based distribution of their code to leverage the multi-GPU training.

Requirements

User program must be written for single GPU training.

API

To run ddl distributed deep learning:

  • FRAMEWORK_NAME should be set to tensorflow-ddl
  • FRAMEWORK_VERSION should be 1.5
  • COMPUTE_CONFIGURATION name should be any of: k80x2, k80x4, p100x2, or v100x2
  • Number of nodes higher than one

User code

See here for more details on DDL and step by step process to modify users code to enable DDL based training and scoring in WML.

Horovod

Similar to IBM DDL, horovod is also based on a similar approach of no parameter server and workers talking amongst each other and learning from each other. Horovod is installed and configured for use if you decide to use horovod. Using the documentation mentioned here you can run an existing horovod example. As a user there is no need for you to do any installation or the need to run the underlying mpi commands to orchestrate the process. You can simply run your command and we take care of setting up the underlying infrastructure and orchestration. Refer to a sample example [here] on how you can run the examples mentioned [here] (https://github.com/uber/horovod/tree/master/examples)

Requirements

API

To run horovod distributed deep learning:

  • FRAMEWORK_NAME should be set to tensorflow-horovod
  • FRAMEWORK_VERSION should be 1.5
  • COMPUTE_CONFIGURATION name should be any of: k80x2, k80x4, p100x2, or v100x2
  • Number of nodes higher than one

User code

Coming soon!

Next Steps

References