Distributed deep learning
Deep learning models training can be significantly accelerated with distributed computing on GPUs. IBM Watson Machine Learning provides support for speeding up training using data parallelism three types of distributed learning:
Restrictions:
- Distributed deep learning is in beta phase now.
- Lite plan of IBM Watson Machine Learning does not support distributed deep learning.
- TensorFlow is only supported at this stage.
- Online deployment (scoring) is only supported for native TensorFlow.
TensorFlow
Tensorflow supports distributed using a notion of parameter server and worker approach, think of this as more of a master worker approach where workers are responsible for carrying out the work and master is responsible for sharing the learnings (weights calculated) amongst the workers.
Our current approach is that all the nodes start as equal and are provided with an id, that you can refer using environment variable $LEARNER_ID
and have a host name that has a prefix $LEARNER_NAME_PREFIX
and full host name as $LEARNER_NAME_PREFIX-$LEARNER_ID
. Similarly you can find the total number of nodes using $NUM_LEARNERS
. It is users responsibility to write code which can then designate some of these nodes as parameter servers (at least 1 needed) and workers (at least 1 needed). We are providing a sample launcher script that shows one approach of extract this infomation out.
When a distributed learning is started these nodes come up per distributed TensorFlow setup grpc server is started on these nodes on port 2222
the command that is provided by the user as the part of the manifest is execute. Refer to the example to see how the launcher script can be used to control how to provide appropriate task id and job name to these nodes and different node can act as a worker or parameter server depending on its learner id.
Requirements
API
To run TensorFlow distributed deep learning:
FRAMEWORK_NAME
should be set totensorflow
FRAMEWORK_VERSION
should be1.13
COMPUTE_CONFIGURATION
name should be any of:k80x2
,k80x4
, orv100x2
- Number of
nodes
higher than one
IBM Distributed Deep Learning Library for TensorFlow
The IBM Distributed Deep Learning (DDL) library for TensorFlow automatically distributes the computatin across multiple nodes and multiple GPUs. Users are required to start with a single node GPU training code. Users modify their code with a few statement to active DDL based distribution of their code to leverage the multi-GPU training.
Requirements
User program must be written for single GPU training.
API
To run ddl distributed deep learning:
FRAMEWORK_NAME
should be set totensorflow-ddl
FRAMEWORK_VERSION
should be1.13
COMPUTE_CONFIGURATION
name should be any of:k80x2
,k80x4
, orv100x2
- Number of
nodes
higher than one
User code
See here for more details on DDL and step by step process to modify users code to enable DDL based training and scoring in WML.
Horovod
Similar to IBM DDL, horovod is also based on a similar approach of no parameter server and workers talking amongst each other and learning from each other. Horovod is installed and configured for use if you decide to use horovod. As a user there is no need for you to do any installation or the need to run the underlying mpi commands to orchestrate the process. You can simply run your command and we take care of setting up the underlying infrastructure and orchestration.
Requirements
API
To run horovod distributed deep learning:
FRAMEWORK_NAME
should be set totensorflow-horovod
FRAMEWORK_VERSION
should be1.13
COMPUTE_CONFIGURATION
name should be any of:k80x2
,k80x4
, orv100x2
- Number of
nodes
higher than one
Next Steps
- Get started using these sample training runs or create your own new training runs.
- Go in depth with the following Developer Works article: Introducing deep learning and long-short term memory networks.