Spark environments

If your notebook includes Spark APIs, or you create machine learning models or model flows with Spark runtimes, you need to associate the tool with a Spark service or environment. With Spark environments, you can configure the size of the Spark driver and the size and number of the executors. You can use Spark environments for notebooks, the model builder and Spark modeler flows.

Spark environments for model flows are currently still in beta.

Spark options

In Watson Studio, you can use:

  • Spark environments offered under Watson Studio.

    All Watson Studio users can create Spark environments with varying hardware and software configurations. Spark environments offer Spark kernels as a service (SparkR, PySpark and Scala). Each kernel gets a dedicated Spark cluster and Spark executors. Spark environments are offered under Watson Studio and, like default environments, consume capacity unit hours (CUHs) that are tracked.

  • Spark services offered through IBM Cloud.

    With IBM Analytics Engine, you are offered Hortonworks Data Platform on IBM Cloud. You get one VM per cluster node and your own local HDFS. You get Spark and the entire Hadoop ecosystem. You are given shell access and can also create notebooks. IBM Analytics Engine is not offered under Watson Studio; it must be purchased separately through IBM Cloud. See Add associated services.

Default Spark environment definitions

Watson Studio offers the following default Spark environments. You can use these default Spark environment definitions to quickly get started with Spark in Watson Studio tools without having to create your own Spark environment definitions.

  • Default Spark Scala 2.11
    2 Executors: 1 vCPU and 4 GB RAM, Driver: 1 vCPU and 4 GB RAM
  • Default Spark Python 3.5 XS
    2 Executors: 1 vCPU and 4 GB RAM, Driver: 1 vCPU and 4 GB RAM

For model builder and modeler flows, you must select the Default Spark Scala 2.11 environment definition.

Configuration options for Spark environments

When you create a Spark environment definition, you can select the following configuration options:

  • Driver hardware configuration. The driver creates the SparkContext which distributes the execution of jobs on the Spark cluster.
    • 1 vCPU and 4 GB RAM
    • 2 vCPU and 8 GB RAM
  • Executor hardware configuration. The executor is the process in charge of running the tasks in a given Spark job.
    • 1 vCPU and 4 GB RAM
    • 2 vCPU and 8 GB RAM
  • Number of executors: Select from 1 to 10 executors.
  • Spark version: Spark 2.3
  • Software version:
    • Scala 2.11
    • Python 2.7
    • Python 3.5
    • R 3.4

Notebooks and Spark environments

When you create a notebook, you can select the Spark runtime you want the notebook to run in. After you have created a Spark environment definition on the Environments page of your project, you can select to run your notebook in that environment at the time you create the notebook. Spark environments are available for the following notebook languages:

  • Scala 2.11 with Spark
  • Python 2.7 with Spark
  • Python 3.5 with Spark
  • R 3.4 with Spark

You can create more than one notebook with the same Spark environment definition. Every notebook kernel has its own dedicated Spark cluster and the resources are not shared. For example, if you create two notebooks using the same Spark environment definition, two kernels are started, one for each notebook, which means that two clusters each with its own Spark driver and set of Spark executors are created.

You can edit an existing Spark environment definition after a notebook is created; however, to use the changed environment configuration, you must restart the runtime.

You can restart a Spark runtime:

  • From the Environment tab when you click the Notebook Info icon if the notebook is open in edit mode.
  • By restarting the kernel from the Jupyter menu in the notebook.

You can learn to use Spark environments in Watson Studio by opening one of the following sample notebooks:

File system on a Spark cluster for notebooks

You can't access files in the temporary file system of a Spark cluster in a Spark environment because /tmp isn't a shared file system and can't be accessed by the Spark executors.

If you want to share files or libraries across nodes and kernels, you can load those files to the NFS mount at /home/spark/shared/user-libs/spark2.

To share files across kernels:

  1. Download your custom JAR files to /home/spark/shared/user-libs/spark2.
  2. Restart the kernel from the notebook menu by clicking Kernel > Restart Kernel to enable access your JAR files.

Model builder or Spark modeler flows and Spark environments

Model builder and the flow editor must run in a Spark runtime. When you create a model or a Spark modeler flow in Watson Studio, you can select the Spark runtime you want the model builder or flow editor to run in. After you have created a Spark environment on the Environments page of your project, it will appear in the list of Spark runtimes you can select from at the time you create the model or Spark modeler flow.

The software version of the Spark environment for model builder or modeler flows must be Scala 2.11.

Runtime logs

When a Spark runtime is stopped, the accumulated logs are added to the IBM Cloud Object Storage bucket associated with the project. If you want to view these logs, download them from the IBM Cloud Object Storage bucket.

Next steps