Spark environments

If your notebook includes Spark APIs, or you want to create machine learning models or model flows with Spark runtimes, you need to associate the tool with a Spark service or environment. With Spark environments, you can configure the size of the Spark driver and the size and number of the executors.

Spark options

In Watson Studio, you can use:

  • Spark environments offered under Watson Studio.

    All Watson Studio users can create Spark environments with varying hardware and software configurations. Spark environments offer Spark kernels as a service (SparkR, PySpark and Scala). Each kernel gets a dedicated Spark cluster and Spark executors. Spark environments are offered under Watson Studio and, like default environments, consume capacity unit hours (CUHs) that are tracked.

  • Spark services offered through IBM Cloud.

    With IBM Analytics Engine, you are offered Hortonworks Data Platform on IBM Cloud. You get one VM per cluster node and your own local HDFS. You get Spark and the entire Hadoop ecosystem. You are given shell access and can also create notebooks. IBM Analytics Engine is not offered under Watson Studio; it must be purchased separately through IBM Cloud. See Add associated services.

Preset Spark environment definitions

Watson Studio offers the following preset Spark environment definitions. You can use the preset Spark environment definitions to quickly get started with Spark in Watson Studio tools without having to create your own Spark environment definitions.

For model builder and modeler flows, select a Spark Scala 2.11 environment definition. For Data Refinery flows, select a Spark R environment.

Environment Hardware configuration
Default Spark Python 3.6 XS 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark R 3.4 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark Scala 2.11 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM

Notebooks and Spark environments

When you create a notebook, you can select the Spark runtime you want the notebook to run in. You can select a preset Spark environment definition or a Spark environment definition you created from the Environments page of your project.

You can create more than one notebook and select the same Spark environment definition. Every notebook associated with the environment has its own dedicated Spark cluster and no resources are shared. For example, if you create two notebooks using the same Spark environment definition, two Spark clusters are started, one for each notebook, which means that each notebook has its own Spark driver and set of Spark executors.

You can edit an existing Spark environment definition and change the hardware configuration after a notebook is created. However, to make the notebook use the changed environment configuration, you must stop and then restart the Spark environment runtime.

To stop and restart the Spark runtime in an opened notebook:

  1. Save your notebook changes.
  2. Click the Notebook Info icon (Notebook Info icon) from the notebook toolbar and select Environment.
  3. Stop the active runtime.
  4. Select the Spark environment you changed.

You can learn to use Spark environments in Watson Studio by opening one of the following sample notebooks:

File system on a Spark cluster for notebooks

You can’t access files in the temporary file system of a Spark cluster in a Spark environment because /tmp isn’t a shared file system and can’t be accessed by the Spark executors.

If you want to share files or libraries across nodes and kernels, you can load those files to the NFS mount at /home/spark/shared/user-libs/spark2.

To share files across kernels:

  1. Download your custom JAR files to /home/spark/shared/user-libs/spark2.
  2. Restart the kernel from the notebook menu by clicking Kernel > Restart Kernel to enable access your JAR files.

Model builder or Spark modeler flows and Spark environments

Model builder and the flow editor must run in a Spark Scala 2.11 environment runtime. When you create a model or a Spark modeler flow in Watson Studio, you can select the Spark runtime you want the model builder or flow editor to run in. If you create your own Spark environment definition from the Environments page of your project, it will appear in the list of Spark runtimes you can select from at the time you create the model or Spark modeler flow.

Runtime logs

When a Spark runtime is stopped, the accumulated logs are added to the IBM Cloud Object Storage bucket associated with the project. If you want to view these logs, download them from the IBM Cloud Object Storage bucket.

Limitations

The following limitation exists:

  • You can’t customize the software configuration of a Spark environment definition you created.

Next steps