Spark environments

If your notebook includes Spark APIs, or you want to create machine learning models or model flows with Spark runtimes, you need to associate the tool with a Spark service or environment. With Spark environments, you can configure the size of the Spark driver and the size and number of the executors.

Spark options

In Watson Studio, you can use:

  • Spark environments offered under Watson Studio.

    All Watson Studio users can create Spark environments with varying hardware and software configurations. Spark environments offer Spark kernels as a service (SparkR, PySpark and Scala). Each kernel gets a dedicated Spark cluster and Spark executors. Spark environments consume capacity unit hours (CUHs) that are tracked.

  • Spark services offered through IBM Cloud.

    With IBM Analytics Engine, you are offered Hortonworks Data Platform on IBM Cloud. You get one VM per cluster compute node and your own local HDFS. You get Spark and the entire Hadoop ecosystem. You are given shell access and can also create notebooks. IBM Analytics Engine is not offered under Watson Studio; it must be purchased separately through IBM Cloud. See Add associated services.

Default environment definitions

Watson Studio offers default Spark environment definitions that you can use to quickly get started with Spark in Watson Studio tools without having to create your own Spark environment definitions.

Environment Hardware configuration
Default Spark 3.0 & Python 3.7 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 2.4 & Python 3.6 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 3.0 & R 3.6 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 2.4 & R 3.6 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 3.0 & Scala 2.12 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 2.4 & Scala 2.11 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM

When you run a notebook in a Spark environment, you would expect to be able to consume the amount of resources mentioned in hardware configuration of the environment definition. However, you will observe that you are consuming more resources than you would expect. The reason is that extra resources need to be allocated for the Jupyter Enterprise Gateway, Spark Master, and the Spark worker daemons. These extra resources amount to 1 vCPU and 2 GB of RAM for the driver and 1 GB RAM for each executor.

As an example: if you create a notebook and select Default Spark 3.0 & Python 3.7, the Spark cluster will consume 3 vCPU and 12 GB RAM but, as 1 vCPU and 4 GB RAM are required for the extra resources, the total resource consumption will be 4 vCPU and 16 GB RAM.

Notebooks and Spark environments

When you create a notebook, you can select the Spark runtime you want the notebook to run in. You can select a default Spark environment definition or a Spark environment definition you created from the Environments page of your project.

You can create more than one notebook and select the same Spark environment definition. Every notebook associated with the environment has its own dedicated Spark cluster and no resources are shared. For example, if you create two notebooks using the same Spark environment definition, two Spark clusters are started, one for each notebook, which means that each notebook has its own Spark driver and set of Spark executors.

You can edit an existing Spark environment definition and change the hardware configuration after a notebook is created. However, to make the notebook use the changed environment configuration, you must stop and then restart the Spark environment runtime.

To stop and restart the Spark runtime in an opened notebook:

  1. Save your notebook changes.
  2. Click the Notebook Info icon (Notebook Info icon) from the notebook toolbar and select Environment.
  3. Stop the active runtime.
  4. Select the Spark environment you changed.

You can learn to use Spark environments in Watson Studio by opening one of the following sample notebooks:

File system on a Spark cluster for notebooks

You can’t access files in the temporary file system of a Spark cluster in a Spark environment because /tmp isn’t a shared file system and can’t be accessed by the Spark executors.

If you want to share files or libraries across compute nodes and kernels, you can load those files to the NFS mount at /home/spark/shared/user-libs/spark2.

To share files across kernels:

  1. Download your custom JAR files to /home/spark/shared/user-libs/spark2.
  2. Restart the kernel from the notebook menu by clicking Kernel > Restart Kernel to enable access your JAR files.

Runtime logs

When a Spark runtime is stopped, the accumulated logs are added to the IBM Cloud Object Storage bucket associated with the project. If you want to view these logs, download them from the IBM Cloud Object Storage bucket.

Next steps