Spark environments in a project | IBM Cloud Pak for Data as a Service

Spark environments

Last updated: Aug 31, 2021

Spark environments in a project

If your notebook includes Spark APIs, or you want to create machine learning models or model flows with Spark runtimes, you need to associate the tool with a Spark service or environment. With Spark environments, you can configure the size of the Spark driver and the size and number of the executors.

Spark options
Default environment definitions
Notebooks and Spark environments
File system on a Spark cluster
Runtime logs

Spark options

In Watson Studio, you can use:

Spark environments offered under Watson Studio.

All Watson Studio users can create Spark environments with varying hardware and software configurations. Spark environments offer Spark kernels as a service (SparkR, PySpark and Scala). Each kernel gets a dedicated Spark cluster and Spark executors. Spark environments consume capacity unit hours (CUHs) that are tracked.
Spark services offered through IBM Cloud.

With IBM Analytics Engine, you are offered Hortonworks Data Platform on IBM Cloud. You get one VM per cluster compute node and your own local HDFS. You get Spark and the entire Hadoop ecosystem. You are given shell access and can also create notebooks. IBM Analytics Engine is not offered under Watson Studio; it must be purchased separately through IBM Cloud. See Add associated services.

Default environment definitions

You can use the default Spark environment definitions to quickly get started with Spark notebooks in Watson Studio tools, without having to create your own environment definitions. The default environment definitions are listed on the project’s Environments page.

Environment	Hardware configuration
`Default Spark 3.0 & Python 3.7`	2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM
`Default Spark 3.0 & R 3.6`	2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM
`Default Spark 3.0 & Scala 2.12`	2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM
`Default Spark 2.4 & Python 3.7`	2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM
`Default Spark 2.4 & R 3.6`	2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM
`Default Spark 2.4 & Scala 2.11`	2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM
`Default Spark 2.3 & Scala 2.11`	2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM
`Default Spark 2.3 & R 2.4`	2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM

Note: When you start a Spark environment, extra resources are needed for the Jupyter Enterprise Gateway, Spark Master, and the Spark worker daemons. These extra resources amount to 1 vCPU and 2 GB of RAM for the driver and 1 GB RAM for each executor. You need to take these extra resources into account when selecting the hardware size of a Spark environment. For example: if you create a notebook and select Default Spark 3.0 & Python 3.7, the Spark cluster consumes 3 vCPU and 12 GB RAM but, as 1 vCPU and 4 GB RAM are required for the extra resources, the resources remaining for the notebook are 2 vCPU and 8 GB RAM.

Notebooks and Spark environments

When you create a notebook, you can select the Spark runtime you want the notebook to run in. You can select a default Spark environment definition or a Spark environment definition you created from the Environments page of your project.

You can create more than one notebook and select the same Spark environment definition. Every notebook associated with the environment has its own dedicated Spark cluster and no resources are shared. For example, if you create two notebooks using the same Spark environment definition, two Spark clusters are started, one for each notebook, which means that each notebook has its own Spark driver and set of Spark executors.

You can learn to use Spark environments in Watson Studio by opening the following sample notebooks:

Use Spark ML and Scala to detect network intrusions

File system on a Spark cluster

If you want to share files across executors and the driver or kernel of a Spark cluster, you can use the shared file system at /home/spark/shared.

If you want to use your own custom libraries, you can store them under /home/spark/shared/user-libs/. There are four subdirectories under /home/spark/shared/user-libs/ that are pre-configured to be made available to Python, R and Scala or Java runtimes.

The following tables lists the pre-configured subdirectories where you can add your custom libaries.

Directory	Type of library
`/home/spark/shared/user-libs/python3/`	Python 3 libraries
`/home/spark/shared/user-libs/R/`	R packages
`/home/spark/shared/user-libs/spark2/`	Java or Scala JAR files

To share libraries across a Spark driver and executors:

Download your custom libraries or JAR files to the appropriate pre-configured directory.
Restart the kernel from the notebook menu by clicking Kernel > Restart Kernel. This loads your custom libraries or JAR files in Spark.

Note that these libraries are not persisted. When you stop the environment runtime and restart it again later, you need to load the libraries again.

Runtime logs

When a Spark runtime is stopped, the accumulated logs are added to the IBM Cloud Object Storage bucket associated with the project. If you want to view these logs, download them from the IBM Cloud Object Storage bucket.

Spark options

Default environment definitions

Notebooks and Spark environments

File system on a Spark cluster

Runtime logs

Next steps