Spark environments for Data Refinery

After you have created a Data Refinery flow with the steps to reshape the data in your data set and want to run the data flow on your entire data set, you need to select the compute runtime to run the data flow in.

With a default Spark R 3.4 environment, each Data Refinery flow runs in a dedicated Spark cluster. The Spark environment runtime is started when you select to run the Data Refinery flow and is stopped after the data flow run has finished running against your data set.

Spark R environments for Data Refinery flows are currently still in beta.

Runtime options

When you run a Data Refinery flow, you can select to use a Spark R environment or you can continue using the Data Refinery Default.

  • Spark R environments

    With Spark R 3.4 environments, you can configure the size of the Spark driver and the size and number of the executors dependent on the size of the data set.

    Each Spark environment consists of one SparkR kernel as a service. The kernel has a dedicated Spark cluster and Spark executors. When you select to run a Data Refinery flow in a Spark R environment, it runs in its own cluster.

  • Data Refinery Default

    None - Use Data Refinery Default is the runtime that has always existed in Data Refinery and is still the default runtime that is used when you refine data and create data flow steps in Data Refinery.

    One custom Spark cluster is shared by all Data Refinery users for the work done on data set samples in Data Refinery and to run all data flows.

    The None - Use Data Refinery Default runtime consumes 6 capacity units per hour. You are charged for this consumption.

Default Spark R environment

Watson Studio offers the following default Spark R environment for Data Refinery flows. You can use this default Spark environment to quickly get started with running data flows in Data Refinery without having to create your own Spark R environment.

  • Default Spark R 3.4 (Beta)
    2 Executors: 1 vCPU and 4 GB RAM, Driver: 1 vCPU and 4 GB RAM

Configuration options for environments for data flows

You can create your own Spark R environment definitions for Data Refinery flows from the Environments tab in your project by clicking New environment definition.

You must have the Admin or Editor role within the project to create an environment definition.

When you create a Spark R environment definition in which to run your data flows, you can select the following configuration options:

  • Driver hardware configuration. The driver creates the SparkContext which distributes the execution of jobs on the Spark cluster.
    • 1 vCPU and 4 GB RAM
    • 2 vCPU and 8 GB RAM
  • Executor hardware configuration. The executor is the process in charge of running the tasks in a given Spark job.
    • 1 vCPU and 4 GB RAM
    • 2 vCPU and 8 GB RAM
  • Number of executors: Select from 1 to 10 executors.
  • You must select the software version R 3.4.

Your new environment definition is listed under Environment definitions on the Environments page of your project. It also appears in the list of Spark environments for you to choose from when you select to run your data flow.

Data flows and Spark environments

When you select to run a Data Refinery flow in a Spark R environment, a Spark runtime is started. On the data flow summary page, you can view the progress of the environment runtime initialization process.

As soon as the runtime has started, you can view the active Spark R runtime for the Data Refinery flow and monitor its capacity unit hour (CUH) usage from your project's Environments page.

If you need to cancel the data flow run while it's in progress, you can do this from the data flow's details page. Alternatively you can select Stop from the ACTIONS menu for the active runtime on your project's Environments page.

For information on the capacity unit hours (CUHs) that are tracked for Spark R 3.4 environments, see CUH calculation for Spark environments.

Runtime logs

You can view the accumulated logs for the runs of a data flow by clicking the data flow from the project's Assets page and selecting the History tab.

Limitations for beta

The following limitations exist during beta:

  • If you schedule a data flow, the data flow will run in the None - Use Data Refinery Default runtime and not in a Spark R environment runtime, even if you associated a Spark R environment with your data flow in Data Refinery.
  • You can select to run your data flow in a Spark R 3.4 environment only if the data flow run is started from Data Refinery or from the data flow's details page. If you run a Data Refinery flow from your project's Assets page by clicking Run from the data flow's ACTIONS menu, the data flow is always run in the None - Use Data Refinery Default runtime, irrespective of the environment you chose for the data flow in Data Refinery.
  • The following operations in Data Refinery are not supported in a data flow that you run in a Spark R 3.4 environment:
    • All operations under Natural language
    • The Sample operation

FAQs

When should I use a Spark R 3.4 environment?

During beta, you can select to use either the None - Use Data Refinery Default or a Spark R 3.4 runtime. The Spark R 3.4 environment can be the Default Spark R 3.4 (Beta) or any Spark R 3.4 environment that you created.

If you are working on a small data set, you should select the None - Use Data Refinery Default runtime and not a Spark R 3.4 environment. The reason is that, although the SparkR cluster in a Spark R 3.4 environment is fast and powerful, it requires time to create, which is noticeable when you run data flows on small data sets.