Data Refinery environments

After you have created a Data Refinery flow with the steps to reshape your data and want to run the data flow on your entire data set, you need to select the compute runtime to run the data flow in.

With a default Spark R 3.4 environment, each Data Refinery flow runs in a dedicated Spark cluster. The Spark environment runtime is started when you select to run the Data Refinery flow and is stopped after the data flow run has finished running against your data set.

Environment options

When you run a Data Refinery flow, you can select to use one the following environments:

  • Spark R 3.4 environments

    With a Spark R 3.4 environment, you can configure the size of the Spark driver and the size and number of the executors dependent on the size of the data set. You should always select a Spark R 3.4 environment to run Data Refinery flows that operate on large data sets.

    Each Spark environment consists of one SparkR kernel as a service. The kernel has a dedicated Spark cluster and Spark executors. When you select to run a Data Refinery flow in a Spark R environment, it runs in its own cluster.

    You can select the Default Spark R 3.4 environment definition provided by Watson Studio or create your own Spark R environment definition. The Spark R environments are HIPAA ready.

    A Spark R 3.4 runtime consumes capacity unit hours (CUHs) that are tracked. See CUH calculation for Spark environments. You are charged based on your Watson Studio service plan. For up-to-date information, see the Watson Studio pricing plans.

  • Data Refinery Default

    None - Use Data Refinery Default is the runtime that is used when you refine data and create data flow steps in Data Refinery and can also be selected as the environment runtime for your data flows. You should select None - Use Data Refinery Default to run Data Refinery flows that operate on small data sets because the runtime is instantly available and doesn’t first have to be started before the flow can run.

    One custom Spark cluster is shared by all Data Refinery users to run the data flows.

    This runtime consumes 6 capacity units per hour.

Using the preset Spark R environment definition

Watson Studio offers the following preset Spark R environment definition which you can select to use when you run Data Refinery flows. Selecting this Spark environment definition helps you to quickly get started with running data flows in Data Refinery without having to create your own Spark R environment definition.

Environment Hardware configuration Software version
Default Spark R 3.4 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
R 3.4 with r-essentials

Data flows and Spark environments

After you have completed shaping your data in Data Refinery and are ready to run the data flow on the entire data set:

  1. Clickthe run Data Refinery flow icon from the Data Refinery toolbar.
  2. On the flow details page, select the environment to use and click Save and run. When you run a Data Refinery flow in a Spark R 3.4 environment, a Spark runtime is started. On the data flow summary page, you can view the progress of the environment runtime initialization process.

As soon as the runtime has started, you can see the active Spark R runtime for the Data Refinery flow and monitor its capacity unit hour (CUH) usage from your project’s Environments page.

If you need to cancel the data flow run while it’s in progress, you can do this from the data flow’s details page. Alternatively you can select Stop from the ACTIONS menu for the active runtime on your project’s Environments page.

For information on the capacity unit hours (CUHs) that are tracked for Spark R 3.4 environments, see CUH calculation for Spark environments. You are charged based on your Watson Studio service plan. For up-to-date information, see the Watson Studio pricing plans.

Runtime logs

To view the accumulated logs for the run of a data flow:

  1. From the project’s Assets page, click the data flow for which you want to see logs.
  2. Select the data flow run and click View log from the Actions menu (three vertical dots).

Limitations

The following limitation exists:

  • Spark R 3.4 environments cannot be used to schedule data flow runs.
  • The manual stratified sampling operation in Data Refinery is not supported.

FAQs

When should I use a Spark R 3.4 environment?

If you are working on a small data set, you should select None - Non-distributed runtime and not a Spark R 3.4 environment. The reason is that, although the SparkR cluster in a Spark R 3.4 environment is fast and powerful, it requires time to create, which is noticeable when you run Data Refinery flows on small data sets.

What is difference between Data Refinery Spark R environments and Spark R environments in the project?

These environments are the same.

The Spark R environment can be the Default Spark R 3.4, which appears in the list of environments you can select from by default when you run a Data Refinery flow, or a Spark R environment definition that you created before running the Data Refinery flow.

Next steps