Data Refinery environments
After you have created a Data Refinery flow with the steps to reshape your data and want to run the data flow on your entire data set, you need to select the compute runtime to run the data flow in.
With a default Spark R 3.4 environment, each Data Refinery flow runs in a dedicated Spark cluster. The Spark environment runtime is started when you select to run the Data Refinery flow and is stopped after the data flow run has finished running against your data set.
- Environment options
- Using the preset Spark R environment
- Data flows and Spark environments
- Runtime logs
When you run a Data Refinery flow, you can select to use one the following environments:
Spark R 3.4 environments
With a Spark R 3.4 environment, you can configure the size of the Spark driver and the size and number of the executors dependent on the size of the data set. You should always select a Spark R 3.4 environment to run Data Refinery flows that operate on large data sets.
Each Spark environment consists of one SparkR kernel as a service. The kernel has a dedicated Spark cluster and Spark executors. When you select to run a Data Refinery flow in a Spark R environment, it runs in its own cluster.
You can select the
Default Spark R 3.4environment definition provided by Watson Studio or create your own Spark R environment definition. The Spark R environments are HIPAA ready.
A Spark R 3.4 runtime consumes capacity unit hours (CUHs) that are tracked. See CUH calculation for Spark environments. You are charged based on your Watson Studio service plan. For up-to-date information, see the Watson Studio pricing plans.
Data Refinery Default
None - Use Data Refinery Defaultis the runtime that is used when you refine data and create data flow steps in Data Refinery and can also be selected as the environment runtime for your data flows. You should select
None - Use Data Refinery Defaultto run Data Refinery flows that operate on small data sets because the runtime is instantly available and doesn’t first have to be started before the flow can run.
One custom Spark cluster is shared by all Data Refinery users to run the data flows.
This runtime consumes 6 capacity units per hour.
Using the preset Spark R environment definition
Watson Studio offers the following preset Spark R environment definition which you can select to use when you run Data Refinery flows. Selecting this Spark environment definition helps you to quickly get started with running data flows in Data Refinery without having to create your own Spark R environment definition.
|Environment||Hardware configuration||Software version|
||2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
|R 3.4 with r-essentials|
Data flows and Spark environments
After you have completed shaping your data in Data Refinery and are ready to run the data flow on the entire data set:
- Click from the Data Refinery toolbar.
- On the flow details page, select the environment to use and click Save and run. When you run a Data Refinery flow in a Spark R 3.4 environment, a Spark runtime is started. On the data flow summary page, you can view the progress of the environment runtime initialization process.
As soon as the runtime has started, you can see the active Spark R runtime for the Data Refinery flow and monitor its capacity unit hour (CUH) usage from your project’s Environments page.
If you need to cancel the data flow run while it’s in progress, you can do this from the data flow’s details page. Alternatively you can select Stop from the ACTIONS menu for the active runtime on your project’s Environments page.
For information on the capacity unit hours (CUHs) that are tracked for Spark R 3.4 environments, see CUH calculation for Spark environments. You are charged based on your Watson Studio service plan. For up-to-date information, see the Watson Studio pricing plans.
To view the accumulated logs for the run of a data flow:
- From the project’s Assets page, click the data flow for which you want to see logs.
- Select the data flow run and click View log from the Actions menu (three vertical dots).
The following limitation exists:
- Spark R 3.4 environments cannot be used to schedule data flow runs.
- The manual stratified sampling operation in Data Refinery is not supported.
When should I use a Spark R 3.4 environment?
If you are working on a small data set, you should select
None - Non-distributed runtime and not a Spark R 3.4 environment. The reason is that, although the SparkR cluster in a Spark R 3.4 environment is fast and powerful, it requires time to create, which is noticeable when you run Data Refinery flows on small data sets.
What is difference between Data Refinery Spark R environments and Spark R environments in the project?
These environments are the same.
The Spark R environment can be the
Default Spark R 3.4, which appears in the list of environments you can select from by default when you run a Data Refinery flow, or a Spark R environment definition that you created before running the Data Refinery flow.