Data Refinery environments

After you have created a Data Refinery flow with the steps to reshape the data in your data set and want to run the data flow on your entire data set, you need to select the compute runtime to run the data flow in.

With a default Spark R 3.4 environment, each Data Refinery flow runs in a dedicated Spark cluster. The Spark environment runtime is started when you select to run the Data Refinery flow and is stopped after the data flow run has finished running against your data set.

Using Spark R environments for running Data Refinery flows is currently still in beta.

Environment options

When you run a Data Refinery flow, you can select to use a dedicated Spark R 3.4 environment or continue using the Data Refinery Default.

  • Spark R environments

    With Spark R 3.4 environments, you can configure the size of the Spark driver and the size and number of the executors dependent on the size of the data set. You should select to run your data flow in a dedicated Spark R environment when your data set is large.

    Each Spark environment consists of one SparkR kernel as a service. The kernel has a dedicated Spark cluster and Spark executors. When you select to run a Data Refinery flow in a Spark R environment, it runs in its own cluster.

    You can select the preset default Spark R 3.4 environment definition provided by Watson Studio or create your own Spark R environment definition.

    A Spark R 3.4 runtime consumes capacity unit hours (CUHs) that are tracked. See Data flows and Spark environments.

  • Data Refinery Default

    None - Use Data Refinery Default is the name given to the runtime that has always existed in Data Refinery and is still the default runtime that is used when you refine data and create data flow steps in Data Refinery.

    One custom Spark cluster is shared by all Data Refinery users for the work done on data set samples in Data Refinery and to run all data flows.

    The None - Use Data Refinery Default runtime consumes 6 capacity units per hour.

Using the preset Spark R environment

Watson Studio offers the following preset Spark R environment definition which you can select to use when you run Data Refinery flows. Selecting this Spark environment definition helps you to quickly get started with running data flows in Data Refinery without having to create your own Spark R environment definition.

  • Default Spark R 3.4(Beta)
    2 Executors: 1 vCPU and 4 GB RAM, Driver: 1 vCPU and 4 GB RAM

Data flows and Spark environments

After you have completed shaping your data in Data Refinery and are ready to run the data flow on the entire data set:

  1. Clickthe run Data Refinery flow icon from the Data Refinery toolbar.
  2. On the flow details page, select the environment to use and click Save and run. When you run a Data Refinery flow in a Spark R 3.4 environment, a Spark runtime is started. On the data flow summary page, you can view the progress of the environment runtime initialization process.

As soon as the runtime has started, you can see the active Spark R runtime for the Data Refinery flow and monitor its capacity unit hour (CUH) usage from your project’s Environments page.

If you need to cancel the data flow run while it’s in progress, you can do this from the data flow’s details page. Alternatively you can select Stop from the ACTIONS menu for the active runtime on your project’s Environments page.

For information on the capacity unit hours (CUHs) that are tracked for Spark R 3.4 environments, see CUH calculation for Spark environments. You are charged based on your Watson Studio service plan. For up-to-date information, see the Watson Studio pricing plans.

Runtime logs

To view the accumulated logs for the run of a data flow:

  1. Click the data flow from the project’s Assets page and select the History tab.
  2. Select the flow run and View log from the Actions menu (three verical dots).

Limitations

The following limitations exist:

  • Spark R 3.4 environments cannot be used to schedule data flow runs or to run Data Refinery flows initiated from the project’s Assets page when you click Run from the data flow’s ACTIONS menu.
  • The following operations in Data Refinery are not supported in a data flow that you run in a Spark R 3.4 environment:
    • All operations under Natural language
    • The Sample operation

FAQs

When should I use a Spark R 3.4 environment?

If you are working on a small data set, you should select the None - Use Data Refinery Default runtime and not a Spark R 3.4 environment. The reason is that, although the SparkR cluster in a Spark R 3.4 environment is fast and powerful, it requires time to create, which is noticeable when you run Data Refinery flows on small data sets.

What is difference between Data Refinery Spark R environments and Spark R environments in the project?

These environments are the same.Using a Spark R environment to run a Data Refinery flow is in beta. The Spark R environment itself is not in beta.

You can select to use either the None - Use Data Refinery Default runtime or a Spark R 3.4 runtime. The Spark R environment can be the Default Spark R 3.4(Beta), which appears in the list of environments you can select from by default when you run a Data Refinery flow, or a Spark R environment definition that you created before running the Data Refinery flow.

Next steps