Compute resource options for Data Refinery in projects

Last updated: Nov 21, 2024

When you create or edit a Data Refinery flow in a project, you use the Default Data Refinery XS runtime environment. However, when you run a Data Refinery flow in a job, you choose an environment template for the runtime environment. The environment template specifies the type, size, and power of the hardware configuration, plus the software template.

Types of environments
Default environment templates
Compute usage
Changing the runtime
Runtime logs for jobs

Types of environments

You can use these types of environments with Data Refinery:

Default Data Refinery XS runtime environment for running jobs on small data sets.
Spark environments for running jobs on larger data sets. The Spark environments have default environment templates so you can get started quickly. Otherwise, you can create custom environment templates for Spark environments. You should use a Spark & R environment only if you are working on a large data set. If your data set is small, you should select the Default Data Refinery XS runtime. The reason is that, although the SparkR cluster in a Spark & R environment is fast and powerful, it requires time to create, which is noticeable when you run a Data Refinery job on small data set.

Default environment templates

When you work in Data Refinery, the Default Data Refinery XS environment runtime is started and appears as an active runtime under Tool runtimes on the Environments page on the Manage tab of your project. This runtime stops after an hour of inactivity in the Data Refinery interface. However, you can stop it manually under Tool runtimes on the Environments page.

When you create a job to run a Data Refinery flow in a project, you select an environment template. After a runtime for a job is started, it is listed as an active runtime under Tool runtimes on the Environments page on the Manage tab of your project. The runtime for a job stops when the Data Refinery job stops running.

Compute usage is tracked by capacity unit hours (CUH).

Preset environment templates available in projects for Data Refinery
Name	Hardware configuration	Capacity units per hour (CUH)
Default Data Refinery XS	3 vCPU and 12 GB RAM	1.5
Default Spark 3.4 & R 4.2	2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM	1.5
Default Spark 3.3 & R 4.2 Deprecated	2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM	1.5

Note: Spark 3.3 in Notebooks and JupyterLab is deprecated. Although you can still use Spark 3.3 to run your notebooks and scripts, consider moving to Spark 3.4.

All default environment templates for Data Refinery are HIPAA ready.

The Spark default environment templates are listed under Templates on the Environments page on the Manage tab of your project.

Compute usage in projects

Data Refinery is provided with both watsonx.ai Studio and IBM Knowledge Catalog. However, if you have both services, the compute resources that you use for working in Data Refinery and for running jobs for Data Refinery flows in projects are not cumulative. You use the CUH of the plan with the highest level of precedence:

Enterprise or Professional
Standard
Lite

For example, if you have watsonx.ai Studio Lite plan and IBM Knowledge Catalog Professional plan, you use IBM Knowledge Catalog CUH.

If the plan level for both services is the same, then you use CUH from watsonx.ai Studio. For example, if you have watsonx.ai Studio Lite plan and IBM Knowledge Catalog Lite plan, you can use up to the limit of watsonx.ai Studio CUH only. You cannot switch to using IBM Knowledge Catalog CUH.

You can monitor the watsonx.ai Studio CUH consumption on the Resource usage page on the Manage tab of your project.

You can't monitor IBM Knowledge Catalog CUH consumption.

Changing the runtime

You can't change the runtime for working in Data Refinery.

You can change the runtime for a Data Refinery flow job by editing the job template. See Creating jobs in Data Refinery.

Runtime logs for jobs

To view the accumulated logs for a Data Refinery job:

From the project's Jobs page, click the job that ran the Data Refinery flow for which you want to see logs.
Click the job run. You can view the log tail or download the complete log file.

Next steps

Learn more

Monitoring account resource usage

Parent topic: Choosing compute resources for tools