Data Refinery environments
In Data Refinery, a Spark R runtime is started when:
- You shape your data in Data Refinery
- You run a Data Refinery flow in a job
All runtimes consume capacity unit hours (CUHs) that are tracked.
- Shaping data in Data Refinery
- Running a Data Refinery flow
- Environment options in jobs
- Using the
Default Spark R 3.4
- Runtime logs for jobs
Shaping data in Data Refinery
When you select to refine data in Data Refinery, a
Default Data Refinery XS runtime is started automatically and is listed as an active runtime on the Environments page of your project.
Remember: When you close the refinery interface, the runtime isn’t stopped immediately. You must stop the runtime from the Environments page of your project. If you don’t explicitly stop the runtime, it is stopped for you after an idle time of 1 hour.
Running a Data Refinery flow
You can create a job in which to run your Data Refinery flow:
- Directly in Data Refinery by clicking the from the Data Refinery toolbar and creating a job
- From your project’s Jobs page
- From your project’s Assets page by selecting the Data Refinery flow and clicking ACTIONS > Create job
Environment options in jobs
When you create a job in which to run a Data Refinery flow, you can select to use one the following environments:
Spark R 3.4 environment
With a Spark R 3.4 environment, the Data Refinery flow runs in its own Spark cluster. Each Spark environment consists of one SparkR kernel as a service. The kernel has a dedicated Spark cluster and Spark executors.
You can select the
Default Spark R 3.4environment definition provided by Watson Studio or create your own Spark R 3.4 environment definition.
If you create your own Spark R 3.4 environment definition, you can configure the size of the Spark driver and the size and number of the executors dependent on the size of the data set.
You should always select a Spark R 3.4 environment to run Data Refinery flows that operate on large data sets.
All Spark R 3.4 environments are HIPAA ready.
Default Data Refinery XS
Default Data Refinery XSruntime is used when you refine data in Data Refinery and can also be selected as the environment runtime when you create a job in which to run your Data Refinery flow.
You should select the
Default Data Refinery XSruntime to run Data Refinery flows that operate on small data sets because the runtime is instantly available and doesn’t first have to be started before the job can run.
Default Data Refinery XSruntime is HIPAA ready.
As soon as the runtime starts, it is listed as an active runtime on the Environments page of your project. The runtime is stopped when the Data Refinery job stops running.
Default Spark R 3.4
Watson Studio offers the following preset Spark R environment definition which you can select when you create a job in which to run a Data Refinery flow. Selecting this Spark environment definition helps you to quickly get started running Data Refinery jobs without having to create your own Spark R environment definition.
||2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Runtime logs for jobs
To view the accumulated logs for a Data Refinery job:
- From the project’s Jobs page, click the job that ran the Data Refinery flow for which you want to see logs.
- Click the job run. You can view the log tail or download the complete log file.
When should I use a Spark R 3.4 environment?
If you are working on a small data set, you should select the
Default Data Refinery XS runtime and not a Spark R 3.4 environment. The reason is that, although the SparkR cluster in a Spark R 3.4 environment is fast and powerful, it requires time to create, which is noticeable when you run Data Refinery jobs on small data sets.
What is the difference between a Spark R 3.4 environment used in Data Refinery and the other Spark R environments in a project?
These environments are the same.
When you create a job in which to run a Data Refinery flow, you can select the
Default Spark R 3.4 environment or any other Spark R environment definition that you created.