DataStage environments

In DataStage, a Default DataStage XS runtime is started when:

  • You shape your data in DataStage
  • You run a DataStage flow in a job

All runtimes consume capacity unit hours (CUHs) that are tracked.

Transforming data

When you select to extract, transform, or load data in DataStage, a Default DataStage XS runtime is started automatically and is listed as an active runtime on the Environments page of your project.

Remember: When you close the flow interface, the runtime isn’t stopped immediately. You must stop the runtime from the Environments page of your project. If you don’t explicitly stop the runtime, it is stopped for you after an idle time of 1 hour.

Running a flow

You can create a job in which to run your DataStage flow:

  • Directly in DataStage by clicking the  jobs icon from the DataStage toolbar and creating a job
  • From your project’s Jobs page
  • From your project’s Assets page by selecting the DataStage flow and clicking ACTIONS > Create job

Environment options in jobs

When you create a job in which to run a DataStage flow, you can select one the following environments:

  • A Spark R environment

    With a Spark R environment, the DataStage flow runs in its own Spark cluster. Each Spark environment consists of one SparkR kernel as a service. The kernel has a dedicated Spark cluster and Spark executors.

    You can select a Spark with R environment definition provided by Watson Studio or create your own Spark with R environment definition. If you create your own Spark with R environment definition, you can configure the size of the Spark driver and the size and number of the executors dependent on the size of the data set.

    You should always select a Spark with R environment to run DataStage flows that operate on large data sets.

    All Spark R environments are HIPAA ready.

  • The Default DataStage XS environment

    The Default DataStage XS runtime is used when you extract, transform, and load data in DataStage and can also be selected as the environment runtime when you create a job in which to run your DataStage flow.

    You should select the Default DataStage XS runtime to run DataStage flows that operate on small data sets because the runtime is instantly available and doesn’t first have to be started before the job can run.

    The Default DataStage XS runtime is HIPAA ready.

After the runtime was started, it is listed as an active runtime on the Environments page of your project. The runtime is stopped when the DataStage job stops running.

Default Spark with R environment definitions

Watson Studio offers the following Spark R environment definitions which you can select when you create a job in which to run a DataStage flow. Selecting this Spark environment definition helps you to quickly get started running DataStage jobs without having to create your own Spark R environment definition. The default environment definitions are listed on the project’s Environments page.

Name Hardware configuration
Default Spark 2.4 & R 3.6 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM
Default Spark 2.3 & R 3.4 2 Executors each: 1 vCPU and 4 GB RAM;
Driver: 1 vCPU and 4 GB RAM

 

Runtime logs for jobs

To view the accumulated logs for a DataStage job:

  1. From the project’s Jobs page, click the job that ran the DataStage flow for which you want to see logs.
  2. Click the job run. You can view the log tail or download the complete log file.

These environments are the same.

When you create a job in which to run a DataStage flow, you can select a default Spark with R environment (either Default Spark 2.3 & R 3.4 or Default Spark 2.4 & R 3.6) or any other Spark R environment definition that you created.

Next steps