Data set | IBM Cloud Pak for Data as a Service

Data set

You can read data from or write data to a data set. You can use the data set as a source or target.

The data set can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode.

Parallel jobs use data sets to manage data within a job. Each link in a job carries a data set. The data set allows you to store data being operated on in a persistent form, which can then be used by other IBM® DataStage® jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs.

Double-click the data set to open the properties panel. The panel has up to three tabs, depending on whether you are reading or writing a data set:

Stage tab
Input tab
Output tab

Stage tab

You can specify the following Advanced properties:

Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the contents of the data set are processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire contents of the data set are processed by the conductor node.
Combinability mode. This is Auto by default, which allows DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.
Preserve partitioning. You can select Propagate, Set or Clear. If you select Set file read operations will request that the next stage preserves the partitioning as is. Propagate takes the setting of the flag from the previous stage.

Input tab

The Input tab allows you to specify details about how data is written to a data set. The data set can have only one input link. The target category for the data set includes the properties File and Update Policy. While File is the name of the control file for the data set, the Update Policy specifies the action taken if the data set that you are writing to already exists.

Below is a description of each property on the Input tab:

File

The name of the control file for the data set. You can browse for the file or enter a job parameter. By convention, the file has the suffix .ds.

Update policy

Specifies what action will be taken if the data set you are writing to already exists. Select:

Append. Append any new data to the existing data.
Create (Error if exists). DataStage reports an error if the data set already exists.
Overwrite. Overwrites any existing data with new data.
Use existing (Discard records). Keeps existing files listed in a descriptor file (for example, datasetname.ds or filesetname.fs) but discards the old records. You receive an error if the data set with different schema already exists.
Use existing (Discard records and schema). Keeps existing files listed in a descriptor file (for example, datasetname.ds or filesetname.fs) but discards the old schema and records.

The default is Overwrite.

Output tab

In the Output page, you can specify details about how the data set reads data. You can change the default buffer settings for the output link and view the column definitions.

File

The name of the control file for the data set. You can browse for the file or enter a job parameter. By convention the file has the suffix .ds.

Missing Columns Mode

Use this option to specify how the stage behaves if columns defined in the stage are not present in the data set when the job runs. Select one of the following options:

Ignore

The job fails. If runtime column propagation is off, the job warns at the Data Set stage. The job fails when that column is explicitly used by another stage.

Fail

The job fails at the Data Set stage, regardless of whether runtime column propagation is on or off.

Default Nullable Only

The job sets any missing columns that are marked as nullable to the null value. Any missing columns marked as not nullable will cause the job to fail.

Default Non-Nullable Only

The job sets any missing columns that are marked as not nullable to the default value for that data type (for example, an integer column defaults to 0). Any missing columns marked as nullable will cause the job to fail.

Default All

The job sets values for missing columns as follows:

Nullable columns are set to null.
Non-nullable columns are set to the default value for that data type (for example, an integer column defaults to 0).