Sample stage in DataStage: Partitioning on input links

Sample stage: Partitioning on input links

The Partitioning section allows you to specify details about how the incoming data is partitioned or collected before the sample is performed.

By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job, the stage will warn if it cannot preserve the partitioning of the incoming data.

If the Sample stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

Whether the Sample stage is set to execute in parallel or sequential mode.
Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Sample stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning.

If the Sample stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default auto collection method.

The following partitioning methods are available:

(Auto). IBM® DataStage® attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default method for the Sample stage.
Entire. Each file written to receives the entire data set.
Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.
Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.
Random. The records are partitioned randomly, based on the output of a random number generator.
Round Robin. The records are partitioned on a round robin basis as they enter the stage.
Same. Preserves the partitioning already in place.
Db2®. Replicates the Db2 partitioning method of a specific Db2 table. Requires extra properties to be set. Access these properties by clicking the properties button.
Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button.

The following Collection methods are available:

(Auto). This is the default collection method for the Sample stage. Normally, when you are using Auto mode, IBM DataStage will eagerly read any row from any input partition as it becomes available.
Ordered. Reads all records from the first partition, then all records from the second partition, and so on.
Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over.
Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the sample is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available for the default auto methods).

Select the check boxes as follows:

Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.
Stable. Select this if you want to preserve previously sorted data sets. This is the default.
Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort.

You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.