Partitioning and collecting data

Use the Partitioning section in DataStage® stages or connectors that have Input tabs to specify details about how the stage or connector partitions or collects data on the current link before it processes the data or writes it to a data target.

You can also use the Partitioning section to sort data that is arriving on the input link before the data is processed or written to the data target. The availability of sorting depends on the partitioning or collecting method that is chosen. It is not available with the Auto methods. The Partitioning section provides basic sorting facilities. For a more complex sort operation, use the Sort stage.

Note: Partitioning is not currently available for the Transformer stage.
The Partitioning section contains the following controls and fields:
Choose the partitioning type from the list.
The Partition type list is available if the Execution mode is set to parallel in the Stage tab. If you select a method from the list, the method overrides any current partitioning method.
The following partitioning types are available:
At run time, the engine attempts to work out the best partitioning method, depending on:
  • Whether the current and preceding stages are set to run in sequential mode or in parallel mode.
  • Whether previous stages in the job have the Preserve Partitioning option set.
  • How many nodes are specified in the configuration file.
Auto is the default method for most stages, but Auto is not available for the Lookup File Set stage or the Db2 Enterprise stage.
Every processing node receives the entire data set.
The rows are partitioned randomly, based on the output of a random number generator.
Round Robin
The rows are partitioned on a round-robin basis as they enter the stage.
This method preserves the current data partitions.
The rows are partitioned by using a modulus function on the key column.
The rows are hashed into partitions based on the value of one or more key columns.
This method divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often used as a preparatory step for performing a total sort on a data set. The range method requires you to specify the name of a sample range map (you create the range map by using the Write Range Map stage). To specify the range map, click Properties and enter or browse for the range map name in the window.
Note: Range Map is not supported for DataStage Version 4.0.5.
Choose the collecting type from the list.

The Collection type list is available if the stage is set to execute in sequential mode, and the preceding stage is set to execute in parallel mode. If you select a method from the list, the method overrides the default collection method of Auto.

The following collection types are available:
The Auto method usually causes the stage to read any row from any input partition as the row becomes available and is the fastest collecting method. However, the stage can use a different collecting method when Auto is set in some circumstances. For example, if the stage requires data to be sorted before it can operate, the stage sorts the data.
This method reads all the rows from the first partition, then all the rows from the second partition, and so on.
Round Robin
This method reads a row from the first input partition, then a row from the second partition, and so on. After reaching the last partition, the stage starts again from the first partition.
Sort Merge
This method reads rows in an order based on one or more columns of the row.
Use these controls to specify how to sort the data. Data is always sorted within data partitions. If the stage is partitioning incoming data, the data is sorted after the partitioning. If the stage is collecting incoming data, the data is sorted before the collection.
Select Perform sort to sort data that comes in on the link.
Select Stable if you want to preserve previously sorted data sets. Stable is set by default.
Select Unique if you want to retain only one record per sorting key value. If multiple records have identical sorting key values, all but one is discarded. If stable sort is also set, the first record with the sorting key value is the record that is retained.