Partitioning and collecting data in DataStage

Last updated: Nov 07, 2024

Use the Partitioning section in DataStage® stages or connectors that have Input tabs to specify details about how the stage or connector partitions or collects data on the current link before it processes the data or writes it to a data target.

Data partitioning is an approach to parallelism that involves breaking the record set into partitions, or subsets of records. If no resource constraints or other data skew issues exist, data partitioning can provide linear increases in application performance. DataStage automatically partitions data based on the type of partition that the stage requires.

You can also use the Partitioning section to sort data that is arriving on the input link before the data is processed or written to the data target. The availability of sorting depends on the partitioning or collecting method that is chosen. It is not available with the Auto methods. The Partitioning section provides basic sorting facilities. For a more complex sort operation, use the Sort stage.

The Partitioning section contains the following controls and fields:

Partitioning

Choose the partitioning type from the list.

The Partition type list is available if the Execution mode is set to parallel in the Stage tab. If you select a method from the list, the method overrides any current partitioning method.

The following partitioning types are available:

(Auto)

At run time, the engine attempts to work out the best partitioning method, depending on:

Whether the current and preceding stages are set to run in sequential mode or in parallel mode.
Whether previous stages in the job have the Preserve Partitioning option set.
How many nodes are specified in the configuration file.

Auto is the default method for most stages, but Auto is not available for the Lookup File Set stage or the Db2 Enterprise stage.

DB2 connector: This partition type is available only for the IBM Db2 for DataStage connector. When this method is specified, the connector determines the number of partitions in the table that is specified in the Partitioned reads method > Table name property and dynamically configures the number of nodes to match the number of partitions. For Db2® for LUW (Linux, Unix, and Windows) tables, this number is the number of DPF (Database Partitioning Feature) partitions that are in the table. For Db2 for z/OS® tables, this number is the number of table partitions in the table. The connector associates each node with one partition. For each node, the connector reads the rows that belong to the partition that is associated with that node.

Entire: Every processing node receives the entire data set.

Random: The rows are partitioned randomly, based on the output of a random number generator.

Round Robin: The rows are partitioned on a round-robin basis as they enter the stage.

Same: This method preserves the current data partitions.

Modulus: The rows are partitioned by using a modulus function on the key column.

Hash: The rows are hashed into partitions based on the value of one or more key columns.

Range: This method divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often used as a preparatory step for performing a total sort on a data set.

Collecting

Choose the collecting type from the list.

The Collection type list is available if the stage is set to execute in sequential mode, and the preceding stage is set to execute in parallel mode. If you select a method from the list, the method overrides the default collection method of Auto.

The following collection types are available:

(Auto): The Auto method usually causes the stage to read any row from any input partition as the row becomes available and is the fastest collecting method. However, the stage can use a different collecting method when Auto is set in some circumstances. For example, if the stage requires data to be sorted before it can operate, the stage sorts the data.

Ordered: This method reads all the rows from the first partition, then all the rows from the second partition, and so on.

Round Robin: This method reads a row from the first input partition, then a row from the second partition, and so on. After reaching the last partition, the stage starts again from the first partition.

Sort Merge: This method reads rows in an order based on one or more columns of the row.

Sorting

Use these controls to specify how to sort the data. Data is always sorted within data partitions. If the stage is partitioning incoming data, the data is sorted after the partitioning. If the stage is collecting incoming data, the data is sorted before the collection.

Sort: Select Perform sort to sort data that comes in on the link.

Stable: Select Stable if you want to preserve previously sorted data sets. Stable is set by default.

Unique: Select Unique if you want to retain only one record per sorting key value. If multiple records have identical sorting key values, all but one is discarded. If stable sort is also set, the first record with the sorting key value is the record that is retained.

Dynamically generated configuration files in DataStage-SaaS

DataStage-SaaS does not support user-generated configuration files. You can provide the number of partitions for dynamically generated configuration files by setting the partition count in the runtime environment or by setting the environment variable APT_WLM_PARTITION_COUNT for the number of partitions.

Was the topic helpful?

0/1000