Use the Partitioning section in DataStage® stages or connectors that have Input tabs to
specify details about how the stage or connector partitions or collects data on the current link
before it processes the data or writes it to a data target.
Data partitioning is an approach to parallelism that involves breaking the record set into
partitions, or subsets of records. If no resource constraints or other data skew issues exist, data
partitioning can provide linear increases in application performance. DataStage automatically partitions data based on the type of
partition that the stage requires.
You can also use the Partitioning section to sort data that is arriving on
the input link before the data is processed or written to the data target. The availability of
sorting depends on the partitioning or collecting method that is chosen. It is not available with
the Auto methods. The Partitioning section provides basic sorting facilities.
For a more complex sort operation, use the Sort stage.
The
Partitioning section contains the following controls and fields:
- Partitioning
- Choose the partitioning type from the list.
- The Partition type list is available if the Execution
mode is set to parallel in the Stage tab. If you select a method from the list, the
method overrides any current partitioning method.
- The following partitioning types are available:
- (Auto)
- At run time, the engine attempts to work out the best partitioning method, depending on:
- Whether the current and preceding stages are set to run in sequential mode or in parallel
mode.
- Whether previous stages in the job have the Preserve Partitioning option
set.
- How many nodes are specified in the configuration file.
- Auto is the default method for most stages, but Auto is not available for the Lookup File Set
stage or the Db2 Enterprise stage.
- DB2 connector
- This partition type is
available only for the IBM Db2 for DataStage connector. When
this method is specified, the connector determines the number of partitions in the table that is
specified in the property and dynamically configures the number of nodes to match the
number of partitions. For Db2® for LUW (Linux,
Unix, and Windows) tables, this number is the number of DPF (Database Partitioning Feature)
partitions that are in the table. For Db2 for z/OS®
tables, this number is the number of table partitions in the table. The connector associates each
node with one partition. For each node, the connector reads the rows that belong to the partition
that is associated with that node.
- Entire
- Every processing node receives the entire data set.
- Random
- The rows are partitioned randomly, based on the output of a random number generator.
- Round Robin
- The rows are partitioned on a round-robin basis as they enter the stage.
- Same
- This method preserves the current data partitions.
- Modulus
- The rows are partitioned by using a modulus function on the key
column.
- Hash
- The rows are hashed into partitions based on the value of one or more key
columns.
- Range
- This method divides a data set into approximately equal size partitions based on one or more
partitioning keys. Range partitioning is often used as a preparatory step for performing a total
sort on a data set.
- Collecting
- Choose the collecting type from the list.
-
The Collection type list is available if the stage is set to execute in
sequential mode, and the preceding stage is set to execute in parallel mode. If you select a method
from the list, the method overrides the default collection method of Auto.
- The following collection types are available:
- (Auto)
- The Auto method usually causes the stage to read any row from any input partition as the row
becomes available and is the fastest collecting method. However, the stage can use a different
collecting method when Auto is set in some circumstances. For example, if the stage requires data to
be sorted before it can operate, the stage sorts the data.
- Ordered
- This method reads all the rows from the first partition, then all the rows from the second
partition, and so on.
- Round Robin
- This method reads a row from the first input partition, then a row from the second partition,
and so on. After reaching the last partition, the stage starts again from the first partition.
- Sort Merge
- This method reads rows in an order based on one or more columns of the
row.
- Sorting
- Use these controls to specify how to sort the data. Data is always sorted within data
partitions. If the stage is partitioning incoming data, the data is sorted after the partitioning.
If the stage is collecting incoming data, the data is sorted before the collection.
- Sort
- Select Perform sort to sort data that comes in on the
link.
- Stable
- Select Stable if you want to preserve previously sorted data sets.
Stable is set by default.
- Unique
- Select Unique if you want to retain only one record per sorting key
value. If multiple records have identical sorting key values, all but one is discarded. If stable
sort is also set, the first record with the sorting key value is the record that is retained.
Dynamically generated configuration files in DataStage-SaaS
DataStage-SaaS does not support
user-generated configuration files. You can provide the number of partitions for dynamically
generated configuration files by setting the partition count in the runtime environment or by
setting the environment variable APT_WLM_PARTITION_COUNT
for the number of
partitions.