0 / 0
Join stage: Partitioning on input links

Join stage: Partitioning on input links

The Partitioning tab allows you to specify details about how the data on each of the incoming links is partitioned or collected before it is joined.

It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. Auto mode ensures that data being input to the Join stage is key partitioned and sorted.

If the Join stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

  • Whether the Join stage is set to execute in parallel or sequential mode.
  • Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Join stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partition type drop-down list. This will override any current partitioning.

If the Join stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collector type drop-down list. This will override the default collection method.

The following partitioning methods are available:

  • (Auto). InfoSphere® DataStage® attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file. This is the default collection method for the Join stage.
  • Entire. Each file written to receives the entire data set.
  • Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.
  • Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.
  • Random. The records are partitioned randomly, based on the output of a random number generator.
  • Round Robin. The records are partitioned on a round robin basis as they enter the stage.
  • Same. Preserves the partitioning already in place.
  • Db2®. Replicates the Db2 partitioning method of a specific Db2 table. Requires extra properties to be set. Access these properties by clicking the properties button.
  • Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button.

The following Collection methods are available:

  • (Auto). This is the default collection method for the Join stage. Normally, when you are using Auto mode, InfoSphere DataStage will eagerly read any row from any input partition as it becomes available. In the case of a Join stage, Auto will also ensure that the collected data is sorted.
  • Ordered. Reads all records from the first partition, then all records from the second partition, and so on.
  • Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over.
  • Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to explicitly specify that data arriving on the input link should be sorted before being joined (you might use this if you have selected a partitioning method other than auto or same). The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning or collecting method chosen (it is not available with the default auto methods).

Select the check boxes as follows:

  • Perform Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.
  • Stable. Select this if you want to preserve previously sorted data sets. This is the default.
  • Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

If NLS is enabled an additional button opens a dialog box allowing you to select a locale specifying the collate convention for the sort.

You can also specify sort direction, case sensitivity, whether sorted as ASCII or EBCDIC, and whether null columns will appear first or last for each column. Where you are using a keyed partitioning method, you can also specify whether the column is used as a key for sorting, for partitioning, or for both. Select the column in the Selected list and right-click to invoke the shortcut menu.

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more