Sort stage in DataStage

Sort stage

The Sort stage is used to perform more complex sort operations than can be provided for on the Input page Partitioning sections of parallel job stage editors.

The Sort stage is a processing stage. You can also use the Sort stage to insert a more explicit simple sort operation where you want to make your job easier to understand. The Sort stage has a single input link which carries the data to be sorted, and a single output link carrying the sorted data.

You specify sorting keys as the criteria on which to perform the sort. A key is a column on which to sort the data, for example, if you had a name column you might specify that as the sort key to produce an alphabetical list of names. The first column you specify as a key to the stage is the primary key, but you can specify additional secondary keys. If multiple rows have the same value for the primary key column, then IBM® DataStage® uses the secondary columns to sort these rows.

You can sort in sequential mode to sort an entire data set or in parallel mode to sort data within partitions, as shown in the following image:

Shows a Sort stage being used to sort an entire data set sequentially, and a Sort stage sorting parallel data within partitions

You might perform a sort for several reasons. For example, you might want to sort a data set by a zip code column, then by last name within the zip code. Once you have sorted the data set, you can filter the data set by comparing adjacent records and removing any duplicates.

However, you must be careful when processing a sorted data set: many types of processing, such as repartitioning, can destroy the sort order of the data. For example, assume you sort a data set on a system with four processing nodes and store the results to a data set stage. The data set will therefore have four partitions. You then use that data set as input to a stage executing on a different number of nodes, possibly due to node constraints. IBM DataStage automatically repartitions a data set to spread out the data set to all nodes in the system, unless you tell it not to, possibly destroying the sort order of the data. You could avoid this by specifying the Same partitioning method. The stage does not perform any repartitioning as it reads the input data set; the original partitions are preserved.

You must also be careful when using a stage operating sequentially to process a sorted data set. A sequential stage executes on a single processing node to perform its action. Sequential stages will collect the data where the data set has more than one partition, which might also destroy the sorting order of its input data set. You can overcome this if you specify the collection method as follows:

If the data was range partitioned before being sorted, you should use the ordered collection method to preserve the sort order of the data set. Using this collection method causes all the records from the first partition of a data set to be read first, then all records from the second partition, and so on.
If the data was hash partitioned before being sorted, you should use the sort merge collection method specifying the same collection keys as the data was partitioned on.

By default the stage will sort with the native IBM DataStage sorter, but you can also specify that it uses the UNIX sort command.

The stage editor has three pages:

Stage. This is always present and is used to specify general information about the stage.
Input. This is where you specify details about the data sets being sorted.
Output. This is where you specify details about the sorted data being output from the stage.

Watch the following video for an example of how to work with the DataStage Sort stage.

This video provides a visual method to learn the concepts and tasks in this documentation.