Remove Duplicates Stage in DataStage

Remove Duplicates Stage

The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set.

The Remove Duplicates stage is a processing stage. It can have a single input link and a single output link.

Removing duplicate records is a common way of cleansing a data set before you perform further processing. Two rows are considered duplicates if they are adjacent in the input data set and have identical values for the key column(s). A key column is any column you designate to be used in determining whether two rows are identical.

The data set input to the Remove Duplicates stage must be sorted so that all records with identical key values are adjacent. You can either achieve this using the in-stage sort facilities available on the Input page Partitioning tab, or have an explicit Sort stage feeding the Remove Duplicates stage.

The stage editor has three tabs:

Stage. This is always present and is used to specify general information about the stage.
Input . This is where you specify details about the data set that is having its duplicates removed.
Output. This is where you specify details about the processed data that is being output from the stage.