0 / 0
Difference stage in DataStage

Difference stage

The Difference stage performs a record-by-record comparison of two input data sets, which are different versions of the same data set designated the before and after data sets.

The Difference stage is a processing stage. It outputs a single data set that's records represent the difference between them. The stage assumes that the input data sets have been key-partitioned and sorted in ascending order on the key columns you specify for the Difference stage comparison. You can achieve this by using the Sort stage or by using the built in sorting and partitioning abilities of the Difference stage.

The comparison is performed based on a set of difference key columns. Two records are copies of one another if they have the same value for all difference keys. You can also optionally specify change values. If two records have identical key columns, you can compare the value columns to see if one is an edited copy of the other.

The Difference stage is similar, but not identical, to the Change Capture stage described in Change Capture stage. The Change Capture stage is intended to be used in conjunction with the Change Apply stage; it produces a change data set which contains changes that need to be applied to the before data set to turn it into the after data set. The Difference stage outputs the before and after rows to the output data set, plus a code indicating if there are differences. If the before and after data have the same column names, then one data set effectively overwrites the other data set and so you only see one set of columns in the output. Which data set is output is controlled by the settings in the Link Order section of the Stage tab. If your before and after data sets have different column names, columns from both data sets are available to be output as set with the mapping options when you edit columns on the Output tab. Any columns that are designated as key or value columns in the input data sets must have the same names.

When you double-click the Difference stage, the properties panel opens. The properties panel has three tabs:

  • Stage. This is always present and is used to specify general information about the stage.
  • Input. This is where you specify details about the data being grouped or aggregated.
  • Output. This is where you specify details about the groups being output from the stage.
Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more