Sample stage in DataStage
The Sample stage samples an input data set.
The Sample stage can have a single input link and any number of output links when operating in percent mode, or a single input and single output link when operating in period mode. It is one of a number of stages that IBM DataStage provides to help you sample data, see also:
- Head stage, Head stage in DataStage.
- Tail stage, Tail stage in DataStage.
- Peek stage, Peek stage in DataStage.
The Sample stage is a debug stage. It operates in two modes. In Percent mode, it extracts rows, selecting them by means of a random number generator, and writes a given percentage of these to each output data set. You specify the number of output data sets, the percentage written to each, and a seed value to start the random number generator. You can reproduce a given distribution by repeating the same number of outputs, the percentage, and the seed value.
In Period mode, it extracts every Nth row from each partition, where N is the period, which you supply. In this case all rows will be output to a single data set, so the stage used in this mode can only have a single output link
For both modes you can specify the maximum number of rows that you want to sample from each partition.
Input tab
The Columns section specifies the column definitions of incoming data.
Output tab
In Percent mode, the stage can have any number of output links, in Period mode it can only have one output. Choose the link you want to work on from the Output Link drop down list.
The Columns section specifies the column definitions of outgoing data. Click Edit at the bottom of the Columns section to specify mapping information. Mapping specifies the relationship between the columns being input to the Sample stage and the output columns. The Advanced section allows you to change the default buffering settings for the output links.
- Mapping output
-
Click Edit in the Columns section to map columns. View the columns of the sampled data. These are read only and cannot be modified on this tab. This shows the meta data from the incoming link
The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility.