Compress stage in DataStage
The Compress stage uses the UNIX compress or GZIP utility to compress a data set. It converts a data set from a sequence of records into a stream of raw binary data.
The Compress stage is a processing stage. It can have a single input link and a single output link.
The complement to the Compress stage is the Expand stage, which is described in Expand stage in DataStage.
A compressed data set is similar to an ordinary data set and can be stored in a persistent form by a Data Set stage. However, a compressed data set cannot be processed by many stages until it is expanded, that is, until its rows are returned to their normal format. Stages that do not perform column-based processing or reorder the rows can operate on compressed data sets. For example, you can use the Copy stage to create a copy of the compressed data set.
Because compressing a data set removes its normal record boundaries, the compressed data set must not be repartitioned before it is expanded.
a:int32;
b:string[50];
The schema for the compressed data set would be:record
( t: tagged {preservePartitioning=no}
( encoded: subrec
( bufferNumber: dfloat;
bufferLength: int32;
bufferData: raw[32000];
);
schema: subrec
( a: int32;
b: string[50];
);
Therefore, when you are looking to reuse a file that has been compressed,
ensure that you use the 'compressed schema' to read the file rather than the schema that had gone
into the compression.When you double-click the Compress stage, the properties panel opens. The properties panel has three tabs:
- Stage. This is always present and is used to specify general information about the stage.
- Input. This is where you specify details about the data set being compressed.
- Output. This is where you specify details about the compressed data being output from the stage.
Input tab
The Columns section specifies the column definitions of incoming data. The Advanced section allows you to change the default buffering settings for the input link.
Output tab
The Columns section specifies the column definitions of the data. The Advanced section allows you to change the default buffering settings for the output link.