Compress stage in DataStage

Compress stage

The Compress stage uses the UNIX compress or GZIP utility to compress a data set. It converts a data set from a sequence of records into a stream of raw binary data.

The Compress stage is a proessing stage. It can have a single input link and a single output link.

The complement to the Compress stage is the Expand stage, which is described in Expand stage.

A compressed data set is similar to an ordinary data set and can be stored in a persistent form by a Data Set stage. However, a compressed data set cannot be processed by many stages until it is expanded, that is, until its rows are returned to their normal format. Stages that do not perform column-based processing or reorder the rows can operate on compressed data sets. For example, you can use the Copy stage to create a copy of the compressed data set.

Because compressing a data set removes its normal record boundaries, the compressed data set must not be repartitioned before it is expanded.

DataStage® puts the existing data set schema as a subrecord to a generic compressed schema. For example, given a data set with a schema of:
The schema for the compressed data set would be:
  ( t: tagged {preservePartitioning=no}
    ( encoded: subrec
        ( bufferNumber: dfloat;
          bufferLength: int32;
          bufferData: raw[32000];
      schema: subrec
        ( a: int32;
          b: string[50];
Therefore, when you are looking to reuse a file that has been compressed, ensure that you use the 'compressed schema' to read the file rather than the schema that had gone into the compression.

When you double-click the Compress stage, the properties panel opens. The properties panel has three tabs:

  • Stage. This is always present and is used to specify general information about the stage.
  • Input. This is where you specify details about the data set being compressed.
  • Output. This is where you specify details about the compressed data being output from the stage.
