Bloom Filter stage in DataStage: Stage tab

Bloom Filter stage: Stage tab

The Bloom Filter stage tab enables you to control aspects of the Bloom Filter stage.

Double-click the stage to open the stage properties panel. The Properties section lets you specify what the stage does. The Advanced section allows you to specify how the stage executes. Specify an optional description of the stage.

Properties section

Use the Properties and Options sections to define what the stage actually does.

Mode

Select Create or Process. The method property is set to Create by default.

Create: This option specifies that the stage runs in create mode. The keys in the input data set are added to a bloom filter and are written to memory after the last record in the data set. This option can be used to create bloom filters from old static data that will eventually be used in future jobs that use the bloom filter in -process mode.
Process: This option specifies that the stage will run in process mode. The keys in the input data set are looked up against the bloom filters that are loaded in memory.

Fileset

Specify the path and name of the file set that is used to store the bloom filter information.

Size

Specify the number of unique entries that you expect to insert into the bloom filter. Overestimate the total number of entries when you specify the value for this option.

Edit

Click Edit to specify a key. This option specifies the key to use for the lookup with either the -create or -process option. At least one -key is required.

Additional properties (Create)

Date: This option specifies the date string in the yyyy-mm-dd format that the incoming data set is associated with. This number is appended to the file name of the associated bloom filter that is used for dropping older filters. If you do not specify this option in create mode, the -previous_days option cannot be used in process mode.
Phases: This option specifies the number of hash indexes that each key group will produce. A higher number of phases lowers the false positive percentage, but raises the memory requirements. The phase count that you use must match the phase count that is used to create static filters.
Truncate: This option truncates the file set.

Additional properties (Process)

Date: This option specifies the date string in the yyyy-mm-dd format that the incoming data set is associated with. This number is appended to the file name of the associated bloom filter that is used for dropping older filters. If you do not specify this option in create mode, the -previous_days option cannot be used in process mode.
Drop old: This option specifies that bloom filters older than the -previous_days count will be removed from the file set.
Duplicate flag: This option specifies that you want to flag duplicates when running the stage.
Phases: This option specifies the number of hash indexes that each key group will produce. A higher number of phases lowers the false positive percentage, but raises the memory requirements. The phase count that you use must match the phase count that is used to create static filters.
Previous days: This option specifies the number of days of old bloom filters to use for the lookup. If not specified all the existing filters will be used.
Reference date: This option is the reference date for the -previous_days option. Specify this variable in yyyy-mm-dd format.
Truncate: This option truncates the file set.

Advanced properties

The advanced properties section allows you to specify the following options:

Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data set is processed by the available nodes as specified in the Configuration file, and by any node constraints specified in the Advanced section. In Sequential mode the entire data set is processed by the conductor node.
Combinability mode. This is Auto by default, which allows IBM DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.
Preserve partitioning. This is Set by default. You can select Set or Clear. If you select Set the stage will request that the next stage in the job attempt to maintain the partitioning.