0 / 0

Investigate stage

Last updated: Mar 12, 2025
Investigate stage in DataStage

The Investigate stage shows the actual condition of source data and helps to identify and correct data problems before they corrupt new systems. Understanding your data is a necessary precursor to cleansing.

Investigation parses and analyzes free-form fields, counts unique values, and classifies or assigns a business meaning to each occurrence of a value within a field.

Investigation achieves these goals:

  • Uncovers trends, potential anomalies, metadata discrepancies, and undocumented business practices.
  • Identifies invalid or default values.
  • Reveals common terminology.
  • Verifies the reliability of fields that are proposed as matching criteria.

The Investigate stage takes a single input, which can be a link from any database connector that is supported by IBM DataStage, from a flat file or data set, or from any processing stage. Inputs to the Investigate stage can be fixed length or variable. The stage can have one or two output links, depending on the type of investigation that you specify.

The Word Investigation stage parses free-form data fields into individual tokens and analyzes them to create patterns. This stage also provides frequency counts on the tokens. To create the patterns in address data, for example, the Word Investigation stage uses a set of rules for classifying personal names, business names, and addresses. The stage provides pre-built rule sets for investigating patterns on names and postal addresses for a number of different countries. For example, for the United States the stage parses the following components:

USPREP
Name, address, and area if the data is not previous formatted
USNAME
Individual and organization names
USADDR
Street and mailing addresses
USAREA
City, state, ZIP code, and other related data

The test field 123 St. Virginia St. is analyzed in the following way:

  1. Field parsing breaks the address into the individual tokens of 123, St., Virginia, and St.
  2. Lexical analysis determines the business significance of each token:
    1. 123 = number
    2. St. = street type
    3. Virginia = alpha
    4. St. = Street type
  3. Context analysis identifies the various data structures and content as 123 St. Virginia, St.
    1. 123 = House number
    2. St. Virginia = Street address
    3. St. = Street type

The Character Investigation stage parses a single-domain field (one that contains one data element or token, such as Social Security number, telephone number, date, or ZIP code) to analyze and classify data. The Character Investigation stage provides a frequency distribution and pattern analysis of the tokens.

A pattern report is prepared for all types of investigations and displays the count, percentage of data that matches this pattern, the generated pattern, and sample data. This output can be presented in a wide range of formats to conform to standard reporting tools.

Investigate stage: Stage tab

You can specify aspects of the Investigate stage by double-clicking the stage and updating settings on the Stage tab.

The Properties section lets you specify what the stage does. The Advanced section allows you to specify how the stage executes.

Properties

Use the Properties section to define what the stage actually does.
Alternate locale
Optional. Lets you specify the international locale you want to use on the server to process the data.

This value needs to be set only if you are processing data for a language that is not the default language of the server. For example, the default language for your server is French and the data to be processed is Italian.

When you change the locale, InfoSphere QualityStage uses the appropriate collating sequence and decimal separators for the alternate language. The value required depends on the type of server and how it is configured.

If you are using a UNIX server, enter the following command to obtain a list of locales supported by your server:

locale -a

If you are using a Windows workstation, select your InfoSphere QualityStage server directory and the locale subdirectory. The local subdirectory contains folders that are listed alphabetically by the languages they support.

Investigation

Investigation type

Character . A character investigation analyzes and classifies data, parsing it into a single-pattern report.

Column Investigation selection
Click Edit to apply a column mask. You use column masks to choose which characters are included in the frequency count or pattern analysis and which characters are displayed as part of the samples in the pattern report.
Frequency cutoff
Patterns with a frequency of less than this number will not appear in the pattern or token reports. If desired, enter a higher number. For example, if you enter 4, any pattern that occurs three times or less does not appear in the report.
Number of samples
If desired, increase the number of samples that appear for each pattern in the pattern report. The default is 1.
Comparison Mode: Concatenate
Performs cross-column correlations between multiple columns to determine relationships. You can choose two noncontiguous columns from anywhere in the record to be investigated as a single data column.

Advanced

This section let's you specify the following:
Execution mode
The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.
Combinability mode
This is Auto by default, which allows IBM® DataStage® to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.
Preserver partitioning
This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

Investigate stage: Input tab

The Input tab allows you to specify details about the incoming data sets. The Investigate stage expects one incoming data set.

The Columns section specifies the column definitions of incoming data. The Advanced section allows you to change the default buffering settings for the input link.

Investigate stage: Output tab

The Output tab allows you to specify details about data output from the Investigate stage.

The stage can have one or two output links, depending on the type of investigation that you specify. The Columns section specifies the column definitions of outgoing data. Click Edit at the bottom of the Columns section to specify mapping information. Mapping specifies the relationship between the columns being input to the Sample stage and the output columns. The Advanced section allows you to change the default buffering settings for the output links.