Investigate stage
The Investigate stage shows the actual condition of source data and helps to identify and correct data problems before they corrupt new systems. Understanding your data is a necessary precursor to cleansing.
Investigation parses and analyzes free-form fields, counts unique values, and classifies or assigns a business meaning to each occurrence of a value within a field.
Investigation achieves these goals:
- Uncovers trends, potential anomalies, metadata discrepancies, and undocumented business practices.
- Identifies invalid or default values.
- Reveals common terminology.
- Verifies the reliability of fields that are proposed as matching criteria.
The Investigate stage takes a single input, which can be a link from any database connector that is supported by IBM DataStage, from a flat file or data set, or from any processing stage. Inputs to the Investigate stage can be fixed length or variable. The stage can have one or two output links, depending on the type of investigation that you specify.
The Word Investigation stage parses free-form data fields into individual tokens and analyzes them to create patterns. This stage also provides frequency counts on the tokens. To create the patterns in address data, for example, the Word Investigation stage uses a set of rules for classifying personal names, business names, and addresses. The stage provides pre-built rule sets for investigating patterns on names and postal addresses for a number of different countries. For example, for the United States the stage parses the following components:
- USPREP
- Name, address, and area if the data is not previous formatted
- USNAME
- Individual and organization names
- USADDR
- Street and mailing addresses
- USAREA
- City, state, ZIP code, and other related data
The test field 123 St. Virginia St. is analyzed in the following way:
- Field parsing breaks the address into the individual tokens of 123, St., Virginia, and St.
- Lexical analysis determines the business significance of each token:
- 123 = number
- St. = street type
- Virginia = alpha
- St. = Street type
- Context analysis identifies the various data structures and content as 123 St. Virginia, St.
- 123 = House number
- St. Virginia = Street address
- St. = Street type
The Character Investigation stage parses a single-domain field (one that contains one data element or token, such as Social Security number, telephone number, date, or ZIP code) to analyze and classify data. The Character Investigation stage provides a frequency distribution and pattern analysis of the tokens.
A pattern report is prepared for all types of investigations and displays the count, percentage of data that matches this pattern, the generated pattern, and sample data. This output can be presented in a wide range of formats to conform to standard reporting tools.
Investigate stage: Stage tab
You can specify aspects of the Investigate stage by double-clicking the stage and updating settings on the Stage tab.
The Properties section lets you specify what the stage does. The Advanced section allows you to specify how the stage executes.
Properties
- Alternate locale
- Optional. Lets you specify the international locale you want to use on the server to process the
data.
This value needs to be set only if you are processing data for a language that is not the default language of the server. For example, the default language for your server is French and the data to be processed is Italian.
When you change the locale, InfoSphere QualityStage uses the appropriate collating sequence and decimal separators for the alternate language. The value required depends on the type of server and how it is configured.
If you are using a UNIX server, enter the following command to obtain a list of locales supported by your server:
locale -a
If you are using a Windows workstation, select your InfoSphere QualityStage server directory and the locale subdirectory. The local subdirectory contains folders that are listed alphabetically by the languages they support.
Investigation
- Investigation type
-
Character . A character investigation analyzes and classifies data, parsing it into a single-pattern report.
- Column Investigation selection
- Click Edit to apply a column mask. You use column masks to choose which characters are included in the frequency count or pattern analysis and which characters are displayed as part of the samples in the pattern report.
- Frequency cutoff
- Patterns with a frequency of less than this number will not appear in the pattern or token reports. If desired, enter a higher number. For example, if you enter 4, any pattern that occurs three times or less does not appear in the report.
- Number of samples
- If desired, increase the number of samples that appear for each pattern in the pattern report. The default is 1.
- Comparison Mode: Concatenate
- Performs cross-column correlations between multiple columns to determine relationships. You can choose two noncontiguous columns from anywhere in the record to be investigated as a single data column.
Advanced
- Execution mode
- The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.
- Combinability mode
- This is Auto by default, which allows IBM® DataStage® to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.
- Preserver partitioning
- This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.
Investigate stage: Input tab
The Input tab allows you to specify details about the incoming data sets. The Investigate stage expects one incoming data set.
The Columns section specifies the column definitions of incoming data. The Advanced section allows you to change the default buffering settings for the input link.
Investigate stage: Output tab
The Output tab allows you to specify details about data output from the Investigate stage.
The stage can have one or two output links, depending on the type of investigation that you specify. The Columns section specifies the column definitions of outgoing data. Click Edit at the bottom of the Columns section to specify mapping information. Mapping specifies the relationship between the columns being input to the Sample stage and the output columns. The Advanced section allows you to change the default buffering settings for the output links.