Investigate stage in DataStage

Investigate stage

The Investigate stage shows the actual condition of source data and helps to identify and correct data problems before they corrupt new systems. Understanding your data is a necessary precursor to cleansing.

Investigation parses and analyzes free-form fields, counts unique values, and classifies or assigns a business meaning to each occurrence of a value within a field.

Investigation achieves these goals:

Uncovers trends, potential anomalies, metadata discrepancies, and undocumented business practices.
Identifies invalid or default values.
Reveals common terminology.
Verifies the reliability of fields that are proposed as matching criteria.

The Investigate stage takes a single input, which can be a link from any database connector that is supported by IBM DataStage, from a flat file or data set, or from any processing stage. Inputs to the Investigate stage can be fixed length or variable. The stage can have one or two output links, depending on the type of investigation that you specify.

The Word Investigation stage parses free-form data fields into individual tokens and analyzes them to create patterns. This stage also provides frequency counts on the tokens. To create the patterns in address data, for example, the Word Investigation stage uses a set of rules for classifying personal names, business names, and addresses. The stage provides pre-built rule sets for investigating patterns on names and postal addresses for a number of different countries. For example, for the United States the stage parses the following components:

USPREP: Name, address, and area if the data is not previous formatted
USNAME: Individual and organization names
USADDR: Street and mailing addresses
USAREA: City, state, ZIP code, and other related data

The test field 123 St. Virginia St. is analyzed in the following way:

Field parsing breaks the address into the individual tokens of 123, St., Virginia, and St.
Lexical analysis determines the business significance of each token:
1. 123 = number
2. St. = street type
3. Virginia = alpha
4. St. = Street type
Context analysis identifies the various data structures and content as 123 St. Virginia, St.
1. 123 = House number
2. St. Virginia = Street address
3. St. = Street type

The Character Investigation stage parses a single-domain field (one that contains one data element or token, such as Social Security number, telephone number, date, or ZIP code) to analyze and classify data. The Character Investigation stage provides a frequency distribution and pattern analysis of the tokens.

A pattern report is prepared for all types of investigations and displays the count, percentage of data that matches this pattern, the generated pattern, and sample data. This output can be presented in a wide range of formats to conform to standard reporting tools.