Adding data matching to data classes

Last updated: Nov 27, 2024

You can add data matching to a data class to specify how to assign data classes automatically. Select a matching method to specify how to automatically assign data classes to data assets during data analysis.

By default matching data is set to "No automatic matching", which means that you can only assign the data class manually to a column. To enable a data class to be automatically assigned, you must define data matching.

To add a data matching method to a data class:

Open the data class and make sure Data matching is enabled in the data class overview.

Note:
A data class is not enabled for data matching if a parent data class has matching data disabled. Draft data classes cannot be used for data matching. Inactive data classes can be used to specify how to classify data but do not contribute to any action until they become active.
Click edit next to the Matching method field to choose how to specify matching criteria. Most methods include data and column matching criteria. Depending on the deployed services, the following matching methods are available:
- No automatic matching
- Match to a list of valid values - A dictionary of valid values is used to determine if each value of a database column belongs to the data class.
- Match to reference data - Codes from a reference data set are used to determine if each value of a database column belongs to the data class.
- Match to criteria in a regular expression - A regular expression is used to determine if each value of a database column belongs to the data class.
- Other matching criteria - Matching is based only on the regex to be applied to the column name, on the specified data type of the column, or both. There is no additional criteria to evaluate the values of the column. Other matching criteria is applied before the main matching method is applied. Only if the name or the data type of the column or both match what is specified as other matching criteria, column values are evaluated against the main matching criteria.
Enter the information to define matching data and other matching criteria as required for your selected matching method and select a threshold value.
Optional: Set a matching priority. Select a value in the range -2147483648 and 2147483647 to determine the priority of the data class.
Publish the data class.

Notes on enabling and disabling matching data:

A data class is not enabled if a parent data class has matching data disabled.
If you disable matching data for a data class, it will also disable matching for dependent data classes.

Parent data class

The parent data class is used to organize the data class in parent/children relationships. It also acts as a kind of "pre-filter" if an automatic matching data method is used: If a parent data class has a matching data method, the data matching methods for the children data classes will only be evaluated if the data matching method for the parent data class returned a positive match. This means that if you define a parent data class it has an impact on the criteria used by the data classification process to decide whether the data class should be assigned or not to an analyzed data field.

Threshold

This field represents the minimum confidence that a data class candidate should have on a column so that that data class is actually assigned to the column. For example: You define the threshold of a class to be 90%. During the analysis one column matches the data class with a confidence of 95%, and another column matches with a confidence of 89%. Because the threshold is 90%, the data class will be only assigned to the first column.

Lower the threshold when you want the data class assignment to happen even if not all data match the data class. You can do this if the data quality is not perfect, and also in cases when you know that the matching method definition doesn't cover 100% of the whole domain of all possible values. A good example is a classifier to detect city names. It is not practical to define an accurate list of values containing all city names in the world, including the smallest locations. A more practical approach would be to enter the list of the 100 largest cities and decrease the threshold to reflect the fact that you don't expect all values of a column to be one of these 100 largest cities, but that a classification should be positive even if enough values (<100%) are found in that list of the 100 largest cities.

Setting a threshold is optional. For metadata enrichment, the threshold defined at the project level is used if you do not set a threshold on the data class directly. A threshold set on the data class always takes precedence over the project setting. See Data class assignment settings.

The following predefined data classes have a default threshold set in the data class definition:

Default threshold settings
Data class	Threshold
City	50%
Person Name	50%
First Name	50%
Middle Name	50%
Last Name	50%
Organization Name	60%

Priority

The priority of the data class determines the order in which candidate data classes should become the inferred data class. Only data classes with a confidence above the confidence threshold will be assigned. When data match multiple data classes, the one with the highest priority and a confidence above the confidence threshold will be assigned.

Some predefined data classes have a priority set. Otherwise, the default priority is 10 for predefined data classes with the matching scope value. For data classes with the matching scope column, the default priority is 0. For a custom data class to take precedence over a predefined data class, it must be defined with a higher priority.

Default priority settings
Data class	Priority
Address Line 1	12
Address Line 2	12
Address Line 3	12
Boolean	16
Canada Province Code	14
Canada Province Name	12
City	7
Code	-10
Country Code	13
Country Name	12
First Name	10
Gender	16
Identifier	-10
Indicator	-10
Last Name	7
Middle Name	10
Organisation	7
Person Name	7
Quantity	-10
Text	-10
US County	8
US State Code	14
US State Name	12

Match to a list of valid values

When you match data to a list of valid values, you create a list of valid values which classify your data on the level of the values of a database column. You must provide the values one by one manually, so this method is recommended for a small set of values. For longer lists, you can use the Match to reference data method.

In the Match to list of valid values section, specify a list of valid values.

Text matching criteria:

Case sensitive: If you select, only the values that have the same case as the specified valid values are classified as matching the data class. If not selected, the case is ignored.
Exact spacing: If you select, only exact matches are classified positively. If not selected, multiple white space characters are collapsed into a single space before comparing the valid values with the tested values. For example, if the valid value is New York, and the tested value is New York, the tested value is classified as matching, even if there are multiple white space characters in the valid value, such as New York. If the tested value is NewYork without a space, however, the tested value is classified as not matching.
Whole words: If you select, only exact matches are classified positively. If not selected, the values which are found as a substring are classified as matching the data class too. For example, if the valid value is Paris, and the tested value is Parisienne moonlight, the tested value is classified as matching.

Then specify the percentage of matching data values required to assign this data class.

Match to reference data

When you match data to a reference data set, you select a reference data set to classify your data on the level of the values of a database column. A reference data set at a minimum consists of the following columns:

Code
Value

Note that this matching method uses the code column in the reference data set to determine the data class.

Example CSV file with a sample of country codes:

code,value
"AND","Andorra"
"ARE","United Arab Emirates"
"AFG","Afghanistan"
"ATG","Antigua And Barbuda"
"AIA","Anguilla"
"ALB","Albania"
"ARM","Armenia"
...

The codes in this example, such as AND, ARE, AFG, can be used to determine the data class.

Match to criteria in a regular expression

A regular expression is used to determine if each value of a database column belongs to the data class.

When you match to criteria in a regular expression, you create a regular expression which classifies your data on the level of the values of a database column. The regular expression must use JavaScript format.

The regular expression applies to data assets with clear structure, for example databases, tables, or columns.

You can copy and paste any of the following examples for regular expressions to Column name criteria. Then specify a column name to test the regular expression. You can also select the data type and length of the data value.

Note: When using any of these examples, it is strongly recommended that you experiment by using it in the Build Regular Expression tool, entering a variety of matching and non-matching values, so that you understand exactly what is being matched by the expression.

This regular expression matches the Social Security Number. It must have hyphens:

[0-9]{3}-[0-9]{2}-[0-9]{4}

Example - Phone Number (North America)

This regular expression matches:

3334445555
333.444.5555
333-444-5555
333 444 5555
(333) 444 5555
and all combinations thereof

\(?[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}

Example - DOB (date of birth)

This regular expression matches the DOB (date of birth):

<tns:DataClass id="DOB" name="%DOB.name" description="%DOB.description" provider="IBM" example="12-30-2015">
            <tns:JavaClassifier
                className="com.ibm.infosphere.classification.impl.DOBClassifier" />:
<tns:ColumnNameFilter>
                <tns:ColumnNameRegularExpression><![CDATA[dob$|birth(day)?|geburtsdatum|na(issance|cimiento|scita)|urodzenia|(生ま(れた日)?|誕生日)|出生(年月)?]]></tns:ColumnNameRegularExpression>
            </tns:ColumnNameFilter>
</tns:DataClass>

Other matching criteria

Matching is based on criteria about the name, the data type of the column, or both. There is no additional criteria to evaluate the values of the column. This criteria is applied on top of the initially selected matching method.

You can specify a regular expression to define matching column names and provide a sample column name for test. The column data type can have any type, Boolean, date, or number. You can also define the minimum and maximum length of data value.

Anchoring example

The following example is anchored. Anchoring works the way the Search feature works in most software programs - searching for the text, by itself or nested within other text. If you want to anchor the string of your regular expression you use this syntax:

^string$

The "^" and the "$" anchor the characters in the string. The "^" represents the beginning of the string and the "$" represents the end, when found at the beginning and end, respectively. The "^" character has this special meaning only when it is the first character in a pattern; the "$" has this meaning only when it is the last character in a pattern.

For example, if you want to verify that a property value has a specific string of characters, make sure that you anchor it. Suppose a label in an order form is "Order" if the customer has only one order, and is "Orders" if the customer has multiple orders, and you want to confirm that this customer has only one order. On the text property of the label, change the value to a regular expression: