0 / 0
Matching algorithms in IBM Match 360 with Watson
Matching algorithms in IBM Match 360 with Watson

Matching algorithms in IBM Match 360 with Watson

IBM Match 360 with Watson uses matching algorithms to resolve data records into master data entities. Data engineers can define different matching algorithms for each entity type in their data. The matching algorithms can then analyze the data to evaluate and compare records, and then collect matched records into entities.

There are two common reasons to run matching on your data:

  • For record deduplication and entity resolution, the matching process analyzes your data to determine whether any duplicate records exist in your data. Suspected duplicate records are merged into master data entities to establish a single, trusted, 360-degree view of your data.
  • To create other types of entity associations, the matching process analyzes your data to collect records into entities that represent different kinds of groupings, such as a household.

In this topic:

Matching to create more than one type of entity

IBM Match 360 matching algorithms are driven by the entity type of the associated data. You can define more than one entity type for each record type in the data model. For each entity type, configure and tune its corresponding matching algorithm to ensure that IBM Match 360 creates entities that meet your organization's requirements.

A single record can be part of more than one separate entity. If your data model includes more than one entity type, you can run different types of matching across the same data set. For example, consider a data set that includes person records from across your enterprise. If the Person record type includes definitions for a Person entity type and a Household entity type, then you can run the Person matching algorithm for entity resolution and deduplication, and also run the Household matching algorithm to create entities made up of person records that belong to the same household.

The matching process

The matching engine goes through a defined process to match records into entities. The matching process includes three major steps:

  1. Standardization. During this step, the algorithm standardizes the format of the data so that it can be processed by the matching engine.

  2. Bucketing. The algorithm sorts data into various categories or "buckets" so that it can compare like-to-like pieces of information.

  3. Comparison. The algorithm compares data to determine a final comparison score. The algorithm then uses the comparison score to determine whether the records are a match.

Each of these steps is defined and configured by the matching algorithm.

Components of the matching algorithm

Three main types of components define an IBM Match 360 matching algorithm:

Standardizers

As the name suggests, standardizers define how data gets standardized. Standardization enables the matching algorithm to convert the values of different attributes to a standardized representation that can be processed by matching engine.

The matching algorithm uses multiple standardizers. Each standardizer is suited to process specific attribute types found in record data.

Standardizers are defined by JSON objects. Each standardizer's JSON object definition contains three elements:

  • label - A label that identifies this standardizer.
  • inputs - The inputs list has one element, which is a JSON object. That JSON object has two elements: fields and attributes:
    • fields - The list of fields to use for standardization.
    • attributes - The list of attributes to use for standardization.
  • standardizer_recipe - A list of JSON objects in which each object represents one step to be run during the standardization process of the associated standardizer. Each object in the standardizer_recipe list consists of four main elements:

    • label - A label that identifies this step in the standardizer recipe.
    • method - The internal method used. This element is just for reference and must not be edited.
    • inputs - A single element of the inputs list defined one level above.
    • fields - A list of the fields to be used for this step. This is generally a subset of all the fields defined within the inputs list one level above. Not every step needs to process all of the inputs fields.
    • set_resource - The name of a set type customizable resource used for this step.
    • map_resource - The name of a map type customizable resource used for this step.

    Depending on the behavior of a step, there might be more configuration elements that are required in the corresponding JSON object.

Preconfigured standardizers

The following standardizers are ready to use in IBM Match 360. The preconfigured standardizers are also customizable.

Person Name standardizer

This standardizer is used to standardize Person Name attribute values. It contains the following recipes, in sequence:

  1. Upper case - Converts the input field values to use their uppercase equivalents.
  2. Map character - Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.
  3. Tokenizer - Tokenizes the input field value into multiple tokens, based on the defined list of delimiters.
  4. Parse token - Parses input field values to different tokens, depending on the predefined values in the IBM Match 360 resources. For example, you can use this recipe to parse suffix, prefix, and generation values into appropriate fields.
  5. Length - Discards tokens that are outside a given length range. Minimum and maximum values are defined in the IBM Match 360 resources.
  6. Stop token - Removes anonymous input values, as configured.
  7. Pick token - Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
Organization Name standardizer

This standardizer is used to standardize Organization Name attribute values. It contains the following recipes, in sequence:

  1. Upper case - Converts the input field values to use their uppercase equivalents.
  2. Map character - Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.
  3. Stop character - Removes unwanted input characters from name values.
  4. Map token - Generates nicknames or alternate names for the given input and stores the information in a separate new internal field.
  5. Tokenizer - Tokenizes the input field value into multiple tokens, based on the defined list of delimiters.
  6. Stop token - Removes anonymous input values, as configured.
  7. Acronym - Generates an acronym for the given organization name and stores the information in a separate new internal field. This acronym value is used during comparison to handle abbreviated names.
  8. Pick token - Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
Date standardizer

This standardizer is used to standardize Date attribute values. It supports many different date formats and contains the following recipes, in sequence:

  1. Map character - Converts slash characters (/) to dash characters (-).
  2. Date function - Converts date inputs in different formats to a standardized format.
  3. Stop token - Removes anonymous date values, as configured.
  4. Parse token - Parses input field values to different tokens, depending on certain regular expressions. For example, you can use this recipe to parse a full date input into day, month, and year tokens.
  5. Pick token - Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
Gender standardizer

This standardizer is used to standardize Gender attribute values. It contains the following recipes, in sequence:

  1. Map character - Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.
  2. Upper case - Converts the input field values to use their uppercase equivalents.
  3. Stop token - Removes anonymous input gender values, as configured.
  4. Map token - Converts input token values to equivalent values, as configured in the IBM Match 360 resources.
  5. Parse token - Parses processed field values to an appropriate internal field.
  6. Pick token - Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
Address standardizer

This standardizer is used to standardize Address attribute values. Addresses can have several different formats, depending on the locales. This flexibility requires complex processing to convert addresses to a standardized form. The Address standardizer contains the following recipes, in sequence:

  1. Upper case - Converts the input field values to use their uppercase equivalents.
  2. Map character - Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.
  3. Map token - Converts input token values to equivalent values, as configured in the IBM Match 360 resources. For example, "United States of America", "United States", and "US" can all be mapped to "USA". This mapping is common for country and province/state field values. In addition, delimiter characters configured in the resource are mapped to the space character.
  4. Tokenizer - Tokenizes the input field value into multiple tokens, based on the defined list of delimiters.
  5. Stop token - Removes anonymous input values, such as postal codes, as configured.
  6. Keep token - Allows only the defined list of values for a given field. For example, you might define a list of postal codes that are allowed during standardization. Input values that are not in the allowed list will be removed.
  7. Parse token - Parses input field values to appropriate internal fields depending on certain regular expressions and predefined values, as configured in the resources. You can use this recipe to truncate a given token to a certain length by using regular expressions. You can also define different alphanumeric pattern sets in the form of regular expressions to allow only certain patterns.
  8. Join fields - Joins two or more fields together to create a new combined value, assigned to an internal field. For example, latitude and longitude field values can be joined together to form a new internal field called lat_long.
  9. Pick token - Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
Phone standardizer

This standardizer is used to standardize Phone attribute values. It contains the following recipes, in sequence:

  1. Stop character - Removes unwanted input characters from phone values.
  2. Stop token - Removes anonymous phone values, as configured.
  3. Phone - Parses input phone numbers with different formats from different locales into a common format. This recipe can be configured to remove area codes and country codes from phone numbers. It can also retain a certain number of digits in a standardized phone number.
  4. Parse token - Parses processed field values to an appropriate internal field depending on certain regular expressions, as configured in the resources.
  5. Pick token - Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
Identification standardizer

This standardizer is used to standardize Identification attribute values. It contains the following recipes, in sequence:

  1. Map character - Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.
  2. Upper case - Converts the input field values to use their uppercase equivalents.
  3. Stop character - Removes unwanted input characters from identification values.
  4. Stop token - Removes anonymous input values, as configured.
  5. Map token - Converts input token values to equivalent values, as configured in the IBM Match 360 resources.
  6. Parse token - Parses processed field values to an appropriate internal field depending on certain regular expressions, as configured in the resources.
  7. Pick token - Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
Email standardizer

This standardizer is used to standardize Email attribute values. It contains the following recipes, in sequence:

  1. Map character - Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.
  2. Upper case - Converts the input field values to use their uppercase equivalents.
  3. Stop token - Removes anonymous input values, as configured.
  4. Map token - Converts input token values to equivalent values, as configured in the IBM Match 360 resources.
  5. Parse token - Parses processed field values to an appropriate internal field depending on certain regular expressions, as configured in the resources.
  6. Pick token - Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.
Social Media standardizer

This standardizer is used to standardize Social Media attribute values. It contains the following recipes, in sequence:

  1. Map character - Converts UNICODE input characters to equivalent English alphabet characters. Optionally, define the map in the IBM Match 360 resources.
  2. Upper case - Converts the input field values to use their uppercase equivalents.
  3. Stop token - Removes anonymous input values, as configured.
  4. Map token - Converts input token values to equivalent values, as configured in the IBM Match 360 resources.
  5. Parse token - Parses processed field values to an appropriate internal field depending on certain regular expressions, as configured in the resources.
  6. Pick token - Selects a subset (or all) of the tokens as the standardized data to use in bucketing and comparison.

Entity types (bucketing)

Within a single matching algorithm, each record type can have multiple entity type definitions (entity_type JSON objects). For example, in an algorithm defined for a person record type, you might need to create more than one entity type definition, such as person entity, household entity, location entity, and others.

Each entity type can be used to match and link records in different ways. An entity type defines how records are bucketed and compared during the matching process.

Each entity type definition (entity_type) in the matching algorithm has four JSON elements:

  • clerical_review_threshold - Records that have a comparison score lower than the clerical review threshold are considered as non-matches.
  • auto_link_threshold - Records that have a comparison score higher than the autolink threshold are considered to be strong enough matches that they are automatically matched.
  • bucket_generators - This section contains the definition of the bucket generators configured for an entity type. There are two types of bucket generators: buckets and bucket groups.

    • Buckets involve bucketing for only one attribute. Each bucket definition includes four elements:

      • label - A label that identifies the bucket generator.
      • maximum_bucket_size - A value that defines the size of large buckets. Any bucket hash with a bucket size greater than this value is not considered for candidate selection during matching.
      • inputs - For buckets, the inputs list has only one element, which is a JSON object. That JSON object has two elements: fields and attributes:
        • fields - The list of fields to use for bucketing.
        • attributes - The list of attributes to use for bucketing.
      • bucket_recipe - A bucket recipe list defines the steps for the bucket generator to complete during the bucketing process. Each bucket_recipe list has a number of subelements:
        • label - A label that identifies the bucket recipe element.
        • method - The internal method used. This element is just for reference and must not be edited.
        • inputs - A single element of the inputs list defined one level above.
        • fields - A list of the fields to be used for this bucket. This is generally a subset of all the fields defined within the inputs list one level above.
        • min_tokens - The minimum number of tokens to use when the recipe is forming a bucket hash.
        • max_tokens - The maximum number of tokens to use together when the recipe is forming a bucket hash.
        • count - A limit on the number of bucket hashes for a single record that get generated out of a bucket generator. If a record generates a lot of bucket hashes, only the number of hashes set by this element get picked up.
        • bucket_group - The sequence number for a bucket group that produces a bucket hash. Intermediary steps or recipes would not be assigned a sequence number.
        • order - Specifies whether the tokens are sorted in lexicographical order when multiple tokens are combined to form a bucket hash.
        • maximum_bucket_size - A value that defines the size of large buckets. This element is the same as the one defined at the bucket generator level; also having it at the bucket recipe level gives you finer control over large individual buckets.
    • Bucket groups involve bucketing for more than one attribute. Each bucket_group definition includes five elements:

      • label - A label that identifies the bucket generator.
      • maximum_bucket_size - A value that defines the size of large buckets. Any bucket hash with a bucket size greater than this value is not considered for candidate selection during matching.
      • inputs - For bucket groups, the inputs list has more than one JSON object element. The JSON objects each have two elements: fields and attributes:
        • fields - The list of fields to use for bucketing.
        • attributes - The list of attributes to use for bucketing.
      • bucket_recipe - A bucket recipe list defines the steps for the bucket generator to complete during the bucketing process. Each bucket_recipe list has a number of subelements:
        • label - A label that identifies the bucket recipe element.
        • method - The internal method used. This element is just for reference and must not be edited.
        • inputs - A single element of the inputs list defined one level above.
        • fields - A list of the fields to be used for this bucket. This is generally a subset of all the fields that are defined within the inputs list one level above.
        • min_tokens - The minimum number of tokens to use when the recipe is forming a bucket hash.
        • max_tokens - The maximum number of tokens to use together when the recipe is forming a bucket hash.
        • count - A limit on the number of bucket hashes for a single record that get generated out of a bucket generator. If a record generates many bucket hashes, only the number of hashes set by this element get picked up.
        • bucket_group - The sequence number for a bucket group that produces a bucket hash. Intermediary steps or recipes would not be assigned a sequence number.
        • order - Specifies whether the tokens are sorted in lexicographical order when multiple tokens are combined to form a bucket hash.
        • maximum_bucket_size - A value that defines the size of large buckets. This element is the same as the one defined at the bucket generator level. Being able to define it at the bucket recipe level gives you finer control over large individual buckets.
        • set_resource - The name of a set type resource used for a bucket recipe.
        • map_resource - The name of a map type resource used for a bucket recipe.
        • output_fields - If this recipe produces new fields after it completes bucketing functions on the input fields, this element contains a list of the names of the generated fields.
      • bucket_group_recipe - A bucket group recipe section is typically used for defining buckets that consist of more than one attribute. Every element of a bucket_group_recipe list is a JSON object defining the construct for a single bucket group.

        • The inputs list within bucket_group_recipe has more than one element, which means it refers to more than one attribute defined in the inputs array one level above.
        • The fields element is a list of lists. Every inner list of fields is associated with the respective attributes list.
        • min_tokens and max_tokens lists have more than one element, with each element corresponding to respective attributes list.

        Note: In some bucketing recipe definitions, there is a property that is named search_only. By default, its value is false. If set to true, this property indicates that a bucket or bucket group is used only for probabilistic search scenarios and is not used for entity resolution (matching) scenarios.

  • compare_methods - Definitions of the comparison methods that are configured for an entity type. Each compare_methods JSON object consists of definitions of various compare methods. The matching algorithm adds up the scores from each compare method definition to get the final comparison score. Each compare method's JSON object contains three elements:

    • label - A label that identifies the compare method.
    • methods - A list of comparators that form a comparison group. Every element in this array represents one comparator, meant for one type of matching attribute. The matching algorithm considers the maximum of the scores from all the comparators in a methods list as the final score from this comparison group. Each comparator definition includes two elements:
      • inputs - For comparators, the inputs list has only one element, which is a JSON object. That JSON object has two elements: fields and attributes:
        • fields - The list of fields to use for comparison.
        • attributes - The list of attributes to use for comparison.
      • compare_recipe - This list is used mainly for defining the comparison steps. Typically, there is only one JSON element in this array, representing only one step for doing the comparison. This step has five elements:
        • label - A label that identifies the comparison step.
        • method - The internal method used. This element is just for reference and must not be edited.
        • inputs - A single element of the inputs list defined one level above.
        • fields - The fields to be used for this comparison out of all of the fields that are defined in the inputs list one level above.
        • comparison_resource - The name of a customizable comparison resource used for this comparison step.
    • weights - Each comparison that is done by a comparator results in a number score from 0 to 10. This number is called the distance or dis-similarity measure. A distance of 0 indicates that the values being compared are exactly the same. A distance of 10 indicates that they are completely different. Corresponding to the 11 distinct values (0 - 10), 11 weights are defined for each comparator. After calculating the distance, the compare method determines the corresponding weight value from the weights list, resulting in the total comparison score. Data engineers can customize the weights as needed, based on the data quality, distribution, or other factors.

Comparison functions

Comparison functions, sometimes called comparators, are one of the key components of the matching algorithm. Comparison functions are used by the matching engine to compare record data during the matching process. Essentially, record matching involves comparing different types of attributes between different records’ data.

For many of the commonly used attribute types in the person, organization, and location domains, the IBM Match 360 matching engine includes preconfigured comparison methods.

In IBM Match 360, comparison functions use an approach to comparison known as feature vectors. There are different customizable feature definitions in IBM Match 360 that are used for different comparison functions. Each comparison results in a measure of distance (a vector) that shows how dissimilar two given attribute values are.

In the matching algorithm, each discrete distance value is given a weight that determines how strongly to consider that value. The weight combines with the distance to produce a comparison score. The matching algorithm adds all of the comparison scores together to arrive at a final comparison score for the overall record-to-record comparison.

About features

A feature represents the fine-level details of a comparison function. Different types of attributes use different types of similarity checks, meaning that their features vary as well.

Feature definitions dictate the types of internal functions used for each comparison function. Examples of internal functions include exact match, edit distance, nickname, phonetic equivalent, or initial match.

Person name comparisons

Different fields within a person name attribute are handled differently. For fields like prefix, suffix, and generation values, exactness or non-matching is checked. Other fields such as given name, last name, and middle name primarily use the following features:

  • Exact match
  • Nickname match
  • Edit distance
  • Initials match
  • Phonetic matching
  • Misplacement of tokens
  • Extra tokens
  • Missing values

Organization name comparisons

For organization names, there is typcally one field that contains the entire business name. That field is compared using primarily the following features:

  • Exact match
  • Nickname match
  • Edit distance
  • Initials match
  • Phonetic matching
  • Misplacement of tokens
  • Extra tokens
  • Missing values

For organization names, the acronyms and nicknames are also compared for exactness.

Date comparisons

For dates, there are typically three fields to compare: day, month, and year.

The year field is compared using the following features:

  • Exactness
  • Edit distance
  • Non-matching
  • Missing

The day and month fields are compared using the following features:

  • Exactness
  • Non-matching
  • Missing

The date comparator also checks to see if the day and month fields have been transposed due to locale differences in date formatting.

Gender comparisons

The gender attribute is compared using the following features:

  • Exactness
  • Non-matching

Address comparisons

Different fields within an address attribute are handled differently.

Fields like country, city, province/state, and subdivision are compared using the following features:

  • Exactness
  • Equivalency
  • Edit distance
  • Non-matching
  • Missing

Postal code fields are compared using the following features:

  • Exactness
  • Edit distance
  • Non-matching
  • Missing

Fields like street number, street name, street type, unit number, and direction are compared using the following features:

  • Exactness
  • Equivalency
  • Initials match
  • Edit distance
  • Non-matching
  • Misplacement of tokens
  • Missing

Phone comparisons

Phone number attributes are compared using the following features:

  • Exact match
  • Edit distance
  • Non-matching

Identifier comparisons

Identification number attributes are compared using the following features:

  • Exact match
  • Edit distance
  • Non-matching

Email comparisons

Email attributes consist of two parts: the unique ID (before the @ symbol) and the email domain (after the @ symbol). Both the ID and domain parts are compared, separately, using the following features:

  • Exact match
  • Edit distance
  • Non-matching

The outcome of the two comparisons are combined in a weighted manner to produce an overall comparison score.

Social Media comparisons

Social media handle attributes are compared using the following features:

  • Exact match
  • Edit distance
  • Non-matching

Edit distance

The IBM Match 360 matching engine calculates edit distance as one of the internal functions during comparison and matching of various attributes. Edit distance is a measurement of how dissimilar two strings are from each other. It is calculated by counting the number of changes required to transform one string into the other.

There are different ways to define edit distance by using different sets of string operations. By default, IBM Match 360 uses a standard edit distance function that is publicly available in literature. As an alternative, you can choose to use a specialized IBM Match 360 edit distance function.

  • The standard edit distance function provides better performance of the matching engine. For this reason, it is the default comparison configuration for all attributes except for the Telephone attribute type.
  • The specialized edit distance function is built for hyper-precision use cases. This option takes into consideration typos or similar-looking characters, such as 8 and B, 0 and O, 5 and S, or 1 and I. When there is a mismatch in two compared values based on similar-looking characters, the assigned dissimilarity measure is less than what would be assigned by a standard edit distance function. As a result, these types of mismatches are not penalized as strongly by the specialized function.

    Important: The specialized edit distance function includes some complex calculations. As a result, choosing this option has an impact on system performance during the matching process.

For information about customizing your matching algorithm, including using the API to customize the edit distance, see Customizing and strengthening your matching algorithm.

Learn more

Parent topic: Managing master data