Advanced matching algorithm tuning using the IBM Match 360 REST API

To achieve an advanced level of customization, you can use the IBM Match 360 REST API to configure and tune your matching algorithm.

When working with the API, you must explicitly deploy the algorithm before running your matching jobs. Within the api-model microservice API, the POST /mdm/v1/algorithms/{record_type} method generates a matching algorithm based on the supplied attributes and fields.

You can further customize the matching algorithm by using the PUT /mdm/v1/algorithms/{record_type} method, which enables you to provide a fully defined matching algorithm in the method's payload.

Here is a sample payload for POST /mdm/v1/algorithms/{record_type} that defines the autolink threshold and a set of matching attributes and fields:

{"person_entity":{"auto_link_threshold":0.4,"matching_attributes":[{"attributes":["legal_name"]},{"attributes":["primary_residence"]}, {"attributes":["mobile_telephone"]},
{"attributes":["birth_date"]}, {"attributes":["gender"]}, {"attributes":["personal_email"]}]}}

For more information about the IBM Match 360 REST API and the corresponding SDKs, including authentication instructions and full documentation of each method, see the IBM Match 360 API reference.

Remember: Any time you update the matching algorithm, even through the API, you must run matching afterwards to see the changes reflected in your match results.

In this topic:

Configuring multi-dimensional comparison filters
Switching the edit distance function
Configuring a glue record threshold
Configuring source-specific matching thresholds

Configuring multi-dimensional comparison filters

Fine tune your matching algorithm even further by defining multi-dimensional comparison filters. Multi-dimensional filters can compare attributes across records and adjust matching scores and weights up or down based on criteria that you define. Multi-dimensional comparison filters can reduce the amount of false positive or false negative matches in your matching results.

You can also use multi-dimensional comparison filters to include your own deterministic matching rules that override the machine learning-based matching results.

Generating a multi-dimensional comparison filter

To generate a multi-dimensional comparison filter in your matching algorithm, update the matching engine configuration by using REST API commands:

Access and authenticate to the IBM Match 360 API interface.

Specify a POST /mdm/v1/algorithms/{record_type} payload that defines a filter, as in the following example:

{"person_entity":{"auto_link_threshold":0.4,"matching_attributes":[{"attributes":["legal_name"], "post_filter_methods": ["false_positive_filter"]},{"attributes":["primary_residence"], "post_filter_methods": ["false_positive_filter"]}, {"attributes":["mobile_telephone"]},
{"attributes":["birth_date"], "post_filter_methods": ["false_positive_filter"]}, {"attributes":["gender"]}, {"attributes":["personal_email"]}]}}

In the sample payload, false_positive_filter is the name of the custom filter. It applies to each attribute in the payload that includes the filter name.

The sample API payload will generate an algorithm containing a false_positive_filter in which the weights and penalties are the default, which is 0.

Optionally, you can customize the weights and penalties to meet your organization's requirements, and then deploy your updated algorithm using the PUT /mdm/v1/algorithms/{record_type} API.

Understanding the parameters that define filters

To understand the configuration parameters that define of multi-dimensional comparison filters, consider the example of the false_positive_filter created in the previous section.

Retrieve the current algorithm using the API command GET /mdm/v1/algorithms/{record_type}.

After you submitted the POST request in the previous section, with the corresponding example payload, the following section in the algorithm configuration was generated:

{
  "false_positive_filter": {
    "filter_recipe": [
      {
        "method": "FilterMethod.MultiDimFilter",
        "inputs": [1,2,3],
        "label": "Multi-Dim filter",
        "weights": [
          {
            "distances": [0,0],
            "values": [0,0,0,0,0,0]
          }
        ]
      }
    ],
    "inputs": [
      {"compare_method": "address_compare"},
      {"compare_method": "date_compare"},
      {"compare_method": "pername_compare"}
    ],
    "label": "false_positive_filter"
  }
}

The example false_positive_filter section includes the standard parameters that define multi-dimensional comparison filters:

filter_recipe - This section contains an array of parameters that provide the necessary recipe to define matching weights for each input.
- inputs. The filter_recipe.inputs section contains an index of the inputs that this filter recipe applies to. These are number values that correspond to the order of the compare methods listed in the inputs section. For example, in the example, 1 corresponds to the address_compare method, 2 corresponds to the date_compare method, and 3 corresponds to the pername_compare method.
- weights - The weights section is an array of elements that define how each input is weighed for the three-dimensional comparison. The weights section includes distances and values definitions for the for the inputs. The default weight is 0 for any input that is not defined.
inputs - This section contains the compare methods for the matching attributes. These methods will use the distances and weights that you define in the filter_recipe section.
max_distance - Optional (not shown). This parameter defines the maximum distance. The default maximum distance is 5, meaning that the filter_recipe.weights.values parameter can include 6 elements ("values":[0,1,2,3,4,5]).

Configuring custom filters

To customize existing compare methods for use with a multi-dimensional comparison filter:

Retrieve the current algorithm:
```
GET /mdm/v1/algorithms/{record_type}
```
Update the algorithm as needed. For example, you can:
- Add or update elements in the weights section to customize the weights for the listed inputs.
- Define the maximum distance by adding a max_distance parameter.
- Add compare methods as inputs that will use this filter instead of the default matching weights.
Overwrite the matching algorithm with your updated version:
```
PUT /mdm/v1/algorithms/{record_type}
```

Example 1: Use the following sample payload if you want to set the maximum distance to 9 and specify custom weights and penalties for different combinations of inputs and distances as follows: -input1 distance=0, input2 distance=0, input3 distance=[0,1,2,3,4,5,6,7,8,9]. In this case, the distance combination [0,0,3] gives a score of 15.

input1 distance=1, input2 distance=0, input3 distance=[0,1,2,3,4,5,6,7,8,9]. In this case, the distance combination [1,0,9] gives a penalized score of -30.

{
  "false_positive_filter": {
    "filter_recipe": [
      {
        "method": "FilterMethod.MultiDimFilter",
        "max_distance": 9,
        "inputs": [1,2,3],
        "label": "Multi-Dim filter",
        "weights": [
          {
            "distances": [0,0],
            "values": [0,-5,-10,-15,-20,-25,-30,-30,-30,-30]
          },
          {
            "distances": [1,0],
            "values": [0,-5,-10,-15,-20,-25,-30,-30,-30,-30]
          }
        ]
      }
    ],
    "inputs": [
      {"compare_method": "address_compare"},
      {"compare_method": "date_compare"},
      {"compare_method": "pername_compare"}
    ],
    "label": "false_positive_filter"
  }
}

Example 2: You can add your own customized compare methods and configure them to be excluded from contributing to the overall match score, as in the following sample payload. In this case, the custom methods would be used by the multi-dimensional comparison filter only.

In the following example, the given_name_only_compare filter sets the overall_score_contribution to false.

{
  "given_name_only_compare": {
    "methods": [
      {
        "inputs": [
          {
            "attributes": [
              "legal_name"
            ],
            "fields": [
              "given_name"
            ]
          }
        ],
        "compare_recipe": [
          {
            "comparison_resource": "person_person_entity_person_compare_spec_name",
            "method": "CompareMethod.NameCompare",
            "inputs": [
              1
            ],
            "label": "Given Name Only Match",
            "fields": [
              "given_name"
            ]
          } 
        ]
      }
    ],
    "overall_score_contribution" : false,
    "label": "Given Name Only Compare",
    "weights": [1,0,0,0,0,0,0,0,0,0,0]
  }
}

Switching the edit distance function

The IBM Match 360 matching engine calculates edit distance as one of the internal functions during comparison and matching of various attributes. Edit distance is a measurement of how dissimilar two strings are from each other. It is calculated by counting the number of changes required to transform one string into the other.

You can choose between the standard edit distance function or a specialized one. The standard edit distance is the default configuration to ensure faster performance during matching. For more information about the edit distance, see IBM Match 360 matching algorithms.

To change the active edit distance function, update the matching engine configuration by using REST API commands:

Access and authenticate to the IBM Match 360 API interface.
Retrieve the existing configuration JSON file for the comparison function, compare_spec_resource:
```
GET /mdm/v1/compare_spec_resources/{resource_name}
```
On your local machine, edit the JSON to add the line "similar_characters_enabled": true (or remove it if you want to switch back to the default edit distance setting).
Update the IBM Match 360 configuration by uploading your edited JSON:
```
PUT /mdm/v1/compare_spec_resources/{resource_name}
```

Configuring a glue record threshold

You can define a glue record threshold by using API commands to update the IBM Match 360 matching algorithm.

When IBM Match 360 forms entities through matching, some low quality records can act as glue records. Glue records get their name because they stick to many other records like glue. Because glue records include few or no detailed attribute values, they can appear to match with many different records. A glue record's matching behavior can inadvertently and incorrectly create very large entities that have only one low quality glue record in common.

As a simplified example, consider a low quality record that has no attributes other than a name, such as "John Smith". A record such as this can easily match with any other "John Smith" in the data set, causing other records that otherwise would not be matched to be included in a single "John Smith" entity.

By setting a glue record threshold in the matching algorithm for each entity type, data engineers can prevent glue records from causing the formation of large, poorly matched entities.

When a glue record threshold is configured, IBM Match 360 identifies glue records by using their self-match score. A self-match score is the matching score achieved by comparing a record to itself. A high self-match score indicates that the record has a good number of high quality matching attributes.

IBM Match 360 identifies glue records by checking whether their self-match score plus the value of the glue record threshold is less than the self-match score of the center record in the entity. If it is less, than the record is considered a glue record, and won't be included in the entity.

Glue record thresholds are optional, and are not set by default. Each entity type's glue record threshold must be defined separately.

To set a glue record threshold:

Access and authenticate to the IBM Match 360 API interface.
Retrieve the existing configuration matching algorithm JSON file for the given record type:
```
GET /mdm/v1/algorithms/{record_type}
```

On your local machine, edit the JSON to add the glue_threshold parameter under the appropriate entity type. Provide a numerical threshold value. (Delete the parameter if you want to remove an existing glue record threshold.) For example:

locale: {...}
encryption: {...}
standardizers: {...}
entity_types:
  person_entity:
    bucket_generators: {...}
    auto_link_threshold: 65
    clerical_review_threshold: 55
    glue_threshold: 20
    compare_methods: {...}

Update the IBM Match 360 matching algorithm:
```
PUT /mdm/v1/algorithms/{record_type}
```

Configuring source-specific matching thresholds

Data Engineers can define clerical review thresholds and autolink thresholds within the matching algorithm that are specific to various record sources. This enables your organization to handle matching differently depending on how trusted the source is.

Your organization might have records from different sources that each use different attributes and have varying levels of quality. By configuring record source-level matching thresholds, you can weigh the data from trusted sources more heavily than data from less trusted sources, or even exclude some sources from participating in matching. Sources that are excluded from matching can still be used as reference sources in the system.

Source-level thresholds are optional, and are not set by default.

Source-level thresholds must be defined separately for each entity type in your data model. As a reminder, each entity type has its own matching algorithm definition.

To set up source-level matching thresholds:

Access and and authenticate to the IBM Match 360 API interface.
Retrieve the existing matching algorithm configuration file (in JSON format) for the entity type you want configure.
```
GET /v1/algorithms/{record_type}
```

On your local machine, edit the JSON to add the source_level_thresholds object under the appropriate entity type (such as person_entity). For example:

"person_entity":{

  "auto_link_threshold":150,

  "clerical_review_threshold":120,

  "source_level_thresholds": {

       "src0": {

            "default":[165, 150],

            “srcxsrc” : {

                  "src0": [null, null],    

                  "src1": [160, 130], 

                  "src2": [123, 111], 

                  "src3": [null, null]

           }

       },

       "src1": {

            “srcxsrc” : {

                  "src1": [160, 130], 

                  "src2": [123, 111], 

                  "src3": [136, 120], 

                  "src4": [120, null]

           }

       }

    }

}

For more information about this example and guidance about how to define the source-level threshold JSON object, see Sample JSON object defining source-level thresholds.

Update the IBM Match 360 matching algorithm:
```
PUT /v1/algorithms/{record_type}
```

For more information about source-level thresholds, see the following subsections:

Sample JSON object for source-level thresholds
Assessing source-level threshold results
Source-level thresholds and pair reviews

Sample JSON object for source-level thresholds

In the following JSON example, you can see a snippet of the matching algorithm configuration file that defines source-level thresholds for the Person entity.

"person_entity":{

  "auto_link_threshold":150,

  "clerical_review_threshold":120,

  "source_level_thresholds": {

       "src0": {

            "default":[165, 150],

            “srcxsrc” : {

                  "src0": [null, null],    

                  "src1": [160, 130], 

                  "src2": [123, 111], 

                  "src3": [null, null]

           }

       },

       "src1": {

            “srcxsrc” : {

                  "src1": [160, 130], 

                  "src2": [123, 111], 

                  "src3": [136, 120], 

                  "src4": [120, null]

           }

       }

    }

}

In the preceding example:

The default global autolink threshold is 150.
The default global clerical review threshold is 120.
src0, src1, src2, src3, and src4 are examples of source names.
Within the source_level_thresholds object, source-by-source thresholds are defined for two sources: src0 and src1.

General guidance:

Under each source in the source_level_thresholds object, you can optionally override the default global matching thresholds for that source by using the default parameter.
Under each source, you can define an array of source-to-source matching thresholds under the srcxsrc property. These thresholds are used when comparing records from the listed sources.
Within the array, the values provided in square brackets are in the following format: [autolink-threshold, clerical-threshold]. So [136, 120] indicates that for the given source-to-source comparison, the autolink threshold is 136 and the clerical review threshold is 120.
When both values are given, the autolink threshold should always been higher than the clerical review threshold.
If a value is given as null, then that threshold is disabled.
If both values in a pair are given as null, then both matching and linking between the two sources is disabled.
When both values are null and the two given sources are the same, then the source is considered a reference source only. For example, src0 is reference source for src0 in the preceding example JSON. Any entity that only has records from reference sources is not viable.

Assessing source-level threshold results

If you have configured source-level thresholds in your custom matching algorithm, use the following REST API method to get scoring details.

POST /v1/compare/?details=debug&crn={CRN}&entity_type={entity_type}&record_type={record_type}

Use the information returned by this method to help you to assess the results and, if necessary, fine-tune your source-level threshold configuration.

Source-level thresholds and pair reviews

Source-level thresholds can be overwritten if you accept the tuning recommendations generated by pair reviews. If your organization uses, or intendes to use, the IBM Match 360 pair review capability to generate intelligent tuning recommendations, it is best to complete the pair review tasks before defining your source-level thresholds.

If you have already defined source-level thresholds in custom matching algorithms, disable the source-level threshold feature by editing the IBM Match 360 CR (mdm-cr). Use the following command to disable source-level thresholds in the CR:

oc patch mdm mdm-cr --type=merge -p '{"spec": {"mdm_matching": {"features": {"source_level_thresholds": {"enabled": false}}}}}'

It can take 20-30 minutes for the CR to reconcile itself after you make a change. The mdm-matching service pods must also be restarted to apply the updated configuration. If necessary, these pods must be restarted manually.

To re-enable source-level thresholds, run the following command:

oc patch mdm mdm-cr --type=merge -p '{"spec": {"mdm_matching": {"features": {"source_level_thresholds": {"enabled": true}}}}}'

Next steps

Learn more

Parent topic: Customizing and strengthening your matching algorithm