0 / 0
Metadata enrichment default settings (Watson Knowledge Catalog)
Metadata enrichment default settings (Watson Knowledge Catalog)

Metadata enrichment default settings (Watson Knowledge Catalog)

To achieve useful metadata enrichment results, configure default settings for all metadata enrichment assets in a project. Default settings also help ensure consistent use of enrichment options.

Changes to these default settings are applied to new metadata enrichment assets and to enrichment jobs that run after the settings changed.

Required permissions
To configure metadata enrichment default settings, you must have the Admin role in the project. Any project collaborator can view the settings.

You can access the default settings from within an existing metadata enrichment asset. Edit settings as required and save your changes when you're done. Clear changes discards all changes you made after you last saved any updates.

Profiling and term assignment

Set thresholds for profiling and business term assignment, select the methods for term assignment, and preselect categories. At any time, you can restore the default for any threshold setting that you changed.

Nullability

Data fields in a column or a flat file are nullable if they are allowed to have no value.

Null threshold This setting determines whether a column or flat file field allows null values. If a column or flat file has fields without values, the percentage of the empty fields found is compared to the set threshold. If it's equal to or greater than the nullability threshold, the field allows null values. If null values do not exist in the data field or the frequency percentage is less than the threshold, the data field must have a value. The default setting is 5%.

Cardinality

The cardinality of a column can be unique, constant, or not constrained. The percentage of unique distinct values and the percentage of the most frequent constant value found are compared to the set thresholds. The cardinality type is unique or constant if the respective percentage is equal to or greater than the threshold percentage. Otherwise, it’s not constrained.

Uniqueness threshold Determines whether a data field contains unique values. A column or flat file is considered unique if it has a percentage of distinct values equal to or greater than the threshold that you set. The default is 95%.

Constant threshold Determines whether a column or flat file contains constant values. It is determined that a field is constant if it has a single distinct value with a frequency percentage equal to or greater than the constant threshold that you set. The default is 99%.

Data class assignment

Data classes that are included in the metadata enrichment are assigned to a column during profiling. The thresholds determine the minimum confidence level for a data class to be assigned or suggested. The assignment threshold should be higher than the suggestion threshold.

Assignment threshold Determines the minimum percentage of values for which the data class must match the criteria to be automatically assigned to a column. The default setting is 75%. This setting can be overridden by a threshold defined directly on the data class.

The following predefined data classes have a default threshold set:

  • City (50%)
  • Person Name (50%)
  • First Name (50%)
  • Middle Name (50%)
  • Last Name (50%)
  • Organization Name (60%)

See Adding data matching to data classes.

Suggestion threshold Determines the minimum percentage of values for which the data class must match the criteria to be suggested for a column. The default setting is 25%.

Term assignment

Business terms that are included in the metadata enrichment (through category selection) can be automatically assigned to or suggested for a column. The thresholds determine the minimum confidence level for a term to be assigned or suggested. The assignment threshold should be higher than the suggestion threshold.

Assignment threshold Determines the percentage of matching values that must be exceeded for a term to be automatically assigned to a column. The default setting is 90%.

Suggestion threshold Determines the percentage of matching values that must be exceeded for a term to be suggested for a column. The default setting is 75%.

Determine which term assignment method is used in the project to generate assignments and suggestions. Assignments and suggestions are made based on the highest confidence score that one of the methods returns. Select at least one of these methods:

  • Machine learning: Supervised machine learning models are used to assign terms. Project-specific models are trained with any published business terms and any available term assignments or removals on columns that were marked as reviewed in the project. Global models are trained with any published business terms and any term assignments available in the default catalog.
  • Data-class-based assignments: Terms are assigned based on the data class assignment for a column. Appropriate linkage between data classes and terms is a prerequisite for quality results here.
  • Linguistic name matching: Terms are assigned based on the similarity between a term and the name of the asset or column.

Use individual methods for testing and evaluating term assignments, for example, when you have a large set of custom data classes. This way, you can also find out the proper threshold settings for your project.

For more information, see Automatic term assignment.

Categories

You can limit the set of categories that users can select from to the ones that align with the purpose of the project. Preselect categories that are relevant for the project. The selected categories determine the business terms and data classes to be used for profiling and automatic term assignment. This selection does not limit users' options when assigning data classes or terms manually. For manual assignments, users can pick data classes or business terms from any category they have access to.

Important: The categories to choose from are limited to the categories to which the administrator has access. That might result in different category sets for different administrators.

Quality analysis

Data quality threshold Determines the minimum required data quality score for an asset to be of sufficient or good quality.

Learn more

Parent topic: Enriching your data assets