To achieve useful metadata enrichment results, configure default settings for all metadata enrichments in a project. Default settings also help ensure consistent use of enrichment options.
Changes to the threshold settings or the selected term assignment methods are applied to new metadata enrichments and to enrichment jobs that run after the settings changed. Changes to the set of categories are applied to new enrichments only.
- Required permissions
- To configure metadata enrichment default settings, you must have the Admin role in the project. Any project collaborator can view the settings.
You can access the default settings in one of these ways:
- Within an existing metadata enrichment asset, click Default settings.
- On the project's Manage page, go to Tools > Metadata enrichment.
Edit settings as required. Your changes are autosaved. For some settings, you can restore the system-defined default values at any time.
Configure default settings for these features:
- Profiling and term assignment
- Advanced profiling settings
- Basic quality analysis
- Data quality output
- Key relationship analysis
You can also create, update, or retrieve enrichment settings with APIs instead of the user interface. The links to the APIs are listed in the Learn more section.
Profiling and term assignment
Set thresholds for profiling and business term assignment, select the methods for term assignment, and preselect categories. At any time, you can restore the default for any threshold setting that you changed.
Nullability
Data fields in a column or a flat file are nullable if they are allowed to have no value.
- Null threshold
- Determines whether a column or flat file field allows null values. If a column or flat file has fields without values, the percentage of the empty fields found is compared to the set threshold. If it's equal to or greater than the nullability threshold, the field allows null values. If null values do not exist in the data field or the frequency percentage is less than the threshold, the data field must have a value. The default setting is 5%.
Cardinality
The cardinality of a column can be unique, constant, or not constrained. The percentage of unique distinct values and the percentage of the most frequent constant value found are compared to the set thresholds. The cardinality type is unique or constant if the respective percentage is equal to or greater than the threshold percentage. Otherwise, it’s not constrained.
- Uniqueness threshold
- Determines whether a data field contains unique values. A column or flat file is considered unique if it has a percentage of distinct values equal to or greater than the threshold that you set. The default is 95%.
- Constant threshold
- Determines whether a column or flat file contains constant values. It is determined that a field is constant if it has a single distinct value with a frequency percentage equal to or greater than the constant threshold that you set. The default is 99%.
Data class assignment
Data classes that are included in the metadata enrichment are automatically assigned to a column solely during profiling. Term assignments do not have an impact on data class assignments. The thresholds determine the minimum confidence level for a data class to be assigned or suggested. The assignment threshold should be higher than the suggestion threshold.
Related classifications can also be automatically assigned for automatically assigned data classes.
You can control this behavior by enabling or disabling the classification assignment option for data classes. See Classification assignment.
- Assignment threshold
-
Determines the minimum percentage of values for which the data class must match the criteria to be automatically assigned to a column. The default setting is 75%. This setting can be overridden by a threshold defined directly on the data class.
The following predefined data classes have a default threshold set:
- City (50%)
- Person Name (50%)
- First Name (50%)
- Middle Name (50%)
- Last Name (50%)
- Organization Name (60%)
- Suggestion threshold
-
Determines the minimum percentage of values for which the data class must match the criteria to be suggested for a column. The default setting is 25%.
Primary keys
A primary key can consist of one or more columns and uniquely identifies each record in a table. Each table can have only one primary key.
- Suggestion threshold
- Defines the minimum confidence for a column or a combination of columns to be suggested as primary key. The default setting is 80%.
Display name
Based on a built-in glossary and on existing business term abbreviations in the categories selected for enrichment, fuzzy matching is used to produce semantic names for data assets and the columns that they contain as alternative names that are more descriptive than the source names. These alternative names can be automatically assigned or suggested. The thresholds determine the minimum confidence level for a semantic name to be assigned or suggested as display name. The assignment threshold should be higher than the suggestion threshold.
- Assignment threshold
- Determines the confidence that must be exceeded for a display name to be automatically assigned to a data asset or column. The default setting is 90%.
- Suggestion threshold
- Determines the confidence that must be exceeded for a display name to be suggested for a data asset or column. The default setting is 75%.
AI-generated description
Generative AI can produce descriptions for entire data assets and for the columns that a data asset contains. A granite.8b
model considers the context of assets and columns to provide meaningful descriptions. These descriptions
can be automatically assigned or suggested. The thresholds determine the minimum confidence level for a description to be assigned or suggested. The assignment threshold should be higher than the suggestion threshold.
- Assignment threshold
- Determines the confidence that must be exceeded for a generated description to be automatically assigned to a data asset or column. The default setting is 100%.
- Suggestion threshold
- Determines the confidence that must be exceeded for a generated description to be suggested for a data asset or column. The default setting is 75%.
Term assignment
Business terms that are included in the metadata enrichment (through category selection) can be automatically assigned to or suggested for a column. The thresholds determine the minimum confidence level for a term to be assigned or suggested. The assignment threshold should be higher than the suggestion threshold. Note that term assignments do not affect data class assignments. If a term that is associated with a data class is assigned to a column by an ML model or through name matching, the related data class is not automatically assigned as well.
Related classifications can also be automatically assigned for automatically assigned terms.
You can control this behavior by enabling or disabling the classification assignment option for terms. See Classification assignment.
- Assignment threshold
-
Determines the percentage of matching values that must be exceeded for a term to be automatically assigned to a data asset or column. The default setting is 90%.
- Suggestion threshold
-
Determines the percentage of matching values that must be exceeded for a term to be suggested for a data asset or column. The default setting is 75%.
Tip: If semantic term assignment is selected as one of the term-assignment methods, consider lowering this threshold to a value in the range 65%-70%. Otherwise, terms returned by this method might not be considered for term assignment because the confidence scores are usually lower than the scores for the other methods.
Determine which term assignment method is used in the project to generate assignments and suggestions. Assignments and suggestions are made based on the highest confidence score that one of the methods returns. Select at least one of these methods:
-
Machine learning: A machine learning model is used to assign terms. You can define for each project whether this model is trained with assets from the project or with assets from a catalog of your choice.
-
Data-class-based assignments: Terms are assigned based on the data class assignment for a column. Appropriate linkage between data classes and terms is a prerequisite for quality results here.
-
Name matching: Terms are assigned based on the similarity between a term and the name of the asset or column.
-
Semantic term assignment: Domain-specific business terms are assigned and suggested by using the
slate.30m.semantic-automation.c2c
model. The model takes into account names and descriptions of assets and columns, and semantically matches terms with that metadata. Thus, terms can be assigned even if they aren't exact matches.Tip: The confidence scores for this method are usually lower than those for the other methods. Therefore, lower the suggestion threshold to a value in the range 65%-70% to have terms that are returned by the semantic term-assignment method considered for term assignment.
By default, the confidence scores that are returned by the selected term assignment methods are adjusted based on previous term rejections, which affects the overall confidence score.
If you don't want term rejections to affect the confidence score, you can disable this option.
You can enable or disable the option regardless of which term assignment methods you select. The training scope that you set applies to the model for term assignment and to the model for adjusting the confidence score.
Use individual methods for testing and evaluating term assignments, for example, when you have a large set of custom data classes. This way, you can also find out the proper threshold settings for your project.
For more information, see Automatic term assignment.
Classification assignment
Determine whether classifications are also assigned when a related data class or term is automatically assigned to a data asset or a column. You can configure this individually for data classes and terms.
For projects that were created before 23 August 2024, automatic classification assignment is disabled by default.
Categories
You can limit the set of categories from which users can select when they create new metadata enrichments to the categories that align with the purpose of the project. Note that this selection does not determine which categories are actually used in a metadata enrichment. Preselect categories that are relevant for the project. The selected categories determine the business terms and data classes that can be used for profiling and automatic term assignment. This selection does not limit users' options when assigning data classes or terms manually. For manual assignments, users can pick data classes or business terms from any category they have access to.
Any changes to this set are reflected in new metadata enrichments and when you edit an existing metadata enrichment.
Advanced profiling settings
These settings apply to advanced data profiling if a user enables the External output option and can be overwritten for each individual run.
Determine whether all distinct values or a maximum number of the most frequent distinct values are captured for each column. The default setting is to capture the 1,000 most frequent distinct values.
Set the default output location for storing the captured values:
- Select a connection.
- Depending on the selected connection, select a schema and a table, or select a catalog, a schema, and a table. You can select from existing catalogs, schemas, and tables. You can also create a new table in an existing schema.
For information about which data sources are supported as output target, see column Output tables in Supported data sources. Schema and table names must follow this convention:
- The first character for the name must be an alphabetic character.
- The rest of the name can consist of alphabetic characters, numeric characters, or underscores.
- The name must not contain spaces.
Basic quality analysis
Set the data quality threshold and select the data quality checks to apply when users run quality analysis as part of metadata enrichment.
- Data quality threshold
- Determines the minimum required data quality score for an asset to be of sufficient or good quality. Data quality scores that are below the specified threshold are marked with a red dot in the enrichment results. Data quality scores that are equal to or exceed the specified threshold are marked green.
- Data quality checks
- Select the predefined data quality checks that you want to apply when you run quality analysis as part of metadata enrichment. Select at least one check. Each run of a metadata enrichment that is configured with the Run basic data quality analysis option contributes to the data quality dimension scores that are tied to the selected checks. For more information, see Predefined data quality checks.
Data quality output
Set the default output location for storing data quality exceptions and determine the maximum number of exception records per data quality check. Writing data quality exceptions to a database table must be enabled in the metadata enrichment asset.
- Maximum number of exception output records
-
Determine how many issues per column are written to the output table at maximum for each data quality check. The default setting is 100.
- Output location
-
Set the default output tables for storing data quality exceptions:
- Select a connection.
- Depending on the selected connection, select a schema and a table, or select a catalog, a schema, and a table for storing the exceptions.
- Optionally, select a table for storing the entire rows in which the issues were found (exception records). You can select an existing table from the schema where the exceptions table is created or create a new table in that schema.
You can select from existing schemas and tables or create new tables in an existing schema. For information about which data sources are supported as output target, see column Output tables in Supported data sources. Schema and table names must follow this convention:
- The first character for the name must be an alphabetic character.
- The rest of the name can consist of alphabetic characters, numeric characters, or underscores.
- The name must not contain spaces.
To create a new table for the output, enter a name instead of selecting from the available tables. Note that the table name must not contain any special characters.
For storing only the quality issues, a new table is created with the following column definitions:
asset_id VARCHAR(40), issue_type VARCHAR(64), column1 VARCHAR(128), value1 VARCHAR(64), column2 VARCHAR(128), value2 VARCHAR(64)
For storing the quality issues and the exception records, a new table for the quality issues is created with these column definitions:
asset_id VARCHAR(40), issue_type VARCHAR(64), column VARCHAR(128), row_id VARCHAR(64)
A new table for storing the exception records is created with these column definitions:
asset_id VARCHAR(40), row_id VARCHAR(64), row_data CLOB
If you select an existing table for either type of output, the selected table must have the appropriate structure for the intended output.
If the connection is locked, you are asked to enter your personal credentials. This is a one-time step that permanently unlocks the connection for you.
Key relationships
A key relationship consists of a a primary and foreign key, and defines a relationship between two data assets in a relational database.
- Suggestion threshold
-
Defines the minimum required confidence for relationships between primary and foreign keys to be suggested. The default setting is 80%.
This threshold is applied when you run a basic key relationship analysis; it is not applied to in-depth key relationship analysis or overlap analysis. You can set suggestion thresholds for these types of analysis for each individual run. See Identifying relationships.
To have relationships automatically assigned, select the Automatically assign option and set an assignment threshold.
- Assignment threshold
-
Defines the minimum required confidence for relationships between primary and foreign keys to be automatically assigned. The default setting is 90%.
When a key relationship is automatically assigned, the corresponding primary key in a parent asset is also assigned automatically. However, a data asset can't have more than one primary key assigned. Therefore, only one relationship can be assigned if multiple key relationships with different primary keys are detected for an asset. The relationship candidate with the highest confidence score is assigned. This confidence score is calculated based on the confidence score of the primary key analysis. If all relationship candidates have the same confidence score, none of them is assigned.
These settings are applied when you run a basic key relationship analysis. They are not applied to in-depth key relationship analysis or overlap analysis. For these types of analysis, you can enable automatic assignment of relationships and set an assignment threshold for each individual run. See Identifying relationships.
Learn more
- Adding data matching to data classes
- Automatic term assignment
- Identifying primary keys
- Identifying relationships
- Adding a custom service for automatic term assignment
- IBM Knowledge Catalog API: Create or update metadata enrichment settings
- IBM Knowledge Catalog API: Retrieve metadata enrichment settings
Parent topic: Enriching your data assets