Metadata enrichment default settings
To achieve useful metadata enrichment results, configure default settings for all metadata enrichments in a project. Default settings also help ensure consistent use of enrichment options.
Changes to the threshold settings or the selected term assignment methods are applied to new metadata enrichments and to enrichment jobs that run after the settings changed. Changes to the set of categories are applied to new enrichments only.
- Required permissions
- To configure metadata enrichment default settings, you must have the Admin role in the project. Any project collaborator can view the settings.
You can access the default settings in one of these ways:
- Within an existing metadata enrichment asset, click Default settings.
- On the project's Manage page, go to Tools > Metadata enrichment.
Edit settings as required. Your changes are autosaved. For some settings, you can restore the system-defined default values at any time.
Configure default settings for these features:
You can also create, update, or retrieve enrichment settings with APIs instead of the user interface. The links to the APIs are listed in the Learn more section.
Profiling and term assignment
Set thresholds for profiling and business term assignment, select the methods for term assignment, and preselect categories. At any time, you can restore the default for any threshold setting that you changed.
Data fields in a column or a flat file are nullable if they are allowed to have no value.
- Null threshold
- This setting determines whether a column or flat file field allows null values. If a column or flat file has fields without values, the percentage of the empty fields found is compared to the set threshold. If it's equal to or greater than the nullability threshold, the field allows null values. If null values do not exist in the data field or the frequency percentage is less than the threshold, the data field must have a value. The default setting is 5%.
The cardinality of a column can be unique, constant, or not constrained. The percentage of unique distinct values and the percentage of the most frequent constant value found are compared to the set thresholds. The cardinality type is unique or constant if the respective percentage is equal to or greater than the threshold percentage. Otherwise, it’s not constrained.
- Uniqueness threshold
- Determines whether a data field contains unique values. A column or flat file is considered unique if it has a percentage of distinct values equal to or greater than the threshold that you set. The default is 95%.
- Constant threshold
- Determines whether a column or flat file contains constant values. It is determined that a field is constant if it has a single distinct value with a frequency percentage equal to or greater than the constant threshold that you set. The default is 99%.
Data class assignment
Data classes that are included in the metadata enrichment are automatically assigned to a column solely during profiling. Term assignments do not have an impact on data class assignments. The thresholds determine the minimum confidence level for a data class to be assigned or suggested. The assignment threshold should be higher than the suggestion threshold.
- Assignment threshold
Determines the minimum percentage of values for which the data class must match the criteria to be automatically assigned to a column. The default setting is 75%. This setting can be overridden by a threshold defined directly on the data class.
The following predefined data classes have a default threshold set:
- City (50%)
- Person Name (50%)
- First Name (50%)
- Middle Name (50%)
- Last Name (50%)
- Organization Name (60%)
- Suggestion threshold
Determines the minimum percentage of values for which the data class must match the criteria to be suggested for a column. The default setting is 25%.
Business terms that are included in the metadata enrichment (through category selection) can be automatically assigned to or suggested for a column. The thresholds determine the minimum confidence level for a term to be assigned or suggested. The assignment threshold should be higher than the suggestion threshold. Note that term assignments do not affect data class assignments. If a term that is associated with a data class is assigned to a column by an ML model or through name matching, the related data class is not automatically assigned as well.
- Assignment threshold
- Determines the percentage of matching values that must be exceeded for a term to be automatically assigned to a data asset or column. The default setting is 90%.
- Suggestion threshold
- Determines the percentage of matching values that must be exceeded for a term to be suggested for a data asset or column. The default setting is 75%.
Determine which term assignment method is used in the project to generate assignments and suggestions. Assignments and suggestions are made based on the highest confidence score that one of the methods returns. Select at least one of these methods:
Machine learning: A machine learning model is used to assign terms. You can define for each project whether this model is trained with assets from the project or with assets from a catalog of your choice.
Data-class-based assignments: Terms are assigned based on the data class assignment for a column. Appropriate linkage between data classes and terms is a prerequisite for quality results here.
Linguistic name matching: Terms are assigned based on the similarity between a term and the name of the asset or column.
By default, the confidence scores that are returned by the selected term assignment methods are adjusted based on previous term rejections, which affects the overall confidence score.
If you don't want term rejections to affect the confidence score, you can disable this option.
You can enable or disable the option regardless of which term assignment methods you select. The training scope that you set applies to the model for term assignment and to the model for adjusting the confidence score.
Use individual methods for testing and evaluating term assignments, for example, when you have a large set of custom data classes. This way, you can also find out the proper threshold settings for your project.
For more information, see Automatic term assignment.
You can limit the set of categories from which users can select when they create new metadata enrichments to the categories that align with the purpose of the project. Note that this selection does not determine which categories are actually used in a metadata enrichment. Preselect categories that are relevant for the project. The selected categories determine the business terms and data classes that can be used for profiling and automatic term assignment. This selection does not limit users' options when assigning data classes or terms manually. For manual assignments, users can pick data classes or business terms from any category they have access to.
Any changes to this set are reflected in new metadata enrichments and when you edit an existing metadata enrichment.
Basic quality analysis
- Data quality threshold
- Determines the minimum required data quality score for an asset to be of sufficient or good quality. Data quality scores that are below the specified threshold are marked with a red dot in the enrichment results. Data quality scores that are equal to or exceed the specified threshold are marked green.
- Data quality checks
- Select the predefined data quality checks that you want to apply when you run quality analysis as part of metadata enrichment. Select at least one check. Each run of a metadata enrichment that is configured with the Run basic data quality analysis option contributes to the data quality dimension scores that are tied to the selected checks. For more information, see Predefined data quality checks.
Data quality output
Set the default output location for storing data quality exceptions and determine the maximum number of exception records per data quality check. Writing data quality exceptions to a database table must be enabled in the metadata enrichment asset.
- Maximum number of exception records
Determine how many issues per column are written to the output table at maximum for each data quality check. The default setting is 100.
- Output location
Set the default output table for data quality exceptions. Select a connection, a schema, and a table. You can select from existing schemas and tables or create a new table in an existing schema. For information about which data sources are supported as output target, see column Output tables in Supported data sources. Schema and table names must follow this convention:
- The first character for the name must be an alphabetic character.
- The rest of the name can consist of alphabetic characters, numeric characters, or underscores.
- The name must not contain spaces.
To create a new table for the output, enter a name instead of selecting from the available tables. Note that the table name must not contain any special characters. A new table is created with the following column definitions:
asset_id VARCHAR(40), issue_type VARCHAR(64), column1 VARCHAR(128), value1 VARCHAR(64), column2 VARCHAR(128), value2 VARCHAR(64)
If you select an existing table, this table must have the same structure.
Parent topic: Enriching your data assets