Creating a metadata enrichment asset
Configure and run metadata enrichment to add descriptive information to your data assets.
You can add several layers of metadata to a data asset:
- Profile the data to classify it and compile statistics about the values.
- Run predefined data quality checks for an initial quality assessment.
- Enrich assets with business vocabulary that describe the semantic meaning of the data for your organization.
- Required permissions
- To create and run a metadata enrichment, you must have the Admin or the Editor role in the project, and you must have at least view access to the categories that you want to use in the enrichment. Also, you must be authorized to access the connections to the data sources of the data assets to be enriched.
You can also create metadata enrichments with APIs instead of the user interface. The links to these APIs are listed in the Learn more section.
To create a metadata enrichment asset and a job for enriching data:
-
Open a project and click New asset > Metadata Enrichment. After you create the first metadata enrichment in this way, you can add new metadata enrichment assets from the project's Asset page.
-
Define details:
- Specify a name for the metadata enrichment.
- Optional: Provide a description.
- Optional: Select or create tags to be assigned to the metadata enrichment asset to simplify searching. You can create new tags by entering the tag name and pressing Enter.
-
Set the data scope:
-
Select the data assets that you want enrich from Data assets.
The list shows all assets of the supported formats. You can enrich relational and structured data assets. You can select individual assets but you can also select metadata import assets to enrich the entire set of data assets from those metadata imports. However, you can't select data assets or metadata imports that are already included in a metadata enrichment. For individual data assets, you can hover over the asset name to see in which metadata enrichment the asset is included.
A metadata import asset is automatically excluded from the selection scope in these cases:
- It has a catalog as the import target.
- It was run on a connection that doesn't support access to the actual data.
See Importing metadata.
Remember: Each data asset or metadata import can be included in only one metadata enrichment per project. If you want to enrich a data asset several times with different enrichment options, you need to do that in separate projects. -
Review the selected scope. You can directly delete assets from the data scope or you can rework the entire scope by clicking Edit data scope.
-
When you're done refining the data scope, click Next.
You can skip this step to create an empty metadata enrichment asset, and set the scope later.
-
-
Define the objective of this metadata enrichment asset:
-
Determine the enrichment objective:
- Profile data
-
Provides statistics about the asset content, assigns and suggests data classes.
For more information about the statistics, see Detailed profiling results.
Data classes describe the contents of the data in the column: for example, city, account number, or credit card number. Data classes can be used to mask data with data protection rules or to restrict access to data assets with policies. In addition, they can contribute to term assignments if a corresponding data class to term linkage exists.
The confidence of a data class is the percentage of nonnull values that match the data class. The confidence score for a data class to be assigned or suggested must at least equal the set threshold. See Data class assignment settings. If a threshold is set on a data class directly, this threshold takes precedence when data classes are assigned. It is not considered for suggestions. In addition to the confidence score, the priority of a data class is taken into account. See Adding data matching to data classes.
Several data classes are more generic identifiers that are detected and assigned at a column level. These data classes are assigned when a more specific data class could not be identified at a value level. Generic identifiers always have a confidence of 100% and include the following data classes: code, date, identifier, indicator, quantity, and text.
- Analyze data quality
-
Provides a data quality score for tables and columns. Data quality analysis can be done only in combination with profiling. Therefore, the Profile data option is automatically selected when you select to analyze data quality.
Data quality scores for individual columns in the data asset are computed based on quality dimensions. The overall quality score for the entire data asset is the average of the scores for all columns.
- Assign terms
-
Automatically assigns business terms to columns and entire assets, or suggests business terms for manual assignment. Those assignments or suggestions are generated by a set of services. See Automatic term assignment.
Depending on which term assignment services are active for your project, term assignment might require profiling.
- Profile data
-
Select categories to determine the data classes and business terms that can be applied during the enrichment. A project administrator might have limited the set of categories to choose from when you create an enrichment. This limitation does not apply when you edit the enrichment. In any case, you can choose only from categories where you are a collaborator with at least the Viewer role.
This selection applies to automatic assignments and suggestions only. When you manually assign terms or data classes, you can choose from all categories to which you have access.
Changes to the set of categories to choose from or the actual category selection take effect with the next enrichment run. However, existing assignments remain unchanged.
If your access to any of the selected categories is revoked after you ran the metadata enrichment and you don’t make any changes to the enrichment, any rerun still considers all selected categories for data class and term assignments.
-
Select a sampling type:
- Basic: Basic sampling works with the smallest possible sample size to speed up the process: 1,000 rows per table are analyzed, and classification is done based on the most frequent 100 values per column.
- Moderate: Moderate sampling works with a medium-sized sample size to provide reasonably accurate results without being too time-consuming: 10,000 rows per table are analyzed, and classification is done based on the most frequent 100 values per column.
- Comprehensive: Comprehensive sampling works with a large sample size to provide more accurate results: 100,000 rows per table are analyzed, and classification takes all values per column into account. However, this method is time and resource intensive.
- Custom: Define the sampling method, the sample size, and the basis for classification yourself:
- Choose between sequential and random sampling. With sequential sampling, the first rows of a data set are selected in a sequential order. With random sampling, the rows to be included are randomly selected. For both methods, the maximum number of rows to be selected is determined by the defined sample size. Random sampling is available only for data assets from data sources that support this type of sampling.
- Define the maximum size of the sample. You can set a fixed number of rows or specify how many percent of the rows in the data set you want to be analyzed. If you define the sample size as a percentage value, you can optionally set the minimum and maximum number of rows that the sample can include. You might want to set these values when you don't know the size of the data sets to be analyzed. The percentage of rows selected for the sample can only approximate the specified percentage.
- Select whether you want a data class to be assigned based on all values in a column or on the most frequent values in a column where you can specify the number of values you want to be taken into account.
Basic, moderate, or comprehensive sampling is sequential and starts at the top of the table.
-
-
Define whether you want to run scheduled enrichment jobs. If you don't set a schedule, you run the enrichment when you save the metadata enrichment asset. You can rerun the enrichment manually at any time.
If you select to run the enrichment on a specific schedule, define the date and time you want the job to run. You can schedule single and recurring runs. If you schedule a single run, the job will run exactly one time at the specified day and time. If you schedule recurring runs, the job will run for the first time at the timestamp indicated in the Repeat section.
Optionally, change the name of the enrichment job. The default name is metadata_enrichment_name job.
You can later access the enrichment job you create from the project's Jobs page. This page also provides easy access to the job logs. See Jobs.
If your data scope includes metadata import assets, the Schedule page also provides information about the schedules of the respective metadata import jobs. This information helps you coordinate your enrichment schedule with any import schedules.
-
Select the data scope for the reruns of the enrichment, whether scheduled or run manually. The data scope can be all assets from the selected data scope, or only new or modified assets. New or modified assets means assets that were added to the data scope, where columns were added or removed, and where asset or column descriptions changed after the last run of the enrichment. Enrichment is always run on the entire data asset regardless of whether an asset is new or modified.
-
Review the metadata enrichment configuration. To make changes, click the edit (
) icon on the tile and update the settings.
-
Click Create. The metadata enrichment asset is added to the project, and a metadata enrichment job is created. If you didn't configure a schedule, the enrichment is run immediately. If you configured a schedule, the enrichment will run on the defined schedule.
After the enrichment is complete, you can access a high-level overview of the enrichment results by viewing the metadata enrichment asset. From there, you can drill down into and work with the results for each asset. See Working with the enrichment results.
For information about how to update, rerun, or delete a metadata enrichment, see Managing an existing metadata enrichment.
Learn more
Next steps
Parent topic: Managing metadata enrichment