0 / 0
Managing metadata enrichment (Watson Knowledge Catalog)
Managing metadata enrichment (Watson Knowledge Catalog)

Managing metadata enrichment (Watson Knowledge Catalog)

Data assets can be enriched with information that helps users to find data faster, to decide whether the data is appropriate for the task at hand, whether they can trust the data, and how to work with the data. Such information includes, for example, terms that define the meaning of the data, rules that document ownership or determine quality standards, or reviews.

Data stewards create asset profiles to understand the meaning of data and to assess its quality. Also, they add business context to data by assigning terms. Metadata enrichment automates this process thus increasing the data steward's productivity.

Data is useful only if its context, content, and quality are trusted. To keep it that way, data must continuously be evaluated and appropriate remediation be taken if required. Data stewards can configure recurring jobs to continuously track changes to the content and structure of data and then analyze only data that changed.

The information that is added to assets through metadata enrichment also helps to protect data because it can be used in data protection policies to mask data or to restrict access.

Required services Watson Knowledge Catalog

Data format Tables from relational and nonrelational data sources Files uploaded from the local file system or from file-based connections to the data sources, with these formats: CSV, TSV, Avro, Parquet, Microsoft Excel (xls, xlsm, and xlsx; only the first sheet in a workbook is profiled.) These structured data files are not profiled:

  • Files within a connected folder asset. Files that are accessible from a connected folder asset are not treated as assets and are not profiled.
  • Files within an archive file. The archive file is referenced by the data asset and the compressed files are not profiled.

You can enrich data assets from the data sources listed in Supported data sources for metadata import and metadata enrichment.

Data size Any; data sets from file-based connections cannot have more than 4,999 columns

Required permissions
To create and run a metadata enrichment, you must have the Admin or the Editor role in the project, and you must have at least view access to the categories that you want to use in the enrichment. Also, you must be authorized to access the connections to the data sources of the data assets to be enriched.

You can also create, edit, run, or delete metadata enrichments with APIs instead of the user interface. The links to these APIs are listed in the Learn more section.

Metadata enrichment overview

Enriching data assets involves the following process:

  • Identify the data assets that you want to enrich.
  • In a project, create a metadata enrichment asset to configure the enrichment details like the scope and the objective of the enrichment and the schedule for the enrichment job.
  • Run the enrichment job.
  • For each data asset included in the enrichment, check the results in the metadata enrichment asset:

    • Identify anomalies and quality issues and take appropriate measures to remediate any issues.
    • Check term assignments, and evaluate and act on term suggestions.
    • Manage data class assignments at the column level.

    You can also access the enrichment results and work with them in the profile of each individual asset. See Asset profiles.

  • Reevaluate the assets in question.

While you can add individual connected assets to a metadata enrichment, metadata enrichment is intended for bulk processing data assets added to the project through metadata import.

To ensure consistent use of enrichment options, you can configure default settings for all metadata enrichment assets in a project. You must have at least one metadata enrichment asset in your project to be able to configure those settings. To open the settings page, open an existing metadata enrichment asset and click Default settings.

Creating a metadata enrichment asset and enriching data

To create a metadata enrichment asset and a job for enriching data:

  1. Open a project and click New asset > Metadata Enrichment. After you create the first metadata enrichment in this way, you can add new metadata enrichment assets from the project's Asset page.

  2. Define details:

    • Specify a name for the metadata enrichment.
    • Optional: Provide a description.
    • Optional: Select or create tags to be assigned to the metadata enrichment asset to simplify searching. You can create new tags by entering the tag name and pressing Enter.
  3. Set the data scope:

    1. Select the data assets that you want enrich from Data assets. The list shows all assets that can be included in the enrichment, which means, assets of the supported formats that are not included in any other metadata enrichment within the current project. You can enrich relational and structured data assets. You can select individual assets but you can also select metadata import assets to enrich the entire set of data assets from those metadata imports. Metadata import assets that have a catalog as the import target or that were run on connections that do not support access to the actual data are automatically excluded from the selection scope. See Importing metadata.

      Remember: Each data asset or metadata import can be included in only one metadata enrichment per project. If you want to enrich a data asset several times with different enrichment options, you'll need to do that in separate projects.
    2. Review the selected scope. You can directly delete assets from the data scope or you can rework the entire scope by clicking Edit data scope.

    3. When you're done refining the data scope, click Next.

    You can skip this step to create an empty metadata enrichment asset, and set the scope later.

  4. Define the objective of this metadata enrichment asset:

    1. Determine the enrichment objective:

      Profile data
      Provides statistics about the asset content, and assigns and suggests data classes.

      Statistics about the content include the following information:

      • The percentage of matching, mismatching, or missing data.
      • The frequency distribution for all values identified in a column.
      • For each column, the minimum, maximum, and mean values and the number of unique values in that column. Depending on a column’s data type, the statistics for each column will vary slightly. For example, statistics for a column of data type integer have minimum, maximum, and mean values while statistics for a column of data type string have minimum length, maximum length, and mean length values.

      Data classes describe the contents of the data in the column: for example, city, account number, or credit card number. Data classes can be used to mask data with data protection rules. Also, they can be used to restrict access to data assets with policies.

      The confidence of a data class is the percentage of nonnull values that match the data class. The confidence score for a data class to be assigned or suggested must at least equal the set threshold. See Data class assignment settings. If a threshold is set on a data class directly, this threshold takes precedence when data classes are assigned. It is not considered for suggestions. In addition to the confidence score, the priority of a data class is taken into account. See Adding data matching to data classes.

      Several data classes are more generic identifiers that are detected and assigned at a column level. These data classes are assigned when a more specific data class could not be identified at a value level. Generic identifiers always have a confidence of 100% and include the following data classes: code, date, identifier, indicator, quantity, and text.

      Analyze data quality
      Provides a data quality score for tables and columns. Data quality analysis can be done only in combination with profiling. Therefore, the Profile data option is automatically selected when you select to analyze data quality.

      Data quality scores for individual columns in the data asset are computed based on quality dimensions. The overall quality score for the entire data asset is the average of the scores for all columns.

      Assign terms
      Automatically assigns business terms to columns and entire assets, or suggests business terms for manual assignment. Those assignments or suggestions are generated by a set of services. See Automatic term assignment.

      Depending on which term assignment services are active for your project, term assignment might require profiling.

    2. Select categories to determine the data classes and business terms that can be applied during the enrichment. A project administrator might have limited the set of categories to choose from when you create an enrichment. This limitation does not apply when you edit the enrichment. In any case, you can choose only from categories where you are a collaborator with at least the Viewer role.

      This selection applies to automatic assignments and suggestions only. When you manually assign terms or data classes, you can choose from all categories to which you have access.

      Changes to the set of categories to choose from or the actual category selection take effect with the next enrichment run. However, existing assignments remain unchanged.

    3. Select a sampling type:

      • Basic: Basic sampling works with the smallest possible sample size to speed up the process: 1,000 rows per table are analyzed, and classification is done based on the most frequent 100 values per column.
      • Moderate: Moderate sampling works with a medium-sized sample size to provide reasonably accurate results without being too time-consuming: 10,000 rows per table are analyzed, and classification is done based on the most frequent 100 values per column.
      • Comprehensive: Comprehensive sampling works with a large sample size to provide more accurate results: 100,000 rows per table are analyzed, and classification takes all values per column into account. However, this method is time and resource intensive.
      • Custom: Define the sampling method, the sample size, and the basis for classification yourself:
        • Choose between sequential and random sampling. With sequential sampling, the first rows of a data set are selected in a sequential order. With random sampling, the rows to be included are randomly selected. For both methods, the maximum number of rows to be selected is determined by the defined sample size. Random sampling is available only for data assets from data sources that support this type of sampling.
        • Define the maximum size of the sample. You can set a fixed number of rows or specify how many percent of the rows in the data set you want to be analyzed. If you define the sample size as a percentage value, you can optionally set the minimum and maximum number of rows that the sample can include. You might want to set these values when you don't know the size of the data sets to be analyzed. The percentage of rows selected for the sample can only approximate the specified percentage.
        • Select whether you want a data class to be assigned based on all values in a column or on the most frequent values in a column where you can specify the number of values you want to be taken into account.

      Basic, moderate, or comprehensive sampling is sequential and starts at the top of the table.

  5. Define whether you want to run scheduled enrichment jobs. If you don't set a schedule, you run the enrichment when you save the metadata enrichment asset. You can rerun the enrichment manually at any time.

    If you select to run the enrichment on a specific schedule, define the date and time you want the job to run. You can schedule single and recurring runs. If you schedule a single run, the job will run exactly one time at the specified day and time. If you schedule recurring runs, the job will run for the first time at the timestamp indicated in the Repeat section.

    Optionally, change the name of the enrichment job. The default name is metadata_enrichment_name job.

    You can later access the enrichment job you create from the project's Jobs page. This page also provides easy access to the job logs. See Jobs.

    If your data scope includes metadata import assets, the Schedule page also provides information about the schedules of the respective metadata import jobs. This information helps you coordinate your enrichment schedule with any import schedules.

  6. Select the data scope for the reruns of the enrichment, whether scheduled or run manually. The data scope can be all assets from the selected data scope, or only new or modified assets. New or modified assets means assets that were added to the data scope, where columns were added or removed, and where asset or column descriptions changed after the last run of the enrichment. Enrichment is always run on the entire data asset regardless of whether an asset is new or modified.

  7. Review the metadata enrichment configuration. To make changes, click the edit (edit icon) icon on the tile and update the settings.

  8. Click Create. The metadata enrichment asset is added to the project, and a metadata enrichment job is created. If you didn't configure a schedule, the enrichment is run immediately. If you configured a schedule, the enrichment will run on the defined schedule.

After the enrichment is complete, you can access a high-level overview of the enrichment results by viewing the metadata enrichment asset. From there, you can drill down into and work with the results for each asset. See Working with the enrichment results.

Editing the metadata enrichment

Metadata enrichment assets are listed in the Metadata enrichments section of the Assets page. To edit a metadata enrichment asset:

  • Open the metadata enrichment. In the Metadata enrichments section of the Assets page, click the asset's name or click View from the asset's overflow menu. Then, click Edit enrichment.
  • In the Metadata enrichments section of the Assets page, select Edit from the overflow menu next to the asset name.

You can change these configuration settings:

  • Asset details such as the asset name, the description, or tags. Note that changing the asset name does not change the name of the associated enrichment job.
  • The data scope.
  • The enrichment objectives, the category selection, and the sampling option.
  • The schedule.

Hint: The metadata enrichment does not run automatically when you save configuration changes. For example, even if you delete the schedule, you must manually run the metadata enrichment. See Running the enrichment manually.

Running the enrichment manually

You can manually run the metadata enrichment at any time for the entire set of assets or a subset of assets.

To run the enrichment for the entire set of assets:

  • Open the metadata enrichment asset and select Enrich all assets from the overflow menu next to the asset name.
  • Open the metadata enrichment asset. On the Assets tab, select all assets and select Enrich from the toolbar.
  • Go to the project's Jobs page and run the enrichment job from there. See Jobs.

To run the enrichment for a subset of the assets:

  • Open the metadata enrichment asset. On the Assets tab, select assets as required and select Enrich from the overflow menu next to the asset name.
  • Open the metadata enrichment asset. On the Assets tab, select assets as required and select Enrich from the toolbar.

If an enrichment ran at least once, also your selection of the data scope on reruns determines which assets are actually reenriched.

At any time, you can change the metadata enrichment configuration by updating the metadata enrichment asset before you run the enrichment. Assets are then profiled and analyzed according to the current enrichment configuration.

In case of a rerun, assets might not be available for reenrichment because they were deleted from the data source or were removed from the enrichment scope. For such assets, the timestamp of the asset profile will still show the date and time of the previous run.

Deleting a metadata enrichment asset

You can delete a metadata enrichment asset from a project in one of these ways:

  • Select the Delete option from the overflow menu for the asset on the project Assets page.
  • Open the asset and select Delete from the overflow menu next to the asset name.

The metadata enrichment configuration and its associated metadata enrichment job are deleted. Assets in the project or a catalog that were enriched with this metadata enrichment asset are not affected. You might need to refresh your browser to see the deletion reflected.

Basic enrichment information

A summary of relevant information about a metadata enrichment is provided in the side panel. This panel is open by default, but you can also access it by clicking the info icon icon.

Learn more

Next steps

Parent topic: Curation