0 / 0
Creating a metadata enrichment asset

Creating a metadata enrichment asset

Configure and run metadata enrichment to add descriptive information to your data assets.

You can add several layers of metadata to a data asset:

  • Profile the data to classify it and compile statistics about the values.
  • Run predefined data quality checks for an initial quality assessment.
  • Enrich assets with business vocabulary that describe the semantic meaning of the data for your organization.
  • Identify primary key and relationship candidates
Required service

DataStage for key or relationship analysis and advanced profiling

Required permissions

To create and run a metadata enrichment, you must have the Admin or the Editor role in the project, and you must have at least view access to the categories that you want to use in the enrichment. Also, you must be authorized to access the connections to the data sources of the data assets to be enriched.

All operations that are run as part of a metadata enrichment require credentials for secure authorization. Typically, your user API key is used to execute such long-running operations without disruption. If credentials are not available when you create a metadata enrichment or try to run any type of enrichment, you are prompted to create an API key. That API key is then saved as your task credentials. See Managing the user API key.

You can also create metadata enrichments with APIs instead of the user interface. The links to these APIs are listed in the Learn more section.

To create a metadata enrichment asset and a job for enriching data:

  1. Open a project and click New asset > Metadata Enrichment. After you create the first metadata enrichment in this way, you can add new metadata enrichment assets from the project's Asset page.

  2. Define details:

    • Specify a name for the metadata enrichment.
    • Optional: Provide a description.
    • Optional: Select or create tags to be assigned to the metadata enrichment asset to simplify searching. You can create new tags by entering the tag name and pressing Enter.
  3. Set the data scope:

    1. Select the data assets that you want enrich from Data assets.

      If necessary, enter your personal credentials for locked data connections that are marked with a key icon (the key symbol for connections with personal credentials). This is a one-time step that permanently unlocks the connection for you. After you unlock the connection, the key icon is no longer displayed.

      The list shows all assets of the supported formats. You can enrich relational and structured data assets. You can select individual assets, but you can also select metadata import assets to enrich the entire set of data assets from those metadata imports. However, you can't select data assets or metadata imports that are already included in a metadata enrichment. For individual data assets, you can hover over the asset name to see in which metadata enrichment the asset is included.

      A metadata import asset is automatically excluded from the selection scope in these cases:

      • It has a catalog as the import target.
      • It was run on a connection that doesn't support access to the actual data.

      See Importing metadata.

      Remember: Each data asset or metadata import can be included in only one metadata enrichment per project. If you want to enrich a data asset several times with different enrichment options, you need to do that in separate projects.
    2. Review the selected scope. You can directly delete assets from the data scope or you can rework the entire scope by clicking Edit data scope.

    3. When you're done refining the data scope, click Next.

    You can skip this step to create an empty metadata enrichment asset, and set the scope later.

  4. Define the objective of this metadata enrichment asset:

    1. Determine the enrichment objective:

      Profile data

      Provides basic statistics about the asset content, assigns and suggests data classes.

      This type of profiling is fast but makes some approximations for certain metrics like frequency distribution and uniqueness. To get more exact results without approximation, run advanced profiling on selected data assets. See Advanced data profiling.

      For more information about the statistics, see Detailed profiling results.

      Data classes describe the contents of the data in the column: for example, city, account number, or credit card number. Data classes can be used to mask data with data protection rules or to restrict access to data assets with policies. In addition, they can contribute to term assignments if a corresponding data class to term linkage exists.

      The confidence of a data class is the percentage of nonnull values that match the data class. The confidence score for a data class to be assigned or suggested must at least equal the set threshold. See Data class assignment settings. If a threshold is set on a data class directly, this threshold takes precedence when data classes are assigned. It is not considered for suggestions. In addition to the confidence score, the priority of a data class is taken into account. See Adding data matching to data classes.

      Several data classes are more generic identifiers that are detected and assigned at a column level. These data classes are assigned when a more specific data class could not be identified at a value level. Generic identifiers always have a confidence of 100% and include the following data classes: code, date, identifier, indicator, quantity, and text.

      Single-column primary keys are suggested based on profiling statistics. If primary key and foreign key constraints are already defined in your data, these keys are automatically assigned. Primary and foreign key information must explicitly be included in a metadata import.

      From the enrichment results, you can run a multi-column primary key analysis where the actual data is checked. For more information, see Identifying primary keys.

      Run basic quality analysis

      Runs predefined data quality checks on the columns of a data asset. The set of checks that is applied is defined in the enrichment settings. See Basic quality analysis settings and Predefined data quality checks. Each check can contribute to the asset's overall data quality scores. This type of data quality analysis can be done only in combination with profiling. Therefore, the Profile data option is automatically selected when you select to analyze data quality.

      You can choose whether you want to write the output of these checks to a database. Click Customize and enable the Write output to database option. If default settings exist, the sections are populated accordingly. You can overwrite the settings. If no default settings exist, configure the output and the output location. For information about which data sources are supported as output target, see column Output tables in Supported data sources. Schema and table names must follow this convention:

      • The first character for the name must be an alphabetic character.
      • The rest of the name can consist of alphabetic characters, numeric characters, or underscores.
      • The name must not contain spaces.

      If you select to write the exceptions or the rows in which the issues were found (exception records) to existing tables, make sure these tables have the required format. See Data quality output.

      If the connection that you pick is locked, you are asked to enter your personal credentials. This is a one-time step that permanently unlocks the connection for you.

      Assign terms

      Automatically assigns business terms to columns and entire assets, or suggests business terms for manual assignment. Those assignments or suggestions are generated by a set of services. See Automatic term assignment.

      Depending on which term assignment services are active for your project, term assignment might require profiling.

      Set relationships

      Uses profiling statistics and name similarities between columns to provide primary and foreign keys and to suggest relationships between assets and columns. The suggestion threshold that is set in the default enrichment settings is applied. This type of relationship analysis requires profiling.

      From the enrichment results, you can run a deeper relationship analysis where the actual data is checked. For more information, see Identifying relationships.

    2. Select categories to determine the data classes and business terms that can be applied during the enrichment. A project administrator might have limited the set of categories to choose from when you create an enrichment. This limitation does not apply when you edit the enrichment. In any case, you can choose only from categories where you are a collaborator with at least the Viewer role.

      This selection applies to automatic assignments and suggestions only. When you manually assign terms or data classes, you can choose from all categories to which you have access.

      Changes to the set of categories to choose from or the actual category selection take effect with the next enrichment run. However, existing assignments remain unchanged.

      If your access to any of the selected categories is revoked after you ran the metadata enrichment and you don’t make any changes to the enrichment, any rerun still considers all selected categories for data class and term assignments.

    3. Select a sampling type:

      • Basic: Basic sampling works with the smallest possible sample size to speed up the process: 1,000 rows per table are analyzed, and classification is done based on the most frequent 100 values per column.
      • Moderate: Moderate sampling works with a medium-sized sample size to provide reasonably accurate results without being too time-consuming: 10,000 rows per table are analyzed, and classification is done based on the most frequent 100 values per column.
      • Comprehensive: Comprehensive sampling works with a large sample size to provide more accurate results: 100,000 rows per table are analyzed, and classification takes all values per column into account. However, this method is time and resource intensive.
      • Custom: Define the sampling method, the sample size, and the basis for classification yourself:
        • Choose between sequential and random sampling. With sequential sampling, the first rows of a data set are selected in a sequential order. With random sampling, the rows to be included are randomly selected. For both methods, the maximum number of rows to be selected is determined by the defined sample size. Random sampling is available only for data assets from data sources that support this type of sampling.

        • Define the maximum size of the sample. You can set a fixed number of rows or specify how many percent of the rows in the data set you want to be analyzed. If you define the sample size as a percentage value, you can optionally set the minimum and maximum number of rows that the sample can include. You might want to set these values when you don't know the size of the data sets to be analyzed. The number or percentage of rows selected for the sample can only approximate the specified value.

          If the data source does not support fetching the actual record count of a data set, only a subset of the sampling options is available.

        • Select whether you want a data class to be assigned based on all values in a column or on the most frequent values in a column where you can specify the number of values you want to be taken into account.

      Basic, moderate, or comprehensive sampling is sequential and starts at the top of the table. To suppress sampling, use custom sampling that is configured with random sampling and a sample size of 100%.

  5. Define whether you want to run scheduled enrichment jobs. If you don't set a schedule, you run the enrichment when you save the metadata enrichment asset. You can rerun the enrichment manually at any time.

    If you select to run the enrichment on a specific schedule, define the date and time you want the job to run. You can schedule single and recurring runs. If you schedule a single run, the job will run exactly one time at the specified day and time. If you schedule recurring runs, the job will run for the first time at the timestamp indicated in the Repeat section.

    Optionally, change the name of the enrichment job. The default name is metadata_enrichment_name job.

    You can later access the enrichment job you create from the project's Jobs page. This page also provides easy access to the job logs. See Jobs.

    If your data scope includes metadata import assets, the Schedule page also provides information about the schedules of the respective metadata import jobs. This information helps you coordinate your enrichment schedule with any import schedules.

  6. Select the data scope for the reruns of the enrichment, whether scheduled or run manually. The data scope can be all assets from the selected data scope or a subset of assets. The default option is New and modified assets and assets not enriched in the previous run. With this option, assets are selected for enrichment as follows:

    • Assets that were added after the last run of the enrichment
    • Assets where columns were added or removed after the last run of the enrichment
    • Assets where asset or column descriptions changed after the last run of the enrichment
    • Assets for which the previous enrichment failed or was canceled

    Enrichment is always run on the entire data asset regardless of whether an asset is new or modified.

  7. Review the metadata enrichment configuration. To make changes, click the edit (edit icon) icon on the tile and update the settings.

  8. Click Create. The metadata enrichment asset is added to the project, and several jobs are created:

    • A metadata enrichment job
    • A job for deep primary key analysis named metadata-enrichment-name (PK Detection)
    • A job for deep relationship analysis named metadata-enrichment-name (Relationship Detection)

    If you didn't configure a schedule, the enrichment is run immediately. If you configured a schedule, the enrichment will run on the defined schedule.

After the enrichment is complete, you can access a high-level overview of the enrichment results by viewing the metadata enrichment asset. From there, you can drill down into and work with the results for each asset. See Working with the enrichment results.

Metadata enrichment is run on assets that are available in the project. Thus, the list of enriched assets might not correspond to the configured scope of included metadata import assets in these cases:

  • Metadata import was not yet complete when the enrichment started.
  • Metadata import failed for a set of assets or failed completely.

For information about how to update, rerun, or delete a metadata enrichment, see Managing an existing metadata enrichment.

Learn more

Next steps

Parent topic: Managing metadata enrichment

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more