Enriching your data | IBM Cloud Pak for Data as a Service

Enriching your data

Last updated: Dec 13, 2024

Enriching your data

Enrich data assets with information that helps users to find data faster, to decide whether the data is appropriate for the task at hand, whether they can trust the data, and how to work with the data. Such information includes, for example, terms that define the meaning of the data, rules that document ownership or determine quality standards, or reviews.

Data stewards create asset profiles to understand the meaning of data and to assess its quality. Also, they add business context to data by assigning terms and identify relationships between tables. Metadata enrichment automates this process thus increasing the data steward's productivity.

Data is useful only if its context, content, and quality are trusted. To keep it that way, data must continuously be evaluated and appropriate remediation be taken if required. Data stewards can configure recurring jobs to continuously track changes to the content and structure of data and then analyze only data that changed.

The information that is added to assets through metadata enrichment also helps to protect data because it can be used in data protection policies to mask data or to restrict access.

Required services

IBM Knowledge Catalog
DataStage for advanced key or relationship analysis and advanced profiling

Data format

Tables from relational and nonrelational data sources

Files uploaded from the local file system or from file-based connections to the data sources, with these formats: CSV, TSV, Avro, Parquet, Microsoft Excel (xls, xlsm, and xlsx; only the first sheet in a workbook is profiled for files uploaded from the local file system.) These structured data files are not profiled:

Files within a connected folder asset. Files that are accessible from a connected folder asset are not treated as assets and are not profiled.
Files within an archive file, for example, a .zip file. The archive file is referenced by the data asset and the compressed files are not profiled.

You can enrich data assets from the data sources listed in Supported data sources for curation and data quality.

Data size

Any; data sets from file-based connections cannot have more than 4,999 columns

Required permissions

To create, manage, and run a metadata enrichment, you must have the Admin or the Editor role in the project, and you must have at least view access to the categories that you want to use in the enrichment. Also, you must be authorized to access the connections to the data sources of the data assets to be enriched.

If any of these connections are locked, you are asked to enter your personal credentials. This is a one-time step that permanently unlocks the connections for you.

All operations that are run as part of a metadata enrichment require credentials for secure authorization. Typically, your user API key is used to execute such long-running operations without disruption. If credentials are not available when you create a metadata enrichment or try to run any type of enrichment, you are prompted to create an API key. That API key is then saved as your task credentials. See Managing the user API key.

You can also create, edit, run, or delete metadata enrichments with APIs instead of the user interface. The links to these APIs are listed in the Learn more section.

Metadata enrichment overview

Enriching data assets involves the following process:

Identify the data assets that you want to enrich.
In a project, create a metadata enrichment asset to configure the enrichment details like the scope and the objective of the enrichment, and the schedule for the enrichment job.
Run the enrichment job.
For each data asset included in the enrichment, work with the results in the metadata enrichment asset:
1. Identify anomalies and quality issues and take appropriate measures to remediate any issues.
2. Review generated content such as display names or AI-generated descriptions.
3. Check term assignments, and evaluate and act on term suggestions.
4. Manage data class assignments at the column level.
5. Manage classifications.
6. Identify and set primary keys and relationships.
7. Detect overlapping or redundant data.
You can also access the enrichment results and work with them in the profile of each individual asset. See Asset profiles. Detailed quality information is available on an asset's Data quality tab.
Reevaluate the assets in question.
Publish the data assets with the results as required.

You can perform most tasks with APIs instead of the UI. Links to IBM Knowledge Catalog API are listed for each applicable task.

While you can add individual connected assets to a metadata enrichment, metadata enrichment is intended for bulk processing data assets added to the project through metadata import.

To ensure consistent use of enrichment options, you can configure default settings for all metadata enrichment assets in a project. To open the settings page, go to Manage > Metadata enrichment. Alternatively, you can open an existing metadata enrichment asset and click Default settings.

For workload management, running metadata enrichment jobs can be restricted to job execution windows. A project administrator can define such windows in Manage > Job execution windows.

Learn more

Next steps

Parent topic: Data curation