Planning to curate data into catalogs

Last updated: Dec 13, 2024

The process of curation includes creating data assets, assigning governance artifacts and other metadata to the data assets, publishing the data assets to a catalog, and then updating asset metadata as the underlying data or your business vocabulary changes. After your data stewards add high-quality, enriched data assets to catalogs, data consumers can find and use those data assets.

Although you can curate data assets individually, that process is not scalable. You can automate many curation tasks with the Metadata import and Metadata enrichment tools, with which you can discover, create, enrich, and publish sets of data assets.

To automate data curation as much as possible, complete these tasks to set up a curation project, add curated data assets to a catalog, and update the data assets to keep metadata current:

Task	Mandatory?	Frequency
Set up a project	Yes	One-time
Add connections to data sources	Yes	One-time
Import metadata to create data assets	Yes	Recurring
Enrich data assets with metadata and other information	Yes	Recurring
Resolve entity data to create a 360-degree view of your data	No	Recurring
Customize data quality analysis	No	Recurring
Publish data assets to catalogs	Yes	Recurring

The curate data cycle includes the steps listed in the table.

When you create metadata import and metadata enrichment assets, you can schedule them to run automatically or run them on demand. You can set up job schedules in the UI or with APIs. For example, you can schedule a metadata import for a specific time and date. Then, you can schedule metadata enrichment for the same assets to run after the metadata import is complete. After metadata enrichment is complete, review the results, make the necessary adjustments, and then publish the updates to the data assets to the catalog.

Set up a project for curation

A project is a collaborative workspace where people work with data to fulfill a shared goal.

To improve consistency, you can create conventions for projects, such as:

Project names: Identify projects in a consistent way, for example, by purpose, date range, or team.
Project requirements: Describe and link to requirements and tasks in external systems in the project Read me file.
Connection names: Identify connections in a consistent way, for example, by data source, table name, or purpose.

A data curation project typically contains the following types of items that are either explicitly added by data stewards or are created as the result of a process:

Connection assets for the data sources that contain the data to curate
Connected data assets that are created by metadata import
Metadata import assets
Metadata enrichment assets
Data quality definition and rule assets
DataStage flow assets that are created by running data quality rules
Data assets that contain data quality rule output tables
Data assets that contain frequency distribution tables that are created by metadata enrichment
Jobs that are created by running assets

Learn more about creating projects

Add connections to data sources

Before your data stewards can import metadata to create connected data assets, they need the connection assets for the relevant data sources. Data sources can include databases, such as Db2, or files systems, such as IBM Cloud Object Storage.

Typically, organizations add connections to the Platform assets catalog so that all users can find and use them. For example, your data engineers can create the connection assets in the Platform assets catalog, and then all users can easily add those connections to their projects. Alternatively, you can create connections within a project.

When you create connections, you must decide how to handle connection credentials. By default, connection credentials are marked as shared, which allows all users to use the same credentials to access the data. If you want each user to enter their personal credentials, disable shared credentials when you create connections. However, if your connections require personal credentials, you must ensure that your data stewards have credentials for all the connections that they need for curation.

Cloud Pak for Data supports many connections, but not all of them are supported for metadata import, metadata enrichment, and data quality analysis.

Learn more about adding connections

Import metadata to create data assets

Metadata import detects all the tables or files that are accessible from a specified connection to a data source. You can choose to create connected data assets for all or a selection of the tables or files. The metadata import process also creates a metadata import asset that you can rerun or specify as an input for metadata enrichment.

Typically, organizations create multiple metadata import assets for a single data source. Each metadata import contains tables or files that have a similar frequency of changes to structure, schema, or data rows. You can then run each metadata import on a different schedule. For example, you might create metadata imports with the following characteristics:

A metadata import for tables that have frequent updates that you schedule to run weekly.
A metdata import for tables with infrequent updates that you schedule to run monthly.
A metadata import for tables with rare updates that you manually run when necessary.

Rerun metadata import to detect the following types of changes in the data source:

Assets that are added or removed
Table schemas that are altered
Updates to asset metadata, such as, name changes or updated descriptions

After you rerun metadata import, rerun metadata enrichment.

Learn more about importing metadata

Enrich data assets with metadata and other information

Metadata enrichment adds information to your connected data assets. You can easily run metadata enrichment on all the tables or files that you created with metadata import by setting the metadata import as the data scope. The metadata enrichment process also creates a metadata enrichment job that you can rerun.

Typically, organizations create a metadata enrichment for each metadata import. You can then easily synchronize the schedules of metadata import and metadata enrichment. However, you can create metadata enrichments for a single connected data asset, such as a virtualized table.

When you run metadata enrichment on data assets, the information is added depending on the selected enrichment options:

Profiling only: Adds data classes and statistics, and suggests primary keys.
Metadata expansion: Generates display names and descriptions.
Quality analysis and profiling: Adds quality scores, data classes, and statistics.
Term assignment: Assigns terms and classifications based on the selected methods. Term assignment based on relationships with data classes requires profiling. For gen AI based term assignment, metadata should also be expanded. In any case, terms can be assigned by a machine learning algorithm and name matching.
Relationship generation: Identifies primary and foreign keys and suggests relationships between assets.
Monitoring data quality: Checks whether the data quality complies with defined data quality service level agreements and reports violations. A remediation workflow might be triggered.

You can balance accuracy versus speed by setting the sampling size of the data. The larger the sampling size of the data, the more accurate the data class and business term assignments and data quality analysis, but the longer metadata enrichment job lasts.

Although you can specify to assign data classes and business terms automatically, you must review the results. Accurate assignments of data classes and business terms are critical. Otherwise, sensitive information might not be masked or protected by data protection rules. The more you run metadata enrichment and adjust the data class and business term assignments, the more accurate the automatic assignment algorithm becomes.

Rerun metadata enrichment and the standard data quality analysis in these circumstances:

After you rerun metadata import. Depending on how many changes to the data you expect, rerun metadata enrichment on the entire data scope of the import, or only on new or changed data, for example, to pick up new tables or columns. Changes to the data values in a column might affect data quality scores or the data class and business term assignments.
After changes to the available data classes and business terms. Changes to data classes and business terms might affect their assignments to columns.

Metadata enrichment jobs can take significant amounts of time, depending on the size of your data. They also consume compute resources that are billed to your account.

Learn more about enriching metadata

Resolve entity data to create a 360-degree view of your data

To ensure that your users and systems have a total, trusted, and unified view of your customer data, use IBM Match 360 to match and consolidate data from disparate sources and establish a 360-degree view of your data, known as master data.

Define the data model for your master data, then load data assets from across your enterprise and map them to your model. Next, start configuring the system to meet your organization's unique requirements. Configure the matching algorithm and run it to create master data entities. Review the provided statistics and graphs to evaluate the match results. Depending on your results, you can further tune the algorithm and improve your matching results by completing pair reviews or changing matching weights and thresholds.

When you have perfected your matching algorithm, business users can search and explore your master data to obtain key insights. Data stewards can edit, maintain and remediate the data, then export it as connected data or in CSV format for use elsewhere.

Learn more about resolving entity data

Customize data quality analysis

To customize your data quality analysis, you create and run data quality rules. Each data quality rule applies to the data assets from a single data source or to a single data asset from a file. You run your data quality rules as DataStage flows, which requires the DataStage service. With DataStage, you can run data quality rules in the supported regions. With DataStage as a Service Anywhere, you can run data quality rules outside of IBM Cloud by using remote engines. For more information about setting up remote engines, see the DataStage as a Service Anywhere documentation.

The format and the way that you define data quality rule conditions depends on the type of results that you want to receive.

Results	Format	Method
Returns the degree to which columns comply with rule conditions.	Data quality definitions	You create data quality definition assets that you reference in one or more data quality rules. You specify the rule logic by arranging block elements on a canvas or by entering an expression in a free form editor.
Returns columns that fail rule conditions.	SQL statements	You enter SQL statements in each data quality rule.

If you create data quality rules that contain data quality definitions, you have the following options:

Reuse the same data quality definition multiple times in a data quality rule.
Include multiple data quality definitions in a data quality rule.
Publish data quality definitions to a catalog and reuse them in multiple projects.
Create simple rules that bind data directly and optionally create joins for bindings.
Create complex rules where data is preprocessed in DataStage flows and output can be routed to DataStage output links.
Create joins for bindings to use data from multiple tables in the output table.
Create parameter sets in a project for managing the literal values and columns that you bind to rule variables. You can also publish the parameter set to a catalog and reuse it in multiple projects.
Set the maximum number of records to evaluate and the sampling method.

You can choose to send the data quality rule output to an external database to maintain a detailed record of the rule results. For example, you might want to run reports or send the information to a data management team for quality remediation.

Learn more about data quality analysis

Managing data quality

Publish data assets to a catalog

You can publish multiple enriched data assets to a catalog in one operation from within the metadata enrichment asset or from the Assets tab in the project.

The main differences between publishing from the Assets tab and from a metadata enrichment asset are in the handling of duplicate assets. The following table compares the choices that you have and their effects.

Publishing method	Bulk publishing?	Duplicate handling choices	Business term assignments
Assets tab	Yes, you can select multiple assets to publish together.	• Update original assets • Overwrite original assets • Allow duplicates (if the catalog settings include this option) • Preserve original assets and reject duplicates	Original business term assignments can be removed.
Metadata enrichment asset	Yes, you can select multiple assets to publish together.	Update original assets	Business terms from the new asset are added to the original asset. No original business term assignments are removed.

Learn more about publishing to a catalog

Import lineage for the data assets in the catalog

Lineage is the information about where your data comes from, how it changes, and where it moves over time. You can import lineage information for the data assets that you imported, enriched, and published to a catalog. Data lineage must be enabled. To import lineage, you create a metadata import with the Import lineage metadata option. The lineage service scans the target data source and analyzes the data flow. This lineage metadata is imported with the data assets and, if available, any transformation scripts.

Typically, organizations rerun metadata import to capture lineage information after running metadata import and enrichment and publishing the updated data assets.