0 / 0
Data governance tutorial: Curate high quality data
Last updated: Nov 27, 2024
Data governance tutorial: Curate high quality data

Take this tutorial to learn how to prepare trusted data with the Data governance use case of the data fabric trial. Your goal is to create trusted data assets by enriching your data and running data quality analysis.

Quick start: If you did not already create the sample project for this tutorial, access the Data governance sample project in the Resource hub.

The story for the tutorial is that Golden Bank has several departments that need access to high-quality customer mortgage data. As a Data Steward on the governance team, you must sort and organize the company's data to provide high-quality and protected data assets that data consumers can easily find in a self-service catalog.

The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial where you will import metadata from an external data source, enrich that data with auto-assigned business terms, view the enriched data, and publish the enriched data to a catalog. Click the image to view a larger image.

Animated image

Preview the tutorial

In this tutorial, you will complete these tasks:

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method to learn the concepts and tasks in this documentation.





Tips for completing this tutorial
Here are some tips for successfully completing this tutorial.

Use the video picture-in-picture

Tip: Start the video, then as you scroll through the tutorial, the video moves to picture-in-picture mode. Close the video table of contents for the best experience with picture-in-picture. You can use picture-in-picture mode so you can follow the video as you complete the tasks in this tutorial. Click the timestamps for each task to follow along.

The following animated image shows how to use the video picture-in-picture and table of contents features:

How to use picture-in-picture and chapters

Get help in the community

If you need help with this tutorial, you can ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Set up your browser windows

For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Tip: If you encounter a guided tour while completing this tutorial in the user interface, click Maybe later.



Set up the prerequisites

Sign up for Cloud Pak for Data as a Service

You must sign up for Cloud Pak for Data as a Service and provision the necessary services for the Data integration use case.

  • If you have an existing Cloud Pak for Data as a Service account, then you can get started with this tutorial. If you have a Lite plan account, only one user per account can run this tutorial.
  • If you don't have a Cloud Pak for Data as a Service account yet, then sign up for a data fabric trial.

Video icon Watch the following video to learn about data fabric in Cloud Pak for Data.

This video provides a visual method to learn the concepts and tasks in this documentation.

Verify the necessary provisioned services

preview tutorial video To preview this task, watch the video beginning at 01:05.

Follow these steps to verify or provision the necessary services:

  1. From the Navigation Menu Navigation menu, choose Services > Service instances.

  2. Use the Product drop-down list to determine whether a IBM Knowledge Catalog service instance exists.

  3. If you need to create a IBM Knowledge Catalog service instance, click Add service.

    1. Select IBM Knowledge Catalog.

    2. Select the Lite plan.

    3. Click Create.

  4. Repeat these steps to verify or provision the Cloud Object Storage service.

Checkpoint icon Check your progress

The following image shows the provisioned service instances:

Provisioned services

Create the sample project

preview tutorial video To preview this task, watch the video beginning at 01:38.

If you did not already create the sample project for this tutorial, follow these steps:

  1. Access the Data governance sample project in the Resource hub.

  2. Click Create project.

  3. If prompted to associate the project to a Cloud Object Storage instance, select a Cloud Object Storage instance from the list.

  4. Click Create.

  5. Wait for the project import to complete, and then click View new project to verify that the project and assets were created successfully.

  6. Click the Assets tab to view the project's assets.

  7. From the Overflow menu Overflow menu at the end of the Banking.csv data asset row, choose Download, and save it to your computer. You'll use that file in a later step.

Note: You might see a guided tour showing the tutorials that are included with this use case. The links in the guided tour will open these tutorial instructions.

Checkpoint icon Check your progress

The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.

Sample project




Task 1: Create a catalog

preview tutorial video To preview this task, watch the video beginning at 02:49.

Before you start working with data, create a catalog where you will publish data to share it with your organization. With the IBM Knowledge Catalog Lite plan, you can create only two catalogs. If you already have a catalog, you can skip this step. Otherwise, follow these steps to create a catalog:

Tip: If this occasion is your first time accessing a catalog, you see a guided tour asking if you want to tour of catalogs. For now, click Maybe later.
  1. From the Navigation Menu Navigation menu, choose Catalogs > View all catalogs.

  2. If you see a catalog on the Catalogs page, then skip to Task 2: Create a category. Otherwise, follow these steps to create a new catalog:

  3. Click Create Catalog.

  4. For the Name, copy and paste the catalog name exactly as shown with no leading or trailing spaces:

    Mortgage Approval Catalog
    
  5. Select Enforce data protection rules, confirm the selection, and accept the defaults for the other fields.

  6. Click Create.

Checkpoint icon Check your progress

The following image shows your catalog. You are now ready to share assets with your organization.

Mortgage Approval Catalog




Task 2: Create a category

preview tutorial video To preview this task, watch the video beginning at 03:13.

You need a category to contain the business terms that you’ll import in the next Task. Categories act like folders to organize your governance artifacts and the people who can author and manage those artifacts. Follow these steps to create a category:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Governance > Categories.

  2. Click Add category > New category.

  3. For the name, type Banking.

  4. Click Create.

Checkpoint icon Check your progress

The following image shows the Banking category. You are now ready to import business terms.

Banking category




Task 3: Add business terms

preview tutorial video To preview this task, watch the video beginning at 03:41.

Now import business terms into the new category. You’ll use them to enrich your data assets in a later step. Business terms are standardized definitions of business concepts so that your data is described in a uniform and easily understood way across your enterprise. Follow these steps to import the business terms from a file:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Governance > Business terms.

  2. Click Add business term > Import from file.

  3. Click Drag and drop file here or upload.

    1. Select the banking.csv file that you downloaded earlier.

    2. Click Open.

  4. Click Next.

  5. Select Replace all values, and click Next.

  6. Click Go to task to see the draft business terms. If you miss the notification, then from the Cloud Pak for Data navigation menu Navigation menu, choose Governance > Task inbox.

  7. Select the Publish business terms checkbox, and then click Publish. Click Publish to confirm.

  8. From the Cloud Pak for Data navigation menu Navigation menu, choose Governance > Business terms to view the published business terms.

Checkpoint icon Check your progress

The following image shows the imported business terms. You are now ready to import the data to a project and then enrich with the imported business terms.

Imported business terms




Task 4: Import data to a project

preview tutorial video To preview this task, watch the video beginning at 04:47.

The sample project includes a connection to a Db2 Warehouse instance, which contains the mortgage assets. You can import technical metadata that is associated with the data assets into a project or a catalog to inventory, evaluate, and catalog these assets. Technical metadata describes the structure of data objects. Follow these steps to import the data assets:

  1. From the Navigation Menu Navigation menu, choose Projects > View all projects.

  2. Click the Data governance project.

  3. Click the Assets tab.

  4. Click New asset > Import metadata for data assets.

  5. For the name, copy and paste the following text:

    Mortgage data - metadata import
    
  6. Click Next to continue.

  7. On the Select target page, select This project, and click Next to continue.

  8. On the Select scope page, click Select connection.

    1. Select the Data Fabric Trial - Db2 Warehouse connection.

    2. Select the checkbox next to the WKC_MORTGAGE schema, then click the WKC_MORTGAGE schema name.

    3. Select the following tables:

      • COMMERCIAL_CLIENT
      • CREDIT_SCORE
      • HOUSE_PRICE
      • MORTGAGE_APPLICANTS
      • MORTGAGE_APPLICATION
    4. Review the list of assets in the side panel, and then click Select.

  9. Click Next to continue to the schedule. You can manually run the metadata enrichment, so keep the scheduled turned off.

  10. Click Next to continue to the Advanced Options.

  11. Accept the default values for on the Advanced options page, and click Next to continue to the review.

  12. Review the summary of the import, and click Create. The metadata import job starts.

  13. Click the Refresh icon Refresh to watch the status change from Queued to In progress to Imported. When the job run is complete, you see the five assets listed.

Checkpoint icon Check your progress

The following image shows the completed metadata import. Your next task is to enrich the imported data assets with the imported business terms.

Metadata import asset




Task 5: Enrich the imported data

preview tutorial video To preview this task, watch the video beginning at 06:07.

You can enrich data assets with information that helps users to find data faster to decide whether the data is appropriate for the task at hand, whether they can trust the data, and how to work with the data. Such information includes, for example, terms that define the meaning of the data, rules that document ownership or determine quality standards, or reviews. Follow these steps to enrich the imported data:

  1. Click the Data governance project name in the navigation trail.
    Navigation trail

  2. On the Assets tab, click New asset > Enrich data assets with metadata.

  3. For the name, copy and paste the following text:

    Mortgage data - metadata enrichment
    
  4. Click Next to continue.

  5. Click Select data from project.

    1. Select Metadata import.

    2. Click the checkbox next to Mortgage data - metadata import. This asset includes the following assets:

      • COMMERICIAL_CLIENT
      • CREDIT_SCORE
      • HOUSE_PRICE
      • MORTGAGE_APPLICANTS
      • MORTGAGE_APPLICATION
    3. Click Select.

  6. Click Next to continue to the enrichment objective.

  7. Select all enrichment objectives:

    • Profile data
    • Assign terms
    • Run basic quality analysis
  8. For Categories, click Select categories.

    1. Select only [uncategorized] and Banking.

    2. Click Select.

  9. For the Sampling, select Basic.

  10. Click Next to continue to the schedule. You can manually run the import, so keep the scheduled turned off.

  11. Click Next to continue to the review.

  12. Click Create.

  13. The metadata enrichment asset displays, but the job might take several minutes to complete. Click the Refresh icon Refresh to watch the status change from Not analyzed to In progress to Finished. When the job run is complete, you see the five assets listed.

Checkpoint icon Check your progress

The following image shows the completed metadata enrichment. Now you can explore the enriched data assets.

Metadata enrichment asset




Task 6: View the results of the metadata enrichment

preview tutorial video To preview this task, watch the video beginning at 07:45.

After Metadata enrichment run is completed, follow these steps to view the enriched data:

  1. From the Mortgage data - metadata enrichment screen, click the Columns tab.

  2. In the list of Columns, locate the EMAIL_ADDRESS column for the MORTGAGE_APPLICANTS asset.

    1. At the end of the EMAIL_ADDRESS for MORTGAGE_APPLICANTS row, click the Overflow menu Overflow menu, and choose View column details.

    2. In the side panel on the Details tab, you see profiling information such as: Format, Frequency distribution, Statistics.

    3. In the side panel, click the Governance tab. This tab includes the data classes and business terms that were auto-assigned during the metadata enrichment. You might also see suggested business terms and data classes, and manually assign them.

    4. Review any suggested business terms or data classes and manually assign them. For example, you may see Address as a suggested business term.

      1. Click Suggested business terms.

      2. For Address, click Assign.

  3. At the end of the EMAIL_ADDRESS column for the MORTGAGE_APPLICANTS asset row, click the Overflow menu Overflow menu, and choose View data quality details.

    1. View the data quality information. IBM Knowledge Catalog automatically generates a data quality score for each column and data asset by analyzing every value in every record according to pre-built dimensions.

    2. Click the X to close the Data quality window.

  4. For the CITY column for the CREDIT_SCORE asset, click the Overflow menu Overflow menu, and choose Mark as reviewed.

  5. Click the Assets tab.

  6. In the list of Assets, for the MORTGAGE_APPLICANTS asset, click the Overflow menu Overflow menu, and choose View asset details.

    1. In the side panel, click the Governance tab to see business term auto assignment.

    2. Click the Edit icon Edit to manually assign business terms.

    3. Search for social. If you don't see any results, then make sure that the drop-down list is set to All terms instead of Suggested terms.

    4. Select Social Security Number.

    5. Click Assign.

Checkpoint icon Check your progress

The following image shows the reviewed and enriched data assets. The next step is to publish the enriched data to a catalog to share with your organization.

Reviewed enriched data assets




Task 7: Publish data to a catalog

preview tutorial video To preview this task, watch the video beginning at 09:06.

Now that you have enriched data, you want to publish those data assets to a catalog so data scientists and data analysts can use the enriched data assets. Follow these steps to store the enriched data assets in a catalog for others to have access to the trusted data:

  1. Click the Data governance project name in the navigation trail.

  2. Click the Assets tab.

  3. Select Data > Data assets.

  4. Select the COMMERICIAL_CLIENT, HOUSE_PRICE, MORTGAGE_APPLICANTS, and MORTGAGE_APPLICATION data assets from the list, and click Publish to catalog.

    1. For the Target catalog, select Mortgage Approval Catalog, and click Next.

    2. For the Tags, type the tag, trusted, and click + (plus sign), and then click Next.

    3. Review the assets, and click Publish.

  5. Clear all checked assets, then select the checkbox next to the CREDIT_SCORE asset from the list, and click Publish to catalog.

    1. For the Target catalog, select Mortgage Approval Catalog, and click Next.

    2. For the Tags, type the tag confidential, and click + (plus sign).

    3. Type the tag trusted, and click + (plus sign) to a second tag.

    4. Select the option to Go to the catalog after publishing it, and click Next.

    5. Review the assets, and click Publish.

  6. Filter the assets In the Mortgage Approval Catalog.

    1. Click the Filter icon Filter.

    2. Expand the Tag section.

    3. Select trusted, and click Apply.

    4. Verify that the five data assets were added to the catalog.

  7. Change the name for the MORTGAGE_APPLICANTS data asset.

    1. Open the MORTGAGE_APPLICANTS asset.

    2. Click the Edit name icon Edit name.

    3. Change the name to:

      MORTGAGE_APPLICANTS_TRUST
      
    4. Click Apply.

Checkpoint icon Check your progress

The following image shows the enriched data assets published to a catalog. Now you have trusted data available through your company's catalog.

Published assets to the catalog



As a Data Steward on the governance team, you learned how to sort and organize the company's data to provide high-quality and protected data assets that data consumers can easily find in a self-service catalog.

Next steps

You are now ready to protect your data by creating data protection rules and masking flows to control access to your data. See the Protect your data tutorial.

Learn more

Parent topic: Use case tutorials

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more