0 / 0
Data governance tutorial: Consume your data

Data governance tutorial: Consume your data

Take this tutorial to work with your high quality and protected data after completing the Curate high quality data tutorial and Protect your data tutorial with the Data governance use case of the data fabric trial. Your goal is to evaluate, share, shape, and analyze data in the data fabric.

Quick start: If you did not already create the sample project for this tutorial, access the Data governance sample project in the Resource hub.

The story for the tutorial is that Golden Bank has several departments that need access to high-quality customer mortgage data. As a Data Analyst, you will need to search for and find the right data, understand and trust its content, and then prepare it for other data analysts and data scientists to use.

The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial where you will view catalog assets, manually enrich assets and create relationships, visualize data, and filter data to improve quality. Click the image to view a larger image.

Animated image

Preview the tutorial

In this tutorial, you will complete these tasks:

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method to learn the concepts and tasks in this documentation.





Tips for completing this tutorial

Use the video picture-in-picture

Tip: Start the video, then as you scroll through the tutorial, the video moves to picture-in-picture mode. Close the video table of contents for the best experience with picture-in-picture. You can use picture-in-picture mode so you can follow the video as you complete the tasks in this tutorial. Click the timestamps for each task to follow along.

The following animated image shows how to use the video picture-in-picture and table of contents features:

How to use picture-in-picture and chapters

Get help in the community

If you need help with this tutorial, you can ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Set up your browser windows

For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Tip: If you encounter a guided tour while completing this tutorial in the user interface, click Maybe later.



Set up the prerequisites

Complete prerequisite tutorials

preview tutorial video To preview this task, watch the video beginning at 00:39.

Complete the Curate high quality data and Protect your data tutorials:




Task 1: Understand data assets

preview tutorial video To preview this task, watch the video beginning at 01:12.

Data assets in catalogs are much more than pointers to data. They contain information about the format and meaning of the data and statistics about the data values. Follow these steps to understand the value of data assets:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Catalogs > View all catalogs.

  2. Open the Mortgage Approval Catalog.

  3. The featured assets section shows Recently added assets, assets that Watson recommends which are suggested assets from AI and machine learning based on your past usage and popularity, and Highly rated assets that catalog collaborators rated and reviewed.

  4. Click Hide featured assets to close that section.

  5. Search for mortgage.

  6. Click MORTGAGE_APPLICANTS_TRUST to view that catalog asset. The Overview tab and the side panel provide basic information about the asset such as the description, a rating, tags, where the asset is located, business terms, data classes, and related items.

  7. Click the Profile tab. The profile information helps you understand the content, the quality, and usability of the data.

  8. Scroll to the right to locate the ZIP_CODE column.

  9. The data class that was automatically assigned to the ZIP_CODE column is Commercial and Government Entity. Note that the automatically assigned data class may vary. Since the values are zip codes, you can easily reclassify this column. Click the drop-down list to see other possible data classes and their confidence levels. Select US Zip Code.

  10. Click the Asset tab to see a preview of the data.

  11. Return to the Overview tab to see more metadata about the columns. In the list of columns, search for the EMPLOYMENT_STATUS column to see the metadata including the assigned business terms.

Checkpoint icon Check your progress

The following image shows the MORTGAGE_APPLICANTS_TRUST asset in the catalog. You explored the type of information that IBM Knowledge Catalog automatically adds to data assets during metadata enrichment. In the next task, you will manually enrich this data asset.

MORTGAGE_APPLICANTS_TRUST asset




Task 2: Enrich assets and create relationships

preview tutorial video To preview this task, watch the video beginning at 02:49.

You can make assets more valuable by adding information to them. For example, you can add your opinion of the asset, update asset properties, and create relationships to link assets. Follow these steps to enrich assets and create relationships:

  1. For the MORTGAGE_APPLICANTS_TRUST catalog asset, click the Review tab. Rate and comment on this asset so that others can find the asset easily.

    1. Select 5 stars for the rating.

    2. For the review, copy and paste the following text:

      This contains high quality customer data from the mortgage system.
      
    3. Click Submit.

  2. Click the Overview tab.

  3. Click the Edit Edit icon icon next to the asset name to edit the asset name.

    1. Change the name to:

      MORTGAGE_APPLICANTS_TRUST_PROTECT
      
    2. Click Apply.

  4. In the Description section in the right side panel, click the Add Add icon.

    Note:

    If this asset has an existing description, you will see an Edit Edit icon icon instead of an Add icon.

    1. Copy and paste the following description:

      Mortgage applicants from the Mortgage System
      
    2. Click Apply.

  5. Because this asset relates to mortgage loans, next to Business terms, click the Add Add icon icon or the Edit Edit icon icon.

    1. In the Search field, type loan.

      Note: It is not necessary to press Enter after typing the search term. You will see a list of results immediately after typing the search term.
    2. Select Loan.

    3. Click Save.

  6. Because this asset contains personal information, next to Classifications, click Add Add icon icon or the Edit Edit icon icon.

    1. Select Personally Identifiable Information.

    2. Click Save.

  7. Because this asset is related to other mortgage assets, next to Related items, click Add related items > Add related assets.

    1. Select Is related to, and click Next.

    2. Select the CREDIT_SCORE and MORTGAGE_APPLICATION assets, and click Add.

  8. Click MORTGAGE_APPLICATION to view that related asset.

Checkpoint icon Check your progress

The following image shows the Overview tab for the MORTGAGE_APPLICANTS_TRUST_PROTECT asset in the catalog. You made these assets more valuable by reviewing, updating properties, and adding relationships to the assets. In the next task, you will add the enriched asset to a project.

MORTGAGE_APPLICANTS_TRUST with related assets




Task 3: Add enriched data to a project

preview tutorial video To preview this task, watch the video beginning at 04:09.

The data analysts team needs the mortgage applicants data in the mortgage analysis project to refine, visualize, analyze, and use as training data for models. Follow these steps to add the enriched data to a project:

  1. Click Mortgage Approval Catalog in the navigation trail.
    Navigation trail

  2. At the end of the MORTGAGE_APPLICANTS_TRUST_PROTECT catalog asset row, click the Overflow menu Overflow menu, and choose Add to project.

    1. In the Target drop down list, select the Data governance project.

    2. Click Add.

  3. When the notification displays, click Go to project. If you miss the notification, then:

    1. Click the Cloud Pak for Data navigation menu Navigation menu, choose Projects > View all projects.

    2. Click the Data governance project.

  4. In the project, click the Assets tab to see the MORTGAGE_APPLICANTS_TRUST_PROTECT data asset.

Checkpoint icon Check your progress

The following image shows the MORTGAGE_APPLICANTS_TRUST_PROTECT asset in the project. Now you are ready to visualize the data.

MORTGAGE_APPLICANTS_TRUST_PROTECT asset in the project




Task 4: Visualize the data

preview tutorial video To preview this task, watch the video beginning at 04:39.

You need to cleanse and refine the mortgage applicants data to get it ready for your analytical tools and models. A quick and easy way to determine how it needs to be shaped is to visualize the data in Data Refinery. The visualization is based on the first 5,000 rows of the data. Follow these steps to visualize the data:

  1. Click the MORTGAGE_APPLICANTS_TRUST_PROTECT data asset to preview the data.

  2. Click Prepare data to open the data asset in Data Refinery, and wait for the data to be read and processed.

  3. In the About this asset panel, click the X to close the panel.

  4. In the Steps panel, click the X to close the panel.

  5. Click the Visualizations tab.

  6. For the Column to visualize, select EMPLOYMENT_STATUS.

  7. Click Visualize data. The tool selects a pie chart as the best chart type for this column, which shows the distribution of applicants by employment status. Notice the suggested chart types that are indicated by a blue dot next to bar, word cloud, and sunburst.

  8. For the Chart type, select the Bubble chart type. The Bubble chart is one easy way to quickly visualize the distribution of values in a particular data set.

  9. From the Chart type drop-down, select the Relationship chart type.

  10. This chart type requires two columns. Select these columns:

    1. For the first column, select EMPLOYMENT_STATUS.

    2. Click Add another column.

    3. For the second Column, select EDUCATION.

  11. With the Relationship chart, you can select endpoints to see the relationships. For example, you can see applicants employment status by level of education.

Checkpoint icon Check your progress

The following image shows the MORTGAGE_APPLICANTS_TRUST_PROTECT asset visualized in Data Refinery. You are now ready to cleanse the data.

Relationship visualization




Task 5: Prepare the data for analytics and AI

preview tutorial video To preview this task, watch the video beginning at 05:59.

You can't process applicants without a social security number, so you need to review the data and remove any applicants without social security numbers. To prepare the MORTGAGE_APPLICANTS_TRUST_PROTECT data, you will:

  • View the frequency of values in the Social_Security_Number column.
  • Filter the applicants with missing values from the Social_Security_Number column.

Follow these steps to prepare the data:

  1. In the Data Refinery, click the Profile tab.

  2. Scroll to the right to locate the Social_Security_Number column. Notice several missing values.

  3. Click the Data tab to filter out these records. In the status bar at the bottom of the screen, Data Refinery indicates that the FULL DATA SET is 1101 rows.

  4. If the Steps panel is not visible, click Steps to open the panel.

  5. Click New step.

    1. In the Cleanse section, select Filter.

    2. In the Column field, select the Social_Security_Number column.

    3. In the Operator field, select Is not empty.

    4. Click Apply. Notice in the status bar at the bottom of the screen, Data Refinery now indicates that the FULL DATA SET is 1000 rows because the rows with missing Social Security Numbers are filtered out. Notice that a new step displays in the Steps panel showing the Filter operation.

  6. Click the Profile tab.

  7. Scroll to the right to locate the Social_Security_Number column. Notice that the missing values are gone.

  8. From the toolbar, click the Save icon Save icon.

  9. From the toolbar, click the Export icon, and choose Export current data to CSV.
    Export as csv icon

    1. Save the MORTGAGE_APPLICANTS_TRUST_PROTECT_shaped.csv to a local folder.

    2. Navigate to that folder, and open the CSV file, which contains 1000 rows and no applicants are missing the social security number.

  10. Return to Cloud Pak for Data, and click the Data governance project in the navigation trail.
    Navigation trail

  11. Click All assets, and locate the new Data Refinery flow asset with the name MORTGAGE_APPLICANTS_TRUST_PROTECT_flow.

Tip: You can save the refined data set to the project or to an external data source, such as the Db2 Warehouse instance where the original data sets are stored. For more information, refer to Creating jobs in Data Refinery.

Checkpoint icon Check your progress

The following image shows the MORTGAGE_APPLICANTS_TRUST_PROTECT_shaped.csv file that you refined in Data Refinery. This data set contains the information about those mortgage applicants who provided a social security number.

Refined data asset



As a Data Analyst for Golden Bank, you learned how to search for and find the right data, understand and trust its content, and then prepare it for other data analysts and data scientists to use.

Cleanup (Optional)

If you would like to retake the tutorials in the Data governance use case, delete the following artifacts.

Artifact How to delete
Imported business terms Delete governance artifacts
Banking category Delete a category
Data protection rules: Confidential Information and Redact Social Security Number Delete data protection rules
Mortgage Approval Catalog Delete a catalog
Data governance sample project Delete a project

Next steps

Learn more

Parent topic: Use case tutorials

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more