0 / 0
Data governance and privacy tutorial: Govern virtualized data
Data governance and privacy tutorial: Govern virtualized data

Data governance and privacy tutorial: Govern virtualized data

Take this tutorial to govern data that was virtualized after completing the Trust your data tutorial, Protect your data tutorial, and Virtualize external data tutorial with the Multicloud data integration use case of the data fabric trial. Your goal is to protect the virtual data that contains mortgage applicants and applications and their credit scores for unauthorized access. Certain personal information such as social security number, must be masked so that all Golden Bank employees don't have access to that personal information.

Quick start: If you did not already create the sample project for this tutorial, access the Data governance and privacy sample project in the gallery.

The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial. You will add virtual data to your project, and then enrich that data with business terms, and see how Watson Knowledge Catalog data protection rules mask data through Cloud Pak for Data as a Service. Click the image to view a larger image.

Animated image

The story for the tutorial is that Golden Bank has several departments that need access to high-quality customer mortgage data that is stored across three external data sources. As a Data Steward on the governance team, you must enrich the virtualized data and ensure that the virtualized data is protected.

In this tutorial, you will complete these tasks:

  1. Enable governance of virtualized data.
  2. Run an SQL query on virtual tables.
  3. Copy virtualized data to your project.
  4. Enrich virtualized data.
  5. View the results of the metadata enrichment.
  6. Publish virtual tables to a catalog.
  7. Cleanup

If you need help with this tutorial, ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Tip: For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Preview the tutorial

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method as an alternative to following the written steps in this documentation.

Tip: Start the video, then as you scroll through the tutorial, the video moves to picture-in-picture mode. Close the video table of contents for the best experience with picture-in-picture. You can use picture-in-picture mode so you can follow the video as you complete the tasks in this tutorial. Click the timestamps for each task to follow along.

Video timestamps


  • Watch this short video to see how to use the video picture-in-picture and table of contents.

Prerequisites

To preview this task, watch the video beginning at 00:27.

Complete the following tutorials:

  • Virtualize external data tutorial to create virtual tables and join views from data that is stored across three external sources.
  • Trust your data tutorial to import and enrich data assets and publish them to a catalog.
  • Protect your data tutorial to create data protection rules to protect data.
Tip: If you encounter a guided tour while completing this tutorial in the Cloud Pak for Data user interface, click Maybe later.

Task 1: Enable governance of virtualized data

There are two required steps to enabling governance of virtualized data:

  • Enforce data protection rules in Watson Query.
  • Set up authorization between Watson Knowledge Catalog and Watson Query.

Enforce data protection rules

To preview this task, watch the video beginning at 01:02.

Follow these steps to enforce data protection rules in Watson Query:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Data > Data virtualization.

  2. If you see a notification to Set up a primary catalog to enforce governance, click Go to Governance. If you don't see this message, then from the service menu, click Administration > Service settings, and then click the Governance tab.
    Watson Query Service menu

  3. Enable the Enforce policies within Data Virtualization option.

  4. From the service menu, return to Virtualization > Data sources.

Checkpoint for Enforce policies Check your progress

The following image shows the Governance tab with policy enforcement enabled. Next, you need to set up authorization between Watson Knowledge Catalog and Watson Query.

Enforce policies

Set up authorization between Watson Knowledge Catalog and Watson Query

To preview this task, watch the video beginning at 01:40.

Follow these steps to set up authorization between Watson Knowledge Catalog and Watson Query:

  1. Visit the Authorizations page in the IBM Cloud console.

  2. Click Create.

  3. For the Source account, select This account.

  4. For the Source service, select Watson Knowledge Catalog.

  5. For How do you want to scope the access? to Watson Knowledge Catalog, select All resources.

  6. For the Target service, select Watson Query.

  7. For How do you want to scope the access? to Watson Query, select All resources.

  8. For Service access, select DataAccess (For Service to Service Authorization Only).

  9. Click Authorize.

Checkpoint for Authorizations page Check your progress

The following image shows the Authorizations page in IBM Cloud with the authorization between Watson Knowledge Catalog and Watson Query. Now you are ready to query governed virtual tables in Watson Query.

Authorizations page

Task 2: Run an SQL query on governed virtual tables

To preview this task, watch the video beginning at 02:20.

With data protection rules in place, virtual tables are governed by those rules. Follow these steps to run an SQL query on a governed virtual table:

  1. From the Watson Query service menu, click Run SQL.
    Watson Query Service menu

  2. Copy and paste the following SELECT statement for the new query. Replace <your schema> with the schema name that you noted earlier.

    SELECT * FROM <your-schema>.MORTGAGE_APPLICANT WHERE STATE_CODE LIKE 'CA'
    

    Your query looks similar to SELECT * FROM DV_IBMID_663002GN1Q.MORTGAGE_APPLICANT WHERE STATE_CODE LIKE 'CA'
    Select statement

  3. Click Run all.

  4. After the query completes, select the query on the History tab. On the Results tab, you can see that the table is filter to only applicants from the state of California. The data protection rules apply in the Watson Query, catalog preview, catalog download, Data Refinery, and Project Asset preview. The rule doesn’t apply to the person who created the rule or virtualized the data. Watch Video Watch the video at 02:47 to see what other users see when they run the SQL query.

Checkpoint for SQL query results Check your progress

The following image shows the SQL query results from the perspective of another user. Now you are ready to copy the virtual tables to your project.

SQL query results

Task 3: Copy the virtual data to your project

To preview this task, watch the video beginning at 03:02.

In the Virtualize external data tutorial, you created virtual tables and virtual join views, and copied them to your Multicloud data integration project. If you would like to use that project to complete this tutorial, then skip to Task 3. If you would like to use your Data governance and privacy project to complete this tutorial, then follow these steps:

  1. From the service menu, click Virtualization > Virtualized data.
    Watson Query Service menu

  2. Select the following tables:

    • MORTGAGE_APPLICATION
    • MORTGAGE_APPLICANT
    • CREDIT_SCORE
    • APPLICANTS_APPLICATIONS_JOINED
    • APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED
  3. Click Assign.

  4. For the Project, select Data governance and privacy.

  5. Click Assign.

  6. When the virtual objects are successfully assigned, click Go to project.

  7. In the Data governance and privacy project, click the Assets tab. The virtual data tables begin with DV_IBMID_<your schema>.

  8. Open any of the virtual data tables. For example, click the APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED virtual table to view it.

  9. Provide your credentials to access the data asset.

    1. For the Authentication method, select API Key.

    2. Paste the same API key that you created in the Virtualize external data tutorial.
      Paste API key

    3. Click Connect. The data protection rules apply in the catalog preview, catalog download, Data Refinery, and Project Asset preview. The rule doesn’t apply to the person who created the rule or virtualized the data. Watch Video Watch the video at 04:09 to see what other users see trying to access the virtual data table.

Checkpoint for Virtual table in project Check your progress

The following image shows the virtual table with a masked column in the project from the perspective of a different user. Now you are ready to enrich the data.

Virtual table in project

Task 4: Enrich the virtual data tables

To preview this task, watch the video beginning at 04:21.

You can enrich data assets with information that helps users to find data faster. Users can use the enrichments to decide whether the data is appropriate for the task at hand, whether they can trust the data, and how to work with the data. Such information includes, for example, terms that define the meaning of the data, rules that document ownership or determine quality standards, or reviews. Follow these steps to enrich the virtual data tables:

  1. Click Data governance and privacy in the navigation trail to return to the project.
    Navigation trail

  2. From the Assets tab, click New asset.

  3. Select Metadata Enrichment.

  4. For the name, copy and paste the following text:

    Virtual mortgage data - metadata enrichment
    
  5. Click Next to continue.

  6. Click Select data from project.

    1. Select Data asset.

    2. Click the checkbox next to the following assets:

      • <your schema>.MORTGAGE_APPLICATION
      • <your schema>.MORTGAGE_APPLICANT
      • <your schema>.CREDIT_SCORE
      • <your schema>.APPLICANTS_APPLICATIONS_JOINED
      • <your schema>.APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED
    3. Click Select.

  7. Click Next to continue to the enrichment objective.

  8. Select all enrichment objectives:

    • Profile data
    • Analyze quality
    • Assign terms
  9. For Categories, click Select categories.

    1. Select only [uncategorized] and Banking.

    2. Click Select.

  10. For the Sampling, select Basic.

  11. Click Next to continue to the schedule.

  12. Click Next to continue to the review.

  13. Click Create.

  14. The metadata enrichment asset displays, but the job might take several minutes to complete. Click the Refresh Refresh icon icon to watch the status change from Queued to In progress to Finished. When the job run is complete, you see the five assets listed.

Checkpoint for Enriched data Check your progress

The following image shows the completed metadata enrichment. Now you can explore the enriched data assets.

Enriched data

Task 5: View the results of the metadata enrichment

To preview this task, watch the video beginning at 05:48.

After Metadata enrichment run is completed, follow these steps to view the enriched data:

  1. From the Virtual mortgage data - metadata enrichment screen, click the Columns tab.

  2. Search for mortgage_applicant.

  3. In the list of Columns, locate the EMAIL_ADDRESS column for your-schema.MORTGAGE_APPLICANT asset.

    1. Click the Overflow Overflow menu menu at the end of the EMAIL_ADDRESS for your your_schema.MORTGAGE_APPLICANT row, and choose View column details.

    2. In the side panel on the Details tab, you see profiling information such as: Format, Frequency distribution, Statistics.

    3. In the side panel, click the Governance tab. This tab includes the data classes and business terms that were auto-assigned during the metadata enrichment. You might also see suggested business terms and data classes, and manually assign them.

    4. To review the suggested terms and manually assign them:

      1. Click Suggested business terms.

      2. For Address, click Assign.

      3. Click Suggested data classes.

      4. For Text, click Assign.

  4. At the end of the EMAIL_ADDRESS column for your your_schema.MORTGAGE_APPLICANT asset row, click the Overflow Overflow menu menu, and choose View data quality details.

    1. View the data quality score. Watson Knowledge Catalog automatically generates a data quality score for each column and data asset by analyzing every value in every record according to pre-built dimensions.

    2. Click the X to close the Data quality window.

  5. Search for credit_score.

  6. For the CITY column for your_schema.CREDIT_SCORE asset, click the Overflow Overflow menu menu, and choose Mark as reviewed.

  7. Click the Assets tab.

  8. In the list of Assets, for your your_schema.MORTGAGE_APPLICANT asset, click the Overflow Overflow menu menu, and choose View asset details.

    1. In the side panel, click the Governance tab to see any business term that were auto-assigned.

    2. To manually assign business terms, click the Edit Edit icon icon.

    3. Search for social. If you don't see any results, then make sure that the drop-down list is set to All terms instead of Suggested terms.

    4. Select Social Security Number.

    5. Click Assign.

Checkpoint for Reviewed enriched data assets Check your progress

The following image shows the reviewed and enriched data assets. The next step is to publish the enriched data to a catalog to share with your organization.

Reviewed enriched data assets

Task 6: Publish virtual tables to a catalog

To preview this task, watch the video beginning at 7:18.

Now that the virtualized data is enriched with business terms, follow these steps to publish the virtual tables it to a catalog:

  1. Click Data governance and privacy in the navigation trail to return to the project.
    Navigation trail

  2. Click the Assets tab.

  3. Navigate to Data > Data assets.

  4. Click the checkbox next to the following assets:

    • <your schema>.MORTGAGE_APPLICATION
    • <your schema>.MORTGAGE_APPLICANT
    • <your schema>.CREDIT_SCORE
    • <your schema>.APPLICANTS_APPLICATIONS_JOINED
    • <your schema>.APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED
  5. Click Publish to catalog.

  6. Select the Mortgage Approval Catalog (or your catalog name) from the list, and click Publish.

  7. From the Cloud Pak for Data navigation menu Navigation menu, choose Catalogs > View all catalogs.

  8. Open the Mortgage Approval Catalog.

  9. Search for DV_IBMID.

  10. Open one of the virtual tables. If prompted, provide your credentials:

    1. For the Authentication method, select API Key.

    2. Paste the same API key that you created in the Virtualize external data tutorial.

  11. Click Asset tab to view the data. The data protection rules apply in the catalog preview, catalog download, Data Refinery, and Project Asset preview. The rule doesn’t apply to the person who created the rule or virtualized the data. Watch Video Watch the video at 08:17 to see what other users see trying to access the virtual data table in the catalog.

Checkpoint for Catalog preview Check your progress

The following image shows the data preview of the virtual table in the catalog from the perspective of the user.

Catalog preview

As data engineers and data stewards at Golden Bank, you enriched the virtualized data to ensure that the virtualized data is protected.

Cleanup (Optional)

If you would like to retake the tutorials in the Data governance and privacy use case, refer to the Cleanup section in each of the prerequisite tutorials:

Next steps

Learn more

Parent topic: Data fabric tutorials