0 / 0
Multicloud data integration tutorial: Virtualize external data
Multicloud data integration tutorial: Virtualize external data

Multicloud data integration tutorial: Virtualize external data

Take this tutorial to virtualize data stored in three external data sources with the Multicloud data integration use case of the data fabric trial. Your goal is to use Watson Query to create virtual tables and to join the virtual tables together from the existing data that lies across three data sources: a Db2 Warehouse, a PostgreSQL database, and a MongoDB database. If you completed the Integrate data tutorial, then you did many of the same tasks using DataStage that this tutorial accomplishes using Watson Query.

Quick start: If you did not already create the sample project for this tutorial, access the Multicloud data integration sample project in the gallery.

The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial. You will connect to external data sources, create virtual tables and views, and add them to a project. Click the image to view a larger image.

Animated image

The story for the tutorial is that Golden Bank needs to adhere to a new regulation where it cannot lend to underqualified loan applicants. You will use Watson Query to combine data from different data sources without data movement, and make the virtual data available to other data scientists and data engineers in a project.

In this tutorial, you will complete these tasks:

  1. Verify the Platform assets catalog.

  2. Add data connections to the Platform assets catalog.

  3. Add data sources to Watson Query.

  4. Virtualize data tables.

  5. Create virtual join views by joining virtual tables.

  6. Generate an API key.

  7. Access the virtual join view in the project.

  8. Cleanup (Optional)

If you need help with this tutorial, ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Tip: For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Preview the tutorial

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method as an alternative to following the written steps in this documentation.

Tip: Start the video, then as you scroll through the tutorial, the video moves to picture-in-picture mode. Close the video table of contents for the best experience with picture-in-picture. You can use picture-in-picture mode so you can follow the video as you complete the tasks in this tutorial. Click the timestamps for each task to follow along.

Video timestamps


  • Watch this short video to see how to use the video picture-in-picture and table of contents.

Prerequisites

Sign up for Cloud Pak for Data as a Service

You must sign up for Cloud Pak for Data as a Service and provision the necessary services for the Multicloud data integration use case.

  • If you have an existing Cloud Pak for Data as a Service account, then you can get started with this tutorial. If you have a Lite plan account, only one user per account can run this tutorial.
  • If you don't have a Cloud Pak for Data as a Service account yet, then sign up for a data fabric trial.

Verify the necessary provisioned services

To preview this task, watch the video beginning at 01:06.

Follow these steps to verify or provision the necessary services:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Services > Service instances.

  2. Use the Product drop-down list to determine whether an existing Watson Query service instance exists.

  3. If you need to create a Watson Query service instance, click Add service.

    1. Select Watson Query.

    2. Select the Lite plan.

    3. Click Create.

  4. Wait while the Watson Query service is provisioned, which might take a few minutes to complete.

  5. Repeat these steps to verify or provision the following additional services:

    • Watson Knowledge Catalog
    • Cloud Object Storage

Checkpoint for Provisioned services Check your progress

The following image shows the provisioned service instances:

Provisioned services

Create the sample project

To preview this task, watch the video beginning at 01:46.

If you already have the sample project for this tutorial, then skip to Task 1. Otherwise, follow these steps:

  1. Access the Multicloud data integration sample project in the gallery.

  2. Click Create project.

  3. If prompted to associate the project to a Cloud Object Storage instance, select a Cloud Object Storage instance from the list.

  4. Click Create.

  5. Wait for the project import to complete, and then click View new project to verify that the project and assets were created successfully.

    Note: If this occasion is your first time accessing a project, you see a guided tour asking if you want a tour of projects. For now, click Maybe later.
  6. Click the Assets tab to see the connections and DataStage flow.

Note: You might see a guided tour showing the tutorials that are included with this use case. The links in the guided tour will open these tutorial instructions.

Checkpoint for Sample project Check your progress

The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.

Sample project

Tip: If you encounter a guided tour while you are completing this tutorial in the Cloud Pak for Data as a Service user interface, click Maybe later or close out the tour window.

Task 1: Verify the Platform assets catalog

To preview this task, watch the video beginning at 02:42.

You can add connections to external data sources at either the platform level or the service level. When you add the connections at the platform level by using the Platform assets catalog, you can easily include those connections in projects, catalogs, and Watson Query data sources. Follow these steps to verify the Platform assets catalog.

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Data > Platform connections.

  2. If you see existing connections, then you already have a Platform assets catalog, and you can skip to Task 2. If you don't see any connections, but you see an option to create a new connection, then you can skip to Task 2.
    New connection button

  3. If you don't have a Platform assets catalog, click Create catalog.
    Create Platform assets catalog

  4. Select a Cloud Object Storage from the list.

  5. Accept the default value for Duplicate asset handling.

  6. Click Create. The Platform connections page displays.

Checkpoint for Platform Connections catalog Check your progress

The following image shows the Access control tab in the Platform Connections catalog. On this tab, you can add collaborators. On the Connections tab, you can create connections. Since the sample project includes the connections, you can add the connections for the external data sources to this catalog from the sample project.

Platform Connections catalog

Task 2: Add data connections to the Platform assets catalog

To preview this task, watch the video beginning at 03:22.

The Multicloud data integration sample project includes three connections to external data sources. Next, you add those connections to the Platform assets catalog, and then you can make these connections available in Watson Query. Follow these steps to publish the connections from the sample project to the Platform assets catalog.

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Projects > View all projects.

  2. Click the Multicloud data integration project.

    Note: You might see a guided tour showing the tutorials that are included with this use case. The links in the guided tour will open these tutorial instructions. If you already have the tutorial instructions open in a new window, then close out this guided tour.
  3. Click the Assets tab.

  4. Under Asset types, click Data access > Connections.

  5. Select the following connection assets:

    • Data Fabric Trial - Db2 Warehouse
    • Data Fabric Trial - MongoDB
    • Data Fabric Trial - Databases for PostgreSQL
  6. Click Publish to catalog.

  7. For the Target, select Platform assets catalog.

  8. Click Publish.

  9. From the Cloud Pak for Data navigation menu Navigation menu, choose Data > Platform connections to see the three connections that are published to the catalog.

Checkpoint for Connections in the Platform connections catalog Check your progress

The following image shows the Platform assets catalog with three connections. You are now ready to add data sources.

Connections in the Platform connections catalog

Task 3: Add data sources to Watson Query

To preview this task, watch the video beginning at 04:05.

Now you can add these external data sources from the Platform assets catalog to Watson Query. Follow these steps to add the data sources:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Data > Data virtualization.

    Note: If you see a notification to Set up a primary catalog to enforce governance, you can safely close this notification. Setting up a primary catalog is optional.
  2. On the Data sources page, in the Table view, click Add connection > Existing connection.

    Add existing connection

  3. Select Data Fabric Trial - Db2 Warehouse.

  4. Click Add.

  5. Repeat these steps to add the Data Fabric Trial - Mongo DB and Data Fabric Trial - Databases for PostgreSQL connections.

Checkpoint for Data sources Check your progress

The following image shows the data sources. You are now ready to create a virtual table from data stored in those external data sources.

Data sources

Task 4: Virtualize data tables

To preview this task, watch the video beginning at 04:40.

You want to virtualize the MORTGAGE_APPLICATION, MORTGAGE_APPLICANT, and CREDIT_SCORES tables. Later, you can join the first two virtual tables with the third table to create a new virtual join view. Follow these steps to virtualize the data tables:

  1. From the service menu, click Virtualization > Virtualize.
    Virtualize on the service menu

  2. If necessary, change to Tables view, and wait while the tables load, which might take up to 30 seconds. When you see Available tables, then all of the tables loaded.
    Virtualize list of available tables

  3. On the Tables tab, filter the tables based on the following criteria:

    1. Source type: IBM Db2 Warehouse and PostgreSQL

    2. Database: Data Fabric Trial - Db2 Warehouse and Data Fabric Trial - Databases for PostgreSQL

    3. Schema: BANKING

  4. Select the MORTGAGE_APPLICATION, MORTGAGE_APPLICANT, and CREDIT_SCORE tables to virtualize. You can hover over a table name to see the full name.

  5. Click Add to cart.

  6. Click View cart to view your selection. From here, you can edit the table and schema names, or remove a selection from your cart.

  7. For now, clear the checkbox next to Add to project. This action will make the virtual tables available on the Virtualized data page.

  8. Click Virtualize.

  9. Click Confirm to begin virtualizing the tables.

  10. When virtualization is complete, click Go to virtualized data to see your newly created table.

Checkpoint for Virtualized data Check your progress

The following image shows the Virtualized data page. You are now ready to create a virtual table by joining these virtual tables.

Virtualized data

Task 5: Create a virtual join view by joining virtual tables

You want to create a virtual join view by joining the MORTGAGE_APPLICANT and MORTGAGE_APPLICATION virtual tables. Then, you want to join the resulting virtual object with the CREDIT_SCORE virtual table to create a second virtual join view.

Virtual join view 1: Join the MORTGAGE_APPLICANT and MORTGAGE_APPLICATION virtual tables

To preview this task, watch the video beginning at 05:59.

Follow these steps to create the first virtual join view:

  1. From the Virtualized data page, select the MORTGAGE_APPLICATION and MORTGAGE_APPLICANT tables to virtualize.

  2. Make note of the schema name. You will need that name later to run an SQL query.

  3. Click Join.

  4. In the list of columns for MORTGAGE_APPLICATION table, drag to connect the ID column with the ID column in the MORTGAGE_APPLICANT table.
    Join virtual objects using ID for key

  5. Click Preview to see a preview of the joined tables.

  6. Close the preview window.

  7. Click Open in SQL editor, and then click Continue at the notice that you are not able to return to the join canvas. The SQL Editor lets you run queries on the data set. In this case, you want to preview what records the data set will contain when you filter on California applicants.

    Open SQL Editor

    1. Copy your schema, and then delete the existing query. You will need to insert your schema in the next SQL statement.

    2. Copy and paste the following SELECT statement for the new query. Replace <your schema> with the schema name that you noted earlier.

      SELECT * FROM <your-schema>.MORTGAGE_APPLICANT WHERE STATE_CODE LIKE 'CA'
      

      Your query looks similar to SELECT * FROM DV_IBMID_663002GN1Q.MORTGAGE_APPLICANT WHERE STATE_CODE LIKE 'CA'
      Select statement

    3. Click Run all.

    4. After the query completes, select the query on the History tab. On the Results tab, you can see that the table is filter to only applicants from the state of California.

    5. Click Back to close the SQL editor.

  8. Now that you previewed the data set filtered on California applicants, you will add this filter criteria to the virtual join view. For the MORTGAGE_APPLICANT table, copy and paste the following statement for the filter criteria. Replace <your schema> with the schema name that you noted earlier.

    "<your-schema>"."MORTGAGE_APPLICANT"."STATE_CODE"='CA'
    

    Your filter criteria looks similar to "DV_IBMID_663002GN1Q"."MORTGAGE_APPLICANT"."STATE_CODE"='CA'
    Filter criteria

  9. Click Next.

  10. You can edit the column names to differentiate between columns with the same name in both tables. In this case, keep the default column names, and click Next.

  11. On the Assign and review page, for the View name, type APPLICANTS_APPLICATIONS_JOINED.

  12. For now, clear the Assign to project option. Later, you create a virtual object and assign that to the Multicloud data integration project.

  13. Click Create view.

  14. When virtualization is complete, click Go to virtualized data to see your newly created join view.

Checkpoint for Virtualized data Check your progress

The following image shows the Virtualized data page. You are now ready to create a second virtual join view.

Virtualized data

Virtual join view 2: Join the APPLICANTS_APPLICATIONS_JOINED and CREDIT_SCORE virtual tables

To preview this task, watch the video beginning at 07:47.

Follow these steps to create the second virtual join view:

  1. From the Virtualized data page, select the APPLICANTS_APPLICATIONS_JOINED and CREDIT_SCORE tables to virtualize.

  2. Click Join.

  3. In the list of columns for APPLICANTS_APPLICATIONS_JOINED table, drag to connect the EMAIL_ADDRESS column with the EMAIL_ADDRESS column in the CREDIT_SCORE table.

  4. Click Preview to see a preview of the joined tables.

  5. Close the preview window.

  6. Click Next.

  7. Accept the default column names, and click Next.

  8. On the Assign and review page, for the View name, type APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED.

  9. Assign the virtualized view to Project, and select the Multicloud data integration project.

  10. Click Create view.

  11. When virtualization is complete, click Go to virtualized data to see your newly created join view.

Checkpoint for Virtualized data Check your progress

The following image shows the Virtualized data page. You are now ready to work with the virtual data in your project.

Virtualized data

Task 6: Generate an API key

To preview this task, watch the video beginning at 08:27.

You need to provide your personal credentials in the form of an API key to view virtualized assets. If you don't already have a saved API key, then follow these steps to create an API key.

  1. Access the API keys page in the IBM Cloud console. Log in if prompted.

  2. On the API keys page, click Create an IBM Cloud API key.

  3. Type a name and description.

  4. Click Create.

  5. Copy the API key.

  6. Download the API key for future use.

Checkpoint for API keys page Check your progress

The following image shows the API keys page. You are now ready to view the virtual table in the project.

API keys page

Task 7: Access the virtual join view in the project

To preview this task, watch the video beginning at 09:01.

The virtual table was added to your project along with a connection to Watson Query. Follow these steps to open the project to see the virtual data and the connection information that is required to access the virtual data.

  1. Switch back to Cloud Pak for Data. From the navigation menu Navigation menu, choose Projects > View all projects.

  2. Open the Multicloud data integration project.

  3. Click the Assets tab.

  4. Open any of the virtualized data. For example, click the APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED data asset to view it.

  5. Provide your credentials to access the data asset.

  6. For the Authentication method, select API Key.

  7. Paste your API key.
    Paste API key

  8. Click Connect.

  9. Scroll through the data asset to see all of the applicants from the state of California.

Checkpoint for View virtual table Check your progress

The following image shows the virtual data in the project. You are now ready to analyze the virtual data.

View virtual table

As a data engineer at Golden Bank, you used Watson Query to combine data from different data sources and with different types. You used SQL syntax and accessed and combined data without data movement.

Cleanup (Optional)

If you would like to retake the tutorials in the Multicloud data integration use case, delete the following artifacts.

Artifact How to delete
Connections in the Platform assets catalog Remove an asset from a catalog
Virtualized data Navigate to Data > Data virtualization; On the Virtualized data page, access the Overflow menu Overflow menu for a table, and select Remove.
Data sources Navigate to Data > Data virtualization; On the Data sources page, click the Delete icon Delete icon for a connection.
Multicloud data integration sample project Delete a project

Next steps

Learn more

Parent topic: Data fabric tutorials