0 / 0
Data integration tutorial: Integrate data
Last updated: Nov 27, 2024
Data integration tutorial: Integrate data

Take this tutorial to transform data stored in three external data sources with the Data integration use case of the data fabric trial. Your goal is to use DataStage to transform the data, and then deliver that transformed data to a single output file. If you completed the Virtualize external data tutorial, then you did many of the same tasks using Data Virtualization that this tutorial accomplishes using DataStage.

Quick start: If you did not already create the sample project for this tutorial, access the Data integration sample project in the Resource hub.

The story for the tutorial is that Golden Bank needs to adhere to a new regulation where it cannot lend to underqualified loan applicants. As a data engineer at Golden Bank, you currently use DataStage to aggregate your anonymized mortgage applications data with the mortgage applicants’ personally identifiable information. Your lenders use this information to help them decide whether they should approve or deny mortgage applications. Your leadership added some risk analysts who calculate daily what interest rate they recommend offering to borrowers in each credit score range. You need to integrate this information into the spreadsheet you share with the lenders. The spreadsheet includes credit score information for each applicant, the applicant’s total debt, and an interest-rate lookup table. Lastly, load your data into a target output CSV file.

The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial. You will use DataStage to join applicant and application data, filter by state, join applicant credit scores, calculate total debt, look up the mortgage interest rate to offer based on credit score ranges, and output the result to a CSV file. Click the image to view a larger image.

Animated image

Preview the tutorial

In this tutorial, you will complete these tasks:

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method to learn the concepts and tasks in this documentation.





Tips for completing this tutorial
Here are some tips for successfully completing this tutorial.

Use the video picture-in-picture

Tip: Start the video, then as you scroll through the tutorial, the video moves to picture-in-picture mode. Close the video table of contents for the best experience with picture-in-picture. You can use picture-in-picture mode so you can follow the video as you complete the tasks in this tutorial. Click the timestamps for each task to follow along.

The following animated image shows how to use the video picture-in-picture and table of contents features:

How to use picture-in-picture and chapters

Get help in the community

If you need help with this tutorial, you can ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Set up your browser windows

For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Tip: If you encounter a guided tour while completing this tutorial in the user interface, click Maybe later.



Set up the prerequisites

Prerequisites

Sign up for Cloud Pak for Data as a Service

You must sign up for Cloud Pak for Data as a Service and provision the necessary services for the Data integration use case.

  • If you have an existing Cloud Pak for Data as a Service account, then you can get started with this tutorial. If you have a Lite plan account, only one user per account can run this tutorial.
  • If you don't have a Cloud Pak for Data as a Service account yet, then sign up for a data fabric trial.

Video icon Watch the following video to learn about data fabric in Cloud Pak for Data.

This video provides a visual method to learn the concepts and tasks in this documentation.

Verify the necessary provisioned services

preview tutorial video To preview this task, watch the video beginning at 01:08.

Important: The DataStage service is available in the Dallas and Frankfurt regions only. If necessary, switch to the Dallas or Frankfurt region before continuing.

Follow these steps to verify or provision the necessary services:

  1. In Cloud Pak for Data, verify that you are in the Dallas or Frankfurt region. If not, click the region drop down, and then select Dallas or Frankfurt.
    Change region

  2. From the Navigation menu Navigation menu, choose Services > Service instances.

  3. Use the Product drop-down list to determine whether a DataStage service instance exists.

  4. If you need to create a DataStage service instance, click Add service.

  5. Select DataStage.

    1. For the region, select Dallas or Frankfurt.

    2. Select the Lite plan.

    3. Click Create.

  6. Repeat these steps to verify or provision the following additional services:

    • IBM Knowledge Catalog
    • Cloud Object Storage

Checkpoint icon Check your progress

The following image shows the provisioned service instances:

Provisioned services

Create the sample project

preview tutorial video To preview this task, watch the video beginning at 01:44.

If you already have the sample project for this tutorial, then skip to Task 1. Otherwise, follow these steps:

  1. Access the Data integration sample project in the Resource hub.

  2. Click Create project.

  3. If prompted to associate the project to a Cloud Object Storage instance, select a Cloud Object Storage instance from the list.

  4. Click Create.

  5. Wait for the project import to complete, and then click View new project to verify that the project and assets were created successfully.

  6. Click the Assets tab to see the connections and DataStage flow.

Note: You might see a guided tour showing the tutorials that are included with this use case. The links in the guided tour will open these tutorial instructions.

Checkpoint icon Check your progress

The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.

Sample project




Task 1: Run an existing DataStage flow

preview tutorial video To preview this task, watch the video beginning at 02:29.

Start with a basic DataStage flow that joins the mortgage applicants and mortgage applications data sets, and then outputs that result to a CSV file in the project. Follow these steps to run the DataStage flow:

  1. Start in the Data integration project. If you don't have the project open, follow these steps:

    1. From the Navigation menu Navigation menu, choose Projects > View all projects.

    2. Open the Data integration project.

  2. Click the Assets tab to see all of the assets in the project.

  3. Click Flows > DataStage flows.

    Tip: If you don't see any DataStage flows, then go back to view your service instances to verify your DataStage instance provisioned successfully. See Provision the necessary services.
  4. Click the Data integration flow in the list to open it. This flow joins the Mortgage Applicants and Mortgage Applications tables that are stored in Db2 Warehouse, filters the data to those records from the State of California, and creates a sequential file in CSV format as the output.

  5. Click the zoom in icon Zoom in and zoom out icon Zoom out on the toolbar to set your preferred view of the canvas.

  6. Double-click MORTGAGE_APPLICATIONS_1 node to view the settings.

    1. Expand the Properties section.

    2. Scroll down, and then click Preview data. This data set includes information that is captured on a mortgage application.

    3. Click Close.

  7. Double-click MORTGAGE_APPLICANTS_1 node to view the settings.

    1. Expand the Properties section.

    2. Scroll down, and click Preview data. This data set includes information about mortgage applicants who applied for a loan.

    3. Optional: Visualize the data.

      1. Click the Chart panel.

      2. In the Columns to visualize list, select STATE.

      3. Click Visualize data to see a pie chart showing the distribution of the data by state.

      4. Click the Treemap icon to see the same data in a treemap chart.

    4. Click Close.

  8. Double-click Join_on_ID node to view the settings.

    1. Expand the Properties section.

    2. Note that the join key is the ID column.
      Join_on_ID join key

    3. Click Cancel to close the settings.

  9. Click the Logs icon View log on the toolbar so you can watch the flow's progress.

  10. Click Compile, and then click Run. Alternatively, you can click Run which compiles and then runs the DataStage flow. The run can take about one minute to complete.

  11. View the logs. You can use the total rows and rows/sec for each step in the flow to visually verify that the filter is working as expected.

  12. When the run completes successfully, click Data integration in the navigation trail to return to the project.
    Navigation trail

  13. On the Assets tab, click Data > Data assets.

  14. Open the MORTGAGE_DATA.CSV file. You can see that this file contains the columns from both the mortgage applicants and mortgage applications data sets.

Checkpoint icon Check your progress

The following image shows resulting CSV file. The next task is to edit the DataStage flow.

CSV file




Overview: Edit the DataStage flow

Now that you joined the mortgage applicant and application data, you are ready to edit the DataStage flow to:

  • Task 2: Specify a key column for the Join stage.
  • Task 3: Add credit score data from a PostgreSQL database.
  • Task 4: Add a Join stage to join the credit score data with the applicant and application data.
  • Task 5: Add a Transformer stage to calculate total debt.
  • Task 6: Add interest rate data from a MongoDB database.
  • Task 7: Add a Lookup stage to look up interest rates for applicants based on their credit scores and Golden Bank's daily interest rate ranges.



Task 2: Specify the key column for the Join stage

preview tutorial video To preview this task, watch the video beginning at 04:42.

Identifying a key column indicates to DataStage that column contains unique values. The Join_on_ID node joins the mortgage applicants and mortgage applications data sets using the ID column for the join key. The next phase is to join the resulting data set with the credit score data. Later, you will join the resulting filtered data with the credit score data set. The second join will use the EMAIL_ADDRESS column as the join key. In this task, you edit the DataStage flow to specify the EMAIL_ADDRESS column as the key column for the resulting data set when it is joined with the credit score data.

The following animated image provides a visual representation as an alternative to the description of the two join nodes. Click the image to view a larger image.

Join nodes

Follow these steps to change the Join node settings:

  1. Click Data integration in the navigation trail to return to the project.
    Navigation trail

  2. On the Assets tab, click Flows > DataStage flows.

  3. Open the Data integration flow.

  4. Double-click the Join_on_ID node to edit the settings.

  5. Click the Output tab, and expand the Columns section to see a list of the columns in the joined data set.

  6. Click Edit.

  7. For the EMAIL_ADDRESS column name, select Key.

  8. Click Apply and return to return to the Join_on_ID node settings.

  9. Click Save to save the Join_on_ID node settings.

Checkpoint icon Check your progress

The following image shows the DataStage flow with the edited Join_on_id stage. Now that you identified the EMAIL_ADDRESS column as the key column, you can add the PostgreSQL data containing the applicants credit scores.

Join_on_id stage




Task 3: Add credit score data from a PostgreSQL database

preview tutorial video To preview this task, watch the video beginning at 05:23.

Follow these steps to add the credit score data that is stored in a PostgreSQL database to the DataStage flow:

  1. In the node palette, expand the Connectors section.

  2. Drag the Asset browser connector to the canvas beside the MORTGAGE_APPLICANTS_1 node.

  3. Locate the asset by selecting Connection > Data Fabric Trial - Databases for PostgreSQL > BANKING > CREDIT_SCORE.

    Note: Click the connection or schema name instead of the checkbox to expand the connection and schema.

    Credit score preview

  4. Click the Preview icon View to preview the credit score data for each applicant.

  5. Click Add.

Checkpoint icon Check your progress

The following image shows the DataStage flow with the credit score asset added. Now that you added the credit score data to the canvas, you need to join the applicant, application, and credit score data.

Credit score data asset




Task 4: Add a Join stage to join the credit score data with the applicant and application data

preview tutorial video To preview this task, watch the video beginning at 05:54.

Follow these steps to add another Join stage to join the filtered mortgage application and mortgage applicant joined data with the credit score data in the DataStage flow:

  1. In the node palette, expand the Stages section.

  2. Drag the Join stage on to the canvas, and drop the node on the link line between the Filter_State_Code and Sequential_file_1 nodes.

  3. Hover over the CREDIT_SCORE_1 connector to see the arrow. Connect the arrow to the Join stage.

  4. Double-click the CREDIT_SCORE_1 node to edit the settings.

    1. Click the Output tab, and expand the Columns section to see a list of the columns in the joined data set.

    2. Click Edit.

    3. For the EMAIL_ADDRESS and CREDIT_SCORE column names, select Key.

    4. Click Apply and return to return to the CREDIT_SCORE_1 node settings.

    5. Click Save to save the CREDIT_SCORE_1 node settings.

  5. Double-click the Join_1 node to edit the settings.

    1. Expand the Properties section.

    2. Click Add key.

      1. Click Add key again.

      2. Select EMAIL_ADDRESS from the list of possible keys.

      3. Click Apply.

    3. Click Apply and return to return to the Join_1 node settings.

    4. Change the Join_1 node name to Join_on_email.

    5. Click Save to save the Join_1 node settings.

Checkpoint icon Check your progress

The following image shows the DataStage flow with a second Join stage added. Now that you joined the application, applicant, and credit score data, you need to add a Transformer stage to calculate each applicant's total debt.

Join_on_email stage




Task 5: Add a Transformer stage to calculate total debt

preview tutorial video To preview this task, watch the video beginning at 07:08.

Follow these steps to add a Transformer stage that creates a new column by summing the LOAN_AMOUNT and CREDITCARD_DEBT columns:

  1. In the Stages section, drag the Transformer stage on to the canvas, and drop the node on the link line between the Join_on_email and Sequential_file_1 nodes.

  2. Double-click the Transformer node to edit the settings.

  3. Click the Output tab.

    1. Click Add column.

    2. Scroll down in the list of columns to see the new column.

    3. Name the column TOTAL_DEBT.

    4. Click the Edit icon Edit in the row’s Derivation column.

    5. Click the Calculator icon Calculator in the Derivation column to open the expression builder.

    6. Search for LOAN_AMOUNT, and double-click the column name to add it to the expression. Note that the link number is appended to the column name.

    7. Type a plus sign +.

    8. Search for CREDITCARD_DEBT, and then double-click the column name to add it to the expression. Note that the link number is appended to the column name.

    9. Verify that the final expression is Link_7.LOAN_AMOUNT + Link_7.CREDITCARD_DEBT.

      Note: Your link number may be different.
    10. Click Apply and return to return to the Transformer page.

    11. For the CREDIT_SCORE column name, select Key.

  4. Click the Stage tab.

    1. Select the Advanced page.

    2. Change the Execution mode to Sequential.

  5. Click Save and return to return to the canvas.

Checkpoint icon Check your progress

The following image shows the DataStage flow with the Transformer stage added. Now that you calcluated each applicant's total debt, you need to add the table of interest rates to offer based on credit score ranges.

Transformer stage




Task 6: Add interest rate data from a MongoDB database

preview tutorial video To preview this task, watch the video beginning at 08:15.

Follow these steps to include the interest rates in the flow by adding a data asset connector to a MongoDB database:

  1. In the node palette, expand the Connectors section.

  2. Drag the Asset browser connector on to the canvas beside the CREDIT_SCORE_1 node.

  3. Locate the asset by selecting Connection > Data Fabric Trial - Mongo DB > DOCUMENT > DS_INTEREST_RATES.

  4. Click the Preview icon Preview to preview interest rates for each credit score range.
    View data asset
    You can use the values in the STARTING_LIMIT and ENDING_LIMIT columns to look up the appropriate interest rate based on the applicant's credit score. The ID column is not needed, so you will delete that column in the next step.

  5. Click Add.

Checkpoint icon Check your progress

The following image shows the DataStage flow with the interest rates data asset added from the MongoDB external source. Now that you added the interest rates table, you can look up the appropriate interest rate for each applicant.

Interest rates data asset




Task 7: Add a Lookup stage to look up interest rates for applicants

preview tutorial video To preview this task, watch the video beginning at 09:00.

Based on each applicant's credit score, you want to look up the appropriate interest rate. Follow these steps to add a Lookup stage and specify the range for starting and ending credit score limits for each interest rate:

  1. In the Stages section, drag the Lookup stage on to the canvas, and drop the node on the link line between the Transformer_1 and Sequential_file_1 nodes.

  2. Connect the DS_INTEREST_RATES_1 connector to the Lookup_1 stage.

  3. Double-click the DS_INTEREST_RATES_1 node to edit the settings.

  4. Click the Output tab.

    1. Expand the Columns section, and click Edit.

    2. Select the _ID column.

    3. Click the Delete icon Trash to delete the _ID column.

    4. Click Apply and return to return to the DS_INTEREST_RATES_1 node settings.

    5. Click Save to save the changes to the DS_INTEREST_RATES_1 node.

  5. Double-click the Lookup_1 node to edit the settings.

  6. Expand the Properties section.

    1. For the Apply range to columns field, select CREDIT_SCORE. The Reference Links, Operator, and Range column fields display.

    2. For the Reference Links, select Link_9.

      Note: Your link number may be different.
    3. For the first Operator, select <=.

    4. For the first Range column, select ENDING_LIMIT.

    5. For the second Operator, select >=.

    6. For the second Range column, select STARTING_LIMIT.

  7. Click the Output tab.

    1. Expand the Columns section, and click Edit.

    2. Select the STARTING_LIMIT and ENDING_LIMIT columns.

    3. Click the Delete icon Trash to delete these unnecessary STARTING_LIMIT and ENDING_LIMIT columns.

    4. Click Apply and return to return to the Lookup_1 node settings.

    5. Click Save to save the changes to the Lookup_1 node.

Checkpoint icon Check your progress

The following image shows that the DataStage flow with the Lookup stage added. The DataStage flow is now complete. The last task before running the flow is to specify the name for the output file.

Lookup stage




Task 8: Edit the Sequential file node and run the DataStage flow

preview tutorial video To preview this task, watch the video beginning at 10:22.

Follow these steps to edit the Sequential file node to create a final output file as a data asset in the project, and then compile and run the DataStage flow:

  1. Double-click the Sequential_file_1 node to edit the settings.

  2. Click the Input tab.

  3. Expand the Properties section.

  4. For the Target File, copy and paste MORTGAGE_APPLICANTS_INTEREST_RATES.CSV for the file name.

  5. Select Create data asset.

  6. For the First line is column names field, select True.

  7. Click Save.

  8. Click Run which compiles and then runs the DataStage flow. The job takes about 1 minute to complete.

  9. Click Logs on the toolbar to watch the flow's progress. It is normal to see warnings during the run, and then you see that the flow ran successfully.

Checkpoint icon Check your progress

The following image shows that the DataStage flow ran successfully. Now that the DataStage flow created the output file, you need to create the catalog where you will publish the output file.

DataStage run complete




Task 9: Create a catalog to store the published data asset

preview tutorial video To preview this task, watch the video beginning at 11:00.

Other data engineers and business analysts at Golden Bank need access to the mortgage interest rates. With the IBM Knowledge Catalog Lite plan, you can create two catalogs. If you already have a catalog, skip this step. Otherwise, complete the following steps to create a catalog to which you can publish the interest rates data set.

  1. From the Navigation menu Navigation menu, choose Catalogs > View all catalogs.

  2. If you see a Mortgage Approval Catalog on the Catalogs page, then skip to Task 10: View the output and publish to a catalog. Otherwise, follow these steps to create a new catalog:

  3. Click Create Catalog.

  4. For the Name, copy and paste the catalog name exactly as shown with no leading or trailing spaces:

    Mortgage Approval Catalog
    
  5. Select Enforce data protection rules, confirm the selection, and accept the defaults for the other fields.

  6. Click Create.

Checkpoint icon Check your progress

The following image shows your catalog. Now that you the Mortgage Approval Catalog exists, you can publish the output file to the catalog.

Mortgage Approval Catalog




Task 10: View the output and publish to a catalog

preview tutorial video To preview this task, watch the video beginning at 11:31.

Follow these steps to view the output file in the project, and then publish it to a catalog:

  1. From the Navigation menu Navigation menu, choose Projects > View all projects.

  2. Open the Data integration project.

  3. On the Assets tab, click Data > Data assets.

  4. Open the MORTGAGE_APPLICANTS_INTEREST_RATES.CSV file.

  5. Scroll to see all of the columns in your integrated data set with interest rates at the end of each data entry.

  6. Click Data integration in the navigation trail to return to the project.

  7. On the Assets tab, click the Overflow menu Overflow menu at the end of the row for the MORTGAGE_APPLICANTS_INTEREST_RATES.CSV file, and choose Publish to catalog.

    1. Select the Mortgage Approval Catalog (or your catalog name) from the list, and click Next.

    2. Select the option to Go to the catalog after publishing it, and click Next.

    3. Review the assets, and click Publish.

  8. In the catalog, search for Mortgage.

  9. Open the MORTGAGE_APPLICANTS_INTEREST_RATES.CSV file.

  10. Click Asset tab to view the data.

Checkpoint icon Check your progress

The following image shows the MORTGAGE_APPLICANTS_INTEREST_RATES.CSV file in catalog. The data that lenders need to make mortgage decisions is now available.

MORTGAGE_APPLICANTS_INTEREST_RATES.CSV file in catalog



As a data engineer at Golden Bank, you integrated the mortgage applicant, application, credit rating, and credit score information, and published that data in a catalog.

Cleanup (Optional)

If you would like to retake the tutorials in the Data integration use case, delete the following artifacts.

Artifact How to delete
Mortgage Approval Catalog Delete a catalog
Data integration sample project Delete a project

Next steps

Learn more

Parent topic: Use case tutorials