0 / 0
Data integration tutorial: Orchestrate an AI pipeline with data integration
Last updated: Nov 27, 2024
Data integration tutorial: Orchestrate an AI pipeline with data integration

Take this tutorial to create an end-to-end pipeline to deliver concise, pre-processed, and up-to-date data stored in an external data source with the data fabric trial. Your goal is to use Orchestration Pipelines to orchestrate that end-to-end workflow to generate automated, consistent, and repeatable outcomes. The pipeline uses DataStage and AutoAI, which automates several aspects for a model building process such as, feature engineering and hyperparameter optimization. AutoAI ranks candidate algorithms, and then selects the best model.

Quick start: If you did not already create the sample project for this tutorial, access the Orchestrate an AI pipeline sample project in the Resource hub.

The story for the tutorial is that GoldenBank wants to expand its business by offering special low-rate mortgage renewals for online applications. Online applications expand the bank’s customer reach and reduce the bank’s application processing costs. The team will use Orchestration Pipelines to create a data pipeline that delivers up-to-date data on all mortgage applicants, that lenders can use for decision making. The data is stored in Db2 Warehouse. You need to prepare the data because it is potentially incomplete, outdated, and might be obfuscated or entirely inaccessible due to data privacy and sovereignty policies. Then, the team needs to build a mortgage approval model from trusted data, and then deploy and test the model in a pre-production environment.

The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial. You will edit and run a pipeline to build and deploy a machine learning model. Click the image to view a larger image.

Animated image

Preview the tutorial

In this tutorial, you will complete these tasks:

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface that is shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method to learn the concepts and tasks in this documentation.





Tips for completing this tutorial
Here are some tips for successfully completing this tutorial.

Use the video picture-in-picture

Tip: Start the video, then as you scroll through the tutorial, the video moves to picture-in-picture mode. Close the video table of contents for the best experience with picture-in-picture. You can use picture-in-picture mode so you can follow the video as you complete the tasks in this tutorial. Click the timestamps for each task to follow along.

The following animated image shows how to use the video picture-in-picture and table of contents features:

How to use picture-in-picture and chapters

Get help in the community

If you need help with this tutorial, you can ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Set up your browser windows

For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Tip: If you encounter a guided tour while completing this tutorial in the user interface, click Maybe later.



Set up the prerequisites

Sign up for Cloud Pak for Data as a Service

You must sign up for Cloud Pak for Data as a Service and provision the necessary services for the Data integration use case.

  • If you have an existing Cloud Pak for Data as a Service account, then you can get started with this tutorial. If you have a Lite plan account, only one user per account can run this tutorial.
  • If you don't have a Cloud Pak for Data as a Service account yet, then sign up for a data fabric trial.

Video icon Watch the following video to learn about data fabric in Cloud Pak for Data.

This video provides a visual method to learn the concepts and tasks in this documentation.

Verify the necessary provisioned services

preview tutorial video To preview this task, watch the video beginning at 00:37.

Follow these steps to verify or provision the necessary services:

  1. From the Navigation Menu Navigation menu, choose Services > Service instances.

  2. Use the Product drop-down list to determine whether an existing watsonx.ai Studio service instance exists.

  3. If you need to create a watsonx.ai Studio service instance, click Add service.

    1. Select watsonx.ai Studio.

    2. Select the Lite plan.

    3. Click Create.

  4. Wait while the watsonx.ai Studio service is provisioned, which might take a few minutes to complete.

  5. Repeat these steps to verify or provision the following additional services:

    • watsonx.ai Runtime
    • DataStage
    • Cloud Object Storage

Checkpoint icon Check your progress

The following image shows the provisioned service instances:

Provisioned services

Create the sample project

preview tutorial video To preview this task, watch the video beginning at 01:14.

If you already have the sample project for this tutorial, then skip this task. Otherwise, follow these steps:

  1. Access the Orchestrate an AI pipeline sample project in the Resource hub.

  2. Click Create project.

  3. If prompted to associate the project to a Cloud Object Storage instance, select a Cloud Object Storage instance from the list.

  4. Click Create.

  5. Wait for the project import to complete, and then click View new project to verify that the project and assets were created successfully.

  6. Click the Assets tab to see the connection, DataStage flows and data definition, and the pipeline.

Note: You might see a guided tour showing the tutorials that are included with this use case. The links in the guided tour will open these tutorial instructions.
Tip: If you don't see any DataStage flows, then go back to view your service instances to verify your DataStage instance provisioned successfully. See Provision the necessary services.

Checkpoint icon Check your progress

The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.

The following image shows the Assets tab in the sample project.

Associate the watsonx.ai Runtime service with the sample project

preview tutorial video To preview this task, watch the video beginning at 02:04.

You will use watsonx.ai Runtime to create and deploy the model, so follow these steps to associate your watsonx.ai Runtime service instance with the sample project.

  1. In the Orchestrate an AI pipeline project, click the Manage tab.

  2. Click the Services and Integrations page.

  3. Click Associate service.

  4. Check the box next to your watsonx.ai Runtime service instance.

  5. Click Associate.

  6. Click Cancel to return to the Services & Integrations page.

Checkpoint icon Check your progress

The following image shows the Services and Integrations page with the watsonx.ai Runtime service listed. You are now ready to create the sample project.

Associate service with project




Task 1: View the assets in the sample project

preview tutorial video To preview this task, watch the video beginning at 02:26.

The sample project includes several assets including a connection, data definition, two DataStage flows, and a pipeline. Follow these steps to view those assets:

  1. Click the Assets tab in the Orchestrate an AI pipeline project, and then view All assets.

  2. All of the data assets that are used in the DataStage flows and the pipeline are stored in a Data Fabric Trial - Db2 Warehouse connection in the AI_MORTGAGE schema. The following image shows the assets from that connection:

    Db2 Warehouse tables

  3. The Integrate Mortgage Data DataStage flow integrates data about each mortgage applicant, including personally identifiable information, with their application details, credit scores, status as a commercial buyer, and finally the prices of each applicant’s chosen home, and then creates a sequential file with the name Mortgage_Data.csv in the project containing the joined data. The following image shows the Integrate Mortgage Data DataStage flow.

    Tip: If you don't see any DataStage flows, then go back to view your service instances to verify your DataStage instance provisioned successfully. See Provision the necessary services.

    Integrate Mortgage Data flow

  4. The Integrate Mortgage Approvals DataStage flow uses the output from the first DataStage flow (Mortgage_Data.csv) and further enriches the data by integrating information about each mortgage application approval. The resulting data set is saved to the project with the name Mortgage_Data_with_Approvals.csv. The following image shows the Integrate Mortgage Approvals DataStage flow:

    Integrate Mortgage Approvals flow

  5. The Definition_Mortgage_Data data definition for the Mortgage_Data_with_Approvals.csv data asset is created by the Integrate Mortgage Approvals DataStage flow. The following image shows the data definition:

    Definition Mortgage Data

Checkpoint icon Check your progress

The following image shows all of the assets in the sample project. You are now ready to explore the pipeline in the sample project.

The following image shows all of the assets in the sample project.




Task 2: Explore an existing pipeline

preview tutorial video To preview this task, watch the video beginning at 04:00.

The sample project includes an Orchestration Pipelines, which automates the following tasks:

  • Run two existing DataStage jobs.

  • Create an AutoAI experiment.

  • Run the AutoAI experiment and save the best performing model that uses the resulting output file from the DataStage job as the training data.

  • Create a deployment space.

  • Promote the saved model to the deployment space.

Follow these steps to explore the pipeline:

  1. From the Assets tab in the Orchestrate an AI pipeline project, view All assets.

  2. Click Mortgage approval pipeline to open the pipeline.

  3. In the beginning section of the pipeline, two DataStage jobs (Integrate Mortgage Data and Integrate Mortgage Approvals) run in sequence to combine various tables from the Db2 Warehouse on Cloud connection into a cohesive labeled data set that is used as the training data for the AutoAI experiment.

  4. Double-click the Check Status node to see the condition. This condition is a decision point in the pipeline to confirm the completion of the first DataStage job with a value of either Completed or Completed With Warnings. Click Cancel to return to the pipeline.

  5. Double-click the Create AutoAI experiment node to see the settings. This node creates an AutoAI experiment with the settings.

    1. Review the values for the following settings:

      • AutoAI experiment name

      • Scope

      • Prediction type

      • Prediction column

      • Positive class

      • Training data split ratio

      • Algorithms to include

      • Algorithms to use

      • Optimize metric

    2. Click Cancel to close the settings.

  6. Double-click the Run AutoAI experiment node to see the settings. This node runs the AutoAI experiment that is created from the Create AutoAI experiment node that uses the output from the Integrate Mortgage Approval DataStage job as the training data.

    1. Review the values for the following settings:

      • AutoAI experiment

      • Training Data Assets

      • Model name prefix

    2. Click Cancel to close the settings.

  7. Between the Run AutoAI experiment and Create Deployment Space nodes, double-click the Do you want to deploy model? node to see the condition. The value of True for this condition is a decision point in the pipeline to continue to create the deployment space. Click Cancel to return to the pipeline.

  8. Double-click the Create Deployment Space node to see the settings. This node creates a new deployment space with the specified name, and requires input for your Cloud Object Storage and watsonx.ai Runtime services.

    1. Review the value for the New space name setting.

    2. For the New space COS Instance CRN field, select your Cloud Object Storage instance from the list.

    3. For the New space WML Instance CRN field, select your watsonx.ai Runtime instance from the list.

    4. Click Save.

  9. Double-click the Promote Model to Deployment Space node to see the settings. This node promotes the best model from the Run AutoAI experiment node to the deployment space created from the Create Deployment Space node.

    1. Review the values for the following settings:

      • Source Assets

      • Target

    2. Click Cancel to close the settings.

Checkpoint icon Check your progress

The following image shows the initial pipeline. You are now ready to edit the pipeline to add a node.

Initial pipeline




Task 3: Add a node to the pipeline

preview tutorial video To preview this task, watch the video beginning at 06:23.

The pipeline creates the model, creates a deployment space, and then promotes it to a deployment space. You need to add a node to create an online deployment. Follow these steps to edit the pipeline to automate creating an online deployment:

  1. Add the Create Online Deployment node to the canvas:

    1. Expand the Create section in the node palette.

    2. Drag the Create online deployment node onto the canvas, and drop the node after the Promote Model to Deployment Space node.

  2. Hover over the Promote Model to Deployment Space node to see the arrow. Connect the arrow to the Create online deployment node.

    Note: The node names in your pipeline might differ from the following animated image.

    Pipeline connect nodes

  3. Connect the Create online deployment for promoted model comment to the Create online deployment node by connecting the circle on the comment box to the node.

    Note: The node names in your pipeline might differ from the following animated image.

    Pipeline comment

  4. Double-click the Create online deployment node to see the settings.

  5. Change the node name to Create Online Deployment.

  6. Next to ML asset, click Select from another node from the menu.

    Select from another node ML asset

  7. Select the Promote Model to Deployment Space node from the list. The node ID winning_model is selected.

  8. For the New deployment name, type mortgage approval model deployment.

  9. For Creation Mode, select Overwrite.

  10. Click Save to save the Create Online Deployment node settings.

Checkpoint icon Check your progress

The following image shows the completed pipeline. You are now ready to run the pipeline.

Completed pipeline




Task 4: Run the pipeline

preview tutorial video To preview this task, watch the video beginning at 07:38.

Now that the pipeline is complete, follow these steps to run the pipeline:

  1. From the toolbar, click Run pipeline > Trial run.

  2. On the Define pipeline parameters page, select True for the deployment.

    • If set to True, then the pipeline verifies the deployed model and scores the model.

    • If set to False, then the pipeline verifies that the model was created in the project by the AutoAI experiment, and reviews the model information and training metrics.

  3. If this occasion is your first time running a pipeline, you are prompted to provide an API key. Pipeline assets use your personal IBM Cloud API key to run operations securely without disruption.

    • If you have an existing API key, click Use existing API key, paste the API key, and click Save.

    • If you don't have an existing API key, click Generate new API key, provide a name, and click Save. Copy the API key, and then save the API key for future use. When you're done, click Close.

  4. Click Run to start running the pipeline.

  5. Scroll through consolidated logs while the pipeline is running. The trial run might take up to 10 minutes to complete.

  6. As each operation completes, select the node for that operation on the canvas.

  7. On the Node Inspector tab, view the details of the operation.

  8. Click the Node output tab to see a summary of the output for each node operation.

Checkpoint icon Check your progress

The following image shows the pipeline after it completed the trial run. You are now ready to review the assets that the pipeline created.

Completed run of pipeline




Task 5: View the assets, deployed model, and online deployment

preview tutorial video To preview this task, watch the video beginning at 09:48.

The pipeline created several assets. Follow these steps to view the assets:

  1. Click the Orchestrate an AI pipeline project name in the navigation trail to return to the project.

    Navigation trail

  2. On the Assets tab, view All assets.

  3. View the data assets.

    1. Click the Mortgage_Data.csv data asset. The DataStage job created this asset.

    2. Click the project name in the navigation trail to return to the Assets tab.

    3. Click the Mortgage_Data_with_Approvals.csv data asset. The DataStage job created this asset.

    4. Click the project name in the navigation trail to return to the Assets tab.

  4. View the model.

    1. Click the machine learning model asset beginning with mortgage_approval_best_model. The AutoAI experiment generated several model candidates, and chose this as the best model.

    2. Scroll through the model information.

    3. Click the project name in the navigation trail to return to the Assets tab.

  5. Click the Jobs tab in the project to see information about the two DataStage jobs and one Pipeline job runs.

  6. From the Navigation Menu Navigation menu, choose Deployments.

  7. Click the Spaces tab.

  8. Click the Mortgage approval deployment space.

  9. Click the Assets tab, and see the deployed model beginning with mortgage_approval_best_model.

  10. Click the Deployments tab.

  11. Click mortgage approval model deployment to view the deployment.

    1. View the information on the API reference tab.

    2. Click the Test tab.

    3. Click the JSON input tab, and replace the sample text with the following JSON text.

      {
         "input_data": [
             {
                     "fields": [
                             "ID",
                             "NAME",
                             "STREET_ADDRESS",
                             "CITY",
                             "STATE",
                             "STATE_CODE",
                             "ZIP_CODE",
                             "EMAIL_ADDRESS",
                             "PHONE_NUMBER",
                             "GENDER",
                             "SOCIAL_SECURITY_NUMBER",
                             "EDUCATION",
                             "EMPLOYMENT_STATUS",
                             "MARITAL_STATUS",
                             "INCOME",
                             "APPLIEDONLINE",
                             "RESIDENCE",
                             "YRS_AT_CURRENT_ADDRESS",
                             "YRS_WITH_CURRENT_EMPLOYER",
                             "NUMBER_OF_CARDS",
                             "CREDITCARD_DEBT",
                             "LOANS",
                             "LOAN_AMOUNT",
                             "CREDIT_SCORE",
                             "CRM_ID",
                             "COMMERCIAL_CLIENT",
                             "COMM_FRAUD_INV",
                             "FORM_ID",
                             "PROPERTY_CITY",
                             "PROPERTY_STATE",
                             "PROPERTY_VALUE",
                             "AVG_PRICE"
                     ],
                     "values": [
                             [
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     "Bachelor",
                                     "Employed",
                                     null,
                                     144306,
                                     null,
                                     "Owner Occupier",
                                     15,
                                     19,
                                     2,
                                     7995,
                                     1,
                                     1483220,
                                     437,
                                     null,
                                     false,
                                     false,
                                     null,
                                     null,
                                     null,
                                     111563
                             ],
                             [
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     null,
                                     "High School",
                                     "Employed",
                                     null,
                                     45283,
                                     null,
                                     "Private Renting",
                                     11,
                                     13,
                                     1,
                                     1232,
                                     1,
                                     7638,
                                     706,
                                     null,
                                     false,
                                     false,
                                     null,
                                     null,
                                     null,
                                     547262
                             ]
                     ]
             }
         ]
      }
      
    4. Click Predict. The results show that the first applicant would not be approved and the second applicant will be approved.

Checkpoint icon Check your progress

The following image shows the results of the test.

Test results predictions



Golden Bank's team used Orchestration Pipelines to create a data pipeline that delivers up-to-date data on all mortgage applicants and a machine learning model that lenders can use for decision making.


Cleanup (Optional)

If you would like to retake this tutorial, delete the following artifacts.

Artifact How to delete
Mortgage Approval Model Deployment in the Mortgage approval deployment space Delete a deployment
Mortgage approval deployment space Delete a deployment space
Orchestrate an AI pipeline sample project Delete a project

Next steps

Learn more

Parent topic: Use case tutorials