0 / 0
Quick start: Use DataStage to load enterprise data into Snowflake

Quick start: Use DataStage to load enterprise data into Snowflake

DataStage is a modernized data integration tool that helps users build trusted data pipelines, orchestrate data across distributed landscapes, and move and transform data between cloud sources and data warehouses. It provides a Snowflake connector, among many others, to write, read, and load data into Snowflake and integrate it into the ETL job design. This quick start tutorial demonstrates how to load enterprise data into Snowflake quickly and efficiently through DataStage.

In this tutorial, you will complete these tasks:

This tutorial takes approximately 20 minutes to complete.

Preview the tutorial

Watch Video Watch this video to see how to create a simple DataStage flow.

This video provides a visual method as an alternative to following the written steps in this documentation.

Tip: Start the video, then as you scroll through the tutorial, the video moves to picture-in-picture mode. You can use picture-in-picture mode so you can follow the video as you complete the tasks in this tutorial.

Prerequisites

Sign up for the DataStage trial on Cloud Pak for Data as a Service

You must sign up for Cloud Pak for Data as a Service and provision the DataStage service. Go to the DataStage trial page. Using this link, the following services are provisioned:

  • DataStage
  • Cloud Object Storage

Checkpoint icon for Cloud Pak for Data home page Check your progress

The following image shows the Cloud Pak for Data home page:

Cloud Pak for Data home page

For more information on provisioning services, see Creating and managing IBM Cloud services.

Checkpoint icon for Provisioned services Check your progress

The following image shows the required provisioned service instances. You are now ready to sign up for the Snowflake trial.

Provisioned services

Sign up for a Snowflake trial account

  1. Go to https://www.snowflake.com/.

  2. Click START FOR FREE.

  3. Complete the signup form, and click Continue.

  4. On the START YOUR 30-DAY FREE TRIAL page, complete these steps:

    1. Choose a Snowflake edition.

    2. Choose a cloud provider.

    3. Click GET STARTED.

  5. Skip the questions until you see the Thanks for signing up with Snowflake message.

  6. Access your email account, open the email from Snowflake Computing, and click CLICK TO ACTIVATE.

  7. Provide a username and password, and click Get started.

Checkpoint icon for Snowflake dashboard Check your progress

The following image shows the Snowflake dashboard:

Snowflake dashboard

Task 1: Create a Snowflake data warehouse

Your goal is to use DataStage to load data into your Snowflake account. To accomplish that, you need a data warehouse in your Snowflake account. Follow these steps to create a data warehouse in your Snowflake account:

  1. Log in to your Snowflake trial account.

  2. In the navigation panel, click Admin > Warehouses.

    1. Click + Warehouse.

    2. For the Name, type: DATASTAGEDATA

    3. Accept the defaults for the rest of the fields, and click Create Warehouse.

  3. In the navigation panel, click Data.

    1. On the Databases page, click + Database.

    2. For the Name, type DATASTAGEDB, and click Create.

  4. Click the newly created DATASTAGEDB database in the list, and click + Schema.

    1. For the Schema name, type: MORTGAGE

    2. Click Create.

  5. In the list of databases, select DATASTAGEDB > MORTGAGE.

Checkpoint icon for DATASTAGEDB database in Snowflake Check your progress

The following image shows the DATASTAGEDB database in Snowflake. You are now ready to create the sample project in Cloud Pak for Data for the connection information and the DataStage flow.

DATASTAGEDB database in Snowflake

Task 2: Create the DataStage project

You need a project to store the connections to the external data sources and the DataStage flow. Follow these steps to create the sample project:

  1. Access the Data integration sample project in the Gallery.

  2. Click Create project.

  3. If prompted to associate the project to a Cloud Object Storage instance, select a Cloud Object Storage instance from the list.

  4. Click Create.

  5. Wait for the project import to complete, and then click View new project.

  6. Click the Assets tab to verify that the project and assets were created successfully.

Checkpoint icon for The following image shows the sample project. Check your progress

The following image shows the sample project. Now you are ready to create the connection to Snowflake.

The following image shows the sample project.

Task 3: Create a connection to your Snowflake data warehouse

You need to add the connection information to your project so you can access the Snowflake data warehouse in your DataStage flow. Follow these steps to create a connection asset in your project:

  1. On the Assets tab, click New asset.

  2. In the Data access tools section, click Connection.

  3. Search for Snowflake in the Find connection types search field.

  4. Select the Snowflake connection type.

  5. Click Select.

  6. On the Create connection: Snowflake page, type Snowflake for the connection name.

  7. For the Connection details, complete the following fields by using the information from the Snowflake account that you just created:

    • Account name: Your account name is a combination of your account ID, your region, and your cloud provider. You can find this information in the URL when logged in to your Snowflake account. For example, if your login URL is https://app.snowflake.com/us-east4.gcp/iu68134, then your account name is iu68134.us-east4.gcp.

    • Database: Type DATASTAGEDB

    • Role: Type ACCOUNTADMIN

    • Warehouse: Type DATASTAGEDATA

    • Username: Type your Snowflake account username.

    • Password: Type your Snowflake account password.

  8. Click Test Connection to test the connection to your Snowflake account.

  9. If the test is successful, click Create. If prompted to create the connection without setting location and sovereignty, click Create. This action creates the Snowflake connector, which you can use to load the data from Db2 Warehouse into your Snowflake account.

Checkpoint icon for The following image shows the new connection information. Check your progress

The following image shows the new connection information. Now you are ready to create the DataStage flow.

The following image shows the new connection information.

Task 4: Create a DataStage flow 

Now you are ready to create a DataStage asset in the project. Follow these steps to create the DataStage flow:

  1. From the Assets tab, click New asset.

  2. In the Graphical builders section, click DataStage.

  3. For the Name, type: Load Db2 data to Snowflake

  4. Click Create.

Checkpoint icon for The following image shows the empty DataStage canvas. Check your progress

The following image shows empty DataStage canvas. Now you are ready to design the DataStage flow.

The following image shows the empty DataStage canvas.

Task 5: Design the DataStage flow

The DataStage flow contains two connector nodes: the Db2 Warehouse connector pointing to the source data asset and the Snowflake connector pointing to the target data asset, and several other nodes to join and filter data assets. Follow these steps to add the nodes to the canvas:

Add the two connector nodes

Follow these steps to add the two connector nodes to the canvas:

  1. In the node palette, expand the Connectors section.

  2. Drag the Asset browser connector and drop it anywhere on the empty canvas.

    Drag first asset browser node onto canvas

  3. When you drop the Asset Browser connector on the canvas, you are prompted to select the asset.

    1. To locate the asset, select Connection > Data Fabric Trial - Db2 Warehouse > BANKING > MORTGAGE_APPLICATION.

      Tip: To expand the connection and schema, click the connection or schema name instead of the checkbox.
    2. Click Add to drop the Db2 Warehouse data source onto the DataStage canvas.

  4. Double-click the MORTGAGE_APPLICATION node to see its settings.

    1. Click the Output tab.

    2. Check the Runtime column propagation option. DataStage is flexible about metadata. It can handle situations where the metadata is not fully defined. In this case, you select Runtime column propagation to ensure that if the DataStage job encounters extra columns that are not defined in the metadata when the job actually runs, it adopts these extra columns and propagates them through the rest of the job. This feature allows your flow design to be flexible for schema drift.

    3. Click Save.

      Because you are reading data from Db2 Warehouse into Snowflake, the Db2 Warehouse connector is positioned first in the flow. Your goal is to load the Db2 Warehouse data into Snowflake. Next, you add a Snowflake connector that reads the data from the Db2 Warehouse connector. Thus, the Snowflake connector is positioned second in the flow.

  5. In the Node palette, expand the Connectors section.

  6. Drag the Asset browser connector and drop it onto the canvas so it is positioned as the second node.

    Drag second asset browser node onto canvas

    1. To locate the asset, select Connection > Snowflake > MORTGAGE.

      Tip: Click the checkbox to select the MORTGAGE schema name.
    2. Click Add to drop the Snowflake connection onto the DataStage canvas.

  7. To link the nodes together, hover over the Mortgage_Application_1 node until you see an arrow. Drag the arrow to the Snowflake connection to connect the two nodes.

    Connect the two nodes on the canvas

  8. Double-click the MORTGAGE_DATA connector to see its settings.

    1. Change the node name to Snowflake_mortgage_data

    2. In the settings side panel, click the Input tab.

    3. Expand the Usage section.

    4. For Write mode, select Insert.

    5. For the Table name, add APPLICATION after the schema name, so the full table name reads MORTGAGE.APPLICATION.

    6. For the Table action, select Create. This setting creates the table in the specified database and schema in Snowflake, and then loads the enterprise data into that table.

    7. Accept the default values for all other fields in the Actions section.

    8. Click Save to update the changes, and return to the DataStage flow.

Add the nodes to join and filter data

Now you have a basic DataStage flow to load the data into Snowflake. Follow these steps to add several nodes to join and filter data:

Asset connector node

  1. In the node palette, expand the Connectors section.

  2. Drag the Asset browser connector on to the canvas close to the MORTGAGE_APPLICATION node.

  3. When you drop the Asset Browser connector on the canvas, you are prompted to select the asset.

    1. To locate the asset, select Connection > Data Fabric Trial - Db2 Warehouse > BANKING > MORTGAGE_APPLICANT.

      Tip: To expand the connection and schema, click the connection or schema name instead of the checkbox.
    2. Click Add to drop the Db2 Warehouse data source onto the DataStage canvas.

  4. Double-click the MORTGAGE_APPLICANT node to see its settings.

    1. Click the Output tab.

    2. Check the Runtime column propagation option. As mentioned previously, this option accommodates schema drift.

    3. Click Save.

Join stage node

  1. In the Node palette, expand the Stages section.

  2. In the Node palette, drag the Join stage on to the canvas, and drop the node on the link line between the MORTGAGE_APPLICATION and Snowflake_mortgage_data nodes. This action maintains links from the MORTGAGE_APPLICATION node to the JOIN node to the Snowflake_mortgage_data node.

  3. Hover over the MORTGAGE_APPLICANT connector to see the arrow. Connect the arrow to the Join stage.

  4. Double-click the Join_1 node to edit the settings.

    1. Expand the Properties section.

    2. Click Add key.

      1. Click Add key again.

      2. Select ID from the list of possible keys.

      3. Click Apply.

      4. Click Apply and return to return to the Join_1 node settings.

    3. Change the Join_1 node name to Join_on_ID.

    4. Click the Output tab.

    5. Check the Runtime column propagation option to accommodate schema drift.

    6. Click Save to save the Join_on_ID node settings.

Filter stage node

  1. In the Node palette, in the Stages section, drag the Filter node to the canvas, and drop the node on the link line between the Join_on_ID and Snowflake_mortgage_data nodes.

  2. Double-click the Filter_1 node to edit the settings.

    1. Expand the Properties section.

    2. Under Predicates, click Edit.

      1. Click the Edit Edit icon icon in the Where clause column, and type STATE_CODE='CA'. This clause filters mortgage applications to only California applicants.

      2. Click Apply and return.

    3. Click the Output tab.

      1. Check the Runtime column propagation option to accommodate schema drift.
    4. Click Save to save the Filter node settings.

Checkpoint icon for The following image shows the completed DataStage flow. Check your progress

The following image shows the completed DataStage flow. Now you are ready to run the DataStage job.

The following image shows the completed DataStage flow.

Task 6: Run the DataStage job

Now you are ready to compile and run the DataStage job to load the Mortgage Application data from Db2 Warehouse into Snowflake. Follow these steps to run the DataStage job:

  1. On the toolbar, click Compile. This action validates your DataStage flow.

  2. When the flow compiles successfully, click Run on the toolbar to start the DataStage job. The run might take a few minutes to complete.

  3. When the run completes, you see a message stating Run successful with warnings.

Checkpoint icon for The following image shows the successful run completed. Check your progress

The following image shows the successful run completed. Now that the DataStage job completed successfully, you can view the new table in Snowflake.

The following image shows the successful run completed.

Task 7: View the data asset in the Snowflake data warehouse

To check whether the data was loaded data into Snowflake correctly, you can go back to your Snowflake dashboard.

  1. Navigate to Data > Databases.

  2. Expand DATASTAGEDB > MORTGAGE > TABLES.

  3. Select the APPLICATION table.

  4. Under the table name, click the Data Preview tab.

  5. Select the DATASTAGEDATA warehouse.

  6. Click Preview to see a preview of the Mortgage Application data imported from DataStage.

Checkpoint icon for The following image shows the loaded table in Snowflake. Check your progress

The following image shows the loaded table in Snowflake.

The following image shows the loaded table in Snowflake.

You successfully loaded enterprise data from a Db2 Warehouse into Snowflake by using DataStage.

Next steps

Try other tutorials:

Learn more