Quick start: Use DataStage to load enterprise data into Snowflake
DataStage is a modernized data integration tool that helps users build trusted data pipelines, orchestrate data across distributed landscapes, and move and transform data between cloud sources and data warehouses. It provides a Snowflake connector, among many others, to write, read, and load data into Snowflake and integrate it into the ETL job design. This quick start tutorial demonstrates how to load enterprise data into Snowflake quickly and efficiently through DataStage.
In this tutorial, you will complete these tasks:
- Task 1: Create a Snowflake data warehouse.
- Task 2: Create the sample project and provision the DataStage service.
- Task 3: Create a connection to your Snowflake data warehouse.
- Task 4: Create a DataStage flow.
- Task 5: Design DataStage flow.
- Task 6: Run the DataStage flow.
- Task 7: View the data asset in the Snowflake data warehouse.
This tutorial takes approximately 20 minutes to complete.
Preview the tutorial
Watch this video to see how to create a simple DataStage flow.
This video provides a visual method as an alternative to following the written steps in this documentation.
Prerequisites
Sign up for the DataStage trial on Cloud Pak for Data as a Service
You must sign up for Cloud Pak for Data as a Service and provision the DataStage service. Go to the DataStage trial page. Using this link, the following services are provisioned:
- DataStage
- Cloud Object Storage
Check your progress
The following image shows the Cloud Pak for Data home page:
For more information on provisioning services, see Creating and managing IBM Cloud services.
Check your progress
The following image shows the required provisioned service instances. You are now ready to sign up for the Snowflake trial.
Sign up for a Snowflake trial account
-
Go to https://www.snowflake.com/.
-
Click START FOR FREE.
-
Complete the signup form, and click Continue.
-
On the START YOUR 30-DAY FREE TRIAL page, complete these steps:
-
Choose a Snowflake edition.
-
Choose a cloud provider.
-
Click GET STARTED.
-
-
Skip the questions until you see the Thanks for signing up with Snowflake message.
-
Access your email account, open the email from Snowflake Computing, and click CLICK TO ACTIVATE.
-
Provide a username and password, and click Get started.
Check your progress
The following image shows the Snowflake dashboard:
Task 1: Create a Snowflake data warehouse
Your goal is to use DataStage to load data into your Snowflake account. To accomplish that, you need a data warehouse in your Snowflake account. Follow these steps to create a data warehouse in your Snowflake account:
-
Log in to your Snowflake trial account.
-
In the navigation panel, click Admin > Warehouses.
-
Click + Warehouse.
-
For the Name, type:
DATASTAGEDATA
-
Accept the defaults for the rest of the fields, and click Create Warehouse.
-
-
In the navigation panel, click Data.
-
On the Databases page, click + Database.
-
For the Name, type
DATASTAGEDB
, and click Create.
-
-
Click the newly created DATASTAGEDB database in the list, and click + Schema.
-
For the Schema name, type:
MORTGAGE
-
Click Create.
-
-
In the list of databases, select DATASTAGEDB > MORTGAGE.
Check your progress
The following image shows the DATASTAGEDB database in Snowflake. You are now ready to create the sample project in Cloud Pak for Data for the connection information and the DataStage flow.
Task 2: Create the DataStage project
You need a project to store the connections to the external data sources and the DataStage flow. Follow these steps to create the sample project:
-
Access the Data integration sample project in the Gallery.
-
Click Create project.
-
If prompted to associate the project to a Cloud Object Storage instance, select a Cloud Object Storage instance from the list.
-
Click Create.
-
Wait for the project import to complete, and then click View new project.
-
Click the Assets tab to verify that the project and assets were created successfully.
Check your progress
The following image shows the sample project. Now you are ready to create the connection to Snowflake.
Task 3: Create a connection to your Snowflake data warehouse
You need to add the connection information to your project so you can access the Snowflake data warehouse in your DataStage flow. Follow these steps to create a connection asset in your project:
-
On the Assets tab, click New asset.
-
In the Data access tools section, click Connection.
-
Search for Snowflake in the Find connection types search field.
-
Select the Snowflake connection type.
-
Click Select.
-
On the Create connection: Snowflake page, type
Snowflake
for the connection name. -
For the Connection details, complete the following fields by using the information from the Snowflake account that you just created:
-
Account name: Your account name is a combination of your account ID, your region, and your cloud provider. You can find this information in the URL when logged in to your Snowflake account. For example, if your login URL is
https://app.snowflake.com/us-east4.gcp/iu68134
, then your account name isiu68134.us-east4.gcp
. -
Database: Type
DATASTAGEDB
-
Role: Type
ACCOUNTADMIN
-
Warehouse: Type
DATASTAGEDATA
-
Username: Type your Snowflake account username.
-
Password: Type your Snowflake account password.
-
-
Click Test Connection to test the connection to your Snowflake account.
-
If the test is successful, click Create. If prompted to create the connection without setting location and sovereignty, click Create. This action creates the Snowflake connector, which you can use to load the data from Db2 Warehouse into your Snowflake account.
Check your progress
The following image shows the new connection information. Now you are ready to create the DataStage flow.
Task 4: Create a DataStage flow
Now you are ready to create a DataStage asset in the project. Follow these steps to create the DataStage flow:
-
From the Assets tab, click New asset.
-
In the Graphical builders section, click DataStage.
-
For the Name, type:
Load Db2 data to Snowflake
-
Click Create.
Check your progress
The following image shows empty DataStage canvas. Now you are ready to design the DataStage flow.
Task 5: Design the DataStage flow
The DataStage flow contains two connector nodes: the Db2 Warehouse connector pointing to the source data asset and the Snowflake connector pointing to the target data asset, and several other nodes to join and filter data assets. Follow these steps to add the nodes to the canvas:
Add the two connector nodes
Follow these steps to add the two connector nodes to the canvas:
-
In the node palette, expand the Connectors section.
-
Drag the Asset browser connector and drop it anywhere on the empty canvas.
-
When you drop the Asset Browser connector on the canvas, you are prompted to select the asset.
-
To locate the asset, select Connection > Data Fabric Trial - Db2 Warehouse > BANKING > MORTGAGE_APPLICATION.
Tip: To expand the connection and schema, click the connection or schema name instead of the checkbox. -
Click Add to drop the Db2 Warehouse data source onto the DataStage canvas.
-
-
Double-click the MORTGAGE_APPLICATION node to see its settings.
-
Click the Output tab.
-
Check the Runtime column propagation option. DataStage is flexible about metadata. It can handle situations where the metadata is not fully defined. In this case, you select Runtime column propagation to ensure that if the DataStage job encounters extra columns that are not defined in the metadata when the job actually runs, it adopts these extra columns and propagates them through the rest of the job. This feature allows your flow design to be flexible for schema drift.
-
Click Save.
Because you are reading data from Db2 Warehouse into Snowflake, the Db2 Warehouse connector is positioned first in the flow. Your goal is to load the Db2 Warehouse data into Snowflake. Next, you add a Snowflake connector that reads the data from the Db2 Warehouse connector. Thus, the Snowflake connector is positioned second in the flow.
-
-
In the Node palette, expand the Connectors section.
-
Drag the Asset browser connector and drop it onto the canvas so it is positioned as the second node.
-
To locate the asset, select Connection > Snowflake > MORTGAGE.
Tip: Click the checkbox to select the MORTGAGE schema name. -
Click Add to drop the Snowflake connection onto the DataStage canvas.
-
-
To link the nodes together, hover over the Mortgage_Application_1 node until you see an arrow. Drag the arrow to the Snowflake connection to connect the two nodes.
-
Double-click the MORTGAGE_DATA connector to see its settings.
-
Change the node name to
Snowflake_mortgage_data
-
In the settings side panel, click the Input tab.
-
Expand the Usage section.
-
For Write mode, select Insert.
-
For the Table name, add
APPLICATION
after the schema name, so the full table name readsMORTGAGE.APPLICATION
. -
For the Table action, select Create. This setting creates the table in the specified database and schema in Snowflake, and then loads the enterprise data into that table.
-
Accept the default values for all other fields in the Actions section.
-
Click Save to update the changes, and return to the DataStage flow.
-
Add the nodes to join and filter data
Now you have a basic DataStage flow to load the data into Snowflake. Follow these steps to add several nodes to join and filter data:
Asset connector node
-
In the node palette, expand the Connectors section.
-
Drag the Asset browser connector on to the canvas close to the MORTGAGE_APPLICATION node.
-
When you drop the Asset Browser connector on the canvas, you are prompted to select the asset.
-
To locate the asset, select Connection > Data Fabric Trial - Db2 Warehouse > BANKING > MORTGAGE_APPLICANT.
Tip: To expand the connection and schema, click the connection or schema name instead of the checkbox. -
Click Add to drop the Db2 Warehouse data source onto the DataStage canvas.
-
-
Double-click the MORTGAGE_APPLICANT node to see its settings.
-
Click the Output tab.
-
Check the Runtime column propagation option. As mentioned previously, this option accommodates schema drift.
-
Click Save.
-
Join stage node
-
In the Node palette, expand the Stages section.
-
In the Node palette, drag the Join stage on to the canvas, and drop the node on the link line between the MORTGAGE_APPLICATION and Snowflake_mortgage_data nodes. This action maintains links from the MORTGAGE_APPLICATION node to the JOIN node to the Snowflake_mortgage_data node.
-
Hover over the MORTGAGE_APPLICANT connector to see the arrow. Connect the arrow to the Join stage.
-
Double-click the Join_1 node to edit the settings.
-
Expand the Properties section.
-
Click Add key.
-
Click Add key again.
-
Select ID from the list of possible keys.
-
Click Apply.
-
Click Apply and return to return to the Join_1 node settings.
-
-
Change the Join_1 node name to
Join_on_ID
. -
Click the Output tab.
-
Check the Runtime column propagation option to accommodate schema drift.
-
Click Save to save the Join_on_ID node settings.
-
Filter stage node
-
In the Node palette, in the Stages section, drag the Filter node to the canvas, and drop the node on the link line between the Join_on_ID and Snowflake_mortgage_data nodes.
-
Double-click the Filter_1 node to edit the settings.
-
Expand the Properties section.
-
Under Predicates, click Edit.
-
Click the Edit
icon in the Where clause column, and type
STATE_CODE='CA'
. This clause filters mortgage applications to only California applicants. -
Click Apply and return.
-
-
Click the Output tab.
- Check the Runtime column propagation option to accommodate schema drift.
-
Click Save to save the Filter node settings.
-
Check your progress
The following image shows the completed DataStage flow. Now you are ready to run the DataStage job.
Task 6: Run the DataStage job
Now you are ready to compile and run the DataStage job to load the Mortgage Application data from Db2 Warehouse into Snowflake. Follow these steps to run the DataStage job:
-
On the toolbar, click Compile. This action validates your DataStage flow.
-
When the flow compiles successfully, click Run on the toolbar to start the DataStage job. The run might take a few minutes to complete.
-
When the run completes, you see a message stating Run successful with warnings.
Check your progress
The following image shows the successful run completed. Now that the DataStage job completed successfully, you can view the new table in Snowflake.
Task 7: View the data asset in the Snowflake data warehouse
To check whether the data was loaded data into Snowflake correctly, you can go back to your Snowflake dashboard.
-
Navigate to Data > Databases.
-
Expand DATASTAGEDB > MORTGAGE > TABLES.
-
Select the APPLICATION table.
-
Under the table name, click the Data Preview tab.
-
Select the DATASTAGEDATA warehouse.
-
Click Preview to see a preview of the Mortgage Application data imported from DataStage.
Check your progress
The following image shows the loaded table in Snowflake.
You successfully loaded enterprise data from a Db2 Warehouse into Snowflake by using DataStage.
Next steps
Try other tutorials: