Data Refinery tutorial: Shape raw data

This tutorial demonstrates a few of the many data shaping capabilities of Data Refinery.

About this tutorial

This tutorial is for users who are new to Data Refinery. You’ll start with a source CSV file that contains data about different airlines. You’ll create, save, and run a Data Refinery flow that refines the data to one airline with columns for the total arrival and delay times, grouped by year, month, and day, with a sorted column for the average (mean) of all the delay times.

The steps in this tutorial are largely based on the steps in the video below. After you load the data asset, you can follow the video as you do the steps. It will take you approximately 30 minutes to complete this tutorial.

Figure 1. Video iconShape raw data in the Data Refinery
This video shows you how to shape raw data in Data Refinery.

Prerequisites

  1. Activate IBM Watson apps. Instructions are here.

  2. Set up a project. Choose a Standard or a Data Engineering project. You can name the project anything you like. If you already have a project that includes Data Refinery, you can use it for this tutorial. Steps to set up a project are here.

Bring the data into Data Refinery

  1. Download the airline-data.csv file (1.5 MB) External link.

    Right-click the browser window, and then choose Save Page As or Save As depending on your browser. (For Safari: Choose Format: Page Source.) Make sure the downloaded file name is airline-data.csv.

  2. Add the airline-data.csv file to your project. Follow the steps to add a local file here.

    The airline-data.csv file is added to your project as a data asset.

  3. Click the airline-data.csv data asset to preview its contents.

  4. Click the Refine link to open a sample of the file in Data Refinery.

Review the data with Profile and Visualizations

  1. Click the Profile tab to review the frequency distribution of the data so that you can find the outliers. The statistics show the minimum, maximum, mean, and number of unique values in each column.

  2. Click the Visualizations tab to see the data in charts. Use the different perspectives available in the charts to identify patterns, connections, and relationships within the data.

Tip: Use the Profile and Visualizations pages to view changes in the data as you refine it.

Data Refinery operations

Data Refinery uses two kinds of operations to refine data, GUI operations and coding operations. You will use both kinds of operations in this tutorial.

Add Operation button and coding operation text box

  • GUI operations can consist of multiple steps. Choose a GUI operation from the + Operation button. GUI operations are also available from each column’s actions menu three vertical dots representing the actions menu.

  • Coding operations are dplyr R library operations, functions, and logical operators. Most of the operations have interactive help. Click the operation name in the command-line text box to see the coding operations and their syntax options.

Refine the data

  1. Go back to the Data tab.

  2. Select the Year column. Click the actions menu three vertical dots representing the actions menu and choose Sort descending.

  3. Focus on the delays for a specific airline:

    The video uses United Airlines (UA), but you can choose any airline.

    1. Click + Operation, and choose the GUI operation Filter.

    2. Choose the UniqueCarrier column.

    3. For Operator, choose Is equal to.

    4. For Value, enter the string for the airline for which you want to see delay information. For example, UA.
      Filter operation

    5. Click Apply.

  4. Create a new column that adds the arrival and departure delay times together.

    To do the math, you need numbers:

    1. Select the ArrDelay column. Click the actions menu, choose CONVERT COLUMN TYPE, and then choose Decimal.

    2. Select the DepDelay column. Click the actions menu, choose CONVERT COLUMN TYPE, and then choose Decimal.

    3. Click + Operation, and choose the GUI operation Calculate.

    4. Choose the ArrDelay column.

    5. For Operator, choose Addition.

    6. Select Column, then choose the DepDelay column.

    7. Select Create new column for results.

    8. For New column name, enter TotalDelay.
      Calculate operation

    9. Click Apply.

    The new column, TotalDelay, is added to the end of the list of columns.

  5. Optional: Use the transmute coding operation to create a new column that contains the data from the Month column and removes all the other columns. This step shows how you can quickly narrow down the data to only the information that interests you.

    1. In the command-line text box, choose the transmute operation.

    2. For the syntax options, choose:

      transmute(<new_or_existing_column> = `<column>`)

    3. Type newcol over the <new_or_existing_column> variable.

    4. Click the `<column>` variable, and then choose the Month column.

      When you finish, the command should look like this:

      transmute(newcol =`Month`)
      
    5. Click Apply.

      The data set now contains only one column with the values from the Month column. All the columns that were not specified in the transmute coding operation are removed.

    6. Important: Remove the last Data Refinery flow step to restore all the columns. Click the Undo icon Undo icon.
  6. Move the new TotalDelay column to the beginning of the data set:

    1. In the command-line text box, choose the select operation.

    2. Click select, and then choose: select(`<column>`, everything())

    3. Click `<column>`, and then choose the TotalDelay column.

      When you finish, the command should look like this:

      select(`TotalDelay`, everything())
      
    4. Click Apply.

      The TotalDelay column is now the first column.

  7. Reduce the data to four columns: the year, month, day and the TotalDelay. Use the group_by coding operation to divide the columns into groups of year, month, and day.

    1. In the command-line text box, choose the group_by operation.

    2. For <column>, choose the Year column.

    3. Before the closing parenthesis, type: ,Month,DayofMonth

      When you finish, the command should look like this:

      group_by(`Year`,Month,DayofMonth)
      
    4. Click Apply.

    5. Use the select coding operation for the TotalDelay column. In the command-line text box, select the select operation, For <column>, choose the TotalDelay column.

      The command should look like this:

      select(`TotalDelay`)
      
    6. Click Apply.

      The shaped data now consists of the Year, Month, DayofMonth, and TotalDelay columns.

      The first four rows of the Data Refinery flow with the Year, Month, DayofMonth, and TotalDelay columns.
      The first four rows of the Data Refinery flow with the Year, Month, DayofMonth, and TotalDelay columns

  8. Show the mean of the values of the TotalDelay column:

    These steps use the summarize coding operation with the aggregate function. The TotalDelay column is renamed to delay.

    1. In the command-line text box, choose the summarize operation. The following command opens:

      summarize(provide_new_column = <func>(`<column>`))
      
    2. Type delay over provide_new_column.

    3. Click <func> and then choose the AGGREGATE category.

    4. Choose the mean function.

    5. Click <column>, and select TotalDelay.

      When you finish, the command should look like this:

      summarize(delay =mean(`TotalDelay`))
      
    6. Click Apply.

      The new column delay is the average of all the delay times.

      The first four rows of the of the Data Refinery flow with the Year, Month, DayofMonth, and delay columns.
      The first four rows of the Data Refinery flow with the Year, Month, DayofMonth, and delay columns

Run the Data Refinery flow

  1. Click the Run icon Run icon.

    The DATA REFINERY FLOW DETAILS and DATA REFINERY FLOW OUTPUT information panels are displayed.

  2. Click Save and Run flow. This action might take some time because the Data Refinery flow is run on the full data set.

  3. In the What’s next dialog box, click View Flow to view the status of the flow.

Create another data asset from the Data Refinery flow

  1. When the run shows STATUS Completed, click Refine in the toolbar.

    The Data Refinery flow opens.

  2. Sort the delay column in descending order. Select the delay column, click the column actions menu three vertical dots representing the actions menu, and then select Sort descending.

  3. Click the Run icon Run icon to run the Data Refinery flow again.

  4. In the DATA REFINERY FLOW OUTPUT panel, click the Edit icon.

  5. Change the data set name to: airline-data_sorted.csv_shaped.csv

    Output changed

  6. Click the checkmark to apply the change.

  7. Click Save and Run flow.

View the data assets and your Data Refinery flow in your project

  1. When the Data Refinery flow completes, go to the project that contains the new Data Refinery flow.

  2. Click the Assets tab.

  3. Scroll down to Data assets. You’ll see the original data set that you uploaded and the output of the two Data Refinery flows.

    airline-data_sorted.csv_shaped.csv
    airline-data.csv_shaped.csv
    airline-data.csv

    If you click the airline-data.csv_shaped.csv data asset, you’ll see the mean delay unsorted. Click airline-data_sorted.csv_shaped.csv data asset to see the mean delay sorted in descending order.

    The Data Refinery flow section shows the new Data Refinery flow:

    airline-data.csv_flow