Data Refinery tutorial: Shape raw data

This tutorial demonstrates a few of the many data shaping capabilities of Data Refinery.

About this tutorial

This tutorial is for users who are new to Data Refinery. You’ll start with a source CSV file that contains data about different airlines. You’ll create a Data Refinery flow that refines the data to one airline with columns for the total arrival and delay times, grouped by year, month, and day, with a sorted column for the average (mean) of all the delay times.

This tutorial follows the steps in the video below. After you load the data asset, you can follow the video as you do the steps.

Figure 1. Video iconShape raw data in the Data Refinery
This video shows you how to shape raw data in Data Refinery.

It will take you approximately 30 minutes to complete this tutorial.

Prerequisites

  1. Activate IBM Watson apps. Instructions are here.

  2. Create a project:

    1. Choose Projects > View all projects from the menu and then click New project on the My Projects page.

    2. Select to create an empty project. You can name the project anything you like.

Bring the data into Data Refinery

  1. Download the airline-data.csv file (1.5 MB) External link.

    Right-click the browser window, and then choose Save Page As or Save As depending on your browser. (For Safari: Choose Format: Page Source.) Make sure the downloaded file name is airline-data.csv.

  2. Add the airline-data.csv file to your project:
    1. From your project’s Assets page, click Add to project > Data.

    2. In the Load pane that opens, browse to the airline-data.csv file. Stay on the page until the load completes.

    The airline-data.csv file is added to your project as a data asset.

  3. Click the airline-data.csv data asset to preview its contents.

  4. Click the Refine link to open a sample of the file in Data Refinery.

Review the data with Profile and Visualizations

  1. Click the Profile tab to review the frequency distribution of the data so that you can find the outliers. The statistics show the minimum, maximum, mean, and number of unique values in each column.
    Profile tab

  2. Click the Visualizations tab. Select columns to visualize, then click Visualize data. Suggested charts have a blue dot next to their icons. Use the different perspectives available in the charts to identify patterns, connections, and relationships within the data.
    Visualizations tab

Tip: Use the Profile and Visualizations pages to view changes in the data as you refine it.

Data Refinery operations

Data Refinery uses two kinds of operations to refine data, GUI operations and coding operations. You will use both kinds of operations in this tutorial.

Add Operation button and coding operation text box

  • GUI operations can consist of multiple steps. Choose a GUI operation from the + Operation button. A subset of the GUI operations are also available from each column’s actions menu Actions menu.

  • Coding operations are dplyr R library operations, functions, and logical operators. Most of the operations have interactive help. Click the operation name in the command-line text box to see the coding operations and their syntax options.

Refine the data

Refining data is a series of steps to build a Data Refinery flow. As you go through this tutorial, view the Steps panel to follow your progress. You can select a step to delete or edit it. If you make a mistake, you can also click the Undo icon Undo icon.

  1. Go back to the Data tab.

  2. Select the Year column. Click the actions menu Actions menu and choose Sort descending.

  3. Focus on the delays for a specific airline:

    This tutorial uses United Airlines (UA), but you can choose any airline.

    1. Click + Operation, and then choose the GUI operation Filter.

    2. Choose the UniqueCarrier column.

    3. For Operator, choose Is equal to.

    4. For Value, enter the string for the airline for which you want to see delay information. For example, UA.
      Filter operation

    5. Click Apply.

  4. Create a new column that adds the arrival and departure delay times together.

    To do the math, you need numbers:

    1. Select the ArrDelay column. Click the actions menu, choose CONVERT COLUMN TYPE, and then choose Decimal.

    2. Select the DepDelay column. Click the actions menu, choose CONVERT COLUMN TYPE, and then choose Decimal.

    3. Click + Operation, and then choose the GUI operation Calculate.

    4. Choose the ArrDelay column, and click Next.

    5. For Operator, choose Addition.

    6. Choose to specify “Column”, then choose the DepDelay column.

    7. Select Create new column for results.

    8. For New column name, enter TotalDelay.
      Calculate operation

    9. Click Apply.

    The new column, TotalDelay, is added to the end of the list of columns.

  5. Move the new TotalDelay column to the beginning of the data set:

    1. In the command-line text box, choose the select operation.

    2. Click the word select, and then choose: select(`<column>`, everything())

    3. Click `<column>`, and then choose the TotalDelay column.

      When you finish, the command should look like this:

      select(`TotalDelay`, everything())
      
    4. Click Apply.

      The TotalDelay column is now the first column.

  6. Reduce the data to four columns: Year, Month, DayofMonth, and TotalDelay. Use the group_by coding operation to divide the columns into groups of year, month, and day.

    1. In the command-line text box, choose the group_by operation.

    2. Click <column>, and then choose the Year column.

    3. Before the closing parenthesis, type: ,Month,DayofMonth

      When you finish, the command should look like this:

      group_by(`Year`,Month,DayofMonth)
      
    4. Click Apply.

    5. Use the select coding operation for the TotalDelay column. In the command-line text box, select the select operation.
      Click <column>, and choose the TotalDelay column.

      The command should look like this:

      select(`TotalDelay`)
      
    6. Click Apply.

      The shaped data now consists of the Year, Month, DayofMonth, and TotalDelay columns.

      The first four rows of the data.
      The first four rows of the Data Refinery flow with the Year, Month, DayofMonth, and TotalDelay columns

  7. Show the mean of the values of the TotalDelay column. Rename the TotalDelay column to delay:

    1. Click + Operation, and then choose the GUI operation Aggregate.

    2. Select the TotalDelay column, and click Next.

    3. For AGGREGATION 1, select Mean.

    4. For Name of the aggregated column, enter delay.
      Aggregate operation

    5. Click Apply.

      The new column delay is the average of all the delay times.

      The first four rows of the data.
      The first four rows of the Data Refinery flow with the Year, Month, DayofMonth, and delay columns

Run a job for the Data Refinery flow

When you run a job for the Data Refinery flow, the steps are run on the entire data set. You select the runtime and add a one-time or repeating schedule. The output of the Data Refinery flow is added to the data assets in the project.

  1. From the Data Refinery toolbar, click the jobs icon and select Save and create a job.
    Save and create a job

  2. Enter a name for the job, description, and select a runtime.
    create job

  3. Click Create and Run.

    The job page opens with the details for the job. In the Runs section, when the Status for the job is Completed, the output of the Data Refinery flow, airline-data.csv_shaped.csv, is added to the data assets in the project. The default name for the Data Refinery flow is airline-data.csv_flow.

Create another data asset from the Data Refinery flow

  1. Open the Data Refinery flow. On the job page, click the Data Refinery flow’s name under Associated Asset. Alternatively, go to the project page, scroll down to the Data Refinery flows section, and click the name of the Data Refinery flow.

    The Data Refinery flow opens.

  2. Sort the delay column in descending order. Select the delay column, click the column actions menu Actions menu, and then select Sort descending.

  3. In the Details panel, click Edit.

  4. In the DATA REFINERY FLOW OUTPUT panel, click the Edit icon. Change the DATA SET NAME to: airline-data_sorted.csv_shaped.csv

    Output changed

  5. Click the checkmark to save the change.

  6. Click Done.

  7. From the Data Refinery toolbar, click the jobs icon and select Save and view jobs.
    Save and view jobs

  8. Select the job for the airline data, and then click View.

  9. From the Job window toolbar, click the Run Job icon.
    Run jobs icon

View the data assets and your Data Refinery flow in your project

  1. When the job completes, go to the project page.

  2. Click the Assets tab.

  3. Scroll down to Data assets. You’ll see the original data set that you uploaded and the output of the two Data Refinery flows.

    airline-data_sorted.csv_shaped.csv
    airline-data.csv_shaped.csv
    airline-data.csv

    If you click the airline-data.csv_shaped.csv data asset, you’ll see the mean delay unsorted. Click airline-data_sorted.csv_shaped.csv data asset to see the mean delay sorted in descending order.

    The Data Refinery flows section shows the new Data Refinery flow:

    airline-data.csv_flow