Data Refinery tutorial: Shape raw data

Last updated: Jun 30, 2022

This tutorial demonstrates a few of the many data shaping capabilities of Data Refinery.

About this tutorial
Prerequisites
Bring the data into Data Refinery
Review the data with Profile and Visualizations
Data Refinery operations
Refine the data
Run a job for the Data Refinery flow
Create another data asset from the Data Refinery flow
View the data assets and your Data Refinery flow in your project

About this tutorial

This tutorial is for users who are new to Data Refinery. You'll start with a source CSV file that contains data about different airlines. You'll create a Data Refinery flow that refines the data to one airline with columns for the total arrival and delay times, grouped by year, month, and day, with a sorted column for the average (mean) of all the delay times.

This tutorial follows the steps in the Shape raw data in the Data Refinery video. After you load the data asset, you can follow the video as you do the steps.

This video provides a visual method as an alternative to following the written steps in this documentation.

It will take you approximately 30 minutes to complete this tutorial.

Prerequisites

Sign up for Cloud Pak for Data as a Service.
Create a project:
1. Choose Projects > View all projects from the menu and then click New project on the My Projects page.
2. Select to create an empty project. You can name the project anything you like.

Bring the data into Data Refinery

Download the airline-data.csv file (1.5 MB) .

Right-click the browser window, and then choose Save Page As or Save As depending on your browser. (For Safari: Choose Format: Page Source.) Make sure the downloaded file name is airline-data.csv.
Add the airline-data.csv file to your project:
1. From your project's Assets page, click Add to project > Data.
2. In the Load pane that opens, browse to the airline-data.csv file. Stay on the page until the load completes.
The airline-data.csv file is added to your project as a data asset.
Go to the project's Assets page, and click the airline-data.csv data asset to preview its contents.
Click Refine to open a sample of the file in Data Refinery.

Review the data with Profile and Visualizations

Click the Profile tab to review the frequency distribution of the data so that you can find the outliers. The statistics show the minimum, maximum, mean, and number of unique values in each column.
Click the Visualizations tab. Select columns to visualize, then click Visualize data. Suggested charts have a blue dot next to their icons. Use the different perspectives available in the charts to identify patterns, connections, and relationships within the data.

Tip: Use the Profile and Visualizations pages to view changes in the data as you refine it.

Data Refinery operations

Data Refinery uses two kinds of operations to refine data, GUI operations and coding operations. You will use both kinds of operations in this tutorial.

GUI operations can consist of multiple steps. Select an operation from New step. A subset of the GUI operations is also available from each column's overflow menu ().

When you open a file in Data Refinery, the Convert column type operation is automatically applied as the first step to convert any non-string data types to inferred data types (for example, to Integer, Date, Boolean, etc.). You can undo or edit this step.

Coding operations are interactive templates for coding operations, functions, and logical operators. Most of the operations have interactive help. Click the operation name in the command-line text box to see the coding operations and their syntax options.

Refine the data

Refining data is a series of steps to build a Data Refinery flow. As you go through this tutorial, view the Steps pane to follow your progress. You can select a step to delete or edit it. If you make a mistake, you can also click the Undo icon .

Go back to the Data tab.
Select the Year column. Click the the overflow menu () and choose Sort ascending.
Focus on the delays for a specific airline:

This tutorial uses United Airlines (UA), but you can choose any airline.
1. Click New step, and then choose the GUI operation Filter.
2. Choose the UniqueCarrier column.
3. For Operator, choose Is equal to.
4. For Value, enter the string for the airline for which you want to see delay information. For example, UA.
5. Click Apply.
Create a new column that adds the arrival and departure delay times together.

The Convert column type operation was automatically applied as the first step to convert the String data types in all the columns whose values are numbers to Integer data types.
1. Click New step, and then choose the GUI operation Calculate.
2. Choose the ArrDelay column, and click Next.
3. For Operator, choose Addition.
4. Specify "Column", then choose the DepDelay column.
5. Select Create new column for results.
6. For New column name, enter TotalDelay.
7. Click Apply.
The new column, TotalDelay, is added to the end of the list of columns.
Move the new TotalDelay column to the beginning of the data set:
1. In the command-line text box, choose the select operation.
2. Click the word select, and then choose: select(`<column>`, everything())
3. Click `<column>`, and then choose the TotalDelay column.
  
  When you finish, the command should look like this:
```
select(`TotalDelay`, everything())
```
4. Click Apply.
  
  The TotalDelay column is now the first column.
Reduce the data to four columns: Year, Month, DayofMonth, and TotalDelay. Use the group_by coding operation to divide the columns into groups of year, month, and day.
1. In the command-line text box, choose the group_by operation.
2. Click <column>, and then choose the Year column.
3. Before the closing parenthesis, type: ,Month,DayofMonth
  
  When you finish, the command should look like this:
```
group_by(`Year`,Month,DayofMonth)
```
4. Click Apply.
5. Use the select coding operation for the TotalDelay column. In the command-line text box, select the select operation.
  Click <column>, and choose the TotalDelay column.
  
  The command should look like this:
```
select(`TotalDelay`)
```
6. Click Apply.
  
  The shaped data now consists of the Year, Month, DayofMonth, and TotalDelay columns.
  
  The first four rows of the data.
Show the mean of the values of the TotalDelay column. Rename the TotalDelay column to delay:
1. Click New step, and then choose the GUI operation Aggregate.
2. Select the TotalDelay column, and click Next.
3. For AGGREGATION 1, select Mean.
4. For Name of the aggregated column, enter delay.
5. Click Apply.
  
  The new column delay is the average of all the delay times.
  
  The first four rows of the data.

Run a job for the Data Refinery flow

When you run a job for the Data Refinery flow, the steps are run on the entire data set. You select the runtime and add a one-time or repeating schedule. The output of the Data Refinery flow is added to the data assets in the project.

From the Data Refinery toolbar, click the jobs icon and select Save and create a job.
Enter a name for the job, description, and select a runtime.
Click Create and Run.

Go to the project's Jobs page and click the job's name. When the Status for the job is Completed, the output of the Data Refinery flow, airline-data_csv_shaped, is added to the Data assets section on the Assets page. The default name for the Data Refinery flow is airline-data_flow. It is added to the Data Refinery flows section.

Create another data asset from the Data Refinery flow

Open the Data Refinery flow. On the job page, click the Data Refinery flow's name under Associated Asset. Alternatively, go to the project's Assets page, scroll down to the Data Refinery flows section, and click the name of the Data Refinery flow.

The Data Refinery flow opens.
Sort the delay column in descending order. Select the delay column, click the column overflow menu (), and then select Sort descending.
In the Details pane, click Edit.
In the DATA REFINERY FLOW OUTPUT pane, click Edit Output. Change the DATA SET NAME to: airline-data_sorted_shaped.csv
Click the checkmark to save the change.
Click Done.
From the Data Refinery toolbar, click the jobs icon and select Save and view jobs.
Select the job for the airline data, and then click View.
From the Job window toolbar, click the Run Job icon.

View the data assets and your Data Refinery flow in your project

When the job completes, go to the project page.
Click the Assets tab.
Scroll down to Data assets. You'll see the original data set that you uploaded and the output of the two Data Refinery flows.

airline-data_sorted_shaped.csv
airline-data_csv_shaped
airline-data.csv

If you click the airline-data_shaped.csv data asset, you'll see the mean delay unsorted. Click airline-data_sorted_shaped.csv data asset to see the mean delay sorted in descending order.

The Data Refinery flows section shows the Data Refinery flow:

airline-data_flow

Parent topic: Refining data

Was the topic helpful?

0/1000