This tutorial demonstrates a few of the many data shaping capabilities of Data Refinery.
- About this tutorial
- Prerequisites
- Bring the data into Data Refinery
- Review the data with Profile and Visualizations
- Data Refinery operations
- Refine the data
- Run a job for the Data Refinery flow
- Create another data asset from the Data Refinery flow
- View the data assets and your Data Refinery flow in your project
About this tutorial
This tutorial is for users who are new to Data Refinery. You'll start with a source CSV file that contains data about different airlines. You'll create a Data Refinery flow that refines the data to one airline with columns for the total arrival and delay times, grouped by year, month, and day, with a sorted column for the average (mean) of all the delay times.
This tutorial follows the steps in the Shape raw data in the Data Refinery video. After you load the data asset, you can follow the video as you do the steps.
This video provides a visual method as an alternative to following the written steps in this documentation.
It will take you approximately 30 minutes to complete this tutorial.
Prerequisites
- Sign up for Cloud Pak for Data as a Service.
- Create a project:
- Choose Projects > View all projects from the menu and then click New project on the My Projects page.
- Select to create an empty project. You can name the project anything you like.
Bring the data into Data Refinery
-
Download the airline-data.csv file (1.5 MB)
.
Right-click the browser window, and then choose Save Page As or Save As depending on your browser. (For Safari: Choose Format: Page Source.) Make sure the downloaded file name is
airline-data.csv
. -
Add the
airline-data.csv
file to your project:-
From your project's Assets page, click Add to project > Data.
-
In the Load pane that opens, browse to the
airline-data.csv
file. Stay on the page until the load completes.
The
airline-data.csv
file is added to your project as a data asset. -
-
Go to the project's Assets page, and click the
airline-data.csv
data asset to preview its contents. -
Click Refine to open a sample of the file in Data Refinery.
Review the data with Profile and Visualizations
- Click the Profile tab to review the frequency
distribution of the data so that you can find the outliers. The statistics show the minimum, maximum, mean, and number of unique values in each column.
- Click the Visualizations tab. Select columns to visualize, then click Visualize data. Suggested
charts have a blue dot next to their icons. Use the different perspectives available in the charts to identify patterns, connections, and relationships within
the data.
Tip: Use the Profile and Visualizations pages to view changes in the data as you refine it.
Data Refinery operations
Data Refinery uses two kinds of operations to refine data, GUI operations and coding operations. You will use both kinds of operations in this tutorial.
-
GUI operations can consist of multiple steps. Select an operation from New step. A subset of the GUI operations is also available from each column's overflow menu (
).
When you open a file in Data Refinery, the Convert column type operation is automatically applied as the first step to convert any non-string data types to inferred data types (for example, to Integer, Date, Boolean, etc.). You can undo or edit this step.
- Coding operations are interactive templates for coding operations, functions, and logical operators. Most of the operations have interactive help. Click the operation name in the command-line text box to see the coding operations and their syntax options.
Refine the data
Refining data is a series of steps to build a Data Refinery flow. As you go through this tutorial, view the Steps pane to follow your progress. You can select a step to delete or edit it. If you make a mistake, you
can also click the Undo icon .
-
Go back to the Data tab.
-
Select the Year column. Click the the overflow menu (
) and choose Sort ascending.
-
Focus on the delays for a specific airline:
This tutorial uses United Airlines (UA), but you can choose any airline.
-
Click New step, and then choose the GUI operation Filter.
-
Choose the UniqueCarrier column.
-
For Operator, choose Is equal to.
-
For Value, enter the string for the airline for which you want to see delay information. For example,
UA
.
-
Click Apply.
-
-
Create a new column that adds the arrival and departure delay times together.
The Convert column type operation was automatically applied as the first step to convert the String data types in all the columns whose values are numbers to Integer data types.
-
Click New step, and then choose the GUI operation Calculate.
-
Choose the ArrDelay column, and click Next.
-
For Operator, choose Addition.
-
Specify "Column", then choose the DepDelay column.
-
Select Create new column for results.
-
For New column name, enter
TotalDelay
.
-
Click Apply.
The new column, TotalDelay, is added to the end of the list of columns.
-
-
Move the new TotalDelay column to the beginning of the data set:
-
In the command-line text box, choose the select operation.
-
Click the word select, and then choose: select(`
<column>
`, everything()) -
Click
`<column>`
, and then choose the TotalDelay column.When you finish, the command should look like this:
select(`TotalDelay`, everything())
-
Click Apply.
The TotalDelay column is now the first column.
-
-
Reduce the data to four columns: Year, Month, DayofMonth, and TotalDelay. Use the group_by coding operation to divide the columns into groups of year, month, and day.
-
In the command-line text box, choose the group_by operation.
-
Click
<column>
, and then choose the Year column. -
Before the closing parenthesis, type:
,Month,DayofMonth
When you finish, the command should look like this:
group_by(`Year`,Month,DayofMonth)
-
Click Apply.
-
Use the select coding operation for the TotalDelay column. In the command-line text box, select the select operation.
Click<column>
, and choose the TotalDelay column.The command should look like this:
select(`TotalDelay`)
-
Click Apply.
The shaped data now consists of the Year, Month, DayofMonth, and TotalDelay columns.
The first four rows of the data.
-
-
Show the mean of the values of the TotalDelay column. Rename the TotalDelay column to delay:
-
Click New step, and then choose the GUI operation Aggregate.
-
Select the TotalDelay column, and click Next.
-
For AGGREGATION 1, select Mean.
-
For Name of the aggregated column, enter
delay
.
-
Click Apply.
The new column delay is the average of all the delay times.
The first four rows of the data.
-
Run a job for the Data Refinery flow
When you run a job for the Data Refinery flow, the steps are run on the entire data set. You select the runtime and add a one-time or repeating schedule. The output of the Data Refinery flow is added to the data assets in the project.
-
From the Data Refinery toolbar, click the jobs icon and select Save and create a job.
-
Enter a name for the job, description, and select a runtime.
-
Click Create and Run.
Go to the project's Jobs page and click the job's name. When the Status for the job is Completed, the output of the Data Refinery flow,
airline-data_csv_shaped
, is added to the Data assets section on the Assets page. The default name for the Data Refinery flow isairline-data_flow
. It is added to the Data Refinery flows section.
Create another data asset from the Data Refinery flow
-
Open the Data Refinery flow. On the job page, click the Data Refinery flow's name under Associated Asset. Alternatively, go to the project's Assets page, scroll down to the Data Refinery flows section, and click the name of the Data Refinery flow.
The Data Refinery flow opens.
-
Sort the delay column in descending order. Select the delay column, click the column overflow menu (
), and then select Sort descending.
- In the Details pane, click Edit.
- In the DATA REFINERY FLOW OUTPUT pane, click Edit Output. Change the DATA SET NAME to:
airline-data_sorted_shaped.csv
-
Click the checkmark to save the change.
-
Click Done.
-
From the Data Refinery toolbar, click the jobs icon and select Save and view jobs.
-
Select the job for the airline data, and then click View.
- From the Job window toolbar, click the Run Job icon.
View the data assets and your Data Refinery flow in your project
- When the job completes, go to the project page.
- Click the Assets tab.
-
Scroll down to Data assets. You'll see the original data set that you uploaded and the output of the two Data Refinery flows.
airline-data_sorted_shaped.csv
airline-data_csv_shaped
airline-data.csv
If you click the
airline-data_shaped.csv
data asset, you'll see the mean delay unsorted. Clickairline-data_sorted_shaped.csv
data asset to see the mean delay sorted in descending order.The Data Refinery flows section shows the Data Refinery flow:
airline-data_flow
Parent topic: Refining data