Refine data

Refining data consists of cleansing and shaping it. When you cleanse data, you fix or remove data that is incorrect, incomplete, improperly formatted, or duplicated. And when you shape data, you customize it by filtering, sorting, combining or removing columns, and performing operations.

As you manipulate your data, you build a customized Data Refinery flow that you can modify in real time and save for future re-use. When you save the refined data set, you typically load it to a different location than where you read it from. In this way, your source data can remain untouched by the refinement process.

Prerequisites

Before you can refine data, you'll need a project that uses Cloud Object Storage. Select a Standard project starter or a Data Engineering project starter. Watch this video to see how to create a project for data preparation.

Figure 1. Video iconCreate a project for data preparation
This video shows you how to create a project for data preparation.


If you have data in cloud or on-premises data sources, you'll need connections to those sources and you'll need to add data assets from each connection. If you want to be able to save refined data to cloud or on-premises data sources, create connections for this purpose as well. Source connections can only be used to read data; target connections can only be used to load (save) data. When you create a target connection, be sure to use credentials that have Write permission or you won't be able to save your Data Refinery flow output to the target. Watch this video to see how to create a connection and add connected data to a project.

Figure 2. Video iconCreate a connection and add connected data to a project
This video shows you how to create a connection and add connected data to a project.

Refine your data

  1. Access Data Refinery from within a project. Click Add to project, and then choose Data Refinery flow. Then select the data you want to work with.

    Alternatively, from the Assets tab of a project page, you can perform one of the following actions:

    • Select Refine from the menu of an Avro, CSV, JSON, Parquet, or text data asset
    • Click an Avro, CSV, JSON, Parquet, or text data asset to preview it first and then click the Refine link
    • Click New Data Refinery flow from the Data Refinery flows section of the Assets tab and then select the data that you want to work with.

    Tip: If your data doesn't display in tabular form, specify the format of your data by clicking the Specify data format icon at the bottom of Data Refinery.

  2. On the Data tab, apply operations to cleanse, shape, and enrich your data. You can enter R code in the command line and let autocomplete assist you in getting the syntax right. Or you can use the Operations menu to browse operation categories or search for a specific operation, then let the UI guide you.

    The following video shows you how to refine data.

    Figure 3. Video iconRefine data
    This video shows you how to refine data.

    For information about manipulating operations and steps in the Data Refinery flow, see Data Refinery flows and steps, below. For complete information about Data Refinery flows, see Manage Data Refinery flows.

  3. Validate your data throughout the data refinement process.

  4. Use visualizations to gain insights into your data and to uncover patterns, trends, and correlations within your data.

  5. When you've refined the sample data set to suit your needs, click the Run Data Refinery flow icon in the toolbar to run the Data Refinery flow on the entire data set.

    By default, Data Refinery uses the name of the data source to name the Data Refinery flow and the target data set. You can change these names, but you can't change the project that these data assets belong to.

    a. In the Data Refinery flow details pane, optionally edit the name of the Data Refinery flow and enter a description for it. You can also add a one-time or repeating schedule for the Data Refinery flow.

    b. In the Data Refinery flow output pane, optionally edit the name of the target data set and enter a description for it. You can save the target data set to the project, to a connection, or to a connected data asset . If you save it to the project, you can save it as a new data asset (by default) or you can replace an existing data asset. To save the target data set to a connection or as a connected data asset or to replace an existing data asset, use Change Location.

    If you select an existing relational database table or view or you select a connected relational data asset as the target for your Data Refinery flow output, you have a number of options for impacting the existing data set.

    • Overwrite - Overwrite the rows in the existing data set with those in the Data Refinery flow output
    • Recreate - Delete the rows in the existing data set and replace them with the rows in the Data Refinery flow output
    • Insert - Append all rows of the Data Refinery flow output to the existing data set
    • Update - Update rows in the existing data set with the Data Refinery flow output; don’t insert any new rows
    • Upsert - Update rows in the existing data set and append the rest of the Data Refinery flow output to it

    For the Update and Upsert options, you'll need to select the columns in the output data set to compare to columns in the existing data set. The output and target data sets must have the same number of columns, and the columns must have the same names and data types in both data sets.

    If you select a file in a connection as the target for your Data Refinery flow output, you can select one of the following formats for that file:

    • Avro
    • CSV
    • JSON
    • Parquet

    c. Click Save and Run.

Tip: If you want to continue refining your data at a later time, simply open the Data Refinery flow from the project > Assets tab > Data Refinery flow section and pick up from where you left off.

Data set previews

Data Refinery provides support for large data sets, which can be time-consuming and unwieldy to refine. To enable you to work quickly and efficiently, it operates on a subset of rows in the data set while you interactively refine the data. When you run the Data Refinery flow, it operates on the entire data set.

Data Refinery flows and steps

As you apply operations to a data set, Data Refinery keeps track of them and builds a Data Refinery flow. For each operation that you apply, Data Refinery adds a step in the Steps tab.

To see what your data looked like at any point in time, simply click a step to put Data Refinery into snapshot view. For example, if you click the data source step, you'll see what your data looked like before you started refining it. Click any operation step to see what your data looked like after that operation was applied. To leave snapshot mode, click SNAPSHOT VIEW or toggle it off by clicking the same step that you selected to get into snapshot view.

You can undo and redo operations from the toolbar. You can also insert, edit, and delete operations from the Steps tab.

To insert an operation between two steps:

  1. Click the step before the position where you want to insert the new operation. Data Refinery shows you a snapshot view of the data set after that operation was applied.
  2. Select and apply the new operation. Data Refinery inserts a new step between the existing steps and it reruns all of the operations that follow the new step.

To edit an operation:

  1. Click the Edit icon on the step for the operation you want to edit. Data Refinery goes into edit mode and either displays the operation to be edited on the command line or in the Operation pane.
  2. Edit the operation or select a different operation to take its place.
  3. Apply the edited operation. Data Refinery updates the relevant step to reflect your changes and it reruns all of the operations that follow the edited one.

Learn more

Next steps