Refining data

Use Data Refinery to cleanse and shape tabular data with a graphical flow editor. You can also use dplyr R library operations, functions, and logical operators. When you cleanse data, you fix or remove data that is incorrect, incomplete, improperly formatted, or duplicated. When you shape data, you customize it by filtering, sorting, combining or removing columns, and performing operations.

You create a Data Refinery flow as a set of ordered operations on data. Data Refinery includes a graphical interface to profile your data to validate it and over 20 customizable charts that give you perspective and insights into your data. When you save the refined data set, you typically load it to a different location than where you read it from. In this way, your source data remains untouched by the refinement process.

Required service
Watson Studio or Watson Knowledge Catalog
Data format
Avro, CSV, JSON, Parquet, or text data asset
Tables in relational data sources
Data size
Any. Data Refinery operates on a sample subset of rows in the data set. However, when you run a job for the Data Refinery flow, the entire data set is processed.

For more information on choosing the right tool for your data and use case, see Choosing a tool.

Prerequisites

Before you can refine data, you’ll need a project that uses Cloud Object Storage.

Watch this video to see how to create a project.

Figure 1. Video iconCreate a project
This video shows you how to create a project

If you have data in cloud or on-premises data sources, you’ll need connections to those sources and you’ll need to add data assets from each connection. If you want to be able to save refined data to cloud or on-premises data sources, create connections for this purpose as well. Source connections can only be used to read data; target connections can only be used to load (save) data. When you create a target connection, be sure to use credentials that have Write permission or you won’t be able to save your Data Refinery flow output to the target.

Watch this video to see how to create a connection and add connected data to a project.

Figure 2. Video iconCreate a connection and add connected data to a project
This video shows you how to create a connection and add connected data to a project.

Refine your data

  1. Access Data Refinery from within a project. Click Add to project, and then choose Data Refinery flow. Then select the data you want to work with.

    Alternatively, from the Assets tab of a project page, you can perform one of the following actions:

    • Select Refine from the menu of an Avro, CSV, JSON, Parquet, or text data asset
    • Click an Avro, CSV, JSON, Parquet, or text data asset to preview it first and then click the Refine link
    • If you already have a Data Refinery flow, click New Data Refinery flow from the Data Refinery flows section and then select the data that you want to work with.

    Tip: If your data doesn’t display in tabular form, go to the Data tab. Scroll down to the SOURCE FILE information at the bottom of the page. Click the Specify data format icon. For more information, see specify the format of your data source.

  2. On the Data tab, apply operations to cleanse, shape, and enrich your data. You can enter R code in the command line and let autocomplete assist you in getting the correct syntax. Or you can use the Operations menu to browse operation categories or search for a specific operation, then let the UI guide you.

    The following video shows you how to refine data.

    Figure 3. Video iconRefine data
    This video shows you how to refine data.
  3. Click the Profile tab to validate your data throughout the data refinement process.

  4. Click the Visualizations tab to visualize the data in charts. Uncover patterns, trends, and correlations within your data.

  5. When you’ve refined the sample data set to suit your needs, edit the Data Refinery details and output information (optional), and then save and create a job that runs the Data Refinery flow for the full data set.

    By default, Data Refinery uses the name of the data source to name the Data Refinery flow and the target data set. You can change these names, but you can’t change the project that these data assets belong to.

    1. Optional: In the DATA REFINERY FLOW DETAILS panel, edit the name of the Data Refinery flow and enter a description for it.

    2. Optional: In the DATA REFINERY FLOW OUTPUT panel, edit the name of the target data set and enter a description for it. You can save the target data set to the project, to a connection, or to a connected data asset. If you save it to the project, you can save it as a new data asset (default) or you can replace an existing data asset. To save the target data set to a connection or as a connected data asset or to replace an existing data asset, click Change Location.

      If you select an existing relational database table or view or you select a connected relational data asset as the target for your Data Refinery flow output, select an option for the existing data set:

      • Overwrite - Overwrite the rows in the existing data set with those in the Data Refinery flow output
      • Recreate - Delete the rows in the existing data set and replace them with the rows in the Data Refinery flow output
      • Insert - Append all rows of the Data Refinery flow output to the existing data set
      • Update - Update rows in the existing data set with the Data Refinery flow output; don’t insert any new rows
      • Upsert - Update rows in the existing data set and append the rest of the Data Refinery flow output to it

      For the Update and Upsert options, you’ll need to select the columns in the output data set to compare to columns in the existing data set. The output and target data sets must have the same number of columns, and the columns must have the same names and data types in both data sets.

      If you select a file in a connection as the target for your Data Refinery flow output, you can select one of the following formats for that file:

      • Avro
      • CSV
      • JSON
      • Parquet
    3. Click Save and create a job or Save and view jobs in the toolbar to run the Data Refinery flow on the entire data set. Select the runtime and add a one-time or repeating schedule. For information about jobs, see Jobs in a project.

Tip: If you want to continue refining your data later, open the Data Refinery flow from the project’s Assets tab > Data Refinery flows section and pick up from where you left off.

Data set previews

Data Refinery provides support for large data sets, which can be time-consuming and unwieldy to refine. To enable you to work quickly and efficiently, it operates on a subset of rows in the data set while you interactively refine the data. When you run a job for the Data Refinery flow, it operates on the entire data set.

Data Refinery steps

As you apply operations to a data set, Data Refinery keeps track of them and builds a Data Refinery flow. For each operation that you apply, Data Refinery adds a step in the Steps tab.

To see what your data looked like at any point in time, click a step to put Data Refinery into snapshot view. For example, if you click the data source step, you’ll see what your data looked like before you started refining it. Click any operation step to see what your data looked like after that operation was applied. To leave snapshot mode, click SNAPSHOT VIEW or toggle it off by clicking the same step that you selected to get into snapshot view.

If you want to change the source of the Data Refinery flow, click the edit icon next to Data Source (before the first step). For best results, the new data set should have a schema that is compatible to the original data set (for example, column names, number of columns, and data types). If the new data set has a different schema, operations that won’t work with the schema will show errors. You can edit or delete the operations, or change the source to one that has a more compatible schema.

You can undo and redo operations from the toolbar. You can also insert, edit, and delete operations from the Steps tab.

To insert an operation between two steps:

  1. Click the step before the position where you want to insert the new operation. Data Refinery shows you a snapshot view of the data set after that operation was applied.
  2. Select and apply the new operation. Data Refinery inserts a new step between the existing steps and it reruns all of the operations that follow the new step.

To edit an operation:

  1. Click the Edit icon on the step for the operation you want to edit. Data Refinery goes into edit mode and either displays the operation to be edited on the command line or in the Operation pane.
  2. Edit the operation or select a different operation to take its place.
  3. Apply the edited operation. Data Refinery updates the relevant step to reflect your changes and it reruns all of the operations that follow the edited one.

Next steps

Learn more