Manage Data Refinery flows

A Data Refinery flow is an ordered set of steps to cleanse, shape, and enhance data. As you refine your data by applying operations to a data set, you dynamically build a customized Data Refinery flow that you can modify in real time and save for future use.

Run Data Refinery flows

Data Refinery provides support for large data sets, which can be time-consuming and unwieldy to refine. To enable you to work quickly and efficiently, Data Refinery operates on sample data sets, a subset of rows in each data set.

To run a Data Refinery flow against your entire data set, you can either:

  • Click the Run Data Refinery flow icon in the Data Refinery toolbar
  • Go to the project > Assets tab > Data Refinery flows section and select Run from the ACTIONS menu

Save Data Refinery flows

Data Refinery flows are saved to the project that you're working in. Save a Data Refinery flow so that you can continue refining a data set later. You can just pick up from where you left off.

Save a Data Refinery flow to the project at any time by clicking the Save Data Refinery flow icon in the Data Refinery tool bar. The Data Refinery flow is implicitly saved whenever you run it against a data set.

For a list of all the Data Refinery flows in a project, see the Data Refinery flows section on the Assets tab of the project.

Cancel Data Refinery flow runs

You can cancel a Data Refinery flow run when it's in progress, that is, when its status is Running. To cancel a run, select Cancel from the run's menu on the History tab of the Data Refinery flow details page.

Reopen Data Refinery flows to continue working

To reopen a Data Refinery flow and continue refining your data, you can either:

  • Click Refine in the toolbar of the Data Refinery flow details page
  • Go to the project > Assets tab > Data Refinery flows section and select Refine from the ACTIONS menu

When you save the Data Refinery flow again, you can either overwrite the existing flow or you can create a new flow by changing the name of the Data Refinery flow in the Details pane before you save it.

Change the source of a Data Refinery flow

Change the source of a saved Data Refinery flow. Run the same Data Refinery flow but with a different source data asset. The new data set must have a compatible schema to the original data set (for example, column names, number of columns, and data types). Go to the project > Assets tab > Data Refinery flows section and click the Data Refinery flow. In the Data Refinery flow summary page, click the "Change the source data asset" icon (Change source) to select a different data source.

View summary information

To view summary information for a Data Refinery flow, go to the project > Assets tab > Data Refinery flows section and click the Data Refinery flow.

Summary

The Summary section of the Data Refinery flow details page provides high-level information about the data source, the Data Refinery flow, and the target data set (the Data Refinery flow output). You can also preview the data source and the target data sets from here by clicking their eye icons.

Runs

The Runs section of the Data Refinery flow details page provides detailed historical information about any Data Refinery flow runs. It also provides information about the current schedule, which you can change.

History

The History tab displays the following information about each Data Refinery flow run:

  • Timestamp
  • Status
  • Duration
  • Number of rows read from the data source and number of rows written to the target
  • Size (amount of data processed, in MB)
  • Initiated by - The name of the person who initiated the flow

To view detailed information about a run, select View log from the run's menu. The log file name is the name of the Data Refinery flow appended by the date and time (24-hour system) when the Data Refinery flow was run. Invalid characters for file names are changed to an underscore (_). When you no longer need the information about a run, select Delete from the run's menu.

Schedule

The Schedule tab displays the date, day, and time of each scheduled Data Refinery flow. It also displays the start date and time, interval, and end date and time for the current schedule.

You can create, pause, resume, edit, or delete a schedule from this tab.

Rename Data Refinery flows

By default, Data Refinery flows are given names similar to the source data that they're applied to. For example, if your data source is a local file named Sales.csv, your Data Refinery flow is named "Sales Flow".

To rename a Data Refinery flow, click the Edit icon next to the Data Refinery flow name in the Details pane or in the DATA REFINERY FLOW DETAILS pane of the Save and Run page.

Remove Data Refinery flows

To remove a Data Refinery flow, go to the project > Assets tab > Data Refinery flows section and select Remove from the ACTIONS menu.