Managing Data Refinery flows

A Data Refinery flow is an ordered set of steps to cleanse, shape, and enhance data. As you refine your data by applying operations to a data set, you dynamically build a customized Data Refinery flow that you can modify in real time and save for future use.

Save a Data Refinery flow

Save a Data Refinery flow by clicking the Save Data Refinery flow icon Save icon in the Data Refinery tool bar. Data Refinery flows are saved to the project that you’re working in. Save a Data Refinery flow so that you can continue refining a data set later.

The default output of the Data Refinery flow is saved as a data asset file in CSV format: source_file_name_shaped.csv. For example, if the source file is airline-data.csv, the default name and output for the Data Refinery flow is airline-data.csv_flow.

Run a Data Refinery flow

Data Refinery provides support for large data sets, which can be time-consuming and unwieldy to refine. To enable you to work quickly and efficiently, Data Refinery operates on sample data sets, a subset of rows in each data set.

To run a Data Refinery flow against your entire data set, you can either:

  • Click the Run icon in the Data Refinery toolbar
  • Go to the project’s Assets tab, scroll down to the Data Refinery flows section and select Run from the ACTIONS menu

Cancel a Data Refinery flow run

You can cancel a Data Refinery flow run when it’s in progress, that is, when its status is Running. To cancel a run, select Cancel from the run’s menu on the History tab of the Data Refinery flow details page.

Reopen a Data Refinery flow to continue working

To reopen a Data Refinery flow and continue refining your data, go to the project’s Assets tab. Scroll down to the Data Refinery flows section and select Refine from the ACTIONS for the Data Refinery flow.

Change the Data Refinery flow output file

  1. Open the info pane info icon and click the Details tab.
  2. Click the Edit button.
  3. In the DATA REFINERY FLOW OUTPUT pane, click the Edit icon to change any of the following properties:
    • Target location
    • Data set name and description
    • Relational database targets only: Whether to overwrite the data in the existing data set
    • File format
    • Column header information

Specify the runtime environment for the Data Refinery flow.

  1. Open the info pane info icon and click the Details tab.
  2. Click the Edit button.
  3. In the DATA REFINERY FLOW DETAILS pane, select the runtime environment.

Change the source of a Data Refinery flow

Change the source of a saved Data Refinery flow. Run the same Data Refinery flow but with a different source data asset. The new data set must have a compatible schema to the original data set (for example, column names, number of columns, and data types). Go to the project. Click the Assets tab. Scroll down to the **Data Refinery flows section and click the Data Refinery flow. In the Data Refinery flow summary page, click the “Change the source data asset” icon (Change source) to select a different data source.

Rename a Data Refinery flow

  1. Open the info pane info icon and click the Details tab.
  2. Click the Edit icon next to the Data Refinery name in the DATA REFINERY FLOW DETAILS pane.

Remove a Data Refinery flow

To remove a Data Refinery flow, go to the project. Select the Assets tab. Scroll down to the Data Refinery flows section and select Remove from the ACTIONS menu.

View summary information

To view summary information for a Data Refinery flow, go to the project. Select the Assets tab. Scroll down to the Data Refinery flows section and click the Data Refinery flow.

Summary

The Summary section of the Data Refinery flow details page provides high-level information about the data source, the Data Refinery flow, and the target data set (the Data Refinery flow output). You can also preview the data source and the target data sets from here by clicking their eye icons.

Runs

The Runs section of the Data Refinery flow details page provides detailed historical information about any Data Refinery flow runs. It also provides information about the current schedule, which you can change.

History

The History tab displays the following information about each Data Refinery flow run:

  • Timestamp
  • Status
  • Duration
  • Number of rows read from the data source and number of rows written to the target
  • Size (amount of data processed, in MB)
  • Initiated by - The name of the person who initiated the flow

To view detailed information about a run, select View log from the run’s menu. The log file name is the name of the Data Refinery flow appended by the date and time (24-hour system) when the Data Refinery flow was run. Invalid characters for file names are changed to an underscore (_). When you no longer need the information about a run, select Delete from the run’s menu.

Schedule

The Schedule tab displays the date, day, and time of each scheduled Data Refinery flow. It also displays the start date and time, interval, and end date and time for the current schedule.

You can create, pause, resume, edit, or delete a schedule from this tab.