Managing Data Refinery flows

A Data Refinery flow is an ordered set of steps to cleanse, shape, and enhance data. As you refine your data by applying operations to a data set, you dynamically build a customized Data Refinery flow that you can modify in real time and save for future use.

Save a Data Refinery flow

Save a Data Refinery flow by clicking the Save Data Refinery flow icon Save icon in the Data Refinery tool bar. Data Refinery flows are saved to the project that you’re working in. Save a Data Refinery flow so that you can continue refining a data set later.

The default output of the Data Refinery flow is saved as a data asset file in CSV format: source_file_name_shaped.csv. For example, if the source file is airline-data.csv, the default name and output for the Data Refinery flow is airline-data.csv_flow.

Run or schedule a job for a Data Refinery flow

Data Refinery supports large data sets, which can be time-consuming and unwieldy to refine. To enable you to work quickly and efficiently, Data Refinery operates on a sample subset of rows in each data set. When you run a job for the Data Refinery flow, the entire data set is processed. When you run the job, you select the runtime and add a one-time or repeating schedule.

In Data Refinery, from the Data Refinery toolbar click the Jobs icon the run or schedule a job icon, and then select Save and create a job or Save and view jobs.

If you’ve already saved the Data Refinery flow, you can create and run a job for it from these places in the Project page:

  • From the Assets tab, select the Data Refinery flow, and then click ACTIONS > Create job.
  • From the Jobs tab, click New job. In the Create a job page, select a Data Refinery flow as the asset and enter the details for the job.

For more information about jobs, see Jobs in a project.

Reopen a Data Refinery flow to continue working

To reopen a Data Refinery flow and continue refining your data, go to the project’s Assets tab. Scroll down to the Data Refinery flows section and click the Data Refinery flow name.

Change the Data Refinery flow output file

  1. In Data Refinery, open the info pane info icon and click the Details tab.
  2. Click the Edit button.
  3. In the DATA REFINERY FLOW OUTPUT pane, click the Edit icon to change any of the following properties:
    • Target location
    • Data set name and description
    • Relational database targets only: Whether to overwrite the data in the existing data set
    • File format
    • Column header information

Change the source of a Data Refinery flow

Change the source of a saved Data Refinery flow. Run the same Data Refinery flow but with a different source data set. In the Steps panel in Data Refinery, click the Edit icon next to Data Source to choose a different source data set.
Edit source

For best results, the new data set should have a schema that is compatible to the original data set (for example, column names, number of columns, and data types). If the new data set has a different schema, operations that won’t work with the schema will show errors. You can edit or delete the operations, or change the source to one that has a more compatible schema.

Rename a Data Refinery flow

  1. In Data Refinery, open the info pane info icon and click the Details tab.
  2. Click the Edit icon next to the Data Refinery name in the DATA REFINERY FLOW DETAILS pane.

Clone a Data Refinery flow

To create a copy of a Data Refinery flow, go to the project. Select the Assets tab. Scroll down to the Data Refinery flows section and select Clone from the ACTIONS menu. The Data Refinery flow is added to the Data Refinery flows list as “original-name copy 1”.

Remove a Data Refinery flow

To remove a Data Refinery flow, go to the project. Select the Assets tab. Scroll down to the Data Refinery flows section and select Remove from the ACTIONS menu.