Managing Data Refinery flows
A Data Refinery flow is an ordered set of steps to cleanse, shape, and enhance data. As you refine your data by applying operations to a data set, you dynamically build a customized Data Refinery flow that you can modify in real time and save for future use.
- Save Data Refinery flows
- Run Data Refinery flows
- Cancel Data Refinery flow runs
- Reopen Data Refinery flows to continue working
- Change the source of a Data Refinery flow
- View summary information
- Rename Data Refinery flows
- Remove Data Refinery flows
- Schedule Data Refinery flows
Save Data Refinery flows
Data Refinery flows are saved to the project that you’re working in. Save a Data Refinery flow so that you can continue refining a data set later. You can just pick up from where you left off.
Save a Data Refinery flow to the project at any time by clicking the Save Data Refinery flow icon in the Data Refinery tool bar. The Data Refinery flow is implicitly saved whenever you run it against a data set.
The output of the Data Refinery flow is saved as a data asset file in CSV format:
shaped.csv. For example, if the source file is
airline-data.csv, the default name for the Data Refinery flow is
For a list of all the Data Refinery flows in a project, see the Data Refinery flows section on the Assets tab of the project.
Run Data Refinery flows
Data Refinery provides support for large data sets, which can be time-consuming and unwieldy to refine. To enable you to work quickly and efficiently, Data Refinery operates on sample data sets, a subset of rows in each data set.
To run a Data Refinery flow against your entire data set, you can either:
- Click the Run icon in the Data Refinery toolbar
- Go to the project’s Assets tab, scroll down to the Data Refinery flows section and select Run from the ACTIONS menu
To run a Data Refinery flow against your entire data set, click the Run Data Refinery flow icon in the Data Refinery toolbar.
Cancel Data Refinery flow runs
You can cancel a Data Refinery flow run when it’s in progress, that is, when its status is Running. To cancel a run, select Cancel from the run’s menu on the History tab of the Data Refinery flow details page.
Reopen Data Refinery flows to continue working
To reopen a Data Refinery flow and continue refining your data, you can either:
- Click Refine in the toolbar of the Data Refinery flow details page
- Go to the project’s Assets tab. Scroll down to the Data Refinery flows section and select Refine from the ACTIONS menu
When you save the Data Refinery flow again, you can either overwrite the existing flow or you can create a new flow by changing the name of the Data Refinery flow in the Details pane before you save it.
Change the source of a Data Refinery flow
Change the source of a saved Data Refinery flow. Run the same Data Refinery flow but with a different source data asset. The new data set must have a compatible schema to the original data set (for example, column names, number of columns, and data types). Go to the project. Click the Assets tab. Scroll down to the **Data Refinery flows section and click the Data Refinery flow. In the Data Refinery flow summary page, click the “Change the source data asset” icon () to select a different data source.
View summary information
To view summary information for a Data Refinery flow, go to the project. Select the Assets tab. Scroll down to the Data Refinery flows section and click the Data Refinery flow.
The Summary section of the Data Refinery flow details page provides high-level information about the data source, the Data Refinery flow, and the target data set (the Data Refinery flow output). You can also preview the data source and the target data sets from here by clicking their eye icons.
The Runs section of the Data Refinery flow details page provides detailed historical information about any Data Refinery flow runs. It also provides information about the current schedule, which you can change.
The History tab displays the following information about each Data Refinery flow run:
- Number of rows read from the data source and number of rows written to the target
- Size (amount of data processed, in MB)
- Initiated by - The name of the person who initiated the flow
To view detailed information about a run, select View log from the run’s menu. The log file name is the name of the Data Refinery flow appended by the date and time (24-hour system) when the Data Refinery flow was run. Invalid characters for file names are changed to an underscore (_). When you no longer need the information about a run, select Delete from the run’s menu.
The Schedule tab displays the date, day, and time of each scheduled Data Refinery flow. It also displays the start date and time, interval, and end date and time for the current schedule.
You can create, pause, resume, edit, or delete a schedule from this tab.
Rename Data Refinery flows
To rename a Data Refinery flow, click the Edit icon next to the Data Refinery flow name in the Details pane or in the DATA REFINERY FLOW DETAILS pane of the Save and Run page.
Remove Data Refinery flows
To remove a Data Refinery flow, go to the project. Select the Assets tab. Scroll down to the Data Refinery flows section and select Remove from the ACTIONS menu.