Managing Data Refinery flows
A Data Refinery flow is an ordered set of steps to cleanse, shape, and enhance data. As you refine your data by applying operations to a data set, you dynamically build a customized Data Refinery flow that you can modify in real time and save for future use.
- Save a Data Refinery flow
- Run or schedule a job for Data Refinery flow
- Reopen a Data Refinery flow to continue working
- Change the Data Refinery output file
- Change the source of a Data Refinery flow
- Rename a Data Refinery flow
- Clone a Data Refinery flow
- Remove a Data Refinery flow
Save a Data Refinery flow
Save a Data Refinery flow by clicking the Save Data Refinery flow icon in the Data Refinery toolbar. Data Refinery flows are saved to the project that you’re working in. Save a Data Refinery flow so that you can continue refining a data set later.
The default output of the Data Refinery flow is saved as a data asset file in CSV format:
shaped.csv. For example, if the source file is
airline-data.csv, the default name and output for the Data Refinery flow is
Run or schedule a job for a Data Refinery flow
Data Refinery supports large data sets, which can be time-consuming and unwieldy to refine. So that you can work quickly and efficiently, Data Refinery operates on a sample subset of rows in the data set. The sample size is 1 MB or 10,000 rows, whichever comes first. When you run a job for the Data Refinery flow, the entire data set is processed. When you run the job, you select the runtime and you can add a one-time or repeating schedule.
In Data Refinery, from the Data Refinery toolbar click the Jobs icon , and then select Save and create a job or Save and view jobs.
After you save a Data Refinery flow, you can also create a job for it from the Project page. Go to the Assets tab, select the Data Refinery flow, choose Create job from the Actions menu ().
You must have the Admin or Editor role to view the job details or to edit or run the job. With the Viewer role for the project, you can only view the job details.
For more information about jobs, see Jobs in a project.
Reopen a Data Refinery flow to continue working
To reopen a Data Refinery flow and continue refining your data, go to the project’s Assets tab. Scroll down to the Data Refinery flows section and click the Data Refinery flow name.
Change the Data Refinery flow output file
- In Data Refinery, open the Information pane and click the Details tab.
- Click the Edit button.
- In the DATA REFINERY FLOW OUTPUT pane, click the Edit icon to change any of the following properties:
- Target location. (The target data set must be a different data set than the source data set.)
- Data set name and description
- Relational database targets only: Choose whether to overwrite the data in the existing data set. (If the target data set is not in a relational database, the target data is always overwritten.)
- File format
- Column header information
- Encoding (UTF-8 or SJIS)
Change the source of a Data Refinery flow
Change the source of a saved Data Refinery flow. Run the same Data Refinery flow but with a different source data set. In the Steps pane in Data Refinery, click the Edit icon next to Data Source to choose a different source data set.
For best results, the new data set should have a schema that is compatible to the original data set (for example, column names, number of columns, and data types). If the new data set has a different schema, operations that won’t work with the schema will show errors. You can edit or delete the operations, or change the source to one that has a more compatible schema.
Rename a Data Refinery flow
- In Data Refinery, open the info pane and click the Details tab.
- Click the Edit icon next to the Data Refinery name.
Clone a Data Refinery flow
To create a copy of a Data Refinery flow, go to the project. Select the Assets tab. Scroll down to the Data Refinery flows section and select Clone from the Actions menu (). The Data Refinery flow is added to the Data Refinery flows list as “original-name copy 1”.
Remove a Data Refinery flow
To remove a Data Refinery flow, go to the project. Select the Assets tab. Scroll down to the Data Refinery flows section and select Remove from the Actions menu ().