GUI operations in Data Refinery

Data Refinery supports the following categories of GUI operations.

Select a GUI operation from the + Operation button.

A subset of the operations is available from each column’s actions menu (three vertical dots in the column header). You can rename a column by clicking the Edit icon in the column header.

FREQUENTLY USED

Calculate
Perform a calculation with another column or with a specified value. The operators are:

  • Addition
  • Division
  • Exponentiation
  • Is between two numbers
  • Is equal to
  • Is greater than
  • Is greater than or equal to
  • Is less than
  • Is less than or equal to
  • Is not equal to
  • Modulus
  • Multiplication
  • Subtraction

Convert column type

When you open a file in Data Refinery, the Convert column type operation is automatically applied as the first step if it detects any non-string data types in the data. Data types are automatically converted to inferred data types. Click Edit to change the automatic conversion for selected columns. As with any other operation, you can undo the step. The Convert column type operation is re-applied every time you open the file in Data Refinery.  Automatic conversion is applied as needed for file-based data sources only. (It does not apply to a data source from a database connection.)

You can manually apply the Convert column type operation to change the data type of a column at any point in the Data Refinery flow. You can create a new column to hold the result of this operation or you can overwrite the existing column.

Tip: A column’s data type determines the operations you can perform. Changing the data type can affect which operations are relevant for that column.

Filter
Filter rows by the selected columns. Keep rows with the selected column values; filter out all other rows.

The operators for numeric, string, and Boolean (logical), and date and timestamp columns are:

Operator Numeric String Boolean Date and timestamp
Contains      
Does not contain      
Does not end with      
Does not start with      
Ends with      
Is between two numbers      
Is empty  
Is equal to  
Is greater than    
Is greater than or equal to    
Is in    
Is less than    
Is less than or equal to    
Is not empty  
Is not equal to  
Is not in    
Is not null    
Is null  
Starts with      

Math

You can apply math operations only to numeric columns. You can create a new column to hold the result of an operation or you can overwrite the existing column.

Math > Absolute value
Get the absolute value of a number.
Example: The absolute value of both 4 and -4 is 4.

Math > Arc cosine
Get the arc cosine of an angle.

Math > Ceiling
Get the nearest integer of greater value, also known as the ceiling of the number.
Examples: The ceiling of 2.31 is 3. The ceiling of -2.31 is -2.

Math > Exponent
Get a number raised to the power of the column value.

Math > Floor
Get the nearest integer of lesser value, also known as the floor of the number.
Example: The floor of 2.31 is 2. The floor of -2.31 is -3.

Math > Round
Get the whole number nearest to the column value. If the column value is a whole number, return it.

Math > Square root
Get the square root of the column value.

Remove
Remove the selected column.

Rename
Rename the selected column.

Sort ascending
Sort rows by the selected column in ascending order.

Sort descending
Sort rows by the selected column in descending order.

Substitute
Obscure sensitive information from view by substituting a random string of characters for the actual data in the selected column.

Text

You can apply text operations only to string columns. You can create a new column to hold the result of an operation or you can overwrite the existing column.

Text > Collapse spaces
Collapse multiple, consecutive spaces in the text to a single space.

Text > Concatenate string
Link together any string to the text. You can prepend the string to the text, append the string to the text, or both.

Text > Lower case
Convert the text to lower case.

Text > Number of characters
Return the number of characters in the text.

Text > Pad characters
Pad the text with the specified string. Specify whether to pad the text on the left, right, or both the left and right.

Text > Substring
Create substrings from the text that start at the specified position and have the specified length.

Text > Title case
Convert the text to title case.

Text > Trim quotes
Remove single or double quotation marks from the text.

Text > Trim spaces
Remove leading, trailing, and extra spaces from the text.

Text > Upper case
Convert the text to upper case.

CLEANSE

Convert column value to missing
Convert values in the selected column to missing values if they match values in the specified column or they match a specified value.

Extract date or time value
Extract a selected portion of a date or time value from a column with a date or timestamp data type.

Remove duplicates
Remove rows with duplicate column values.

Remove empty rows
Remove rows that have a blank or missing value for the selected column.

Replace missing values
Replace missing values in the column with a specified value or with the value from a specified column in the same row.

Replace substring
Replace the specified substring with the specified text.

ORGANIZE

Aggregate
Apply summary calculations to the values of a column. Each aggregation creates a new column. Optionally select Group by columns to group the new column by another column that defines a characteristic of the group, for example, a department or an ID. You can group by multiple columns. You can combine multiple aggregations in a single operation.

The available aggregate operations depend on the data type.

Numeric data:

  • Count unique values
  • Minimum
  • Maximum
  • Sum
  • Standard deviation
  • Mean

String data:

  • Combine row values
  • Count unique values

Concatenate
Concatenate the values of two or more columns.

Conditional replace
Replace the values in a column based on conditions.

Join
Combine data from two data sets based on a comparison of the values in specified key columns. Specify the type of join to perform, select the columns (join keys) in both data sets that you want to compare, and select the columns that you want in the resulting data set.

The join types include:

Join type Description
Left join Returns all rows in the original data set and return only matching rows in the joining data set. Returns one row in the original data set for each matching row in the joining data set.
Right join Returns all rows in the joining data set and return only matching rows in the original data set. Returns one row in the joining data set for each matching row in the original data set.
Inner join Returns only the rows in each data set that match rows in the other data set. Returns one row in the original data set for each matching row in the joining data set.
Full join Returns all rows in both data sets. Blends rows in the original data set with matching rows in the joining data set.
Semi join Returns only the rows in the original data set that match rows in the joining data set. Returns one row in the original data set for all matching rows in the joining data set.
Anti join Returns only the rows in the original data set that do not match rows in the joining data set.

Watch this video to see how to join two data sets. The video shows examples of the join and filter operations.

Figure 1. Video iconJoin two data sets
This video shows you how to join two data sets.

Sample
Generate a subset of your data by using one of the following methods. Sampling steps from UI operations apply only when the flow is run.

  • Random sample: Each data record of the subset has an equal probability of being chosen.
  • Stratified sample: Divide the data into one or more subgroups called strata. Then generate one random sample that contains data from each subgroup.

Split column
Split the column by non-alphanumeric characters, position, pattern, or text.

NATURAL LANGUAGE

Remove stop words
Remove common words of the English language, such as “the” or “and.” Stop words usually have little semantic value for text analytics algorithms and models. Remove the stop words to reduce the data volume and to improve the quality of the data that you use to train machine learning models.

Optional: To confirm which words were removed, apply the Tokenize operation (by words) on the selected column, and then view the statistics for the words in the Profile tab. You can undo the Tokenize step later in the Data Refinery flow.

Tokenize
Break up English text into words, sentences, paragraphs, lines, characters, or by regular expression.