GUI operations in Data Refinery
Data Refinery supports the following categories of GUI operations.
Select a GUI operation from the + Operation button.
A subset of the operations is available from each column’s Actions menu (three vertical dots in the column header ). You can rename a column by clicking the Edit icon in the column header.
Perform a calculation with another column or with a specified value. The operators are:
- Is between two numbers
- Is equal to
- Is greater than
- Is greater than or equal to
- Is less than
- Is less than or equal to
- Is not equal to
Convert column type
When you open a file in Data Refinery, the Convert column type operation is automatically applied as the first step if it detects any non-string data types in the data. Data types are automatically converted to inferred data types. Click Edit to change the automatic conversion for selected columns. As with any other operation, you can undo the step. The Convert column type operation is reapplied every time that you open the file in Data Refinery. Automatic conversion is applied as needed for file-based data sources only. (It does not apply to a data source from a database connection.)
If the data is converted to an Integer or to a Decimal data type, you can specify the decimal symbol and the thousands grouping symbol for all applicable columns. Strings that are converted to the Decimal data type use a dot for the decimal symbol and a comma for the thousands grouping symbol. Alternatively, you can select comma for the decimal symbol and dot or a custom symbol for the thousands grouping symbol. The decimal symbol and the thousands grouping symbol cannot be the same.
The source data is read from left to right until a terminator or an unrecognized character is encountered. For example, if you are converting string data
12,834 to Decimal and you do not specify what to do with the comma (,), the data will be truncated to
12. Similarly, if the source data has multiple dots (.), and you select dot for the decimal symbol, the first dot is used as the decimal separator and the digits following the second dot are truncated. A source string of
1.834.230,000 is converted to a
The Convert column type operation automatically converts these date and timestamp formats:
You can manually apply the Convert column type operation to change the data type of a column at any point in the Data Refinery flow. You can create a new column to hold the result of this operation or you can overwrite the existing column.
Tip: A column’s data type determines the operations that you can perform. Changing the data type can affect which operations are relevant for that column.
Filter rows by the selected columns. Keep rows with the selected column values; filter out all other rows.
The operators for numeric, string, and Boolean (logical), and date and timestamp columns are:
|Operator||Numeric||String||Boolean||Date and timestamp|
|Does not contain||✓|
|Does not end with||✓|
|Does not start with||✓|
|Is between two numbers||✓|
|Is equal to||✓||✓||✓|
|Is greater than||✓||✓|
|Is greater than or equal to||✓||✓|
|Is less than||✓||✓|
|Is less than or equal to||✓||✓|
|Is not empty||✓||✓||✓|
|Is not equal to||✓||✓||✓|
|Is not in||✓||✓|
|Is not null||✓|
You can apply math operations only to numeric columns. You can create a new column to hold the result of an operation or you can overwrite the existing column.
Math > Absolute value
Get the absolute value of a number.
Example: The absolute value of both 4 and -4 is 4.
Math > Arc cosine
Get the arc cosine of an angle.
Math > Ceiling
Get the nearest integer of greater value, also known as the ceiling of the number.
Examples: The ceiling of 2.31 is 3. The ceiling of -2.31 is -2.
Math > Exponent
Get a number raised to the power of the column value.
Math > Floor
Get the nearest integer of lesser value, also known as the floor of the number.
Example: The floor of 2.31 is 2. The floor of -2.31 is -3.
Math > Round
Get the whole number nearest to the column value. If the column value is a whole number, return it.
Math > Square root
Get the square root of the column value.
Remove the selected column.
Rename the selected column.
Sort rows by the selected column in ascending order.
Sort rows by the selected column in descending order.
Obscure sensitive information from view by substituting a random string of characters for the actual data in the selected column.
You can apply text operations only to string columns. You can create a new column to hold the result of an operation or you can overwrite the existing column.
Text > Collapse spaces
Collapse multiple, consecutive spaces in the text to a single space.
Text > Concatenate string
Link together any string to the text. You can prepend the string to the text, append the string to the text, or both.
Text > Lower case
Convert the text to lower case.
Text > Number of characters
Return the number of characters in the text.
Text > Pad characters
Pad the text with the specified string. Specify whether to pad the text on the left, right, or both the left and right.
Text > Substring
Create substrings from the text that start at the specified position and have the specified length.
Text > Title case
Convert the text to title case.
Text > Trim quotes
Remove single or double quotation marks from the text.
Text > Trim spaces
Remove leading, trailing, and extra spaces from the text.
Text > Upper case
Convert the text to upper case.
Convert column value to missing
Convert values in the selected column to missing values if they match values in the specified column or they match a specified value.
Extract date or time value
Extract a selected portion of a date or time value from a column with a date or timestamp data type.
Remove rows with duplicate column values.
Remove empty rows
Remove rows that have a blank or missing value for the selected column.
Replace missing values
Replace missing values in the column with a specified value or with the value from a specified column in the same row.
Replace the specified substring with the specified text.
Apply summary calculations to the values of one or more columns. Each aggregation creates a new column. Optionally select Group by columns to group the new column by another column that defines a characteristic of the group, for example, a department or an ID. You can group by multiple columns. You can combine multiple aggregations in a single operation.
The available aggregate operations depend on the data type.
- Count unique values
- Standard deviation
- Combine row values
- Count unique values
Concatenate the values of two or more columns.
Replace the values in a column based on conditions.
Combine data from two data sets based on a comparison of the values in specified key columns. Specify the type of join to perform, select the columns (join keys) in both data sets that you want to compare, and select the columns that you want in the resulting data set.
The join key columns in both data sets need to be compatible data types. If the Join operation is the first step that you add, check whether the Convert column type operation automatically converted the data type of the join key columns in the first data set when you opened the file in Data Refinery. Also, depending where the Join operation is in the Data Refinery flow, you can use the Convert column type operation to ensure that the join key columns’ data types match. Click Steps to see the snapshot view of the steps.
The join types include:
|Left join||Returns all rows in the original data set and return only matching rows in the joining data set. Returns one row in the original data set for each matching row in the joining data set.|
|Right join||Returns all rows in the joining data set and return only matching rows in the original data set. Returns one row in the joining data set for each matching row in the original data set.|
|Inner join||Returns only the rows in each data set that match rows in the other data set. Returns one row in the original data set for each matching row in the joining data set.|
|Full join||Returns all rows in both data sets. Blends rows in the original data set with matching rows in the joining data set.|
|Semi join||Returns only the rows in the original data set that match rows in the joining data set. Returns one row in the original data set for all matching rows in the joining data set.|
|Anti join||Returns only the rows in the original data set that do not match rows in the joining data set.|
Watch this video to see how to join two data sets. The video shows examples of the join and filter operations.
Generate a subset of your data by using one of the following methods. Sampling steps from UI operations apply only when the flow is run.
- Random sample: Each data record of the subset has an equal probability of being chosen.
- Stratified sample: Divide the data into one or more subgroups called strata. Then generate one random sample that contains data from each subgroup.
Split the column by non-alphanumeric characters, position, pattern, or text.
Combine the rows from two data sets that share the same schema and filter out the duplicates. If you select Allow a different number of columns and allow duplicate values, the operation is a
UNION ALL command.
Tip: If you receive an error about incompatible schemas, check if the automatic Convert column type operation changed the data types of the first data set. Delete the Convert column type step and try again.
Remove stop words
Remove common words of the English language, such as “the” or “and.” Stop words usually have little semantic value for text analytics algorithms and models. Remove the stop words to reduce the data volume and to improve the quality of the data that you use to train machine learning models.
Optional: To confirm which words were removed, apply the Tokenize operation (by words) on the selected column, and then view the statistics for the words in the Profile tab. You can undo the Tokenize step later in the Data Refinery flow.
Break up English text into words, sentences, paragraphs, lines, characters, or by regular expression.