Use logical expressions to define rule logic in a data quality definition or column content in the output table of data quality rules.
These logical expressions follow a basic syntax where a variable, such as a word or term, is evaluated based on a specified condition or type of check.
A rule expression can consist of the following types of elements:
Each rule expression is made up of at least one check and can combine various checks. Also, check the provided samples to see how you can use the expressions.
Variables and literals
Variables and literals are elements that you combine with checks, operations, and general, date or time, mathematical, and string elements. You bind logical variables to physical data sources when you create a rule from the data quality definition. You can select the following types:
Element | Description |
---|---|
var |
A logical variable. Variable names are not case-sensitive. Var1 and var1 refer to the same variable. Names can contain the following characters:• Alphabetic characters • Unicode characters that belong to the Letter category • Numeric characters • Underscore (_) • Question mark (?) • Dollar sign ($) The first character of a variable name must be an alphabetic character. Names cannot contain periods and spaces. |
0.123 |
A numerical value. |
'value' |
A character or string of characters and numbers that must be enclosed in single quotation marks. If the string contains a single quotation mark, precede the single quotation mark with a backslash. If the string contains a backslash, precede the backslash with another backslash. |
null |
Checks for null values. |
{value1,value2} |
A list of values separated by commas. |
Logic
Combine logic blocks with checks logic. The basic logic is a single check, such as field1 exists
. You can expand this logic by adding conditions and joining them by using operators. You can enclose parts of the expression in parentheses
to specify priority. marital_status='married' and (age<18 or age>100)
is different from (marital_status='married' and age<18) or age>100
Block element | Expression | Description |
---|---|---|
Not [ ] | not logic | Negates the expression:not (ucase(PrimaryName) contains 'TEST') |
[ ] Or [ ] | logic1 or logic2 | Either logic1 or logic2 must be true:UnhandledNameData not exists OR len(trim(UnhandledNameData)) = 0 |
[ ] And [ ] | logic1 and logic2 | Both logic1 and logic2 must be true:IF Field1 exists AND len(trim(Field1)) <> 0 THEN Field1 is_numeric |
If [ ] Then [ ] | if check1 then check2 | Conditional expression:IF NamePrefix exists THEN len(trim(NamePrefix)) > 1 |
If [ ] Then [ ] Else [ ] | if check1 then check2 else check2 | Conditional expression with alternative path:if age >= 18 then credit_card exists else (not credit_card exists) |
Checks
Specify checks for various conditions: logical comparison, matching, occurrence. A check is a part of the expression that evaluates to a Boolean result.
Logical comparisons
Block element | Expression | Description |
---|---|---|
[ ] < [ ] | x < y | Checks if the first value is less than the second value. |
[ ] <= [ ] | x <= y | Checks if the first value is less than or equal to the second value. |
[ ] <> [ ] | x <> y | Checks if the first value is not equal to the second value. |
[ ] = [ ] | x = y | Checks if the first value is equal to the second value. For Boolean comparison, use 0 or 1 as the second value. |
[ ] > [ ] | x > y | Checks if the first value is greater than the second value. |
[ ] >= [ ] | x >= y | Checks if the first value is greater than or equal to the second value. |
The source data and reference data must have compatible data types. You can use an expression that is made up of scalar functions, arithmetic operations, and variables as a source data. The expression must evaluate to a character or string data type.
For string data, logical comparisons are case-sensitive and ignore all trailing spaces in the value you are checking. Spaces at the beginning of the values to be checked are not ignored. Additional spaces or characters or differences between uppercase and lowercase values might prevent a successful validation. You can use trimming and modifications to standardize the values.
Matching
Block element | Expression | Description |
---|---|---|
[ ] contains substring [ ] | x contains y | Checks if the first value contains the substring that you specify. Notes: • The containment check is case-sensitive. • Both source and reference data must be strings. • Quotation marks are required if the string is a hardcoded literal. • Both source and reference data can be a variable or the result of a scalar function. • You can use an expression that is made up of scalar functions, arithmetic operations, and variables as source data. If you do this, make sure that each expression evaluates to a character or string data type. |
[ ] exists | x exists | Checks if the value exists. |
[ ] exists in reference column [ ] | x in_reference_column y | Checks if the first value exists in the specified reference column. Source and reference data must have compatible data types. |
[ ] is a date using format [ ] | x is_date y | Checks if the first value is a date in the specified format. See allowed date formats. If you configure the expression manually, enclose the format in single quotation marks. |
[ ] is in a reference list [ ] | x in_reference_list y | Checks if the first value is in the specified reference list. Source and reference data must have compatible data types. |
[ ] is numeric | x is_numeric | Checks if the value is a numeric value. This check accepts nonformatted numbers or numbers that are formatted in a locale-specific way, such as "100,000.00." By default, the current locale is used. |
[ ] matches data class [ ] | x matches_dataclass y | Checks if the first value matches the data class identified by the class code that you specified as the second value. To find the class code for a specific data class, use the Watson Data API GET /v3/data_classes/{artifact_id}/versions/{version_id} . You can find the artifact and version IDs for the call in the data class URL. The class code is in the entity section of the response. You can use this check only with value-based data classes. Such data classes show Value as the scope of code in the Data matching section on the Overview tab. You can also check the scope information in Predefined data classes details. |
[ ] matches format [ ] | x matches_format y | Checks if the first value matches the specified format. The pattern explicitly defines what is acceptable in each specific character position. You can use the following operators to build the pattern: • 'A' for any uppercase letter • 'a' for any lowercase letter • '9' for any 0-9 digit • 'x' for any alphanumeric value, regardless of its case • 'I' for a CJKV (Chinese, Japanese, Korean, or Vietnamese) ideograph, as defined by the Unicode Standard Any other character in the pattern string indicates that you are explicitly looking for the exact character you entered. The source data must be of a character data type. |
[ ] matches regex [ ] | x matches_regex y | Checks if the first value matches the pattern specified by the regular expression. The Perl Regular Expression Library is used for these types of evaluations. |
Occurrences
Element | Expression | Description |
---|---|---|
[ ] is unique | x unique | Checks if the value is unique in the column. |
[ ] occurs [ ] time(s) | x occurs y | Checks if the first value occurs in the column the specified number of times. |
[ ] occurs at least [ ] time(s) | x occurs>= y | Checks if the first value occurs in the column at least the specified number of times. |
[ ] occurs at most [ ] time(s) | x occurs<= y | Checks if the first value occurs in the column at most the specified number of times. |
[ ] occurs less than [ ] time(s) | x occurs< y | Checks if the first value occurs in the column less than the specified number of times. |
[ ] occurs more than [ ] time(s) | x occurs> y | Checks if the first value occurs in the column more than the specified number of times. |
These checks imply the computation of aggregations on the source columns. Such operations can run much longer than checks that test only single records.
Operations
Combine operation elements with checks logic. You use variables and literals, general elements, date and time, mathematical, and string elements with operation elements.
Element | Description |
---|---|
x + y | Adds the first value to the second value. |
x - y | Subtracts the second value from the first value. |
x * y | Multiplies the specified values. |
x / y | Divides the specified values. |
x % y | Uses the first number as a dividend and the second number as a divisor. The modulo is the remainder from the division of the first number by the second number. For example, 7 % 2 computes to 1. |
x ^ y | The exponential value of a number. For example, 5 ^ 3 is raising 5 to the third power (or 555), which is equal to 125. |
Date and time
You can combine date and time elements with checks logic.
Valid date formats
No delimiter | Dash delimiter | Slash delimiter | Dot delimiter |
---|---|---|---|
%yyyy%mm%dd | %yyyy-%mm-%dd | %yyyy/%mm/%dd | %yyyy.%mm.%dd |
%yy%mm%dd | %yy-%mm-%dd | %yy/%mm/%dd | %yy.%mm.%dd |
%mm%dd%yyyy | %mm-%dd-%yyyy | %mm/%dd/%yyyy | %mm.%dd.%yyyy |
%mm%dd%yy | %mm-%dd-%yy | %mm/%dd/%yy | %mm.%dd.%yy |
%yyyy%dd%mm | %yyyy-%dd-%mm | %yyyy/%dd/%mm | %yyyy.%dd.%mm |
%yy%dd%mm | %yy-%dd-%mm | %yy/%dd/%mm | %yy.%dd.%mm |
%dd%mm%yyyy | %dd-%mm-%yyyy | %dd/%mm/%yyyy | %dd.%mm.%yyyy |
%dd%mm%yy | %dd-%mm-%yy | %dd/%mm/%yy | %dd.%mm.%yy |
Conversions
Element | Expression | Description |
---|---|---|
convert [ ] from format [ ] to another format [ ] | convertdate(x, y, z) | Converts the first value from the specified source format to the specified target format. |
convert [ ] to Gregorian date | julianDayToGregorian(x) | Converts the specified date in the Julian format to the Gregorian format. |
convert [ ] to Julian format | gregorianToJulianDay(x) | Converts the specified date in the Gregorian format to the Julian format. |
convert [ ] to date using format [ ] | datevalue(x,y) | Converts the first value to a date with the specified format. |
convert [ ] to time using format [ ] | timevalue(x,y) | Converts the first value to a time with the specified format. |
convert [ ] to timestamp using format [ ] | timestampvalue(x,y) | Converts the first value to a timestamp by using the specified format. |
Formatting
Element | Expression | Description |
---|---|---|
current date | date() | Returns the system date from the computer as a date value. |
current time | time() | Returns the system time from the computer as a time value. |
current timestamp | timestamp() | Returns the system time from the computer as a timestamp value. |
year of date [ ] | year(x) | Returns a number that represents the year for the specified date. |
month of date [ ] | month(x) | Returns a number that represents the month for the specified date. |
day of date [ ] | day(x) | Returns a number that represents the day of the month for the specified date. |
day of week for date [ ] | weekday(x) | Returns a number that represents the day of the week for the specified date, where 1 is Sunday. |
hour of time [ ] | hours(x) | Returns a number that represents the hours for the specified time value. |
minutes of time [ ] | minutes(x) | Returns a number that represents the minutes for the specified time value. |
seconds of time [ ] | seconds(x) | Returns a number that represents the seconds and milliseconds for the specified time value. |
Functions
Element | Expression | Description |
---|---|---|
add [ ] months to date [ ] | addmonthsdate(x, y) | Adds the specified number of months to the specified date. |
add [ ] months to timestamp [ ] | addmonthstimestamp(x, y) | Adds the specified number of months to the specified timestamp. |
number of days between dates [ ] and [ ] | datediff(x, y) | Returns the number of days between the two specified dates. |
round date [ ] using format [ ] | round_date(x, y) | Rounds the date value by using the specified format. |
round time [ ] using format [ ] | round_time(x, y) | Rounds the time value by using the specified format. |
round timestamp using format [ ] | round_timestamp(x, y) | Rounds the timestamp value by using the specified format. |
time difference between times [ ] and [ ] | timediff(x, y) | Returns the difference between the two specified times in the number of hours, minutes, and seconds. The first value is the earliest of the two times. The second value is the later of the two times. The returned value is a time value. |
truncate date [ ] using format [ ] | trunc_date(x, y) | Truncates the date by using the specified format. The first value must be either a variable (which must be bound to a column of type date), or an expression or function that returns a date. The format value must either be a string literal with the format of a string, or a variable that is bound to a string column or a string literal value. |
truncate time [ ] using format [ ] | trunc_time(x, y) | Truncates the time by using the specified format. The first value must be either a variable (which must be bound to a column of type time), or an expression or function that returns a time. The format value must either be a string literal with the format of a string, or a variable that is bound to a string column or a string literal value. |
truncate timestamp [ ] using format [ ] | trunc_timestamp(x, y) | Truncates the timestamp by using the specified format. The first value must be either a variable (which must be bound to a column of type timestamp), or an expression or function that returns a timestamp. The format value must either be a string literal with the format of a string, or a variable that is bound to a string column or a string literal value. |
General
You can combine general elements with checks, operations, and mathematical, string, and date and time elements.
Element | Expression | Description |
---|---|---|
data classes matched by [ ] | dataclassesof(x) | Finds data classes in your data that match the specified data class. This value is usually a variable that is bound to a column with values to classify. |
frequency of [ ] | occurrences(x) | Finds the frequency of the specified value. |
lookup [ ] from reference key [ ] and reference value [ ] | lookup(x, y, z) | Replaces a value with its corresponding value in a lookup table. To run this function, you must have a lookup table that contains a list of the values with a corresponding column that contains data that is related to the first column.
For example, if the original value you are looking for is contained in a column that contains keys, such as product codes, then the related lookup table contains one column that lists all the possible key values, and another column that
contains the corresponding values to use as a replacement for the keys. This function is generally used in systems where internal codes are used to represent values that occur in different places in your data sources. Note: The first parameter of the function is the value to look up. The second parameter must be bound to the reference column that contains the keys in the lookup table. The third parameter must be bound to the reference column that contains the values in the lookup table. Both reference columns must be in the same physical table. The result of the function is the value from the lookup table corresponding to the key given as first parameter. |
number of [ ] values that are not null grouped by [ ] | count_not_null(x, y) | Finds the number of null (nondistinct) values in a column. The first value must contain a variable that is bound to a column to count. The second value specifies an optional grouping column. If specified, the functions return the count of the null values in the first column for all rows where the values of the second column are the same. |
number of distinct values of [ ] grouped by [ ] | count(x, y) | Finds the number of distinct values in a column. The first value must contain a variable that is bound to a column to count. The second value specifies an optional grouping column. If specified, the functions return the count of the distinct values in the first column for all rows where the values of the second column are the same. |
replace missing value for [ ] with [ ] | coalesce(x, y) | Looks for null columns in the first value and replaces it with the second value. If the first value that you specify is not null, the value is not replaced. |
Mathematical
You can combine mathematical elements with checks logic.
Element | Expression | Description |
---|---|---|
absolute value of [ ] | abs(x) | Returns the absolute value of a numeric value that you specify. For example, the absolute value of [-13] would return 13. |
average value of [ ] grouped by [ ] | avg(x, y) | An aggregate function that returns the average value of the specified numeric column, grouped by the specified value. |
exponential value of [ ] | exp(x) | Returns the exponential value of the specified numeric value. |
maximum value of [ ] grouped by [ ] | max(x, y) | An aggregate function that returns the maximum value that is found in the specified numeric column, grouped by the specified value. For the first value, you can specify either a column or a variable (which must be bound to a column when using the rule), or an expression that contains at least one variable. |
minimum value of [ ] grouped by [ ] | min(x, y) | An aggregate function that returns the minimum value that is found in the specified numeric column, grouped by the specified value. |
normalize value [ ] grouped by [ ] | standardize(x, y) | Normalizes the values in a column into the number of standard deviations of the value around the mean value of the column. The normalized value is computed as = (value-mean)/stddev. If you specify a value for grouped by, the mean/standard deviation that is used is not the one of the whole column, but the one of all values that have the same value in the group by column. For example, if a column has a mean value of 10 and a standard deviation of 2, the normalized value of 12 is 1.0 (=mean+1stddev). The normalized value of 6 is -2.0 (=mean-2stddev). |
round number [ ] keeping [ ] decimals | round(x, y) | Returns a rounded value of a number that you specify. The optional decimal parameter specifies the number of places to the right of the decimal point to round to. If not specified, the value of decimal is zero. |
scale of number [ ] | scale(x*) | Returns the number of digits to the right of the decimal point. For example, you want to determine the scale of 2.3456789. The scale is 7. |
square root of [ ] | sqrt(x) | Returns the square root of the specified numeric value. |
standard deviation of [ ] grouped by [ ] | stddev(x) | An aggregate function that computes the standard deviation of a numerical column that you specify, grouped by the specified value. |
sum of [ ] grouped by [ ] | sum(x, y) | An aggregate function that returns the sum of all the values within the specified numeric column. |
truncate number [ ] keeping [ ] decimals | trunc(x, y) | Returns a truncated value of the specified number. The keeping parameter is the value to be truncated. The optional decimal parameter specifies the number of places to the right of the decimal point to truncate to. If not specified, the value of decimal value is zero. |
String
You can combine string elements with checks logic.
Identities
Element | Expression | Description |
---|---|---|
ASCII code of character [ ] | ascii(x) | Returns the ASCII character set value for a character value. |
character with ASCII code [ ] | char(x) | Returns the character value for the specified ASCII character. |
concatenate [ ] times the string [ ] | str(x, y) | Generates a string that concatenates the specified value the specified number of times. For example, if you specify concatenate [5] times the string [AB] , the data quality rule returns ABABABABAB. |
length of [ ] | len(x) | Returns the total number of characters (that is, the length) in the specified string. |
parse [ ] as a number | val(x) | Takes a string as an input and tries to parse it as a number. For example, if you specify 12.34, the results is the number 12.34. |
position of [ ] in string [ ] | index(x, y) | Returns the index of the first occurrence of a substring in another string. Returns the zero-based index of the occurrence if found, or -1 if the substring is not found. |
Modifications
Element | Expression | Description |
---|---|---|
convert [ ] to string using format [ ] | tostring(x, y) | Converts the specified value to a string in the specified format. |
first [ ] characters of [ ] | left(x, y) | Returns the first n characters of the specified string, where n is the number of characters to return. |
last [ ] characters of [ ] | right(x, y) | Returns the last n characters of the specified string, where n is the number of characters to return. |
lowercase of [ ] | lcase(x) | Converts all characters in the specified string to lowercase. |
substring of [ ] beginning at position [ ] and of length [ ] | substring(x, y, z) | Returns a substring of the specified string, beginning at the specified position, and of the specified length. The value that you specify for the beginning position is the index of the first character to retrieve (inclusive). 1 is the index of the first character in the string. For example, you want to use the three-digit (actual character positions four to six) value from each product code to determine which division is responsible for the product. |
uppercase of [ ] | ucase(x) | Converts all characters in the specified string to uppercase. |
Padding
Element | Expression | Description |
---|---|---|
add [ ] spaces after [ ] | rpad(x, y) | Adds the specified number of spaces after the specified string. |
add [ ] spaces before and [ ] spaces after [ ] | pad(x, y, z) | Adds the specified number of spaces before and after the specified string. |
insert [ ] spaces before [ ] | lpad(x, y) | Adds the specified number of spaces to the beginning of the specified string. |
Trimming
Element | Expression | Description |
---|---|---|
left trim of [ ] | ltrim(x) | Removes all spaces at the beginning of the specified string. |
right trim [ ] | rtrim(x) | Removes all spaces at the end of the specified string. |
trim [ ] | trim(x) | Removes all spaces at the beginning and end of the specified string. |
Learn more
Parent topic: Managing data quality definitions