Predefined data quality checks
Predefined data quality checks are automatically run when you run basic data quality analysis as part of metadata enrichment. These data quality checks identify basic quality problems with your data, both on the data asset and the column level.
The following data quality checks generate individual quality scores and contribute to the overall quality score of a data asset or a column:
- Data class violations
- Data type violations
- Format violations
- Inconsistent capitalization
- Inconsistent representation of missing values
- Suspect values
- Unexpected duplicated values
- Unexpected missing values
- Values out of range
Also, each data quality check is associated with a data quality dimension. See Data quality dimensions.
The results of these checks are shown as part of the data quality information for a data asset or a column. See Data quality analysis results. However, not every check is applied to all columns in a data asset. Which checks are run depends on the column's data type and the data that it contains.
The predefined checks validate individual values (value-based checks) within a column or the column metadata that was found during profiling (metadata-based checks).
Data class violations
A data class is the kind of data that is detected for a particular column. Examples of data class might include postal code, country, or credit card number. This check counts the number of values in a column that do not match the detected data class of that column. Each value that violates the class is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.
For example, a column has a data class 'credit card number' assigned. The expected value for that data class is a numeric string of 16 characters. If that column contains a value of 'MA,' then that value is identified as a violation of the data class. If that column has 100 values, 40 values do not match the class, the column has a quality score of 60% for this check because 40% of the values violate the column's data class.
Type of check: value-based check
Dimension: Validity
Data type violations
A data type defines the valid format for data in a particular column. Examples of data type might include text, numeric, or date. This metric counts the number of values in a column that do not match the detected or assigned data type of a column. Each value that does not match the inferred data type in length, precision, or scale, or violates the specified data type is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.
For example, a column has a data type DECIMAL (4,2) specified. That data type defines the format of the column as a numeric value with a total length of 4 digits with 2 of those digits following the decimal point. If that column contains a numeric value with too many digits, that value is identified as a violation of the data type. If that column has 100 values, 40 values do not match the type, the column has a quality score of 60% for this check because 40% of the values violate the column's data type.
Type of check: value-based check
Dimension: Validity
Format violations
Currently, not evaluated in metadata enrichment. Therefore, always a data quality score of 100% is shown.
Type of check: value-based check
Dimension: Validity
Inconsistent capitalization
This check examines whether the capitalization of values in a column is consistent. In columns of data type String, values can have any case, title case, sentence case, or can be uppercase or lowercase. If the check detects that most (over 95%) of the values have specific capitalization, the rest of the values are flagged as quality issues.
For example, a column has 100 values. Of these values, 90 are lowercase, and 10 are uppercase. So the check sets the column's quality score to 90% because 10% of the values are in a different case than the majority.
Addressing inconsistent capitalization violation: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column.
Type of check: value-based check
Dimension: Consistency
Inconsistent representation of missing values
It is common for data assets to contain varying representations of missing data. One column in a data asset might contain several values of NULL, several others that read NA, and still others where the field is blank. All of these values might suggest missing information, but they are interpreted differently and can lead to inaccurate analysis. The inconsistent representation of missing values is detected by identifying columns with both null values and empty values. A column that contains both null values and empty values suggests that there is no standardized way to represent missing values. Often when a column contains null values, any empty values should also be represented as null.
Each value that matches this criteria in a column is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.
Addressing representation of missing values violations: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column.
Type of check: value-based check
Dimension: Consistency
Suspect values
This check looks for suspect values that do not seem to match the majority of the other values in the column because their characteristics are different. It identifies outliers in numeric columns or string columns with numeric data. Results for string columns with string values are ignored. The quality score is based on the percentage of values identified subtracted from a percentage of 100.
For example, if a column contains 100 values, and 98 of those values are numeric strings in the range 5 - 9 characters in length, but two are 30-45 character text strings, those two values are identified as suspect because they do not match the characteristics of the other values. For this individual check, the column has a quality score of 98% because 2% of the values are suspect.
Addressing suspect values violations: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column.
Type of check: value-based check
Dimension: Consistency
Unexpected duplicated values
This check identifies duplicated values in columns where most of the values are unique. All non-unique values are flagged as quality issues. The uniqueness threshold is set in the metadata enrichment settings. The default setting is 95%. See Uniqueness threshold.
The quality score is based on the percentage of values identified subtracted from a percentage of 100. For example, a set of patient data contains a column with social security numbers. Most of the values in the column appear only once because each patient is only associated with one SSN. Each duplicate value in this column is identified. If the column has 100 values, 3 values are duplicates, the column has a quality score of 97% for this check because 3% of the values are duplicates.
Type of check: metadata-based check
Dimension: Uniqueness
Unexpected missing values
This check looks for unexpected missing values in columns. If a column is close to having no null or empty values, rows with missing values are deemed incomplete. The null threshold determines when missing values are allowed and when missing values are considered unexpected. This threshold is set in the metadata enrichment settings. The default setting is 5%, which means that missing values in 5% or less of the rows in a column are considered unexpected missing values. See Nullability.
The quality score is based on the percentage of values in that column that are complete. For example, with the default setting, if a column has 100 values and 4 values are missing, the quality score for this check is 96%. If 9 values are missing, the quality score is 100% because that number of missing values is above the set threshold and missing values aren't considered unexpected.
Type of check: value-based check
Dimension: Completeness
Values out of range
Currently, not evaluated in metadata enrichment. Therefore, always a data quality score of 100% is shown.
Type of check: value-based check
Dimension: Validity
Learn more
Parent topic: Managing data quality