0 / 0
Data quality violations

Data quality violations

Data quality analysis identifies quality problems with your data by analyzing quality dimensions, both on the data asset and the column level.

Results are provided for the following data quality violations:

For each type of violation, the number of findings is shown and the percentage of the evaluated records that showed this violation.

Data class violations

A data class is the kind of data detected for a particular column. Examples of data class might include postal code, country, or credit card number. This metric counts the number of values in a column that do not match the detected data class of that column. Each value that violates the class is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, a column has a data class 'credit card number' assigned. The expected value for that data class is a numeric string of 16 characters. If that column contains a value of 'MA,' then that value is identified as a violation of the data class. If that column has 100 values, 40 values do not match the class, and no other quality dimensions are identified, the column has a quality score of 60% because 40% of the values violate the column's data class.

Data type violations

A data type defines the valid format for data in a particular column. Examples of data type might include text, numeric, or date. This metric counts the number of values in a column that do not match the detected or assigned data type of a column. Each value that does not match the inferred data type in length, precision, or scale, or violates the specified data type is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, a column has a data type DECIMAL (4,2) specified. That data type defines the format of the column as a numeric value with a total length of 4 digits with 2 of those digits following the decimal point. If that column contains a numeric value with too many digits, then that value is identified as a violation of the data type. If that column has 100 values, 40 values do not match the type, and no other quality dimensions are identified, the column has a quality score of 60% because 40% of the values violate the column's data type.

Format violations

Currently not evaluated in metadata enrichment.

Inconsistent capitalization

This dimension checks whether the usage of uppercases and lowercases in the analyzed data asset is consistent.

For example, a column has values that are written in both lowercase and uppercase. If the column has 100 values, 90 of them are in lowercase, and 10 of them are in uppercase, and no other quality dimensions are identified, the column has a quality score of 90% because 10% of the values are in a different case than the majority.

Addressing inconsistent capitalization violation: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column.

Inconsistent representation of missing values

It is common for data assets to contain varying representations of missing data. One column in a data asset might contain several values of NULL, several others that read NA, and still others where the field is blank. All of these values might suggest missing information, but they are interpreted differently and can lead to inaccurate analysis. The inconsistent representation of missing values is detected by identifying columns with both null values and empty values. A column that contains both null values and empty values suggests that there is no standardized way to represent missing values. Often when a column contains null values, any empty values should also be represented as null.

Each value that matches this criteria in a column is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

Addressing representation of missing values violations: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column.

Suspect values

When the data class of a column cannot be determined, this metric looks for suspect values that do not seem to match the majority of the other values in the column because their characteristics are different. Each suspect value that violates the domain is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, if a column contains 100 values, and 98 of those values are numeric strings in the range 5 - 9 characters in length, but two are 30-45 character text strings, those two values are identified as suspect because they do not match the characteristics of the other values. If no other quality dimensions are identified, the column has a quality score of 98% because 2% of the values are suspect.

Addressing suspect values violations: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column.

Unexpected duplicated values

This dimension identifies duplicated values in columns where most of the values are unique. The uniqueness threshold is set in the metadata enrichment settings. The default setting is 95%. See Uniqueness threshold. In a column where at least 95% of the values are identified as unique, each duplicate value is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, a set of patient data contains a column with social security numbers. The majority of the values in the column appear only once because each patient is only associated with one SSN. Each duplicate value in this column is identified. If the column has 100 values, 3 values are duplicates, and no other quality dimensions are identified, the column has a quality score of 97% because 3% of the values are duplicates.

Unexpected missing values

This dimension looks for unexpected missing values in columns. If a column is close to having no null or empty values, rows with missing values are deemed incomplete. The null threshold determines when missing values are allowed and when missing values are considered unexpected. This threshold is set in the metadata enrichment settings. The default setting is 5%, which means that missing values in 5% or less of the rows in a column are considered unexpected missing values. See Nullability.

The quality score is based on the percentage of values in that column that are complete. For example, with the default setting, if a column has 100 values and 4 values are missing, the quality score for this check is 96%. If 9 values are missing, the quality score is 100% because that number of missing values is above the set threshold and missing values aren't be considered unexpected.

Values out of range

Currently not evaluated in metadata enrichment.

Learn more

Parent topic: Metadata enrichment results

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more