0 / 0
Go back to the English version of the documentation
Data quality violations
Data quality violations

Data quality violations

Data quality analysis identifies quality problems with your data by analyzing quality dimensions, both on the data asset and the column level.

Results are provided for the following data quality violations:

For each type of violation, the number of findings is shown and the percentage of the evaluated records that showed this violation.

Data class violations

A data class is the kind of data detected for a particular column. Examples of data class might include postal code, country, or credit card number. This metric counts the number of values in a column that do not match the detected data class of that column. Each value that violates the class is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, a column has a data class 'credit card number' assigned. The expected value for that data class is a numeric string of 16 characters. If that column contains a value of 'MA,' then that value is identified as a violation of the data class. If that column has 100 values, 40 values do not match the class, and no other quality dimensions are identified, the column has a quality score of 60% because 40% of the values violate the column's data class.

Data type violations

A data type defines the valid format for data in a particular column. Examples of data type might include text, numeric, or date. This metric counts the number of values in a column that do not match the detected or assigned data type of a column. Each value that does not match the inferred data type in length, precision, or scale, or violates the specified data type is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, a column has a data type DECIMAL (4,2) specified. That data type defines the format of the column as a numeric value with a total length of 4 digits with 2 of those digits following the decimal point. If that column contains a numeric value with too many digits, then that value is identified as a violation of the data type. If that column has 100 values, 40 values do not match the type, and no other quality dimensions are identified, the column has a quality score of 60% because 40% of the values violate the column's data type.

Duplicated values

This dimension identifies duplicated values in columns where most of the values are unique. In a column where at least 95% of the values are identified as unique, each duplicate value is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, a set of patient data contains a column with social security numbers. The majority of the values in the column appear only once because each patient is only associated with one SSN. Each duplicate value in this column is identified. If the column has 100 values, 3 values are duplicates, and no other quality dimensions are identified, the column has a quality score of 97% because 3% of the values are duplicates.

Format violations

Currently not evaluated in metadata enrichment.

Inconsistent capitalization

This dimension checks whether the usage of uppercases and lowercases in the analyzed data asset is consistent.

For example, a column has values that are written in both lowercase and uppercase. If the column has 100 values, 90 of them are in lowercase, and 10 of them are in uppercase, and no other quality dimensions are identified, the column has a quality score of 90% because 10% of the values are in a different case than the majority.

Addressing inconsistent capitalization violation: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column.

Inconsistent representation of missing values

It is common for data assets to contain varying representations of missing data. One column in a data asset might contain several values of NULL, several others that read NA, and still others where the field is blank. All of these values might suggest missing information, but they are interpreted differently and can lead to inaccurate analysis. The inconsistent representation of missing values is detected by identifying columns with both null values and empty values. A column that contains both null values and empty values suggests that there is no standardized way to represent missing values. Often when a column contains null values, any empty values should also be represented as null.

Each value that matches this criteria in a column is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

Addressing representation of missing values violations: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column.

Missing values

This dimension looks for missing values in a column. Rows with missing values are deemed incomplete. The quality score is based on the percentage of rows in that column that are complete.

For example, if a column has 100 values, 40 of those are missing values, and no other quality dimensions are identified, the quality score is 60% because 60 out of 100 values are identified as complete.

Suspect values

When the data class of a column cannot be determined, this metric looks for suspect values that do not seem to match the majority of the other values in the column because their characteristics are different. Each suspect value that violates the domain is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, if a column contains 100 values, and 98 of those values are numeric strings in the range 5 - 9 characters in length, but two are 30-45 character text strings, those two values are identified as suspect because they do not match the characteristics of the other values. If no other quality dimensions are identified, the column has a quality score of 98% because 2% of the values are suspect.

Addressing suspect values violations: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column.

Values out of range

Currently not evaluated in metadata enrichment.

Learn more

Parent topic: Metadata enrichment results