Data quality scores
A data quality score is displayed for the entire data asset and for all columns that the analyzed data asset contains. Data quality scores are computed based on the results of data quality checks that are run on the entire asset and its columns.
The following types of data quality checks provide data quality scores:
-
Predefined data quality checks
These checks are run when you run quality analysis as part of metadata enrichment. Each check is run on the entire asset but might not return results for all its columns, depending on the type of check.
Each predefined data quality check is associated with a data quality dimension.
-
Data quality rules (Managing data quality rules)
Data quality rules validate specific conditions in your data source. They can be run manually or automatically on a schedule.
A data quality rule can contribute to more than one dimension depending on the rule's configuration. If no dimension is set for a rule, its results are captured as dimension score None.
For each check, you can determine whether its results contribute to the overall data quality score. See Data quality analysis results.
You can also retrieve the data quality scores for individual assets by using the Watson Data API.
How data quality scores are calculated
The column score is calculated as a weighted average of the available dimension scores for the column, which means, the scores of all dimensions for which at least one data quality check was run and returned a result.
A dimension score, except for the Entity confidence dimension, is calculated by multiplying the probability numbers of all issues for which the data quality checks looked for this dimension, where an issue's probability number is (1 - frequency). For example, assume that a column has 2 different quality issues that are reported for the same dimension. Issue 1 occurs with a frequency of 10% and issue 2 with a frequency of 20%. Thus, the probability that a value in that column does not have issue 1 is 90%. For issue 2, it is 80%. So, the probability that the column does have any quality issue in that dimension is 72%, which is calculated as follows:
(1.0 - 0.1) × (1.0 - 0.2) = 0.9 × 0.8 = 0.72
For the Entity confidence dimension, the dimension score represents the percentage of entities of the particular entity type that have no records with potential match issues as member.
Asset scores (the overall score or the dimension scores) are calculated as weighted average of the corresponding scores of its columns.
In projects, you can change what is considered for calculating the scores by changing the Contributes to overall score setting. This setting is on by default. You can exclude the results of entire columns, and the results for certain checks at column level or at asset level.
In projects, the quality scores are recalculated in these cases:
- Data quality analysis is run in the context of metadata enrichment.
- Existing or new data quality rules are run on the asset.
- A data quality rule that contributed to the scores is deleted.
- A Contributes to overall score setting is changed.
- An IBM Match 360 entity data asset is updated.
In catalogs, the quality scores change when the asset is published again.
Score calculation example
Assume that a data asset has the columns ID, NAME, EMAIL, PHONE, and SALARY. All columns and all types of issues contribute to the overall scores (the default setting).
Initially, no data quality scores are available because no data quality check was run on the asset. To generate data quality information:
-
IBM Match 360 analysis runs on the data asset and identifies these issues:
-
10% matching entities for the data asset. This information is considered for the data quality dimension Entity confidence.
The following scores at asset level are calculated:
-
Dimension score
Entity confidence: (1 - 0.1) = 90% -
Overall Score: 90%
-
-
-
Run data quality analysis as part of metadata enrichment. Quality analysis identifies these issues:
- Missing values, which are considered for the data quality dimension Completeness:
- 3% of the values in column NAME
- 5% of the values in column EMAIL
- 3% of the values in column PHONE
- Data class violations, which are considered for the data quality dimension Validity:
- 10% of the values in column EMAIL
- 6% of the values in column PHONE
- Outlier or suspect values, which are considered for the data quality dimension Consistency:
- 4% of the values in column NAME
- 1% of the values in column SALARY
These findings result in the following scores for the individual columns:
- Column ID
- Dimension scores
Entity confidence: 90% (unchanged)
Completeness: 100% (The Unexpected missing values check didn't find any issues.)
Validity: 100% (None of the predefined Validity checks found any issues.)
Consistency: 100% (The of the predefined Consistency checks found any issues.) - Overall column score: (90% + 100% + 100% + 100%)/4 = 97.5%
- Dimension scores
- Column NAME
- Dimension scores
Entity confidence: 90% (unchanged)
Completeness: 100% - 3% = 97%
Validity: 100%
Consistency: 100% - 4% = 96% - Overall column score: (90% + 97% + 100% + 96%)/4 = 95.75%
- Dimension scores
- Column EMAIL
- Dimension scores
Entity confidence: 90% (unchanged)
Completeness: 100% - 5%= 95%
Validity: 100% - 10% = 90%
Consistency: 100% - Overall column score: (90% + 95% + 90% + 100%)/4 = 93.75%
- Dimension scores
- Column PHONE
- Dimension scores
Entity confidence: 90% (unchanged)
Completeness: 100% - 3% = 97%
Validity: 100% - 6% = 94%
Consistency: 100% - Overall column score: (90% + 97% + 94% + 100%)/4 = 95.25%
- Dimension scores
- Column SALARY
- Dimension scores
Entity confidence: 90% (unchanged)
Completeness: 100%
Validity: 100%
Consistency: 100% - 1% = 99% - Overall column score: (90% + 100% + 100% + 99%)/4 = 97.25%
- Dimension scores
From these scores, the scores at asset level are calculated:
-
Dimension scores
Entity confidence: (90% + 90% + 90% + 90% + 90%)/5 = 90%
Completeness: (100% + 97% + 95% + 97% + 100%)/5 = 97.8%
Validity: (100% + 100% + 90% + 94% + 100%)/5 = 96.8%
Consistency: (100% + 96% + 100% +100% + 99%)/5 = 99% -
Overall Score: (97.5% + 95.75% + 93.75% + 95.25% + 97.25%)/5 = 95.9%
- Missing values, which are considered for the data quality dimension Completeness:
-
Run data quality rule Name_Complete, which is applied to column NAME to verify that it contains a given name and a surname. The rule is tied to the data quality dimension Completeness. That rule reports 1% violations in column NAME.
The scores of the NAME column change as follows. The scores of the other columns remain unchanged.
- Dimension scores
Entity confidence: 90% (unchanged)
Completeness: (1-0.03) × (1-0.01) = 0.9603 = 96.03%
Validity: 100% (unchanged)
Consistency: 96% (unchanged) - Overall score: (90% + 96.03% + 100% + 96%)/4 = 95.5%
These changes also change the asset scores.
- Dimension scores
Entity confidence: 90% (unchanged)
Completeness: (100% + 96% + 95% + 97% + 100%)/5 = 97.6%
Validity: 96.8% (unchanged)
Consistency: 99% (unchanged) - Overall score: (97.5% + 95.5% + 93.75% + 95.25% + 97.25%)/5 = 95.85%
- Dimension scores
-
Run an additional data quality rule Phone_Valid, which is applied to column PHONE to verify that the phone number has the country code and prefix that correspond to the address. The rule is tied to the data quality dimension Validity. That rule reports 2% violations in column PHONE.
The scores of the PHONE column change as follows. The scores of the other columns remain unchanged.
- Dimension scores
Entity confidence: 90% (unchanged)
Completeness: 97% (unchanged)
Validity: (1.0-0.06) × (1.0-0.02) = 0.9212 = 92.12%
Consistency: 100% - Overall score: (90% + 97% + 92.12% + 100%)/4 = 94.78%
These changes also result in changes of the asset scores.
- Dimension scores
Entity confidence: 90% (unchanged)
Completeness: 97.6% (unchanged)
Validity: (100% + 100% + 90% + 92.12% + 100%)/5 = 96.42%
Consistency: 99% (unchanged) - Overall score: = (97.5% + 95.5% + 93.75% + 94.78% + 97.25%)/5 = 95.76%
- Dimension scores
-
Set all checks for the dimension Consistency to be ignored for score calculation. The dimension score for the dimension Consistency is no longer shown. All other dimension scores remain unchanged. The overall column and asset scores are recalculated.
-
Column scores
xx Column ID: (1 × 90% + 1 × 100% + 1 × 100% + 0 × 100%)/(1 + 1 + 1 + 0) = 96.67%
Column NAME: (1 × 90% + 1 × 96.03% + 1 × 100% + 0 × 96%)/(1 + 1 + 1 + 0) = 95.34%
Column EMAIL: (1 × 90% + 1 × 95% + 1 × 90% + 0 × 100%)/(1 + 1 + 1 + 0) = 91.67%
Column PHONE: (1 × 90% + 1 × 97% + 1 × 92.12% + 0 × 100%)/(1 + 1 + 1 + 0) = 94.78%
Column SALARY: (1 × 90% + 1 × 100% + 1 × 100% + 0 × 99%)/(1 + 1 + 1 + 0) = 96.67% -
Overall asset score: (96.67 + 95.34% + 91.67% + 93.04% + 96.67)/5 = 94.68%
-
-
Exclude the results for column SALARY from the score calculation. The column scores don't change. The overall and dimension scores for the asset are recalculated as follows:
- Dimension scores
Entity confidence: (1 × 90% + 1 × 90% + 1 × 90% + 1 × 90% + 0 × 90%)/(1 + 1+ 1 + 1 + 0) = 90%
Completeness: (1 × 100% + 1 × 96.03% + 1 × 95% + 1 × 97% + 0 × 100%)/(1 + 1+ 1 + 1 + 0) = 97%
Validity: (1 × 100% + 1 × 100% +1 × 90% +1 × 92.12% + 0 × 100%)/(1 + 1 + 1 + 1 + 0) = 95.53%
Consistency: not shown - Overall asset score = (100% + 98.02% + 92.5% + 92.74% + 0%)/(1 + 1 + 1 + 1 + 0) = 95.82%
- Dimension scores
Learn more
Parent topic: Data quality analysis results