Design data quality assets for analyzing and monitoring the data quality in a project.
You can have the following data quality assets in your project:
Before you start creating data quality definitions and rules, give the following topics some thought:
- What do you want to analyze and monitor?
- Which elements do you need to evaluate?
- What's the goal of the analysis, such as checking for completeness, validity, and so on?
- Required permissions
- To create, edit, or delete data quality assets, you must have the Manage data quality assets user permission and the Admin or the Editor role in the project.
Data quality definitions
A data quality definition represents a generic form of a data quality rule. It describes the rule evaluation or condition by using logical variables that are not tied to any actual data. Thus, it can be used in any number of data quality rules. If you change the data quality definition, you also change the validation logic for all rules derived from the definition.
You create and manage data quality definitions in projects. To make a data quality definition available for re-use in other projects, you can publish it to a catalog.
Properties for data quality definitions
In addition to the common asset properties, data quality definitions have the following properties in projects. Users with the appropriate permissions can edit all properties.
For more information about the common properties, see Common properties for assets.
Property | Description |
---|---|
Rule expression | The defined rule logic. Changes to the rule expression affect all rules derived from this data quality definition. |
Data quality dimension | Optional. The primary data quality metric for the rule logic. The selected dimension can be used as report category, for filtering, or for visualizing selected data. |
Governance artifacts | Optional. The business terms and governance rules that users assigned to the asset. |
Data quality rules
A data quality rule links or binds logical variables to actual data for evaluation. A rule is run against physical data to assess the quality of your data by evaluating and validating specific conditions. Each rule run provides statistics and information about potential exceptions as defined for the rule's output table.
You create, manage, and run data quality rules in projects.
You can create rules from one or more data quality definitions or you can create data quality rules with SQL statements. Rules built from data quality definitions capture which columns comply with the rule conditions and which don't. SQL-based rules are better suited to check for noncompliant records.
For example, you want to validate tax identifiers. So your concepts could be TaxID exists and Validate TaxID.
Now, you have these options:
-
Create rules from data quality definitions. For either concept, you can create a data quality definition with evaluation logic for the logical variable
tax_id
. The first condition is that the tax identifier (or TaxID) must exist, and the second condition is that the tax identifier must meet a defined format.Data quality definition TaxID exists:
tax_id exists
Data quality definition Validate TaxID:tax_id matches_format 'AA99-A999-9999'
Then, select one of these options:
- For each column that contains a tax identifier to be validated, define two data quality rules. The first rule binds the logical variable
tax_id
of the definition TaxID exists to the column. The second rule binds the logical variabletax_id
of the definition Validate TaxID to the column. - For each column that contains a tax identifier to be validated, define one data quality rule and use both data quality definitions in that rule. Bind the logical variable
tax_id
in either definition TaxID exists and Validate TaxID to the column. - Define one data quality rule and use both data quality definitions in that rule. Bind the logical variable
tax_id
in either definition TaxID exists and Validate TaxID to a parameter set of the type Parameter from column. Add all columns that contain a tax identifier to be validated to that parameter set.
- For each column that contains a tax identifier to be validated, define two data quality rules. The first rule binds the logical variable
-
Create an SQL-based rule:
select tax_id from taxschema.taxtable where tax_id is null or not regexp_like(tax_id, '^[a-zA-Z]{2}[0-9]{2}-[a-zA-Z][0-9]{3}-[0-9]{4}$')
Properties for data quality rules
In addition to the common asset properties, data quality rules have the following properties in projects. Users with the appropriate permissions can edit all properties.
For more information about the common properties, see Common properties for assets.
Property | Description | Rule type |
---|---|---|
Bound expressions | The list of rule expressions. For simple rules, you can see information about a binding target by hovering over the tag in the bound expression. | Simple rule Rule with external bindings |
SQL connection | The connection to the data source where the rule will be applied. | SQL-based rule |
SQL statements | The SQL query that makes up the rule. | SQL-based rule |
Data quality dimension | Optional. The primary data quality metric to which the data quality rule contributes. | Simple rule Rule with external bindings |
Related DataStage flow | The DataStage flow that provides the bindings for the rule. | Rule with external bindings |
Data quality definitions | The data quality definitions that provide the rule expressions. | Simple rule Rule with external bindings |
Governance artifacts | Optional. The business terms and governance rules that users assigned to the asset. | Simple rule Rule with external bindings SQL-based rules |
Relationships | Relationships appear in the Related objects section. Can be between the data quality rule and assets or columns in the same project, or between the rule and an artifact. The followingk relationships are automatically created: • For all rule types, an Is implemented by relationship with the associated DataStage flow after the first run of the rule • For rules with externally managed bindings, an Is implemented by relationship with the associated DataStage subflow • For simple rules, a Validates data quality of relationship with each bound column and with the asset that contains the column. • For simple rules and rules with externally managed bindings, an Implements relationship for each of the referenced data quality definitions Add related objects as required. The Data quality page shows aggregated information for each related asset or column that is listed here with a Validates the data quality of relationship. |
Simple rule Rule with external bindings SQL-based rules |
Selected output | Optional. If configured, the columns of the rule output table. | Simple rule Rule with external bindings SQL-based rules |
Learn more
Parent topic: Managing data quality