Creating SQL-based data quality rules
You can create SQL-based data quality rules for assessing the quality of the data in your project. Use such rules for measuring errors rather than for evaluating compliance with specific quality criteria.
To create an SQL-based data quality rule:
-
Open a project, click New asset, and select Data quality rule.
-
Define details:
- Specify a name for the data quality rule.
- Optional: Provide a description.
- Optional: Select a data quality dimension to describe the primary data quality metric for the rule logic in this asset. The selected dimension can be used as report category, for filtering, or for visualizing selected data.
-
Enable the Use SQL statements option.
-
Specify the source of the data to check by selecting an existing connection or creating a new one. For supported data sources, see Supported connectors for data quality rules.
If you select an existing connection and this connection was created with personal credentials, you will need to unlock the connection.
-
Enter your SQL statements.
When you write the query, make sure that the SELECT statement meets the following conditions:
- The statement returns only columns with unique names. Columns with duplicate names will cause validation errors.
- The statement returns the number of records that do not meet your data quality condition. SQL-based rules work in a different way than rules created from data quality definitions. They report the records that the SELECT statement returns
as
failed
orNot met
. Plus, the total number of records that is reported equals the number of returned records, not the number of checked records.
For example, assuming you have a table
db2admin.credit_card
with 31 rows and you want to check how many records with card type AMEX are in the table, the difference is as follows:Data quality rule from a data quality definition
You check for records where the card type is AMEX.
Expression in the data quality definition:Col = 'AMEX'
Bound expression in the data quality rule:credit_card.card_type = 'AMEX'
Sample result:Total: 31 | Met: 4 (12.9%) Not met: 27 (87.1%)
SQL-based data quality rule
You check for records where the card type is not AMEX.
SELECT statement:select card_type from db2admin.credit_card where card_type <> 'AMEX'
Sample result:Total: 27 | Met: 0 (0%) Not met: 27 (100%)
Also check the set of Sample SQL statements for data quality rules. These samples demonstrate how you can write SQL rules to return records that do not meet your quality criteria. You can copy the provided statements into your own data quality rules and adjust them as needed.
Tip: Table and schema names in PostgreSQL data sources are case-sensitive. You might need to enclose the names in double quotation marks like in this example: "schema"."table_name"At any time, you can test the SQL statements. Note that the test returns only the names of the columns selected by your query. No actual processing is done. A validity check is done when you click Next. You can't proceed unless your query passes this check.
-
Configure output settings and content.
Select whether you want rule output to be written to a database. If not, only some statistical information is provided in the rule's run history.
To generate a database table:
-
Enable the External output option and expand the section. Such output table is also added to your project as a data asset with the table name that you define in the Output type setting. You can access it from there or from the rule's run history.
Configure the following settings:
-
Output type: Specify a connection, a schema, and a table. You can select from existing schemas and tables or create a new table in an existing schema. To create a new table for the output, enter a name instead of selecting from the available tables.
For supported database types, see Supported connectors for data quality rules.
-
Output records: For SQL rules, only records that do not meet rule conditions are captured. See the information about defining the SQL statements.
- Maximum number of exception output records: You can include all records or set a maximum number.
- Update method: New output records can be appended to the existing content of the output table. If you want to keep only the output results from latest run, select to overwrite existing records.
When you change the output content for a data quality rule that ran already but still write to the same output table, make sure to use the update method Overwrite to replace the columns in output table with the newly defined output columns.
When you're done, collapse the section and proceed to configure the output content.
-
-
Configure the content of your output table. By default, all columns selected by the SQL query are included in the output table. You can remove selected or all of these columns, and add other content. Click Add output content and select one of these options:
-
Columns: Select the columns that you want to see in your output table. You can select from all columns that the SQL query returns.
-
Statistics and attributes: Select any additional attributes or statistics that you want to include in your output table:
- Record ID: Contains a unique key that identifies a record in the output.
- Rule name: Contains the name of the data quality rule.
- System date: Shows the system date when the rule was run. System date is the date in the time zone set on the server.
- System timestamp: Shows the system date and time when the rule was run. System date and time is the date and time in the time zone set on the server.
- Passing rules: Shows the number of rule conditions that the record met.
- Failing rules: Shows the number of rule conditions that the record didn't meet.
- Percent passing rules: Shows the percentage of rule conditions that were met.
- Percent failing rules: Shows the percentage of rule conditions that weren't met.
-
-
-
Review your configuration. To make sure your rule is configured properly, you can test it before you actually save it to the project. The output of the rule test is directly displayed and matches what you configured in the output settings.
To make changes to the configuration, click the edit (
) icon on the tile and update the settings. When your review is complete, click Create. The rule and its related DataStage flow are added to the project. The default name of the DataStage flow is
DataStage flow of data rule <rulename>
.
If your rule is configured properly without any information missing, it has the status Ready. This status means that the rule can be run. The rule status Not ready indicates that the rule can't be run because of SQL syntax errors, modified dependencies, or other rule definition issues. For example, the password for accessing the data source changed. This status is more likely to be seen for data quality rules that were created by using the Watson Data API: Create data quality rule. When you create data quality rules by using the API, make sure to also test and validate the rule.
Learn more
Next steps
Parent topic: Managing data quality rules