Creating SQL-based data quality rules
You can create SQL-based data quality rules for assessing the quality of the data in your project. Use such rules for measuring errors rather than for evaluating compliance with specific quality criteria.
To have an SQL-based data quality rule contribute to the data quality scores of an asset or a column, add that asset or column as a related item to the corresponding rule. Use the Validates the data quality of relationship type. The same score and issues are reported for all assets and columns that are linked with this relationship type.
To create an SQL-based data quality rule:
-
Open a project, click New asset > Measure and monitor data quality.
-
Define details:
-
Specify a name for the data quality rule.
-
Optional: Provide a description.
-
Optional: Select a data quality dimension to describe the primary data quality metric for the rule logic in this asset. The selected dimension can be used as report category, for filtering, or for visualizing selected data.
When you create your rule from a data quality definition, a data quality dimension might already be set. You can keep it, delete the dimension setting, or select a different dimension.
-
-
Enable the Use SQL statements option.
-
Specify the source of the data to check by selecting an existing connection or creating a new one. For supported data sources, see Supported data sources for curation and data quality.
If you select an existing connection and this connection was created with personal credentials, you will need to unlock the connection.
-
Enter your SQL statements.
When you write the query, make sure that the SELECT statement meets the following conditions:
- The statement returns only columns with unique names. Columns with duplicate names will cause validation errors.
- The statement returns the number of records that do not meet your data quality condition. SQL-based rules work in a different way than rules created from data quality definitions. They report the records that the SELECT statement returns
as
failed
orNot met
. Plus, the total number of records that is reported equals the number of returned records, not the number of checked records.
For example, assuming you have a table
db2admin.credit_card
with 31 rows and you want to check how many records with card type AMEX are in the table, the difference is as follows:Data quality rule from a data quality definition
You check for records where the card type is AMEX.
Expression in the data quality definition:Col = 'AMEX'
Bound expression in the data quality rule:credit_card.card_type = 'AMEX'
Sample result:Total: 31 | Met: 4 (12.9%) Not met: 27 (87.1%)
SQL-based data quality rule
You check for records where the card type is not AMEX.
SELECT statement:select card_type from db2admin.credit_card where card_type <> 'AMEX'
Sample result:Total: 27 | Met: 0 (0%) Not met: 27 (100%)
Also check the set of Sample SQL statements for data quality rules. These samples demonstrate how you can write SQL rules to return records that do not meet your quality criteria. You can copy the provided statements into your own data quality rules and adjust them as needed.
-
Consider these conventions for specifying column, table, and schema names in your SELECT statement:
- Table and schema names in PostgreSQL data sources are case-sensitive. You might need to enclose the names in double quotation marks like in this example: "schema"."table_name"
- Try to avoid
SELECT *
queries. Such queries might cause validation errors when column names change. Narrow down the column selection. - If the column name does not start with an alphabetic character or contains characters other than alphabetic characters, numeric characters, or underscores, use an alias for the column name.
At any time, you can test the SQL statements. Note that the test returns only the names of the columns selected by your query. No actual processing is done. A validity check is done when you click Next. You can't proceed unless your query passes this check.
-
Configure output settings and content.
Select whether you want rule output to be written to a database. If not, only some statistical information is provided in the rule's run history.
To generate a database table:
-
Enable the External output option and expand the section.
Select one of these options:
-
Write to a new database table
Select a connection. Depending on the selected connection, select a schema, or select a catalog and a schema. Then, enter the name of the output table to be created.
When you run the rule, this new output table is also added to your project as a data asset.
-
Write to an existing database table
Select a connection. Depending on the selected connection, select a schema and an existing table, or select a catalog, a schema, and an existing table. The Output content section is populated with the columns of this table and you can map content to these columns.
If a corresponding data asset doesn't exist in your project, it is created when you run the rule.
For supported database types, see Supported data sources for curation and data quality. Schema and table names must follow this convention:
- The first character for the name must be an alphabetic character.
- The rest of the name can consist of alphabetic characters, numeric characters, or underscores.
- The name must not contain spaces.
You can access the data asset that corresponds to the rule output table from the Assets page in your project or from the rule's run history.
Your SQL query determines which records are written to the output table, but you can configure the following settings:
-
Maximum number of exception output records: You can include all records or set a maximum number.
-
Update method: New output records can be appended to the existing content of the output table. If you want to keep only the output results from latest run, select to overwrite existing records.
For the update method Append, the table schema can't be changed, that is, you can't rename, add, or delete columns. If you want to change the output content for a data quality rule and write to an existing output table, make sure to use the update method Overwrite to replace the columns in output table with the newly defined output columns.
You can change the output type at any time. Depending on your new selection, any configured settings are reset or overwritten.
-
-
Configure the content of your output table. By default, all columns selected by the SQL query are included in the output table. You can remove selected or all of these columns, and add other content. Click Add output content and select one of these options:
-
Columns: Select the columns that you want to see in your output table. You can select from all columns that the SQL query returns.
-
Statistics and attributes: Select any additional attributes or statistics that you want to include in your output table:
- Record ID: Contains a unique key that identifies a record in the output.
- Rule name: Contains the name of the data quality rule.
- System date: Shows the system date when the rule was run. System date is the date in the time zone set on the server.
- System time: Shows the system date and time when the rule was run. System date and time is the date and time in the time zone set on the server.
- Passing rules: Shows the number of rule conditions that the record met.
- Failing rules: Shows the number of rule conditions that the record didn't meet.
- Percent passing rules: Shows the percentage of rule conditions that were met.
- Percent failing rules: Shows the percentage of rule conditions that weren't met.
-
-
-
Review your configuration. To make sure your rule is configured properly, you can test it before you actually save it to the project. The output of the rule test is directly displayed and matches what you configured in the output settings.
To make changes to the configuration, click the Edit icon on the tile and update the settings. When your review is complete, click Create. The rule and its related DataStage flow are added to the project. The default name of the DataStage flow is
DataStage flow of data rule <rulename>
.
If your rule is configured properly without any information missing, it has the status Ready. This status means that the rule can be run. The rule status Not ready indicates that the rule can't be run because of SQL syntax errors, modified dependencies, or other rule definition issues. For example, the password for accessing the data source changed. This status is more likely to be seen for data quality rules that were created by using the Watson Data API: Create data quality rule. When you create data quality rules by using the API, make sure to also test and validate the rule.
To confirm that a rule is still valid before you run it manually, you can check the rule status by selecting Validate from the overflow menu.
Learn more
Next steps
Parent topic: Managing data quality rules