Creating rules from data quality definitions
You can create data quality rules from data quality definitions in your project.
You can apply more than one data quality definition to a table or bind the same definition to multiple columns of the same table in a single data quality rule.
At least one data quality definition must exist in the project. See Managing data quality definitions.
To create a data quality rule from data quality definitions:
-
Open a project, click New asset > Measure and monitor data quality.
Alternatively, you can create a rule directly from a data quality definition.
-
Define details:
-
Specify a name for the data quality rule.
-
Optional: Provide a description.
-
Select the data quality dimensions to which this data quality rule contributes. Data quality dimensions describe the data quality metrics for the rule logic in this asset. The selected dimensions can be used as report category, for filtering, or for visualizing selected data.
You can choose between these options:
- Apply all preset dimensions
- The rule contributes to the scores of all dimensions that are set on the data quality definitions used. This is the default setting.
- Apply only this dimension
- The rule contributes only to the score of the selected dimension. Dimension settings on the data quality definitions that are used in this rule are ignored. If you select this option but do not set a data quality dimension, the data quality scores of the rule's individual checks are captured in the None dimension.
-
Optional: Change the type of rule that you want to create to an SQL-based rule. In this case, proceed with the instructions in Creating an SQL-based rule.
-
-
Add at least one data quality definition. If you create the rule from a data quality definition directly, that definition is already preselected. However, you can delete this preselected data quality definition and select different ones.
To add data quality definitions, click Add and select all data quality definitions that you want to use to create multiple checks within the same rule. The Add button is available only when no data quality definition is selected. After you add at least one definition, you can add further definitions by using the plus icon.
For any data quality definition in the Select data quality definition dialog, the configured rule expression is displayed in the side panel to help you pick the most appropriate definition for your purpose.
If you want to apply the same definition to different columns of a table, you can duplicate the selected data quality definition as many times as needed.
Note that a separate entry in the output table is created for each passed or failed check, depending on the output configuration.
-
Configure bindings.
For each data quality definition, bind data to all variables in the rule expression. You can bind column data, literal values, or job parameters to a variable. Depending on the configured bindings, you might need to create joins as described in the next step.
To complete your bindings, you can move between the data quality definitions by using the Previous and Next arrows or the drop-down list. The bindings table shows all variables along with their data type. For each variable, select a binding type and the data to bind the variable to.
When you bind column data to a variable in the rule directly, you can use data from all data assets in the project that come from one of the supported connections. See Supported connectors for curation and data quality. If you want to bind data from a connection that was created with personal credentials, you will need to unlock the connection first. In addition to data assets from a connection, you can work with data assets from files in CSV format uploaded from the local file system or from file-based connections to the data sources.
However, the bindings might require preprocessing of data or you might want to include additional information in your output table. In this case, enable the Manage bindings externally option and activate DataStage. All existing bindings are removed and a DataStage flow is created. By default, the DataStage flow is named
<rule_name>_DataStage_flow
, but you change that name. Configure the DataStage flow after you complete the rule configuration. When you create such complex rules and manage bindings externally, you can work with all data assets that come from connections supported by DataStage. See DataStage connectors.To have a data quality rule with external bindings contribute to the data quality score of an asset or a column, add that asset or column as a related item to the corresponding rule. Use the Validates the data quality of relationship type. The same score and issues are reported for all assets and columns that are linked with this relationship type.
In addition to binding a rule variable to a single literal value or column, you can work with project-level job parameters.
You can use the Parameter to literal option to bind your rule variables to literal values that are centrally managed and can be changed at run time. Such parameters usually represent a fact or a specific piece of data. By using a parameter instead of the actual value in a rule, you ensure that the rule always uses the most current value should the value change.
Before you can bind rule variables to job parameters, you must create a reusable DataStage parameter set:
- In your project, click New asset > Define resuable sets of parameters.
- Define parameters with default values or with value sets. For use in rules, you can define parameters of the type date, integer, string, float, time, or timestamp. The types encrypted, list, and path are not supported. See Creating and using parameters and parameter sets.
When you work with value sets, you can change the variable value for each job run. Edit the runtime parameter values and rerun the job.
You can also define columns for use in bindings as project-level job parameters for easier maintenance. Column parameters consist of an asset ID and a column name and are also stored in DataStage parameter sets. You can create a new parameter set or add column parameters to an existing parameter set. In this case, you must create a parameter of the type string and enter the required asset ID and column name manually as the default value. The easier way is to add such parameter in the binding step when you create a rule.
- As Binding type, select Parameter from column. Then, click Select parameter.
- All available parameter sets are listed. Expand the one that you want to work with.
- To add a parameter, click the plus icon.
- Specify the parameter name. Skip the Prompt field. It is not used for column parameters.
- Select a data asset and a column. Your selection is set as the parameter's default value.
Note that value sets can't be used with column parameters. Also, you can't change column parameters at runtime.
If you update a column parameter that is used in more than one rule, you must rerun each of these data quality rules by opening it and clicking Run rule.
-
Create joins. If your bindings don't require joins, you can move to the next step. But if you want to use data from multiple tables in the output table, you must create joins to these tables. If you manage bindings externally, you can't create joins in the rule configuration. Joins must also be defined in the DataStage flow.
If your bindings require joins, the tables are listed. A checkmark in the Join complete column is shown after you set up a join. In the Join keys table, complete these steps for each join that you want to define:
-
Click Add key pair.
-
Click Key 1. Then, select the first item that you want to use in the join.
-
Click Key 2 and select the second item.
-
Select the type of join:
- Inner join
- Records where the selected columns contain equal values are transferred to the output data set.
- Left outer join
- All records for the column selected for key 1 are transferred to the output table. Records for the column selected for key 2 are transferred only where the values match.
- Right outer join
- All records for the column selected for key 2 are transferred to the output table. Records for the column selected for key 1 are transferred only where the values match.
- Full outer join
- All records from both tables are transferred to the output table.
You can change the join type at any time. However, if you want to change your selection for key 1 or key 2, you must delete the existing key pair and create a new one.
-
-
Optional: Configure sampling.
If you do not want or need to evaluate all rows of a data asset, enable data sampling. Thus, you can generate results based on a fraction of your data.
For regular connected data assets, sampling is done at the data source. In most of the databases, the order of the records is not deterministic. Therefore, the records included in the sample can vary from run to run, which means that the results and the content of the output table (if configured) can also change over time.
For query-based connected data assets, sampling is not done at the data source but in the Sample stage of the DataStage flow that is associated with the rule.
-
Set the maximum size of the sample. Select the maximum number of records that you want to include in your data sample. The default value is 1,000 records.
-
Select a sampling method:
- Sequential
- The sample includes the first x records of the data asset. Depending on the size of the data asset, the number x can be up to the value that you specified as the maximum allowed sample size. For example, if you have 1,000,000 records and you specify a maximum sample size of 2,000, the sample includes the first 2,000 records.
- Interval
- The sample includes every nth record until the maximum allowed sample size is reached. For example, if you have 1,000,000 records and specify a sample size of 2,000 with an interval of 10, then a maximum of 20,000 records are read (2,000*10) with every 10th record selected to retrieve the sample size of 2,000.
- Random
- The sample includes randomly selected records up to the maximum allowed sample size. The formula used to select records is (100/sample_percent)*sample_size*2. The number 2 is used in the formula to ensure that enough records are read to produce a valid random sample size. For example, if you have 1,000,000 records and you specify a sample size of 2,000 and a percentage of 5, the sample includes 2,000 records. To create the sample, at most 80,000 records ((100/5)*2,000*2=80,000) are read.
- In the Percentage field, specify the percentage that you want to use for creating the sample. Specify a value greater than 0 and up to 100.
-
-
Configure output settings and content.
Select whether you want rule output to be written to an external location. If not, only some statistical information is provided in the rule's run history.
You can choose to write the rule output to a database table. If the bindings are managed externally, you also have the option to create up to 4 DataStage output links.
To generate a database table or output links:
-
Enable the External output option and expand the section.
-
Select the type of output you want to generate:
-
To write the output to a database table, select one of these options:
-
Write to a new database table
Select a connection. Depending on the selected connection, select a schema, or select a catalog and a schema. Then, enter the name of the output table to be created.
When you run the rule, this new output table is also added to your project as a data asset.
-
Write to an existing database table
Select a connection. Depending on the selected connection, select a schema and an existing table, or select a catalog, a schema, and an existing table. The Output content section is populated with the columns of this table and you can map content to these columns.
If a corresponding data asset doesn't exist in your project, it is created when you run the rule.
For supported database types, see Supported connectors for curation and data quality. Schema and table names must follow this convention:
- The first character for the name must be an alphabetic character.
- The rest of the name can consist of alphabetic characters, numeric characters, or underscores.
- The name must not contain spaces.
You can access the data asset that corresponds to the rule output table from the Assets page in your project or from the rule's run history.
Configure the following settings:
-
Output records: Select whether you want to include all records in your output, only records that do not meet rule conditions (default setting), or only those records that meet the rule conditions.
-
Maximum number of exception output records: You can include all records or set a maximum number.
-
Update method: New output records can be appended to the existing content of the output table. If you want to keep only the output results from latest run, select to overwrite existing records.
For the update method Append, the table schema can't be changed, that is, you can't rename, add, or delete columns. If you want to change the output content for a data quality rule and write to an existing output table, make sure to use the update method Overwrite to replace the columns in output table with the newly defined output columns.
-
-
To create output links if the bindings are managed in a DataStage flow, select DataStage output links.
Configure up to 4 output links. Select which output should be routed to a specific link: all records, only records that do not meet rule conditions, only records that meet the rule conditions, or all violated rule conditions.
Also, define the maximum number of output records to be written per link.
The content of the output records is determined by what you configure in the next step. For violated rule conditions, 0 or more output records can be returned depending on the number of data quality definitions in the rule. Each output record has the following information:
- The record ID. This metric is automatically set as an output column.
- The definition ID of one of the definitions that the input record didn't pass
- The number that uniquely identifies the failed definition in case of duplicated definitions
To map the definition ID to a data quality definition in your project, use the Watson Data API:
The target nodes of these output links must be configured in the DataStage flow.
You can change the output type at any time. Depending on your new selection, any configured settings are reset or overwritten.
When you're done, collapse the section and proceed to configure the output content.
-
-
Configure the content of your output table.
-
If bindings are managed externally, you can include any additional columns that are provided through the DataStage input link in the output table. Such columns are not listed in the output table configuration. You can't include any variables that are used in rule bindings.
-
Click Add output content and select one of these options:
- Columns: Select the columns that you want to see in your output table. This option is not available if bindings are managed externally.
- Statistics and attributes: Select any additional attributes or statistics that you want to include in your output table:
-
Bound to asset ID: Lists the ID of the data asset to which the rule is bound. If this metric is selected, one output record is written for each data asset in a data quality rule. Thus, more than one output record might be written for an individual input record. The content of the output records for different assets of the same data quality rule varies only for these metrics for each individual input record: Data quality definition, Data quality definition ID, Bound to asset ID, and maybe Passing rules, Failing rules, Percent passing rules, and Percent failing rules
This metric is not available for rules with externally managed bindings. Also, you can use this metric only in combination with the Data quality definition or Data quality definition ID metric. If the rule is associated with more than one data quality definition, metric Disambiguator is automatically included in the output.
-
Bound to column: Lists the name of each bound column. If this metric is selected, one output record is written for each column in a data quality definition. Thus, more than one output record might be written for an individual input record. The content of the output records for different columns of the same data quality definition varies only for these metrics for each individual input record: Data quality definition, Data quality definition ID, Bound to column, and maybe Passing rules, Failing rules, Percent passing rules, and Percent failing rules
This metric is not available for rules with externally managed bindings. Also, you can use this metric only in combination with the Data quality definition or Data quality definition ID metric. If the rule is associated with more than one data quality definition, metric Disambiguator is automatically included in the output.
-
Data quality definition: Lists the name of the applied data quality definition. If this metric is selected, multiple output records might be written depending on the number of data quality definitions that the rule contains.
-
Data quality definition ID: Contains a unique key that identifies the applied data quality definition. If this metric is selected, multiple output records might be written depending on the number of data quality definitions that the rule contains.
-
Disambiguator: Contains a number to disambiguate the data quality definitions that are used in the rule primarily if a data quality definition is used multiple times. Numbering starts at 0.
-
Failing rules: Shows the number of rule conditions that the record didn't meet.
-
Job ID: Contains a unique key that identifies the job that is associated with the DataStage flow for the rule.
-
Job run ID: Contains a unique key that identifies an individual run of the job that is associated with the DataStage flow for the rule.
-
Passing rules: Shows the number of rule conditions that the record met.
-
Percent passing rules: Shows the percentage of rule conditions that were met.
-
Percent failing rules: Shows the percentage of rule conditions that weren't met.
-
Project ID: Contains a unique key that identifies the project in which the rule lives.
-
Record ID: Contains a unique key that identifies a record in the output. This metric is automatically included for output links for violated rule conditions.
-
Rule ID: Contains a unique key that identifies the data quality rule.
-
Rule name: Contains the name of the data quality rule.
-
System date: Shows the system date when the rule was run. System date is the date in the time zone set on the server.
-
System time: Shows the system date and time when the rule was run. System date and time is the date and time in the time zone set on the server.
-
- Variables: Select the variables from the rule logic that you want to include in your output table.
- Expressions: Add an expression that defines the content of an output column. You can give this column a descriptive name in the output content overview. You can use block elements to construct your expression. Select and combine elements as required. For more information about using block elements, see Managing data quality definitions. Alternatively, you can use the free form editor to construct your the expression. See Building blocks for rule logic or rule output.
-
-
-
Review your configuration. To make sure your rule is configured properly, you can test it before you actually save it to the project. The output of the rule test is directly displayed and matches what you configured in the output settings.
To make changes to the configuration, click the Edit icon on the tile and update the settings.
When your review is complete, click Create. The rule and its related DataStage flow are added to the project. The default name of the DataStage flow is
DataStage flow of data rule <rulename>
. Do not edit such flow.If your rule is configured with externally managed bindings, the rule and the DataStage flow and its subflows are added to the project when you click Create. However, your rule isn't ready to run. Before you can run the rule, you must edit the DataStage flow. You can also select Create & edit DataStage flow. In this case, the rule and the DataStage flow and its subflows are also added to the project but you are directly taken to the DataStage flow configuration. The naming of such DataStage flows follows the pattern
<rule-name>_Datastage_flow
andDataStage subflow of data rule <rulename>
. For more information about configuring the flow, see DataStage flows.
If your rule is configured properly without any information missing, it has the status Ready. This status means that the rule can be run. The rule status Not ready indicates that the rule can't be run because some dependencies were modified. For example, the data quality definition was updated or a table that is used in the rule's bindings was removed. The status Not ready is also shown for rules with externally managed bindings if the associated DataStage flow isn't configured. After you configure the flow, you can validate the rule by selecting Validate from the overflow menu. If the validation is successful, the status is set to Ready, and you can run the rule.
A rule might no longer be valid after modifications to the data asset that the rule analyzes. Therefore, you might want to validate the rule status in any case before you run a rule manually.
Learn more
- Data quality assets
- Managing data quality definitions
- Supported connectors for curation and data quality
- Creating and using parameters and parameter sets
- Creating SQL-based rules
- Watson Data API: Create data quality rule
- Watson Data API: Validate data quality rule
- Watson Data API: List all data quality definitions or a subset of them
- Watson Data API: Get data quality definition
Next steps
Parent topic: Managing data quality rules