Assessing data quality | IBM Cloud Pak for Data as a Service

Assessing data quality

To determine whether your data is of good quality, check in how far the data meets your expectations and identify anomalies in the data. Evaluating your data for quality also helps you to understand the structure and content of your data.

Run data quality rules to evaluate data based on the defined conditions. The type of rule determines where the data can come from.

Rules that are created from data quality definitions

You can run complex rules with externally managed bindings on data assets from any connector that is supported by DataStage. See DataStage connectors.

For simple rules where you bind the data directly, the connections listed in Supported connectors are supported.

In addition, you can work with data assets from files in CSV format uploaded from the local file system or from file-based connections to the data sources.
SQL-based rules

For supported database types, see Supported connectors.

To have a data quality rule with external bindings or an SQL-based data quality rule contribute to the data quality scores of an asset or a column, add that asset or column as a related item to the corresponding rule. Use the Validates the data quality of relationship type.

Required services

IBM Knowledge Catalog
DataStage or DataStage as a Service Anywhere
With DataStage, you can run data quality rules in the supported regions. With DataStage as a Service Anywhere, you can run data quality rules outside of IBM Cloud by using remote engines. For more information about setting up remote engines, see the DataStage as a Service Anywhere documentation.

Required permissions

To run data quality rules, you must have the Admin or the Editor role in the project. In addition, you must have the Execute data quality rules user permission Also, you must be authorized to access the connections to the data sources of the data assets to be checked.

To view the data that caused data quality issues (the output table) from the rule run history or the Data quality page, you must have the Drill down to issue details user permission. However, the data asset in the project that is created for the output table is accessible by anyone who can access the connection. To limit access to this data asset, the connection to the data source where the output table is stored should be set up with personal credentials.

You can also complete the following tasks with APIs instead of the user interface. The links to these APIs are listed in the Learn more section.

Running data quality rules

Running a data quality rule requires a DataStage flow and subsequently a DataStage job. The job with default job settings is created automatically when you run the rule for the first time from within the asset. A DataStage job with the default name DataStage flow of data rule <rulename>.DataStage job is added to the project.

After the initial run, you can modify the job settings as required, for example, to set up scheduled runs. Or, you might want to adjust the number of warnings that are acceptable before the job ends, which is 100 by default. To change the job settings, go to the job's details page and click the pencil icon on the toolbar. You can get to the job's details page by clicking the job name in the rule's run history or on the project's Jobs page.

You can also create additional DataStage jobs for your rule manually, either from the rule's overflow menu in the project or, when you open the asset, from the overflow menu next to the asset name. See Creating jobs for running data quality rules.

To confirm that a rule is still valid before you run it manually, you can check the rule status by selecting Validate from the overflow menu.

You can run a rule in one of these ways:

Open the data quality rule and click Run rule. Use this option for the initial run of the rule to create the associated DataStage job.
Select Run from the rule overflow menu in the project.
Go to the project's Jobs page, open the job details, and run the job by clicking the Run icon from the action bar.

You can also automate quality checks by setting up jobs with a repeating schedule for running a rule.

Rules are run with IBM Cloud credentials. Typically, your personal IBM Cloud API key is used to execute such long-running operations without disruption. If credentials are not available when you create the job, you are prompted to create an API key. That API key is then saved as your task credentials.

Grouping rules

You can group certain data quality rules in a single DataStage flow for execution:

The data quality rules must be created from data quality definitions.
The rule variables must be bound to a single data asset in the project:
- A single file from one of these file-storage connectors: Amazon S3, Apache HDFS, Azure Data Lake Storage, or Google Cloud Storage
- A file that was uploaded from the local file system
- A single relational data asset

Depending on the configuration of the individual data quality rules that you group, running the rules might require multiple passes over the data.

You cannot group data quality rules that are bound to multiple data assets.

You can use the following API call to group rules for execution:

POST /data_quality/v3/projects/{project_id}/execute_rules

This API call requires the following parameters:

project_id

The ID of the project that contains the rules

Request body

The payload in the following format:

{
  "rules": [
    {
      "id": "<rule1_id>"
    },
    {
      "id": "<rule2_id>"
    }
  ]
}

Pushdown of processing in data quality rules

Certain aspects of data quality rule processing can be pushed down to the data source to reduce the amount of data that is transferred out of the source and to speed up processing. Column selection, creation of joins between different data assets, and sampling are pushed down to data sources with a Relational Database Management System (RDBMS), which means that they support SQL queries. for file-based data sources, no processing is pushed down. SQL-based data quality rules are always run at the data source.

With DataStage as a Service Anywhere, you can run data quality rules outside of IBM Cloud by using remote engines. For more information about setting up remote engines, see the DataStage as a Service Anywhere documentation.

Column selection

For RDBMS data sources, an SQL SELECT statement like select colA, colB from schema1.table1 is run on the data source to retrieve only the required columns from a table. Such queries can't be run on data assets from file-storage connections. For such files, all columns are retrieved and the DataStage Modify stage is used to filter the columns.

Joins

If a data quality rule has two or more variables that are bound to multiple data assets, these data assets must be joined on certain columns.

For RDBMS data sources, an SQL SELECT statement with a JOIN clause like SELECT col1, col2 FROM schema1.table1 INNER JOIN schema1.table2 ON table1.id = table2.id is run on the data source. With this query, the join of the data assets is done at the data source. Some RDBMS data sources do not support JOIN processing at all or support only certain types of JOIN clauses. For example, Google BigQuery doesn't support JOIN clauses at all.

Results of data quality rules with joins also depend on how values like null and empty strings are handled by the RDBMS data source during JOIN processing.

For a data asset from a file-storage connection, all records from each individual data asset are retrieved and the DataStage Join stage is used to join the data assets.

Sampling

For RDBMS data sources, random and sequential sampling are done at the data source. For sequential sampling, an RDBMS-specific clause for selecting the records is added to the SQL statement, for example, FETCH FIRST or LIMIT.

For a data asset from a file-storage connection, all records are retrieved and the DataStage Sample stage is used to create the sample.

Checking the run history

Each time you run a data rule, a run record is created. These run records are listed in the run history of a rule so that you can see how results changed with each run. To view the run records, open the data quality rule and go to the Run history tab. Each run record provides this information:

The start time of the rule run as a hyperlink. Click the link to access the job run retails.
The name of the corresponding DataStage job as a hyperlink. Click the link to access the job details.
The status of the run.
For rules that were created from data quality definitions:
- The number of records that were tested.
- The number of records and the percentage of tested records that met the rule.
- The number of records and the percentage of tested records that didn't meet the rule.
For SQL-based rules:
- The number of records returned by the select statement in the Rule not met column.

All run records are saved and stored until you delete them. Consider cleaning up the run history on a regular basis to save space. You can delete selected run records or all run records at once. When you delete a run record, the corresponding job run details are also deleted.

Checking the rule output table

If an output table is defined for the rule, rule output is written to a database table as configured. See the step for configuring output settings in Creating rules from data quality definitions or Creating SQL-based rules.

The output table is also added to the project as a data asset. You can access the output table in one of these ways:

Go to the rule's run history and click View output table. You can download the rule output as a CSV file, for example, for use in a spreadsheet program if you want to search or filter output that contains a large number of records. The output page also provides a link to the corresponding data asset in the project.
Open the output table in the project. Search for a data asset with the same name as the output table defined in the rule.
Access the table in the database by using native database queries.

Learn more

Parent topic: Managing data quality