Assessing data quality
To determine whether your data is of good quality, check in how far the data meets your expectations and identify anomalies in the data. Evaluating your data for quality also helps you to understand the structure and content of your data.
Run data quality rules to evaluate data based on the defined conditions. The type of rule determines where the data can come from.
Rules that are created from data quality definitions
You can run complex rules with externally managed bindings on data assets from any connector that is supported by DataStage. See DataStage connectors.
For simple rules where you bind the data directly, the connections listed in Supported connectors for data quality rules are supported.
In addition, you can work with data assets from files in CSV format uploaded from the local file system or from file-based connections to the data sources.
For supported database types, see Supported connectors for data quality rules.
- Required permissions
- To run data quality rules, you must have the Admin or the Editor role in the project. Also, you must be authorized to access the connections to the data sources of the data assets to be checked.
You can also complete the following tasks with APIs instead of the user interface. The links to these APIs are listed in the Learn more section.
Running data quality rules
Running a data quality rule requires a DataStage flow and subsequently a DataStage job. The job with default job settings is created automatically when you run the rule for the first time from within the asset. A DataStage job with the default
DataStage flow of data rule <rulename>.DataStage job is added to the project.
After the initial run, you can modify the job settings as required, for example, to set up scheduled runs. Or, you might want to adjust the number of warnings that are acceptable before the job ends, which is 100 by default. To change the job settings, go to the job's details page and click the pencil icon on the toolbar. You can get to the job's details page by clicking the job name in the rule's run history or on the project's Jobs page.
You can also create additional DataStage jobs for your rule manually, either from the rule's overflow menu in the project or, when you open the asset, from the overflow menu next to the asset name. See Creating jobs for running data quality rules.
You can run a rule in one of these ways:
- Open the data quality rule and click Run rule. Use this option for the initial run of the rule to create the associated DataStage job.
- Go to the project's Jobs tab, open the job details, and run the job by clicking from the action bar.
You can also automate quality checks by setting up jobs with a repeating schedule for running a rule.
Rules are run with IBM Cloud credentials. Typically, your personal IBM Cloud API key is used to execute such long-running operations without disruption. If credentials are not available when you create the job, you are prompted to create an API key. That API key is then saved as your task credentials.
Checking the run history
Each time you run a data rule, a run record is created. These run records are listed in the run history of a rule so that you can see how results changed with each run. To view the run records, open the data quality rule and go to the Run history tab. Each run record provides this information:
- The start time of the rule run as a hyperlink. Click the link to access the job run retails.
- The name of the corresponding DataStage job as a hyperlink. Click the link to access the job details.
- The status of the run.
- For rules that were created from data quality definitions:
- The number of records that were tested.
- The number of records and the percentage of tested records that met the rule.
- The number of records and the percentage of tested records that didn't meet the rule.
- For SQL-based rules:
- The number of records returned by the select statement in the Rule not met column.
Checking the rule output table
If an output table is defined for the rule, rule output is written to a database table as configured. See the step for configuring output settings in Creating rules from data quality definitions or Creating SQL-based rules.
The output table is also added to the project as a data asset. You can access the output table in one of these ways:
- Go to the rule's run history and click View output table. You can download the rule output as a CSV file, for example, for use in a spreadsheet program if you want to search or filter output that contains a large number of records. The output page also provides a link to the corresponding data asset in the project.
- Open the output table in the project. Search for a data asset with the same name as the output table defined in the rule.
- Access the table in the database by using native database queries.
- Creating jobs for running data quality rules
- Creating rules from data quality definitions
- Creating SQL-based rules
- Watson Data API: Run data quality rule
- Watson Data API: List history of all data quality rule run results or a subset of them
- Watson Data API: Get the data quality rule run
Parent topic: Managing data quality