Drop in data consistency in Watson OpenScale drift metrics

As your data changes over time, the ability of your model to make accurate predictions might deteriorate. The Watson OpenScale drop in data consistency metric calculates the percentage of transactions at run time that are significantly different than the transactions in the training data.

How it works

Watson OpenScale analyzes each transaction for data inconsistency by comparing the run-time transactions with the patterns of the transactions in the training data. If a transaction violates one or more of the training data patterns, the transaction is identified as inconsistent. To calculate the drop in data consistency, Watson OpenScale divides the total number of transactions by the number of transactions that are identified as inconsistent. For example, if 10 transactions are identified as inconsistent from a set of 100 transactions, then the drop in data consistency is 10%.

Do the math

To identify data inconsistency, Watson OpenScale generates a schema when you configure drift detection by creating a constraints.json file to specify the rules that your input data must follow. Watson OpenScale uses the schema to evaluate your data for drift by identifying outliers that do not fit within the constraints that are specified. The schema is specified as a JSON object with columns and constraints arrays that describe the training data as shown in the following example:

{
      "columns": [
        {
            "name": "CheckingStatus",
            "dtype": "categorical",
            "count": 5000,
            "sparse": false,
            "skip_learning": false
        },
      "constraints": [
        {
            "name": "categorical_distribution_constraint",
            "id": "f0476d40-d7df-4095-9be5-82564511432c",
            "kind": "single_column",
            "columns": [
                "CheckingStatus"
            ],
            "content": {
                "frequency_distribution": {
                    "0_to_200": 1304,
                    "greater_200": 305,
                    "less_0": 1398,
                    "no_checking": 1993
                }
            }
        }

Columns

Watson OpenScale specifies values for the name, dtype, count,sparse, and skip_learning keys to describe a column.

The name and dtype keys describe the label and the data type for a column. The following values that are specified with the dtype key describe the data type:

categorical
numeric_discrete
numeric_continuous

The data type that is specified determines if more statistical properties are described with keys, such as min, max, and mean. For example, when the numeric_discrete or the numeric_continuous data type is specified, properties are described as shown in the following example:

{
            "name": "LoanDuration",
            "dtype": "numeric_discrete",
            "count": 5000,
            "sparse": false,
            "skip_learning": false,
            "min": 4,
            "max": 53,
            "mean": 21.28820697954272,
            "std": 10.999096037050032,
            "percentiles": [
                13.0,
                21.0,
                29.0
            ],
            "count_actual": 4986
        }

The count key specifies the number of rows for a column. Watson OpenScale specifies Boolean values to describe the sparse and skip_learning keys for a column. The sparse key specifies whether a column is sparse and the skip_learning key specifies whether a column skips learning any of the rules that are described in the schema. A column is sparse if the 25th and 75th percentiles have the same value.

Constraints

The name key specifies the constraint type. The following values are specified to describe the constraint type:

categorical_distribution_constraint
numeric_range_constraint
numeric_distribution_constraint
catnum_range_constraint
catnum_distribution_constraint
catcat_distribution_constraint

The id key identifies constraints with a universally unique identifier (UUID). The kind key specifies whether the constraint is a single_column or two-column constraint.

The columns key specifies an array of column names. When Watson OpenScale specifies a single_column constraint with the kind key, the array contains a value that correlates with the column that you want to describe. When Watson OpenScale specifies a two-column constraint with the kind key, the array contains values that correlate with columns that contain related data.

The content key specifies attributes that describe the statistical characteristics of your data. The constraint type that is specified with the name key determines which attribute is specified in the content key as shown in the following table:

Attribute	Constraints
frequency_distribution	categorical_distribution_constraint
ranges	numeric_range_constraint, catnum_range_constraint
distribution	numeric_distribution_constraint, catnum_distribution_constraint
rare_combinations	catcat_distribution_constraint
source_column	catcat_distribution_constraint, catnum_range_constraint, catnum_distribution_constraint
target_column	catcat_distribution_constraint, catnum_range_constraint, catnum_distribution_constraint

The following sections provide examples of how each constraint type is specified:

Categorical distribution constraint
Numeric range constraint
Numeric distribution constraint
Categorical- categorical distribution constraint
Categorical- numeric range constraint
Categorical- numeric distribution constraint

Categorical distribution constraint

        {
            "name": "categorical_distribution_constraint",
            "id": "f0476d40-d7df-4095-9be5-82564511432c",
            "kind": "single_column",
            "columns": [
                "CheckingStatus"
            ],
            "content": {
                "frequency_distribution": {
                    "0_to_200": 1304,
                    "greater_200": 305,
                    "less_0": 1398,
                    "no_checking": 1993
                }
            }
        }

In the training data, the CheckingStatus column contains four values that are specified with the frequency_distribution attribute. The frequency_distribution attribute specifies the frequency counts with values for categories, such as 0_to_200. If Watson OpenScale finds records in the payload data that specifies values that are different than the frequency_distribution attribute values, the records are identified as drift.

Numeric range constraint

   {
            "name": "numeric_range_constraint",
            "id": "79f3a1f5-30a1-4c7f-91a0-1613013ee802",
            "kind": "single_column",
            "columns": [
                "LoanAmount"
            ],
            "content": {
                "ranges": [
                    {
                        "min": 250,
                        "max": 11676,
                        "count": 5000
                    }
                ]
            }
        }

The LoanAmount column contains minimum and maximum values that are specified with the ranges attribute to set a range for the training data. The ranges attribute specifies the high-density regions of the column. Any ranges that rarely occur in the training data aren't included. If Watson OpenScale finds records in the payload data that does not fit within the range and a pre-defined buffer, the records are identified as drift.

Numeric distribution constraint

{
            "name": "numeric_distribution_constraint",
            "id": "3a97494b-0cd7-483e-a1c6-adb7755c1cb0",
            "kind": "single_column",
            "columns": [
                "LoanAmount"
            ],
            "content": {
                "distribution": {
                        "name": "norm",
                        "parameters": {
                            "loc": 3799.62,
                            "scale": 1920.0640064678398
                        },
                        "p-value": 0.22617155797563282
                }
            }
        }

The LoanAmount column contains values that are specified with the distribution attribute to set a normal distribution for the training data. If Watson OpenScale finds records in the payload data that does not fit within the normal distribution, the records are identified as drift. The distributions that Watson OpenScale tries to fit within are uniform, exponential, or normal distributions. If Watson OpenScale doesn't find records that fit within these distributions, this constraint is not learned.

Categorical- categorical distribution constraint

    {
            "name": "catcat_distribution_constraint",
            "id": "99468600-1924-44d9-852c-1727c9c414ee",
            "kind": "two_column",
            "columns": [
                "CheckingStatus",
                "CreditHistory"
            ],
            "content": {
                "source_column": "CheckingStatus",
                "target_column": "CreditHistory",
                "rare_combinations": [
                    {
                        "source_value": "no_checking",
                        "target_values": [
                            "no_credits"
                        ]
                    }
                ]
            }
        }

For the CheckingStatus and CreditHistory columns, the rare_combinations attributes specifies a combination of values that rarely occur in the training data. If Watson OpenScale finds records in the payload data that contains the combination, the records are identified as drift.

Categorical- numeric range constraint

        {
            "name": "catnum_range_constraint",
            "id": "f252033c-1635-4974-8976-3f7904d0c37d",
            "kind": "two_column",
            "columns": [
                "CheckingStatus",
                "LoanAmount"
            ],
            "content": {
                "source_column": "CheckingStatus",
                "target_column": "LoanAmount",
                "ranges": {
                    "no_checking": [
                        {
                            "min": 250,
                            "max": 11676,
                            "count": 1993
                        }
                    ],
                    "less_0": [
                        {
                            "min": 250,
                            "max": 7200,
                            "count": 1398
                        }
                    ],
                    "0_to_200": [
                        {
                            "min": 250,
                            "max": 9076,
                            "count": 1304
                        }
                    ],
                    "greater_200": [
                        {
                            "min": 250,
                            "max": 9772,
                            "count": 305
                        }
                    ]
                }
            }
        }

The ranges attribute specifies minimum and maximum values for the CheckingStatus and LoanAmount columns that set a range for the training data. If Watson OpenScale finds records in the payload data that don't contain LoanAmount and CheckingStatus column values that fit within the range and a pre-defined buffer, the records are identified as drift.

Categorical- numeric distribution constraint

        {
            "name": "catnum_distribution_constraint",
            "id": "3a97494b-0cd7-483e-a1c6-adb7755c1cb0",
            "kind": "two_column",
            "columns": [
                "CheckingStatus",
                "LoanAmount"
            ],
            "content": {
                "source_column": "CheckingStatus",
                "target_column": "LoanAmount",
                "distribution": {
                    "greater_200": {
                        "name": "norm",
                        "parameters": {
                            "loc": 3799.62,
                            "scale": 1920.0640064678398
                        },
                        "p-value": 0.22617155797563282
                    }
                }
            }
        }

The LoanAmount and CheckingStatus columns contain values that are specified with the distribution attribute to set a normal distribution for the training data. If Watson OpenScale finds records in the payload data that don't contain LoanAmount and CheckingStatus column values that fit within the normal distribution, the records are identified as drift.

Note:

To mitigate drift after it is detected by Watson OpenScale, you must build a new version of the model that fixes the problem. A good place to start is with the data points that are highlighted as reasons for the drift. Introduce the new data to the predictive model after you manually label the drifted transactions and use them to retrain the model.

Learn more

Reviewing drift results

Parent topic: Drift detection overview