Drop in data consistency evaluation metric
The drop in data consistency metric compares run time transactions with the patterns of transactions in the training data to identify inconsistency.
Metric details
Drop in data consistency is a drift evaluation metric that can help determe how well your model predicts outcomes over time.
Scope
The drop in data consistency metric evaluates machine learning models only.
- Types of AI assets: Machine learning models
- Machine learning problem type:
- Binary classification
- Multiclass classification
Scores and values
The drop in data consistency metric score indicates whether transactions are inconsistent by violating training data patterns.
Range of values: 0.0-1.0
Evaluation process
Each transaction is analyzed for data inconsistency by comparing the run-time transactions with the patterns of the transactions in the training data. If a transaction violates one or more of the training data patterns, the transaction is identified as inconsistent. To calculate the drop in data consistency, the total number of transactions is divided by the number of transactions that are identified as inconsistent. For example, if 10 transactions are identified as inconsistent from a set of 100 transactions, then the drop in data consistency is 10%.
To identify data inconsistency, A schema is generated when you configure drift detection by creating a
file to specify the rules that your input data must follow. The schema
is used to evaluate your data for drift by identifying outliers that do not fit within the constraints that are specified. The schema is specified as a JSON object with constraints.json
and columns
arrays that describe
the training data as shown in the following example:constraints
{
"columns": [
{
"name": "CheckingStatus",
"dtype": "categorical",
"count": 5000,
"sparse": false,
"skip_learning": false
},
"constraints": [
{
"name": "categorical_distribution_constraint",
"id": "f0476d40-d7df-4095-9be5-82564511432c",
"kind": "single_column",
"columns": [
"CheckingStatus"
],
"content": {
"frequency_distribution": {
"0_to_200": 1304,
"greater_200": 305,
"less_0": 1398,
"no_checking": 1993
}
}
}
Values are specified for the
, name
, dtype
,count
, and sparse
keys to describe a column.skip_learning
The
and name
keys describe the label and the data type for a column. The following values that are specified with the dtype
key describe the data type:dtype
categorical
numeric_discrete
numeric_continuous
The data type that is specified determines if more statistical properties are described with keys, such as
, min
, and max
. For example, when the mean
or the numeric_discrete
data type is specified, properties are described as shown in the following example:numeric_continuous
{
"name": "LoanDuration",
"dtype": "numeric_discrete",
"count": 5000,
"sparse": false,
"skip_learning": false,
"min": 4,
"max": 53,
"mean": 21.28820697954272,
"std": 10.999096037050032,
"percentiles": [
13.0,
21.0,
29.0
],
"count_actual": 4986
}
The
key specifies the number of rows for a column. Boolean values are specified to describe the count
and sparse
keys for a column. The skip_learning
key specifies whether a column
is sparse and the sparse
key specifies whether a column skips learning any of the rules that are described in the schema. A column is sparse if the 25th and 75th percentiles have the same value.skip_learning
The
key specifies the constraint type. The following values are specified to describe the constraint type:name
categorical_distribution_constraint
numeric_range_constraint
numeric_distribution_constraint
catnum_range_constraint
catnum_distribution_constraint
catcat_distribution_constraint
The
key identifies constraints with a universally unique identifier (UUID). The id
key specifies whether the constraint is a kind
or single_column
constraint.two-column
The
key specifies an array of column names. When a columns
constraint with the single_column
key is specified, the array contains a value that correlates with the column that you want to describe.
When a kind
constraint with the two-column
key is specified, the array contains values that correlate with columns that contain related data.kind
The
key specifies attributes that describe the statistical characteristics of your data. The constraint type that is specified with the content
key determines which attribute is specified in the name
key as shown in the following table:content
Attribute | Constraints |
---|---|
frequency_distribution | categorical_distribution_constraint |
ranges | numeric_range_constraint, catnum_range_constraint |
distribution | numeric_distribution_constraint, catnum_distribution_constraint |
rare_combinations | catcat_distribution_constraint |
source_column | catcat_distribution_constraint, catnum_range_constraint, catnum_distribution_constraint |
target_column | catcat_distribution_constraint, catnum_range_constraint, catnum_distribution_constraint |
The following sections provide examples of how each constraint type is specified:
- Categorical distribution constraint
- Numeric range constraint
- Numeric distribution constraint
- Categorical- categorical distribution constraint
- Categorical- numeric range constraint
- Categorical- numeric distribution constraint
Categorical distribution constraint
{
"name": "categorical_distribution_constraint",
"id": "f0476d40-d7df-4095-9be5-82564511432c",
"kind": "single_column",
"columns": [
"CheckingStatus"
],
"content": {
"frequency_distribution": {
"0_to_200": 1304,
"greater_200": 305,
"less_0": 1398,
"no_checking": 1993
}
}
}
In the training data, the
column contains four values that are specified with the CheckingStatus
attribute. The frequency_distribution
attribute specifies the frequency counts
with values for categories, such as frequency_distribution
. If records are found in the payload data that specifies values that are different than the 0_to_200
attribute values, the records are identified as drift.frequency_distribution
Numeric range constraint
{
"name": "numeric_range_constraint",
"id": "79f3a1f5-30a1-4c7f-91a0-1613013ee802",
"kind": "single_column",
"columns": [
"LoanAmount"
],
"content": {
"ranges": [
{
"min": 250,
"max": 11676,
"count": 5000
}
]
}
}
The
column contains minimum and maximum values that are specified with the LoanAmount
attribute to set a range for the training data. The ranges
attribute specifies the high-density regions of
the column. Any ranges that rarely occur in the training data aren't included. If records are found in the payload data that do not fit within the range and a pre-defined buffer, the records are identified as drift.ranges
Numeric distribution constraint
{
"name": "numeric_distribution_constraint",
"id": "3a97494b-0cd7-483e-a1c6-adb7755c1cb0",
"kind": "single_column",
"columns": [
"LoanAmount"
],
"content": {
"distribution": {
"name": "norm",
"parameters": {
"loc": 3799.62,
"scale": 1920.0640064678398
},
"p-value": 0.22617155797563282
}
}
}
The
column contains values that are specified with the LoanAmount
attribute to set a normal distribution for the training data. If records are found in the payload data that do not fit within the normal
distribution, the records are identified as drift. The distributions that are fitted within are uniform, exponential, or normal distributions. If records that fit within these distributions are not found, this constraint is not learned.distribution
Categorical- categorical distribution constraint
{
"name": "catcat_distribution_constraint",
"id": "99468600-1924-44d9-852c-1727c9c414ee",
"kind": "two_column",
"columns": [
"CheckingStatus",
"CreditHistory"
],
"content": {
"source_column": "CheckingStatus",
"target_column": "CreditHistory",
"rare_combinations": [
{
"source_value": "no_checking",
"target_values": [
"no_credits"
]
}
]
}
}
For the
and CheckingStatus
columns, the CreditHistory
attributes specifies a combination of values that rarely occur in the training data. If records are found in the payload data
that contain the combination, the records are identified as drift.rare_combinations
Categorical- numeric range constraint
{
"name": "catnum_range_constraint",
"id": "f252033c-1635-4974-8976-3f7904d0c37d",
"kind": "two_column",
"columns": [
"CheckingStatus",
"LoanAmount"
],
"content": {
"source_column": "CheckingStatus",
"target_column": "LoanAmount",
"ranges": {
"no_checking": [
{
"min": 250,
"max": 11676,
"count": 1993
}
],
"less_0": [
{
"min": 250,
"max": 7200,
"count": 1398
}
],
"0_to_200": [
{
"min": 250,
"max": 9076,
"count": 1304
}
],
"greater_200": [
{
"min": 250,
"max": 9772,
"count": 305
}
]
}
}
}
The
attribute specifies minimum and maximum values for the ranges
and CheckingStatus
columns that set a range for the training data. If records are found in the payload data that don't contain
LoanAmount
and LoanAmount
column values that fit within the range and a pre-defined buffer, the records are identified as drift.CheckingStatus
Categorical- numeric distribution constraint
{
"name": "catnum_distribution_constraint",
"id": "3a97494b-0cd7-483e-a1c6-adb7755c1cb0",
"kind": "two_column",
"columns": [
"CheckingStatus",
"LoanAmount"
],
"content": {
"source_column": "CheckingStatus",
"target_column": "LoanAmount",
"distribution": {
"greater_200": {
"name": "norm",
"parameters": {
"loc": 3799.62,
"scale": 1920.0640064678398
},
"p-value": 0.22617155797563282
}
}
}
}
The
and LoanAmount
columns contain values that are specified with the CheckingStatus
attribute to set a normal distribution for the training data. If records are found in the payload data
that don't contain distribution
and LoanAmount
column values that fit within the normal distribution, the records are identified as drift.CheckingStatus
Next steps
To mitigate drift after it is detected, you must build a new version of the model that fixes the problem. A good place to start is with the data points that are highlighted as reasons for the drift. Introduce the new data to the predictive model after you manually label the drifted transactions and use them to retrain the model.
Parent topic: Evaluation metrics