Managing training data in Watson OpenScale

You must connect your training data to Watson OpenScale so that it understands how to process your model. Watson OpenScale uses the training data schema that you provide in your deployment to ensure that the information that you provide to identify the data is accurate and corresponds to the way that it understands the model. The following example shows how the training data schema must be formatted:

"training_data_references": [
        {
            "connection": {
                "endpoint_url": "",
                "access_key_id": "",
                "secret_access_key": ""
            },
            "location": {
                "bucket": "",
                "path": ""
            },
            "type": "fs",
            "schema": {
                "id": "4cdb0a0a-1c69-43a0-a8c0-3918afc7d45f",
                "fields": [
                    {
                        "metadata": {},
                        "name": "CheckingStatus",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "LoanDuration",
                        "nullable": true,
                        "type": "integer"
                    },
                    {
                        "metadata": {},
                        "name": "CreditHistory",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "LoanPurpose",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "LoanAmount",
                        "nullable": true,
                        "type": "integer"
                    },
                    {
                        "metadata": {
                            "modeling_role": "target",
                            "values": [
                                "No Risk",
                                "Risk"
                            ]
                        },
                        "name": "Risk",
                        "nullable": true,
                        "type": "string"
                    }
                ],
                "type": "struct"
            }
        }
    ]

Generating artifacts from training data

Training data is accessed for the following reasons:

For building a drift detection model to detect a drop in accuracy, which is model drift.
For learning data constraints to detect a drop in data consistency, which is data drift.
To generate the training data statistics, which are used to recommend fairness configuration like fairness attributes, reference and monitored groups for those attributes
The training data is also used to generate correlations between the sensitive attributes (for example: gender, race, etc.) against the features that are used to train the model. These correlations are further used to find indirect bias in the model deployment.
The training data statistics are used to generate transaction perturbations, which are used for transaction explanation generation.

The preceding artifacts can be generated through Watson OpenScale, or through a Jupyter Notebook.

Maintaining the privacy of your training data

When training data is accessed through a Jupyter Notebook, either from a stand-alone Jupyter or Python application or from Watson Studio, the previously listed artifacts are generated and uploaded to Watson OpenScale.

In this approach, Watson OpenScale does not require access to the training data at run time because the fairness, explain, drift, and quality metrics are computed.

Model drift

In production environments, Watson OpenScale creates the drift detection model by looking at the data that was used to train and test the model. For example, if the model has an accuracy of 90% on the test data, it means that it provides incorrect predictions on 10% of the test data. Watson OpenScale builds a binary classification model that accepts a data point and predicts whether that data point is similar to the accurately (90%) or inaccurately predicted (10%) data.

After Watson OpenScale creates the drift detection model, at run time it scores this model by using all the data that the client model receives. For example, if the client model received 1000 records in the past 3 hours, Watson OpenScale runs the drift detection model on those same 1000 data points. It calculates how many of the records are similar to the 10% of records on which the model made an error when training. If 200 of these records are similar to the 10%, then it implies that the model accuracy is likely to be 80%. Because the model accuracy at training time was 90%, it means that there is an accuracy drift of 10% in the model.

Data drift

Data drift calculation and processing requires column statistics, such as mean and percentiles and the constraints or rules in the training data that can span a single column or two columns. For the online subscriptions, processing is done through python libraries, such as pandas, numpy, or scipy. For the batch subscriptions, processing is done through a pyspark library.

Training data serves as the main input in the data flow diagram. From training data, the column statistics are calculated and then single column constraints are learned. If two-column constraints are enabled, there is an extra step to learn the two-column constraints before the drift is calculated and processed.

For each numerical column, the following statistics are processed:

mean value in column
min value in column
max value in column
standard deviation in column
quartiles
actual count of data after dropping outliers [only in online subscriptions]
approximate count for the number of distinct values [only in batch subscriptions]

Fairness

The training data for fairness is used to generate the training data statistics. These statistics then recommend fairness configuration (for example: fairness attributes, reference, and monitored groups) and calculate fairness (disparate impact) on the training data for the given fairness configuration.

The training data statistics contain the distribution of the fairness attribute columns across the values in the prediction column. For example, the sample German Credit Risk model the training data statistics for features Sex and Age would look as follows:

{
    "fields": [
        "feature",
        "feature_value",
        "label",
        "count",
        "is_favourable",
        "group"
    ],
    "values": [
        [
            "Sex",
            "male",
            "No Risk",
            1995,
            true,
            "reference"
        ],
        [
            "Age",
            [
                26,
                75
            ],
            "No Risk",
            2431,
            true,
            "reference"
        ]
    ]
}

These training data statistics are stored in the fairness monitor instance. If there is a change in the fairness configuration, these statistics must be regenerated. If the training data location is given to Watson OpenScale as part of the subscription, Watson OpenScale would automatically regenerate these statistics when there is a change in the fairness configuration. If not given, it must be done by the user via the training statistics notebook.

The training data is also used to generate correlations between features that were used to train the model and the meta-features, which are sensitive, such as gender or race. These correlations are then further used to figure out the correlated reference and monitored groups for finding out the indirect bias in the model deployment.

These statistics are not used to report or remove bias from the model deployment.

Explainability

For explainability, the following statistics are computed from the training data:

For the categorical feature column, counts of each categorical value in the training data
For numeric feature column, the following values are computed:
- Minimum, Maximum, Median and the Standard deviation values for the column.
- Create bins(4 or 10) for the numeric column and the minimum, maximum, mean, standard deviation values for each bin. Also, the counts of values in each bin.
For the label column, the list of label column values

How are these statistics used in explanations? In explanation generation, the statistics are used to generate perturbation values in the same distribution as the statistics. They are also used to ensure perturbations are generated with the boundary of feature values.

Sample training data schema

The following output shows a sample training data schema from the German Credit Risk model:

"training_data_references": [
        {
            "connection": {
                "endpoint_url": "",
                "access_key_id": "",
                "secret_access_key": ""
            },
            "location": {
                "bucket": "",
                "path": ""
            },
            "type": "fs",
            "schema": {
                "id": "4cdb0a0a-1c69-43a0-a8c0-3918afc7d45f",
                "fields": [
                    {
                        "metadata": {},
                        "name": "CheckingStatus",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "LoanDuration",
                        "nullable": true,
                        "type": "integer"
                    },
                    {
                        "metadata": {},
                        "name": "CreditHistory",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "LoanPurpose",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "LoanAmount",
                        "nullable": true,
                        "type": "integer"
                    },
                    {
                        "metadata": {},
                        "name": "ExistingSavings",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "EmploymentDuration",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "InstallmentPercent",
                        "nullable": true,
                        "type": "integer"
                    },
                    {
                        "metadata": {},
                        "name": "Sex",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "OthersOnLoan",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "CurrentResidenceDuration",
                        "nullable": true,
                        "type": "integer"
                    },
                    {
                        "metadata": {},
                        "name": "OwnsProperty",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "Age",
                        "nullable": true,
                        "type": "integer"
                    },
                    {
                        "metadata": {},
                        "name": "InstallmentPlans",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "Housing",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "ExistingCreditsCount",
                        "nullable": true,
                        "type": "integer"
                    },
                    {
                        "metadata": {},
                        "name": "Job",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "Dependents",
                        "nullable": true,
                        "type": "integer"
                    },
                    {
                        "metadata": {},
                        "name": "Telephone",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "ForeignWorker",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {
                            "modeling_role": "target",
                            "values": [
                                "No Risk",
                                "Risk"
                            ]
                        },
                        "name": "Risk",
                        "nullable": true,
                        "type": "string"
                    }
                ],
                "type": "struct"
            }
        }
    ]

Next steps

Prepare models for evaluation

Parent topic: Configure Watson OpenScale