You can configure drift v2 evaluations to measure changes in your data over time to ensure consistent outcomes for your model. Use drift v2 evaluations to identify changes in your model output, the accuracy of your predictions, and the distribution of your input data.
The following sections describe how to configure drift v2 evaluations:
Configuring drift v2 evaluations for machine learning models
If you log payload data when you prepare for model evaluations, you can configure drift v2 evaluations for machine learning models to help you understand how changes in your data affect model outcomes.
Compute the drift archive
You must choose the method that you want to use to analyze your training data to determine the data distributions of your model features. If you connect training data and the size of your is less than 500 MB, you can choose to compute the drift v2 archive.
If you don't connect your training data, or if the size of your data is larger than 500 MB, you must choose to compute the drift v2 archive in a notebook. You must also compute the drift v2 archive in notebooks if you want to evaluate image or text models.
You can specify a limit for the size of your training data by setting maximum sample sizes for the amount of training data that is used for scoring and computing the drift v2 archive. For non-watsonx.ai Runtime deployments, computing the drift v2 archive has a cost associated with scoring the training data against your model's scoring endpoint.
Set drift thresholds
You must set threshold values for each metric to identify issues with your evaluation results. The values that you set create alerts on the Insights dashboard that appear when metric scores violate your thresholds. You must set the values between the range of 0 to 1. The metric scores must be lower than the threshold values to avoid violations.
Select important features
For tabular models only, feature importance is calculated to determine the impact of feature drift on your model. To calculate feature importance, you can select the important and most important features from your model that have the biggest impact on your model outcomes.
When you configure SHAP explanations, the important features are automatically detected by using global explanations.
You can also upload a list of important features by uploading a JSON file. Sample snippets are provided that you can use to upload a JSON file. For more information, see Feature importance snippets.
Set sample size
Sample sizes are used to understand how to process the number of transactions that are evaluated during evaluations. You must set a minimum sample size to indicate the lowest number of transactions that you want to evaluate. You can also set a maximum sample size to indicate the maximum number of transactions that you want to evaluate.
Configuring drift v2 evaluations for generative AI models
When you evaluate prompt templates, you can review a summary of drift v2 evaluation results for the following task types:
- Text summarization
- Text classification
- Content generation
- Entity extraction
- Question answering
- Retrieval Augmented Generation (RAG)
Set drift thresholds
To configure drift v2 evaluations with your own settings, you can set a minimum and maximum sample size for each metric. The minimum or maximum sample size indicates the minimum or maximum number of model transactions that you want to evaluate.
You can also configure baseline data and set threshold values for each metric. Threshold values create alerts on the evaluation summary page that apper when metric scores violate your thresholds. You must set the values between the range of 0 to 1. The metric scores must be lower than the threshold values to avoid violations.
Compute the drift archive
Watsonx.governance uses payload records to establish the baseline for drift v2 evaluations. You must configure the number of records that you want to calculate as your baseline data. You can use a notebook to generate your drift v2 baseline data archive to configure evaluations.
Compute the embeddings
To compute embedding drift metrics, you must provided embeddings with your test data. You can use notebooks to help generate and persist embeddings.
Supported drift v2 metrics
When you enable drift v2 evaluations for machine learning models or generative AI models, you can view a summary of evaluation results with metrics for the type of model that you're evaluating.
If you are evaluating machine learning models, you can view the results of your drift v2 evaluations on the Insights dashboard. For more information, see Reviewing drift v2 results.
The following metrics are supported by drift v2 evaluations:
Embedding drift
Embedding drift detects the percentage of records that are outliers when compared to the baseline data.
- How it works: You must provide embeddings with your baseline data when you enable the embeddings drift metric to generate evaluation results. Watsonx.governance builds an auto-encoder that processes the embeddings in your baseline data and computes pre-defined cosine and euclidean distance metrics for the model output. Watsonx.governance identifies the distribution of the distance metrics to set a threshold for outlier detection and detects drift if the distance metric value is higher than the threshold. For RAG tasks, the embeddings for all of the context columns in your model record are combined into a single vector to determine drift.
- Do the math: Watsonx.governance uses the following formulas to compute embedding drift:
- Supported models: LLMs
- Applies to prompt template evaluations: Yes
- Task types:
- Text summarization
- Text classification
- Content generation
- Entity extraction
- Question answering
- Retrieval Augmented Generation (RAG)
- Task types:
Output drift
Output drift measures the change in the model confidence distribution.
-
How it works:
The amount that your model output changes from the time that you train the model is measured. For regression models, output drift is calculated by measuring the change in distribution of predictions on the training and payload data. For classification models, output drift is calculated for each class probability by measuring the change in distribution for class probabilities on the training and payload data. For multi-classification models, output drift is aggregated for each class probability by measuring a weighted average. -
Do the math:
The following formulas are used to calculate output drift: -
Supported models: traditional machine learning and LLMs
-
Applies to prompt template evaluations: Yes
- Task types:
- Text summarization
- Text classification
- Content generation
- Entity extraction
- Question answering
- Task types:
Model quality drift
Model quality drift compares the estimated runtime accuracy to the training accuracy to measure the drop in accuracy.
- How it works:
A drift detection model is built that processes your payload data when you configure drift v2 evaluations to predict whether your model generates accurate predictions without the ground truth. The drift detection model uses the input features and class probabilities from your model to create its own input features.
- Do the math:
The following formula is used to calculate model quality drift:
The accuracy of your model is calculated as the base_accuracy
by measuring the fraction of correctly predicted transactions in your training data. During evaluations, your transactions are scored against the drift detection model
to measure the amount of transactions that are likely predicted correctly by your model. These transactions are compared to the total number of transactions that are processed to calculate the predicted_accuracy
. If the predicted_accuracy
is less than the base_accuracy
, a model quality drift score is generated.
- Supported models: traditional machine learning
- Applies to prompt template evaluations: No
Feature drift
Feature drift measures the change in value distribution for important features.
- How it works:
Drift is calculated for categorical and numeric features by measuring the probability distribution of continuous and discrete values. To identify discrete values for numeric features, a binary logarithm is used to compare the number of distinct values of each feature to the total number of values of each feature. The following binary logarithm formula is used to identify discrete numeric features:
If the distinct_values_count
is less than the binary logarithm of the total_count
, the feature is identified as discrete.
- Do the math:
The following formulas are used to calculate feature drift:
-
Supported models: traditional machine learning
-
Applies to prompt template evaluations: No
Prediction drift
Prediction drift measures the change in distribution of the LLM predicted classes.
- Do the math:
Watsonx.governance uses the Jensen Shannon distance formula to calculate prediction drift.
-
Applies to prompt template evaluations: Yes
- Task types: Text classification
-
Supported models: LLMs
Input metadata drift
Input metadata drift measures the change in distribution of the LLM input text metadata.
-
How it works:
Watsonx.governance calculates the following metadata with the LLM input text:
Character count: Total number of characters in the input text
Word count: Total number of words in the input text
Token count: Total number of tokens in the input text
Sentence count: Total number of sentences in the input text
Average word length: Average length of words in the input text
Total word length: Total length of words in the input text
Average sentence length: Average length of the sentences in the input textWatsonx.governance calculates input metadata drift by measuring the change in distribution of the metadata columns. The input token count column, if present in the payload, is also used to compute the input metadata drift. You can also choose to specify any meta fields while adding records to the payload table. These meta fields are also used to compute the input metadata drift. To identify discrete numeric input metadata columns, watsonx.governance uses the following binary logarithm formula:
If the
distinct_values_count
is less than the binary logarithm of thetotal_count
, the feature is identified as discrete.For discrete input metadata columns, watsonx.governance uses the Jensen Shannon distance formula to calculate input metadata drift.
For continuous input metadata columns, watsonx.governance uses the total variation distance and overlap coefficient formulas to calculate input metadata drift.
-
Applies to prompt template evaluations: Yes
- Task types:
- Text summarization
- Text classification
- Content generation
- Entity extraction
- Question answering
- Task types:
-
Supported models: LLMs
Output metadata drift
Output metadata drift measures the change in distribution of the LLM output text metadata.
-
How it works:
Watsonx.governance calculates the following metadata with the LLM output text:
Character count: Total number of characters in the output text
Word count: Total number of words in the output text
Token count: Total number of tokens in the output text
Sentence count: Total number of sentences in the output text
Average word length: Average length of words in the output text
Average sentence length: Average length of the sentences in the output text
Total word length: Total length of words in the output textWatsonx.governance calculates output metadata drift by measuring the change in distribution of the metadata columns. The output token count column, if present in the payload, is also used to compute the output metadata drift. You can also choose to specify any meta fields while adding records to the payload table. These meta fields are also used to compute the output metadata drift. To identify discrete numeric output metadata columns, watsonx.governance uses the following binary logarithm formula:
If the
distinct_values_count
is less than the binary logarithm of thetotal_count
, the feature is identified as discrete.For discrete output metadata columns, watsonx.governance uses the Jensen Shannon distance formula to calculate input metadata drift.
For continuous output metadata columns, watsonx.governance uses the total variation distance and overlap coefficient formulas to calculate output metadata drift:
-
Applies to prompt template evaluations: Yes
- Task types:
- Text summarization
- Text classification
- Content generation
- Question answering
- Task types:
-
Supported models: LLMs
The following formulas are used to calculate drift v2 evaluation metrics:
Total variation distance
Total variation distance measures the maximum difference between the probabilities that two probability distributions, baseline (B) and production (P), assign to the same transaction as shown in the following formula:
If the two distributions are equal, the total variation distance between them becomes 0.
The following formula is used to calculate total variation distance:
-
𝑥 is a series of equidistant samples that span the domain of that range from the combined miniumum of the baseline and production data to the combined maximum of the baseline and production data.
-
is the difference between two consecutive 𝑥 samples.
-
is the value of the density function for production data at a 𝑥 sample.
-
is the value of the density function for baseline data for at a 𝑥 sample.
The denominator represents the total area under the density function plots for production and baseline data. These summations are an approximation of the integrations over the domain space and both these terms should be 1 and total should be 2.
Overlap coefficient
The overlap coefficient is calculated by measuring the total area of the intersection between two probability distributions. To measure dissimilarity between distributions, the intersection or the overlap area is subtracted from 1 to calculate the amount of drift. The following formula is used to calculate the overlap coefficient:
-
𝑥 is a series of equidistant samples that span the domain of that range from the combined miniumum of the baseline and production data to the combined maximum of the baseline and production data.
-
is the difference between two consecutive 𝑥 samples.
-
is the value of the density function for production data at a 𝑥 sample.
-
is the value of the density function for baseline data for at a 𝑥 sample.
Jensen Shannon distance
Jensen Shannon Distance is the normalized form of Kullback-Leibler (KL) Divergence that measures how much one probability distribution differs from the second probabillity distribution. Jensen Shannon Distance is a symmetrical score and always has a finite value.
The following formula is used to calculate the Jensen Shannon distance for two probability distributions, baseline (B) and production (P):
is the KL Divergence.
Cosine distance
Cosine distance measures the difference between embedding vectors. The following formula is used to measure cosine distance:
The cosine distance ranges between 0, which indicates identical vectors to 1, which indicates no correlation between the vectors, to 2, which indicates opposite vectors.
Euclidean distance
Euclidean distance is the shortest distance between embedding vectors in the euclidean space. The following formula is used to measure euclidean distance:
The euclidean distance ranges between 0, which indicates completely identical vectors, to infinity. However, for vectors that are normalized to have unit length, the maximum euclidean distance is the .
Parent topic: Configuring model evaluations