Evaluation metrics

Last updated: Feb 21, 2025

Evaluation metrics

Evaluation metrics can help you continuously monitor the performance of your AI models to provide insights throughout the AI lifecycle. With watsonx.governance, you can use these metrics to help ensure compliance with regulatory requirements and identify how to make improvements to mitigate risks.

You can run evaluations in Watsonx.governance to generate metrics with automated monitoring that can provide actionable insights to help you achieve your AI governance goals. You can use these metrics to help achieve the following goals:

Ensure compliance: Automatically track adherence to evolving regulations and organizational policies with alerts triggered when thresholds are breached.
Promote transparency: Generate detailed documentation to provide clear insights into model behavior, performance, and explainability of outcomes.
Mitigate risks: Detect and address issues like bias or accuracy drift through continous evaluation and proactive risk assessments.
Protect privacy and security: Monitor for security vulnerabilities like personally identifiable information exposure (PII) and enforce guardrails to prevent misuse of sensitive data.

The metrics that you can use to provide insights about your model performance are determined by the type of evaluations that you enable. Each type of evaluation generates different metrics that you can analyze to gain insights.

You can also use the Python SDK to calculate metrics in a notebook runtime environment or offloaded as Spark jobs against IBM Analytics Engine for evaluations. Some metrics might be available only with the Python SDK.

Drift evaluation metrics

Drift evaluation metrics can help you detect drops in accuracy and data consistency in your models to determine how well your model predicts outcomes over time. Watsonx.governance supports the following drift evaluation metrics for machine learning models.:

Table 1. Drift evaluation metric descriptions
Metric	Description
Drop in accuracy	Estimates the drop in accuracy of your model at run time when compared to the training data
Drop in data consistency	Compares run time transactions with the patterns of transactions in the training data to identify inconsistency

Drift v2 evaluation metrics

Drift v2 evaluation metrics can help you measure changes in your data over time to ensure consistent outcomes for your model. You can use these metrics to identify changes in your model output, the accuracy of your predictions, and the distribution of your input data. Watsonx.governance supports the following drift v2 metrics:

Table 2. Drift v2 evaluation metric descriptions
Metric	Description
Embedding drift	Detects the percentage of records that are outliers when compared to the baseline data
Feature drift	Measures the change in value distribution for important features
Input metadata drift	Measures the change in distribution of the LLM input text metadata
Model quality drift	Compares the estimated runtime accuracy to the training accuracy to measure the drop in accuracy.
Output drift	Measures the change in the model confidence distribution
Output metadata drift	Measures the change in distribution of the LLM output text metadata.
Prediction drift	Measures the change in distribution of the LLM predicted classes.

Fairness evaluation metrics

Fairness evaluation metrics can you help you determine whether your model produces biased outcomes. You can use these metrics to identify when your model shows a tendency to provide favorable outcomes more often for one group over another. Watsonx.governance supports the following fairness evaluation metrics:

Table 3. Fairness evaluation metric descriptions
Metric	Description
Average absolute odds difference	Compares the average of absolute difference in false positive rates and true positive rates between monitored groups and reference groups
Average odds difference	Measures the difference in false positive and false negative rates between monitored and reference groups
Disparate impact	Compares the percentage of favorable outcomes for a monitored group to the percentage of favorable outcomes for a reference group
Error rate difference	The percentage of transactions that are incorrectly scored by your model
False discovery rate difference	The amount of false positive transactions as a percentage of all transactions with a positive outcome
False negative rate difference	The percentage of positive transactions that were incorrectly scored as negative by your model
False omission rate difference	The number of false negative transactions as a percentage of all transactions with a negative outcome
False positive rate difference	The percentage of negative transactions that were incorrectly scored as positive by your model.
Impact score	Compares the rate that monitored groups are selected to receive favorable outcomes to the rate that reference groups are selected to receive favorable outcomes.
Statistical parity difference	Compares the percentage of favorable outcomes for monitored groups to reference groups.

Generative AI quality evaluation metrics

Generative AI quality evaluation metrics can help you measure how well your foundation model performs tasks. Watsonx.governance supports the following generative AI quality evaluation metrics:

Table 4. Generative AI quality evaluation metric descriptions
Metric	Description
BLEU (Bilingual Evaluation Understudy)	Compares translated sentences from machine translations to sentences from reference translations to measure the similarity between reference texts and predictions
Exact match	Compares model prediction strings to reference strings to measure how often the strings match.
METEOR (Metric for Evaluation of Translation with Explicit ORdering)	Measures how well the text that is generated with machine translations match the structure of the text from reference translations
Readability	Determines how difficult the model's output is to read by measuring characteristics such as sentence length and word complexity
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)	Measure how well generated summaries or translations compare to reference outputs.
SARI (system output against references and against the input sentence)	Compares the predicted sentence output against the reference sentence output to measure the quality of words that the model uses to generate sentences
Sentence similarity	Captures semantic information from sentence embeddings to measure the similarity between texts
Text quality	Evaluates the output of a model against SuperGLUE datasets by measuring the F1 score, precision, and recall against the model predictions and its ground truth data

Watsonx.governance also supports the following different categories of generative AI quality metrics:

Answer quality metrics

You can use answer quality metrics to evaluate the quality of model answers. Answer quality metrics are calculated with LLM-as-a-judge models. To calculate the metrics with LLM-as-a-judge models, you can create a scoring function that calls the models. For more information, see the Computing Answer Quality and Retrieval Quality Metrics using IBM watsonx.governance for RAG task notebook.

You can calculate the following answer quality metrics:

Table 5. Answer quality evaluation metric descriptions
Metric	Description
Answer relevance	Measures how relevant the answer in the model output is to the question in the model input
Answer similiarity	Measures how similar the answer or generated text is to the ground truth or reference answer to determine the quality of your model performance
Faithfulness	Measures how grounded the model output is in the model context and provides attributions from the context to show the most important sentences that contribute to the model output.
Unsuccessful requests	Measures the ratio of questions that are answered unsuccessfully out of the total number of questions

Content analysis metrics

You can use the following content analysis metrics to evaluate your model output against your model input or context:

Table 6. Content analysis evaluation metric descriptions
Metric	Description
Abstractness	Measures the ratio of n-grams in the generated text output that do not appear in the source content of the foundation model
Compression	Measures how much shorter the summary is when compared to the input text by calculating the ratio between the number of words in the original text and the number of words in the foundation model output
Coverage	Measures the extent that the foundation model output is generated from the model input by calculating the percentage of output text that is also in the input
Density	Measures how extractive the summary in the foundation model output is from the model input by calculating the average of extractive fragments that closely resemble verbatim extractions from the original text
Repetitiveness	Measures the percentage of n-grams that repeat in the foundation model output by calculating the number of repeated n-grams and the total number of n-grams in the model output

Data safety metrics

You can use the following data safety metrics to identify whether your model's input or output contains harmful or sensitive information:

Table 7. Data safety evaluation metric descriptions
Metric	Description
HAP	Measures if there is any toxic content that contains hate, abuse, or profanity in the model input or output data.
PII	Measures if your model input or output data contains any personally identifiable information by using the Watson Natural Language Processing entity extraction model

Multi-label/class metrics

You can use the following multi-label/class metrics to measure model performance for multi-label/multi-class predictions:

Table 8. Multi-label/class evaluation metric descriptions
Metric	Description
Macro F1 score	Calculates F1 scores independently for each class and averages the scores
Macro precision	Aggregates the precision scores calculated for each class separately to calculate the average
Macro recall	The average of recall scores calculated separately for each class
Micro F1 score	Aggregates all true positives, false positives, and false negatives across all classes to calculate F1 score
Micro precision	Aggregates all true postives and false positives across all classes to calculate precision
Micro recall	Aggregates all true positives and false negatives across all classes to calculate recall

Retrieval quality metrics

You can use the retrieval quality metrics to measure the quality of how the retrieval system ranks relevant contexts. Retrieval quality metrics are calculated with LLM-as-a-judge models. To calculate the metrics with LLM-as-a-judge models, you can create a scoring function that calls the models. For more information, see the Computing Answer Quality and Retrieval Quality Metrics using IBM watsonx.governance for RAG task notebook.

You can calculate the following retrieval quality metrics:

Table 9. Retrieval quality evaluation metric descriptions
Metric	Description
Average precision	Evaluates whether all of the relevant contexts are ranked higher or not by calculating the mean of the precision scores of relevant contexts
Context relevance	Measures how relevant the context that your your model retrieves is with the question that is specified in the prompt
Hit rate	Measures whether there is at least one relevant context among the retrieved contexts.
Normalized Discounted Cumulative Gain	Measures the ranking quality of the retrieved contexts
PII	Measures if your model input or output data contains any personally identifiable information by using the Watson Natural Language Processing entity extraction model
Reciprocal rank	The reciprocal rank of the first relevant context
Retrieval precision	Measures the quanity of relevant contexts from the total of contexts that are retrieved

Model health monitor evaluation metrics

Model health monitor evaluation metrics can help you understand your model behavior and performance by determining how efficiently your model deployment processes your transactions. Model health evaluation metrics are enabled by default for machine learning model evaluations in production and generative AI asset deployments. Watsonx.governance supports the following model health monitor evaluation metrics:

Table 10. Model health monitor evaluation metric descriptions
Metric	Description
Payload size	The total, average, minimum, maximum, and median payload size of the transaction records that your model deployment processes across scoring requests in kilobytes (KB)
Records	The total, average, minimum, maximum, and median number of transaction records that are processed across scoring requests
Scoring requests	The number of scoring requests that your model deployment receives
Users	The number of users that send scoring requests to your model deployments

Watsonx.governance also supports the following different categories of model health monitor evaluation metrics:

Token counts

The following token count metrics calculate the number of tokens that are processed across scoring requests for your model deployment:

Table 11. Model health monitor token count evaluation metric descriptions
Metric	Description
Input token count	Calculates the total, average, minimum, maximum, and median input token count across multiple scoring requests during evaluations
Output token count	Calculates the total, average, minimum, maximum, and median output token count across scoring requests during evaluations

Throughput and latency

Model health monitor evaluations calculate latency by tracking the time that it takes to process scoring requests and transaction records per millisecond (ms). Throughput is calculated by tracking the number of scoring requests and transaction records that are processed per second.

The following metrics are calculated to measure thoughput and latency during evaluations:

Table 12. Model health monitor throughput and latency metric descriptions
Metric	Description
API latency	Time taken (in ms) to process a scoring request by your model deployment.
API throughput	Number of scoring requests processed by your model deployment per second
Record latency	Time taken (in ms) to process a record by your model deployment
Record throughput	Number of records processed by your model deployment per second

Python SDK evaluation metrics

The Python SDK is a Python library that you can use to programatically monitor, manage, and govern machine learning models and generative AI assets. You can use the Python SDK to automate calculations of evaluation metrics. The Python SDK also calculates algorithms that you can use to help measure performance. For more information, see Metrics computation with the Python SDK.

The following metrics are currently available only with Python SDK version 3.0.39 or later:

Table 13. Python SDK evaluation metric descriptions
Metric	Description
Adversarial robustness	Measures the robustness of your model and prompt template against adversarial attacks such as prompt injections and jailbreaks
Keyword inclusion	Measures the similarity of nouns and pronouns between the foundation model output and the reference or ground truth
Prompt leakage risk	Measures the risk of leaking the prompt template by calculating the similarity between the leaked prompt template and original prompt template
Question robustness	Detects the English-language spelling errors in the model input questions

The following metric category is also available only with the Python SDK:

Content validation metrics

Content validation metrics use string-based functions to analyze and validate generated LLM output text. The input must contain a list of generated text from your LLM to generate content validation metrics.

If the input does not contain transaction records, the metrics measure the ratio of successful content validations and compares the ratio to the total number of validations. If the input contains transaction records, the metrics measure the ratio of successful content validations when compared to the total number of validations and calculate validation results with the specified record_id.

You can calculate the following content validation metrics:

Table 14. Content validation evaluation metric descriptions
Metric	Description
Contains all	Measures whether the rows in the prediction contain all of the specified keywords
Contains any	Measures whether the rows in the prediction contain any of the specified keywords
Contains email	Measures whether each row in the prediction contains emails
Contains_JSON	Measures if the rows in the prediction contains JSON syntax
Contains link	Measures whether the rows in the prediction contain any links
Contains none	Measues whether the rows in the prediction do not contain any of the specified keywords
Contains string	Measures whether each row in the prediction contains the specified string
Contains valid link	Measures whether the rows in the prediction contain valid links
Ends with	Measures whether the rows in the prediction end with the specified substring
Equals to	Measures whether the rows in the prediction are equal to the specified substring
Fuzzy match	Measures if the prediction fuzzy matches the keyword
Is email	Measures whether the rows in the prediction contain valid emails
Is JSON	Measures whether the rows in the prediction contains valid JSON syntax
Length greater than	Measures whether the length of each row in the prediction is greater than a specified maximum value
Length less than	Measures whether the length of each row in the prediction is less than a specified maximum value
No invalid links	Measures whether the rows in the prediction have no invalid links
Regex	Measures whether the rows in the prediction contain the specified regex expression
Starts with	Measures whether the rows in the prediction start with the specified substring

Parent topic: Evaluating AI models