Evaluation metrics can help you continuously monitor the performance of your AI models to provide insights throughout the AI lifecycle. With watsonx.governance, you can use these metrics to help ensure compliance with regulatory requirements and identify how to make improvements to mitigate risks.
You can run evaluations in Watsonx.governance to generate metrics with automated monitoring that can provide actionable insights to help you achieve your AI governance goals. You can use these metrics to help achieve the following goals:
- Ensure compliance: Automatically track adherence to evolving regulations and organizational policies with alerts triggered when thresholds are breached.
- Promote transparency: Generate detailed documentation to provide clear insights into model behavior, performance, and explainability of outcomes.
- Mitigate risks: Detect and address issues like bias or accuracy drift through continous evaluation and proactive risk assessments.
- Protect privacy and security: Monitor for security vulnerabilities like personally identifiable information exposure (PII) and enforce guardrails to prevent misuse of sensitive data.
The metrics that you can use to provide insights about your model performance are determined by the type of evaluations that you enable. Each type of evaluation generates different metrics that you can analyze to gain insights.
You can also use the Python SDK to calculate metrics in a notebook runtime environment or offloaded as Spark jobs against IBM Analytics Engine for evaluations. Some metrics might be available only with the Python SDK.
Drift evaluation metrics
Drift evaluation metrics can help you detect drops in accuracy and data consistency in your models to determine how well your model predicts outcomes over time. Watsonx.governance supports the following drift evaluation metrics for machine learning models.:
Metric | Description |
---|---|
Drop in accuracy | Estimates the drop in accuracy of your model at run time when compared to the training data |
Drop in data consistency | Compares run time transactions with the patterns of transactions in the training data to identify inconsistency |
Drift v2 evaluation metrics
Drift v2 evaluation metrics can help you measure changes in your data over time to ensure consistent outcomes for your model. You can use these metrics to identify changes in your model output, the accuracy of your predictions, and the distribution of your input data. Watsonx.governance supports the following drift v2 metrics:
Metric | Description |
---|---|
Embedding drift | Detects the percentage of records that are outliers when compared to the baseline data |
Feature drift | Measures the change in value distribution for important features |
Input metadata drift | Measures the change in distribution of the LLM input text metadata |
Model quality drift | Compares the estimated runtime accuracy to the training accuracy to measure the drop in accuracy. |
Output drift | Measures the change in the model confidence distribution |
Output metadata drift | Measures the change in distribution of the LLM output text metadata. |
Prediction drift | Measures the change in distribution of the LLM predicted classes. |
Fairness evaluation metrics
Fairness evaluation metrics can you help you determine whether your model produces biased outcomes. You can use these metrics to identify when your model shows a tendency to provide favorable outcomes more often for one group over another. Watsonx.governance supports the following fairness evaluation metrics:
Metric | Description |
---|---|
Average absolute odds difference | Compares the average of absolute difference in false positive rates and true positive rates between monitored groups and reference groups |
Average odds difference | Measures the difference in false positive and false negative rates between monitored and reference groups |
Disparate impact | Compares the percentage of favorable outcomes for a monitored group to the percentage of favorable outcomes for a reference group |
Error rate difference | The percentage of transactions that are incorrectly scored by your model |
False discovery rate difference | The amount of false positive transactions as a percentage of all transactions with a positive outcome |
False negative rate difference | The percentage of positive transactions that were incorrectly scored as negative by your model |
False omission rate difference | The number of false negative transactions as a percentage of all transactions with a negative outcome |
False positive rate difference | The percentage of negative transactions that were incorrectly scored as positive by your model. |
Impact score | Compares the rate that monitored groups are selected to receive favorable outcomes to the rate that reference groups are selected to receive favorable outcomes. |
Statistical parity difference | Compares the percentage of favorable outcomes for monitored groups to reference groups. |
Generative AI quality evaluation metrics
Generative AI quality evaluation metrics can help you measure how well your foundation model performs tasks. Watsonx.governance supports the following generative AI quality evaluation metrics:
Metric | Description |
---|---|
BLEU (Bilingual Evaluation Understudy) | Compares translated sentences from machine translations to sentences from reference translations to measure the similarity between reference texts and predictions |
Exact match | Compares model prediction strings to reference strings to measure how often the strings match. |
METEOR (Metric for Evaluation of Translation with Explicit ORdering) | Measures how well the text that is generated with machine translations match the structure of the text from reference translations |
Readability | Determines how difficult the model's output is to read by measuring characteristics such as sentence length and word complexity |
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) | Measure how well generated summaries or translations compare to reference outputs. |
SARI (system output against references and against the input sentence) | Compares the predicted sentence output against the reference sentence output to measure the quality of words that the model uses to generate sentences |
Sentence similarity | Captures semantic information from sentence embeddings to measure the similarity between texts |
Text quality | Evaluates the output of a model against SuperGLUE datasets by measuring the F1 score, precision, and recall against the model predictions and its ground truth data |
Watsonx.governance also supports the following different categories of generative AI quality metrics:
Answer quality metrics
You can use answer quality metrics to evaluate the quality of model answers. Answer quality metrics are calculated with LLM-as-a-judge models. To calculate the metrics with LLM-as-a-judge models, you can create a scoring function that calls the models. For more information, see the Computing Answer Quality and Retrieval Quality Metrics using IBM watsonx.governance for RAG task notebook.
You can calculate the following answer quality metrics:
Metric | Description |
---|---|
Answer relevance | Measures how relevant the answer in the model output is to the question in the model input |
Answer similiarity | Measures how similar the answer or generated text is to the ground truth or reference answer to determine the quality of your model performance |
Faithfulness | Measures how grounded the model output is in the model context and provides attributions from the context to show the most important sentences that contribute to the model output. |
Unsuccessful requests | Measures the ratio of questions that are answered unsuccessfully out of the total number of questions |
Content analysis metrics
You can use the following content analysis metrics to evaluate your model output against your model input or context:
Metric | Description |
---|---|
Abstractness | Measures the ratio of n-grams in the generated text output that do not appear in the source content of the foundation model |
Compression | Measures how much shorter the summary is when compared to the input text by calculating the ratio between the number of words in the original text and the number of words in the foundation model output |
Coverage | Measures the extent that the foundation model output is generated from the model input by calculating the percentage of output text that is also in the input |
Density | Measures how extractive the summary in the foundation model output is from the model input by calculating the average of extractive fragments that closely resemble verbatim extractions from the original text |
Repetitiveness | Measures the percentage of n-grams that repeat in the foundation model output by calculating the number of repeated n-grams and the total number of n-grams in the model output |
Data safety metrics
You can use the following data safety metrics to identify whether your model's input or output contains harmful or sensitive information:
Metric | Description |
---|---|
HAP | Measures if there is any toxic content that contains hate, abuse, or profanity in the model input or output data. |
PII | Measures if your model input or output data contains any personally identifiable information by using the Watson Natural Language Processing entity extraction model |
Multi-label/class metrics
You can use the following multi-label/class metrics to measure model performance for multi-label/multi-class predictions:
Metric | Description |
---|---|
Macro F1 score | Calculates F1 scores independently for each class and averages the scores |
Macro precision | Aggregates the precision scores calculated for each class separately to calculate the average |
Macro recall | The average of recall scores calculated separately for each class |
Micro F1 score | Aggregates all true positives, false positives, and false negatives across all classes to calculate F1 score |
Micro precision | Aggregates all true postives and false positives across all classes to calculate precision |
Micro recall | Aggregates all true positives and false negatives across all classes to calculate recall |
Retrieval quality metrics
You can use the retrieval quality metrics to measure the quality of how the retrieval system ranks relevant contexts. Retrieval quality metrics are calculated with LLM-as-a-judge models. To calculate the metrics with LLM-as-a-judge models, you can create a scoring function that calls the models. For more information, see the Computing Answer Quality and Retrieval Quality Metrics using IBM watsonx.governance for RAG task notebook.
You can calculate the following retrieval quality metrics:
Metric | Description |
---|---|
Average precision | Evaluates whether all of the relevant contexts are ranked higher or not by calculating the mean of the precision scores of relevant contexts |
Context relevance | Measures how relevant the context that your your model retrieves is with the question that is specified in the prompt |
Hit rate | Measures whether there is at least one relevant context among the retrieved contexts. |
Normalized Discounted Cumulative Gain | Measures the ranking quality of the retrieved contexts |
PII | Measures if your model input or output data contains any personally identifiable information by using the Watson Natural Language Processing entity extraction model |
Reciprocal rank | The reciprocal rank of the first relevant context |
Retrieval precision | Measures the quanity of relevant contexts from the total of contexts that are retrieved |
Model health monitor evaluation metrics
Model health monitor evaluation metrics can help you understand your model behavior and performance by determining how efficiently your model deployment processes your transactions. Model health evaluation metrics are enabled by default for machine learning model evaluations in production and generative AI asset deployments. Watsonx.governance supports the following model health monitor evaluation metrics:
Metric | Description |
---|---|
Payload size | The total, average, minimum, maximum, and median payload size of the transaction records that your model deployment processes across scoring requests in kilobytes (KB) |
Records | The total, average, minimum, maximum, and median number of transaction records that are processed across scoring requests |
Scoring requests | The number of scoring requests that your model deployment receives |
Users | The number of users that send scoring requests to your model deployments |
Watsonx.governance also supports the following different categories of model health monitor evaluation metrics:
Token counts
The following token count metrics calculate the number of tokens that are processed across scoring requests for your model deployment:
Metric | Description |
---|---|
Input token count | Calculates the total, average, minimum, maximum, and median input token count across multiple scoring requests during evaluations |
Output token count | Calculates the total, average, minimum, maximum, and median output token count across scoring requests during evaluations |
Throughput and latency
Model health monitor evaluations calculate latency by tracking the time that it takes to process scoring requests and transaction records per millisecond (ms). Throughput is calculated by tracking the number of scoring requests and transaction records that are processed per second.
The following metrics are calculated to measure thoughput and latency during evaluations:
Metric | Description |
---|---|
API latency | Time taken (in ms) to process a scoring request by your model deployment. |
API throughput | Number of scoring requests processed by your model deployment per second |
Record latency | Time taken (in ms) to process a record by your model deployment |
Record throughput | Number of records processed by your model deployment per second |
Python SDK evaluation metrics
The Python SDK is a Python library that you can use to programatically monitor, manage, and govern machine learning models and generative AI assets. You can use the Python SDK to automate calculations of evaluation metrics. The Python SDK also calculates algorithms that you can use to help measure performance. For more information, see Metrics computation with the Python SDK.
The following metrics are currently available only with Python SDK version 3.0.39 or later:
Metric | Description |
---|---|
Adversarial robustness | Measures the robustness of your model and prompt template against adversarial attacks such as prompt injections and jailbreaks |
Keyword inclusion | Measures the similarity of nouns and pronouns between the foundation model output and the reference or ground truth |
Prompt leakage risk | Measures the risk of leaking the prompt template by calculating the similarity between the leaked prompt template and original prompt template |
Question robustness | Detects the English-language spelling errors in the model input questions |
The following metric category is also available only with the Python SDK:
Content validation metrics
Content validation metrics use string-based functions to analyze and validate generated LLM output text. The input must contain a list of generated text from your LLM to generate content validation metrics.
If the input does not contain transaction records, the metrics measure the ratio of successful content validations and compares the ratio to the total number of validations. If the input contains transaction records, the metrics measure the
ratio of successful content validations when compared to the total number of validations and calculate validation results with the specified record_id
.
You can calculate the following content validation metrics:
Metric | Description |
---|---|
Contains all | Measures whether the rows in the prediction contain all of the specified keywords |
Contains any | Measures whether the rows in the prediction contain any of the specified keywords |
Contains email | Measures whether each row in the prediction contains emails |
Contains_JSON | Measures if the rows in the prediction contains JSON syntax |
Contains link | Measures whether the rows in the prediction contain any links |
Contains none | Measues whether the rows in the prediction do not contain any of the specified keywords |
Contains string | Measures whether each row in the prediction contains the specified string |
Contains valid link | Measures whether the rows in the prediction contain valid links |
Ends with | Measures whether the rows in the prediction end with the specified substring |
Equals to | Measures whether the rows in the prediction are equal to the specified substring |
Fuzzy match | Measures if the prediction fuzzy matches the keyword |
Is email | Measures whether the rows in the prediction contain valid emails |
Is JSON | Measures whether the rows in the prediction contains valid JSON syntax |
Length greater than | Measures whether the length of each row in the prediction is greater than a specified maximum value |
Length less than | Measures whether the length of each row in the prediction is less than a specified maximum value |
No invalid links | Measures whether the rows in the prediction have no invalid links |
Regex | Measures whether the rows in the prediction contain the specified regex expression |
Starts with | Measures whether the rows in the prediction start with the specified substring |
Parent topic: Evaluating AI models