0 / 0
Evaluation metrics
Last updated: Feb 21, 2025
Evaluation metrics

Evaluation metrics can help you continuously monitor the performance of your AI models to provide insights throughout the AI lifecycle. With watsonx.governance, you can use these metrics to help ensure compliance with regulatory requirements and identify how to make improvements to mitigate risks.

You can run evaluations in Watsonx.governance to generate metrics with automated monitoring that can provide actionable insights to help you achieve your AI governance goals. You can use these metrics to help achieve the following goals:

  • Ensure compliance: Automatically track adherence to evolving regulations and organizational policies with alerts triggered when thresholds are breached.
  • Promote transparency: Generate detailed documentation to provide clear insights into model behavior, performance, and explainability of outcomes.
  • Mitigate risks: Detect and address issues like bias or accuracy drift through continous evaluation and proactive risk assessments.
  • Protect privacy and security: Monitor for security vulnerabilities like personally identifiable information exposure (PII) and enforce guardrails to prevent misuse of sensitive data.

The metrics that you can use to provide insights about your model performance are determined by the type of evaluations that you enable. Each type of evaluation generates different metrics that you can analyze to gain insights.

You can also use the Python SDK to calculate metrics in a notebook runtime environment or offloaded as Spark jobs against IBM Analytics Engine for evaluations. Some metrics might be available only with the Python SDK.

Drift evaluation metrics

Drift evaluation metrics can help you detect drops in accuracy and data consistency in your models to determine how well your model predicts outcomes over time. Watsonx.governance supports the following drift evaluation metrics for machine learning models.:

Table 1. Drift evaluation metric descriptions
Metric Description
Drop in accuracy Estimates the drop in accuracy of your model at run time when compared to the training data
Drop in data consistency Compares run time transactions with the patterns of transactions in the training data to identify inconsistency

Drift v2 evaluation metrics

Drift v2 evaluation metrics can help you measure changes in your data over time to ensure consistent outcomes for your model. You can use these metrics to identify changes in your model output, the accuracy of your predictions, and the distribution of your input data. Watsonx.governance supports the following drift v2 metrics:

Table 2. Drift v2 evaluation metric descriptions
Metric Description
Feature drift Measures the change in value distribution for important features
Model quality drift Compares the estimated runtime accuracy to the training accuracy to measure the drop in accuracy.
Output drift Measures the change in the model confidence distribution

Fairness evaluation metrics

Fairness evaluation metrics can you help you determine whether your model produces biased outcomes. You can use these metrics to identify when your model shows a tendency to provide favorable outcomes more often for one group over another. Watsonx.governance supports the following fairness evaluation metrics:

Table 3. Fairness evaluation metric descriptions
Metric Description
Average absolute odds difference Compares the average of absolute difference in false positive rates and true positive rates between monitored groups and reference groups
Average odds difference Measures the difference in false positive and false negative rates between monitored and reference groups
Disparate impact Compares the percentage of favorable outcomes for a monitored group to the percentage of favorable outcomes for a reference group
Error rate difference The percentage of transactions that are incorrectly scored by your model
False discovery rate difference The amount of false positive transactions as a percentage of all transactions with a positive outcome
False negative rate difference The percentage of positive transactions that were incorrectly scored as negative by your model
False omission rate difference The number of false negative transactions as a percentage of all transactions with a negative outcome
False positive rate difference The percentage of negative transactions that were incorrectly scored as positive by your model.
Impact score Compares the rate that monitored groups are selected to receive favorable outcomes to the rate that reference groups are selected to receive favorable outcomes.
Statistical parity difference Compares the percentage of favorable outcomes for monitored groups to reference groups.

Model health monitor evaluation metrics

Model health monitor evaluation metrics can help you understand your model behavior and performance by determining how efficiently your model deployment processes your transactions. Model health evaluation metrics are enabled by default for machine learning model evaluations in production. Watsonx.governance supports the following model health monitor evaluation metrics:

Table 12. Model health monitor evaluation metric descriptions
Metric Description
Payload size The total, average, minimum, maximum, and median payload size of the transaction records that your model deployment processes across scoring requests in kilobytes (KB)
Records The total, average, minimum, maximum, and median number of transaction records that are processed across scoring requests
Scoring requests The number of scoring requests that your model deployment receives
Users The number of users that send scoring requests to your model deployments

Throughput and latency

Model health monitor evaluations calculate latency by tracking the time that it takes to process scoring requests and transaction records per millisecond (ms). Throughput is calculated by tracking the number of scoring requests and transaction records that are processed per second.

The following metrics are calculated to measure thoughput and latency during evaluations:

Table 12. Model health monitor throughput and latency metric descriptions
Metric Description
API latency Time taken (in ms) to process a scoring request by your model deployment.
API throughput Number of scoring requests processed by your model deployment per second
Record latency Time taken (in ms) to process a record by your model deployment
Record throughput Number of records processed by your model deployment per second

Python SDK evaluation metrics

The Python SDK is a Python library that you can use to programatically monitor, manage, and govern machine learning models. You can use the Python SDK to automate calculations of evaluation metrics. The Python SDK also calculates algorithms that you can use to help measure performance. For more information, see Metrics computation with the Python SDK.

The Smoothed empirical differential (SED) metric is available only with the Python SDK. It quantifies the differential in the probability of favorable and unfavorable outcomes between intersecting groups that are divided by features.

Parent topic: Evaluating AI models