Evaluation metrics can help you continuously monitor the performance of your AI models to provide insights throughout the AI lifecycle. With watsonx.governance, you can use these metrics to help ensure compliance with regulatory requirements and identify how to make improvements to mitigate risks.
You can run evaluations in Watsonx.governance to generate metrics with automated monitoring that can provide actionable insights to help you achieve your AI governance goals. You can use these metrics to help achieve the following goals:
- Ensure compliance: Automatically track adherence to evolving regulations and organizational policies with alerts triggered when thresholds are breached.
- Promote transparency: Generate detailed documentation to provide clear insights into model behavior, performance, and explainability of outcomes.
- Mitigate risks: Detect and address issues like bias or accuracy drift through continous evaluation and proactive risk assessments.
- Protect privacy and security: Monitor for security vulnerabilities like personally identifiable information exposure (PII) and enforce guardrails to prevent misuse of sensitive data.
The metrics that you can use to provide insights about your model performance are determined by the type of evaluations that you enable. Each type of evaluation generates different metrics that you can analyze to gain insights.
You can also use the Python SDK to calculate metrics in a notebook runtime environment or offloaded as Spark jobs against IBM Analytics Engine for evaluations. Some metrics might be available only with the Python SDK.
Drift evaluation metrics
Drift evaluation metrics can help you detect drops in accuracy and data consistency in your models to determine how well your model predicts outcomes over time. Watsonx.governance supports the following drift evaluation metrics for machine learning models.:
Metric | Description |
---|---|
Drop in accuracy | Estimates the drop in accuracy of your model at run time when compared to the training data |
Drop in data consistency | Compares run time transactions with the patterns of transactions in the training data to identify inconsistency |
Drift v2 evaluation metrics
Drift v2 evaluation metrics can help you measure changes in your data over time to ensure consistent outcomes for your model. You can use these metrics to identify changes in your model output, the accuracy of your predictions, and the distribution of your input data. Watsonx.governance supports the following drift v2 metrics:
Metric | Description |
---|---|
Feature drift | Measures the change in value distribution for important features |
Model quality drift | Compares the estimated runtime accuracy to the training accuracy to measure the drop in accuracy. |
Output drift | Measures the change in the model confidence distribution |
Fairness evaluation metrics
Fairness evaluation metrics can you help you determine whether your model produces biased outcomes. You can use these metrics to identify when your model shows a tendency to provide favorable outcomes more often for one group over another. Watsonx.governance supports the following fairness evaluation metrics:
Metric | Description |
---|---|
Average absolute odds difference | Compares the average of absolute difference in false positive rates and true positive rates between monitored groups and reference groups |
Average odds difference | Measures the difference in false positive and false negative rates between monitored and reference groups |
Disparate impact | Compares the percentage of favorable outcomes for a monitored group to the percentage of favorable outcomes for a reference group |
Error rate difference | The percentage of transactions that are incorrectly scored by your model |
False discovery rate difference | The amount of false positive transactions as a percentage of all transactions with a positive outcome |
False negative rate difference | The percentage of positive transactions that were incorrectly scored as negative by your model |
False omission rate difference | The number of false negative transactions as a percentage of all transactions with a negative outcome |
False positive rate difference | The percentage of negative transactions that were incorrectly scored as positive by your model. |
Impact score | Compares the rate that monitored groups are selected to receive favorable outcomes to the rate that reference groups are selected to receive favorable outcomes. |
Statistical parity difference | Compares the percentage of favorable outcomes for monitored groups to reference groups. |
Model health monitor evaluation metrics
Model health monitor evaluation metrics can help you understand your model behavior and performance by determining how efficiently your model deployment processes your transactions. Model health evaluation metrics are enabled by default for machine learning model evaluations in production. Watsonx.governance supports the following model health monitor evaluation metrics:
Metric | Description |
---|---|
Payload size | The total, average, minimum, maximum, and median payload size of the transaction records that your model deployment processes across scoring requests in kilobytes (KB) |
Records | The total, average, minimum, maximum, and median number of transaction records that are processed across scoring requests |
Scoring requests | The number of scoring requests that your model deployment receives |
Users | The number of users that send scoring requests to your model deployments |
Throughput and latency
Model health monitor evaluations calculate latency by tracking the time that it takes to process scoring requests and transaction records per millisecond (ms). Throughput is calculated by tracking the number of scoring requests and transaction records that are processed per second.
The following metrics are calculated to measure thoughput and latency during evaluations:
Metric | Description |
---|---|
API latency | Time taken (in ms) to process a scoring request by your model deployment. |
API throughput | Number of scoring requests processed by your model deployment per second |
Record latency | Time taken (in ms) to process a record by your model deployment |
Record throughput | Number of records processed by your model deployment per second |
Python SDK evaluation metrics
The Python SDK is a Python library that you can use to programatically monitor, manage, and govern machine learning models. You can use the Python SDK to automate calculations of evaluation metrics. The Python SDK also calculates algorithms that you can use to help measure performance. For more information, see Metrics computation with the Python SDK.
The Smoothed empirical differential (SED) metric is available only with the Python SDK. It quantifies the differential in the probability of favorable and unfavorable outcomes between intersecting groups that are divided by features.
Parent topic: Evaluating AI models