Evaluating AI models

Last updated: Feb 10, 2025

You can track and measures outcomes from your AI assets to help ensure that they are compliant with business processes no matter where your models are built or running.

You can use model evaluations as part of your AI governance strategies to ensure that models in deployment environments meet established compliance standards regardless of the tools and frameworks that are used to build and run the models. This approach ensures that models are free from bias, can be easily explained and understood by business users, and are auditable in business transactions.

Required service: watsonx.ai Runtime
Training data format: Relational: Tables in relational data sources; Tabular: Excel files (.xls or .xlsx), CSV files; Textual: In the supported relational tables or files
Connected data: Cloud Object Storage (infrastructure); Db2
Data size: Any

With Watsonx.governance, you can evaluate generative AI assets and machine learning models to gain insights about model performance throughout the AI lifecycle.

You can run the following types of evaluations with watsonx.governance:

Quality
Evaluates how well your model predicts correct outcomes that match labeled test data.
Fairness
Evaluates whether your model produces biased outcomes that provide favorable results for one group over another.
Drift
Evaluates how your model changes in accuracy and data consistency by comparing recent transactions to your training data.
Drift v2
Evaluates changes in your model output, the accuracy of your predictions, and the distribution of your input data.
Model health
Evaluates how efficiently your model deployment processes your transactions.
Generative AI quality
Measures how well your foundation model performs tasks

When you enable evaluations, you can choose to run them continuously on the following default scheduled intervals:

Evaluation	Online subscription default schedule	Batch subscription default schedule
Quality	1 hour	1 week
Fairness	1 hour	1 week
Drift	3 hours	1 week
Drift v2	1 day	NA
Model health	1 hour	NA
Generative AI Quality	1 hour	NA

Model health evaluations are enabled by default when you provide payload data to evaluate generative AI assets and machine learning models.

Evaluating generative AI assets

You can evaluate generative AI assets to measure how well your model performs the following tasks:

Text classification: Categorize text into predefined classes or labels.
Text summarization: Summarize text accurately and concisely.
Content generation: Produce relevant and coherent text or other forms of content based on your input.
Question answering: Provide accurate and contextually relevant answers to your queries.
Entity extraction: Identify and categorize specific segments of information within text.
Retrieval-augmented generation: Retrieve and integrate external knowledge into your model outputs.

The type of evaluation that you can run is determined by the type of task that you want your model to perform. Generative AI evaluations calculate metrics that provide insights about your model's performance of these tasks. Fairness and quality evaluations can only measure perormance for text classification tasks. Drift v2 and generative AI quality evaluations can measure peformance for any task type.

You can evaluate prompt template assets to measure the performance of models that are built by IBM or evaluate detached prompt templates for models that are not created or hosted by IBM. You can run these evaluations in projects and deployment spaces to gain insights about individual assets within your development environment.

If you want to evaluate and compare multiple assets simultaneously, you can run experiments with Evaluation Studio to help you identify the best-performing assets.

To run evaluations, you must manage data for model evaluations by providing test data that contains reference columns that include the input and expected model output for each asset. The type of test data that you provide can determine the type of evaluation that you can run. You can provide feedback or payload data to enable evaluations for generative AI assets. To run quality evaluations, you must provide feedback data to measure performance for text classification tasks. Fairness and drift v2 evaluations use payload data to measure your model peformance. Generative AI quality evaluations use feedback data to measure performance for entity extraction tasks.

Generative AI quality evaluations can use payload and feedback data to calculate metrics for the following task types:

Text summmarization
Content generation
Question answering
Retrieval-augmented generation

Payload data is required for retrieval-augmented generation tasks.

Evaluating machine learning models

You can evaluate machine learning models to measure how well they predict outcomes. Watsonx.governance supports evaluations for the following type of machine learning models:

Classification models

Predict categorical outcomes based on your input features

Binary classification: Predict one of two possible outcomes
Multiclass classification: Predict one of several outcomes

Regression models

Predict continuous numerical outcomes

With watsonx.governance, you can evaluate machine learning models in deployment spaces. To run evaluations, you must prepare to evaluate models by providing model details about your training data and model output.

You must also manage data for model evaluations to determine the type of evaluation that you can run to generate metric insights. To run quality evaluations, you must provide feedback data that contains the same structure and prediction columns from your training data with the known model outcome. To run fairness, drift, and drift v2 evaluations, you must provide payload data that matches the structure of the training data.

Watsonx.governance logs these data types to calculate metrics for your evaluation results. You must send model transactions to generate accurate results continuously.

You can also create custom evaluations and metrics to generate a greater variety of insights about your model performance. For insights about how your model predicts outcomes, you can configure explainability.

Learn more

Parent topic: Governing AI assets