You can track and measures outcomes from your AI assets to help ensure that they are compliant with business processes no matter where your models are built or running.
You can use model evaluations as part of your AI governance strategies to ensure that models in deployment environments meet established compliance standards regardless of the tools and frameworks that are used to build and run the models. This
approach ensures that models are free from bias, can be easily explained and understood by business users, and are auditable in business transactions.
Required service
watsonx.ai Runtime
Training data format
Relational: Tables in relational data sources
Tabular: Excel files (.xls or .xlsx), CSV files
Textual: In the supported relational tables or files
Connected data
Cloud Object Storage (infrastructure)
Db2
Data size
Any
With Watsonx.governance, you can evaluate generative AI assets and machine learning models to gain insights about model performance throughout the AI lifecycle.
You can run the following types of evaluations with watsonx.governance:
Quality Evaluates how well your model predicts correct outcomes that match labeled test data.
Fairness Evaluates whether your model produces biased outcomes that provide favorable results for one group over another.
DriftSupported models: machine learning Evaluates how your model changes in accuracy and data consistency by comparing
recent transactions to your training data.
Drift v2 Evaluates changes in your model output, the accuracy of your predictions, and the distribution of your input data.
Model health Evaluates how efficiently your model deployment processes your transactions.
Generative AI qualitySupported models: generative AI Measures how well your foundation model performs tasks
When you enable evaluations, you can choose to run them continuously on the following default scheduled intervals:
Evaluation
Online subscription default schedule
Batch subscription default schedule
Quality
1 hour
1 week
Fairness
1 hour
1 week
Drift
3 hours
1 week
Drift v2
1 day
NA
Model health
1 hour
NA
Generative AI Quality
1 hour
NA
Model health evaluations are enabled by default when you provide payload data to evaluate generative AI assets and machine learning models.
Evaluating generative AI assets
Copy link to section
You can evaluate generative AI assets to measure how well your model performs the following tasks:
Text classification
Categorize text into predefined classes or labels.
Text summarization
Summarize text accurately and concisely.
Content generation
Produce relevant and coherent text or other forms of content based on your input.
Question answering
Provide accurate and contextually relevant answers to your queries.
Entity extraction
Identify and categorize specific segments of information within text.
Retrieval-augmented generation
Retrieve and integrate external knowledge into your model outputs.
The type of evaluation that you can run is determined by the type of task that you want your model to perform. Generative AI evaluations calculate metrics that provide insights about your model's performance of these tasks. Fairness and quality
evaluations can only measure perormance for text classification tasks. Drift v2 and generative AI quality evaluations can measure peformance for any task type.
You can evaluate prompt template assets to measure the performance of models that are built by IBM or evaluate detached prompt templates for models that are not created or hosted by IBM. You can run these evaluations in projects and deployment
spaces to gain insights about individual assets within your development environment.
If you want to evaluate and compare multiple assets simultaneously, you can run experiments with Evaluation Studio to help you identify the best-performing assets.
To run evaluations, you must manage data for model evaluations by providing test data that contains reference columns that include the input and expected model output for each asset. The type of test data that
you provide can determine the type of evaluation that you can run. You can provide feedback or payload data to enable evaluations for generative AI assets. To run quality evaluations, you must provide feedback data to measure performance for
text classification tasks. Fairness and drift v2 evaluations use payload data to measure your model peformance. Generative AI quality evaluations use feedback data to measure performance for entity extraction tasks.
Generative AI quality evaluations can use payload and feedback data to calculate metrics for the following task types:
Text summmarization
Content generation
Question answering
Retrieval-augmented generation
Payload data is required for retrieval-augmented generation tasks.
Evaluating machine learning models
Copy link to section
You can evaluate machine learning models to measure how well they predict outcomes. Watsonx.governance supports evaluations for the following type of machine learning models:
Classification models
Predict categorical outcomes based on your input features
Binary classification: Predict one of two possible outcomes
Multiclass classification: Predict one of several outcomes
Regression models
Predict continuous numerical outcomes
With watsonx.governance, you can evaluate machine learning models in deployment spaces. To run evaluations, you must prepare to evaluate models by providing model details about your training data and model output.
You must also manage data for model evaluations to determine the type of evaluation that you can run to generate metric insights. To run quality evaluations, you must provide feedback data that contains the same structure and prediction columns
from your training data with the known model outcome. To run fairness, drift, and drift v2 evaluations, you must provide payload data that matches the structure of the training data.
Watsonx.governance logs these data types to calculate metrics for your evaluation results. You must send model transactions to generate accurate results continuously.
You can also create custom evaluations and metrics to generate a greater variety of insights about your model performance. For insights about how your model predicts outcomes, you can configure explainability.