Configuring generative AI quality evaluations

Last updated: Mar 11, 2025
Configuring generative AI quality evaluations

You can use configure generative AI quality evaluations to measure how well your foundation model performs tasks.

When you evaluate prompt templates, you can review a summary of generative AI quality evaluation results for the following task types:

  • Text summarization
  • Content generation
  • Entity extraction
  • Question answering
  • Retrieval Augmented Generation (RAG)

The summary displays scores and violations for metrics that are calculated with default settings.

To configure generative AI quality evaluations with your own settings, you can set a minimum sample size and set threshold values for each metric as shown in the following example:

Configure generative AI quality evaluations

The minimum sample size indicates the minimum number of model transaction records that you want to evaluate and the threshold values create alerts when your metric scores violate your thresholds. The metric scores must be higher than the lower threshold values to avoid violations. Higher metric values indicate better scores.

You can also configure settings to calculate metrics with LLM-as-a-judge models. LLM-as-a-judge models are LLM models that you can use to evaluate the performance of other models.

To calculate metrics with LLM-as-a-judge models, you must select Manage to add a generative_ai_evaluator system when you configure your evaluation settings.

Add gen AI evaluator for LLM-as-a-judge model evaluations

You can select an evaluator to calculate answer quality and retrieval quality metrics.

Select gen AI evaluator for metric setings

You can also use a notebook to create an evaluator when you set up your prompt templates and review evaluation results for the RAG task in watsonx.governance.

Parent topic: Evaluating AI models