0 / 0
Generative AI quality evaluations

Generative AI quality evaluations

You can use watsonx.governance generative AI quality evaluations to measure how well your foundation model performs tasks.

When you evaluate prompt templates, you can review a summary of generative AI quality evaluation results for the following task types:

  • Text summarization
  • Content generation
  • Entity extraction
  • Question answering
  • Retrieval Augmented Generation (RAG)

The summary displays scores and violations for metrics that are calculated with default settings.

To configure generative AI quality evaluations with your own settings, you can set a minimum sample size and set threshold values for each metric as shown in the following example:

Configure generative AI quality evaluations

The minimum sample size indicates the minimum number of model transaction records that you want to evaluate and the threshold values create alerts when your metric scores violate your thresholds. The metric scores must be higher than the lower threshold values to avoid violations. Higher metric values indicate better scores.

You must run a notebook to evaluate prompt templates before you can review evaluation results for the RAG task in watsonx.governance. For more information, see Watson OpenScale Python client samples.

Supported generative AI quality metrics

The following generative AI quality metrics are supported by  watsonx.governance:

ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics measure how well generated summaries or translations compare to reference outputs. The generative AI quality evaluation calculates the rouge1, rouge2, and rougeLSum metrics.

  • Task types:

    • Text summarization
    • Content generation
    • Question answering
    • Entity extraction
    • Retrieval Augmented Generation (RAG)
  • Parameters:

    • Use stemmer: If true, users Porter stemmer to strip word suffixes. Defaults to false.
  • Thresholds:

    • Lower bound: 0.8
    • Upper bound: 1.0
  • How it works: Higher scores indicate higher similarity between the summary and the reference.

SARI

SARI (system output against references and against the input sentence) compares the predicted sentence output against the reference sentence output to measure the quality of words that the model uses to generate sentences.

  • Task types:

    • Text summarization
  • Thresholds:

    • Lower bound: 0
    • Upper bound: 100
  • How it works: Higher scores indicate a higher quality of words are used to generate sentences.

METEOR

METEOR (Metric for Evaluation of Translation with Explicit ORdering) measures how well the text that is generated with machine translations match the structure of the text from reference translations. It is calculated with the harmonic mean of precision and recall.

  • Task types:

    • Text summarization
    • Content generation
  • Parameters:

    • Alpha: Controls relative weights of precision and recall
    • Beta: Controls shape of penalty as a function of fragmentation.
    • Gamma: The relative weight assigned to fragmentation penalty.
  • Thresholds:

    • Lower bound: 0
    • Upper bound: 1
  • How it works: Higher scores indicate that machine translations match more closely with references.

Text quality

Text quality evaluates the output of a model against SuperGLUE datasets by measuring the F1 score, precision, and recall against the model predictions and its ground truth data. It is calculated by normalizing the input strings and identifying the number of similar tokens that exist between the predictions and references.

  • Task types:

    • Text summarization
    • Content generation
  • Thresholds:

    • Lower bound: 0.8
    • Upper bound: 1
  • How it works: Higher scores indicate higher similarity between the predictions and references.

BLEU

BLEU (Bilingual Evaluation Understudy) compares translated sentences from machine translations to sentences from reference translations to measure the similarity between reference texts and predictions.

  • Task types:

    • Text summarization
    • Content generation
    • Question answering
    • Retrieval Augmented Generation (RAG)
  • Parameters:

    • Max order: Maximum n-gram order to use when completing BLEU score
    • Smooth: Whether or not to apply a smoothing function to remove noise from data
  • Thresholds:

    • Lower bound: 0.8
    • Upper bound: 1
  • How it works: Higher scores indicate more similarity between reference texts and predictions.

Sentence similarity

Sentence similarity captures semantic information from sentence embeddings to measure the similarity between texts. It measures Jaccard similarity and Cosine similarity.

  • Task types: Text summarization

  • Thresholds:

    • Lower limit: 0.8
    • Upper limit: 1
  • How it works: Higher scores indicate that the texts are more similar.

PII

PII measures if your model input or output data contains any personally identifiable information by using the  Watson Natural Language Processing Entity extraction model.

  • Task types:

    • Text summarization
    • Content generation
    • Question answering
    • Retrieval Augmented Generation (RAG)
  • Thresholds:

    • Upper limit: 0
  • How it works: Higher scores indicate that a higher percentage of personally identifiable information exists in the input or output data.

HAP

HAP measures if there is any toxic content that contains hate, abuse, or profanity in the model input or output data.

  • Task types:

    • Text summarization
    • Content generation
    • Question answering
    • Retrieval Augmented Generation (RAG)
  • Thesholds

    • Upper limit: 0
  • How it works: Higher scores indicate that a higher percentage of toxic content exists in the model input or output.

Readability

Readability determines how difficult the model's output is to read by measuring characteristics such as sentence length and word complexity.

  • Task types:

    • Text summarization
    • Content generation
  • Thresholds:

    • Lower limit: 60
  • How it works: Higher scores indicate that the model's output is easier to read.

Exact match

Exact match compares model prediction strings to reference strings to measure how often the strings match.

  • Task types:

    • Question answering
    • Entity extraction
    • Retrieval Augmented Generation (RAG)
  • Parameters:

    • Regexes to ignore: Regex expressions of characters to ignore when calculating the exact matches.
    • Ignore case: If True, turns everything to lowercase so that capitalization differences are ignored.
    • Ignore punctuation: If True, removes punctuation before comparing strings.
    • Ignore numbers: If True, removes all digits before comparing strings.
  • Thresholds:

    • Lower limit: 0.8
    • Upper limit: 1
  • How it works: Higher scores indicate that model prediction strings match reference strings more often.

Multi-label/class metrics

Multi-label/class metrics measure model performance for multi-label/multi-class predictions.

  • Metrics:
    • Micro F1 score
    • Macro F1 score
    • Micro precision
    • Macro precision
    • Micro recall
    • Macro recall
  • Task types: Entity extraction
  • Thresholds:
    • Lower limit: 0.8
    • Upper limit: 1
  • How it works: Higher scores indicate that predictions are more accurate.

Parent topic: Configuring model evaluations

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more