Evaluating prompt templates in deployment spaces

Last updated: Jan 13, 2025

You can evaluate prompt templates in deployment spaces with the watsonx.governance service to measure the performance of foundation model tasks and understand how your model generates responses.

With watsonx.governance, you can evaluate prompt templates in deployment spaces to measure how effectively your foundation models generate responses for the following task types:

Classification
Summarization
Generation
Question answering
Entity extraction
Retrieval-augmented generation

Prompt templates are saved prompt inputs for foundation models. You can evaluate prompt template deployments in pre-production and production spaces.

You can evaluate prompt templates to measure the performance of custom (BringYourOwnModel) or tuned foundation models.

Before you begin

Required permissions
You must have the following roles to evaluate prompt templates:
Admin or Editor role in a deployment space

In your project, you must also create and save a prompt template and promote a prompt template to a deployment space. You must specify at least one variable when you create prompt templates to enable evaluations.

Evaluating prompt templates with custom or tuned models

You can evaluate prompt templates that use custom or tuned foundation model deployments in a deployment space. You can also manage and deploy these models when you move them between different spaces. For more information, see Deploying a prompt template programatically.

The following sections describe how to evaluate prompt templates in deployment spaces and review your evaluation results:

Evaluating prompt templates in pre-production spaces

Run evaluation

To run prompt template evaluations, you can click Evaluate on the Evaluations tab when you open a deployment to open the Evaluate prompt template wizard. You can run evaluations only if you are assigned the Admin or Editor roles for your deployment space.

Run prompt template evaluation

If you don't have a database that is associated with your watsonx.governance instance, you must also associate a database before you can run evaluations. To associate a database, you must also click Associate database in the Database required dialog box to connect to a database. You must be assigned the Admin role for your deployment space and watsonx.governance instance to associate databases.

Associate watsonx.governance database with deployment space

Select dimensions

The Evaluate prompt template wizard displays the dimensions that are available to evaluate for the task type that is associated with your prompt. You can expand the dimensions to view the list of metrics that are used to evaluate the dimensions that you select.

Select dimensions to evaluate

Watsonx.governance automatically configures evaluations for each dimension with default settings. To configure evaluations with different settings, you can select Advanced settings to set sample sizes and select the metrics that you want to use to evaluate your prompt template:

Select metrics to evaluate

You can also set threshold values for each metric that you select for your evaluations:

Configure evaluations

Select test data

To select test data, you can browse to upload a CSV file or you can select an asset from your deployment space. The test data that you select must contain reference columns and columns for each prompt variable.

Select test data

Map variables

You must map prompt variables to the associated columns from your test data.

Map test data

Review and evaluate

You can review the selections for the prompt task type, the uploaded test data, and the type of evaluation that runs. You must select Evaluate to run the evaluation.

Review and evaluate prompt template evaluation settings

Reviewing evaluation results

When your evaluation finishes, you can review a summary of your evaluation results on the Evaluations tab in watsonx.governance to gain insights about your model performance. The summary provides an overview of metric scores and violations of default score thresholds for your prompt template evaluations.

To analyze results, you can click the arrow next to your prompt template evaluation to view data visualizations of your results over time. You can also analyze results from the model health evaluation that is run by default during prompt template evaluations to understand how efficiently your model processes your data.

The Actions menu also provides the following options to help you analyze your results:

Evaluate now: Run evaluation with a different test data set
All evaluations: Display a history of your evaluations to understand how your results change over time.
Configure monitors: Configure evaluation thresholds and sample sizes.
View model information: View details about your model to understand how your deployment environment is set up.

Analyze prompt template evaluation results

If you track your prompt templates, you can review evaluation results to gain insights about your model performance throughout the AI lifecycle.

Evaluating prompt templates in production spaces

Activate evaluation

To run prompt template evaluations, you can click Activate on the Evaluations tab when you open a deployment to open the Evaluate prompt template wizard.

Run prompt template evaluation

If you don't have a watsonx.governance instance that is associated with your deployment space, you must select Associate a service instance in the Associate a service instance dialog box before you can run evaluations. In the Associate instance for evaluation window, you must choose the watsonx.governance instance that you want to use and select Associate a service instance to associate an instance with your deployment space. You must be assigned the Admin role for your deployment space to associate instances.

Associate watsonx.governance instance

Associate watsonx.governance database with project

Select dimensions

The Evaluate prompt template wizard displays the dimensions that are available to evaluate for the task type that is associated with your prompt. You can provide a label column name for the reference output that you specify in your feedback data. You can also expand the dimensions to view the list of metrics that are used to evaluate the dimensions that you select.

Select dimensions to evaluate

Select metrics to evaluate

You can also set threshold values for each metric that you select for your evaluations:

Configure evaluations

Review and evaluate

You can review the selections for the prompt task type and the type of evaluation that runs. You can also select View payload schema or View feedback schema to validate that your column names match the prompt variable names in the prompt template. You must select Activate to run the evaluation.

Review and evaluate selections

To generate evaluation results, select Evaluate now in the Actions menu to open the Import test data window when the evaluation summary page displays.

Select evaluate now

Import test data

In the Import test data window, you can select Upload payload data or Upload feedback data to upload a CSV file that contains labeled columns that match the columns in your payload and feedback schemas.

Import test data

When your upload completes successfully, you can select Evaluate now to run your evaluation.

Reviewing evaluation results

The Actions menu also provides the following options to help you analyze your results:

Evaluate now: Run evaluation with a different test data set
Configure monitors: Configure evaluation thresholds and sample sizes.
View model information: View details about your model to understand how your deployment environment is set up.

Analyze prompt template evaluation results

If you track your prompt templates, you can review evaluation results to gain insights about your model performance throughout the AI lifecycle.