Quick start: Compare prompt performance

Last updated: Jan 16, 2025

Take this tutorial to learn how to compare multiple prompts in the Evaluation Studio. With Evaluation Studio, you can evaluate and compare your generative AI assets with quantitative metrics and customizable criteria that fit your use cases. Evaluate the performance of multiple assets simultaneously and view comparative analyses of results to identify the best solutions.

Required services: watsonx.ai; watsonx.governance; watsonx.ai Runtime
Required roles: Watsonx.governance service level access: Reader role; For your project: Admin or Editor roles; Cloud Object Storage bucket used for your project: Writer role

Your basic workflow includes these tasks:

Open a project that contains the prompt templates to evaluate. Projects are where you can collaborate with others to work with assets.
Create an Evaluation Studio experiment.
Review the results.

Read about Evaluation Studio

You can use Evaluation Studio to streamline your generative AI development by automating the process of evaluating multiple AI assets for various task types. Instead of individually reviewing each prompt template and manually comparing their performance, you can configure a single experiment to evaluate multiple prompt templates simultaneously, which can save time during development.

The following features are included in Evaluation Studio to help you evaluate and compare prompt templates to identify the best-performing assets for your needs:

Customizable experiment setup
Flexible results analysis

Watch a video about Evaluation Studio

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method to learn the concepts and tasks in this documentation.

Try a tutorial with Evaluation Studio

In this tutorial, you will complete these tasks:

Task 1: Create the sample project
Task 2: Create the Evaluation Studio experiment
Task 3: Review the results in Evaluation Studio

Tips for completing this tutorial

Here are some tips for successfully completing this tutorial.

Use the video picture-in-picture

Tip: Start the video, then as you scroll through the tutorial, the video moves to picture-in-picture mode. Close the video table of contents for the best experience with picture-in-picture. You can use picture-in-picture mode so you can follow the video as you complete the tasks in this tutorial. Click the timestamps for each task to follow along.

The following animated image shows how to use the video picture-in-picture and table of contents features:

How to use picture-in-picture and chapters

Get help in the community

If you need help with this tutorial, you can ask a question or find an answer in the watsonx Community discussion forum.

Set up your browser windows

For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Tip: If you encounter a guided tour while completing this tutorial in the user interface, click Maybe later.

Task 1: Create the sample project

preview tutorial video To preview this task, watch the video beginning at 00:16.

The Resource hub includes a sample project that contains sample prompt templates that you can compare in the Evaluation Studio. Follow these steps to create the project based on a sample:

From the home screen, click the Create a new project icon .
Select Sample.
Search for Getting started with watsonx.governance, select that sample project, and click Next.
Choose an existing object storage service instance or create a new one.
Click Create.
Wait for the project import to complete, and then click View new project.
Associate a watsonx.ai Runtime service with the project. For more information, see watsonx.ai Runtime.
1. When the project opens, click the Manage tab, and select the Services and integrations page.
2. On the IBM services tab, click Associate service.
3. Select your watsonx.ai Runtime instance. If you don't have a watsonx.ai Runtime service instance provisioned yet, follow these steps:
  1. Click New service.
  2. Select watsonx.ai Runtime.
  3. Click Create.
  4. Select the new service instance from the list.
4. Click Associate service.
5. If necessary, click Cancel to return to the Services & Integrations page.
Click the Assets tab in the project to see the sample assets.

For more information or to watch a video, see Creating a project. For more information on associated services, see Adding associated services.

Check your progress

The following image shows the project Assets tab. You are now ready to create the experiment.

Task 2: Create the Evaluation Studio experiment

preview tutorial video To preview this task, watch the video beginning at 01:11.

To compare prompt peformance, you need to create an Evaluation Studio experiment. Follow these steps to create the experiment:

From the Assets tab, click New asset > Evaluate and compare prompts.
On the Setup page, type Summarization Evaluation experiment for the name.
Select a task type. In this case, you want to compare summarization prompt templates, so select Summarization.
Click Next to continue to the Prompt templates page.
Select the Insurance claim summarization, 2 Insurance claim summarization and 3 Insurance claim summarization prompt templates.

Notice that all three of these prompt templates include Input variables, which is a requirement for the Evaluation Studio.
Click Next to continute to the Metrics page.
Expand the Generative AI Quality and Model health sections to review the metrics that will be used in the evaluation.
Click Next to continute to the Test data page.
Select the test data:
1. Click Select data from project.
2. Select Project file > Insurance claim summarization test data.csv.
  
  The test data that you upload must contain reference output and input columns for each prompt variable. Reference output columns are used to calculate reference-based metrics such as ROUGE and BLEU.
3. Click Select.
4. For the Input column, select Insurance_Claim.
5. For the Reference output column, select Summary.
Click Next to continute to the Review and run page.
Review the configuration, and click Run evaluation. Evaluations can take a few minutes to complete.

Check your progress

The following image shows the results of the evaluation. Now you can review the results.

Task 3: Review the results in Evaluation Studio

preview tutorial video To preview this task, watch the video beginning at 02:26.

Now, you are ready to evaluate and compare the AI assets. Follow these steps to review the results in Evaluation Studio:

When your evaluation completes, view the Metric comparison visualizations.

The charts compare the results for each prompt template that you selected. The visualization display whether scores violate the thresholds for each metric.
Click the Records list to select a different metric. For example, select Content Analysis to see the chart updates based on the selected metric.
Hover over a bar in the chart to see the details.
Review the table below the visualization that shows the three prompt templates. Notice that each of the prompts use a different foundation model.
To make comparisons, click the Set as reference icon next to a prompt template.

Setting the reference template highlights columns in the table to show whether other assets are performing better or worse than the asset that you select.
Click the Custom ranking icon .

To analyze results, you can also create a custom ranking of metrics across different groups by specifying weight factors and a ranking formula to determine which prompt templates have the best performance. When you create a custom ranking, you can select metrics that are relevant for your ranking and provide a weight factor to them. Click Cancel.
To run the evaluations again, click the Adjust settings icon . Use the Evaluation details pane to update the test data or reconfigure the metrics.
To edit the experiment, click the Assets icon to remove or add assets to your evaluation to change your comparison.
From the table, click the Overflow menu next to a prompt template, and choose View AI Factsheet. Factsheets capture details about the asset for each stage of the AI lifecycle to help you meet governance and compliance goals.
Close the AI Factsheet page to return to the Evaluation Studio.
From here, you can start tracking a prompt template in an AI use case. From the table, click the Overflow menu next to a prompt template, and choose Track in AI use case.

Check your progress

The following image shows the results of the evaluation.

Learn more

For more information, refer to the following topics:

Next steps

Try one of the other tutorials:

Additional resources

View more videos.
Find sample data sets, projects, models, prompts, and notebooks in the Resource hub to gain hands-on experience:

Notebooks that you can add to your project to get started analyzing data and building models.

Projects that you can import containing notebooks, data sets, prompts, and other assets.

Data sets that you can add to your project to refine, analyze, and build models.

Prompts that you can use in the Prompt Lab to prompt a foundation model.

Foundation models that you can use in the Prompt Lab.

Parent topic: Quick start tutorials