Evaluating detached prompt templates in projects

Last updated: Dec 03, 2024

You can evaluate detached prompt templates in projects to measure the performance of foundation models that are not created or hosted by IBM.

When you evaluate detached prompt templates in projects, you can evaluate how effectively your external model generates responses for the following task types:

Text summarization
Text classification
Question answering
Entity extraction
Content generation
Retrieval augmented generation

Before you begin

Required permissions
You must have the following roles to evaluate prompt templates:
Admin or Editor role in a project

Before you evaluate detached prompt templates in your project, you must create a detached prompt template that connects your external model to watsonx.governance. You must specify variables and provide connection details such as the name of your external model and its URL when you create the detached prompt template. The following example shows you how to create a detached prompt template with the API:

{
    "name": "prompt name",
    "description": "prompt description",
    "model_version": {
        "number": "2.0.0-rc.7",
        "tag": "my prompt tag",
        "description": "my description"
    },
    "prompt_variables": {
        "var1": {},
        "var2": {}
    },
    "task_ids": [
        "retrieval_augmented_generation"
    ],
    "input_mode": "detached",
    "prompt": {
        "model_id": "",
        "input": [
            [
                "Some input",
                ""
            ]
        ],
        "data": {},
        "external_information": {
            "external_prompt_id": "external prompt",
            "external_model_id": "external model",
            "external_model_provider": "external provider",
            "external_prompt": {
                "url": "https://asdfasdf.com?asd=a&32=1",
                "additional_information": [
                    {
                        "additional_key": "additional settings"
                    }
                ]
            },
            "external_model": {
                "name": "An external model",
                "url": "https://asdfasdf.com?asd=a&32=1"
            }
        }
    }
}

Running evaluations

To run detached prompt template evaluations in your project, you can open a saved detached prompt template on the Assets tab and select Evaluate on the Evaluations tab in watsonx.governance to open the Evaluate prompt template wizard. You can run evaluations only if you are assigned the Admin or Editor roles for your project.

Run external prompt template evaluation

Select dimensions

The Evaluate prompt template wizard displays the dimensions that are available to evaluate for the task type that is associated with your prompt. You can expand the dimensions to view the list of metrics that are used to evaluate the dimensions that you select.

Select external llm dimensions to evaluate

Watsonx.governance automatically configures evaluations for each dimension with default settings. To configure evaluations with different settings, you can select Advanced settings to set minimum sample sizes and threshold values for each metric as shown in the following example:

Configure external llm evaluations

Select test data

You must upload a CSV file that contains test data with reference columns that include the input and the expected model output. The test data that you upload must contain the model output to enable detached deployment evaluations. When the upload completes, you must also map prompt variables to the associated columns from your test data. Select external LLM test data to upload

Review and evaluate

You can review the selections for the prompt task type, the uploaded test data, and the type of evaluation that runs. You must select Evaluate to run the evaluation.

Review and evaluate detached prompt template evaluation settings

Reviewing evaluation results

When your evaluation finishes, you can review a summary of your evaluation results on the Evaluations tab in watsonx.governance to gain insights about your model performance. The summary provides an overview of metric scores and violations of default score thresholds for your prompt template evaluations.

If you are assigned the Viewer role for your project, you can select Evaluate from the asset list on the Assets tab to view evaluation results.

To analyze results, you can click the arrow next to your prompt template evaluation to view data visualizations of your results over time. You can also analyze results from the model health evaluation that is run by default during prompt template evaluations to understand how efficiently your model processes your data.

The Actions menu also provides the following options to help you analyze your results:

Evaluate now: Run evaluation with a different test data set
All evaluations: Display a history of your evaluations to understand how your results change over time.
Configure monitors: Configure evaluation thresholds and sample sizes.
View model information: View details about your model to understand how your deployment environment is set up.

Analyze detached prompt template evaluation results

Next steps

You can promote your prompt templates to deployment spaces to evaluate detached prompt templates in spaces to gain insights to about your model performance throughout the AI lifecycle.

Learn more

If you are tracking the detached deployment in an AI use case, details about the model and evaluation results are recorded in a factsheet.