You can evaluate detached prompt templates in projects to measure the performance of foundation models that are not created or hosted by IBM.
When you evaluate detached prompt templates in projects, you can evaluate how effectively your external model generates responses for the following task types:
- Text summarization
- Text classification
- Question answering
- Entity extraction
- Content generation
- Retrieval augmented generation
Before you begin
Required permissions
You must have the following roles to evaluate prompt templates:
Admin or Editor role in a project
Before you evaluate detached prompt templates in your project, you must create a detached prompt template that connects your external model to watsonx.governance. You must specify variables and provide connection details such as the name of your external model and its URL when you create the detached prompt template. The following example shows you how to create a detached prompt template with the API:
{
"name": "prompt name",
"description": "prompt description",
"model_version": {
"number": "2.0.0-rc.7",
"tag": "my prompt tag",
"description": "my description"
},
"prompt_variables": {
"var1": {},
"var2": {}
},
"task_ids": [
"retrieval_augmented_generation"
],
"input_mode": "detached",
"prompt": {
"model_id": "",
"input": [
[
"Some input",
""
]
],
"data": {},
"external_information": {
"external_prompt_id": "external prompt",
"external_model_id": "external model",
"external_model_provider": "external provider",
"external_prompt": {
"url": "https://asdfasdf.com?asd=a&32=1",
"additional_information": [
{
"additional_key": "additional settings"
}
]
},
"external_model": {
"name": "An external model",
"url": "https://asdfasdf.com?asd=a&32=1"
}
}
}
}
Running evaluations
To run detached prompt template evaluations in your project, you can open a saved detached prompt template on the Assets tab and select Evaluate on the Evaluations tab in watsonx.governance to open the Evaluate prompt template wizard. You can run evaluations only if you are assigned the Admin or Editor roles for your project.
Select dimensions
The Evaluate prompt template wizard displays the dimensions that are available to evaluate for the task type that is associated with your prompt. You can expand the dimensions to view the list of metrics that are used to evaluate the dimensions that you select.
Watsonx.governance automatically configures evaluations for each dimension with default settings. To configure evaluations with different settings, you can select Advanced settings to set minimum sample sizes and threshold values for each metric as shown in the following example:
Select test data
You must upload a CSV file that contains test data with reference columns that include the input and the expected model output. The test data that you upload must contain the model output to enable detached deployment evaluations. When the upload completes, you must also map prompt variables to the associated columns from your test data.
Review and evaluate
You can review the selections for the prompt task type, the uploaded test data, and the type of evaluation that runs. You must select Evaluate to run the evaluation.
Reviewing evaluation results
When your evaluation finishes, you can review a summary of your evaluation results on the Evaluations tab in watsonx.governance to gain insights about your model performance. The summary provides an overview of metric scores and violations of default score thresholds for your prompt template evaluations.
If you are assigned the Viewer role for your project, you can select Evaluate from the asset list on the Assets tab to view evaluation results.
To analyze results, you can click the arrow next to your prompt template evaluation to view data visualizations of your results over time. You can also analyze results from the model health evaluation that is run by default during prompt template evaluations to understand how efficiently your model processes your data.
The Actions menu also provides the following options to help you analyze your results:
- Evaluate now: Run evaluation with a different test data set
- All evaluations: Display a history of your evaluations to understand how your results change over time.
- Configure monitors: Configure evaluation thresholds and sample sizes.
- View model information: View details about your model to understand how your deployment environment is set up.
Next steps
You can promote your prompt templates to deployment spaces to evaluate detached prompt templates in spaces to gain insights to about your model performance throughout the AI lifecycle.
Learn more
If you are tracking the detached deployment in an AI use case, details about the model and evaluation results are recorded in a factsheet.