Agentic AI evaluation

Last updated: Mar 05, 2025

The agentic AI evaluation module computes metrics to measure the performance of agentic AI tools to help you streamline your workflows and manage risks for your use case.

Agentic AI evaluation is is a module in the ibm-watsonx-gov Python SDK. You can use the agentic AI evaluation module to automate and accelerate tasks to help streamline your workflows and manage regulatory compliance risks by measuring performance with quantitative metrics.

The agentic AI evaluation module uses the following evaluators to measure performance for agentic RAG use cases:

evaluate_context_relevance: To compute context relevance metric of your content retrieval tool
evaluate_faithfulness: To compute faithfulness metric of your answer generation tool. This metric does not require ground truth
evaluate_answer_similarity: To compute answer similarity metric of your answer generation tool. This metric requires ground truth for computation

To use the agentic AI evaluation module you must install the ibm-watsonx-gov Python SDK with specific settings:

pip install "ibm-watsonx-gov[agentic]"

Examples

You can evaluate agentic AI tools with the agentic AI evaluation module as shown in the following examples:

Set up the state

The ibm-watsonx-gov Python SDK provides a pydantic based state class that you can extend:

from ibm_watsonx_gov.entities.state import EvaluationState

class AppState(EvaluationState):
    pass

Set up the evaluator

To evaluate agentic AI applications, you must instantiate the AgenticEvaluation class to define evaluators to compute different metrics:

from ibm_watsonx_gov.evaluate.agentic_evaluation import AgenticEvaluation

evaluator = AgenticEvaluation()

Add your evaluators

Compute the context relevance metric by defining the retrieval_node tool and decorate it with the evaluate_context_relevance evaluator tool:

@evaluator.evaluate_context_relevance
def retrieval_node(state: AppState, config: RunnableConfig):
    # do something
    pass

You can also stack evaluators to compute multiple metrics with a tool. The following example shows the generate_node tool decorated with the evaluate_faithfulness and evaluate_answer_similarity tools to compute answer quality metrics:

@evaluator.evaluate_faithfulness
@evaluator.evaluate_answer_similarity
def generate_node(state: AppState, config: RunnableConfig):
    # do something
    pass

Make an invocation

When you invoke an application for a row of data, a record_id key is added to the inputs to track individual rows and associate metrics with each row:

result = rag_app.invoke({"input_text": "What is concept drift?", "ground_truth": "Concept drift occurs when the statistical properties of the target variable change over time, causing a machine learning model’s predictions to become less accurate.", "record_id": "12"})
evaluator.get_metrics_df()

The invocation generates a result as shown in the following example:

Table 1. Single invocation result
name	method	value	record_id	tool_name	execution_count
answer_similarity	sentence_bert_mini_lm	0.930133	12	generate_node	1
faithfulness	sentence_bert_mini_lm	0.258931	12	generate_node	1
tool_latency (s)		12.777696	12	generate_node	1
context_relevance	sentence_bert_mini_lm	0.182579	12	retrieval_node	1
tool_latency (s)		1.730439	12	retrieval_node	1

Invoke the graph on multiple rows

To complete batch invocation, you can define a dataframe with questions and ground truths for those questions:

import pandas as pd

question_bank_df = pd.read_csv("https://raw.githubusercontent.com/IBM/ibm-watsonx-gov/refs/heads/samples/notebooks/data/agentic/medium_question_bank.csv")
question_bank_df["record_id"] = question_bank_df.index.astype(str)
result = rag_app.batch(inputs=question_bank_df.to_dict("records"))
evaluator.get_metrics_df()

The dataframe index is used as a record_id to uniquely indentify each row.

The invocation generates a result as shown in the following example:

Table 2. Batch invocation result
name	method	value	record_id	tool_name	execution_count
answer_similarity	sentence_bert_mini_lm	0.921843	0	generate_node	1
faithfulness	sentence_bert_mini_lm	0.887591	0	generate_node	1
tool_latency (s)		3.420483	0	generate_node	1
context_relevance	sentence_bert_mini_lm	0.707973	0	retrieval_node	1
tool_latency (s)		0.777236	0	retrieval_node	1
answer_similarity	sentence_bert_mini_lm	0.909655	1	generate_node	1
faithfulness	sentence_bert_mini_lm	0.783347	1	generate_node	1
tool_latency (s)		1.327022	1	generate_node	1
context_relevance	sentence_bert_mini_lm	0.706106	1	retrieval_node	1
tool_latency (s)		0.936945	1	retrieval_node	1
answer_similarity	sentence_bert_mini_lm	0.864697	2	generate_node	1
faithfulness	sentence_bert_mini_lm	0.868233	2	generate_node	1
tool_latency (s)		2.326283	2	generate_node	1
context_relevance	sentence_bert_mini_lm	0.763274	2	retrieval_node	1
tool_latency (s)		0.842586	2	retrieval_node	1

For more information, see the sample notebook.

Parent topic: Metrics computation using Python SDK

Was the topic helpful?

0/1000