0 / 0
Metrics computation with the Python SDK
Last updated: Mar 05, 2025
Metrics computation with the Python SDK

The ibm-watsonx-gov Python SDK is a Python library that you can use to programatically monitor, manage, and govern machine learning models and generative AI assets. You can use the Python SDK to calculate metrics and algorithms in a notebook runtime environment or offloaded as Spark jobs against IBM Analytics Engine for model evaluations.

Use the ibm-watsonx-gov Python SDK, to calculate evaluation metrics and generate insights. You can automate these tasks by using modules and integrating them with your application. You can also use sample notebooks to compute metrics.

Modules

The Python SDK supports the following modules that can help you automate tasks for model evaluations and generate insights:

Metrics

The Python SDK supports metrics that help you evaluate traditional machine learning model evaluations and prompt template evaluations for generative AI assets. For more information, see Evaluation metrics.

The following metrics are currently available only with the Python SDK:

Table 13. Python SDK evaluation metric descriptions
Metric Description
Adversarial robustness Measures the robustness of your model and prompt template against adversarial attacks such as prompt injections and jailbreaks
Keyword inclusion Measures the similarity of nouns and pronouns between the foundation model output and the reference or ground truth
Prompt leakage risk Measures the risk of leaking the prompt template by calculating the similarity between the leaked prompt template and original prompt template
Question robustness Detects the English-language spelling errors in the model input questions

The following metric category is also available only with the Python SDK:

Content validation metrics

Content validation metrics use string-based functions to analyze and validate generated LLM output text. The input must contain a list of generated text from your LLM to generate content validation metrics.

If the input does not contain transaction records, the metrics measure the ratio of successful content validations and compares the ratio to the total number of validations. If the input contains transaction records, the metrics measure the ratio of successful content validations when compared to the total number of validations and calculate validation results with the specified record_id.

You can calculate the following content validation metrics:

Table 14. Content validation evaluation metric descriptions
Metric Description
Contains all Measures whether the rows in the prediction contain all of the specified keywords
Contains any Measures whether the rows in the prediction contain any of the specified keywords
Contains email Measures whether each row in the prediction contains emails
Contains_JSON Measures if the rows in the prediction contains JSON syntax
Contains link Measures whether the rows in the prediction contain any links
Contains none Measues whether the rows in the prediction do not contain any of the specified keywords
Contains string Measures whether each row in the prediction contains the specified string
Contains valid link Measures whether the rows in the prediction contain valid links
Ends with Measures whether the rows in the prediction end with the specified substring
Equals to Measures whether the rows in the prediction are equal to the specified substring
Fuzzy match Measures if the prediction fuzzy matches the keyword
Is email Measures whether the rows in the prediction contain valid emails
Is JSON Measures whether the rows in the prediction contains valid JSON syntax
Length greater than Measures whether the length of each row in the prediction is greater than a specified maximum value
Length less than Measures whether the length of each row in the prediction is less than a specified maximum value
No invalid links Measures whether the rows in the prediction have no invalid links
Regex Measures whether the rows in the prediction contain the specified regex expression
Starts with Measures whether the rows in the prediction start with the specified substring