The ibm-watsonx-gov
Python SDK is a Python library that you can use to programatically monitor, manage, and govern machine learning models and generative AI assets. You can use the Python SDK to calculate metrics and algorithms in
a notebook runtime environment or offloaded as Spark jobs against IBM Analytics Engine for model evaluations.
Use the ibm-watsonx-gov
Python SDK, to calculate evaluation metrics and generate insights. You can automate these tasks by using modules and
integrating them with your application. You can also use sample notebooks to compute metrics.
Modules
The Python SDK supports the following modules that can help you automate tasks for model evaluations and generate insights:
Metrics
The Python SDK supports metrics that help you evaluate traditional machine learning model evaluations and prompt template evaluations for generative AI assets. For more information, see Evaluation metrics.
The following metrics are currently available only with the Python SDK:
Metric | Description |
---|---|
Adversarial robustness | Measures the robustness of your model and prompt template against adversarial attacks such as prompt injections and jailbreaks |
Keyword inclusion | Measures the similarity of nouns and pronouns between the foundation model output and the reference or ground truth |
Prompt leakage risk | Measures the risk of leaking the prompt template by calculating the similarity between the leaked prompt template and original prompt template |
Question robustness | Detects the English-language spelling errors in the model input questions |
The following metric category is also available only with the Python SDK:
Content validation metrics
Content validation metrics use string-based functions to analyze and validate generated LLM output text. The input must contain a list of generated text from your LLM to generate content validation metrics.
If the input does not contain transaction records, the metrics measure the ratio of successful content validations and compares the ratio to the total number of validations. If the input contains transaction records, the metrics measure the
ratio of successful content validations when compared to the total number of validations and calculate validation results with the specified record_id
.
You can calculate the following content validation metrics:
Metric | Description |
---|---|
Contains all | Measures whether the rows in the prediction contain all of the specified keywords |
Contains any | Measures whether the rows in the prediction contain any of the specified keywords |
Contains email | Measures whether each row in the prediction contains emails |
Contains_JSON | Measures if the rows in the prediction contains JSON syntax |
Contains link | Measures whether the rows in the prediction contain any links |
Contains none | Measues whether the rows in the prediction do not contain any of the specified keywords |
Contains string | Measures whether each row in the prediction contains the specified string |
Contains valid link | Measures whether the rows in the prediction contain valid links |
Ends with | Measures whether the rows in the prediction end with the specified substring |
Equals to | Measures whether the rows in the prediction are equal to the specified substring |
Fuzzy match | Measures if the prediction fuzzy matches the keyword |
Is email | Measures whether the rows in the prediction contain valid emails |
Is JSON | Measures whether the rows in the prediction contains valid JSON syntax |
Length greater than | Measures whether the length of each row in the prediction is greater than a specified maximum value |
Length less than | Measures whether the length of each row in the prediction is less than a specified maximum value |
No invalid links | Measures whether the rows in the prediction have no invalid links |
Regex | Measures whether the rows in the prediction contain the specified regex expression |
Starts with | Measures whether the rows in the prediction start with the specified substring |