Model risk evaluation engine

Last updated: Mar 05, 2025

The model risk evaluation engine measures risks for foundation models by computing metrics that are related to risk dimensions to help you identify the best models that achieve your risk tolerance goals.

The model risk evaluation engine is a module in the ibm-watsonx-gov Python SDK that can help you understand generative AI risks and establish effective methods for measuring and mitigating the risks. This module helps provide quantitative risk assessments of foundation models and supports evaluations of watsonx.ai large language models and external models from other providers.

Evaluations calculate metrics for the following risk dimensions:

Toxic output
Harmful output
Prompt leaking
Hallucination
Prompt injection
Jailbreaking
Output bias
Harmful code generation

The risk dimensions are a collection of risks that can occur when working with generative AI assets and machine learning models. . For more information, see the AI Risk Atlas. The available risk dimensions are a subset of the risks available in the Risk Atlas.

You must provide watsonx.ai credentials to calculate metrics for the jailbreaking, prompt-leaking, harmful-code generation, and prompt-injection risks. For each risk dimension, one or more standardized data sets are used to evaluate the risk level. These datasets are stored in Unitxt cards.

After risk assessments are complete, you can save the results in the Governance console or export the results as a PDF report that summarizes the metrics that are calculated.

You can use the model risk evaluation engine to complete the following tasks:

Compute metrics with watsonx.ai as the inference engine.
Compute risk metrics for foundation models in watsonx.ai.
Compute metrics for foundation models that are not in watsonx.ai by implementing your own scoring function for any model and evaluating it.
Store computed metrics in the Governance console (OpenPages).
Retrieve computed metrics from the Governance console (OpenPages).
Generate a PDF report of computed metrics.
Display the metrics in a notebook cell in a table or chart format.

Input

You can specify the following input parameters when you use the model risk evaluation engine:

Table 1. Input parameters for model risk evaluation engine
Parameter	Description
`wx_gc_configuration`	The Governance console configuration to store the computed metrics result. Storing the evaluation result in the Governance console prevents recomputing the metrics during the next evaluation. The evaluation engine retrieves the saved metrics instead.
`foundation_model_name`	The name of the foundation model under evaluation.
`risk_dimensions`	A list of risks to be evaluated. If not provided, all available risks will be evaluated.
`max_sample_size`	The maximum number of data instances to be used for evaluation. Specify a smaller value (for example, 50) to speed up the evaluation, or set it to None to use all data for evaluation, which takes longer but ensures meaningful results.
`model_details`	The foundation model details. The value can be `WxAIFoundationModel` or `CustomFoundationModel` where `WxAIFoundationModel` is an object that represent the logic to invoke inferencing for watsonx.ai, and `CustomFoundationModel` is an object that contains the logic to invoke the logic for an external LLM.
`pdf_report_output_path`	The file path specified by the user where the generated PDF report will be saved.

Output

The model risk evaluation engine computes metrics for each risk dimension as output. This output can be saved in a notebook cell, stored in OpenPages, or exported as a PDF report. The model risk evaluation engine computes metrics for the following risk dimensions:

Table 2. Risk dimensions for Model risk evaluation engine
Risk	Description
Toxic output	The model produces hateful, abusive, and profane (HAP) or obscene content.
Harmful output	The model might generate language that leads to physical harm or language that includes overtly violent, covertly dangerous, or otherwise indirectly unsafe statements.
Hallucination	Factually inaccurate or untruthful content regarding the model training data or input. This risk is also sometimes referred to as lack of faithfulness or lack of groundedness.
Prompt injection	An attack that forces a generative model that takes a prompt as input to produce unexpected output by manipulating the structure, instructions, or information contained in its prompt.
Jailbreaking	An attack that attempts to break through the guardrails that are established in the model to perform restricted actions.
Output bias	Generated content might unfairly represent certain groups or individuals.
Prompt leaking	An attempt to extract a model's system prompt
Harmful code-generation	Models might generate code that causes harm or unintentionally affects other systems.

Examples

You can run evaluations and generate results with the model risk evaluation engine as shown in the following examples:

Step 1: configuration

Create a model risk evaluation engine configuration:

from ibm_watsonx_gov.config.model_risk_configuration import ModelRiskConfiguration, WxGovConsoleConfiguration

configuration = ModelRiskConfiguration(
    model_details = model_details,
    risk_dimensions=risk_dimensions,
    max_sample_size=max_sample_size,
    pdf_report_output_path=pdf_report_output_path,
    # wx_gc_configuration=wx_gc_configuration, # uncomment this line if the result should be pushed to Governance Console (OpenPages)
)

Step 2: run evaluation

Run an evaluation to measure risks:

from ibm_watsonx_gov.evaluate import evaluate_model_risk

evaluation_results = evaluate_model_risk(
    configuration=configuration,
    credentials=credentials,
)

print(evaluation_results.risks)

Step 3: generate PDF report

Export the evaluated data and metrics as a PDF report:

from ibm_wos_utils.joblib.utils.notebook_utils import  create_download_link_for_file
pdf_file = create_download_link_for_file(evaluation_results.output_file_path)
display((pdf_file))

For more information, see the Model Risk Evaluation Engine notebook.

Parent topic: Metrics computation using Python SDK

Was the topic helpful?

0/1000