Model health monitor evaluations

Last updated: Dec 05, 2024

You can configure model health monitor evaluations to help you understand your model behavior and performance. You can use model health metrics to determine how efficiently your model deployment processes your transactions.

Model health evaluations are enabled by default for machine learning model evaluations in production and for all types of generative AI asset deployments. When model health evaluations are enabled, a model health data set is created in the data mart for the service that you use. The model health data set stores details about your scoring requests that are used to calculate model health metrics.

To configure model health monitor evaluations, you can set threshold values for each metric as shown in the following example:

Configuring model health monitor evaluations

Model health evaluations for pre-production and batch deployments are not supported for machine learning model deployments.

Supported model health metrics

The following metric categories for model health evaluations are supported. Each category contains metrics that provide details about your model performance.

Model health monitor evaluations support the following metrics:

Scoring requests

Model health monitor evaluations calculate the number of scoring requests that your model deployment receives.

Supported models: machine learning and LLMs

Records

Model health monitor evaluations calculate the total, average, minimum, maximum, and median number of transaction records that are processed across scoring requests.

Supported models: machine learning and LLMs

Throughput and latency

Model health monitor evaluations calculate latency by tracking the time that it takes to process scoring requests and transaction records per millisecond (ms). Throughput is calculated by tracking the number of scoring requests and transaction records that are processed per second.

To calculate throughput and latency, the response_time value from your scoring requests is used to track the time that your model deployment takes to process scoring requests.

For watsonx.ai Runtime deployments, the response_time value is automatically detected when you configure evaluations.

For external and custom deployments, you must specify the response_time value when you send scoring requests to calculate throughput and latency as shown in the following example from the Python SDK:

    from ibm_watson_openscale.supporting_classes.payload_record import PayloadRecord            
        client.data_sets.store_records(
        data_set_id=payload_data_set_id, 
        request_body=[
        PayloadRecord(
            scoring_id=<uuid>,
            request=openscale_input,
            response=openscale_output,
            response_time=<response_time>,  
            user_id=<user_id>)
                    ]
        )

The following metrics are calculated to measure thoughput and latency during evaluations:

API latency: Time taken (in ms) to process a scoring request by your model deployment.
API throughput: Number of scoring requests processed by your model deployment per second
Record latency: Time taken (in ms) to process a record by your model deployment
Record throughput: Number of records processed by your model deployment per second

The average, maximum, median, and minimum throughput and latency for scoring requests and transaction records are calculated during model health monitor evaluations.

Supported models: machine learning and LLMs

Payload size

The total, average, minimum, maximum, and median payload size of the transaction records that your model deployment processes across scoring requests in kilobytes (KB) is calculated during model health monitor evaluations. Payload size metrics for image models are not supported. Payload size metrics are calculated for traditional models only.

Supported models: machine learning

Users

Model health monitor evaluations calculate the number of users that send scoring requests to your model deployments.

To calculate the number of users, the user_id from scoring requests is used to identify the users that send the scoring requests that your model receives.

For watsonx.ai Runtime deployments, the user_id value is automatically detected when you configure evaluations.

For external and custom deployments, you must specify the user_id value when you send scoring requests to calculate the number of users as shown in the following example from the Python SDK:

    from ibm_watson_openscale.supporting_classes.payload_record import PayloadRecord    
        client.data_sets.store_records(
            data_set_id=payload_data_set_id, 
            request_body=[
                PayloadRecord(
                    scoring_id=<uuid>,
                    request=openscale_input,
                    response=openscale_output,
                    response_time=<response_time>,
                    user_id=<user_id>). --> value to be supplied by user 
            ]
        )

When you review an evaluation summary for the Users metric, you can use the real-time view to see the total number of users and the aggregated views to see the average number of users.

Supported models: machine learning and LLMs

Token counts

If you are using Watsonx.governance, model health monitor evaluations calculate the number of tokens that are processed across scoring requests for your model deployment. This metric category is supported for foundation models only.

Watsonx.governance calculates the following metrics to measure token count during evaluations:

Input token count: Calculates the total, average, minimum, maximum, and median input token count across multiple scoring requests during evaluations
Output token count: Calculates the total, average, minimum, maximum, and median output token count across scoring requests during evaluations
Supported models: LLMs

To calculate custom token count metrics, you must specify the generated_token_count and input_token_count fields when you send scoring requests with the Python SDK to calculate the input and output token count metrics as shown in the following example:

request = {
            "fields": [
                "comment"
            ],
            "values": [
                [
                    "Customer service was friendly and helpful."
                ]
            ]
        }
response = {
            "fields": [
                "generated_text",
                "generated_token_count",
                "input_token_count",
                "stop_reason",
                "scoring_id",
                "response_time"
            ],
            "values": [
                [
                    "1",
                    2,
                    73,
                    "eos_token",
                    "MRM_7610fb52-b11d-4e20-b1fe-f2b971cae4af-50",
                    3558
                ],
                [
                    "0",
                    3,
                    62,
                    "eos_token",
                    "MRM_7610fb52-b11d-4e20-b1fe-f2b971cae4af-51",
                    3778
                ]
            ]
        }

from ibm_watson_openscale.supporting_classes.payload_record import PayloadRecord    
        client.data_sets.store_records(
            data_set_id=payload_data_set_id, 
            request_body=[
                PayloadRecord(
                    scoring_id=<uuid>,
                    request=request,
                    response=response,
                    response_time=<response_time>,
                    user_id=<user_id>). --> value to be supplied by user 
            ]
        )