API latency evaluation metric

Last updated: Mar 14, 2025
API latency evaluation metric

The API latency metric measures the time taken (in ms) to process a scoring request by your model deployment.

Metric details

API latency is a throughput and latency metric for model health monitor evaluations that calculates latency by tracking the time that it takes to process scoring requests per millisecond (ms).

Scope

The API latency metric evaluates generative AI assets and machine learning models.

  • Generative AI tasks:
    • Text summarization
    • Text classification
    • Content generation
    • Entity extraction
    • Question answering
    • Retrieval Augmented Generation (RAG)
  • Machine learning problem type:
    • Binary classification
    • Multiclass classification
    • Regression
  • Supported languages: English

Evaluation process

The average, maximum, median, and minimum API latency for scoring requests and transaction records are calculated during model health monitor evaluations.

To calculate the API latency metric, response_time value from your scoring requests is used to track the time that your model deployment takes to process scoring requests.

For watsonx.ai Runtime deployments, the response_time value is automatically detected when you configure evaluations.

For external and custom deployments, you must specify the response_time value when you send scoring requests to calculate throughput and latency as shown in the following example from the Python SDK:

    from ibm_watson_openscale.supporting_classes.payload_record import PayloadRecord            
        client.data_sets.store_records(
        data_set_id=payload_data_set_id, 
        request_body=[
        PayloadRecord(
            scoring_id=<uuid>,
            request=openscale_input,
            response=openscale_output,
            response_time=<response_time>,  
            user_id=<user_id>)
                    ]
        ) 

Parent topic: Evaluation metrics