API latency evaluation metric
The API latency metric measures the time taken (in ms) to process a scoring request by your model deployment.
Metric details
API latency is a throughput and latency metric for model health monitor evaluations that calculates latency by tracking the time that it takes to process scoring requests per millisecond (ms).
Scope
The API latency metric evaluates generative AI assets and machine learning models.
- Generative AI tasks:
- Text summarization
- Text classification
- Content generation
- Entity extraction
- Question answering
- Retrieval Augmented Generation (RAG)
- Machine learning problem type:
- Binary classification
- Multiclass classification
- Regression
- Supported languages: English
Evaluation process
The average, maximum, median, and minimum API latency for scoring requests and transaction records are calculated during model health monitor evaluations.
To calculate the API latency metric,
value from your scoring requests is used to track the time that your model deployment takes to process scoring requests.response_time
For watsonx.ai Runtime deployments, the
value is automatically detected when you configure evaluations.response_time
For external and custom deployments, you must specify the
value when you send scoring requests to calculate throughput and latency as shown in the following example from the Python SDK:response_time
from ibm_watson_openscale.supporting_classes.payload_record import PayloadRecord
client.data_sets.store_records(
data_set_id=payload_data_set_id,
request_body=[
PayloadRecord(
scoring_id=<uuid>,
request=openscale_input,
response=openscale_output,
response_time=<response_time>,
user_id=<user_id>)
]
)
Parent topic: Evaluation metrics