Learn the terms and concepts that are used for evaluating machine learning models.
Acceptable fairness
The percentage of favorable outcomes that a monitored group must receive to meet the fairness threshold. It is calculated by multiplying perfect equality by the fairness threshold.
Alert
A notification that a performance metric is outside of the acceptable range specified by configured monitors.
API Key
A unique identifier issued by IBM Cloud for connecting to resources. To obtain, open https://cloud.ibm.com/resources, find and expand the resource, such as a storage service, and copy the value
for Resource ID without the quotation marks.
Balanced data set
A data set that includes the scoring requests received by the model for the selected hour and the perturbed records.
Baseline data
Previous data that is collected before intervention or modification. This data serves as the foundation to which future data collected is compared to.
Batch deployment
Processes the input data from a file, data connection, or connected data in a storage bucket, and writes the output to a selected destination. A method to deploy models that processes input data from a file and writes the output to a file.
Batch processing
If you need to monitor deployments involving huge payload/feedback data, then batch processing is suggested.
Bias
When a machine learning model produces a result for a monitored person, group, or thing that is considered to be unfair when compared to a reference result. Can be caused by a problem with the training data for a model. The Fairness monitor
can detect bias that falls under a threshold you set. Related term: Debiasing.
Cloud Object Storage
A service offered by IBM for storing and accessing data. If Cloud Object Storage is the repository for machine learning assets, the associated service credentials must be used to connect to the assets for model evaluations.
See also: Resource ID, API key.
Confidence score
The probability that a machine learning model's prediction is correct. A higher score indicates a higher probability that the predicted outcome matches the actual outcome.
Contrastive explanation
Explanations that indicate the minimal set of feature column value changes to change the model prediction. This is computed for a single data point.
Data mart
Workspace where all the metadata for model evaluations gets saved. Behind the scenes, it is connected to a database persistence layer where metadata gets saved.
Debiased transactions
The transactions for which debiased outcome is generated.
Debiasing
When the Fairness monitor detects bias. When a monitored group receives biased outcomes, take steps to mitigate the bias automatically or manually.
Deployment
You deploy a model to make an endpoint available so you can input new data (the request) to the model and get a score, or response. A model deployment can be in a pre-production environment for testing, or a production environment for actual
usage.
Drift
When model accuracy declines over time. Can be caused by a change in model input data that leads to model performance deterioration. To monitor for draft, alerts can be created for when the model accuracy drops below a specified acceptable
threshold.
Evaluation
The process of using metrics to assess a machine learning model and measure how well the model performs (in areas such as fairness and accuracy). Monitors can assess a model for areas important to goals.
Explanation
An insight into the evaluation of a particular measurement of a model. An explanation helps you understand model evaluation results and also experiment with what-if scenarios to help address issues.
Fairness
Determine whether a model produces biased outcomes that favor a monitored group over a reference group. The fairness evaluation checks when the model shows a tendency to provide a favorable/preferable outcome more often for one group over
another. Typical categories to monitor are age, sex, and race.
Features
List of dataset column names (feature columns) used to train a machine learning model.
Example: In a model that predicts whether a person qualifies for a loan, the features for employment status and credit history might be given greater weight than zip code.
Feedback data
Labeled data that matches the schema and structure of the data used to train a machine learning model (including the target) but that was not used for training. This data is already known or actual data used by the Quality monitor to measure
the accuracy of a deployed model. Determines whether predictions are accurate when measured against the known outcome.
Global explanation
Explains model's prediction on a sample of data.
Headless subscription
A subscription that has a realtime deployment behind the scenes. Through headless subscription, user can monitor the deployment by using the data (Payload/Feedback) being supplied to the deployment without supplying any scoring URL.
Labeled data
Data that is labeled in a uniform manner for the machine learning algorithms to recognize during model training.
Example: A table of data with labeled columns is typical for supervised machine learning. Images can also be labeled for use in a machine learning problem.
Local explanation
Explains a model's prediction by using specific, individual examples.
Meta-fields
Specialized data that is unique between products.
Monitor
Track performance results for different model evaluations
Example: Fairness, drift, quality, explainability.
Monitored group
When evaluating fairness, the monitored group represents the values that are most at risk for biased outcomes.
Example: In the sex feature, Female and Nonbinary can be set as monitored groups.
Online deployment
Method of accessing a deployment through an API endpoint that provides a real-time score or solution on new data.
Payload data
Any real-time data supplied to a model. Consists of requests to a model (input) and responses from a model (output).
Payload logging
Persisting payload data.
Perfect equality
The percentage of favorable outcomes delivered to all reference groups. For the balanced and debiased data sets, the calculation includes monitored group transactions that were altered to become reference group transactions.
Perturbations
Data points that are simulated around real data points during the computation of different metrics that are associated with monitors—such as fairness, explainability.
Pre-production space
An environment that is used to readily test the data for model validations.
Prediction column
The variable that a supervised machine learning model (trained with labeled data) predicts when presented with new data.
See also: Target.
Probability
The confidence with which a model predicts the output. Applicable for classification models.
Production space
A deployment space used for operationalizing machine learning models. Deployments from a production space are evaluated for comparison of actual performance against specified metrics.
Quality
A monitor that evaluates how well a model predicts accurate outcomes based on the evaluation of feedback data. It uses a set of standard data science metrics to evaluate how well the model predicts outcomes that match the actual outcomes
in the labeled data set.
Records
Transactions on which monitors are evaluated.
Reference group
When evaluating fairness, the reference group represents the values that are least at risk for biased outcomes.
Example: For the Age feature, you can set 30-55 as the reference group and compare results for other cohorts to that group.
Relative weight
The relative weight that a feature has on predicting the target variable. A higher weight indicates more importance. Knowing the relative weight helps explain the model results.
Resource ID
The unique identifier for a resource stored in Cloud Object Storage. To obtain:
- Open https://cloud.ibm.com/resources
- Find and expand the resource (such as a storage service)
- Copy the value for Resource ID without the quotation marks
Response time
The time taken to process a scoring request by the model deployment
Runtime data
Data obtained from running a model's lifecycle.
Scoring endpoint
The HTTPS endpoint that users can call to receive the scoring output of a deployed model.
Scoring request
The input to a deployment.
See also: Payload.
Scoring
In a model inference, the action of sending request to model and getting a response.
Self-managed
Model transactions stored in your own data warehouse and evaluated by your own Spark analytics engine.
Service credentials
The access IDs required to connect to IBM Cloud resources.
Service Provider
A machine learning providers (typically a model engine: WML, AWS, Azure, Custom) which hosts the deployments.
Subscription
A deployment getting monitored. There is a 1-1 mapping between deployment and subscription.
System-managed
Model transactions stored in a database and evaluated using computing resources.
Target
The feature or column of a data set that the trained model predicts. The model is trained by using pre-existing data to learn patterns and discover relationships between the features of the data set and the target.
See also: Prediction column.
Threshold
When monitors are configured to evaluate a machine learning model. A benchmark for an acceptable range of outcomes is established. When the outcome falls under the configured threshold, an alert is triggered assess and remedy the situation.
Training data
Data used to teach and train a model's learning algorithm.
Transactions
The records for machine learning model evaluations that are stored in the payload logging table.
Unlabeled data
Data that is not associated with labels that identify characteristics, classifications, and properties. Unstructured data that is not labeled in a uniform manner.
Example: Email or unlabeled images are typical of unlabeled data. Unlabeled data can be used in unsupervised machine learning.
User ID
The id of the user associated with the scoring request
Parent topic: Evaluating AI models with Watson OpenScale