0 / 0


This glossary provides terms and definitions for Cloud Pak for Data as a Service.




The expectation that organizations or individuals will ensure the proper functioning, throughout their lifecycle, of the AI systems that they design, develop, operate or deploy, in accordance with their roles and applicable regulatory frameworks. This includes determining who is responsible for an AI mistake which may require legal experts to determine liability on a case-by-case basis.

active learning

A model for machine learning in which the system requests more labeled data only when it needs it.

active metadata

Metadata that is automatically updated based on analysis by machine learning processes. For example, profiling and data quality analysis automatically update metadata for data assets.

active runtime

An instance of an environment that is running to provide compute resources to analytical assets.


See artificial intelligence.

AI ethics

A multidisciplinary field that studies how to optimize AI's beneficial impact while reducing risks and adverse outcomes. Examples of AI ethics issues are data responsibility and privacy, fairness, explainability, robustness, transparency, environmental sustainability, inclusion, moral agency, value alignment, accountability, trust, and technology misuse.

AI governance

An organization's act of governing, through its corporate instructions, staff, processes and systems to direct, evaluate, monitor, and take corrective action throughout the AI lifecycle, to provide assurance that the AI system is operating as the organization intends, as its stakeholders expect, and as required by relevant regulation.

AI safety

The field of research aiming to ensure artificial intelligence systems operate in a manner that is beneficial to humanity and don't inadvertently cause harm, addressing issues like reliability, fairness, transparency, and alignment of AI systems with human values.

AI system

See artificial intelligence system.


A formula applied to data to determine optimal ways to solve analytical problems.


The science of studying data in order to find meaningful patterns in the data and draw conclusions based on those patterns.

artificial intelligence (AI)

The emulation of natural intelligence by a machine.

artificial intelligence system (AI system)

A system that can make predictions, recommendations or decisions that influence physical or virtual environments, and whose outputs or behaviors are not necessarily pre-determined by its developer or user. AI systems are typically trained with large quantities of structured or unstructured data, and might be designed to operate with varying levels of autonomy or none, to achieve human-defined objectives.


An item in a project or catalog that contains metadata about data or data analysis.

AutoAI experiment

An automated training process that considers a series of training definitions and parameters to create a set of ranked pipelines as model candidates.


batch deployment

A method to deploy models that processes input data from a file, data connection, or connected data in a storage bucket, then writes the output to a selected destination.


Systematic error in an AI system that has been designed, intentionally or not, in a way that may generate unfair decisions. Bias can be present both in the AI system and in the data used to train and test it. AI bias can emerge in an AI system as a result of cultural expectations; technical limitations; or unanticipated deployment contexts. See also fairness.

bias detection

The process of calculating fairness to metrics to detect when AI models are delivering unfair outcomes based on certain attributes.

binary classification

A classification model with two classes. Predictions are a binary choice of one of the two classes.

business term

A word or phrase that defines a business concept in a standard way for an enterprise. Terms can be used to enrich the metadata of data assets and to define the criteria of data protection rules.

business vocabulary

The set of governance artifacts, such as business terms and data classes, that describe and enrich data assets.



A repository of assets for an organization share. Assets in catalogs can be governed by data protection rules and enriched by other governance artifacts, such as classifications, data classes, and business terms. Catalogs can store structured and unstructured data, references to data in external data sources, and other analytical assets, like machine learning models.


In IBM Knowledge Catalog, a collaborative workspace for organizing and managing governance artifacts.


In IBM Knowledge Catalog, a governance artifact that describes the sensitivity level of the data in a data asset.


To ensure that all values in a data set are consistent and correctly recorded.


A member of a group of people who are working together toward a common goal.

combinatorial problem

A problem that is difficult to solve because it requires multiple decisions to be made involving too many combinations of possible choices. Some examples are finding a grouping, ordering, or the assignment of objects.

compute resources

The hardware and software resources defined by an environment definition to run analytical assets.

confusion matrix

A performance measurement that determines the accuracy between a model's positive and negative predicted outcomes compared to positive and negative actual outcomes.

connected data

A data set that is accessed through a connection to an external data source.


The information required to connect to a database. The actual information that is required varies according to the DBMS and connection method.


In Decision Optimization, a condition that must be satisfied by the solution of a problem.

continuous learning

Automating the tasks of monitoring model performance, retraining with new data, and redeploying to ensure prediction quality.

Core ML deployment

The process of downloading a deployment in Core ML format for use in iOS apps.


  • To select, collect, preserve, and maintain content relevant to a specific topic. Curation establishes, maintains, and adds value to data; it transforms data into trusted information and knowledge.
  • To create a data asset and prepare it to be published in a catalog. Curation can include enriching the data asset by assigning governance artifacts such as business terms, classification, and data classes, and analyzing the quality of the data in the data asset.


data asset

An asset that points to data, for example, to an uploaded file. Connections and connected data assets are also considered data assets.

data class

A governance artifact that categorizes columns in relational data sets according to the type of the data and how the data is used.

data integration

The combination of technical and business processes that are used to combine data from disparate sources into meaningful and valuable information.

data lake

A large-scale data storage repository that stores raw data in any format in a flat architecture. Data lakes hold structured and unstructured data as well as binary data for the purpose of processing and analysis.

data lakehouse

A unified data storage and processing architecture that combines the flexibility of a data lake with the structured querying and performance optimizations of a data warehouse, enabling scalable and efficient data analysis for AI and analytics applications.

data mining

The process of collecting critical business information from a data source, correlating the information, and uncovering associations, patterns, and trends. See also predictive analytics.

data model

A visualization of data elements, their relationships, and their attributes.

data product

A collection of optimized data or data-related assets that are packaged for reuse and distribution with controlled access. Data products contain data as well as models, dashboards, and other computational asset types. Unlike data assets in governance catalogs, data products are managed as products with multiple purposes to provide business value.

data protection rule

A governance artifact that specifies what data to control and how to control it. A data protection rule contains criteria and an action.

data quality analysis

The analysis of data against the quality dimensions accuracy, completeness, consistency, timeliness, uniqueness, and validity.

data quality definition

A data quality definition describes a rule evaluation or condition for data quality rules.

data quality rule

During data quality analysis, a data quality rule that assesses data for whether specific conditions are met and identifies records that do not meet the conditions as rule violations.

data science

The analysis and visualization of structured and unstructured data to discover insights and knowledge.

data set

A collection of data, usually in the form of rows (records) and columns (fields) and contained in a file or database table.

data source

A repository, queue, or feed for reading data, such as a Db2 database.

data table

A collection of data, usually in the form of rows (records) and columns (fields) and contained in a table.

data warehouse

A large, centralized repository of data collected from various sources that is used for reporting and data analysis. It primarily stores structured and semi-structured data, enabling businesses to make informed decisions.

Decision Optimization model

A prescriptive model that can be solved with optimization to provide the best solution to a Decision Optimization problem.

decision variable

One of a set of variables representing decisions to be made, whose values are determined by the optimization engine while ensuring that all constraints are satisfied and the objective optimized.


A model or application package that is available for use.

deployment space

A workspace where models are deployed and deployments are managed.


A software methodology that integrates application development and IT operations so that teams can deliver code faster to production and iterate continuously based on market feedback.


A Python API for modeling and solving Decision Optimization problems.


endpoint URL

A network destination address that identifies resources, such as services and objects. For example, an endpoint URL is used to identify the location of a model or function deployment when a user sends payload data to the deployment.


The compute resources for running jobs.

environment runtime

An instantiation of the environment template to run analytical assets.

environment template

A definition that specifies hardware and software resources to instantiate environment runtimes.


  • The ability of human users to trace, audit, and understand predictions that are made in applications that use AI systems.
  • The ability of an AI system to provide insights that humans can use to understand the causes of the system's predictions.



In an AI system, the equitable treatment of individuals or groups of individuals. The choice of a specific notion of equity for an AI system depends on the context in which it is used. See also bias.


A property or characteristic of an item within a data set, for example, a column in a spreadsheet. In some cases, features are engineered as combinations of other features in the data set.

feature engineering

The process of selecting, transforming, and creating new features from raw data to improve the performance and predictive power of machine learning models.

feature selection

Identifying the columns of data that best support an accurate prediction or score in a machine learning model.

feature store

A centralized repository or system that manages and organizes features, providing a scalable and efficient way to store, retrieve, and share feature data across machine learning pipelines and applications.

feature transformation

In AutoAI, a phase of pipeline creation that applies algorithms to transform and optimize the training data to achieve the best outcome for the model type.

federated learning

The training of a common machine learning model that uses multiple data sources that are not moved, joined, or shared. The result is a better-trained model without compromising data security.


A collection of nodes that define a set of steps for processing data or training a model.


Gantt chart

A graphical representation of a project timeline and duration in which schedule data is displayed as horizontal bars along a time scale.

governance artifact

Governance items that enrich or control data assets. Governance artifacts include business terms, classifications, data classes, policies, rules, and reference data sets.

governance rule

A governance artifact that provides a natural-language description of the criteria that are used to determine whether data assets are compliant with business objectives.

governance workflow

A task-based process to control the creating, modifying, and deleting of governance artifacts.

governed catalog

A catalog that has enforcement of data protection rules enabled.


See graphics processing unit.

graphical builder

A tool for creating analytical assets by visually coding. A canvas is an area on which to place objects or nodes that can be connected to create a flow.

graphics processing unit (GPU)

A specialized processor designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. GPUs are heavily utilized in machine learning due to their parallel processing capabilities.


hold-out set

A set of labeled data that is intentionally withheld from both the training and validation sets, serving as an unbiased assessment of the final model's performance on unseen data.

human oversight

Human involvement in reviewing decisions rendered by an AI system, enabling human autonomy and accountability of decision.


In machine learning, a parameter whose value is set before training as a way to increase model accuracy.



A software package that contains a set of libraries.


  • To feed data into a system for the purpose of creating a base of knowledge.
  • To continuously add a high-volume of real-time data to a database.


An accurate or deep understanding of something. Insights are derived using cognitive analytics to provide current snapshots and predictions of customer behaviors and attitudes.


A purpose or goal expressed by customer input to a chatbot, such as answering a question or processing a bill payment.



A separately executable unit of work.

Jupyter notebook

See notebook.


labeled data

Raw data that is assigned labels to add context or meaning so that it can be used to train machine learning models. For example, numeric values might be labeled as zip codes or ages to provide context for model inputs and outputs.

large language model

A language model with a large number of parameters, trained on a large quantity of text.


  • The history of the flow of data through assets.
  • The history of the events performed on an asset.

logical model

A logical representation of data objects that are related to a business domain.


machine learning (ML)

A branch of artificial intelligence (AI) and computer science that focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving the accuracy of AI models.

machine learning framework

The libraries and runtime for training and deploying a model.

machine learning model

An AI model that is trained on a a set of data to develop algorithms that it can use to analyze and learn from new data.


To replace sensitive data values in a column of a data set. Masking methods vary in data utility and privacy from providing similarly formatted replacement values that retain referential integrity to providing the same replacement value for the entire column.

masking flow

A flow that produces permanently masked copies of data.

master data

  • For model training, reference data that remains the same for several jobs on the same model but that can be changed, if necessary.
  • In Match 360, a consolidated view of data from the disparate sources.

mathematical programming (MP)

A field of mathematics, or operational research, used to model and solve Decision Optimization problems. This encompasses linear, integer, mixed integer and non-linear programming.

metadata import

A method of importing metadata that is associated with data assets, including process metadata that describes the lineage of data assets and technical metadata that describes the structure of data assets.


A discrepancy between the goals or behaviors that an AI system is optimized to achieve and the true, often complex, objectives of its human users or designers


See machine learning.


  • The practice for collaboration between data scientists and operations professionals to help manage production machine learning (or deep learning) lifecycle. MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements. It involves model development, training, validation, deployment, monitoring, and management and uses methods like CI/CD.
  • A methodology that takes a machine learning model from development to production.


  • In a machine learning context, a set of functions and algorithms that have been trained and tested on a data set to provide predictions or decisions.
  • In Decision Optimization, a mathematical formulation of a problem that can be solved with CPLEX optimization engines using different data sets.

model formulation

In Decision Optimization, the mathematical formulation of a model expressed as a list of decision variables, one or more objective functions to be maximized or minimized, and some constraints to be satisfied.


A methodology for managing the full lifecycle of an AI model, including training, deployment, scoring, evaluation, retraining, and updating.


See mathematical programming.


natural language

A modeling syntax that resembles natural human language (in English) to formulate models.

natural language processing (NLP)

A field of artificial intelligence and linguistics that studies the problems inherent in the processing and manipulation of natural language, with an aim to increase the ability of computers to understand human languages.

natural language processing library

A library that provides basic natural language processing functions for syntax analysis and out-of-the-box pre-trained models for a wide variety of text processing tasks.

neural network

A mathematical model for predicting or classifying cases by using a complex mathematical scheme that simulates an abstract version of brain cells. A neural network is trained by presenting it with a large number of observed cases, one at a time, and allowing it to update itself repeatedly until it learns the task.


See natural language processing.


The graphical representation of a data operation in a stream or flow. Different types of nodes have different shapes to indicate the type of operation that they perform.


An interactive document that contains executable code, descriptive text for that code, and the results of any code that is run.

notebook kernel

The part of the notebook editor that executes code and returns the computational results.



To replace data in a column with similarly formatted values that match the original format. A form of masking.

object storage

A method of storing data, typically used in the cloud, in which data is stored as discrete units, or objects, in a storage pool or repository that does not use a file hierarchy but that stores all objects at the same level.

objective function

In Decision Opmization and operations research, an expression to optimize (that is, either to minimize or to maximize) while satisfying other constraints of the problem.

one-shot learning

A model for deep learning that is based on the premise that most human learning takes place upon receiving just one or two examples. This model is similar to unsupervised learning.

online deployment

Method of accessing a model or Python code deployment through an API endpoint as a web service to generate predictions online, in real time.


An explicit formal specification of the representation of the objects, concepts, and other entities that can exist in some area of interest and the relationships among them.

operational asset

An asset that runs code in a tool or a job.

OPL model

A model formulation expressed in OPL modeling language.

optimal solution

In operations research, a solution to a problem that optimizes the objective function (whether linear or quadratic) and satisfies all the other constraints of the problem.


The process of finding the most appropriate solution to a precisely defined problem while respecting the imposed constraints and limitations. For example, determining how to allocate resources or how to find the best elements or combinations from a large set of alternatives.


The process of creating an end-to-end flow that can train, run, deploy, test, and evaluate a machine learning model, and uses automation to coordinate the system, often using microservices.



A configurable part of the model that is internal to a model and whose values are estimated or learned from data. Parameters are aspects of the model that are adjusted during the training process to help the model accurately predict the output. The model's performance and predictive power largely depend on the values of these parameters.


In Federated Learning, an entity that contributes data for training a common model. The data is not moved or combined but each party gets the benefit of the federated training.


The data that is passed to a deployment to get back a score, prediction, or solution.

payload logging

The capture of payload data and deployment output to monitor ongoing health of AI in business applications.

physical model

A definition of the physical structures and relationships of data.


  • In Orchestration Pipelines, an end-to-end flow of assets from creation through deployment.
  • In AutoAI, a candidate model.

pipeline leaderboard

In AutoAI, a table that shows the list of automatically generated candidate models, as pipelines, ranked according to the specified criteria.


A field or variable to be replaced with a value.


  • A strategy or rule that an agent follows to determine the next action based on the current state.
  • A set of rules that protect data by controlling access to data assets or anonymizing sensitive data within data assets.
  • A governance artifact that consists of one or more data protection and governance rules.

predictive analytics

A business process and a set of related technologies that are concerned with the prediction of future possibilities and trends. Predictive analytics applies such diverse disciplines as probability, statistics, machine learning, and artificial intelligence to business problems to find the best action for a specific situation. See also data mining.

pretrained model

An AI model that was previously trained on a large data set to accomplish a specific task. Pretrained models are used instead of building a model from scratch.

primary category

In IBM Knowledge Catalog, the category that contains the governance artifact. A category is similar to a folder or directory that organizes a user's governance artifacts.


Assurance that information about an individual is protected from unauthorized access and inappropriate use.


The generated metadata and statistics about the textual content of data.


A collaborative workspace for working with data and other assets.


To copy an asset into a catalog.


A programming language that is used in data science and AI.

Python DOcplex model

A model formulation expressed in Python.

Python function

A function that contains Python code to support a model in production.


quality rule

One or more conditions required for a data record to meet quality standards. During data quality analysis, data records are checked against these conditions.



An extensible scripting language that is used in data science and AI that offers a wide variety of analytic, statistical, and graphical functions and techniques.


To copy data into an application to manipulate or analyze it.


To replace all data values in a column with the same string to hide sensitive values, data format, and any relationships between values. A form of masking..

reference data set

A governance artifact that defines values for specific types of columns.


To cleanse and shape data.

reinforcement learning

A machine learning technique in which an agent learns to make sequential decisions in an environment to maximize a reward signal. Inspired by trial and error learning, agents interact with the environment, receive feedback, and adjust their actions to achieve optimal policies.


A signal used to guide an agent, typically a reinforcement learning agent, that provides feedback on the goodness of a decision.


In IBM Knowledge Catalog, a governance artifact that contains information, criteria, or logic to analyze or protect data. Some rules are enforced and some are informational.

runtime environment

The predefined or custom hardware and software configuration that is used to run tools or jobs, such as notebooks.



  • In machine learning, the process of measuring the confidence of a predicted outcome.
  • The process of computing how closely the attributes for an incoming identity match the attributes of an existing entity.


A file that contains Python or R scripts to support a model in production.

secondary category

An optional category that references the governance artifact.


An attention mechanism that uses information from the input data itself to determine what parts of the input to focus on when generating output.

self-supervised learning

A machine learning training method in which a model learns from unlabeled data by masking tokens in an input sequence and then trying to predict them. An example is "I like ________ sprouts".

sensitive data

Data that contains information that should not be visible to all users. For example, personally identifiable information or other information that is restricted by privacy regulations.


To customize data by filtering, sorting, removing columns; joining tables; performing operations that include calculations, data groupings, hierarchies and more.

small data

Data that is accessible and comprehensible by humans. See also structured data.

SQL pushback

In SPSS Modeler, the process of performing many data preparation and mining operations directly in the database through SQL code.

structured data

Data that resides in fixed fields within a record or file. Relational databases and spreadsheets are examples of structured data. See also unstructured data, small data.

structured information

Items stored in structured resources, such as search engine indices, databases, or knowledge bases.


To replace data in a column with values that don't match the original format but retain referential integrity.


An SPSS Modeler node that shrinks a data stream by encapsulating several nodes into one.

supervised learning

A machine learning training method in which a model is trained on a labeled dataset to make predictions on new data.


text classification

A model that automatically identifies and classifies text into specified categories.

time series

A set of values of a variable at periodic points in time.

trained model

A model that is trained with actual data and is ready to be deployed to predict outcomes when presented with new data.


The initial stage of model building, involving a subset of the source data. The model learns by example from the known data. The model can then be tested against a further, different subset for which the outcome is already known.

training data

A set of annotated documents that can be used to train machine learning models.

training set

A set of labeled data that is used to train a machine learning model by exposing it to examples and their corresponding labels, enabling the model to learn patterns and make predictions.

transfer learning

A machine learning strategy in which a trained model is applied to a completely new problem.


Sharing appropriate information with stakeholders on how an AI system has been designed and developed. Examples of this information are what data is collected, how it will be used and stored, and who has access to it; and test results for accuracy, robustness and bias.

Turing test

Proposed by Alan Turing in 1950, a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human.


unbounded problem

A Decision Optimization problem where an infinite number of solutions exists and the objective can take values up to infinity. Unbounded problems are often caused by missing constraints in the model formulation.

unstructured data

Any data that is stored in an unstructured format rather than in fixed fields. Data in a word processing document is an example of unstructured data. See also structured data.

unstructured information

Data that is not contained in a fixed location, such as the natural language text document.

unsupervised learning

A machine learning training method in which a model is not provided with labeled data and must find patterns or structure in the data on its own.


validation set

A separate set of labeled data that is used to evaluate the performance and generalization ability of a machine learning model during the training process, assisting in hyperparameter tuning and model selection.

virtual agent

A pretrained chat bot that can process natural language to respond and complete simple business transactions, or route more complicated requests to a human with subject matter expertise.


A graph, chart, plot, table, map, or any other visual representation of data.



A coefficient for a node that transforms input data within the network's layer. Weight is a parameter that an AI model learns through training, adjusting its value to reduce errors in the model's predictions.

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more