This glossary provides terms and definitions for Cloud Pak for Data as a Service.
The expectation that organizations or individuals will ensure the proper functioning, throughout their lifecycle, of the AI systems that they design, develop, operate or deploy, in accordance with their roles and applicable regulatory frameworks. This includes determining who is responsible for an AI mistake which may require legal experts to determine liability on a case-by-case basis.
A model for machine learning in which the system requests more labeled data only when it needs it.
Metadata that is automatically updated based on analysis by machine learning processes. For example, profiling and data quality analysis automatically update metadata for data assets.
An instance of an environment that is running to provide compute resources to analytical assets.
A multidisciplinary field that studies how to optimize AI's beneficial impact while reducing risks and adverse outcomes. Examples of AI ethics issues are data responsibility and privacy, fairness, explainability, robustness, transparency, environmental sustainability, inclusion, moral agency, value alignment, accountability, trust, and technology misuse.
An organization's act of governing, through its corporate instructions, staff, processes and systems to direct, evaluate, monitor, and take corrective action throughout the AI lifecycle, to provide assurance that the AI system is operating as the organization intends, as its stakeholders expect, and as required by relevant regulation.
The field of research aiming to ensure artificial intelligence systems operate in a manner that is beneficial to humanity and don't inadvertently cause harm, addressing issues like reliability, fairness, transparency, and alignment of AI systems with human values.
A formula applied to data to determine optimal ways to solve analytical problems.
The science of studying data in order to find meaningful patterns in the data and draw conclusions based on those patterns.
artificial intelligence (AI)
The emulation of natural intelligence by a machine.
artificial intelligence system (AI system)
A system that can make predictions, recommendations or decisions that influence physical or virtual environments, and whose outputs or behaviors are not necessarily pre-determined by its developer or user. AI systems are typically trained with large quantities of structured or unstructured data, and might be designed to operate with varying levels of autonomy or none, to achieve human-defined objectives.
An item in a project or catalog that contains metadata about data or data analysis.
An automated training process that considers a series of training definitions and parameters to create a set of ranked pipelines as model candidates.
A method to deploy models that processes input data from a file, data connection, or connected data in a storage bucket, then writes the output to a selected destination.
Systematic error in an AI system that has been designed, intentionally or not, in a way that may generate unfair decisions. Bias can be present both in the AI system and in the data used to train and test it. AI bias can emerge in an AI system as a result of cultural expectations; technical limitations; or unanticipated deployment contexts. See also fairness.
The process of calculating fairness to metrics to detect when AI models are delivering unfair outcomes based on certain attributes.
A classification model with two classes. Predictions are a binary choice of one of the two classes.
A word or phrase that defines a business concept in a standard way for an enterprise. Terms can be used to enrich the metadata of data assets and to define the criteria of data protection rules.
The set of governance artifacts, such as business terms and data classes, that describe and enrich data assets.
A repository of assets for an organization share. Assets in catalogs can be governed by data protection rules and enriched by other governance artifacts, such as classifications, data classes, and business terms. Catalogs can store structured and unstructured data, references to data in external data sources, and other analytical assets, like machine learning models.
In IBM Knowledge Catalog, a collaborative workspace for organizing and managing governance artifacts.
In IBM Knowledge Catalog, a governance artifact that describes the sensitivity level of the data in a data asset.
To ensure that all values in a data set are consistent and correctly recorded.
A member of a group of people who are working together toward a common goal.
A problem that is difficult to solve because it requires multiple decisions to be made involving too many combinations of possible choices. Some examples are finding a grouping, ordering, or the assignment of objects.
The hardware and software resources defined by an environment definition to run analytical assets.
A performance measurement that determines the accuracy between a model's positive and negative predicted outcomes compared to positive and negative actual outcomes.
A data set that is accessed through a connection to an external data source.
The information required to connect to a database. The actual information that is required varies according to the DBMS and connection method.
In Decision Optimization, a condition that must be satisfied by the solution of a problem.
Automating the tasks of monitoring model performance, retraining with new data, and redeploying to ensure prediction quality.
Core ML deployment
The process of downloading a deployment in Core ML format for use in iOS apps.
- To select, collect, preserve, and maintain content relevant to a specific topic. Curation establishes, maintains, and adds value to data; it transforms data into trusted information and knowledge.
- To create a data asset and prepare it to be published in a catalog. Curation can include enriching the data asset by assigning governance artifacts such as business terms, classification, and data classes, and analyzing the quality of the data in the data asset.
An asset that points to data, for example, to an uploaded file. Connections and connected data assets are also considered data assets.
A governance artifact that categorizes columns in relational data sets according to the type of the data and how the data is used.
The combination of technical and business processes that are used to combine data from disparate sources into meaningful and valuable information.
A large-scale data storage repository that stores raw data in any format in a flat architecture. Data lakes hold structured and unstructured data as well as binary data for the purpose of processing and analysis.
A unified data storage and processing architecture that combines the flexibility of a data lake with the structured querying and performance optimizations of a data warehouse, enabling scalable and efficient data analysis for AI and analytics applications.
The process of collecting critical business information from a data source, correlating the information, and uncovering associations, patterns, and trends. See also predictive analytics.
A visualization of data elements, their relationships, and their attributes.
A collection of optimized data or data-related assets that are packaged for reuse and distribution with controlled access. Data products contain data as well as models, dashboards, and other computational asset types. Unlike data assets in governance catalogs, data products are managed as products with multiple purposes to provide business value.
data protection rule
A governance artifact that specifies what data to control and how to control it. A data protection rule contains criteria and an action.
data quality analysis
The analysis of data against the quality dimensions accuracy, completeness, consistency, timeliness, uniqueness, and validity.
data quality definition
A data quality definition describes a rule evaluation or condition for data quality rules.
data quality rule
During data quality analysis, a data quality rule that assesses data for whether specific conditions are met and identifies records that do not meet the conditions as rule violations.
The analysis and visualization of structured and unstructured data to discover insights and knowledge.
A collection of data, usually in the form of rows (records) and columns (fields) and contained in a file or database table.
A repository, queue, or feed for reading data, such as a Db2 database.
A collection of data, usually in the form of rows (records) and columns (fields) and contained in a table.
A large, centralized repository of data collected from various sources that is used for reporting and data analysis. It primarily stores structured and semi-structured data, enabling businesses to make informed decisions.
Decision Optimization model
A prescriptive model that can be solved with optimization to provide the best solution to a Decision Optimization problem.
One of a set of variables representing decisions to be made, whose values are determined by the optimization engine while ensuring that all constraints are satisfied and the objective optimized.
A model or application package that is available for use.
A workspace where models are deployed and deployments are managed.
A software methodology that integrates application development and IT operations so that teams can deliver code faster to production and iterate continuously based on market feedback.
A Python API for modeling and solving Decision Optimization problems.
A network destination address that identifies resources, such as services and objects. For example, an endpoint URL is used to identify the location of a model or function deployment when a user sends payload data to the deployment.
The compute resources for running jobs.
An instantiation of the environment template to run analytical assets.
A definition that specifies hardware and software resources to instantiate environment runtimes.
- The ability of human users to trace, audit, and understand predictions that are made in applications that use AI systems.
- The ability of an AI system to provide insights that humans can use to understand the causes of the system's predictions.
In an AI system, the equitable treatment of individuals or groups of individuals. The choice of a specific notion of equity for an AI system depends on the context in which it is used. See also bias.
A property or characteristic of an item within a data set, for example, a column in a spreadsheet. In some cases, features are engineered as combinations of other features in the data set.
The process of selecting, transforming, and creating new features from raw data to improve the performance and predictive power of machine learning models.
Identifying the columns of data that best support an accurate prediction or score in a machine learning model.
A centralized repository or system that manages and organizes features, providing a scalable and efficient way to store, retrieve, and share feature data across machine learning pipelines and applications.
In AutoAI, a phase of pipeline creation that applies algorithms to transform and optimize the training data to achieve the best outcome for the model type.
The training of a common machine learning model that uses multiple data sources that are not moved, joined, or shared. The result is a better-trained model without compromising data security.
A collection of nodes that define a set of steps for processing data or training a model.
A graphical representation of a project timeline and duration in which schedule data is displayed as horizontal bars along a time scale.
Governance items that enrich or control data assets. Governance artifacts include business terms, classifications, data classes, policies, rules, and reference data sets.
A governance artifact that provides a natural-language description of the criteria that are used to determine whether data assets are compliant with business objectives.
A task-based process to control the creating, modifying, and deleting of governance artifacts.
A catalog that has enforcement of data protection rules enabled.
A tool for creating analytical assets by visually coding. A canvas is an area on which to place objects or nodes that can be connected to create a flow.
graphics processing unit (GPU)
A specialized processor designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. GPUs are heavily utilized in machine learning due to their parallel processing capabilities.
A set of labeled data that is intentionally withheld from both the training and validation sets, serving as an unbiased assessment of the final model's performance on unseen data.
Human involvement in reviewing decisions rendered by an AI system, enabling human autonomy and accountability of decision.
In machine learning, a parameter whose value is set before training as a way to increase model accuracy.
A software package that contains a set of libraries.
- To feed data into a system for the purpose of creating a base of knowledge.
- To continuously add a high-volume of real-time data to a database.
An accurate or deep understanding of something. Insights are derived using cognitive analytics to provide current snapshots and predictions of customer behaviors and attitudes.
A purpose or goal expressed by customer input to a chatbot, such as answering a question or processing a bill payment.
A separately executable unit of work.
Raw data that is assigned labels to add context or meaning so that it can be used to train machine learning models. For example, numeric values might be labeled as zip codes or ages to provide context for model inputs and outputs.
large language model
A language model with a large number of parameters, trained on a large quantity of text.
- The history of the flow of data through assets.
- The history of the events performed on an asset.
A logical representation of data objects that are related to a business domain.
machine learning (ML)
A branch of artificial intelligence (AI) and computer science that focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving the accuracy of AI models.
machine learning framework
The libraries and runtime for training and deploying a model.
machine learning model
An AI model that is trained on a a set of data to develop algorithms that it can use to analyze and learn from new data.
To replace sensitive data values in a column of a data set. Masking methods vary in data utility and privacy from providing similarly formatted replacement values that retain referential integrity to providing the same replacement value for the entire column.
A flow that produces permanently masked copies of data.
- For model training, reference data that remains the same for several jobs on the same model but that can be changed, if necessary.
- In Match 360, a consolidated view of data from the disparate sources.
mathematical programming (MP)
A field of mathematics, or operational research, used to model and solve Decision Optimization problems. This encompasses linear, integer, mixed integer and non-linear programming.
A method of importing metadata that is associated with data assets, including process metadata that describes the lineage of data assets and technical metadata that describes the structure of data assets.
A discrepancy between the goals or behaviors that an AI system is optimized to achieve and the true, often complex, objectives of its human users or designers
See machine learning.
- The practice for collaboration between data scientists and operations professionals to help manage production machine learning (or deep learning) lifecycle. MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements. It involves model development, training, validation, deployment, monitoring, and management and uses methods like CI/CD.
- A methodology that takes a machine learning model from development to production.
- In a machine learning context, a set of functions and algorithms that have been trained and tested on a data set to provide predictions or decisions.
- In Decision Optimization, a mathematical formulation of a problem that can be solved with CPLEX optimization engines using different data sets.
In Decision Optimization, the mathematical formulation of a model expressed as a list of decision variables, one or more objective functions to be maximized or minimized, and some constraints to be satisfied.
A methodology for managing the full lifecycle of an AI model, including training, deployment, scoring, evaluation, retraining, and updating.
A modeling syntax that resembles natural human language (in English) to formulate models.
natural language processing (NLP)
A field of artificial intelligence and linguistics that studies the problems inherent in the processing and manipulation of natural language, with an aim to increase the ability of computers to understand human languages.
natural language processing library
A library that provides basic natural language processing functions for syntax analysis and out-of-the-box pre-trained models for a wide variety of text processing tasks.
A mathematical model for predicting or classifying cases by using a complex mathematical scheme that simulates an abstract version of brain cells. A neural network is trained by presenting it with a large number of observed cases, one at a time, and allowing it to update itself repeatedly until it learns the task.
The graphical representation of a data operation in a stream or flow. Different types of nodes have different shapes to indicate the type of operation that they perform.
An interactive document that contains executable code, descriptive text for that code, and the results of any code that is run.
The part of the notebook editor that executes code and returns the computational results.
To replace data in a column with similarly formatted values that match the original format. A form of masking.
A method of storing data, typically used in the cloud, in which data is stored as discrete units, or objects, in a storage pool or repository that does not use a file hierarchy but that stores all objects at the same level.
In Decision Opmization and operations research, an expression to optimize (that is, either to minimize or to maximize) while satisfying other constraints of the problem.
A model for deep learning that is based on the premise that most human learning takes place upon receiving just one or two examples. This model is similar to unsupervised learning.
Method of accessing a model or Python code deployment through an API endpoint as a web service to generate predictions online, in real time.
An explicit formal specification of the representation of the objects, concepts, and other entities that can exist in some area of interest and the relationships among them.
An asset that runs code in a tool or a job.
A model formulation expressed in OPL modeling language.
In operations research, a solution to a problem that optimizes the objective function (whether linear or quadratic) and satisfies all the other constraints of the problem.
The process of finding the most appropriate solution to a precisely defined problem while respecting the imposed constraints and limitations. For example, determining how to allocate resources or how to find the best elements or combinations from a large set of alternatives.
The process of creating an end-to-end flow that can train, run, deploy, test, and evaluate a machine learning model, and uses automation to coordinate the system, often using microservices.
A configurable part of the model that is internal to a model and whose values are estimated or learned from data. Parameters are aspects of the model that are adjusted during the training process to help the model accurately predict the output. The model's performance and predictive power largely depend on the values of these parameters.
In Federated Learning, an entity that contributes data for training a common model. The data is not moved or combined but each party gets the benefit of the federated training.
The data that is passed to a deployment to get back a score, prediction, or solution.
The capture of payload data and deployment output to monitor ongoing health of AI in business applications.
A definition of the physical structures and relationships of data.
- In Watson Pipelines, an end-to-end flow of assets from creation through deployment.
- In AutoAI, a candidate model.
In AutoAI, a table that shows the list of automatically generated candidate models, as pipelines, ranked according to the specified criteria.
A field or variable to be replaced with a value.
- A strategy or rule that an agent follows to determine the next action based on the current state.
- A set of rules that protect data by controlling access to data assets or anonymizing sensitive data within data assets.
- A governance artifact that consists of one or more data protection and governance rules.
A business process and a set of related technologies that are concerned with the prediction of future possibilities and trends. Predictive analytics applies such diverse disciplines as probability, statistics, machine learning, and artificial intelligence to business problems to find the best action for a specific situation. See also data mining.
An AI model that was previously trained on a large data set to accomplish a specific task. Pretrained models are used instead of building a model from scratch.
In IBM Knowledge Catalog, the category that contains the governance artifact. A category is similar to a folder or directory that organizes a user's governance artifacts.
Assurance that information about an individual is protected from unauthorized access and inappropriate use.
The generated metadata and statistics about the textual content of data.
A collaborative workspace for working with data and other assets.
To copy an asset into a catalog.
A programming language that is used in data science and AI.
Python DOcplex model
A model formulation expressed in Python.
A function that contains Python code to support a model in production.
One or more conditions required for a data record to meet quality standards. During data quality analysis, data records are checked against these conditions.
An extensible scripting language that is used in data science and AI that offers a wide variety of analytic, statistical, and graphical functions and techniques.
To copy data into an application to manipulate or analyze it.
To replace all data values in a column with the same string to hide sensitive values, data format, and any relationships between values. A form of masking..
reference data set
A governance artifact that defines values for specific types of columns.
To cleanse and shape data.
A machine learning technique in which an agent learns to make sequential decisions in an environment to maximize a reward signal. Inspired by trial and error learning, agents interact with the environment, receive feedback, and adjust their actions to achieve optimal policies.
A signal used to guide an agent, typically a reinforcement learning agent, that provides feedback on the goodness of a decision.
In IBM Knowledge Catalog, a governance artifact that contains information, criteria, or logic to analyze or protect data. Some rules are enforced and some are informational.
The predefined or custom hardware and software configuration that is used to run tools or jobs, such as notebooks.
- In machine learning, the process of measuring the confidence of a predicted outcome.
- The process of computing how closely the attributes for an incoming identity match the attributes of an existing entity.
A file that contains Python or R scripts to support a model in production.
An optional category that references the governance artifact.
An attention mechanism that uses information from the input data itself to determine what parts of the input to focus on when generating output.
A machine learning training method in which a model learns from unlabeled data by masking tokens in an input sequence and then trying to predict them. An example is "I like ________ sprouts".
Data that contains information that should not be visible to all users. For example, personally identifiable information or other information that is restricted by privacy regulations.
To customize data by filtering, sorting, removing columns; joining tables; performing operations that include calculations, data groupings, hierarchies and more.
Data that is accessible and comprehensible by humans. See also structured data.
In SPSS Modeler, the process of performing many data preparation and mining operations directly in the database through SQL code.
Items stored in structured resources, such as search engine indices, databases, or knowledge bases.
To replace data in a column with values that don't match the original format but retain referential integrity.
An SPSS Modeler node that shrinks a data stream by encapsulating several nodes into one.
A machine learning training method in which a model is trained on a labeled dataset to make predictions on new data.
A model that automatically identifies and classifies text into specified categories.
A set of values of a variable at periodic points in time.
A model that is trained with actual data and is ready to be deployed to predict outcomes when presented with new data.
The initial stage of model building, involving a subset of the source data. The model learns by example from the known data. The model can then be tested against a further, different subset for which the outcome is already known.
A set of annotated documents that can be used to train machine learning models.
A set of labeled data that is used to train a machine learning model by exposing it to examples and their corresponding labels, enabling the model to learn patterns and make predictions.
A machine learning strategy in which a trained model is applied to a completely new problem.
Sharing appropriate information with stakeholders on how an AI system has been designed and developed. Examples of this information are what data is collected, how it will be used and stored, and who has access to it; and test results for accuracy, robustness and bias.
Proposed by Alan Turing in 1950, a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human.
A Decision Optimization problem where an infinite number of solutions exists and the objective can take values up to infinity. Unbounded problems are often caused by missing constraints in the model formulation.
Any data that is stored in an unstructured format rather than in fixed fields. Data in a word processing document is an example of unstructured data. See also structured data.
Data that is not contained in a fixed location, such as the natural language text document.
A machine learning training method in which a model is not provided with labeled data and must find patterns or structure in the data on its own.
A separate set of labeled data that is used to evaluate the performance and generalization ability of a machine learning model during the training process, assisting in hyperparameter tuning and model selection.
A pretrained chat bot that can process natural language to respond and complete simple business transactions, or route more complicated requests to a human with subject matter expertise.
A graph, chart, plot, table, map, or any other visual representation of data.
A coefficient for a node that transforms input data within the network's layer. Weight is a parameter that an AI model learns through training, adjusting its value to reduce errors in the model's predictions.