Data governance is the process of tracking and controlling data assets based on asset metadata. Catalogs are workspaces where you provide controlled access to governed assets.
Required service
IBM Knowledge Catalog
A catalog contains assets and collaborators. Collaborators are the people who add assets into the catalog and the people who need to use the assets. You can customize data governance to enrich and control data assets in catalogs.
Learn more about governance or get started with catalogs and governance:
You can set up data governance in an iterative manner. You can start with a simple implementation of data governance that relies on predefined artifacts and default features. Then, as your needs change, you can customize your data governance
framework to better describe and protect your data assets.
To see the tools that you can use to govern data, open the tools and services map and click Governance in the tasks section.
Simplest implementation of data governance
Copy link to section
You use a catalog to share assets across your organization. A catalog can act as a feature store by containing data sets with columns that are used as features (inputs) in machine learning models. A IBM Knowledge Catalog administrator creates
the catalog for sharing assets and adds data engineers, data scientists, and business analysts as collaborators. Catalog collaborators can work with catalog assets by copying them into projects and can publish assets that they create in
projects into the catalog.
Catalog collaborators can add assets to the catalog to share with others or find and use assets in the following ways:
Data engineers create cleansed data, virtualized data, and integrated data assets in projects and then publish the assets into the catalog.
Data engineers import tables or files from a data source into the catalog.
Data scientists and business analysts find data assets in catalogs and then add the assets to projects to work with the data.
Data assets accumulate metadata over time in the following ways:
Data assets are profiled, which automatically assigns predefined data classes that describe the format of the data.
Catalog collaborators add tags, predefined business terms, data classes, and classifications, relationships, and ratings to assets.
All actions on assets are automatically saved in the asset history.
You can add or update any of the custom options to your data governance implementation at any time. Your governance team can establish your business vocabulary, import and enrich data with your vocabulary, analyze the data quality, define
rules to protect data, and then publish the data assets to a catalog where data consumers can find them. When your data changes, you can reimport metadata about the tables or files and enrich your data assets with your business vocabulary
and data quality analysis. You can create increasingly precise rules to protect data as you expand your business vocabulary. Throughout the data governance cycle, your data scientists and other data consumers can find trusted data in catalogs.
The following illustration shows how data governance is a continuous cycle of refreshing the metadata for data assets to reflect changes in the data and changes in your business vocabulary.
Establish your business vocabulary
Copy link to section
Your governance team can establish a business vocabulary that describes the meaning of data with business terms and the format of data with data classes. A business vocabulary helps your business users more easily find what they are looking
for using nontechnical terms.
Your team can quickly establish your business vocabulary by importing your existing business vocabulary or importing Knowledge Accelerators that provide between dozens to thousands of governance artifacts.
Your IBM Knowledge Catalog administrator can customize the workflow, organization, properties, and relationships of governance artifacts.
Import and enrich data assets with your business vocabulary
Copy link to section
Data stewards can regularly run metadata import and enrichment jobs that update the catalog with changes to tables or files from your data sources and automatically assign the appropriate business terms and data classes.
When your team adds governance artifacts, the metadata enrichment jobs suggest the new artifacts to the new or updated data assets.
When data stewards confirm or adjust business term assignments during metadata enrichment, the machine learning algorithms for term assignment become more accurate for your data.
Data stewards can configure metadata import and enrichment to run only when changes are detected.
You can use gen AI based enrichment capabilities for generating descriptive asset and column names, for generating meaningful descriptions for assets and columns, and for assigning business terms.
Data stewards can analyze data quality with default settings during metadata enrichment. Data quality analysis is applied to each asset as a whole and to columns in tables.
Data stewards can create custom data quality definitions and apply them in data quality rules, or apply SQL-based data quality rules.
Your governance team can create a plan for data protection rules by writing policies that document your organization’s standards and guidelines for protecting and managing data. For example, a policy can describe a specific regulation
and how a data protection rule ensures compliance with that regulation.
Your governance team can create data protection rules that define how to keep private information private. Data protection rules are automatically evaluated for enforcement every time a user attempts to access a data asset in any governed
catalog on the platform. Data protection rules can define how to control access to data, mask sensitive values, or filter rows from data assets.
Your team can start with data protection rules that are based on custom tags, users, or predefined data classes, business terms, and classifications. When your governance team adds governance artifacts, the team can define data protection
rules based on your business vocabulary.
Data engineers can enforce data protection rules on virtualized data.
Data engineers can permanently mask data in data assets with masking flows.
The tasks to get started with IBM Knowledge Catalog depend on your goal. The actions that you can take are defined by your Cloud Pak for Data service access roles. Some actions also have workspace role requirements, such as being a collaborator
in a catalog or category.
Use this interactive map to learn about the relationships between your tasks, the tools you need, the services that provide the tools, and where you use the tools.
Select any task, tool, service, or workspace
You'll learn what you need, how to get it, and where to use it.
Some tools perform the same tasks but have different features and levels of automation.
Jupyter notebook editor
Prepare data
Visualize data
Build models
Deploy assets
Create a notebook in which you run Python, R, or Scala code to prepare, visualize, and analyze data, or build a model.
AutoAI
Build models
Automatically analyze your tabular data and generate candidate model pipelines customized for your predictive modeling problem.
SPSS Modeler
Prepare data
Visualize data
Build models
Create a visual flow that uses modeling algorithms to prepare data and build and train a model, using a guided approach to machine learning that doesn’t require coding.
Decision Optimization
Build models
Visualize data
Deploy assets
Create and manage scenarios to find the best solution to your optimization problem by comparing different combinations of your model, data, and solutions.
Data Refinery
Prepare data
Visualize data
Create a flow of ordered operations to cleanse and shape data. Visualize data to identify problems and discover insights.
Orchestration Pipelines
Prepare data
Build models
Deploy assets
Automate the model lifecycle, including preparing data, training models, and creating deployments.
RStudio
Prepare data
Build models
Deploy assets
Work with R notebooks and scripts in an integrated development environment.
Federated learning
Build models
Create a federated learning experiment to train a common model on a set of remote data sources. Share training results without sharing data.
Deployments
Deploy assets
Monitor models
Deploy and run your data science and AI solutions in a test or production environment.
Catalogs
Catalog data
Governance
Find and share your data and other assets.
Metadata import
Prepare data
Catalog data
Governance
Import asset metadata from a connection into a project or a catalog.
Metadata enrichment
Prepare data
Catalog data
Governance
Enrich imported asset metadata with business context, data profiling, and quality assessment.
Data quality rules
Prepare data
Governance
Measure and monitor the quality of your data.
Masking flow
Prepare data
Create and run masking flows to prepare copies of data assets that are masked by advanced data protection rules.
Governance
Governance
Create your business vocabulary to enrich assets and rules to protect data.
Data lineage
Governance
Track data movement and usage for transparency and determining data accuracy.
AI factsheet
Governance
Monitor models
Track AI models from request to production.
DataStage flow
Prepare data
Create a flow with a set of connectors and stages to transform and integrate data. Provide enriched and tailored information for your enterprise.
Data virtualization
Prepare data
Create a virtual table to segment or combine data from one or more tables.
OpenScale
Monitor models
Measure outcomes from your AI models and help ensure the fairness, explainability, and compliance of all your models.
Data replication
Prepare data
Replicate data to target systems with low latency, transactional integrity and optimized data capture.
Master data
Prepare data
Consolidate data from the disparate sources that fuel your business and establish a single, trusted, 360-degree view of your customers.
Services you can use
Services add features and tools to the platform.
watsonx.ai Studio
Develop powerful AI solutions with an integrated collaborative studio and industry-standard APIs and SDKs. Formerly known as Watson Studio.
watsonx.ai Runtime
Quickly build, run and manage generative AI and machine learning applications with built-in performance and scalability. Formerly known as Watson Machine Learning.
IBM Knowledge Catalog
Discover, profile, catalog, and share trusted data in your organization.
DataStage
Create ETL and data pipeline services for real-time, micro-batch, and batch data orchestration.
Data Virtualization
View, access, manipulate, and analyze your data without moving it.
Watson OpenScale
Monitor your AI models for bias, fairness, and trust with added transparency on how your AI models make decisions.
Data Replication
Provide efficient change data capture and near real-time data delivery with transactional integrity.
Match360 with Watson
Improve trust in AI pipelines by identifying duplicate records and providing reliable data about your customers, suppliers, or partners.
Manta Data Lineage
Increase data pipeline transparency so you can determine data accuracy throughout your models and systems.
Where you'll work
Collaborative workspaces contain tools for specific tasks.
Project
Where you work with data.
> Projects > View all projects
Catalog
Where you find and share assets.
> Catalogs > View all catalogs
Space
Where you deploy and run assets that are ready for testing or production.
> Deployments
Categories
Where you manage governance artifacts.
> Governance > Categories
Data virtualization
Where you virtualize data.
> Data > Data virtualization
Master data
Where you consolidate data into a 360 degree view.