Data governance (IBM Knowledge Catalog)

Last updated: Dec 13, 2024

Data governance is the process of tracking and controlling data assets based on asset metadata. Catalogs are workspaces where you provide controlled access to governed assets.

Required service: IBM Knowledge Catalog

A catalog contains assets and collaborators. Collaborators are the people who add assets into the catalog and the people who need to use the assets. You can customize data governance to enrich and control data assets in catalogs.

Learn more about governance or get started with catalogs and governance:

Data governance approaches
Getting started with IBM Knowledge Catalog

Data governance approaches

You can set up data governance in an iterative manner. You can start with a simple implementation of data governance that relies on predefined artifacts and default features. Then, as your needs change, you can customize your data governance framework to better describe and protect your data assets.

To see the tools that you can use to govern data, open the tools and services map and click Governance in the tasks section.

Simplest implementation of data governance

You use a catalog to share assets across your organization. A catalog can act as a feature store by containing data sets with columns that are used as features (inputs) in machine learning models. A IBM Knowledge Catalog administrator creates the catalog for sharing assets and adds data engineers, data scientists, and business analysts as collaborators. Catalog collaborators can work with catalog assets by copying them into projects and can publish assets that they create in projects into the catalog.

Catalogs store and track assets. Projects are where users prepare data assets and build models. Assets move between the catalog and projects.

Catalog collaborators can add assets to the catalog to share with others or find and use assets in the following ways:

Data engineers create cleansed data, virtualized data, and integrated data assets in projects and then publish the assets into the catalog.
Data engineers import tables or files from a data source into the catalog.
Data scientists and business analysts find data assets in catalogs and then add the assets to projects to work with the data.

Data assets accumulate metadata over time in the following ways:

Data assets are profiled, which automatically assigns predefined data classes that describe the format of the data.
Catalog collaborators add tags, predefined business terms, data classes, and classifications, relationships, and ratings to assets.
All actions on assets are automatically saved in the asset history.

See Creating a catalog.

Customization options for data governance

You can add or update any of the custom options to your data governance implementation at any time. Your governance team can establish your business vocabulary, import and enrich data with your vocabulary, analyze the data quality, define rules to protect data, and then publish the data assets to a catalog where data consumers can find them. When your data changes, you can reimport metadata about the tables or files and enrich your data assets with your business vocabulary and data quality analysis. You can create increasingly precise rules to protect data as you expand your business vocabulary. Throughout the data governance cycle, your data scientists and other data consumers can find trusted data in catalogs. The following illustration shows how data governance is a continuous cycle of refreshing the metadata for data assets to reflect changes in the data and changes in your business vocabulary.

The cycle of data governance tasks

Establish your business vocabulary

Your governance team can establish a business vocabulary that describes the meaning of data with business terms and the format of data with data classes. A business vocabulary helps your business users more easily find what they are looking for using nontechnical terms.
Your team can quickly establish your business vocabulary by importing your existing business vocabulary or importing Knowledge Accelerators that provide between dozens to thousands of governance artifacts.
Your IBM Knowledge Catalog administrator can customize the workflow, organization, properties, and relationships of governance artifacts.

See Planning to implement a governance framework.

Import and enrich data assets with your business vocabulary

Data stewards can regularly run metadata import and enrichment jobs that update the catalog with changes to tables or files from your data sources and automatically assign the appropriate business terms and data classes.
When your team adds governance artifacts, the metadata enrichment jobs suggest the new artifacts to the new or updated data assets.
When data stewards confirm or adjust business term assignments during metadata enrichment, the machine learning algorithms for term assignment become more accurate for your data.
Data stewards can configure metadata import and enrichment to run only when changes are detected.
You can use gen AI based enrichment capabilities for generating descriptive asset and column names, for generating meaningful descriptions for assets and columns, and for assigning business terms.

See Planning to curate data assets to share in catalogs.

Analyze data quality

Data stewards can analyze data quality with default settings during metadata enrichment. Data quality analysis is applied to each asset as a whole and to columns in tables.
Data stewards can create custom data quality definitions and apply them in data quality rules, or apply SQL-based data quality rules.

See Planning to curate data assets to share in catalogs.

Protect your data with rules

Your governance team can create a plan for data protection rules by writing policies that document your organization’s standards and guidelines for protecting and managing data. For example, a policy can describe a specific regulation and how a data protection rule ensures compliance with that regulation.
Your governance team can create data protection rules that define how to keep private information private. Data protection rules are automatically evaluated for enforcement every time a user attempts to access a data asset in any governed catalog on the platform. Data protection rules can define how to control access to data, mask sensitive values, or filter rows from data assets.
Your team can start with data protection rules that are based on custom tags, users, or predefined data classes, business terms, and classifications. When your governance team adds governance artifacts, the team can define data protection rules based on your business vocabulary.
Data engineers can enforce data protection rules on virtualized data.
Data engineers can permanently mask data in data assets with masking flows.

See Planning to protect data with rules.

Getting started with IBM Knowledge Catalog

The tasks to get started with IBM Knowledge Catalog depend on your goal. The actions that you can take are defined by your Cloud Pak for Data service access roles. Some actions also have workspace role requirements, such as being a collaborator in a catalog or category.

To check your service access roles, see Determining your IBM Cloud account and service access roles. To understand your IBM Knowledge Catalog roles, see user roles and permissions.

The following table shows common goals, the required Cloud Pak for Data service access roles, and links to information to get you started.

Goal	Required Cloud Pak for Data service access role	More information
Set up or administer IBM Knowledge Catalog	Manager	• Planning to implement data governance • Setting up IBM Knowledge Catalog • Managing IBM Knowledge Catalog
Find assets or features in a catalog	Any role	• Finding assets in a catalog • Searching for assets across the platform • Adding a catalog asset to a project
Curate data	CloudPak Data Steward or CloudPak Data Engineer	• Curating data • Planning to curate data
Manage data quality	CloudPak Data Steward or CloudPak Data Engineer	Managing data quality
Create governance artifacts	CloudPak Data Steward or CloudPak Data Engineer	• Managing governance artifacts • Importing Knowledge Accelerators • Planning to implement a governance framework
Create data protection rules	CloudPak Data Steward or CloudPak Data Engineer	• Data protection rules • Planning to protect data with rules
Run IBM Knowledge Catalog APIs	The same role for performing the task in the UI.	• IBM Knowledge Catalog API
Generate reports on IBM Knowledge Catalog	Reporting administrator	• Setting up reporting