Profiles of data assets | IBM Cloud Pak for Data as a Service

Profiles of data assets

An asset profile includes generated metadata and statistics about the asset content, and helps you understand what actions to take to improve the data quality. You can see the profile on an asset's Profile page.

Profiles can be created for data assets that contain relational or structured data.

Requirements and restrictions
Ways to create a profile
What is analyzed during profiling?
Profile information

Requirements and restrictions

You can view the profile of assets under the following circumstances.

Required service

Profiling requires the IBM Knowledge Catalog service.

Required permissions

Your roles determine how you can interact with profiles:

To view this page, you can have any role in a project or catalog.
To create or update a profile or to run metadata enrichment in a project, you must have the Admin or Editor role in the project.
To create or update a profile in a catalog, you must have the Admin role in the catalog, or you must have the Editor role and must be an asset owner or an asset member.

Workspaces

You can view the asset profile in these workspaces:

Projects
Catalogs

Types of assets

These types of assets have a profile:

Data assets from relational or nonrelational databases from a connection to the data sources, except Cloudant
Data assets from partitioned data sets, where a partitioned data set consists of multiple files and is represented by a single folder uploaded from the local file system or from file-based connections to the data sources
Data assets from files uploaded from the local file system or from file-based connections to the data sources, with these formats:
- CSV
- XLS, XLSM, XLSX (Only the first sheet in a workbook is profiled.)
- TSV
- Avro
- Parquet
However, structured data files are not profiled when data assets do not explicitly reference them, such as in these circumstances:
- The files are within a connected folder asset. Files that are accessible from a connected folder asset are not treated as assets and are not profiled.
- The files are within an archive file, for example, a .zip file. The archive file is referenced by the data asset and the compressed files are not profiled.

Ways to create a profile

Asset profiles can be created in different ways:

In governed catalogs, profiles for individual data assets are created automatically when the data assets are added to the catalog with these exceptions:
- You disabled automatic profiling for the catalog.
- The asset comes from a connection that is configured to use personal credentials.
- The asset was profiled through metadata enrichment before it was published. Such assets already have a profile that's added to the catalog along with the asset.
In projects and in catalogs without data protection rule enforcement, you can manually create profiles for individual data assets. You can also create a profile manually in a governed catalog if the asset wasn't profiled before.
In projects, you can create and run a metadata enrichment asset to profile large sets of data assets in one go. These asset profiles are available in the project. You can publish the enriched assets with their profiles to any type of catalog. See Managing metadata enrichment.

Within one account, profiling results are copied with the data asset when you publish an asset from a project to a catalog or add it from a catalog to a project. However, if the catalog and the project belong to different accounts, the profiles aren't copied because the set of available data classes might be different.

You can update an individual asset profile from the asset's Profile page in a project or a catalog. If you manually update a profile of a data asset that is included in a metadata enrichment, the profile and analysis information is also reflected in the respective enrichment results. Profiles are also updated when new enrichment results are published.

When you update an existing profile, you can change the data classes to include in the profile. If you exclude a data class that was previously assigned to a column, the updated profile shows Class excluded (from profile) for the respective column unless a different data class was assigned. You will also see Class excluded (from profile) for any columns where you don't have access to the assigned data class.

What is analyzed during profiling?

If you create or update an asset profile from the Profile page in a project or a catalog, columns are analyzed.

When a single asset is profiled in a project or a catalog, the profile is by default created based on the first 5,000 rows of data. If the data asset has more than 250 columns, the profile is created based on the first 1,000 rows of data. If the profile is created through metadata enrichment, sampling is determined by the metadata enrichment settings.

To identify the structure and content of your data and to classify it, analysis includes the following tasks:

Compute statistics about the data of each analyzed column.
Compute data types for columns and data types distribution.
Computes data formats for columns and formats distribution.
Classify the data and compute data class candidates for columns.
Capture frequency distributions.

Profile information

The profile of a data asset shows information about each column in the data asset.

The Profile tab provides some general information and an overview of the analysis results:

When was the profile created or last updated.
How many columns and rows were analyzed.
The inferred data class for each column and the confidence for that data class. Data classes describe the contents of the data in the column: for example, city, account number, or credit card number. Data classes can be used to mask data or to restrict access to data assets with data protection rules. The data classes appear for each column on the asset's Overview page and on the Profile page.

The confidence of a data class is the percentage of non-null values that match the data class.

Several data classes are more generic identifiers that are detected and assigned at a column level. These data classes are assigned when a more specific data class could not be identified at a value level. Generic identifiers always have a confidence of 100% and include the following data classes: code, date, identifier, indicator, quantity, and text.
The percentage of matching, mismatching, or missing data for each column.
The frequency distribution for all values identified in a column.
Statistics about the data for each column such as the number of distinct values, the percentage of unique values, minimum, maximum, or mean, and sometimes the standard deviation in that column. The number of distinct values indicates how many different values exist in the sampled data for the column. The percentage of unique values indicates the percentage of distinct values that appear only once in the column.

Depending on a column’s data format, the statistics vary slightly. For example, statistics for a column of data type integer have minimum, maximum, and mean values and a standard deviation value while statistics for a column of data type string have minimum length, maximum length, and mean length values.

More detailed information about column data is available when you click the column name. See Detailed profiling results.

The latest asset profile is retained and shown while the data asset exists in the catalog or in the project even if the original data in the data source is temporarily or permanently not available. To remove the profile information, you have these options:

You can manually delete the profile on the Profile page. This option is not available if the asset is subject to any data protection rules.
You can manually delete the data asset from the project or the catalog.
If the asset was added through metadata import, you can rerun the metadata import with the appropriate the Delete on reimport option set.

Learn more

Parent topic: Asset types and properties