Profiles of data assets
An asset profile includes generated metadata and statistics about the asset content, and helps you understand what actions to take to improve the data quality. You can see the profile on an asset's Profile page.
Profiles can be created for data assets that contain relational or structured data.
- Requirements and restrictions
- Ways to create a profile
- What is analyzed during profiling?
- Profile information
Requirements and restrictions
You can view the profile of assets under the following circumstances.
- Required service
- Watson Knowledge Catalog service.
- Required permissions
- To view this page, you can have any role in a project or catalog.
- To create or update a profile or to run metadata enrichment in a project, you must have the Admin or Editor role in the project.
- To create or update a profile in a catalog, you must have the Admin role in the catalog, or you must have the Editor role and must be an asset owner or an asset member.
- Workspaces
- You can view the asset profile in these workspaces:
- Projects
- Catalogs
- Types of assets
- These types of assets have a profile:
-
Data assets from relational or nonrelational databases from a connection to the data sources, except Cloudant
-
Data assets from partitioned data sets, where a partitioned data set consists of multiple files and is represented by a single folder uploaded from the local file system or from file-based connections to the data sources
-
Data assets from files uploaded from the local file system or from file-based connections to the data sources, with these formats:
- CSV
- XLS, XLSM, XLSX (Only the first sheet in a workbook is profiled.)
- TSV
- Avro
- Parquet
However, structured data files are not profiled when data assets do not explicitly reference them, such as in these circumstances:
- The files are within a connected folder asset. Files that are accessible from a connected folder asset are not treated as assets and are not profiled.
- The files are within an archive file. The archive file is referenced by the data asset and the compressed files are not profiled.
-
Data assets that contain documents with unstructured data. Documents with a size of up to 100 MB can be profiled. Larger documents are not profiled. These types of documents can be profiled:
- Microsoft Word documents with these mime types:
- application/msword
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- PDF documents with the mime type application/pdf
- Plain text documents with the mime type text/plain
- HTML documents with the mime type text/html
- Microsoft Word documents with these mime types:
-
Ways to create a profile
Profiles of data assets with relational and structured data and profiles of data assets with unstructured data are created differently.
Relational and structured data
Profiles for data assets that contain structured or relational data can be created in different ways:
-
In governed catalogs, profiles for individual data assets are created automatically when the data assets are added to the catalog with these exceptions:
- You disabled automatic profiling for the catalog.
- The asset comes from a connection that is configured to use personal credentials.
- The asset was profiled through metadata enrichment before it was published. Such assets already have a profile that's added to the catalog along with the asset.
-
In projects and in catalogs without data protection rule enforcement, you can manually create profiles for individual data assets. You can also create a profile manually in a governed catalog if the asset wasn't profiled before.
-
In projects, you can create and run a metadata enrichment asset to profile large sets of data assets in one go. These asset profiles are available in the project. You can publish the enriched assets with their profiles to any type of catalog. See Managing metadata enrichment.
Within one account, profiling results are copied with the data asset when you publish an asset from a project to a catalog or add it from a catalog to a project. However, if the catalog and the project belong to different accounts, the profiles aren't copied because the set of available data classes might be different.
You can update an individual asset profile from the asset's Profile page in a project or a catalog. If you manually update a profile of a data asset that is included in a metadata enrichment, the profile and analysis information is also reflected in the respective enrichment results. Profiles are also updated when new enrichment results are published.
When you update an existing profile, you can change the data classes to include in the profile. If you exclude a data class that was previously assigned to a column, the updated profile shows Class excluded (from profile) for the respective column unless a different data class was assigned. You will also see Class excluded (from profile) for any columns where you don't have access to the assigned data class.
Unstructured data
Profiles for unstructured data assets are always created automatically. However, the data assets must be uploaded directly to the project or catalog. Unstructured documents that are added as connected assets are not profiled.
What is analyzed during profiling?
Analysis of data assets with relational and structured data and profiles of data assets with unstructured data is done differently.
Relational and structured data
If you create or update a profile for a data asset with structured or relational data from the Profile page in a project or a catalog, columns are analyzed.
When a single asset is profiled in a project or a catalog, the profile is by default created based on the first 5,000 rows of data. If the data asset has more than 250 columns, the profile is created based on the first 1,000 rows of data. If the profile is created through metadata enrichment, sampling is determined by the metadata enrichment settings.
To identify the structure and content of your data and to classify it, analysis includes the following tasks:
- Compute statistics about the data of each analyzed column.
- Compute data types for columns and data types distribution.
- Computes data formats for columns and formats distribution.
- Classify the data and compute data class candidates for columns.
- Capture frequency distributions.
Unstructured data
For profiling unstructured data assets, the plain text is extracted from the document and the first 5 MB of the extracted text are analyzed. During profiling, several patterns are applied to the extracted document content to identify certain types of information. To detect such information, the structure of the information, nearby context, the entire extracted content, and the language the document is written in are considered. The results are then mapped to predefined data classes. For example, if bank account numbers are detected, the data class IBAN is assigned to the document. Or, if the document contains city names, the data class city is assigned.
However, always keep in mind that any detection logic that is applied to unstructured data cannot be expected to be 100% accurate, which might result in erroneous classifications.
The assigned data classes cannot be used to block access to or mask data in unstructured data assets with policies.
Profile information
The content of the profile depends on whether the data asset contains relational or structured data or unstructured data.
Relational and structured data
The profile of a data asset that contains relational or structured data shows information about each column in the data set.
The Profile tab provides some general information and an overview of the analysis results:
-
When was the profile created or last updated.
-
How many columns and rows were analyzed.
-
The inferred data class for each column and the confidence for that data class. Data classes describe the contents of the data in the column: for example, city, account number, or credit card number. Data classes can be used to mask data or to restrict access to data assets with data protection rules. The data classes appear for each column on the asset's Overview page and on the Profile page.
The confidence of a data class is the percentage of non-null values that match the data class.
Several data classes are more generic identifiers that are detected and assigned at a column level. These data classes are assigned when a more specific data class could not be identified at a value level. Generic identifiers always have a confidence of 100% and include the following data classes: code, date, identifier, indicator, quantity, and text.
-
The percentage of matching, mismatching, or missing data for each column.
-
The frequency distribution for all values identified in a column.
-
Statistics about the data for each column such as the number of distinct values, the percentage of unique values, minimum, maximum, or mean, and sometimes the standard deviation in that column. The number of distinct values indicates how many different values exist in the sampled data for the column. The percentage of unique values indicates the percentage of distinct values that appear only once in the column.
Depending on a column’s data format, the statistics vary slightly. For example, statistics for a column of data type integer have minimum, maximum, and mean values and a standard deviation value while statistics for a column of data type string have minimum length, maximum length, and mean length values.
More detailed information about column data is available when you click the column name. See Detailed profiling results.
Unstructured data
The profile of a data asset that contains a document with unstructured data shows information that allows some high-level assessment of the document content for risk: assigned data classes, value statistics, and metadata such as language, file size, or word count.
Learn more
- Profiling an asset
- Managing metadata enrichment
- Predefined data classes
- Detailed profiling results
- Masking data
Parent topic: Asset types and properties