Profiles of assets

Data assets that contain textual data have profiles. The profile of a data asset includes generated metadata and statistics about the textual content of the data. You can see the profile when you open the data asset in a catalog or project and go to the asset’s Profile page. All catalog or project members can see data asset profiles.

You must have Watson Knowledge Catalog to see a profile when you view a data asset.

The contents of the profile depends on the type of data:

Relational and structured data

The profile of a data asset that contains relational or structured data shows information about each column in the data set. By default, the profile is created based on the first 5,000 rows of data. However, if the data asset has more than 250 columns, the profile is created based on the first 1,000 rows of data. The profile shows the frequency of the inferred data classes and statistics about the data for each column. Data classes describe the contents of the data in the column: for example, city, account number, or credit card number. Data classes are necessary to mask data with policies. The data classes appear for each column on the asset’s Overview page as well as on the Profile page.

These types of relational and structured data are profiled by column:

  • Data assets from a connection to the data sources listed here, except Cloudant.
  • Partitioned data assets that consist of partitioned files in a folder of the local file system.
  • Data assets from files in Cloud Object Storage or from connections to Cloud Object Storage (S3 API) or Object Storage OpenStack Swift with these formats:
    • CSV
    • Avro
    • Parquet

However, structured data files are not profiled when data assets do not explicity reference them, such as in these circumstances:

  • The files are within a folder asset. Files that are accessible from a folder asset are not treated as assets and are not profiled.
  • The files are within an archive file. The archive file is referenced by the data asset and the compressed files are not profiled.

In catalogs with policy enforcement, profiles for structured data assets are created automatically.

In projects and in catalogs without data protection rule enforcement, you must create profiles for structured data assets.

Unstructured data

The profile of a data asset that contains a document with unstructured data shows the semantic features that are inferred about the text in the document by IBM Watson Natural Language Understanding. In catalogs, profiles for unstructured data assets are created automatically. In projects, you must create profiles for unstructured data assets.

Profiling of unstructured data with IBM Watson Natural Language Understanding is currently available only when you provision Watson Knowledge Catalog in the Dallas (US-South) service region on IBM Cloud.

IBM Watson Natural Language Understanding has a document size limit of 50,000 characters. Documents with more than 50,000 characters are not profiled.

These types of documents are profiled for semantic features:

  • Microsoft Word documents with these mime types:
    • application/msword
    • application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • PDF documents with the mime type application/pdf
  • Plain text documents with the mime type text/plain
  • HTML documents with the mime type text/html

Semantic features include:

  • Categories: A five-level hierarchy of subject categories. See Categories hierarchy.
  • Concepts: A list of high-level concepts that aren’t necessarily directly referenced in the text.
  • Sentiment: The overall sentiment conveyed by the document.
  • Emotion: The emotions conveyed by the document.

On the Profile page, you can switch between each type of semantic feature.

Natural Language Understanding supports these languages.

Learn more