Profiles of assets

Data assets that contain textual data have profiles. The profile of a data asset includes generated metadata and statistics about the textual content of the data. You can see the profile when you open the data asset in a catalog or project and go to the asset’s Profile page. All catalog or project members can see data asset profiles.

The contents of the profile depends on the type of data:

Relational and structured data

The profile of a data asset that contains relational or structured data shows information about each column in the data set, based on the first 5000 rows of data. The profile shows the frequency of the inferred attribute classifiers and statistics about the data for each column. Attribute classifiers describe the contents of the data in the column: for example, city, account number, or credit card number. Attribute classifiers are necessary to anonymize data with data policies. The attribute classifiers appear for each column on the asset’s Overview page as well as on the Profile page.

These types of relational and structured data are profiled by column:

  • Data assets from a connection to the data sources listed here, except Cloudant.
  • Partitioned data assets that consist of partitioned files in an IBM Cloud Object Storage folder.
  • Data assets from files in Cloud Object Storage or from connections to Cloud Object Storage (S3 API) or Object Storage OpenStack Swift with these formats:
    • CSV
    • Avro
    • Parquet

However, structured data files are not profiled when data assets do not explicity reference them, such as in these circumstances:

  • The files are within a folder asset. Files that are accessible from a folder asset are not treated as assets and are not profiled.
  • The files are within an archive file. The archive file is referenced by the data asset and the compressed files are not profiled.

In catalogs with data protection policy enforcement, profiles for structured data assets are created automatically.

In projects and in catalogs without data protection policy enforcement, you must create profiles for structured data assets.

Unstructured data

The profile of a data asset that contains a document with unstructured data shows the semantic features that are inferred about the text in the document by IBM Watson Natural Language Understanding. In catalogs, profiles for unstructured data assets are created automatically. In projects, you must create profiles for unstructured data assets.

These types of documents are profiled for semantic features:

  • Microsoft Word documents with these mime types:
    • application/msword
    • application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • PDF documents with the mime type application/pdf
  • Plain text documents with the mime type text/plain
  • HTML documents with the mime type text/html

Semantic features include:

  • Categories: A five-level hierarchy of subject categories. See Categories hierarchy.
  • Concepts: A list of high-level concepts that aren’t necessarily directly referenced in the text.
  • Sentiment: The overall sentiment conveyed by the document.
  • Emotion: The emotions conveyed by the document.

On the Profile page, you can switch between each type of semantic feature.

Natural Language Understanding supports these languages.

Learn more