Automatic profiling for relational or structured data in Watson Knowledge Catalog
Automatic profiling is based on the attribute classifiers that are provided by IBM. It checks all attribute classifiers that are enabled.
The profile of a data asset that contains relational or structured data shows information about each column in the data set, based on the first 5000 rows of data.
To create or update a profile you must be assigned the Admin role for the Watson Knowledge Catalog app.
The following steps show how to profile data:
- Open Catalog > View all catalogs.
- Open a catalog.
- Preview an asset. When you preview an asset on the Overview page the best matching classifier, if available, is automatically shown for each column title.
- Click the Profile tab to automatically create a data profile, or click Update Profile to view the latest classification details for this asset.
- Click View All to display the list of classifiers that are analyzed for the selected column. You can also search for a classifier within this pop-up window to see the percentage of matching information.
- Click View Log to view the log file that is created when processing the data profile.
The algorithms analyze the asset and create a data profile that shows the:
- Total number of classifiers that are used when creating the profile.
- Creation date and time of the profile.
- Total number of columns and rows.
- Data type that is detected for each column.
- Matching percentage of all classifiers that have been evaluated for each column.
- The frequency of the 10 most-frequent data values contained in a column.
- Statistics details.
Classifiers are listed for each column in descending order starting with the highest percentage of an inferred classifier match. The best match is shown at the top of each column. Current limitation: If you manually update an inferred column classification on the Profile page, this new classification is not recognized by data policies.
The colors of the percentage bar and its numbers indicate to which extend the values match a classifier. The best match resulting from the analyses is shown for each column when you preview an asset. If no match is found or the matching percentage is low, the preview displays that no classifier is detected because classifiers could not be inferred automatically. For details on color coding, click on the left side of the Profile page.
For example, profiling can determine whether a column contains a name, address, email, phone, SSN, date of birth, or credit card number. The platform can then classify an asset as containing sensitive data. The results of profiling and classification can be used when specifying the rule conditions.
Frequency shows the 10 most-frequent data values that are contained in a column, such as how often the name John Smith is found in a Person Name column.
- For numeric-based columns the following information is determined:
- The number of unique values.
- The minimum and maximum values.
- The mean value and the standard deviation, which shows how much the values of this column differ from the mean value of this column.
- For any other columns the following information is determined:
- The number of unique strings.
- The minimum and maximum length of the strings.
- The mean length of the strings.
- View all shows the frequency for up to:
- 200 bins for numeric-based columns. Numeric values are grouped into bins and a scale at the bottom of the Frequency window shows the range of this bin, for example, 1 - 32.
- 100 values for any other columns.
- If you are allowed to edit an asset in a catalog, you can classify the asset or its columns on the Overview page manually:
- Click Classification to change the asset classification.
- Each asset column shows the automatically inferred attribute classifier and its matching percentage.
- To modify the automatic pre-selection for columns:
- Click a classifier to view the list of inferred classifiers and their matching percentage for this column.
- Select the attribute classifier that fits best to this column.