Identifying relationships | IBM Cloud Pak for Data as a Service

Identifying relationships

To identify key relationships between data assets, or to identify overlapping or redundant data in columns, run a relationship analysis.

Relationships are connections between data assets that contain related information. You create such relationships by indicating that key values in one data asset correspond to key values in one or more other data assets. Identifying these connections helps you find out how your data fits together.

A key relationship analysis identifies pairs of primary and foreign keys with a one-to-many relationship between two data assets in a relational database. A primary key is a column or a set of columns that uniquely identifies each record in a data asset. A foreign key is a column or a set of columns in one data asset that references the primary key columns in another data asset. You can use the relationship information to determine how data assets can be joined.

An overlap is when content is available from more than one source. Overlap analysis looks for identical or similar structures in data assets by comparing column values within one data asset or across data assets. Thus, overlap analysis helps you determine whether the data is unique or similar and how much the data might overlap.

The results of relationship analyses are cumulative. Each time that you run an analysis, the analysis adds new relationships or updates relationships that were identified in previous analysis runs.

Prerequisites
Running a key relationship analysis
Assigning, unassigning, or removing key relationships
Running an overlap analysis
Checking overlap analysis results
Accessing job run details

Prerequisites

The DataStage service must be deployed for running relationship analyses.

All operations that are run as part of a metadata enrichment require credentials for secure authorization. Typically, your user API key is used to execute such long-running operations without disruption. If credentials are not available when you try to run a key relationship analysis or an overlap analysis, you are prompted to create an API key. That API key is then saved as your task credentials. See Managing the user API key.

If any of the connections to the data sources are locked, you are asked to enter your personal credentials. This one-time step permanently unlocks the connections for you.

You can run deep relationship analyses without creating asset profiles first. Deep key or overlap relationship analysis does not rely on profile information such as the frequency distribution of column values. Instead, column data is analyzed directly.

Running a key relationship analysis

Run a key relationship analysis to identify single-column key relationships or compound key relationships. The analysis determines potential relationship candidates and suggests the best relationships between data assets based on the confidence for each potential relationship. For computing the confidence of the relationship, the following information is considered:

The number of values from the primary key present in the foreign key
The number of orphan values in the foreign key
The similarity between the names of the columns between the primary key and foreign key
The position of the columns in the data assets.

Relationships can be detected between columns of different data types as follows:

Data types BIGINT and SMALLINT
Data types BIGINT and DECIMAL
Data types BIGINT and CHAR
Data types BIGINT and VARCHAR
Data types DECIMAL and VARCHAR
Data types CHAR and VARCHAR
Data types VARCHAR and LONG VARCHAR

The pairs with a confidence that exceeds the defined suggestion threshold are listed as candidate relationships for an asset. You can also enable automatic assignment of relationships. For more information about the thresholds, see Key relationships.

This type of relationship analysis checks single columns in the selected data assets for similarity of the column name and the profiled data. If you do not run an in-depth analysis, the thresholds that are set in the default enrichment settings are applied.

If you select to run an in-depth analysis, the actual column data is checked to identify primary and foreign keys. You can also choose to identify compound key relationships or to run the analysis on a sample of data instead of the entire data asset. For in-depth analysis, the thresholds that are defined in the enrichment settings do not take effect. Instead, you can set thresholds for each individual analysis run.

In-depth analysis takes longer to complete than the default key relationship analysis. If you also choose to identify compound keys, processing time increases significantly. The analysis might run for days or longer. Only adjacent columns are checked where adjacent means directly next to one another if the data type of the columns allows for key analysis or close to one another when columns in between are skipped because their data types don't allow for key analysis.

To run a key relationship analysis for one or more assets:

Open the metadata enrichment asset.
On the Assets tab, select assets as required.
Select Enrich > Identify relationships from the toolbar. The default analysis type is Key.

You run this type of relationship analysis usually on multiple columns in the selected data assets. By default, the compound key length is set to 2.

In-depth key relationship analysis checks the actual column data and thus is a time-consuming process. You might want to run the analysis on a sample of data instead of the entire data asset.

The default enrichment settings for key relationships do not take effect. Different default settings for suggestions and automatic assignment apply to in-depth relationship analysis:
- The default suggestion threshold is 25%.
- Automatic assignment of relationships is enabled. The default assignment threshold is 90%.
You can overwrite these default settings for each individual analysis run.

To run a shallow key relationship analysis on single columns in your data assets, you can use the Set relationships enrichment option.
Click Run.

Assigning, unassigning, or removing key relationships

After key relationship analysis is complete, you can view and manage relationships here:

On the Keys tab of the asset's details panel, you see a plus icon if candidate relationships were detected. Click the plus icon, check the suggested relationships for this asset, and assign the most suitable ones. You can also unassign or remove relationships for this asset.
The View key relationships link in the enrichment area is active. Click the link to see a list of all detected key relationships on the Keys tab. Check and assign suggested relationships or unassign relationships. You can also remove suggested and assigned relationships.

To see only relationships with an assigned primary key, filter the list by Primary key.

From the key relationship table, you can also open a graphical view of each key's relationships to check and manage the relationships. From the overflow menu for a key, select Open diagram, or select View details and then click View diagram in the side panel.

If you unassign a relationship, its status is reset to Suggested. If you remove a relationship, it's entirely removed. A removed relationship might be suggested again when you rerun the analysis.

Running an overlap analysis

Run an overlap analysis to identify overlapping or potentially redundant data in the columns of one or more data assets.

To run a key relationship analysis for one or more assets:

Open the metadata enrichment asset.
On the Assets tab, select assets as required.
Select Enrich > Identify relationships from the toolbar.
Select Overlap as the analysis type. Optionally, you can adjust the sampling setting. By default, all records of a data asset are analyzed, which can be time-consuming. You might want to choose a lower percentage of data be checked. When you reduce the sample size, the records that are included in the sample are picked at random.

The default enrichment settings do not take effect. Different default settings for suggestions and automatic assignment apply to overlap analysis:
- The default suggestion threshold is 25%.
- Automatic assignment of relationships is enabled. The default assignment threshold is 90%.
You can overwrite these default settings for each individual analysis run.

Checking overlap analysis results

View the analysis results to determine how column data is related.

To check and manage overlap relationships:

Click View key relationships link in the enrichment area. On the Overlap tab, you can see the following information for each detected relationship:
- The base asset and the paired asset
- The relationship status
- The maximum number of distinct values
- The total number and percentage of common column values
Expand an entry to see more information:
- The names of the base column and the paired column that show overlapping data
- The overlap type for the base and the paired column:
  
  Same
  
  Base and paired column have the same values. The percentages for both are 100% and 100%.
  
  Subset
  
  All values in this column are in the other column but the other column contains more values. The percentage for the first column is 100%, but the percentage for the other column is less than 100%.
  
  Superset
  
  All values in the other column are in this column but not all values in this column are in the other column. The percentage for the other column is 100%, but the percentage for the first column is less than 100%.
  
  Overlap
  
  Some of the values in the columns have some overlap but not all of the values in either column are common. The percentages of both are less than 100%.
- The total number of distinct values in the base column, and the number and percentage of values that are common to the paired column
- The total number of distinct values in the paired column, and the number and percentage of values that are common to the base column
Assign relationships as required.

Accessing job run details

While you can access the job details for general metadata enrichment from the side panel, you must go to the project's Jobs tab to see the relationship analysis job. The job is of the type Key Analysis for Metadata Enrichment Assets. The job name follows the pattern metadata_enrichment_name (Relationship Detection).

Learn more

Identifying primary keys

Parent topic: Enriching your data assets