Mask data with data protection rules (IBM Knowledge Catalog)
To mask data, the data must conform to these requirements:
- The data is structured. The data must be in relational tables or CSV, Avro, partitioned data, or Parquet files.
- The column headers contain only alphanumeric characters (a-z, A-Z, 0-9). The column headers can't contain unsupported characters, such as, multi-byte characters or special characters.
When you choose the masking action, you must specify the masking criteria and the masking method.
Masking criteria
The masking criteria identifies the columns to mask. You select the type of column property, and specify one or more specific values of the property, which are logically combined with the OR operator.
Type of column property | Description | Specific values |
---|---|---|
Business term | A business term that is assigned to the column. | Search for and then select one or more published business terms. |
Data class | The data class that is assigned to the column. | Search for and then select one or more published data classes. |
Tag | A tag that is assigned to a column in the asset. | Enter one or more tags, separated by commas. |
Column name | The name of a column. | Enter one or more column names, separated by commas. |
For example, suppose you choose the column property of Data class and the specific values of California State Driver's License and Nevada State Driver's License. Values are then masked in columns that are assigned either the California State Driver's License or the Nevada State Driver's License data class.
Overview of masking methods
The main differences between the masking methods are how much of the original characteristics of the data remain. The more original characteristics of the data that are retained, the more useful, but the less secure, the masked data becomes. When you choose a masking method, consider these factors:
-
Data integrity: Whether to repeat the same masked value for a repeated original value to maintain referential integrity between tables.
-
Data format: Whether to retain the format of the original data. Preserving the format means that letters are replaced by letters with the same case, digits are replaced by digits, and the number of characters is the same.
The following table describes how each masking method affects these characteristics.
Method | Description | Preserves integrity? | Preserves data format? |
---|---|---|---|
Redact | By default, replaces values with ten X characters. The most secure method. You can also redact data by using advanced masking options. You can customize the replacement character and the number of replacement characters. For columns that have some assigned data classes, you can choose partial replacement. |
No | No: If you are not using advanced masking options. Yes: If you are using advanced masking options. |
Substitute | Replace values with randomly generated values that preserve referential integrity. | Yes | No |
Obfuscate | Replace values with values that preserve referential integrity and the original data format. The least secure method. | Yes | Yes |
For virtual data, the masking behavior is slightly different, based on the data field definition. See Masking virtual data.
Redact
You can redact data using two different methods.
-
The basic redact method replaces each data value with a string of exactly ten letters of X. With redacted data, the format of the data and data integrity are not preserved. Redact is the most secure masking method, but results in the least useful masked data.
For example, the phone number 510-555-1234 is replaced with XXXXXXXXXX. All other phone numbers are replaced with the same value.
-
You can specify advanced redaction options for criteria that are based on data classes with advanced masking options. Unlike the default redact method, the replacement characters that are used to mask data depend on the specific characters that you configure to redact the data. You can also specify the number of characters to mask the data. With advanced redacted data, the format of the data is not preserved, but the data integrity is preserved.
For example, if a column type is an integer and 0 is configured for redacting integers, the data is redacted with 0000000000. If a column type is a string and X is configured for redacting strings, the data is redacted with XXXXXXXXXX. If a column type is configured for date and 2022-06-30 is configured for redacting dates, the data is redacted with 2022-06-30.
However, advanced masking options are not enforced automatically. You must apply it to selected data assets in a project and then publish the masked assets to a catalog.
Substitute
The substitute method replaces data with values that don't match the original format. However, it does preserve referential integrity for repeated values for all assets in the catalog. The substituted values are meaningless and the original format of the values can't be determined. Substitute provides security and data usefulness in between the Redact and Obfuscate methods.
For example, the phone number 510-555-1234 is always replaced with 500ddcc98133703531re3456
.
Obfuscate
The obfuscate method replaces the data values with similarly formatted values that match the original format and preserves referential integrity for repeated values. Because the obfuscated values are similarly formatted, they can be valid values. Obfuscate is the least secure masking method, but results in the most useful masked data.
For example, the phone number 510-555-1234 is always replaced with 415-987-6543.
However, the obfuscate method is limited to data values in columns that have assigned data classes with the following types of information:
- Personal information, for example, basic attributes of an individual, such as honorific or name suffix.
- Contact details, for example, email addresses, phone numbers, state, postal addresses, latitude, or longitude.
- Financial accounts, for example, credit cards, banking, or other financial account numbers.
- Government identities, for example, personal identification numbers issued by governments, such as SSN (US social security numbers) and CCN (credit card numbers).
- Personal demographic information, for example, religion, ethnicity, marital status, hobbies, or employee status.
- Connectivity data, for example, IP address, or mac address.
If you create a rule to obfuscate data and the rule is enforced on data that is not assigned a data class that supports obfuscation, the substitute method is used instead.
You can specify advanced obfuscation options for masking criteria that are based on data classes with advanced data masking. However, advanced data masking is not enforced automatically. You must apply it to selected data assets in a project and then publish the masked assets to a catalog.
Watch this video to see how to mask data.
This video provides a visual method to learn the concepts and tasks in this documentation.
Learn more
- Designing data protection rules
- Data protection rules evaluation
- Managing data protection rules
- Advanced masking options
Parent topic: Data protection rules