Parquet modular encryption

If your data is stored in columnar format, you can use Parquet modular encryption to encrypt sensitive columns when writing Parquet files, and decrypt these columns when reading the encrypted files. Encrypting data at the column level, enables you to decide which columns to encrypt and how to control the column access.

Besides ensuring privacy, Parquet modular encryption also protects the integrity of stored data. Any tampering with file contents is detected and triggers a reader-side exception.

Key features include:

Parquet modular encryption and decryption is performed on the Spark cluster. Therefore, sensitive data and the encryption keys are not visible to the storage.
Standard Parquet features, such as encoding, compression, columnar projection and predicate push-down, continue to work as usual on files with Parquet modular encryption format.
You can choose one of two encryption algorithms that are defined in the Parquet specification. Both algorithms support column encryption, however:
- The default algorithm AES-GCM provides full protection against tampering with data and metadata parts in Parquet files.
- The alternative algorithm AES-GCM-CTR supports partial integrity protection of Parquet files. Only metadata parts are protected against tampering, not data parts. An advantage of this algorithm is that it has a lower throughput overhead compared to the AES-GCM algorithm.
You can choose which columns to encrypt. Other columns won't be encrypted, reducing the throughput overhead.
Different columns can be encrypted with different keys.
By default, the main Parquet metadata module (the file footer) is encrypted to hide the file schema and list of sensitive columns. However, you can choose not to encrypt the file footers in order to enable legacy readers (such as other Spark distributions that don't yet support Parquet modular encryption) to read the unencrypted columns in the encrypted files.
Encryption keys can be managed in one of two ways:
- Directly by your application. See Key management by application.
- By a key management system (KMS) that generates, stores and destroys encryption keys used by the Spark service. These keys never leave the KMS server, and therefore are invisible to other components, including the Spark service. See Key management by KMS.
Note: Only master encryption keys (MEKs) need to be managed by your application or by a KMS.

For each sensitive column, you must specify which master key to use for encryption. Also, a master key must be specified for the footer of each encrypted file (data frame). By default, the footer key will be used for footer encryption. However, if you choose a plain text footer mode, the footer won’t be encrypted, and the key will be used only for integrity verification of the footer.

The encryption parameters can be passed via the standard Spark Hadoop configuration, for example by setting configuration values in the Hadoop configuration of the application's SparkContext:
```
sc.hadoopConfiguration.set("<parameter name>" , "<parameter value>")
```
Alternatively, you can pass parameter values through write options:
```
<data frame name>.write
.option("<parameter name>" , "<parameter value>")
.parquet("<write path>")
```

Running with Parquet modular encryption

Parquet modular encryption is available only in Spark notebooks that are run in an IBM Analytics Engine service instance. Parquet modular encryption is not supported in notebooks that run in a Spark environment.

To enable Parquet modular encryption, set the following Spark classpath properties to point to the Parquet jar files that implement Parquet modular encryption, and to the key management jar file:

Navigate to Ambari > Spark > Config -> Custom spark2-default.

Add the following two parameters to point explicitly to the location of the JAR files. Make sure that you edit the paths to use the actual version of jar files on the cluster.

spark.driver.extraClassPath=/home/common/lib/parquetEncryption/ibm-parquet-kms-<latestversion>-jar-with-dependencies.jar:/home/common/lib/parquetEncryption/parquet-format-<latestversion>.jar:/home/common/lib/parquetEncryption/parquet-hadoop-<latestversion>.jar

spark.executor.extraClassPath=/home/common/lib/parquetEncryption/ibm-parquet-<latestversion>-jar-with-dependencies.jar:/home/common/lib/parquetEncryption/parquet-format-<latestversion>.jar:/home/common/lib/parquetEncryption/parquet-hadoop-<latestversion>.jar

Mandatory parameters

The following parameters are required for writing encrypted data:

List of columns to encrypt, with the master encryption keys:

parameter name: "encryption.column.keys"
parameter value: "<master key ID>:<column>,<column>;<master key ID>:<column>,.."

The footer key:
```
parameter name: "encryption.footer.key"
parameter value: "<master key ID>"
```
For example:
```
dataFrame.write
.option("encryption.footer.key" , "k1")
.option("encryption.column.keys" , "k2:SSN,Address;k3:CreditCard")
.parquet("<path to encrypted files>")
```
Important:
If neither the encryption.column.keys parameter nor the encryption.footer.key parameter is set, the file will not be encrypted. If only one of these parameters is set, an exception is thrown, because these parameters are mandatory for encrypted files.

Optional parameters

The following optional parameters can be used when writing encrypted data:

The encryption algorithm AES-GCM-CTR

By default, Parquet modular encryption uses the AES-GCM algorithm that provides full protection against tampering with data and metadata in Parquet files. However, as Spark 2.3.0 runs on Java 8, which doesn’t support AES acceleration in CPU hardware (this was only added in Java 9), the overhead of data integrity verification can affect workload throughput in certain situations.

To compensate this, you can switch off the data integrity verification support and write the encrypted files with the alternative algorithm AES-GCM-CTR, which verifies the integrity of the metadata parts only and not that of the data parts, and has a lower throughput overhead compared to the AES-GCM algorithm.
```
parameter name: "encryption.algorithm"
parameter value: "AES_GCM_CTR_V1"
```
Plain text footer mode for legacy readers

By default, the main Parquet metadata module (the file footer) is encrypted to hide the file schema and list of sensitive columns. However, you can decide not to encrypt the file footers in order to enable other Spark and Parquet readers (that don't yet support Parquet modular encryption) to read the unencrypted columns in the encrypted files. To switch off footer encryption, set the following parameter:
```
parameter name: "encryption.plaintext.footer"
parameter value: "true"
```
Important:
The encryption.footer.key parameter must also be specified in the plain text footer mode. Although the footer is not encrypted, the key is used to sign the footer content, which means that new readers could verify its integrity. Legacy readers are not affected by the addition of the footer signature.

Usage examples

The following sample code snippets for Python show how to create data frames, written to encrypted parquet files, and read from encrypted parquet files.

Python: Writing encrypted data:

from pyspark.sql import Row

squaresDF = spark.createDataFrame(
    sc.parallelize(range(1, 6))
    .map(lambda i: Row(int_column=i, square_int_column=i ** 2)))

sc._jsc.hadoopConfiguration().set("encryption.key.list",
    "key1: AAECAwQFBgcICQoLDA0ODw==, key2: AAECAAECAAECAAECAAECAA==")
sc._jsc.hadoopConfiguration().set("encryption.column.keys",
    "key1:square_int_column")
sc._jsc.hadoopConfiguration().set("encryption.footer.key", "key2")

encryptedParquetPath = "squares.parquet.encrypted"
squaresDF.write.parquet(encryptedParquetPath)

Python: Reading encrypted data:

sc._jsc.hadoopConfiguration().set("encryption.key.list",
    "key1: AAECAwQFBgcICQoLDA0ODw==, key2: AAECAAECAAECAAECAAECAA==")

encryptedParquetPath = "squares.parquet.encrypted"
parquetFile = spark.read.parquet(encryptedParquetPath)
parquetFile.show()

The contents of the Python job file InMemoryKMS.py is as follows:

from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import Row

if __name__ == "__main__":
    spark = SparkSession \
        .builder \
        .appName("InMemoryKMS") \
        .getOrCreate()
    sc = spark.sparkContext
    ##KMS operation
    print("Setup InMemoryKMS")
    hconf = sc._jsc.hadoopConfiguration()
    encryptedParquetFullName = "testparquet.encrypted"
    print("Write Encrypted Parquet file")
    hconf.set("encryption.key.list", "key1: AAECAwQFBgcICQoLDA0ODw==, key2: AAECAAECAAECAAECAAECAA==")
    btDF = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: Row(ssn=i,  value=i ** 2)))
    btDF.write.mode("overwrite").option("encryption.column.keys", "key1:ssn").option("encryption.footer.key", "key2").parquet(encryptedParquetFullName)
    print("Read Encrypted Parquet file")
    encrDataDF = spark.read.parquet(encryptedParquetFullName)
    encrDataDF.createOrReplaceTempView("bloodtests")
    queryResult = spark.sql("SELECT ssn, value FROM bloodtests")
    queryResult.show(10)
    sc.stop()
    spark.stop()

Internals of encryption key handling

When writing a Parquet file, a random data encryption key (DEK) is generated for each encrypted column and for the footer. These keys are used to encrypt the data and the metadata modules in the Parquet file.

The data encryption key is then encrypted with a key encryption key (KEK), also generated inside Spark/Parquet for each master key. The key encryption key is encrypted with a master encryption key (MEK) locally.

Encrypted data encryption keys and key encryption keys are stored in the Parquet file metadata, along with the master key identity. Each key encryption key has a unique identity (generated locally as a secure random 16-byte value), also stored in the file metadata.

When reading a Parquet file, the identifier of the master encryption key (MEK) and the encrypted key encryption key (KEK) with its identifier, and the encrypted data encryption key (DEK) are extracted from the file metadata.

The key encryption key is decrypted with the master encryption key locally. Then the data encryption key (DEK) is decrypted locally, using the key encryption key (KEK).

Learn more

Parquet modular encryption

Parent topic: Notebooks and scripts