0 / 0
Key management by KMS
Last updated: Oct 09, 2024
Key management by KMS

Parquet modular encryption can work with arbitrary Key Management Service (KMS) servers. A custom KMS client class, able to communicate with the chosen KMS server, has to be provided to the Analytics Engine Powered by Apache Spark instance. This class needs to implement the KmsClient interface (part of the Parquet modular encryption API). Analytics Engine Powered by Apache Spark includes the VaultClient KmsClient, that can be used out of the box if you use Hashicorp Vault as the KMS server for the master keys. If you use or plan to use a different KMS system, you can develop a custom KmsClient class (taking the VaultClient code as an example).

Custom KmsClient class

Parquet modular encryption provides a simple interface called org.apache.parquet.crypto.keytools.KmsClient with the following two main functions that you must implement:

// Wraps a key - encrypts it with the master key, encodes the result and 
// potentially adds KMS-specific metadata.
public String wrapKey(byte[] keyBytes, String masterKeyIdentifier)
// Decrypts (unwraps) a key with the master key.
public byte[] unwrapKey(String wrappedKey, String masterKeyIdentifier)

In addition, the interface provides the following initialization function that passes KMS parameters and other configuration:

public void initialize(Configuration configuration, String kmsInstanceID, String kmsInstanceURL, String accessToken)

See Example of KmsClient implementation to learn how to implement a KmsClient.

After you have developed the custom KmsClient class, add it to a jar supplied to Analytics Engine Powered by Apache Spark, and pass its full name in the Spark Hadoop configuration, for example:

sc.hadoopConfiguration.set("parquet.ecnryption.kms.client.class", "full.name.of.YourKmsClient"

Key management by Hashicorp Vault

If you decide to use Hashicorp Vault as the KMS server, you can use the pre-packaged VaultClient:

sc.hadoopConfiguration.set("parquet.ecnryption.kms.client.class", "com.ibm.parquet.key.management.VaultClient")

Creating master keys

Consult the Hashicorp Vault documentation for the specifics about actions on Vault. See:

  1. Enable the Transit Engine either at the default path or providing a custom path.
  2. Create named encryption keys.
  3. Configure access policies with which a user or machine is allowed to access these named keys.

Writing encrypted data

  1. Pass the following parameters:

    • Set "parquet.encryption.kms.client.class" to "com.ibm.parquet.key.management.VaultClient":

      sc.hadoopConfiguration.set("parquet.ecnryption.kms.client.class", "com.ibm.parquet.key.management.VaultClient")
      
    • Optional: Set the custom path "parquet.encryption.kms.instance.id" to your transit engine:

      sc.hadoopConfiguration.set("parquet.encryption.kms.instance.id" , "north/transit1")
      
    • Set "parquet.encryption.kms.instance.url" to the URL of your Vault instance:

      sc.hadoopConfiguration.set("parquet.encryption.kms.instance.url" , "https://<hostname>:8200")
      
    • Set "parquet.encryption.key.access.token" to a valid access token with the access policy attached, which provides access rights to the required keys in your Vault instance:

      sc.hadoopConfiguration.set("parquet.encryption.key.access.token" , "<token string>")
      
    • If the token is located in a local file, load it:

      val token = scala.io.Source.fromFile("<token file>").mkStringsc.hadoopConfiguration.set("parquet.encryption.key.access.token" , token) 
      
  2. Specify which columns need to be encrypted, and with which master keys. You must also specify the footer key. For example:

    val k1 = "key1"
    val k2 = "key2"
    val k3 = "key3"
    dataFrame.write
    .option("parquet.encryption.footer.key" , k1)
    .option("parquet.encryption.column.keys" , k2+":SSN,Address;"+k3+":CreditCard")
    .parquet("<path to encrypted files>")
    

    Note: If either the "parquet.encryption.column.keys" or the "parquet.encryption.footer.key" parameter is not set, an exception will be thrown.

Reading encrypted data

The required metadata, including the ID and URL of the Hashicorp Vault instance, is stored in the encrypted Parquet files.

To read the encrypted metadata:

  1. Set KMS client to the Vault client implementation:

    sc.hadoopConfiguration.set("parquet.ecnryption.kms.client.class", "com.ibm.parquet.key.management.VaultClient")
    
  2. Provide the access token with policy attached that grants access to the relevant keys:

    sc.hadoopConfiguration.set("parquet.encryption.key.access.token" , "<token string>")
    
  3. Call the regular Parquet read commands, such as:

    val dataFrame = spark.read.parquet("<path to encrypted files>")
    

Key rotation

If key rotation is required, an administrator with access rights to the KMS key rotation actions must rotate master keys in Hashicorp Vault using the procedure described in the Hashicorp Vault documentation. Thereafter the administrator can trigger Parquet key rotation by calling:

public static void KeyToolkit.rotateMasterKeys(String folderPath, Configuration hadoopConfig)

To enable Parquet key rotation, the following Hadoop configuration properties must be set:

  • The parameters "parquet.encryption.key.access.token" and "parquet.encryption.kms.instance.url" must set set, and optionally "parquet.encryption.kms.instance.id"
  • The parameter "parquet.encryption.key.material.store.internally" must be set to "false".
  • The parameter "parquet.encryption.kms.client.class" must be set to "com.ibm.parquet.key.management.VaultClient"

For example:

sc.hadoopConfiguration.set("parquet.encryption.kms.instance.url" , "https://<hostname>:8200")sc.hadoopConfiguration.set("parquet.encryption.key.access.token" , "<token string>")
sc.hadoopConfiguration.set("parquet.encryption.kms.client.class","com.ibm.parquet.key.management.VaultClient")
sc.hadoopConfiguration.set("parquet.encryption.key.material.store.internally", "false")
KeyToolkit.rotateMasterKeys("<path to encrypted files>", sc.hadoopConfiguration)

Parent topic: Parquet encryption

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more