Kerberos authentication on Cloud for Data Virtualization

Last updated: Mar 17, 2025

To connect to Apache Hive, Apache Impala, and Apache Spark SQL with Kerberos authentication, you must provide the Kerberos configuration file to Data Virtualization before you create the connection.

Before you begin

You must have a Remote Agent set up with IBM Cloud Satellite Connector connection. For more information, see Configuring the Data Virtualization Remote Agent.

About this task

Kerberos is a passwordless computer network security authentication protocol that MIT created to solve network security problems. It is widely used for single-sign-on (SSO) by many organizations today, securely transmitting user identity data to applications with two primary functions: authentication and security.

Data Virtualization on-premises supports Kerberos authentication for Apache Hive, Apache Impala, and Apache Spark SQL and it requires the user to upload a keytab file or an encrypted file that the data source generates, and is used for authentication by using Kerberos.

Note: Kerberos authentication is not available in the Data Virtualization web client due to a file upload restriction.

Procedure

For each of your Apache Hive, Apache Impala, and Apache Spark SQL data sources, open a new text file and then complete the following steps to create a configuration file.

Copy and paste the following information into your new text file, then modify the variables, as signified by the triangle brackets (< >).

# To opt out of the system crypto-policies configuration of krb5, remove the
# symlink at /etc/krb5.conf.d/crypto-policies which will not be recreated.
includedir /etc/krb5.conf.d/

[logging]
    default = FILE:/var/log/krb5libs.log
    kdc = FILE:/var/log/krb5kdc.log
    admin_server = FILE:/var/log/kadmind.log

[libdefaults]
    dns_lookup_realm = false
    ticket_lifetime = 24h
    renew_lifetime = 7d
    forwardable = true
    rdns = false
    pkinit_anchors = FILE:/etc/pki/tls/certs/ca-bundle.crt
    spake_preauth_groups = edwards25519
    dns_canonicalize_hostname = fallback
    qualify_shortname = ""
    default_realm = <DEFAULT_DOMAIN_REALM>
    default_ccache_name = KEYRING:persistent:%{uid}

[realms]
 <KERBEROS_REALM> = {
     kdc = <KDC_SERVER>
     admin_server = <ADMIN_SERVER>
 }

[domain_realm]
 <SUBDOMAIN_REALM> = <DOMAIN_REALM>
 <DOMAIN_TO_REALM> = <SUBDOMAIN_TO_REALM>

Consider the following text as an example.

# To opt out of the system crypto-policies configuration of krb5, remove the
# symlink at /etc/krb5.conf.d/crypto-policies which will not be recreated.
includedir /etc/krb5.conf.d/

[logging]
    default = FILE:/var/log/krb5libs.log
    kdc = FILE:/var/log/krb5kdc.log
    admin_server = FILE:/var/log/kadmind.log

[libdefaults]
    dns_lookup_realm = false
    ticket_lifetime = 24h
    renew_lifetime = 7d
    forwardable = true
    rdns = false
    pkinit_anchors = FILE:/etc/pki/tls/certs/ca-bundle.crt
    spake_preauth_groups = edwards25519
    dns_canonicalize_hostname = fallback
    qualify_shortname = ""
    default_realm = EXAMPLE.COM
    default_ccache_name = KEYRING:persistent:%{uid}

[realms]
 EXAMPLE.COM = {
     kdc = kerberos.example.com
     admin_server = kerberos.example.com
 }

[domain_realm]
 .example.com = EXAMPLE.COM
 example.com = EXAMPLE.COM

Save the configuration file.
- For Apache Hive, save the file as hive_krb5.conf.
- For Apache Impala, save the file as Impala_krb5.conf.
- For Apache Spark SQL, save the file as spark_krb5.conf.

Open the datavirtualization.env file in your remote agent.

vi /root/dv_endpoint/datavirtualization.env

Verify that the contents in the datavirtualization.env file contains the following information.
- JAVA_HOME: This is the path where Java is installed on your machine.
- DATAVIRTUALIZATION_INSTALL: This is the file path for datavirtualization.env.
- KRB5_CONFIG: This is the file path of your newly created krb5.conf configuration file.
The following is an example of the text that your file might contain.
- ```
JAVA_HOME="/root/jdk-21.0.3+9"
```
- ```
DATAVIRTUALIZATION_INSTALL="/root/dv_endpoint"
```
- ```
KRB5_CONFIG=/etc/hive_krb5.conf
```

Replace the parameters in this stored procedure and then run it in Run SQL. In addition, replace <Data_source> with Hive, Impala or SparkSQL.

call dvsys.setrdbcx('<Data_source>', '<host_name>', <db_port>, '<database_name>', '', '', '', <use_SSL>, <validate_cert>, '', '<SSL_certificate>', '<RemoteAgentName:Port>', 'UserPrincipal=<User_principal>,ServicePrincipal=<Service_principal>,Keytab=<Keytab_info>', ?, ?, ?)

For more information on the parameters, see setRdbcX stored procedure (Variation 2).

The following is an example of the stored procedure.

call dvsys.setrdbcx('SparkSQL', 'krbds-hive.fyre.ibm.com', 10000, 'sparkdb01', '', '', '', 0, 0, '', '', 'RA_FOR_KRB:6415', 'UserPrincipal=spark/xxxx.fyre.ibm.com@IBM.COM,ServicePrincipal=hive/xxx.fyre.ibm.com@IBM.COM,Keytab=XXXXXXEQAQcheKq6W+vSDlrJ1GSZAITwAAAAIAAABMAAIAB0lCTS5DT00ABXNwYXJrABdrcmJkcy1oaXZlLmZ5cmUuaWJtLmNvbQAAAAFlHD9NAgAXABBTo30Yd3yTHr8rzj8V9lGKAAAAAgAAAFwAAgAHSUJNLkNPTQAFc3BhcmsAF2tyYmRzLWhpdmUuZnlyZS5pYm0uY29tAAAAAWUcP00CABoAIJNL0pQT6SkPC+JfILB+yq3rcCQo/6uRfLuBSPUmlS6XAAAAAgAAAEwAAgAHSUJNLkNPTQAFc3BhcmsAF2tyYmRzLWhpdmUuZnlyZS5pYm0uY29tAAAAAWUcP00CABkAEAa0R7FrW9AX+Q4GfmCLiG4AAAAC', ?, ?, ?)

Check whether the stored procedure was successful by selecting the Results tab, and then checking the Output value column.
- A successful output has an integer of 1.
- An unsuccessful output has an integer of 0. Verify the previous configuration steps again.