Kerberos authentication on Cloud for Data Virtualization

Last updated: Mar 17, 2025
Kerberos authentication on Cloud for Data Virtualization

To connect to Apache Hive, Apache Impala, and Apache Spark SQL with Kerberos authentication, you must provide the Kerberos configuration file to Data Virtualization before you create the connection.

Before you begin

You must have a Remote Agent set up with IBM Cloud Satellite Connector connection. For more information, see Configuring the Data Virtualization Remote Agent.

About this task

Kerberos is a passwordless computer network security authentication protocol that MIT created to solve network security problems. It is widely used for single-sign-on (SSO) by many organizations today, securely transmitting user identity data to applications with two primary functions: authentication and security.

Data Virtualization on-premises supports Kerberos authentication for Apache Hive, Apache Impala, and Apache Spark SQL and it requires the user to upload a keytab file or an encrypted file that the data source generates, and is used for authentication by using Kerberos.
Note: Kerberos authentication is not available in the Data Virtualization web client due to a file upload restriction.

Procedure

  1. For each of your Apache Hive, Apache Impala, and Apache Spark SQL data sources, open a new text file and then complete the following steps to create a configuration file.
    1. Copy and paste the following information into your new text file, then modify the variables, as signified by the triangle brackets (< >).
      # To opt out of the system crypto-policies configuration of krb5, remove the
      # symlink at /etc/krb5.conf.d/crypto-policies which will not be recreated.
      includedir /etc/krb5.conf.d/
      
      [logging]
          default = FILE:/var/log/krb5libs.log
          kdc = FILE:/var/log/krb5kdc.log
          admin_server = FILE:/var/log/kadmind.log
      
      [libdefaults]
          dns_lookup_realm = false
          ticket_lifetime = 24h
          renew_lifetime = 7d
          forwardable = true
          rdns = false
          pkinit_anchors = FILE:/etc/pki/tls/certs/ca-bundle.crt
          spake_preauth_groups = edwards25519
          dns_canonicalize_hostname = fallback
          qualify_shortname = ""
          default_realm = <DEFAULT_DOMAIN_REALM>
          default_ccache_name = KEYRING:persistent:%{uid}
      
      [realms]
       <KERBEROS_REALM> = {
           kdc = <KDC_SERVER>
           admin_server = <ADMIN_SERVER>
       }
      
      [domain_realm]
       <SUBDOMAIN_REALM> = <DOMAIN_REALM>
       <DOMAIN_TO_REALM> = <SUBDOMAIN_TO_REALM>
      Consider the following text as an example.
      # To opt out of the system crypto-policies configuration of krb5, remove the
      # symlink at /etc/krb5.conf.d/crypto-policies which will not be recreated.
      includedir /etc/krb5.conf.d/
      
      [logging]
          default = FILE:/var/log/krb5libs.log
          kdc = FILE:/var/log/krb5kdc.log
          admin_server = FILE:/var/log/kadmind.log
      
      [libdefaults]
          dns_lookup_realm = false
          ticket_lifetime = 24h
          renew_lifetime = 7d
          forwardable = true
          rdns = false
          pkinit_anchors = FILE:/etc/pki/tls/certs/ca-bundle.crt
          spake_preauth_groups = edwards25519
          dns_canonicalize_hostname = fallback
          qualify_shortname = ""
          default_realm = EXAMPLE.COM
          default_ccache_name = KEYRING:persistent:%{uid}
      
      [realms]
       EXAMPLE.COM = {
           kdc = kerberos.example.com
           admin_server = kerberos.example.com
       }
      
      [domain_realm]
       .example.com = EXAMPLE.COM
       example.com = EXAMPLE.COM
    2. Save the configuration file.
      • For Apache Hive, save the file as hive_krb5.conf.
      • For Apache Impala, save the file as Impala_krb5.conf.
      • For Apache Spark SQL, save the file as spark_krb5.conf.
  2. Open the datavirtualization.env file in your remote agent.
    vi /root/dv_endpoint/datavirtualization.env
  3. Verify that the contents in the datavirtualization.env file contains the following information.
    • JAVA_HOME: This is the path where Java is installed on your machine.
    • DATAVIRTUALIZATION_INSTALL: This is the file path for datavirtualization.env.
    • KRB5_CONFIG: This is the file path of your newly created krb5.conf configuration file.
    The following is an example of the text that your file might contain.
    • JAVA_HOME="/root/jdk-21.0.3+9"
    • DATAVIRTUALIZATION_INSTALL="/root/dv_endpoint"
    • KRB5_CONFIG=/etc/hive_krb5.conf
  4. Replace the parameters in this stored procedure and then run it in Run SQL. In addition, replace <Data_source> with Hive, Impala or SparkSQL.
    call dvsys.setrdbcx('<Data_source>', '<host_name>', <db_port>, '<database_name>', '', '', '', <use_SSL>, <validate_cert>, '', '<SSL_certificate>', '<RemoteAgentName:Port>', 'UserPrincipal=<User_principal>,ServicePrincipal=<Service_principal>,Keytab=<Keytab_info>', ?, ?, ?)

    For more information on the parameters, see setRdbcX stored procedure (Variation 2).

    The following is an example of the stored procedure.
    call dvsys.setrdbcx('SparkSQL', 'krbds-hive.fyre.ibm.com', 10000, 'sparkdb01', '', '', '', 0, 0, '', '', 'RA_FOR_KRB:6415', 'UserPrincipal=spark/xxxx.fyre.ibm.com@IBM.COM,ServicePrincipal=hive/xxx.fyre.ibm.com@IBM.COM,Keytab=XXXXXXEQAQcheKq6W+vSDlrJ1GSZAITwAAAAIAAABMAAIAB0lCTS5DT00ABXNwYXJrABdrcmJkcy1oaXZlLmZ5cmUuaWJtLmNvbQAAAAFlHD9NAgAXABBTo30Yd3yTHr8rzj8V9lGKAAAAAgAAAFwAAgAHSUJNLkNPTQAFc3BhcmsAF2tyYmRzLWhpdmUuZnlyZS5pYm0uY29tAAAAAWUcP00CABoAIJNL0pQT6SkPC+JfILB+yq3rcCQo/6uRfLuBSPUmlS6XAAAAAgAAAEwAAgAHSUJNLkNPTQAFc3BhcmsAF2tyYmRzLWhpdmUuZnlyZS5pYm0uY29tAAAAAWUcP00CABkAEAa0R7FrW9AX+Q4GfmCLiG4AAAAC', ?, ?, ?)
    
    
  5. Check whether the stored procedure was successful by selecting the Results tab, and then checking the Output value column.
    • A successful output has an integer of 1.
    • An unsuccessful output has an integer of 0. Verify the previous configuration steps again.