0 / 0
Apache HDFS connection
Last updated: Nov 21, 2024
Apache HDFS connection

To access your data in Apache HDFS, create a connection asset for it.

Apache Hadoop Distributed File System (HDFS) is a distributed file system that is designed to run on commodity hardware. Apache HDFS was formerly Hortonworks HDFS.

Create a connection to Apache HDFS

To create the connection asset, you need these connection details. The WebHDFS URL is required.
The available properties in the connection form depend on whether you select Connect to Apache Hive so that you can write tables to the Hive data source.

  • WebHDFS URL to access HDFS.
  • Hive host: Hostname or IP address of the Apache Hive server.
  • Hive database: The database in Apache Hive.
  • Hive port number: The port number of the Apache Hive server. The default value is 10000.
  • Hive HTTP path: The path of the endpoint such as gateway/default/hive when the server is configured for HTTP transport mode.
  • SSL certificate (if required by the Apache Hive server).

Credentials

The username is required.

  • Username and password
  • Hive user and password if you connect to Apache Hive

For Private connectivity, to connect to a database that is not externalized to the internet (for example, behind a firewall), you must set up a secure connection.

Choose the method for creating a connection based on where you are in the platform

In a project
Click Assets > New asset > Connect to a data source. See Adding a connection to a project.
In a catalog
Click Add to catalog > Connection. See Adding a connection asset to a catalog.
In a deployment space
Click Import assets > Data access > Connection. See Adding data assets to a deployment space.
In the Platform assets catalog
Click New connection. See Adding platform connections.

Next step: Add data assets from the connection

Where you can use this connection

You can use Apache HDFS connections in the following workspaces and tools:

Projects

  • Data quality rules (IBM Knowledge Catalog)
  • Data Refinery (watsonx.ai Studio or IBM Knowledge Catalog)
  • DataStage (DataStage service). See Connecting to a data source in DataStage.
  • Decision Optimization (watsonx.ai Studio and watsonx.ai Runtime)
  • Metadata enrichment (IBM Knowledge Catalog)
  • Metadata import (IBM Knowledge Catalog)
  • SPSS Modeler (watsonx.ai Studio)

Catalogs

  • Platform assets catalog

  • Other catalogs (IBM Knowledge Catalog)

Apache HDFS setup

Install and set up a Hadoop cluster

Supported file types

The Apache HDFS connection supports these file types:  Avro, CSV, Delimited text, Excel, JSON, ORC, Parquet, SAS, SAV, SHP, and XML.

Table formats

In addition to Flat file, the Apache HDFS connection supports these Data Lake table formats: Delta Lake and Iceberg.

Learn more

Apache HDFS Users Guide

Parent topic: Supported connections

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more