Supported data sources for curation and data quality

You can connect to many data sources from which you can import metadata, against which you can run metadata enrichment or data quality rules, which you can use to create dynamic views, and to which you can write the output of data quality rules.

File-storage connectors
Database connectors
Connectors and other data sources specific to metadata import

A dash (—) in any of the columns indicates that the data source is not supported for this purpose.

By default, data quality rules and the underlying DataStage flows support standard platform connections. Not all connectors that were supported in traditional DataStage and potentially used in custom DataStage flows are supported in IBM Knowledge Catalog.

Requirements and restrictions

Understand the requirements and restrictions for connections to be used in data curation and data quality assessment.

Required permissions

Users must be authorized to access the connections to the data sources. For metadata import, the user running the import must have the SELECT or a similar permission on the databases in question.

General prerequisites

Connection assets must exist in the project for connections that are used in these cases:

For running metadata enrichment including advanced analysis (in-depth primary key analysis, in-depth relationship analysis, or advanced data profiling) on assets in a metadata enrichment
For running data quality rules
For creating query-based data assets (dynamic views)
For writing output of data quality checks or frequency distribution tables

Supported source data formats

In general, metadata import, metadata enrichment, and data quality rules support the following data formats:

All: Tables from relational and nonrelational data sources

For Amazon S3, the Delta Lake table format
Metadata import: Any format from file-based connections to the data sources. For Microsoft Excel workbooks, each sheet is imported as a separate data asset. The data asset name equals the name of the Excel sheet.
Metadata enrichment: Tabular: CSV, TSV, Avro, Parquet, Microsoft Excel (For workbooks uploaded from the local file system, only the first sheet in a workbook is profiled.)
Data quality rules: Tabular: Avro, CSV, Parquet, ORC; for data assets uploaded from the local file system, CSV only

Database support for analysis output tables

In general, output tables that are generated during analysis can be written to these databases:

If a specific database connector also supports output tables, the Target for output tables column shows a checkmark.

File-storage connectors

Supported file-based connectors
Connector	Metadata import	Metadata enrichment	Definition-based rules
Amazon S3	✓	✓	✓
Apache HDFS	✓	✓	✓
Box	✓	✓ ¹	—
Generic S3	✓	✓ ¹	—
IBM Cloud Object Storage	✓	✓	—
IBM Match 360	✓	✓	✓
Microsoft Azure Data Lake Storage	✓	✓ ¹	✓

Notes:

¹ Advanced analysis is not supported for this data source.

Database connectors

Supported connections
Connector	Metadata import (assets)	Metadata import (lineage)	Metadata enrichment	Definition-based rules	SQL-based rules	SQL-based data assets	Target for output tables
Amazon RDS for MySQL	✓	—	✓	—	—	—	—
Amazon RDS for Oracle	✓	—	—	✓	✓	—	—
Amazon RDS for PostgreSQL	✓	—	✓	—	—	—	—
Amazon Redshift	✓	—	✓ ¹	✓	✓	✓	—
Apache Cassandra	✓	—	✓	✓	✓	✓	—
Apache Hive	✓	—	✓	✓	✓	✓	✓ ⁵
Apache Impala with Apache Kudu	✓	—	✓	✓	✓	✓	—
Dremio	✓	—	✓	✓	✓	✓	—
Google BigQuery	✓	—	✓	✓	✓	✓	✓ ⁶
Greenplum	✓	—	✓	✓	✓	✓	—
IBM Cloud Data Engine	✓	—	✓	—	—	—	—
IBM Cloud Databases for MongoDB	✓	—	✓	—	—	—	—
IBM Cloud Databases for MySQL	✓	—	✓	—	—	—	—
IBM Cloud Databases for PostgreSQL	✓	—	✓	—	—	—	—
IBM Data Virtualization	✓	—	✓	✓	✓	✓	—
IBM Data Virtualization Manager for z/OS ²	✓	—	✓	—	—	—	—
IBM Db2	✓	—	✓	✓	✓	✓	✓
IBM Db2 Big SQL	✓	—	✓	—	—	—	—
IBM Db2 for z/OS	✓	—	✓	—	—	—	—
IBM Db2 on Cloud	✓	—	✓	✓	✓	—	✓
IBM Db2 Warehouse	✓	—	✓	—	—	—	—
IBM Informix	✓	—	✓	—	—	—	—
IBM Netezza Performance Server	✓	—	✓	✓	✓	—	—
Connector	Metadata import (assets)	Metadata import (lineage)	Metadata enrichment	Definition-based rules	SQL-based rules	SQL-based data assets	Target for output tables
MariaDB	✓	—	✓	—	—	—	—
Microsoft Azure Databricks ⁷	✓	—	✓	✓	✓	✓	✓
Microsoft Azure SQL Database	✓	✓	✓ ¹	✓	✓	✓	—
Microsoft SQL Server	✓	✓	✓	✓	✓	✓	✓
MongoDB	✓	—	✓	✓	✓	—	—
MySQL	✓	—	✓	✓	✓	✓	—
Oracle ³	✓	—	✓	✓	✓	✓	✓
PostgreSQL	✓	—	✓	✓	✓	✓	✓
Salesforce.com	✓	—	✓ ¹ ⁴	—	—	—	—
SAP ASE	✓	—	✓ ¹	✓	✓	✓	—
SAP OData Authentication method: username and password	✓	—	✓ ⁸	✓	—	—	—
SingleStoreDB	✓	—	✓	✓	✓	✓	✓
Snowflake	✓	✓	✓ ¹	✓	✓	✓	—
Teradata	✓	—	✓	✓	✓	✓	✓

Notes:

¹ Advanced analysis is not supported for this data source.

² With Data Virtualization Manager for z/OS, you add data and COBOL copybooks assets from mainframe systems to catalogs in IBM Cloud Pak for Data. Copybooks are files that describe the data structure of a COBOL program. Data Virtualization Manager for z/OS helps you create virtual tables and views from COBOL copybook maps. You can then use these virtual tables and views to import and catalog mainframe data from mainframes into IBM Cloud Pak for Data in the form of data assets and COBOL copybook assets.

The following types of COBOL copybook maps are not imported: ACI, Catalog, Natural

Restriction: You can't import COBOL copybooks larger than 1 MB.

When the import is finished, you can go to the catalog to review the imported assets, including the COBOL copybook maps, virtual tables, and views. You can use these assets in the same ways as other assets in Cloud Pak for Data.

For more information, see Adding COBOL copybook assets.

³ Table and column descriptions are imported only if the connection is configured with one of the following metadata import advanced options:

No synonyms
Remarks and synonyms

⁴ Some objects in the SFORCE schema are not supported. See Salesforce.com.

⁵ To create metadata-enrichment output tables in Apache Hive at an earlier version than 3.0.0, you must apply the workaround described in Writing metadata enrichment output to an earlier version of Apache Hive than 3.0.0.

⁶ Output tables for advanced profiling: If you rerun advanced profiling at too short intervals, results might accumulate because the data might not be updated fast enough in Google BigQuery. wait at least 90 minutes before rerunning advanced profiling with the same output target. For more information, see Stream data availability. Alternatively, you can define a different output table.

⁷ Hive Metastore and Unity catalog

⁸ Information whether the data asset is a table or a view cannot be retrieved and is thus not shown in the enrichment results.

Connectors and other data sources specific to metadata import

You can import asset and lineage metadata from additional data sources.

Data source	Metadata import (assets)	Metadata import (lineage)
IBM DataStage for Cloud Pak for Data ¹	—	✓
InfoSphere DataStage ¹	—	✓
Microsoft Power BI (Azure)	—	✓

Notes:

¹ To import lineage metadata from these sources, you must provide an input file. See Configuring metadata import for data integration assets.

Learn more

Parent topic: Curation