Supported data sources and data types in Data Virtualization

Supported data sources in Data Virtualization

Last updated: Nov 26, 2024

Data Virtualization supports the following relational and nonrelational data sources on IBM Cloud Pak® for Data as a Service.

You cannot connect to a data source as a connection to an existing Platform connection if the Platform connection uses Cloud integration, Secure Gateway, or Satellite Link. These features are not supported by Data Virtualization when you connect to Platform connections. You see an error message that is similar to Cannot reach the network destination of the data source. You can set up the data source by using Cloud integration, Secure Gateway, or Satellite Link and provide the hostname or IP endpoint directly to Data Virtualization as a new connection.

Size limits: Data Virtualization supports virtualization of tables with a row size up to 1 MB, and up to 2048 columns in a table. However, the number of columns that Data Virtualization can preview depends on many factors, such as the data types of the columns. Currently, preview is limited to 200 columns.
Comment attributes: When virtual tables are created, Data Virtualization does not include comment attributes that were assigned to data source objects. This limitation applies to all data sources.
Data types: Some data types in your data source might not be supported in Data Virtualization. These limitations are documented in the following tables. Data Virtualization might also map some data types in your data source to alternative data types. These mappings are based on underlying Db2® Big SQL mappings. For more information, see Data types in Db2 Big SQL.

IBM data sources

The following table lists the IBM® data sources that you can connect to from Data Virtualization.

Table 1. Supported IBM data sources
Connector	Limitations	More information
IBM Cloud® Databases for MongoDB	IBM Cloud Databases for MongoDB is available as beta. The following MongoDB data types are supported in Data Virtualization: INT32, INT64, DOUBLE, STRING, BOOLEAN, DATE, and BINARY.
IBM Cloud Databases for MySQL
IBM Cloud Databases for PostgreSQL		This connection is optimized to take advantage of the query capabilities in this data source.
IBM Cloud Object Storage	This connection requires special consideration in Data Virtualization. See Connecting to IBM Cloud Object Storage in Data Virtualization. For limitations, see Data sources in object storage in Data Virtualization.
IBM Data Virtualization Manager for z/OS®	You cannot connect to Data Virtualization Manager for z/OS if the connection uses Cloud integration, Secure Gateway, or Satellite Link. These are features are not supported by Data Virtualization. You will see an error message that is similar to `Cannot reach the network destination of the data source`.	This connection is optimized to take advantage of the query capabilities in this data source.
IBM Db2	The NCHAR and NVARCHAR types are not supported in Data Virtualization. Db2 uses a DECFLOAT type, which Data Virtualization defaults to DOUBLE. This mapping influences how specific values are returned. Db2 use DISTINCT types, which Data Virtualization defaults to BLOB.	This connection is optimized to take advantage of the query capabilities in this data source.
IBM Db2 Big SQL		This connection is optimized to take advantage of the query capabilities in this data source.
IBM Db2 for i		This connection is optimized to take advantage of the query capabilities in this data source.
IBM Db2 for z/OS
IBM Db2 on Cloud	The NCHAR and NVARCHAR types are not supported in Data Virtualization.
IBM Db2 Warehouse
IBM Informix®	INTERVAL, BIGINT, and BIGSERIAL data types are not supported in Data Virtualization. For more information, see Exceptions occur when using the Connect for JDBC Informix driver.
IBM Netezza® Performance Server	BLOB, XML, and CLOB Netezza data types are not supported in Data Virtualization. When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only.	This connection is optimized to take advantage of the query capabilities in this data source.
IBM Planning Analytics	Data Virtualization can virtualize only TM1® views with one column dimension. Aggregation cannot be pushed down to TM1. If a query against TM1 contains an aggregation operation such as COUNT, SUM, AVG, MAX, MIN, or GROUP BY, you can add the following server option. `alter server qplex options (add DB2_GROUP_BY 'N')` Data Virtualization supports only VARCHAR and DOUBLE data types in TM1. TM1 doesn't recognize SQL. The native query language of TM1 is MDX, which doesn't have the concept of pushdown. Therefore, this data source is not configured to support optimal predicate pushdown.
IBM Data Virtualization	Important: Do not create a connection to your Data Virtualization instance.

Third-party data sources

The following table lists the third-party data sources that you can connect to from Data Virtualization.

Table 2. Supported third-party data sources
Connector	Limitations	More information
Amazon RDS for MySQL	You must create only one connection to a database on a MySQL data source to avoid duplicate rows on the Virtualize page. MySQL does not support database isolation. The BIT types are not supported in Data Virtualization.	This connection is optimized to take advantage of the query capabilities in this data source.
Amazon RDS for Oracle	When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only. The following data types are converted in Data Virtualization: INTERVAL and JSON convert to VARCHAR. National character types convert to CHAR, VARCHAR, or CLOB. TIMESTAMP and TIMESTAMP WITH LOCAL TIME ZONE convert to TIMESTAMP. XML convert to CLOB. SMALLINT, INTEGER, BIGINT convert to DECIMAL 4.0, DECIMAL 9.0, and DECIMAL 18.0. SMALLINT, INTEGER, BIGINT convert to DECIMAL 4.0, DECIMAL 9.0, and DECIMAL 18.0. ˇ	This connection is optimized to take advantage of the query capabilities in this data source.
Amazon RDS for PostgreSQL		This connection is optimized to take advantage of the query capabilities in this data source.
Amazon Redshift	SPATIAL, SKETCH, and SUPER data types are converted to CLOB in Data Virtualization.	This connection is optimized to take advantage of the query capabilities in this data source.
Amazon S3	This connection requires special consideration in Data Virtualization. See Connecting to Amazon S3 in Data Virtualization. Specific rules apply for naming buckets in Amazon S3. For more information, see Bucket naming rules in the Amazon S3 documentation. If you include dots in a bucket's name, you can't use virtual-host-style addressing over HTTPS, unless you perform your own certificate validation. The security certificates that are used for virtual hosting of buckets don't work for buckets with dots in their names. With AWS PrivateLink for Amazon S3, you can provision interface VPC endpoints (interface endpoints) in your virtual private cloud (VPC). You cannot use these endpoints in the Endpoint URL when you create a connection to an Amazon S3 data source. This type of endpoint is not supported. You must use the standard endpoint for Amazon S3, for example, `http://s3.eu-central-1.amazonaws.com/`. For more limitations, see Data sources in object storage in Data Virtualization.
Apache Derby		This connection is optimized to take advantage of the query capabilities in this data source.
Apache Hive	In virtualized tables, you can list only the value of complex data types, such as array, map, struct, and union type. Any other operation on the value of these complex data types is not supported. When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only. String and Binary types in Hive data sources now resolve to VARCHAR(32672) and VARBINARY(32672) by default, instead of CLOB and BLOB. A Data Virtualization Admin can configure `HIVE_MAX_STRING_SIZE` and `HIVE_MAX_BINARY_SIZE`. This update optimizes SQL performance for Hive data sources. A Data Virtualization Admin can configure HIVE_MAX_STRING_SIZE and HIVE_MAX_BINARY_SIZE by calling the `DVSYS.SETCONFIGPROPERTY` procedure. For example, to change the default maximum size for the Hive `BINARY` data type to `2000`, which resolve to `VARBINARY(2000)`, run the following command. `db2 "call dvsys.setconfigproperty ('HIVE_MAX_BINARY_SIZE', '2000', '', ?, ?)"` The minimum size is `1` and the maximum recommended size is `32672`. You can exceed 32672; however, `STRING` and `BINARY` types will resolve to `CLOB` and `BLOB` types, which can cause certain queries to fail.	This connection is optimized to take advantage of the query capabilities in this data source.
Ceph	This connection requires special consideration in Data Virtualization. See Connecting to Ceph in Data Virtualization. For limitations, see Data sources in object storage in Data Virtualization.
Cloudera Impala		This connection is optimized to take advantage of the query capabilities in this data source.
Generic S3	For limitations, see Data sources in object storage in Data Virtualization.
Google BigQuery	This connection requires special consideration in Data Virtualization. See Connecting to Google BigQuery in Data Virtualization. In the Google BigQuery data source, Data Virtualization does not support the use of the SELECT statement for columns with data type of `RECORD`.	This connection is optimized to take advantage of the query capabilities in this data source.
Greenplum		This connection is optimized to take advantage of the query capabilities in this data source.
MariaDB	The BIT, LONGTEXT, and LONGBLOB types are not supported in Data Virtualization. You must create only one connection to a database on a MariaDB data source to avoid duplicate rows on the Virtualize page. MariaDB does not support database isolation. You must have admin permission for the MariaDB data source for advanced operations such as collecting statistics.	This connection is optimized to take advantage of the query capabilities in this data source.
Microsoft Azure Data Lake Storage
Microsoft Azure SQL Database
Microsoft SQL Server	When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only.	This connection is optimized to take advantage of the query capabilities in this data source.
MongoDB	The BIT, LONGTEXT, and LONGBLOB types are not supported in Data Virtualization. You must have admin permission for the MongoDB data source for advanced operations such as collecting statistics.
MySQL (My SQL Community Edition) (My SQL Enterprise Edition)	You must create only one connection to a database on a MySQL data source to avoid duplicate rows on the Virtualize page. MySQL does not support database isolation. The BIT types are not supported in Data Virtualization. The BIGINT function is not supported with set operations such as INTERSECT, INTERSECT ALL, EXCEPT, EXCEPT ALL in Data Virtualization.	This connection is optimized to take advantage of the query capabilities in this data source.
Oracle	When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only. The following data types are converted in Data Virtualization: INTERVAL and JSON convert to VARCHAR. National character types convert to CHAR, VARCHAR, or CLOB. TIMESTAMP and TIMESTAMP WITH LOCAL TIME ZONE convert to TIMESTAMP. XML convert to CLOB. SMALLINT, INTEGER, BIGINT convert to DECIMAL 4.0, DECIMAL 9.0, and DECIMAL 18.0.	This connection is optimized to take advantage of the query capabilities in this data source.
PostgreSQL		This connection is optimized to take advantage of the query capabilities in this data source.
Salesforce.com		This connection is optimized to take advantage of the query capabilities in this data source.
SAP ASE
SAP OData	You cannot preview or query nonreadable tables due to the following reasons: The SAP OData data source might have write-only access. In such cases, changing the user permissions does not avoid this issue. The SAP OData data source has read access, but requires filters. This limitation means that you cannot preview data, but you can read it if you specify filters.
Snowflake	This connection requires special consideration in Data Virtualization. See Connecting to Snowflake in Data Virtualization. ARRAY, GEOGRAPHY, and VARIANT data types are converted to VARCHAR. Data Virtualization does not support the Okta URL endpoint.	This connection is optimized to take advantage of the query capabilities in this data source.
Teradata Teradata JDBC Driver 17.00 Copyright (C) 2024 by Teradata. All rights reserved. IBM provides embedded usage of the Teradata JDBC Driver under license from Teradata solely for use as part of the IBM Watson® service offering.	The XML Teradata data type is not supported in Data Virtualization. The XML data type is converted to CLOB. DECIMAL data is supported to 34 digits with DECFLOAT. Data columns that exceed this limit must be edited to DOUBLE during virtualization. The following data types are converted in Data Virtualization: Queries that include a string comparison operation such as a GROUP BY or WHERE predicate against CHAR or VARCHAR data to handle case sensitivity. VARBYTE is converted to VARCHAR for bit data. BYTE is converted to BINARY. BYTEINT is converted to SMALLINT. TIME ZONE BEARING is converted to TIME and TIMESTAMP. PERIOD, DISTINCT, DATASET, and ARRAY are converted to BLOB. NUMBER is converted to DOUBLE. YEAR, YEAR TO MONTH, DAY, DAY TO MINUTE, HOUR TO MINUTE, MONTH, HOUR, and MINUTE is converted to INTEGER. DAY TO SECOND, HOUR TO SECOND, MINUTE TO SECOND, SECOND is converted to DECIMAL. When you use a SELECT statement or preview LOB data that is greater than 64 kB, Data Virtualization truncates the data to 64 K bytes only.	This connection is optimized to take advantage of the query capabilities in this data source.

Data sources in object storage in Data Virtualization

You can use data that is stored as files on IBM Cloud Object Storage, Amazon S3, Ceph, or Generic S3 data sources to create virtual tables. To access data that is stored in cloud object storage, you must create a connection to the data source where the files are located.

You can segment or combine data from one or more files to create a virtual table. Accessing files in object storage in Data Virtualization is built on Db2 Big SQL capabilities that use Hadoop external table support. For more information, see CREATE TABLE (HADOOP) statement.

Terminology

Buckets and partitions play important roles in data organization. The following components are key elements of object storage.

A bucket is a logical abstraction that is used to provide a container for data. There is no folder concept in object storage; only buckets and keys. Buckets can be created only in the interface of the object storage data source. They cannot be created in Data Virtualization. Bucket names must be unique and adhere to the rules of the object storage provider. These rules often include restricting the name to 3 - 63 characters with lowercase letters, numbers, and dashes only. Bucket names must begin and end with a lowercase letter or number. When Data Virtualization accesses data in object storage, the bucket name must be unique across all object storage connections.
A file path is the complete path to the file where you want to store data. The S3 file system implementation allows zero-length files to be treated like directories, and file names that contain a forward slash (/) are treated like nested directories. The file path includes the bucket name, an optional file path, and a file name. In object storage, the file path is used when a table is created. All files in the same path contribute to the table data. You can add more data by adding another file to the file path.
A partition is data that is grouped by a common attribute in the schema. Partitioning divides the data into multiple file paths, which are treated like directories. Data Virtualization can discover and use partitions to reduce the amount of data that queries must process, which can improve performance of queries that use predicates on the partitioning columns.

Best practices

File formats

Data Virtualization supports PARQUET (or PARQUETFILE), ORC (optimized row columnar), CSV (comma-separated values), TSV (tab-separated values), and JSON file formats. No other file formats are supported.

For PARQUET (or PARQUETFILE), file extensions are not required. Metadata is extracted from the data file.
For ORC, file extensions are not required. Metadata is extracted from the data file.
For CSV and TSV files:
- The appropriate .csv or .tsv file extension is required, as follows:
  - CSV: The .csv file extension is required and the contents of the file must follow the specifications for comma-separated values.
  - TSV: The .tsv file extension is required and the contents of the file must follow the specifications for tab-separated values.
- An optional parameter can be used to specify a string delimiter character (quoteChar) that surrounds field values in CSV and TSV files.
  - The performance of querying the data might be negatively impacted if quoteChar is specified.
  - The default value is no delimiter (not specified).
  - The value for quoteChar must be a single character that cannot be a space (blank), backslash, tab, comma, or newline character (\n).
  - If the string value contains the string delimiter (quoteChar), the backslash (\) can be used to escape the string delimiter.
For JSON files, the .json file extension is required. JSON files must be coded so that each line is a valid JSON object. Lines must be separated by a newline character (\n). The JSON Lines text format, also called newline-delimited JSON, is the only supported JSON format. This format stores structured data that can be processed one record at a time.

Note: All other file formats return an error. For more information, see Error message when you try to use an unsupported file format in Cloud Object Storage.

Organizing data

Avoid using alphanumeric characters in column names as it may interfere with Hive compatibility. Any character that is not an alphanumeric character or the underscore character is encoded as _xNNNN, where _xNNNN is the hex value of the character. If you want to properly view column names, enable the allownonalphanumeric option by completing these steps:
1. Access the head pod in the Data Virtualization instance (c-db2u-dv-db2u-0).
2. Run the following command to edit the configuration to include the allownonalphanumeric option:
```
db2uctl adm bigsql config --key bigsql.catalog.identifier.mappingrules --value allownonalphanumeric,allowleadingdigits,allowreservedwords
```
3. Run the following command to restart Big SQL:
```
su - db2inst1 
bigsql stop ; 
bigsql start 
```
If your object storage data is accessed through a virtualized table, the files that you want to virtualize must be within a single file path and within a single bucket, and the bucket must include at least one file that you add to the cart. All the files in this file path are part of the virtualized table. When more data is added to the table (new files are created in the file path), the data is visible when you access the virtualized table. All files in the file path must use the same file format so that they are virtualized as one table.
If you want to virtualize the files in multiple file paths as one table, you can virtualize the bucket that contains all of the files. For example, if you have file paths A/B/C/T1a, A/B/C/T1b, A/B/D/T1c, and A/B/D/T1d, you can virtualize the file path A/B/. All the files in that path and nested paths will be part of the accessible object.
Do not create two objects (tables, schemas, or columns) with the same name, even if you use delimited identifiers and mixed case. For example, you cannot have a table t1 and another table that is named T1. These names are considered to be duplicate names in object storage (Hive). For more information, see Identifiers.
Db2 supports a wider range of valid delimited identifiers than Hive supports. Some identifier names that are specified when you create virtualized tables over object store might be adjusted before they can be accepted into the Hive catalog. The mapping is done automatically. For more information, see Identifiers.
When new data is added to the file path for a virtualized table, consider running the following command to ensure that the metadata cache is updated to see the new data.
```
CALL SYSHADOOP.HCAT_CACHE_SYNC(<schema>, <object>)
```
For more information, see the HCAT_CACHE_SYNC stored procedure.
When new partitions are added to the file path for the virtualized table, click Refresh partitions in the overflow menu on the Virtualized data page to identify new partitions.
You can also run the following command in the SQL interface to identify the new partitions that were added.
```
MSCK REPAIR TABLE <table-name> 
```
For more information, see MSCK REPAIR TABLE.

Optimizing query performance

Use a compact file format such as ORC or Parquet to minimize network traffic, which improves query performance.
Don't use the STRING or TEXT data type. Use the VARCHAR(n) data type, with n set to a value that is appropriate for the column data. Use the following command to alter the table to define an appropriate length for the column.
```
 ALTER TABLE <schema>.<table> ALTER COLUMN <col> SET DATA TYPE VARCHAR(<size>)
```
Partition your data by using Hive style partitioning. Partitioned data is grouped by a common attribute. Data Virtualization can use partitions to reduce the amount of data that queries must process. Querying the entire data set might not be possible or even necessary. You can use predicates in your queries that include the partitioning columns to improve performance.
For example, a school_records table that is partitioned on a year column, segregates values by year into separate file paths. A WHERE condition such as YEAR=1993, YEAR IN (1996,1995), or YEAR BETWEEN 1992 AND 1996 scans only the data in the appropriate file path to resolve the query.
Define partitioned column types accurately. By default, partitioned character columns are assumed to be STRING type, which is not recommended. Redefine the partitioned columns to an appropriate data type.
Collect statistics on the data that is being queried. Data Virtualization uses the ANALYZE command to collect statistics on virtualized tables over object storage. You can collect statistics in the web client or by using SQL. For more information, see Collecting statistics in Data Virtualization.

Limitations

Only UTF-8 character encoding is supported in Data Virtualization for text files in CSV, TSV, or JSON format. Cloud Object Storage binary formats such as ORC or PARQUET are unaffected because they transparently encode character types.
Data Virtualization does not support the TIME data type in a virtualized table over object storage.
Preview of assets in cloud object storage shows only the first 200 columns of the table.
Before you remove a connection to Cloud Object Storage, you must remove all virtualized tables in the object storage connection. If a connection is removed and you try to remove a table in that connection, you see an error. See Credential error message when you remove a virtualized table in object storage.
If the bucket is not specified in the connection properties, the connection is global. In this case, include the bucket name in the file path. You can specify up to one global connection in a Data Virtualization instance.

See also Restrictions in CREATE TABLE (HADOOP) statement.

Was the topic helpful?

0/1000