Data Virtualization supports the following relational and nonrelational data sources on IBM Cloud Pak® for Data as a Service.
You cannot connect to a data source as a connection to an existing Platform connection if the Platform connection uses Cloud integration, Secure Gateway, or Satellite Link. These features are not supported by Data Virtualization when you connect to Platform connections. You see an error message that is similar to Cannot reach the network destination of the data source. You can set up the data source by using Cloud integration, Secure Gateway, or Satellite Link and provide the hostname or IP endpoint directly to Data Virtualization as a new connection.
- Size limits
- Data Virtualization supports virtualization of tables with a row size up to 1 MB, and up to 2048 columns in a table. However, the number of columns that Data Virtualization can preview depends on many factors, such as the data types of the columns. Currently, preview is limited to 200 columns.
- Comment attributes
- When virtual tables are created, Data Virtualization does not include comment attributes that were assigned to data source objects. This limitation applies to all data sources.
- Data types
- Some data types in your data source might not be supported in Data Virtualization. These limitations are documented in the following tables. Data Virtualization might also map some data types in your data source to alternative data types. These mappings are based on underlying Db2® Big SQL mappings. For more information, see Data types in Db2 Big SQL.
IBM data sources
The following table lists the IBM® data sources that you can connect to from Data Virtualization.
Connector | Limitations | More information |
---|---|---|
IBM Cloud® Databases for MongoDB | IBM Cloud Databases for MongoDB is available as beta.
The following MongoDB data types are supported in Data Virtualization: INT32, INT64, DOUBLE, STRING, BOOLEAN, DATE, and BINARY. |
|
IBM Cloud Databases for MySQL | ||
IBM Cloud Databases for PostgreSQL |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
IBM Cloud Object Storage |
This connection requires special consideration in Data Virtualization. See Connecting to IBM Cloud Object Storage in Data Virtualization. For limitations, see Data sources in object storage in Data Virtualization. |
|
IBM Data Virtualization Manager for z/OS® | You cannot connect to Data Virtualization Manager for z/OS if the connection uses Cloud integration, Secure Gateway, or Satellite Link. These are features are not supported by Data Virtualization. You will see an error message that is similar to Cannot reach the network destination of the data source. |
This connection is optimized to take advantage of the query capabilities in this data source. |
IBM Db2 |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
IBM Db2 Big SQL |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
IBM Db2 for i |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
IBM Db2 for z/OS | ||
IBM Db2 on Cloud | The NCHAR and NVARCHAR types are not supported in Data Virtualization. | |
IBM Db2 Warehouse | ||
IBM Informix® | INTERVAL, BIGINT, and BIGSERIAL data types are not supported in Data Virtualization. For more information, see Exceptions occur when using the Connect for JDBC Informix driver. | |
IBM Netezza® Performance Server |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
IBM Planning Analytics |
|
|
IBM Data Virtualization |
Important: Do not create a connection to your Data Virtualization
instance.
|
Third-party data sources
The following table lists the third-party data sources that you can connect to from Data Virtualization.
Connector | Limitations | More information |
---|---|---|
Amazon RDS for MySQL |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Amazon RDS for Oracle |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Amazon RDS for PostgreSQL |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
Amazon Redshift | SPATIAL, SKETCH, and SUPER data types are converted to CLOB in Data Virtualization. |
This connection is optimized to take advantage of the query capabilities in this data source. |
Amazon S3 |
This connection requires special consideration in Data Virtualization. See Connecting to Amazon S3 in Data Virtualization.
|
|
Apache Derby |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
Apache Hive |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Ceph |
This connection requires special consideration in Data Virtualization. See Connecting to Ceph in Data Virtualization. For limitations, see Data sources in object storage in Data Virtualization. |
|
Cloudera Impala |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
Generic S3 |
For limitations, see Data sources in object storage in Data Virtualization. |
|
Google BigQuery |
This connection requires special consideration in Data Virtualization. See Connecting to Google BigQuery in Data Virtualization.
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Greenplum |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
MariaDB |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Microsoft Azure Data Lake Storage | ||
Microsoft Azure SQL Database | ||
Microsoft SQL Server |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
MongoDB |
|
|
MySQL (My SQL Community Edition) (My SQL Enterprise Edition) |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Oracle |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
PostgreSQL |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
Salesforce.com |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
SAP ASE | ||
SAP OData |
You cannot preview or query nonreadable tables due to the following reasons:
|
|
Snowflake |
This connection requires special consideration in Data Virtualization. See Connecting to Snowflake in Data Virtualization.
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Teradata Teradata JDBC Driver 17.00 Copyright (C) 2024 by Teradata. All rights reserved. IBM provides embedded usage of the Teradata JDBC Driver under license from Teradata solely for use as part of the IBM Watson® service offering. |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Data sources in object storage in Data Virtualization
You can use data that is stored as files on IBM Cloud Object Storage, Amazon S3, Ceph, or Generic S3 data sources to create virtual tables. To access data that is stored in cloud object storage, you must create a connection to the data source where the files are located.
You can segment or combine data from one or more files to create a virtual table. Accessing files in object storage in Data Virtualization is built on Db2 Big SQL capabilities that use Hadoop external table support. For more information, see CREATE TABLE (HADOOP) statement.
Terminology
- A bucket is a logical abstraction that is used to provide a container for data. There is no folder concept in object storage; only buckets and keys. Buckets can be created only in the interface of the object storage data source. They cannot be created in Data Virtualization. Bucket names must be unique and adhere to the rules of the object storage provider. These rules often include restricting the name to 3 - 63 characters with lowercase letters, numbers, and dashes only. Bucket names must begin and end with a lowercase letter or number. When Data Virtualization accesses data in object storage, the bucket name must be unique across all object storage connections.
- A file path is the complete path to the file where you want to store data. The S3 file system implementation allows zero-length files to be treated like directories, and file names that contain a forward slash (/) are treated like nested directories. The file path includes the bucket name, an optional file path, and a file name. In object storage, the file path is used when a table is created. All files in the same path contribute to the table data. You can add more data by adding another file to the file path.
- A partition is data that is grouped by a common attribute in the schema. Partitioning divides the data into multiple file paths, which are treated like directories. Data Virtualization can discover and use partitions to reduce the amount of data that queries must process, which can improve performance of queries that use predicates on the partitioning columns.
Best practices
- File formats
- Data Virtualization supports PARQUET (or PARQUETFILE), ORC (optimized row columnar), CSV
(comma-separated values), TSV (tab-separated values), and JSON file formats. No other file formats
are supported.
- For PARQUET (or PARQUETFILE), file extensions are not required. Metadata is extracted from the data file.
- For ORC, file extensions are not required. Metadata is extracted from the data file.
- For CSV and TSV files:
- The appropriate .csv or .tsv file extension is
required, as follows:
- CSV: The .csv file extension is required and the contents of the file must follow the specifications for comma-separated values.
- TSV: The .tsv file extension is required and the contents of the file must follow the specifications for tab-separated values.
- An optional parameter can be used to specify a string delimiter character
(
quoteChar
) that surrounds field values in CSV and TSV files.- The performance of querying the data might be negatively impacted if
quoteChar
is specified. - The default value is no delimiter (not specified).
- The value for
quoteChar
must be a single character that cannot be a space (blank), backslash, tab, comma, or newline character (\n
). - If the string value contains the string delimiter (
quoteChar
), the backslash (\
) can be used to escape the string delimiter.
- The performance of querying the data might be negatively impacted if
- The appropriate .csv or .tsv file extension is
required, as follows:
- For JSON files, the .json file extension is required. JSON files must be
coded so that each line is a valid JSON object. Lines must be separated by a newline character
(
\n
). The JSON Lines text format, also called newline-delimited JSON, is the only supported JSON format. This format stores structured data that can be processed one record at a time.
Note: All other file formats return an error. For more information, see Error message when you try to use an unsupported file format in Cloud Object Storage. - Organizing data
-
- Avoid using alphanumeric characters in column names as it may interfere
with Hive compatibility. Any character that is not an alphanumeric character or the underscore
character is encoded as _xNNNN, where _xNNNN is the hex value of the character. If you want to
properly view column names, enable the
allownonalphanumeric
option by completing these steps:- Access the head pod in the Data Virtualization instance (c-db2u-dv-db2u-0).
- Run the following command to edit the configuration to include the
allownonalphanumeric
option:db2uctl adm bigsql config --key bigsql.catalog.identifier.mappingrules --value allownonalphanumeric,allowleadingdigits,allowreservedwords
- Run the following command to restart Big SQL:
su - db2inst1 bigsql stop ; bigsql start
- If your object storage data is accessed through a virtualized table, the files that you want to virtualize must be within a single file path and within a single bucket, and the bucket must include at least one file that you add to the cart. All the files in this file path are part of the virtualized table. When more data is added to the table (new files are created in the file path), the data is visible when you access the virtualized table. All files in the file path must use the same file format so that they are virtualized as one table.
- If you want to virtualize the files in multiple file paths as one table, you can virtualize the bucket that contains all of the files. For example, if you have file paths A/B/C/T1a, A/B/C/T1b, A/B/D/T1c, and A/B/D/T1d, you can virtualize the file path A/B/. All the files in that path and nested paths will be part of the accessible object.
- Do not create two objects (tables, schemas, or columns) with the same name, even if you use delimited identifiers and mixed case. For example, you cannot have a table t1 and another table that is named T1. These names are considered to be duplicate names in object storage (Hive). For more information, see Identifiers.
- Db2 supports a wider range of valid delimited identifiers than Hive supports. Some identifier names that are specified when you create virtualized tables over object store might be adjusted before they can be accepted into the Hive catalog. The mapping is done automatically. For more information, see Identifiers.
- When new data is added to the file path for a virtualized table, consider running the following
command to ensure that the metadata cache is updated to see the new
data.
CALL SYSHADOOP.HCAT_CACHE_SYNC(<schema>, <object>)
For more information, see the HCAT_CACHE_SYNC stored procedure.
- When new partitions are added to the file path for the virtualized
table, click Refresh partitions in the overflow menu on the
Virtualized data page to identify new partitions.
You can also run the following command in the SQL interface to identify the new partitions that were added.
MSCK REPAIR TABLE <table-name>
For more information, see MSCK REPAIR TABLE.
- Avoid using alphanumeric characters in column names as it may interfere
with Hive compatibility. Any character that is not an alphanumeric character or the underscore
character is encoded as _xNNNN, where _xNNNN is the hex value of the character. If you want to
properly view column names, enable the
- Optimizing query performance
-
- Use a compact file format such as ORC or Parquet to minimize network traffic, which improves query performance.
- Don't use the STRING or TEXT data type. Use the VARCHAR(n) data type, with
n set to a value that is appropriate for the column data. Use the following
command to alter the table to define an appropriate length for the
column.
ALTER TABLE <schema>.<table> ALTER COLUMN <col> SET DATA TYPE VARCHAR(<size>)
- Partition your data by using Hive style partitioning. Partitioned data is grouped by a common
attribute. Data Virtualization can use partitions to reduce the amount of data that queries must
process. Querying the entire data set might not be possible or even necessary. You can use
predicates in your queries that include the partitioning columns to improve performance.
For example, a school_records table that is partitioned on a year column, segregates values by year into separate file paths. A
WHERE
condition such asYEAR=1993
,YEAR IN (1996,1995)
, orYEAR BETWEEN 1992 AND 1996
scans only the data in the appropriate file path to resolve the query. - Define partitioned column types accurately. By default, partitioned character columns are assumed to be STRING type, which is not recommended. Redefine the partitioned columns to an appropriate data type.
- Collect statistics on the data that is being queried. Data Virtualization uses the ANALYZE command to collect statistics on virtualized tables over object storage. You can collect statistics in the web client or by using SQL. For more information, see Collecting statistics in Data Virtualization.
Limitations
- Only UTF-8 character encoding is supported in Data Virtualization for text files in CSV, TSV, or JSON format. Cloud Object Storage binary formats such as ORC or PARQUET are unaffected because they transparently encode character types.
- Data Virtualization does not support the TIME data type in a virtualized table over object storage.
- Preview of assets in cloud object storage shows only the first 200 columns of the table.
- Before you remove a connection to Cloud Object Storage, you must remove all virtualized tables in the object storage connection. If a connection is removed and you try to remove a table in that connection, you see an error. See Credential error message when you remove a virtualized table in object storage.
- If the bucket is not specified in the connection properties, the connection is global. In this case, include the bucket name in the file path. You can specify up to one global connection in a Data Virtualization instance.
See also Restrictions
in CREATE TABLE
(HADOOP) statement.