0 / 0
Supported data sources and data types in Watson Query

Supported data sources in Watson Query

Watson Query supports the following relational and nonrelational data sources on IBM Cloud Pak® for Data as a Service.

You cannot connect to a data source as a connection to an existing Platform connection if the Platform connection uses Cloud integration, Secure Gateway, or Satellite Link. These features are not supported by Watson Query when you connect to Platform connections. You see an error message that is similar to Cannot reach the network destination of the data source. You can set up the data source by using Cloud integration, Secure Gateway, or Satellite Link and provide the hostname or IP endpoint directly to Watson Query as a new connection.

Size limits: Watson Query supports virtualization of tables with a row size up to 1 MB, and up to 2048 columns in a table. However, the number of columns that Watson Query can preview depends on many factors, such as the data types of the columns. Currently, preview is limited to 200 columns.
Comment attributes: When virtual tables are created, Watson Query does not include comment attributes that were assigned to data source objects. This limitation applies to all data sources.
Data types: Some data types in your data source might not be supported in Watson Query. These limitations are documented in the following tables. Watson Query might also map some data types in your data source to alternative data types. These mappings are based on underlying Db2® Big SQL mappings. For more information, see Data types in Db2 Big SQL.

IBM data sources

The following table lists the IBM® data sources that you can connect to from Watson Query.

Table 1. Supported IBM data sources
Connector Limitations More information
IBM Cloud Compose for MySQL    
IBM Cloud Databases for MongoDB Databases for MongoDB is available as beta. The following MongoDB data types are supported in Watson Query: INT32, INT64, DOUBLE, STRING, BOOLEAN, DATE, and BINARY.  
IBM Cloud Databases for PostgreSQL  

This connection is optimized to take advantage of the query capabilities in this data source.

IBM Cloud Object Storage

To connect to this data source, see Connecting to Cloud Object Storage in Watson Query.

For limitations, see Data sources in object storage in Watson Query.
 
IBM Watson™ Query
Important: Do not create a connection to your Watson Query instance.
 
IBM Data Virtualization Manager for z/OS® You cannot connect to Data Virtualization Manager for z/OS if the connection uses Cloud integration, Secure Gateway, or Satellite Link. These are features are not supported by Watson Query. You will see an error message that is similar to Cannot reach the network destination of the data source.
IBM Db2
  • The NCHAR and NVARCHAR types are not supported in Watson Query.
  • Db2 uses a DECFLOAT type, which Watson Query defaults to DOUBLE. This mapping influences how specific values are returned.
  • Db2 use DISTINCT types, which Watson Query defaults to BLOB.

This connection is optimized to take advantage of the query capabilities in this data source.

IBM Db2 Big SQL  

This connection is optimized to take advantage of the query capabilities in this data source.

IBM Db2 for i  
IBM Db2 for z/OS  
IBM Db2 on Cloud The NCHAR and NVARCHAR types are not supported in Watson Query.  
IBM Db2 Warehouse    
IBM Informix®
IBM Netezza® Performance Server
  • BLOB, XML, and CLOB Netezza data types are not supported in Watson Query.
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Watson Query truncates the data to 64 K bytes only.

This connection is optimized to take advantage of the query capabilities in this data source.

IBM Planning Analytics

Watson Query can virtualize only TM1® views with one column dimension.

Aggregation cannot be pushed down to TM1. If a query against TM1 contains an aggregation operation such as COUNT, SUM, AVG, MAX, MIN, or GROUP BY, you can add the following server option:

alter server qplex options (add DB2_GROUP_BY 'N') 

Watson Query supports only VARCHAR and DOUBLE data types in TM1.

TM1 doesn't recognize SQL. The native query language of TM1 is MDX, which doesn't have the concept of pushdown. Therefore, this data source is not configured to support optimal predicate pushdown.

 

Third-party data sources

The following table lists the third-party data sources that you can connect to from Watson Query.

Table 2. Supported third-party data sources
Connector Limitations More information
Amazon RDS for MySQL

Since MySQL does not support database isolation, you must create only one connection to a database on a MySQL data source to avoid duplicate rows on the Virtualize page.

The BIT types are not supported in Watson Query.

 
Amazon RDS for Oracle
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Watson Query truncates the data to 64 K bytes only.
  • The following data types are converted in Watson Query:
    • INTERVAL and JSON convert to VARCHAR.
    • National character types convert to CHAR, VARCHAR, or CLOB.
    • TIMESTAMP and TIMESTAMP WITH LOCAL TIME ZONE convert to TIMESTAMP.
    • XML convert to CLOB.
    • SMALLINT, INTEGER, BIGINT convert to DECIMAL 4.0, DECIMAL 9.0, and DECIMAL 18.0.
    • SMALLINT, INTEGER, BIGINT convert to DECIMAL 4.0, DECIMAL 9.0, and DECIMAL 18.0.

This connection is optimized to take advantage of the query capabilities in this data source.

Amazon RDS for PostgreSQL  

This connection is optimized to take advantage of the query capabilities in this data source.

Amazon Redshift SPATIAL, SKETCH, and SUPER data types are converted to CLOB in Watson Query.

This connection is optimized to take advantage of the query capabilities in this data source.

Amazon S3

To connect to this data source, see Connecting to Amazon S3 in Watson Query.

Specific rules apply for naming buckets in Amazon S3. For more information, see Bucket naming rules in the Amazon S3 documentation.

If you include dots in a bucket's name, you can't use virtual-host-style addressing over HTTPS, unless you perform your own certificate validation. The security certificates that are used for virtual hosting of buckets don't work for buckets with dots in their names.

For more limitations, see Data sources in object storage in Watson Query.

 
Apache Derby  

This connection is optimized to take advantage of the query capabilities in this data source.

Apache Hive
  • In virtualized tables, you can list only the value of complex data types, such as array, map, struct, and union type. Any other operation on the value of these complex data types is not supported.
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Watson Query truncates the data to 64 K bytes only.
  • String and Binary types in Hive data sources now resolve to VARCHAR(32672) and VARBINARY(32672) by default, instead of CLOB and BLOB. A Watson Query Admin can configure HIVE_MAX_STRING_SIZE and HIVE_MAX_BINARY_SIZE. This update optimizes SQL performance for Hive data sources. A Watson Query Admin can configure HIVE_MAX_STRING_SIZE and HIVE_MAX_BINARY_SIZE by calling the DVSYS.SETCONFIGPROPERTY procedure.

    For example, to change the default maximum size for the Hive BINARY data type to 2000, which resolve to VARBINARY(2000), run the following command.

    db2 "call dvsys.setconfigproperty ('HIVE_MAX_BINARY_SIZE', '2000', '', ?, ?)"

    The minimum size is 1 and the maximum recommended size is 32672. You can exceed 32672; however, STRING and BINARY types will resolve to CLOB and BLOB types, which can cause certain queries to fail.

This connection is optimized to take advantage of the query capabilities in this data source.

Ceph

To connect to this data source, see Connecting to Ceph in Watson Query.

For limitations, see Data sources in object storage in Watson Query.

 
Cloudera Impala    
Generic S3
Note: This connector appears in the user interface; however, it is not supported.
 
Google BigQuery

To connect to this data source, see Connecting to Google BigQuery in Watson Query.

  • In the Google BigQuery data source, Watson Query does not support the use of the SELECT statement for columns with record data types.

This connection is optimized to take advantage of the query capabilities in this data source.

Greenplum    
MariaDB

The BIT, LONGTEXT, and LONGBLOB types are not supported in Watson Query.

Since MariaDB does not support database isolation, you must create only one connection to a database on a MariaDB data source to avoid duplicate rows on the Virtualize page.

You must have admin permission for the MariaDB data source for advanced operations such as collecting statistics.

This connection is optimized to take advantage of the query capabilities in this data source.

Microsoft Azure Data Lake Store
Note: This connector appears in the user interface; however, it is not supported.
 
Microsoft Azure SQL Database    
Microsoft SQL Server
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Watson Query truncates the data to 64 K bytes only.

This connection is optimized to take advantage of the query capabilities in this data source.

MongoDB

The BIT, LONGTEXT, and LONGBLOB types are not supported in Watson Query.

You must have admin permission for the MongoDB data source for advanced operations such as collecting statistics.

 
MySQL
(My SQL Community Edition)

(My SQL Enterprise Edition)

Since MySQL does not support database isolation, you must create only one connection to a database on a MySQL data source to avoid duplicate rows on the Virtualize page.

The BIT types are not supported in Watson Query.

 
Oracle
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Watson Query truncates the data to 64 K bytes only.
  • The following data types are converted in Watson Query:
    • INTERVAL and JSON convert to VARCHAR.
    • National character types convert to CHAR, VARCHAR, or CLOB.
    • TIMESTAMP and TIMESTAMP WITH LOCAL TIME ZONE convert to TIMESTAMP.
    • XML convert to CLOB.
    • SMALLINT, INTEGER, BIGINT convert to DECIMAL 4.0, DECIMAL 9.0, and DECIMAL 18.0 respectively.

This connection is optimized to take advantage of the query capabilities in this data source.

PostgreSQL  

This connection is optimized to take advantage of the query capabilities in this data source.

Salesforce.com    
SAP ASE    
SAP OData
You cannot preview or query nonreadable tables due to the following reasons:
  • The SAP OData data source might have write-only access. In such cases, changing the user permissions does not avoid this issue.
  • The SAP OData data source has read access, but requires filters. This limitation means that you cannot preview data, but you can read it if you specify filters.
 
Snowflake

To connect to this data source, see Connecting to Snowflake in Watson Query.

  • When you virtualize a Snowflake table that contains data types STRING, TEXT, and VARCHAR that exceed the default maximum string length of 32K for the MaxStringSize parameter, the data type of the column is converted to CLOB.
  • ARRAY, GEOGRAPHY, and VARIANT data types are converted to VARCHAR.
  • Watson Query does not support the Okta URL endpoint.

This connection is optimized to take advantage of the query capabilities in this data source.

Teradata

Teradata JDBC Driver 17.00 Copyright (C) 2015 - 2017 by Teradata. All rights reserved. IBM provides embedded usage of the Teradata JDBC Driver under license from Teradata solely for use as part of the IBM Watson® service offering.

  • The XML Teradata data type is not supported in Watson Query. The XML data type is converted to CLOB.
  • DECIMAL data is supported to 34 digits with DECFLOAT. Data columns that exceed this limit must be edited to DOUBLE during virtualization.
  • The following data types are converted in Watson Query:
    • VARBYTE is converted to VARCHAR for bit data.
    • BYTE is converted to BINARY.
    • BYTEINT is converted to SMALLINT.
    • TIME ZONE BEARING is converted to TIME and TIMESTAMP.
    • PERIOD, DISTINCT, DATASET, and ARRAY are converted to BLOB.
    • NUMBER is converted to DOUBLE.
    • YEAR, YEAR TO MONTH, DAY, DAY TO MINUTE, HOUR TO MINUTE, MONTH, HOUR, and MINUTE is converted to INTEGER.
    • DAY TO SECOND, HOUR TO SECOND, MINUTE TO SECOND, SECOND is converted to DECIMAL.
  • When you use a SELECT statement or preview LOB data that is greater than 64 kB, Watson Query truncates the data to 64 K bytes only.

This connection is optimized to take advantage of the query capabilities in this data source.

Data sources in object storage in Watson Query

You can use data that is stored as files on IBM Cloud Object Storage, Amazon S3, or Ceph data sources to create virtual tables. To access data that is stored in IBM Cloud Object Storage, you must create a connection to the data source where the files are located.

You can segment or combine data from one or more files to create a virtual table. Accessing files in object storage in Watson Query is built on Db2 Big SQL capabilities that use Hadoop external table support. For more information, see CREATE TABLE (HADOOP) statement.

Usage notes

Buckets and partitions play important roles in data organization. The following components are key elements of object storage:
  • A bucket is a logical abstraction that is used to provide a container for data. There is no folder concept in object storage; only buckets and keys. Buckets can be created only in IBM Cloud Object Storage, Amazon S3, or Ceph interfaces. They cannot be created in Watson Query.
  • A file path is the complete path to the file where you want to store data. The S3 file system implementation allows zero-length files to be treated like directories, and file names that contain a forward slash (/) are treated like nested directories. The file path includes the bucket name, an optional file path, and a file name. In object storage, the file path is used when a table is created. All files in the same path contribute to the table data. You can add more data by adding another file to the file path.

If your object storage data will be accessed through a virtualized table, the files that you want to virtualize must be within a single file path and within a single bucket, and the bucket must include at least one file that will be added to the cart. All the files in this file path will be part of the virtualized table. When more data is added to the table (new files are created in the file path), the data will be visible when you access the virtualized table. All files in the file path must use the same file format so that they are virtualized as one table.

If you want to virtualize the files in multiple file paths as one table, you can virtualize the bucket that contains all of the files. For example, if you have file paths A/B/C/T1a, A/B/C/T1b, A/B/D/T1c, and A/B/D/T1d, you can virtualize the file path A/B/. All the files in that path and nested paths will be part of the accessible object.

Limitations

  • Watson Query supports PARQUET (or PARQUETFILE), ORC (optimized row columnar), CSV (comma-separated values), TSV (tab-separated values), and JSON file formats. No other file formats are supported.
    • For PARQUET (or PARQUETFILE), file extensions are not required. Metadata is extracted from the data file.
    • For ORC, file extensions are not required. Metadata is extracted from the data file.
    • For CSV and TSV files, the .csv or .tsv file extension is required.
    • For JSON files, the .json file extension is required. JSON files must be coded so that each line is a valid JSON object. Lines must be separated by a newline character (\n). The JSON Lines text format, also called newline-delimited JSON, is the only supported JSON format. This format stores structured data that can be processed one record at a time.
    • All other file formats return an error. For more information, see Error message when you try to use an unsupported file format in Cloud Object Storage.
  • For text files in CSV, TSV, or JSON format, only UTF-8 character encoding is supported in Watson Query. Cloud Object Storage binary formats such as ORC or PARQUET are unaffected because they transparently encode character types.
  • You cannot create two objects (tables, schemas, or columns) with the same name, even if you use delimited identifiers and mixed case. For example, you cannot have a table named t1 and another table named T1. These names are considered to be duplicate names in object storage (Hive). For more information, see Identifiers.
  • Watson Query does not support the TIME data type in a virtualized table over object storage.
  • Watson Query does not support quotechar in IBM Cloud Object Storage. Preview has quotation mark characters within column data.
  • When you use data types such as STRING or TEXT for columns in a virtualized table, the columns are defined as VARCHAR(32672), which can impact performance and increase the memory requirements for statements that reference the table. To improve performance, use VARCHAR(n) with a specific length instead of STRING. Use the following syntax to alter the table to define an appropriate length for the column.
    ALTER TABLE <schema>.<table> ALTER COLUMN <col> SET DATA TYPE VARCHAR(<size>)
  • There is no support for compressed text files.
  • See also the Restrictions section in CREATE TABLE (HADOOP) statement.
  • Before you remove a connection to Cloud Object Storage, you must remove all virtualized tables in the object storage connection. If a connection is removed and you try to remove a table in that connection, you see an error. See Credential error message when you remove a virtualized table in object storage.

Improving performance

To optimize query performance, you can collect statistics on the data that is being queried. Watson Query uses the ANALYZE command to collect statistics on virtualized tables over object storage.

Data type considerations
To improve query performance, don't use the STRING or TEXT data type. Use the VARCHAR(n) data type, with n set to a value that is appropriate for the column data.
Partitioning considerations
Partition your data to improve query performance. Partitioned data is grouped by a common attribute. Watson Query can discover and use partitions to reduce the amount of data that queries must process. Querying the entire data set might not be possible or even necessary. To improve query performance, query just the partition that you need.

By default, partitioned character columns are assumed to be STRING columns, and it is best practice to accurately define the partitioned column types.