Supported data sources in Watson Query
Watson Query supports the following relational and nonrelational data sources on IBM Cloud Pak® for Data as a Service.
You cannot connect to a data source as a connection to an existing Platform connection if the Platform connection uses Cloud integration, Secure Gateway, or Satellite Link. These features are not supported by Watson Query when you connect to Platform connections. You see an error message that is similar to Cannot reach the network destination of the data source. You can set up the data source by using Cloud integration, Secure Gateway, or Satellite Link and provide the hostname or IP endpoint directly to Watson Query as a new connection.
IBM data sources
The following table lists the IBM® data sources that you can connect to from Watson Query.
Connector | Limitations | More information |
---|---|---|
IBM Cloud Compose for MySQL | ||
IBM Cloud Databases for MongoDB | Databases for MongoDB is available as beta. The following MongoDB data types are supported in Watson Query: INT32, INT64, DOUBLE, STRING, BOOLEAN, DATE, and BINARY. | |
IBM Cloud Databases for PostgreSQL |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
IBM Cloud Object Storage | To connect to this data source, see Connecting to Cloud Object Storage in Watson Query. For limitations, see Data sources in object storage in Watson Query. |
|
IBM Watson™ Query |
Important: Do not create a connection to your Watson Query instance.
|
|
IBM Data Virtualization Manager for z/OS® | You cannot connect to Data Virtualization Manager for z/OS if the connection uses Cloud integration, Secure Gateway, or Satellite Link. These are features are not supported by Watson Query. You will see an error message that is similar to Cannot reach the network destination of the data source. | |
IBM Db2 |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
IBM Db2 Big SQL |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
IBM Db2 for i | ||
IBM Db2 for z/OS | ||
IBM Db2 on Cloud | The NCHAR and NVARCHAR types are not supported in Watson Query. | |
IBM Db2 Warehouse | ||
IBM Informix® |
|
|
IBM Netezza® Performance Server |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
IBM Planning Analytics |
Watson Query can virtualize only TM1® views with one column dimension. Aggregation cannot be pushed down to TM1. If a query against TM1 contains an aggregation operation such as COUNT, SUM, AVG, MAX, MIN, or GROUP BY, you can add the following server option:
Watson Query supports only VARCHAR and DOUBLE data types in TM1. TM1 doesn't recognize SQL. The native query language of TM1 is MDX, which doesn't have the concept of pushdown. Therefore, this data source is not configured to support optimal predicate pushdown. |
Third-party data sources
The following table lists the third-party data sources that you can connect to from Watson Query.
Connector | Limitations | More information |
---|---|---|
Amazon RDS for MySQL |
Since MySQL does not support database isolation, you must create only one connection to a database on a MySQL data source to avoid duplicate rows on the Virtualize page. The BIT types are not supported in Watson Query. |
|
Amazon RDS for Oracle |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Amazon RDS for PostgreSQL |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
Amazon Redshift | SPATIAL, SKETCH, and SUPER data types are converted to CLOB in Watson Query. |
This connection is optimized to take advantage of the query capabilities in this data source. |
Amazon S3 |
To connect to this data source, see Connecting to Amazon S3 in Watson Query. Specific rules apply for naming buckets in Amazon S3. For more information, see Bucket naming rules in the Amazon S3 documentation. If you include dots in a bucket's name, you can't use virtual-host-style addressing over HTTPS, unless you perform your own certificate validation. The security certificates that are used for virtual hosting of buckets don't work for buckets with dots in their names. For more limitations, see Data sources in object storage in Watson Query. |
|
Apache Derby |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
Apache Hive |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Ceph |
To connect to this data source, see Connecting to Ceph in Watson Query. For limitations, see Data sources in object storage in Watson Query. |
|
Cloudera Impala | ||
Generic S3 |
Note: This connector appears in the user interface; however, it is not supported.
|
|
Google BigQuery |
To connect to this data source, see Connecting to Google BigQuery in Watson Query.
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Greenplum | ||
MariaDB |
The BIT, LONGTEXT, and LONGBLOB types are not supported in Watson Query. Since MariaDB does not support database isolation, you must create only one connection to a database on a MariaDB data source to avoid duplicate rows on the Virtualize page. You must have admin permission for the MariaDB data source for advanced operations such as collecting statistics. |
This connection is optimized to take advantage of the query capabilities in this data source. |
Microsoft Azure Data Lake Store |
Note: This connector appears in the user interface; however, it is not supported.
|
|
Microsoft Azure SQL Database | ||
Microsoft SQL Server |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
MongoDB |
The BIT, LONGTEXT, and LONGBLOB types are not supported in Watson Query. You must have admin permission for the MongoDB data source for advanced operations such as collecting statistics. |
|
MySQL (My SQL Community Edition)
(My SQL Enterprise Edition) |
Since MySQL does not support database isolation, you must create only one connection to a database on a MySQL data source to avoid duplicate rows on the Virtualize page. The BIT types are not supported in Watson Query. |
|
Oracle |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
PostgreSQL |
This connection is optimized to take advantage of the query capabilities in this data source. |
|
Salesforce.com | ||
SAP ASE | ||
SAP OData |
You cannot preview or query nonreadable tables due to the following reasons:
|
|
Snowflake |
To connect to this data source, see Connecting to Snowflake in Watson Query.
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Teradata Teradata JDBC Driver 17.00 Copyright (C) 2015 - 2017 by Teradata. All rights reserved. IBM provides embedded usage of the Teradata JDBC Driver under license from Teradata solely for use as part of the IBM Watson® service offering. |
|
This connection is optimized to take advantage of the query capabilities in this data source. |
Data sources in object storage in Watson Query
You can use data that is stored as files on IBM Cloud Object Storage, Amazon S3, or Ceph data sources to create virtual tables. To access data that is stored in IBM Cloud Object Storage, you must create a connection to the data source where the files are located.
You can segment or combine data from one or more files to create a virtual table. Accessing files in object storage in Watson Query is built on Db2 Big SQL capabilities that use Hadoop external table support. For more information, see CREATE TABLE (HADOOP) statement.
Usage notes
- A bucket is a logical abstraction that is used to provide a container for data. There is no folder concept in object storage; only buckets and keys. Buckets can be created only in IBM Cloud Object Storage, Amazon S3, or Ceph interfaces. They cannot be created in Watson Query.
- A file path is the complete path to the file where you want to store data. The S3 file system implementation allows zero-length files to be treated like directories, and file names that contain a forward slash (/) are treated like nested directories. The file path includes the bucket name, an optional file path, and a file name. In object storage, the file path is used when a table is created. All files in the same path contribute to the table data. You can add more data by adding another file to the file path.
If your object storage data will be accessed through a virtualized table, the files that you want to virtualize must be within a single file path and within a single bucket, and the bucket must include at least one file that will be added to the cart. All the files in this file path will be part of the virtualized table. When more data is added to the table (new files are created in the file path), the data will be visible when you access the virtualized table. All files in the file path must use the same file format so that they are virtualized as one table.
If you want to virtualize the files in multiple file paths as one table, you can virtualize the bucket that contains all of the files. For example, if you have file paths A/B/C/T1a, A/B/C/T1b, A/B/D/T1c, and A/B/D/T1d, you can virtualize the file path A/B/. All the files in that path and nested paths will be part of the accessible object.
Limitations
- Watson Query supports PARQUET (or PARQUETFILE), ORC (optimized row columnar), CSV
(comma-separated values), TSV (tab-separated values), and JSON file formats. No other file formats
are supported.
- For PARQUET (or PARQUETFILE), file extensions are not required. Metadata is extracted from the data file.
- For ORC, file extensions are not required. Metadata is extracted from the data file.
- For CSV and TSV files, the .csv or .tsv file extension is required.
- For JSON files, the .json file extension is required. JSON files must be
coded so that each line is a valid JSON object. Lines must be separated by a newline character
(
\n
). The JSON Lines text format, also called newline-delimited JSON, is the only supported JSON format. This format stores structured data that can be processed one record at a time. - All other file formats return an error. For more information, see Error message when you try to use an unsupported file format in Cloud Object Storage.
- For text files in CSV, TSV, or JSON format, only UTF-8 character encoding is supported in Watson Query. Cloud Object Storage binary formats such as ORC or PARQUET are unaffected because they transparently encode character types.
- You cannot create two objects (tables, schemas, or columns) with the same name, even if you use delimited identifiers and mixed case. For example, you cannot have a table named t1 and another table named T1. These names are considered to be duplicate names in object storage (Hive). For more information, see Identifiers.
- Watson Query does not support the TIME data type in a virtualized table over object storage.
- Watson Query does not support
quotechar
in IBM Cloud Object Storage. Preview has quotation mark characters within column data. - When you use data types such as STRING or TEXT for columns in a virtualized table, the columns
are defined as VARCHAR(32672), which can impact performance and increase the memory requirements for
statements that reference the table. To improve performance, use VARCHAR(n) with
a specific length instead of STRING. Use the following syntax to alter the table to define an
appropriate length for the
column.
ALTER TABLE <schema>.<table> ALTER COLUMN <col> SET DATA TYPE VARCHAR(<size>)
- There is no support for compressed text files.
- See also the
Restrictions
section in CREATE TABLE (HADOOP) statement. - Before you remove a connection to Cloud Object Storage, you must remove all virtualized tables in the object storage connection. If a connection is removed and you try to remove a table in that connection, you see an error. See Credential error message when you remove a virtualized table in object storage.
Improving performance
To optimize query performance, you can collect statistics on the data that is being queried. Watson Query uses the ANALYZE command to collect statistics on virtualized tables over object storage.
- Data type considerations
- To improve query performance, don't use the STRING or TEXT data type. Use the VARCHAR(n) data type, with n set to a value that is appropriate for the column data.
- Partitioning considerations
- Partition your data to improve query performance. Partitioned data is grouped by a common
attribute. Watson Query can discover and use partitions to reduce the amount of data that
queries must process. Querying the entire data set might not be possible or even necessary. To
improve query performance, query just the partition that you need.
By default, partitioned character columns are assumed to be STRING columns, and it is best practice to accurately define the partitioned column types.