0 / 0
Limitations for collecting statistics in Data Virtualization
Last updated: Nov 26, 2024
Data source specific limitations for collecting statistics in Data Virtualization

Data sources differ in their support of local statistics collection and support different column statistics. They also support different methods for collecting statistics.

Data Virtualization supports two statistics collection types.
remote-catalog
This type of statistics collection is only supported for virtualized tables in remote data sources that support a local method of statistics collection. Statistics that are stored in the catalog tables at the remote data source are retrieved and then stored in the Data Virtualization statistics catalog. It is critical to ensure that accurate statistics are available in the remote data source. The remote-catalog type of statistics collection is not supported for grouped tables.

The following table identifies the data sources that support local statistics collection and shows which column statistics are available.

Data source Basic statistics available Supports
remote-catalog
Supports
remote-query
Amazon Redshift CARD, COLCARD, NUMNULLS
Amazon S3 See Collecting statistics for data in object storage. No
Ceph® See Collecting statistics for data in object storage. No
Cloudera Impala CARD, COLCARD, NUMNULLS. Recent versions of Impala (1.4 and later) do not collect NUMNULLS.
Databases for PostgreSQL CARD, COLCARD, NUMNULLS
Db2® CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS
Db2 Big SQL CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS
Db2 for z/OS CARD, COLCARD, HIGH2KEY, LOW2KEY
Db2 on Cloud CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS
Generic S3 See Collecting statistics for data in object storage. No
Greenplum CARD, COLCARD, NUMNULLS
Hive CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS
(Not recommended)
IBM® Cloud Object Storage See Collecting statistics for data in object storage. No
Informix CARD, COLCARD
MariaDB CARD, HIGH2KEY, LOW2KEY, NUMNULLS
Microsoft SQL Server CARD, COLCARD
MySQL CARD, COLCARD
Netezza Performance Server CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS
Oracle CARD, COLCARD, HIGH2KEY, LOW2KEY, NUMNULLS
SAP ASE CARD
SAP HANA CARD, COLCARD
Snowflake CARD
Teradata CARD, COLCARD, NUMNULLS

Amazon Redshift

To collect local table statistics, use the ANALYZE command, as described in the Amazon Redshift product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.

Amazon Redshift does not support the HIGH2KEY or LOW2KEY statistic. Also, the ANALYZE command does not collect statistics for CHAR and NCHAR column data if the length of the column data exceeds 1020 bytes. If a column contains only NULL values, column statistics are not collected.

If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

Cloudera Impala

To collect local table statistics, use the COMPUTE STATS statement, as described in the Cloudera Impala product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.

Cloudera Impala does not support the HIGH2KEY or LOW2KEY statistic, and more recent versions of Impala do not support the NUMNULLS statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

Databases for PostgreSQL

To collect local table statistics, use the ANALYZE command, as described in the PostgreSQL product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.

PostgreSQL does not support the HIGH2KEY or LOW2KEY statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

Db2 and Db2 on Cloud

To collect local table statistics for these data sources, use the RUNSTATS command. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. With this type, the HIGH2KEY and LOW2KEY values (representing the second highest and second lowest values in the table) are copied from the Db2 catalog. With the remote-query type, the HIGH2KEY and LOW2KEY values represent the highest and lowest values in the table.

Db2 Big SQL

To collect local table statistics, use the ANALYZE command. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. With both types, the HIGH2KEY and LOW2KEY values represent the highest and lowest values in the table.

Db2 for z/OS

To collect local table statistics, use the RUNSTATS command. Statistics are not collected for LOB types. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. With this method, the HIGH2KEY and LOW2KEY values (representing the second highest and second lowest values in the table) are copied from the Db2 catalog. With the remote-query collection type, the HIGH2KEY and LOW2KEY values represent the highest and lowest values in the table.

Db2 for z/OS® does not support the NUMNULLS statistic. If your queries would benefit from this statistic, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

Greenplum

To collect local table statistics, use the ANALYZE command, as described in the Greenplum product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.

Greenplum does not support the HIGH2KEY or LOW2KEY statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

Hive

To collect local table statistics, use the ANALYZE command, as described in the Hive product documentation. The remote-catalog statistics collection type in Hive is not recommended because of known issues with statistics collection in Hive. Use the remote-query statistics collection type instead. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

Informix

To collect local table statistics, use the UPDATE STATISTICS statement, as described in the Informix® product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.

Informix does not support the HIGH2KEY, LOW2KEY, or NUMNULLS statistic. It collects only CARD (table cardinality) and COLCARD (column cardinality). If your queries would benefit from these statistics, use the remote-query statistics collection type.. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

MariaDB

To collect local table statistics, use the ANALYZE command, as described in the MariaDB product documentation. Statistics for TEXT type or BLOB type columns are not collected. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. You must hold administrative privileges to access catalog statistics, which is required when you use the remote-catalog statistics collection type.

MariaDB does not support the COLCARD statistic. If your queries would benefit from this statistic, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

Microsoft SQL Server

To collect local table statistics, use the UPDATE STATISTICS statement, as described in the Microsoft SQL Server product documentation. Statistics for XML type columns are not collected. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.

Microsoft SQL Server collects only CARD (table cardinality) and COLCARD (column cardinality). It does not support the HIGH2KEY, LOW2KEY, or NUMNULLS statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

MySQL

To collect local table statistics, use the ANALYZE TABLE statement, as described in the MySQL product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.

MySQL collects only CARD (table cardinality) and COLCARD (column cardinality). It does not support the HIGH2KEY, LOW2KEY, or NUMNULLS statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

Netezza Performance Server

To collect local table statistics, use the GENERATE STATISTICS command, as described in the Netezza® product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. With both types, the HIGH2KEY and LOW2KEY values represent the highest and lowest values in the table.

Oracle

To collect local table statistics, use the ANALYZE command, as described in the Oracle product documentation. Alternatively, you can collect statistics by using the DBMS_STATS.GATHER_TABLE_STATS procedure. You can also configure automatic statistics collection in the Oracle remote data source. Statistics are not collected for LOB types. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type. With both types, the HIGH2KEY and LOW2KEY values represent the highest and lowest values in the table.

SAP ASE

To collect local table statistics, use the UPDATE STATISTICS statement, as described in the SAP ASE product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.

SAP ASE collects only CARD (table cardinality). It does not support the COLCARD, HIGH2KEY, LOW2KEY, or NUMNULLS statistic. Use the remote-query statistics collection type for key columns that are used in predicates. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

SAP HANA

To collect local table statistics, use the CREATE STATISTICS ON statement, as described in the SAP HANA product documentation. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.

SAP HANA collects only CARD (table cardinality) and COLCARD (column cardinality). It does not support the HIGH2KEY, LOW2KEY, or NUMNULLS statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

Snowflake

Snowflake does not support an explicit method for collecting statistics. However, the ROW_COUNT column in the INFORMATION_SCHEMA.TABLES catalog table stores the table cardinality (CARD) value as part of Data Virtualization statistics collection. Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.

Snowflake collects only CARD (table cardinality). It does not support the COLCARD, HIGH2KEY, LOW2KEY, or NUMNULLS statistic. Use the remote-query statistics collection type for key columns that are used in predicates. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

Teradata

To collect local table statistics, use the COLLECT STATISTICS command, as described in the Teradata product documentation. Specify the THRESHOLD option to avoid redundant statistics collection. With this option, if the amount of data that changed since the last statistics collection is under a specified threshold, or the statistics are newer than a specified age, Teradata does not recollect statistics. Statistics are not collected for the PERIOD type (except for the BEGIN and END expressions), the XML, BLOB, or CLOB type, or any user-defined type (other than a geospatial type). Best practice is to ensure that statistics are collected in the remote data source and to use the remote-catalog statistics collection type.

Teradata does not support the HIGH2KEY or LOW2KEY statistic. If your queries would benefit from these statistics, use the remote-query statistics collection type. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.

IBM Cloud Object Storage, Amazon S3, Ceph, or Generic S3 data sources

To collect statistics on virtualized tables in IBM Cloud Object Storage, Amazon S3, Ceph, or Generic S3 data sources, use the ANALYZE command.

If the table has a large amount of data, specify the TABLESAMPLE option to collect statistics on a subset of the data so that the analyze operation can complete sooner.

See also Collecting statistics for data in object storage.