0 / 0
Collecting statistics in Data Virtualization
Last updated: Nov 26, 2024
Collecting statistics in Data Virtualization

To optimize query performance, you can collect statistics on the data that is being queried.

Decisions by the cost-based optimizer are crucial to query performance. The optimizer makes its decisions by using statistical information about the data that is being queried. Accurate and up-to-date statistics ensure optimal query performance. Collect statistics whenever the following conditions apply.

You can read more about optimizers and query performance in Optimization profiles and guidelines.

  • A new table is created and populated with data.
  • An existing table’s data undergoes significant changes, such as the following conditions:
    • New data is added.
    • Old data is removed.
    • Existing data is updated.

Data Virtualization collects statistics for the following items so that the optimizer has enough information to build efficient execution plans:

  • Any table that is referenced in a query.
  • Any table that is referenced in a view. When you create a virtual view, be sure to collect statistics for all the tables that are referenced in the view. Statistics are not collected on views.
  • All columns that are referenced in predicates (including join predicates) and aggregation functions that are referenced in queries.

You don't need to collect statistics for columns that appear only in the first SELECT list of a query.

Important: Collecting statistics is critical for good query performance, so it is important to have up-to-date statistics. Statistics collection can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs at the same time.
The following basic statistics are collected in Data Virtualization:
Table cardinality (CARD)
The number of rows in the table.
Column cardinality (COLCARD)
The number of distinct values in the column.
High key (HIGH2KEY)
Either the highest or second highest data value for the column, depending on the remote data source of the virtualized table and the collection type that is used.
Low key (LOW2KEY)
Either the lowest or second lowest data value for the column, depending on the remote data source of the virtualized table and the collection type that is used. For many data sources, Data Virtualization collects the lowest key value, which can be a blank value. A blank does not indicate that the statistics were not collected.
Number of null values (NUMNULLS)
The number of null values in the column. This number is the only statistic that is collected for LOB-type columns.

Statistics collection types

Data Virtualization supports two statistics collection types:
remote-catalog
This type of statistics collection is only supported for virtualized tables in remote data sources that support a local method of statistics collection. Statistics that are stored in the catalog tables at the remote data source are retrieved and then stored in the Data Virtualization statistics catalog. It is critical to ensure that accurate statistics are available in the remote data source. The remote-catalog type of statistics collection is not supported for grouped tables.
remote-query
This type of statistics collection uses SQL queries against the virtualized table to compute the statistics. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. To improve performance and conserve resources, you can collect statistics with data sampling by specifying the TABLESAMPLE option on the COLLECT_STATISTICS stored procedure in Data Virtualization or use the ANALYZE command for data sources in cloud object storage.

Best practices

For remote data sources that support local statistics collection tools, such as IBM® Db2® and Oracle, the best practice for statistics collection in Data Virtualization is to ensure that local statistics at the remote data source are accurate and up to date, especially basic column statistics for those key columns that are used in query predicates. It is a good idea to create a statistics collection job in the Data Virtualization web client for this purpose.