0 / 0
Collecting statistics in Watson Query

Collecting statistics in Watson Query

To optimize query performance, make sure that you are collecting statistics on the data that is being queried.

The decisions that are made by the cost-based optimizer are crucial to determine the performance of queries. The optimizer makes its decisions by using statistical information that it has about the data that is being queried. Accurate and up-to-date statistics on the data are required to ensure optimal query performance. Collect statistics whenever the following conditions apply.

  • A new table is created and populated with data.
  • An existing table’s data undergoes significant changes such as the following conditions:
    • New data is added.
    • Old data is removed.
    • Existing data is updated.

For the optimizer to have enough information to build efficient execution plans, statistics must be collected for:

  • Any table that is referenced in a query.
  • All columns that are referenced in predicates (including join predicates) and aggregation functions that are referenced in queries.

You don't need to collect statistics for columns that appear only in the first SELECT list of a query.

Note: Collecting statistics is critical for good query performance, so it is important to have up-to-date statistics. Statistics collection can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. Consider collecting statistics when the system is less busy. Avoid scheduling many statistics collection jobs in the same timeframe.

For most virtualized tables, you can use SYSPROC.NNSTAT to collect statistics. However, for virtualized tables that are created in object storage, use the ANALYZE command instead. To determine which virtualized tables are created in object storage, you can filter virtualized objects by type. Select object storage, Table, or View in the Filter menu in Data > Data virtualization > Virtualized data.

Virtualized tables (non-object storage)

When a table is virtualized, it is important to collect statistics on the virtualized table to improve the performance of queries that access the data. Also, if the remote table undergoes significant changes, the virtualized table statistics should be updated. The SYSPROC.NNSTAT procedure should be used to collect statistics on the virtualized table by using the data-based collection method. For more information, see the SYSPROC.NNSTAT stored procedure.

For example, see Collecting missing statistics for data in non object storage.

Virtualize tables (object storage)

When a file in object storage is virtualized, it is important to collect statistics on the virtualized object to improve the performance of queries that access the data. To collect statistics, you can use the ANALYZE command.

For example, see Collecting missing statistics for data in object storage.

Virtualized files

When a file is virtualized, it is important to collect statistics on the virtualized object for improving the performance of queries that access that data. The SYSPROC.NNSTAT procedure should be called to collect the key statistics for this new table by using the data-based method. If the file undergoes a significant change, the SYSPROC.NNSTAT procedure should be called again to collect the new statistics.

Views

When you create a virtual view, it is important to make sure that statistics are collected for all the tables that are referenced in the view, as described in the previous sections. Statistics are not collected on views.

Checking for missing statistics

To determine whether table level statistics are missing for a table, you can use the following query:

select cast(TABNAME as varchar(25)),  TYPE, COLCOUNT,  STATS_TIME from SYSCAT.TABLES 
           where TABSCHEMA='<schema>'   
                   and CARD = -1;

Replace <schema>with the schema name of the virtual table. The command lists all the objects in the schema that have a CARD of -1, which is an indication that table level statistics have not been collected.

A similar query can be used to check for missing column statistics:

select cast (COLNAME as varchar(25)) from SYSSTAT.COLUMNS 
          where TABSCHEMA = '<schema>' 
                  and TABNAME = '<table_name>' 
                  and COLCARD = -1; 

Replace <schema>with the schema name and <table_name> with the name of the virtual table. The command lists all the objects in the schema that have a COLCARD of -1, which is an indication that column level statistics have not been collected.

Note: Not all columns in a table require column level statistics.

Collecting missing statistics for data in non object storage

The following example demonstrates a call to collect statistics with the data-based SYSPROC.NNSTAT procedure. This procedure is used to collect statistics for all virtualized objects except objects that are defined in object storage.

Important: Do not use the catalog-based statistics collection method of SYSPROC.NNSTAT on a virtualized table. The catalog-based method deletes any previously collected column level statistics, which has an adverse impact on query performance.
call SYSPROC.NNSTAT(NULL, 
          '<schema>',
          '<table_name>',
          '<list_of_columns_commonly_used_in_predicates>',
          '',
          2,
          '/tmp/<table_name>.log',
          ?,
          1);

This call collects table level statistics and column level statistics for the specified columns in <list_of_columns_commonly_used_in_predicates>. If the table has few columns, you can collect statistics for all columns in the table by passing NULL for the column list argument. If you don't want to collect column statistics, specify the empty string, '', as the fourth parameter.

For more information, see the SYSPROC.NNSTAT stored procedure.

Collecting missing statistics for data in object storage

The following example demonstrates a call to collect statistics for a virtualized table in object storage.

ANALYZE TABLE <schema>.<table_name> 
COMPUTE STATISTICS FOR COLUMNS 
<list_of_columns_commonly_used_in_predicates>;

If the table has few columns, you can collect statistics for all columns in the table by using the FOR ALL COLUMNS clause.

For example, to collect statistics for all columns on the table schema.tablename, run the following command.
ANALYZE TABLE schema.tablename
COMPUTE STATISTICS FOR ALL COLUMNS;

If the table has a large amount of data, you can collect statistics on a subset of the data so that the analyze operation can complete sooner.

For example, to collect statistics by using 10% sampling of the table data, run the following command.
ANALYZE TABLE schema.tablename
COMPUTE STATISTICS FOR ALL COLUMNS
TABLESAMPLE SYSTEM (10);

For more information, see the ANALYZE command reference.

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more