Collecting statistics in Watson Query
To optimize query performance, make sure that you are collecting statistics on the data that is being queried.
The decisions that are made by the cost-based optimizer are crucial to determine the performance of queries. The optimizer makes its decisions by using statistical information that it has about the data that is being queried. Accurate and up-to-date statistics on the data are required to ensure optimal query performance. Collect statistics whenever the following conditions apply.
- A new table is created and populated with data.
- An existing table’s data undergoes significant changes such as the following conditions:
- New data is added.
- Old data is removed.
- Existing data is updated.
For the optimizer to have enough information to build efficient execution plans, statistics must be collected for:
- Any table that is referenced in a query.
- All columns that are referenced in predicates (including join predicates) and aggregation functions that are referenced in queries.
You don't need to collect statistics for columns that appear only in the first SELECT list of a query.
For most virtualized tables, you can use SYSPROC.NNSTAT to collect statistics. However, for virtualized tables that are created in object storage, use the ANALYZE command instead. To determine which virtualized tables are created in object storage, you can filter virtualized objects by type. Select object storage, Table, or View in the Filter menu in .
- Virtualized tables (non-object storage)
-
When a table is virtualized, it is important to collect statistics on the virtualized table to improve the performance of queries that access the data. Also, if the remote table undergoes significant changes, the virtualized table statistics should be updated. The
SYSPROC.NNSTAT
procedure should be used to collect statistics on the virtualized table by using the data-based collection method. For more information, see the SYSPROC.NNSTAT stored procedure. - Virtualize tables (object storage)
-
When a file in object storage is virtualized, it is important to collect statistics on the virtualized object to improve the performance of queries that access the data. To collect statistics, you can use the ANALYZE command.
- Virtualized files
-
When a file is virtualized, it is important to collect statistics on the virtualized object for improving the performance of queries that access that data. The
SYSPROC.NNSTAT
procedure should be called to collect the key statistics for this new table by using the data-based method. If the file undergoes a significant change, theSYSPROC.NNSTAT
procedure should be called again to collect the new statistics. - Views
-
When you create a virtual view, it is important to make sure that statistics are collected for all the tables that are referenced in the view, as described in the previous sections. Statistics are not collected on views.
Checking for missing statistics
To determine whether table level statistics are missing for a table, you can use the following query:
select cast(TABNAME as varchar(25)), TYPE, COLCOUNT, STATS_TIME from SYSCAT.TABLES
where TABSCHEMA='<schema>'
and CARD = -1;
Replace <schema>with the schema name of the virtual table. The command
lists all the objects in the schema that have a CARD
of -1, which is an indication
that table level statistics have not been collected.
A similar query can be used to check for missing column statistics:
select cast (COLNAME as varchar(25)) from SYSSTAT.COLUMNS
where TABSCHEMA = '<schema>'
and TABNAME = '<table_name>'
and COLCARD = -1;
Replace <schema>with the schema name and <table_name>
with the name of the virtual table. The command lists all the objects in the schema that have a
COLCARD
of -1, which is an indication that column level statistics have not been
collected.
Collecting missing statistics for data in non object storage
The following example demonstrates a call to collect statistics with the data-based
SYSPROC.NNSTAT
procedure. This procedure is used to collect statistics for all
virtualized objects except objects that are defined in object
storage.
SYSPROC.NNSTAT
on a virtualized table. The catalog-based method deletes any
previously collected column level statistics, which has an adverse impact on query
performance.call SYSPROC.NNSTAT(NULL,
'<schema>',
'<table_name>',
'<list_of_columns_commonly_used_in_predicates>',
'',
2,
'/tmp/<table_name>.log',
?,
1);
This call collects table level statistics and column level statistics for the specified columns
in <list_of_columns_commonly_used_in_predicates>. If the table has few
columns, you can collect statistics for all columns in the table by passing NULL
for the column list argument. If you don't want to collect column statistics, specify the empty
string, ''
, as the fourth parameter.
For more information, see the SYSPROC.NNSTAT stored procedure.
Collecting missing statistics for data in object storage
The following example demonstrates a call to collect statistics for a virtualized table in object storage.
ANALYZE TABLE <schema>.<table_name>
COMPUTE STATISTICS FOR COLUMNS
<list_of_columns_commonly_used_in_predicates>;
If the table has few columns, you can collect statistics for all columns in the table by using
the FOR ALL COLUMNS
clause.
schema.tablename
, run the following
command.ANALYZE TABLE schema.tablename
COMPUTE STATISTICS FOR ALL COLUMNS;
If the table has a large amount of data, you can collect statistics on a subset of the data so that the analyze operation can complete sooner.
ANALYZE TABLE schema.tablename
COMPUTE STATISTICS FOR ALL COLUMNS
TABLESAMPLE SYSTEM (10);
For more information, see the ANALYZE command reference.