0 / 0
COLLECT_STATISTICS stored procedure in Data Virtualization
Last updated: Nov 26, 2024
COLLECT_STATISTICS stored procedure in Data Virtualization

Gathers statistics for virtualized tables in Data Virtualization. The Data Virtualization optimizer uses these statistics to determine the optimal access plans for processing your queries efficiently. The schema is DVSYS.

With more table statistics, the optimizer can make better decisions to provide the best possible access plans. When you run COLLECT_STATISTICS on a table, subsequent queries against that table usually run significantly faster. For more information, see Collecting statistics in Data Virtualization.

Important: Use the ANALYZE command (instead of the COLLECT_STATISTICS procedure) to collect statistics on virtualized tables over object storage. For more information, see ANALYZE command. To determine which virtualized tables are created over object storage, you can filter virtualized objects by type. Select Object Storage, Table, or View in the Filter menu in Data > Data virtualization > Virtualized data.

Authorization

To run the COLLECT_STATISTICS procedure, you must be a Data Virtualization Manager or Engineer. For more information, see Managing roles for users in Data Virtualization.

To collect statistics, you must have the appropriate authorization at the remote data source and in Data Virtualization. .

Input parameters

virtschema
The type of this required parameter is VARCHAR(128). Specifies the schema name of the virtualized table.
virtname
The type of this required parameter is VARCHAR(128). Specifies the name of the virtualized table.
virtcolumns
The type of this optional parameter is VARCHAR(32672). Specifies a comma-separated list of names of the columns that statistics are to be collected for. A null value specifies that statistics are to be collected for all columns. An empty string specifies that no column statistics are to be collected. In this case, only table cardinality is collected. If a column name includes any special characters, the name must be enclosed by double quotation marks.
collection_type
The type of this required parameter is SMALLINT. Specifies the statistics collection type. Valid values are 1 (remote-catalog type) and 2 (remote-query type):
remote-catalog
This type of statistics collection is only supported for virtualized tables in remote data sources that support a local method of statistics collection. Statistics that are stored in the catalog tables at the remote data source are retrieved and then stored in the Data Virtualization statistics catalog. It is critical to ensure that accurate statistics are available in the remote data source. The remote-catalog type of statistics collection is not supported for grouped tables.
remote-query
This type of statistics collection uses SQL queries against the virtualized table to compute the statistics. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. To improve performance and conserve resources, you can collect statistics with data sampling by specifying the TABLESAMPLE option on the COLLECT_STATISTICS stored procedure in Data Virtualization or use the ANALYZE command for data sources in cloud object storage.
options
The type of this optional parameter is VARCHAR(32672). Specifies a comma-delimited list of extra parameters.
TABLESAMPLE
Specifies a double value between 0 and 99 inclusive. The value represents the percentage of the table to sample when computing statistics. The default value of 0 specifies that no sampling is to be done. This option is valid only with the remote-query statistics collection type.
SAMPLING_THRESHOLD
Specifies an integer value greater than 0. The value represents the minimum number of rows that a data source must contain before sampling (if specified) can be used. The default value is 1000. If the number of rows is less than this threshold, sampling is not used when column statistics are computed. This option is valid only with the remote-query statistics collection type.

Output parameters

diagnostics
The type of this parameter is VARCHAR(32672). Represents diagnostic output if a failure occurs and a summary of the collected statistics with abbreviated results.

Usage notes

  • Run the COLLECT_STATISTICS procedure either directly or through the Data Virtualization web client whenever you make significant changes to the data in the remote data source.
  • If a column name includes any special characters, the name must be enclosed by double quotation marks.
  • If the remote data source supports tools for gathering local statistics, ensure that local statistics are gathered and that the remote-catalog type is used for collecting statistics on the virtualized table.
  • If the remote data source does not support tools for gathering local statistics, the remote-query type is the only option available. If the virtualized table has many rows, use the TABLESAMPLE option. A sampling rate of 20% is recommended in most cases.
  • If you do not specify the TABLESAMPLE option, the SAMPLING_THRESHOLD option has no effect.
  • If you specify the TABLESAMPLE option and the number of rows in the virtualized table is less than the SAMPLING_THRESHOLD (default 1000), sampling does not occur because the resulting statistics would be suboptimal.
  • When you specify the TABLESAMPLE option, statistics collection might be suboptimal if the sampling rate is too low. However, increasing the sampling rate will consume more resources and reduce the performance of statistics collection.
  • When you specify the TABLESAMPLE option, statistics collection might be suboptimal if the virtualized table has a limited number of rows. In this case, increase the TABLESAMPLE value or don't sample the data.
  • If the virtualized table references a view at the remote data source, the remote-query type is the only option available and the TABLESAMPLE option is not supported.
  • If you specify the TABLESAMPLE option with a sampling rate close to 100, but statistics collection is still suboptimal, consider changing the SAMPLING_THRESHOLD value.
  • Only the NUMNULLS statistic is collected for LOB-type columns.

Examples

The following examples use virtualized tables over tables that are part of the SAMPLE database in IBM Db2®. For more information, see The SAMPLE database.

  • Use the remote-catalog type to collect statistics for all the columns in the DEPARTMENT table.
    call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', null, 1, null, ?)
    Value of output parameters
    --------------------------
    Parameter Name  : DIAGS
    Parameter Value : Collected statistics for table "SAMPLE  "."DEPARTMENT":
    Table Cardinality = 14
    Column "LOCATION" [CHAR(16)]:  colcard=1, numnulls=14, highkey="", lowkey=""
    Column "ADMRDEPT" [CHAR(3)]:  colcard=3, numnulls=0, highkey="E01", lowkey="A00"
    Column "DEPTNAME" [VARCHAR(36)]:  colcard=14, numnulls=0, highkey="SPIFFY COMPUTER SERVICE DIV.", lowkey="BRANCH OFFICE F2"
    Column "MGRNO" [CHAR(6)]:  colcard=9, numnulls=6, highkey="000100", lowkey="000020"
    Column "DEPTNO" [CHAR(3)]:  colcard=14, numnulls=0, highkey="I22", lowkey="B01"
    
    Return Status = 0
  • Use the remote-query type to collect statistics for all the columns in the DEPARTMENT table.
    call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', null, 2, null, ?)
    Value of output parameters
    --------------------------
    Parameter Name  : DIAGS
    Parameter Value : Collected statistics for table "SAMPLE  "."DEPARTMENT":
    Table Cardinality = 14
    Column "LOCATION" [CHAR(16)]:  colcard=1, numnulls=14, highkey="", lowkey=""
    Column "ADMRDEPT" [CHAR(3)]:  colcard=3, numnulls=0, highkey="E01", lowkey="A00"
    Column "DEPTNAME" [VARCHAR(36)]:  colcard=14, numnulls=0, highkey="SUPPORT SERVICES", lowkey="ADMINISTRATION SYSTEMS"
    Column "MGRNO" [CHAR(6)]:  colcard=9, numnulls=6, highkey="000100", lowkey="000010"
    Column "DEPTNO" [CHAR(3)]:  colcard=14, numnulls=0, highkey="J22", lowkey="A00"
    
    Return Status = 0
  • Use the remote-query type to collect statistics for some of the columns in the DEPARTMENT table.
    call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', 'DEPTNO,DEPTNAME,LOCATION', 2, null, ?)
    Value of output parameters
    --------------------------
    Parameter Name  : DIAGS
    Parameter Value : Collected statistics for table "SAMPLE  "."DEPARTMENT":
    Table Cardinality = 14
    Column "LOCATION" [CHAR(16)]:  colcard=1, numnulls=14, highkey="", lowkey=""
    Column "DEPTNAME" [VARCHAR(36)]:  colcard=14, numnulls=0, highkey="SUPPORT SERVICES", lowkey="ADMINISTRATION SYSTEMS"
    Column "DEPTNO" [CHAR(3)]:  colcard=14, numnulls=0, highkey="J22", lowkey="A00"
    
    Return Status = 0
  • Use the remote-query type to collect only DEPARTMENT table statistics.
    call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', '', 2, null, ?)
    Value of output parameters
    --------------------------
    Parameter Name  : DIAGS
    Parameter Value : Collected statistics for table "SAMPLE  "."DEPARTMENT":
    Table Cardinality = 14
    
    Return Status = 0
  • Use the remote-query type to try to collect statistics for a column that is not defined on the DEPARTMENT table.
    call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', 'DEPTNO,FIRSTNME', 2, null, ?)
    Value of output parameters
    --------------------------
    Parameter Name  : DIAGS
    Parameter Value : ERROR: VALIDATE COLUMN FILTER LIST -- Invalid column "FIRSTNME" in virtColumns
    
    Return Status = 0
  • Use the remote-catalog type to collect statistics when the remote catalog has no statistics for the local table (DEPARTMENT).
    call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', null, 1, null, ?)
    Value of output parameters
    --------------------------
    Parameter Name  : DIAGS
    Parameter Value : WARNING: No statistics found in remote catalog for table "SAMPLE  "."DEPARTMENT"
    
    Return Status = 0
  • Use the remote-catalog type to collect statistics when columns with special characters are specified.
    call dvsys.collect_statistics('SAMPLE', 'SpecialChars', '"Col,1","Col""2"', 2, null, ?)
    Value of output parameters
    --------------------------
    Parameter Name  : DIAGS
    Parameter Value : Collected statistics for table "SAMPLE  "."SpecialChars":
    Table Cardinality = 4
    Column "Col,1" [INTEGER]:  colcard=3, numnulls=0, highkey="3", lowkey="1"
    Column "Col"2" [INTEGER]:  colcard=3, numnulls=1, highkey="2", lowkey="1"
    
    Return Status = 0
  • Use the remote-catalog type and a sampling rate of 20% to collect statistics for all columns in the SALES table. Extra rows have been added to the SALES table to facilitate sampling.
    call dvsys.collect_statistics('SAMPLE', 'SALES', null, 2, 'TABLESAMPLE=20', ?)
    Value of output parameters
    --------------------------
    Parameter Name  : DIAGS
    Parameter Value : Collected statistics for table "SAMPLE  "."SALES":
    Table Cardinality = 1304
    Column "SALES_DATE" [DATE]:  colcard=247, numnulls=0, highkey="2007-03-31", lowkey="2005-12-31"
    Column "SALES_PERSON" [VARCHAR(15)]:  colcard=3, numnulls=0, highkey="LUCCHESSI", lowkey="GOUNOT"
    Column "SALES" [INTEGER]:  colcard=19, numnulls=0, highkey="19", lowkey="1"
    Column "REGION" [VARCHAR(15)]:  colcard=4, numnulls=0, highkey="Quebec", lowkey="Manitoba"
    
    Return Status = 0
  • Use the remote-catalog type, a sampling rate of 20%, and a SAMPLING_THRESHOLD of 40 rows to collect statistics for all columns in the EMP table, which has 42 rows by default. If SAMPLING_THRESHOLD is not specified, the TABLESAMPLE option is ignored because the default sampling threshold is 1000 rows.
    call dvsys.collect_statistics('SAMPLE', 'EMP', null, 2, 'TABLESAMPLE=20,SAMPLING_THRESHOLD=40', ?)
    Value of output parameters
    --------------------------
    Parameter Name  : DIAGS
    Parameter Value : Collected statistics for table "SAMPLE  "."EMP":
    Table Cardinality = 42
    Column "EDLEVEL" [SMALLINT]:  colcard=5, numnulls=0, highkey="19", lowkey="12"
    Column "PHONENO" [CHAR(4)]:  colcard=10, numnulls=0, highkey="8953", lowkey="1793"
    Column "SEX" [CHAR(1)]:  colcard=2, numnulls=0, highkey="M", lowkey="F"
    Column "FIRSTNME" [VARCHAR(12)]:  colcard=11, numnulls=0, highkey="VINCENZO", lowkey="DIAN"
    Column "MIDINIT" [CHAR(1)]:  colcard=10, numnulls=0, highkey="V", lowkey=" "
    Column "BIRTHDATE" [DATE]:  colcard=11, numnulls=0, highkey="2003-05-26", lowkey="1955-09-15"
    Column "COMM" [DECIMAL(9,2)]:  colcard=11, numnulls=0, highkey="4220.00", lowkey="1272.00"
    Column "SALARY" [DECIMAL(9,2)]:  colcard=11, numnulls=0, highkey="96170.00", lowkey="31840.00"
    Column "LASTNAME" [VARCHAR(15)]:  colcard=9, numnulls=0, highkey="YAMAMOTO", lowkey="ADAMSON"
    Column "WORKDEPT" [CHAR(3)]:  colcard=6, numnulls=0, highkey="E21", lowkey="A00"
    Column "HIREDATE" [DATE]:  colcard=9, numnulls=0, highkey="2006-02-23", lowkey="1979-08-17"
    Column "BONUS" [DECIMAL(9,2)]:  colcard=4, numnulls=0, highkey="800.00", lowkey="300.00"
    Column "EMPNO" [CHAR(6)]:  colcard=9, numnulls=0, highkey="200340", lowkey="000050"
    Column "JOB" [CHAR(8)]:  colcard=5, numnulls=0, highkey="OPERATOR", lowkey="CLERK   "
    
    Return Status = 0