Gathers statistics for virtualized tables in Data Virtualization. The Data Virtualization optimizer uses these statistics to determine the optimal access plans for processing your queries
efficiently. The schema is DVSYS
.
With more table statistics, the optimizer can make better decisions to provide the best possible
access plans. When you run COLLECT_STATISTICS
on a table, subsequent queries
against that table usually run significantly faster. For more information, see Collecting statistics in Data Virtualization.
COLLECT_STATISTICS
procedure) to collect statistics on virtualized tables over
object storage. For more information, see ANALYZE command. To determine which virtualized tables are created over object storage, you
can filter virtualized objects by type. Select Object Storage,
Table, or View in the Filter
menu in .Authorization
To run the COLLECT_STATISTICS
procedure, you must be a Data Virtualization
Manager or Engineer. For more information, see Managing roles for users in Data Virtualization.
To collect statistics, you must have the appropriate authorization at the remote data source and in Data Virtualization. .
Input parameters
- virtschema
- The type of this required parameter is VARCHAR(128). Specifies the schema name of the virtualized table.
- virtname
- The type of this required parameter is VARCHAR(128). Specifies the name of the virtualized table.
- virtcolumns
- The type of this optional parameter is VARCHAR(32672). Specifies a comma-separated list of names of the columns that statistics are to be collected for. A null value specifies that statistics are to be collected for all columns. An empty string specifies that no column statistics are to be collected. In this case, only table cardinality is collected. If a column name includes any special characters, the name must be enclosed by double quotation marks.
- collection_type
- The type of this required parameter is SMALLINT. Specifies the statistics collection type. Valid
values are 1 (remote-catalog type) and 2
(remote-query type):
- remote-catalog
- This type of statistics collection is only supported for virtualized tables in remote data sources that support a local method of statistics collection. Statistics that are stored in the catalog tables at the remote data source are retrieved and then stored in the Data Virtualization statistics catalog. It is critical to ensure that accurate statistics are available in the remote data source. The remote-catalog type of statistics collection is not supported for grouped tables.
- remote-query
- This type of statistics collection uses SQL queries against the virtualized table to compute the statistics. This type can be resource-intensive and take a long time to complete if the virtualized table has many rows or statistics are gathered for many columns. To improve performance and conserve resources, you can collect statistics with data sampling by specifying the TABLESAMPLE option on the COLLECT_STATISTICS stored procedure in Data Virtualization or use the ANALYZE command for data sources in cloud object storage.
- options
- The type of this optional parameter is VARCHAR(32672). Specifies a comma-delimited list of extra parameters.
- TABLESAMPLE
- Specifies a double value between 0 and 99 inclusive. The value represents the percentage of the table to sample when computing statistics. The default value of 0 specifies that no sampling is to be done. This option is valid only with the remote-query statistics collection type.
- SAMPLING_THRESHOLD
- Specifies an integer value greater than 0. The value represents the minimum number of rows that a data source must contain before sampling (if specified) can be used. The default value is 1000. If the number of rows is less than this threshold, sampling is not used when column statistics are computed. This option is valid only with the remote-query statistics collection type.
Output parameters
- diagnostics
- The type of this parameter is VARCHAR(32672). Represents diagnostic output if a failure occurs and a summary of the collected statistics with abbreviated results.
Usage notes
- Run the COLLECT_STATISTICS procedure either directly or through the Data Virtualization web client whenever you make significant changes to the data in the remote data source.
- If a column name includes any special characters, the name must be enclosed by double quotation marks.
- If the remote data source supports tools for gathering local statistics, ensure that local statistics are gathered and that the remote-catalog type is used for collecting statistics on the virtualized table.
- If the remote data source does not support tools for gathering local statistics, the remote-query type is the only option available. If the virtualized table has many rows, use the TABLESAMPLE option. A sampling rate of 20% is recommended in most cases.
- If you do not specify the TABLESAMPLE option, the SAMPLING_THRESHOLD option has no effect.
- If you specify the TABLESAMPLE option and the number of rows in the virtualized table is less than the SAMPLING_THRESHOLD (default 1000), sampling does not occur because the resulting statistics would be suboptimal.
- When you specify the TABLESAMPLE option, statistics collection might be suboptimal if the sampling rate is too low. However, increasing the sampling rate will consume more resources and reduce the performance of statistics collection.
- When you specify the TABLESAMPLE option, statistics collection might be suboptimal if the virtualized table has a limited number of rows. In this case, increase the TABLESAMPLE value or don't sample the data.
- If the virtualized table references a view at the remote data source, the remote-query type is the only option available and the TABLESAMPLE option is not supported.
- If you specify the TABLESAMPLE option with a sampling rate close to 100, but statistics collection is still suboptimal, consider changing the SAMPLING_THRESHOLD value.
- Only the NUMNULLS statistic is collected for LOB-type columns.
Examples
The following examples use virtualized tables over tables that are part of the SAMPLE database in IBM Db2®. For more information, see The SAMPLE database.
- Use the remote-catalog type to collect statistics for all the columns in
the DEPARTMENT
table.
call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', null, 1, null, ?)
Value of output parameters -------------------------- Parameter Name : DIAGS Parameter Value : Collected statistics for table "SAMPLE "."DEPARTMENT": Table Cardinality = 14 Column "LOCATION" [CHAR(16)]: colcard=1, numnulls=14, highkey="", lowkey="" Column "ADMRDEPT" [CHAR(3)]: colcard=3, numnulls=0, highkey="E01", lowkey="A00" Column "DEPTNAME" [VARCHAR(36)]: colcard=14, numnulls=0, highkey="SPIFFY COMPUTER SERVICE DIV.", lowkey="BRANCH OFFICE F2" Column "MGRNO" [CHAR(6)]: colcard=9, numnulls=6, highkey="000100", lowkey="000020" Column "DEPTNO" [CHAR(3)]: colcard=14, numnulls=0, highkey="I22", lowkey="B01" Return Status = 0
- Use the remote-query type to collect statistics for all the columns in the
DEPARTMENT
table.
call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', null, 2, null, ?)
Value of output parameters -------------------------- Parameter Name : DIAGS Parameter Value : Collected statistics for table "SAMPLE "."DEPARTMENT": Table Cardinality = 14 Column "LOCATION" [CHAR(16)]: colcard=1, numnulls=14, highkey="", lowkey="" Column "ADMRDEPT" [CHAR(3)]: colcard=3, numnulls=0, highkey="E01", lowkey="A00" Column "DEPTNAME" [VARCHAR(36)]: colcard=14, numnulls=0, highkey="SUPPORT SERVICES", lowkey="ADMINISTRATION SYSTEMS" Column "MGRNO" [CHAR(6)]: colcard=9, numnulls=6, highkey="000100", lowkey="000010" Column "DEPTNO" [CHAR(3)]: colcard=14, numnulls=0, highkey="J22", lowkey="A00" Return Status = 0
- Use the remote-query type to collect statistics for some of the columns in
the DEPARTMENT
table.
call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', 'DEPTNO,DEPTNAME,LOCATION', 2, null, ?)
Value of output parameters -------------------------- Parameter Name : DIAGS Parameter Value : Collected statistics for table "SAMPLE "."DEPARTMENT": Table Cardinality = 14 Column "LOCATION" [CHAR(16)]: colcard=1, numnulls=14, highkey="", lowkey="" Column "DEPTNAME" [VARCHAR(36)]: colcard=14, numnulls=0, highkey="SUPPORT SERVICES", lowkey="ADMINISTRATION SYSTEMS" Column "DEPTNO" [CHAR(3)]: colcard=14, numnulls=0, highkey="J22", lowkey="A00" Return Status = 0
- Use the remote-query type to collect only DEPARTMENT table
statistics.
call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', '', 2, null, ?)
Value of output parameters -------------------------- Parameter Name : DIAGS Parameter Value : Collected statistics for table "SAMPLE "."DEPARTMENT": Table Cardinality = 14 Return Status = 0
- Use the remote-query type to try to collect statistics for a column that is
not defined on the DEPARTMENT
table.
call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', 'DEPTNO,FIRSTNME', 2, null, ?)
Value of output parameters -------------------------- Parameter Name : DIAGS Parameter Value : ERROR: VALIDATE COLUMN FILTER LIST -- Invalid column "FIRSTNME" in virtColumns Return Status = 0
- Use the remote-catalog type to collect statistics when the remote catalog
has no statistics for the local table
(DEPARTMENT).
call dvsys.collect_statistics('SAMPLE', 'DEPARTMENT', null, 1, null, ?)
Value of output parameters -------------------------- Parameter Name : DIAGS Parameter Value : WARNING: No statistics found in remote catalog for table "SAMPLE "."DEPARTMENT" Return Status = 0
- Use the remote-catalog type to collect statistics when columns with special
characters are
specified.
call dvsys.collect_statistics('SAMPLE', 'SpecialChars', '"Col,1","Col""2"', 2, null, ?)
Value of output parameters -------------------------- Parameter Name : DIAGS Parameter Value : Collected statistics for table "SAMPLE "."SpecialChars": Table Cardinality = 4 Column "Col,1" [INTEGER]: colcard=3, numnulls=0, highkey="3", lowkey="1" Column "Col"2" [INTEGER]: colcard=3, numnulls=1, highkey="2", lowkey="1" Return Status = 0
- Use the remote-catalog type and a sampling rate of 20% to collect
statistics for all columns in the SALES table. Extra rows have been added to the SALES table to
facilitate
sampling.
call dvsys.collect_statistics('SAMPLE', 'SALES', null, 2, 'TABLESAMPLE=20', ?)
Value of output parameters -------------------------- Parameter Name : DIAGS Parameter Value : Collected statistics for table "SAMPLE "."SALES": Table Cardinality = 1304 Column "SALES_DATE" [DATE]: colcard=247, numnulls=0, highkey="2007-03-31", lowkey="2005-12-31" Column "SALES_PERSON" [VARCHAR(15)]: colcard=3, numnulls=0, highkey="LUCCHESSI", lowkey="GOUNOT" Column "SALES" [INTEGER]: colcard=19, numnulls=0, highkey="19", lowkey="1" Column "REGION" [VARCHAR(15)]: colcard=4, numnulls=0, highkey="Quebec", lowkey="Manitoba" Return Status = 0
- Use the remote-catalog type, a sampling rate of 20%, and a
SAMPLING_THRESHOLD of 40 rows to collect statistics for all columns in the EMP table, which has 42
rows by default. If SAMPLING_THRESHOLD is not specified, the TABLESAMPLE option is ignored because
the default sampling threshold is 1000
rows.
call dvsys.collect_statistics('SAMPLE', 'EMP', null, 2, 'TABLESAMPLE=20,SAMPLING_THRESHOLD=40', ?)
Value of output parameters -------------------------- Parameter Name : DIAGS Parameter Value : Collected statistics for table "SAMPLE "."EMP": Table Cardinality = 42 Column "EDLEVEL" [SMALLINT]: colcard=5, numnulls=0, highkey="19", lowkey="12" Column "PHONENO" [CHAR(4)]: colcard=10, numnulls=0, highkey="8953", lowkey="1793" Column "SEX" [CHAR(1)]: colcard=2, numnulls=0, highkey="M", lowkey="F" Column "FIRSTNME" [VARCHAR(12)]: colcard=11, numnulls=0, highkey="VINCENZO", lowkey="DIAN" Column "MIDINIT" [CHAR(1)]: colcard=10, numnulls=0, highkey="V", lowkey=" " Column "BIRTHDATE" [DATE]: colcard=11, numnulls=0, highkey="2003-05-26", lowkey="1955-09-15" Column "COMM" [DECIMAL(9,2)]: colcard=11, numnulls=0, highkey="4220.00", lowkey="1272.00" Column "SALARY" [DECIMAL(9,2)]: colcard=11, numnulls=0, highkey="96170.00", lowkey="31840.00" Column "LASTNAME" [VARCHAR(15)]: colcard=9, numnulls=0, highkey="YAMAMOTO", lowkey="ADAMSON" Column "WORKDEPT" [CHAR(3)]: colcard=6, numnulls=0, highkey="E21", lowkey="A00" Column "HIREDATE" [DATE]: colcard=9, numnulls=0, highkey="2006-02-23", lowkey="1979-08-17" Column "BONUS" [DECIMAL(9,2)]: colcard=4, numnulls=0, highkey="800.00", lowkey="300.00" Column "EMPNO" [CHAR(6)]: colcard=9, numnulls=0, highkey="200340", lowkey="000050" Column "JOB" [CHAR(8)]: colcard=5, numnulls=0, highkey="OPERATOR", lowkey="CLERK " Return Status = 0