When is the right time to refresh statistics? The appropriate time to refresh statistics is just before the query or the load process breaks. This means just before the optimizer change the access plan from a good plan to a bad plan. However, the more relevant question is: How can I determine the right time to refresh statistics of any table or join index? To answer this question there are a couple of items that is important to know and understand first as listed below: 1. 2. The system workload specially the workload of the critical processes and reports. It usually represents the processes and reports associated with a SLA. The DBQL and Resusage data will provide the necessary information for you to understand your system workload and also your critical processes and reports. The statistics collection recommendations to be applied to any table/join index in your environment. Here is a summary of statistics collection recommendations and this information comes from the article “Statistics Collection Recommendations for Teradata 12” from Carrie Ballinger. a. Collect Full Statistics Non-indexed columns used in constraints and joins All NUSIs with uneven distribution of values NUSIs used in join steps USIs/UPIs if used in range constraints Value Ordered NUSIs NUPIs Relevant columns and indexes on small tables b. Collect Sample Statistics (very large tables) Unique index columns Nearly-unique columns or indexes c. Collect Multicolumn Statistics Group of columns used in constraints with equality predicates Group of columns used in joins or aggregations d. Collect PPI Statistics PARTITION Partition column e. Collect PPI Statistics (TD12) (PARTITION, PI) (PARTITION, PI, Partition column) 3. How often do the statistics need to be refreshed? This question needs to be answered while taking into consideration the system workload. However, there are some recommendations based on best practices. a. After every load Refreshed tables b. Daily PARTITION Partition column Value ordered NUSIs NUSIs associated with critical processes/reports Sample statistics DBC tables c. After changing 10% in size Any table or join index that changed 10% in size. d. Every 90 days Rolling tables Join index associated with rolling tables Any statistics that is 90 days old (almost static table) e. Zero Statistics Any statistics showing zero unique values f. No Statistics Any index with no statistics defined 4. What is the impact of the collect statistics process of a column, an index, and a multi-column associated with any table or join index specially the very large ones? It’s possible to determine the impact of the collect statistic process by checking on DBQL data, but also looking at the Explain statements associated with the collect statistics statements. The high CPU impact is associated with large tables and another factor is the uniqueness of the column. a. UPI or NUPI The impact is high for very large tables or join indexes because it does all-rows scan to be able to collect statistics. COLLECT STATISTICS ON GDW_AGG.RPT_ITEM INDEX( ITEM_ID); Explanation 1) First, we lock GDW_AGG.RPT_ITEM for access. 2) Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_ITEM by way of an all-rows scan into Spool 1 (all_amps), which is built locally on the AMPs. 3) Then we save the UPDATED STATISTICS from Spool 1 (Last Use) into Spool 3, which is built locally on the AMP derived from DBC.TVFields by way of the primary index. 4) We lock DBC.TVFields for write on a RowHash. 5) We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 3 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 4 (one-amp), which is built locally on the 6) We do a single-AMP MERGE into DBC.TVFields from Spool 4 (Last Use). 7) We spoil the parser's dictionary cache for the table. 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> No rows are returned to the user as the result of statement 1. b. USI or NUSI The impact is low even on very large tables or join indexes because it uses the secondary index subtable to collect statistics. COLLECT STATISTICS ON GDW_AGG.RPT_LOC_INV_HST_RCLS0809_AJX31 INDEX ( DISTRICT_ID ); Explanation 1) First, we lock GDW_AGG.RPT_LOC_INV_HST_RCLS0809_AJX31 for access. 2) Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_LOC_INV_HST_RCLS0809_AJX31 by way of a traversal of index # 8 without accessing the base table into Spool 3 (all_amps), which is built locally on the AMPs. 3) Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the AMP derived from DBC.TVFields by way of the primary index. 4) We lock DBC.TVFields for write on a RowHash. 5) We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 4 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs. 6) We do a single-AMP MERGE into DBC.TVFields from Spool 5 (Last Use). 7) We spoil the parser's dictionary cache for the table. 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> No rows are returned to the user as the result of statement 1. c. PARTITION The impact is low even on very large tables because it uses the cylinder headers to determine the partitions and to collect statistics but the explain does not reflect that. COLLECT STATISTICS ON GDW_AGG.RPT_LOC_INV_HST_RCLS0809 COLUMN PARTITION; Explanation 1) First, we lock GDW_AGG.RPT_LOC_INV_HST_RCLS0809 for access. 2) Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_LOC_INV_HST_RCLS0809 by way of an all-rows scan into Spool 3 (all_amps), which is built locally on the AMPs. 3) Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the AMP derived from DBC.Indexes by way of the primary index. 4) We lock DBC.Indexes for write on a RowHash. 5) We do a single-AMP MERGE DELETE to DBC.Indexes from Spool 4 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs. 6) We do a single-AMP MERGE into DBC.Indexes from Spool 5 (Last Use). 7) We spoil the parser's dictionary cache for the table. 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> No rows are returned to the user as the result of statement 1. d. Partition Column The impact is high for very large tables or join indexes because it does all-rows scan to be able to collect statistics. COLLECT STATISTICS ON GDW_AGG.RPT_LOC_INV_HST_RCLS0809 COLUMN INV_DT; Explanation 1) First, we lock GDW_AGG.RPT_LOC_INV_HST_RCLS0809 for access. 2) Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_LOC_INV_HST_RCLS0809 by way of an all-rows scan into Spool 3 (all_amps), which is built locally on the AMPs. 3) Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the AMP derived from DBC.TVFields by way of the primary index. 4) We lock DBC.TVFields for write on a RowHash. 5) We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 4 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs. 6) We do a single-AMP MERGE into DBC.TVFields from Spool 5 (Last Use). 7) We spoil the parser's dictionary cache for the table. 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> No rows are returned to the user as the result of statement 1. e. Partition Column with NUSI The impact is low even on very large tables or join indexes because it uses the secondary index subtable to collect statistics. COLLECT STATISTICS ON GDW_DM.SALES_DETAIL COLUMN BUSINESSDAY_DT; Explanation 1) First, we lock GDW_DM.SALES_DETAIL for access. 2) Next, we do a COLLECT STATISTICS step from GDW_DM.SALES_DETAIL by way of a traversal of index # 12 without accessing the base table into Spool 3 (all_amps), which is built locally on the AMPs. 3) Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the AMP derived from DBC.TVFields by way of the primary index. 4) We lock DBC.TVFields for write on a RowHash. 5) We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 4 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs. 6) We do a single-AMP MERGE into DBC.TVFields from Spool 5 (Last Use). 7) We spoil the parser's dictionary cache for the table. 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> No rows are returned to the user as the result of statement 1. f. Multi-Column The impact is high for very large tables because it does all-rows scan to be able to collect statistics. But don’t forget the 16 bytes truncation for multi-column statistics. COLLECT STATISTICS ON GDW_DM.SALES_DETAIL COLUMN ( Explanation 1) First, we lock GDW_DM.SALES_DETAIL for access. 2) Next, we do a COLLECT STATISTICS step from GDW_DM.SALES_DETAIL by way of an all-rows scan into Spool 3 (all_amps), which is built locally on the AMPs. 3) Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the AMP derived from DBC.Indexes by way of the primary index. 4) We lock DBC.Indexes for write on a RowHash. 5) We do a single-AMP MERGE DELETE to DBC.Indexes from Spool 4 (Last Use) by way of a RowHash match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs. 6) We do a single-AMP MERGE into DBC.Indexes from Spool 5 (Last Use). 7) We spoil the parser's dictionary cache for the table. 8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> No rows are returned to the user as the result of statement 1. 5. 6. What is the best way to refresh statistics on a table or a join index? The answer for this question usually depends on the size of the table and how many CPU cycles are available to collect statistics. Consider the following best practices to avoid wasting resources in case the collect statistics process needs to be aborted: Refresh at the Table level Global temporary tables Small tables Refresh at the Index/Column/Multi-Column level All Tables and Join Indexes Are there windows of time when the system usually runs light? This question can be answered using Resusage data to determine the periods of time where the system is running light. The next step is to create a process that automatically considers all items mentioned above and collects statistics efficiently without wasting system resources. To be continued…