When is the right time to refresh statistics

advertisement
When is the right time to refresh statistics?
The appropriate time to refresh statistics is just before the query or the load process breaks. This means
just before the optimizer change the access plan from a good plan to a bad plan. However, the more
relevant question is: How can I determine the right time to refresh statistics of any table or join index? To
answer this question there are a couple of items that is important to know and understand first as listed
below:
1.
2.
The system workload specially the workload of the critical processes and reports. It usually represents
the processes and reports associated with a SLA. The DBQL and Resusage data will provide the
necessary information for you to understand your system workload and also your critical processes
and reports.
The statistics collection recommendations to be applied to any table/join index in your environment.
Here is a summary of statistics collection recommendations and this information comes from the
article “Statistics Collection Recommendations for Teradata 12” from Carrie Ballinger.
a. Collect Full Statistics
 Non-indexed columns used in constraints and joins
 All NUSIs with uneven distribution of values
 NUSIs used in join steps
 USIs/UPIs if used in range constraints
 Value Ordered NUSIs
 NUPIs
 Relevant columns and indexes on small tables
b. Collect Sample Statistics (very large tables)
 Unique index columns
 Nearly-unique columns or indexes
c. Collect Multicolumn Statistics
 Group of columns used in constraints with equality predicates
 Group of columns used in joins or aggregations
d. Collect PPI Statistics
 PARTITION
 Partition column
e. Collect PPI Statistics (TD12)
 (PARTITION, PI)
 (PARTITION, PI, Partition column)
3.
How often do the statistics need to be refreshed? This question needs to be answered while taking
into consideration the system workload. However, there are some recommendations based on best
practices.
a. After every load
 Refreshed tables
b. Daily
 PARTITION
 Partition column
 Value ordered NUSIs



NUSIs associated with critical processes/reports
Sample statistics
DBC tables
c. After changing 10% in size
 Any table or join index that changed 10% in size.
d. Every 90 days
 Rolling tables
 Join index associated with rolling tables
 Any statistics that is 90 days old (almost static table)
e. Zero Statistics
 Any statistics showing zero unique values
f. No Statistics
 Any index with no statistics defined
4.
What is the impact of the collect statistics process of a column, an index, and a multi-column
associated with any table or join index specially the very large ones? It’s possible to determine the
impact of the collect statistic process by checking on DBQL data, but also looking at the Explain
statements associated with the collect statistics statements. The high CPU impact is associated with
large tables and another factor is the uniqueness of the column.
a. UPI or NUPI
The impact is high for very large tables or join indexes because it does all-rows scan to be able to
collect statistics.
COLLECT STATISTICS ON GDW_AGG.RPT_ITEM INDEX( ITEM_ID);
Explanation
1) First, we lock GDW_AGG.RPT_ITEM for access.
2) Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_ITEM by way of an all-rows scan into
Spool 1 (all_amps), which is built locally on the AMPs.
3) Then we save the UPDATED STATISTICS from Spool 1 (Last Use) into Spool 3, which is built locally on the
AMP derived from DBC.TVFields by way of the primary index.
4) We lock DBC.TVFields for write on a RowHash.
5) We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 3 (Last Use) by way of a RowHash
match scan. New updated rows are built and the result goes into Spool 4 (one-amp), which is built locally on the
6) We do a single-AMP MERGE into DBC.TVFields from Spool 4 (Last Use).
7) We spoil the parser's dictionary cache for the table.
8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> No rows are returned to the user as the result of statement 1.
b. USI or NUSI
The impact is low even on very large tables or join indexes because it uses the secondary index subtable to collect statistics.
COLLECT STATISTICS ON GDW_AGG.RPT_LOC_INV_HST_RCLS0809_AJX31 INDEX ( DISTRICT_ID );
Explanation
1) First, we lock GDW_AGG.RPT_LOC_INV_HST_RCLS0809_AJX31 for access.
2) Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_LOC_INV_HST_RCLS0809_AJX31 by
way of a traversal of index # 8 without accessing the base table into Spool 3 (all_amps), which is built
locally on the AMPs.
3) Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the
AMP derived from DBC.TVFields by way of the primary index.
4) We lock DBC.TVFields for write on a RowHash.
5) We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 4 (Last Use) by way of a RowHash
match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the
AMPs.
6) We do a single-AMP MERGE into DBC.TVFields from Spool 5 (Last Use).
7) We spoil the parser's dictionary cache for the table.
8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> No rows are returned to the user as the result of statement 1.
c. PARTITION
The impact is low even on very large tables because it uses the cylinder headers to determine the
partitions and to collect statistics but the explain does not reflect that.
COLLECT STATISTICS ON GDW_AGG.RPT_LOC_INV_HST_RCLS0809 COLUMN PARTITION;
Explanation
1) First, we lock GDW_AGG.RPT_LOC_INV_HST_RCLS0809 for access.
2) Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_LOC_INV_HST_RCLS0809 by way of
an all-rows scan into Spool 3 (all_amps), which is built locally on the AMPs.
3) Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the
AMP derived from DBC.Indexes by way of the primary index.
4) We lock DBC.Indexes for write on a RowHash.
5) We do a single-AMP MERGE DELETE to DBC.Indexes from Spool 4 (Last Use) by way of a RowHash match
scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs.
6) We do a single-AMP MERGE into DBC.Indexes from Spool 5 (Last Use).
7) We spoil the parser's dictionary cache for the table.
8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> No rows are returned to the user as the result of statement 1.
d. Partition Column
The impact is high for very large tables or join indexes because it does all-rows scan to be able to
collect statistics.
COLLECT STATISTICS ON GDW_AGG.RPT_LOC_INV_HST_RCLS0809 COLUMN INV_DT;
Explanation
1) First, we lock GDW_AGG.RPT_LOC_INV_HST_RCLS0809 for access.
2) Next, we do a COLLECT STATISTICS step from GDW_AGG.RPT_LOC_INV_HST_RCLS0809 by way of
an all-rows scan into Spool 3 (all_amps), which is built locally on the AMPs.
3) Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the
AMP derived from DBC.TVFields by way of the primary index.
4) We lock DBC.TVFields for write on a RowHash.
5) We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 4 (Last Use) by way of a RowHash
match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the
AMPs.
6) We do a single-AMP MERGE into DBC.TVFields from Spool 5 (Last Use).
7) We spoil the parser's dictionary cache for the table.
8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> No rows are returned to the user as the result of statement 1.
e. Partition Column with NUSI
The impact is low even on very large tables or join indexes because it uses the secondary index subtable to collect statistics.
COLLECT STATISTICS ON GDW_DM.SALES_DETAIL COLUMN BUSINESSDAY_DT;
Explanation
1) First, we lock GDW_DM.SALES_DETAIL for access.
2) Next, we do a COLLECT STATISTICS step from GDW_DM.SALES_DETAIL by way of a traversal of index
# 12 without accessing the base table into Spool 3 (all_amps), which is built locally on the AMPs.
3) Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the
AMP derived from DBC.TVFields by way of the primary index.
4) We lock DBC.TVFields for write on a RowHash.
5) We do a single-AMP MERGE DELETE to DBC.TVFields from Spool 4 (Last Use) by way of a RowHash
match scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the
AMPs.
6) We do a single-AMP MERGE into DBC.TVFields from Spool 5 (Last Use).
7) We spoil the parser's dictionary cache for the table.
8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> No rows are returned to the user as the result of statement 1.
f. Multi-Column
The impact is high for very large tables because it does all-rows scan to be able to collect statistics.
But don’t forget the 16 bytes truncation for multi-column statistics.
COLLECT STATISTICS ON GDW_DM.SALES_DETAIL COLUMN (
Explanation
1) First, we lock GDW_DM.SALES_DETAIL for access.
2) Next, we do a COLLECT STATISTICS step from GDW_DM.SALES_DETAIL by way of an all-rows scan
into Spool 3 (all_amps), which is built locally on the AMPs.
3) Then we save the UPDATED STATISTICS from Spool 3 (Last Use) into Spool 4, which is built locally on the
AMP derived from DBC.Indexes by way of the primary index.
4) We lock DBC.Indexes for write on a RowHash.
5) We do a single-AMP MERGE DELETE to DBC.Indexes from Spool 4 (Last Use) by way of a RowHash match
scan. New updated rows are built and the result goes into Spool 5 (one-amp), which is built locally on the AMPs.
6) We do a single-AMP MERGE into DBC.Indexes from Spool 5 (Last Use).
7) We spoil the parser's dictionary cache for the table.
8) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> No rows are returned to the user as the result of statement 1.
5.
6.
What is the best way to refresh statistics on a table or a join index? The answer for this question
usually depends on the size of the table and how many CPU cycles are available to collect statistics.
Consider the following best practices to avoid wasting resources in case the collect statistics process
needs to be aborted:

Refresh at the Table level
 Global temporary tables
 Small tables

Refresh at the Index/Column/Multi-Column level
 All Tables and Join Indexes
Are there windows of time when the system usually runs light? This question can be answered using
Resusage data to determine the periods of time where the system is running light.

The next step is to create a process that automatically considers all items mentioned above and
collects statistics efficiently without wasting system resources. To be continued…
Download