Informix Chat with the Labs Unlocking the Mysteries Behind Update Statistics STSM

advertisement
Unlocking the Mysteries Behind Update Statistics
John F. Miller III
STSM
Informix Chat with the Labs
1
© IBM Corporation 2006
The Dice Problem
• Throw dice, how many will be 1?
2
© IBM Corporation 2006
Questions about the Dice
• How many dice are you throwing?
• How many sides does each dice have?
• Are all the dice the same?
The better the information,
the more accurate the estimate.
3
© IBM Corporation 2006
What does Update Statistics do?
• Collects information for the optimizer
– Statistics
– Distributions
LOW
MEDIUM & HIGH
• Drop Distributions
• Compile stored procedures
4
© IBM Corporation 2006
Statistics Collected
•
•
•
•
•
•
systables
systables
syscolumns
syscolumns
sysindexes
sysindexes
Number of Rows
Number of pages to store the data
Second largest value for a column
Second smallest value for a column
# of unique values for the lead key
How highly clustered the values for
the lead key
5
© IBM Corporation 2006
Update Statistics Low
Basic Algorithm
• Walk the leaf pages in each index
• Submit btree cleaner requests when deleted
items are found causing re-balancing of indexes
• Collects the following information
–
–
–
–
Number of unique items
Number of leave pages
How clustered the data is
Second highest and lowest value
6
© IBM Corporation 2006
How to Read Distributions
--- DISTRIBUTION --(
-1
1: ( 868317,
70,
75)
2: ( 868317,
24,
100)
3: ( 868317,
12,
116)
4: ( 868317,
30,
147)
5: ( 868317,
39,
194)
6: ( 868317,
28,
222)
--- OVERFLOW --1: ( 779848,
43)
2: ( 462364,
45)
# of rows represented in this bin
# of unique values
Highest Value in this bin
To get the range of values
look at the highest value
in the previous bin.
The value
# of rows for this value
© IBM Corporation 2006
7
Example - Approximating a Value
--- DISTRIBUTION --(
-1
1: ( 868317,
70,
75)
2: ( 868317,
24,
100)
3: ( 868317,
12,
116)
4: ( 868317,
30,
147)
5: ( 868317,
39,
194)
6: ( 868317,
28,
222)
--- OVERFLOW --1: ( 779848,
43)
2: ( 462364,
45)
• There are 868317 rows
containing a value between
-1 and 75
• There are 70 unique values
in this range
• The optimizer will deduce
868317 / 70 = 12,404
records for each value
between -1 and 75
8
© IBM Corporation 2006
Example - Dealing with Data Skew
--- DISTRIBUTION --(
-1
1: ( 868317,
70,
75)
2: ( 868317,
24,
100)
3: ( 868317,
12,
116)
4: ( 868317,
30,
147)
5: ( 868317,
39,
194)
6: ( 868317,
28,
222)
--- OVERFLOW --1: ( 779848,
43)
2: ( 462364,
45)
• Data skew
• For the value 43 how many
records will the optimizer
estimate will exist?
• Answer 779848 values
• Any value that exceeds 25%
of the bin size will be
placed in an overflow bin
9
© IBM Corporation 2006
Basic Algorithm for Distributions
• Develop scan plan
based on available
resources
• Scan table
– High = All rows
– Medium = Sample of
rows
• Sort each column
• Build distributions
• Begin transaction
– Delete old columns
distributions
– Insert new columns
distributions
• Commit transaction
10
© IBM Corporation 2006
Sample Size
• HIGH
– All rows in the table
• Medium
– Misconception about the number of
rows sampled is based on the number
of rows in the table, this is incorrect.
– The number of samples depends on
the Confidence and Resolution.
– If the sample size is greater than the
number of row in the table Medium
turns into High mode
11
© IBM Corporation 2006
Update Statistics Medium
Sample Size
Resolution Confidence
Samples
Resolution Confidence Samples
2.5
.95
2,963
0.25
.95
296,255
2.5
.99
4,273
0.25
.99
425,104
1.0
.95
18,516
0.1
.95
1,851,593
1.0
.99
26,569
0.1
.99
2,656,900
0.5
.95
74,064
0.05
.95
7,406,375
0.5
.99
106,276
0.05
.99
10,627,600
12
© IBM Corporation 2006
How Much Information is
Enough??
The better the information,
the more accurate the estimate.
13
© IBM Corporation 2006
Examining the Running Query
No Statistics VS Medium Statistics
No Statistics
Medium Statistics
QUERY:
-----select * from t1 where c1 > 20200
QUERY:
-----select * from t1 where c1 > 20200
Estimated Cost: 20888
Estimated # of Rows Returned: 6760
Estimated Cost: 21
Estimated # of Rows Returned: 19
1) miller3.t1: SEQUENTIAL SCAN
Filters: miller3.t1.c1 > 20200
1) miller3.t1: INDEX PATH
(1) Index Keys: c1
(Serial, fragments: ALL)
Lower Index Filter: t1.c1 > 20250
Overall performance improved
The estimates were more accurate
The query plan changed
14
© IBM Corporation 2006
Examining the Running Query
Medium Statistics VS High Statistics
Medium Statistics
High Statistics
QUERY:
-----select * from t1 where c1 > 20200
QUERY:
-----select * from t1 where c1 > 20200
Estimated Cost: 21
Estimated # of Rows Returned: 19
Estimated Cost: 33
Estimated # of Rows Returned: 30
1) miller3.t1: INDEX PATH
(1) Index Keys: c1
Lower Index Filter: t1.c1 > 20250
1) miller3.t1: INDEX PATH
(1) Index Keys: c1
Lower Index Filter: t1.c1 > 20250
Overall performance did not change
The estimates were slightly more accurate
The query plan did not change
15
© IBM Corporation 2006
Version of Update Statistics
Improvements
•
•
•
•
All version of 9.40 and 10.00
9.30.UC3
9.21 Not fixed
7.31.UD2
16
© IBM Corporation 2006
Improvements in Update Statistics
• Update statistics can not allocated memory
between 4MB and 100MB of sort memory
– The default has been raised from 4MB to 15MB
– User can now configure the amount of memory
• Use DBUPSPACE has been augmented to include memory
• Format of DBUPSPACE
– {max disk space}:{default memory}
– To increase the memory to 35 MB, set DBUPSPACE=0:35.
• Allow update statistics to use light scans when
scanning a a table
– Implemented light scans
– Set oriented reads
17
© IBM Corporation 2006
Improvements in update statistics
• Information about building data distributions is not
viewable by the DBA
– Set explain will now print the scan path and resource
usage when building data distributions
• Update statistics low on fragmented tables does not
run in parallel
– With PDQ turned on each index fragment will be
scanned in parallel
– PDQ at 1 means 10% of the index fragments
scanned in parallel, while PDQ at 10 means all the
index fragments will be scanned in parallel
18
© IBM Corporation 2006
Improvements in Update Statistics
• Various errors (126, 312, 100,…) when executing update
statistics
– Errors when trying to insert the distributions because set lock mode
to wait was not handled properly inside update statistics
• Range scanning a fragmented index is slow Replace the next
loop merge with a binary search merge when ordering
items from index fragments
– Most noticeable when the number of fragments in an index is large
19
© IBM Corporation 2006
Update Statistics Medium
Memory Requirements
Confidence .99
Confidence .99
Row
Size
Resolution
2.5
2.0
1.5
1.0
Resolution
Row
Size
2.5
2.0
1.5
1.0
100
.96 MB 1.2 MB 1.8 MB 3.5 MB
600
3.0 MB 4.5 MB 7.6 MB
16.5 MB
200
1.3 MB 1.9 MB 3.0 MB 6.1 MB
700
3.4 MB 5.1 MB 8.7 MB
19.1 MB
300
1.8 MB 2.5 MB 4.2 MB 8.7 MB
800
3.8 MB 5.8 MB 9.9 MB
21.7 MB
400
2.2 MB 3.2 MB 5.3 MB 11.3 MB
900
4.3 MB 6.4 MB 11.1 MB 24.2 MB
500
2.6 MB 3.9 MB 6.4 MB 13.9 MB
1000 4.7 MB 7.1 MB 12.2 MB 26.9 MB
20
© IBM Corporation 2006
Update Statistics High Memory
Requirements
• In memory sort
– Approximate Memory = number of rows *
sum(column widths + 2 * sizeof(pointer) )
21
© IBM Corporation 2006
Memory Rules
• Estimated Update Stats memory is below 100MB
– Hard coded limit of 4MB
– Attempts to minimize the scans by fitting as many
columns into 4MB
• Estimated Update Stats memory is above 100MB
– Memory is requested from MGM
– Attempt to minimize the scans by fitting as many
columns in the MGM memory
22
© IBM Corporation 2006
Examples
• Customer Table
Cust_id integer
Fname
char(50)
Lname
char(50)
Address1 char(200)
Address2 char(200)
State
char(2)
zipcode integer
• Number of Rows 500,000
23
© IBM Corporation 2006
Examples
Memory for Incore Sort
Column
Data Type
Size
Sort Memory
Cust_id
Integer
4 bytes
2 MB
Fname
Char(50)
50 bytes
25 MB
Lname
Char(50)
50 bytes
25 MB
Address1
Char(200)
200 bytes
100 MB
Address2
Char(200)
200 bytes
100 MB
State
Char(2)
2 bytes
1 MB
Zipcode
Integer
4 bytes
2 MB
24
© IBM Corporation 2006
Examples
Number of Table Scans
PDQPRIORITY 0
Scan #1
Cust_id
State
Scan #2
Fname
PDQPRIORITY 0
With 100 MB of Memory
Scan #1
Cust_id Fname
Lname State
ZipCode
Scan #2
Address1
Scan #3
Lname
Scan #3
Scan #4
Address1
Scan #5
Address2
Scan #6
ZipCode
Address2
25
© IBM Corporation 2006
Confidence
• A factor in the number of samples used by
update statistics medium
26
© IBM Corporation 2006
Resolution
• Percentage of data that is represented in a
distribution bin
• Example
– 100,000 rows in the table
– Resolution of 2%
– Each bin will represent 2,000 rows
27
© IBM Corporation 2006
Example
• Following Example
– Table size 215,000 rows
– Row size 445 bytes
– Uniprocessor
28
© IBM Corporation 2006
Example of the current update
statistics
Table:
jmiller.t9
Mode:
HIGH
Number of Bins: 267 Bin size 1082
Sort data 101.4 MB
Sort memory granted 4.0 MB
Estimated number of table scans 10
PASS #1 c9
PASS #2 c5
PASS #3 c7
PASS #4 c6
…..
PASS #10 c4
Completed pass 1 in 0 minutes 24 seconds
Completed pass 2 in 0 minutes 20 seconds
Completed pass 3 in 0 minutes 17 seconds
Completed pass 4 in 0 minutes 17 seconds
Completed pass 5 in 0 minutes 17 seconds
Completed pass 6 in 0 minutes 15 seconds
Completed pass 7 in 0 minutes 14 seconds
Completed pass 8 in 0 minutes 15 seconds
Completed pass 9 in 0 minutes 16 seconds
Completed pass 10 in 0 minutes 14 seconds
Total Time 146 seconds
29
© IBM Corporation 2006
The New Defaults
Table:
jmiller.t9
Mode:
HIGH
Number of Bins: 267 Bin size 1082
Sort data
101.4 MB
Sort memory granted
15.0 MB
Estimated number of table scans 7
PASS #1 c9,c8,c10,c5,c7
PASS #2 c6,c1
PASS #3 c3
PASS #4 c2
PASS #5 c4
Completed pass 1 in 0 minutes 34 seconds
Completed pass 2 in 0 minutes 19 seconds
Completed pass 3 in 0 minutes 16 seconds
Completed pass 4 in 0 minutes 14 seconds
Completed pass 5 in 0 minutes 15 seconds
Total Time 98 seconds
New Memory Default
30
© IBM Corporation 2006
Enabling PDQ with
Update Statistics
Table:
jmiller.t9
PDQ Memory
Mode:
HIGH
Number of Bins: 267 Bin size 1082
Sort data
101.4 MB
Features Enabled
PDQ memory granted
106.5 MB
Estimated number of table scans 1
PASS #1 c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
Index scans disabled
Light scans enabled
Completed pass 1 in 0 minutes 29 seconds
Total Time 29 seconds
31
© IBM Corporation 2006
Tuning with the New Statistics
• Turn on PDQ when running update statistics, but only for
tables
– Avoid PDQ when updating statistics for procedures
• When running high or medium increase the memory update
statistics has to work with
• Enable parallel sorting (i.e. PSORT_NPROCS)
32
© IBM Corporation 2006
Considerations
• Change the RESOLUTION to 1.5
– Increasing the number of bins for the distributions
– Increasing the sample size for update statistics medium
33
© IBM Corporation 2006
Old Recommendations
• Start one update statistics for each column of a table
Fname
Lname
Address
Three sequential
scans of the table
34
© IBM Corporation 2006
New Recommendations
• Start one update statistics for ALL columns giving it
more resources (memory)
• Requires only one scan of the table to produce
distributions on several columns.
Fname
Lname
Address
One scans of the table
35
© IBM Corporation 2006
Other Information
• An Overview of the IBM Informix Dynamic Server Optimizer
www.ibm.com/developerworks/db2/zones/informix/library/techarticle/0211desai/0211desai.
html
• Understanding and Tuning Update Statistics
www.ibm.com/developerworks/db2/zones/informix/library/techarticle/miller/0203miller.ht
ml
• Predicate Inference in Informix Dynamic Server
www.ibm.com/developerworks/db2/zones/informix/library/techarticle/0206goswami/0206g
oswami.html
• IBM Informix Performance Manual
• IBM Informix SQL Reference Manual
36
© IBM Corporation 2006
Questions
37
© IBM Corporation 2006
Download