Parallel Execution Plans

advertisement
Parallel Execution Plans
Joe Chang
jchang6@yahoo.com
www.qdpma.com
About Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
Tools
ExecStats – cross-reference index use by SQLexecution plan
Performance Monitoring,
Profiler/Trace aggregation
So you bought a 64+ core box
Now
Learn all about Parallel Execution
All guns (cores) blazing
Negative scaling Yes, this can happen, how will you know
Super-scaling No I have not been smoking pot
High degree of parallelism & small SQL
Anomalies, execution plan changes etc
Compression How much in CPU do I pay for this?
Partitioning Great management tool, what else?
Parallel Execution Plans
Reference: Adam Machanic PASS
Execution Plan Quickie
I/O and CPU Cost components
F4
Estimated Execution
Plan
Cost is duration in seconds on some reference platform
IO Cost for scan: 1 = 10,800KB/s,
810 implies 8,748,000KB
IO in Nested Loops Join: 1 = 320/s, multiple of 0.003125
Index + Key Lookup - Scan
Actual
LU
Scan
CPU
1919
8736
Time (Data in memory)
1919
8727
(926.67- 323655 *
0.0001581) / 0.003125 =
280160 (86.6%)
True cross-over approx
1,400,000 rows
1 row : page
1,093,729 pages/1350 = 810.17 (8,748MB)
Index + Key Lookup - Scan
Actual
LU
Scan
CPU
2138
18622
Time
321
658
8748000KB/8/1350 = 810
(817- 280326 * 0.0001581) / 0.003125 = 247259 (88%)
Actual Execution Plan
Estimated
Actual
Note Actual Number
of Rows, Rebinds,
Rewinds
Actual
Estimated
Row Count and Executions
Outer
Inner
Source
For Loop Join inner source and Key Lookup,
Actual Num Rows = Num of Exec × Num of Rows
Parallel Plans
Parallelism Operations
Distribute Streams
Non-parallel source, parallel destination
Repartition Streams
Parallel source and destination
Gather Streams
Destination is non-parallel
Parallel Execution Plans
Note: gold circle with double arrow,
and parallelism operations
Parallel Scan (and Index Seek)
2X
DOP 1
DOP 2
IO Cost same
CPU reduce by
degree of
parallelism,
except no
reduction for
DOP 16
8X
4X
IO contributes
most of cost!
DOP 4
DOP 8
Parallel Scan 2
DOP 16
Hash Match Aggregate
CPU cost only reduces
By 2X,
Parallel Scan
IO Cost is the same
CPU cost reduced in proportion to degree of
parallelism, last 2X excluded?
On a weak storage system, a single thread can saturate the IO channel,
Additional threads will not increase IO (reduce IO duration).
A very powerful storage system can provide IO proportional to the number
of threads. It might be nice if this was optimizer option?
The IO component can be a very large portion of the overall plan cost
Not reducing IO cost in parallel plan may inhibit generating favorable plan,
i.e., not sufficient to offset the contribution from the Parallelism operations.
A parallel execution plan is more likely on larger systems (-P to fake it?)
Actual Execution Plan - Parallel
More Parallel Plan Details
Parallel Plan - Actual
Parallelism – Hash Joins
Hash Join Cost
DOP 4
DOP 1
Search: Understanding Hash Joins
For In-memory, Grace, Recursive
DOP 2
DOP 8
Hash Join Cost
CPU Cost is linear with number of rows, outer and inner source
See BOL on Hash Joins for In-Memory, Grace, Recursive
IO Cost is zero for small intermediate data size,
beyond set point proportional to server memory(?)
IO is proportional to excess data (beyond in-memory limit)
Parallel Plan: Memory allocation is per thread!
Summary: Hash Join plan cost depends on memory if IO
component is not zero, in which case is disproportionately
lower with parallel plans. Does not reflect real cost?
Parallelism Repartition Streams
DOP 2
DOP 4
DOP 8
Bitmap
BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering
A bitmap filter uses a compact representation of a set of values from a table in one
part of the operator tree to filter rows from a second table in another part of the
tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in
the second table that qualify for the join to the first table are processed.
SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering
speeds up query execution by eliminating rows with key values that cannot produce any join records before
passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact
representation of a set of values from a table in one part of the operator tree to filter rows from a second table in
another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer
rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap
is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing
Data Warehouse Query Performance Through Bitmap Filtering.
Parallel Execution Plan Summary
Queries with high IO cost may show little
plan cost reduction on parallel execution
Plans with high portion hash or sort cost
show large parallel plan cost reduction
Parallel plans may be inhibited by high row
count in Parallelism Repartition Streams
Watch out for (Parallel) Merge Joins!
Scaling Theory
Parallel Execution Strategy
Partition work into little pieces
Ensures each thread has same amount
High overhead to coordinate
Partition into big pieces
May have uneven distribution between threads
Small table join to big table
Thread for each row from small table
Partitioned table options
What Should Scale?
3
2
Trivially parallelizable:
1) Split large chunk of work among threads,
2) Each thread works independently,
3) Small amount of coordination to consolidate threads
2
More Difficult?
4
Parallelizable:
1) Split large chunk of work among threads,
2) Each thread works on first stage
3) Large coordination effort between threads
4) More work
…
Consolidate
3
2
3
2
Partitioned Tables
No Repartition Streams
Regular Table
Partitioned Tables
No Repartition Streams operations!
Scaling Reality
8-way Quad-Core Opteron
Windows Server 2008 R2
SQL Server 2008 SP1 + HF 27
Test Queries
TPC-H SF 10 database
Standard, Compressed, Partitioned (30)
Line Item Table SUM, 59M rows, 8.75GB
Orders Table 15M rows
CPU-sec
35
30
Sum 1 column
Standard
CPU-sec to SUM 1 or
2 columns in Line
Item
Sum 2 columns
25
20
15
10
5
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Compressed
32
28
24
20
Sum 1 column
16
Sum 2 columns
12
8
4
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Speed Up
26
24
Sum 1
22
Sum 2
20
S2 Group
18
S2 Join
Standard
16
14
12
10
8
6
4
2
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
32
28
24
Compressed
Sum 1 column
Sum 2 columns
S2 Group
20
S2 Join
16
12
8
4
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
Line Item sum 1 column
30
Sum 1 Std
Compressed
Partitioned
25
CPU-sec
20
15
10
5
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Speed up relative to
DOP 1
32
28
24
Sum 1 Std
Compressed
Partitioned
20
16
12
8
4
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Line Item Sum w/Group By
60
50
CPU-sec
40
30
Group Std
20
Compressed
Hash
10
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
Speedup
26
24
Group Std
22
20
Compressed
Hash
18
16
14
12
10
8
6
4
2
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
Hash Join
120
100
CPU-sec
80
60
Join Std
40
Compressed
Partitioned
20
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
Speedup
30
25
Join Std
Compressed
20
Partitioned
15
10
5
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
Key Lookup and Table Scan
20
18
16
CPU-sec
1.4M rows
14
12
10
8
6
4
Key Lookup std
Key Lookup compr
Table Scan uncmp
Table Scan cmpr
2
0
DOP 1
32
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Speedup
Key Lookup std
Key Lookup compr
Table Scan uncmp
Table Scan cmpr
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Parallel Execution Summary
Contention in queries w/low cost per page
Simple scan,
High Cost per Page – improves scaling!
Multiple Aggregates, Hash Join, Compression
Table Partitioning –
alternative query plans
Loop Joins – broken at high DOP
Merge Join – seriously broken (parallel)
Scaling DW Summary
Massive IO bandwidth
Parallel options for data load, updates etc
Investigate Parallel Execution Plans
Scaling from DOP 1, 2, 4, 8, 16, 32 etc
Scaling with and w/o HT
Strategy for limiting DOP with multiple
users
Fixes from Microsoft Needed
Contention issues in parallel execution
Table scan, Nested Loops
Better plan cost model for scaling
Back-off on parallelism if gain is negligible
Fix throughput degradation with multiple
users running big DW queries
Sybase and Oracle, Throughput is close to
Power or better
Test Systems
Test Systems
2-way quad-core Xeon 5430 2.66GHz
Windows Server 2008 R2, SQL 2008 R2
8-way dual-core Opteron 2.8GHz
Windows Server 2008 SP1, SQL 2008 SP1
8-way quad-core Opteron 2.7GHz Barcelona
Windows Server 2008 R2, SQL 2008 SP1
Build 2789
8-way systems were configured for AD- not good!
Test Methodology
Boot with all processors
Run queries at MAXDOP 1, 2, 4, 8, etc
Not the same as running on 1-way, 2-way,
4-way server
Interpret results with caution
References
Search Adam Machanic PASS
Download