SQL 64bit Launch Deck

advertisement
System Architecture:
Big Iron (NUMA)
Joe Chang
jchang6@yahoo.com
www.qdpma.com
About Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
Tools
ExecStats – cross-reference index use by SQLexecution plan
Performance Monitoring,
Profiler/Trace aggregation
Scaling SQL on NUMA Topics
OLTP – Thomas Kejser session
“Designing High Scale OLTP Systems”
Data Warehouse
Ongoing Database Development
Bulk Load – SQL CAT paper + TK session
“The Data Loading Performance Guide”
Other Sessions with common coverage:
Monitoring and Tuning Parallel Query Execution II, R Meyyappan
(SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham
Notes from the field: High Performance Storage, John Langford
SQL Server Storage – 1000GB Level, Brent Ozar
Server Systems and Architecture
Symmetric Multi-Processing
CPU
CPU
CPU
CPU
System Bus
MCH
PXH
PXH
ICH
SMP, processors are not
dedicated to specific tasks
(ASMP), single OS image,
each processor can acess
all memory
SMP makes no reference to
memory architecture?
Not to be confused to Simultaneous Multi-Threading (SMT)
Intel calls SMT Hyper-Threading (HT), which is not to be
confused with AMD Hyper-Transport (also HT)
Non-Uniform Memory Access
CPU CPU CPU CPU
Memory
Controller
Node Controller
CPU CPU CPU CPU
Memory
Controller
Node Controller
CPU CPU CPU CPU
Memory
Controller
Node Controller
CPU CPU CPU CPU
Memory
Controller
Node Controller
Shared Bus or X Bar
NUMA Architecture - Path to memory is not uniform
1) Node: Processors, Memory, Separate or combined
Memory + Node Controllers
2) Nodes connected by shared bus, cross-bar, ring
Traditionally, 8-way+ systems
Local memory latency ~150ns, remote node memory ~300-400ns, can
cause erratic behavior if OS/code is not NUMA aware
AMD Opteron
Opteron
Opteron
Opteron
Opteron
HT2100
HT2100
HT1100
Technically, Opteron is
NUMA,
but remote node memory
latency is low,
no negative impact or
erratic behavior!
For practical purposes:
behave like SMP system
Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns?
Actual: more complicated because of snooping (cache coherency traffic)
8-way Opteron Sys Architecture
CPU
0
CPU
1
CPU
2
CPU
3
CPU
4
CPU
5
CPU
6
CPU
7
Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links.
Note 8-way top and bottom right processors use 2 HT to connect
to other processors, 3rd HT for IO,
CPU 1 & 7 require 3 hops to each other
http://www.techpowerup.com/img/09-08-26/17d.jpg
Nehalem System Architecture
Intel Nehalem generation processors have Quick Path Interconnect (QPI)
Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI
8-way Glue-less is possible
NUMA Local and Remote Memory
Local memory is closer than remote
Physical access time is shorter
What is actual access time?
With cache coherency requirement!
HT Assist – Probe Filter
part of L3 cache used as directory cache
ZDNET
Source Snoop Coherency
From HP PREMA Architecture
whitepaper:
All reads result in snoops to all
other caches, …
Memory controller cannot return
the data until it has collected all
the snoop responses and is sure
that no cache provided a more
recent copy of the memory line
DL980G7
From HP PREAM Architecture whitepaper:
Each node controller stores information about* all data in the processor
caches, minimizes inter-processor coherency communication, reduces
latency to local memory
(*only cache tags, not cache data)
HP ProLiant DL980 Architecture
Node Controllers reduces effective memory latency
Superdome 2 – Itanium, sx3000
Agent –
Remote
Ownership
Tag
+ L4 cache
tags
64M eDRAM
L4 cache
data
IBM x3850 X5 (Glue-less)
Connect two 4-socket Nodes to make 8-way system
OS Memory Models
SUMA: Sufficiently Uniform Memory Access
Memory interleaved across nodes
24
16
8
0
25
17
9
1
Node 0
26
18
10
2
27
19
11
3
Node 1
28
20
12
4
29
21
13
5
30
22
14
6
Node 2
31
23
15
7
2
1
Node 3
NUMA: first interleaved within a node,
then spanned across nodes
6
4
2
0
Node 0
7
5
3
1
14
12
10
8
Node 1
15
13
11
9
22
20
18
16
Node 2
23
21
19
17
30
28
26
24
31
29
27
25
2
Node 3
1
Memory stripe is then spanned across nodes
Windows OS NUMA Support
Memory models
SUMA – Sufficiently Uniform Memory Access
24
16
8
0
25
17
9
1
26
18
10
2
Node 0
27
19
11
3
28
20
12
4
Node 1
29
21
13
5
30
22
14
6
Node 2
31
23
15
7
Node 3
Memory is striped across
NUMA nodes
NUMA – separate memory pools by Node
6
4
2
0
7
5
3
1
Node 0
14
12
10
8
15
13
11
9
Node 1
22
20
18
16
23
21
19
17
Node 2
30
28
26
24
31
29
27
25
Node 3
Memory Model Example: 4 Nodes
SUMA Memory Model
memory access uniformly distributed
25% of memory accesses local, 75% remote
NUMA Memory Model
Goal is better than 25% local node access
True local access time also needs to be faster
Cache Coherency may increase local access
Architecting for NUMA
End to End Affinity
App Server
North East
Mid Atlantic
South East
TCP Port
1440
CPU
Node 0
1441
Node 1
1442
Node 2
Node 3
Central
1443
Texas
1444
Node 4
Mountain
1445
Node 5
California
1446
Pacific NW
1447
Node 6
Node 7
Memory
Table
0-0
0-1
NE
1-0
1-1
MidA
2-0
2-1
SE
3-0
3-1
Cen
4-0
4-1
Tex
5-0
5-1
Mnt
6-0
6-1
Cal
7-0
7-1
Web determines
port for each user
by group
(but should not be
by geography!)
PNW
Affinitize port to
NUMA node
Each node access
localized data
(partition?)
OS may allocate
substantial chunk
from Node 0?
Architecting for NUMA
End to End Affinity
App Server
North East
Mid Atlantic
South East
TCP Port
1440
CPU
Node 0
1441
Node 1
1442
Node 2
Node 3
Central
1443
Texas
1444
Node 4
Mountain
1445
Node 5
California
1446
Pacific NW
1447
Node 6
Node 7
Memory
Table
0-0
0-1
NE
1-0
1-1
MidA
2-0
2-1
SE
3-0
3-1
Cen
4-0
4-1
Tex
5-0
5-1
Mnt
6-0
6-1
Cal
7-0
7-1
Web determines
port for each user
by group
(but should not be
by geography!)
PNW
Affinitize port to
NUMA node
Each node access
localized data
(partition?)
OS may allocate
substantial chunk
from Node 0?
HP-UX LORA
HP-UX – Not Microsoft Windows
Locality-Optimizer Resource Alignment
12.5% Interleaved Memory
87.5% NUMA node Local Memory
System Tech Specs
Processors
Cores DIMM
PCI-E G2
Total
Cores
Max
memory
Base
2 x Xeon X56x0
6
18
5 x8+,1 x4
12
192G*
$7K
4 x Opteron 6100
12
32
5 x8, 1 x4
48
512G
$14K
4 x Xeon X7560
8
64
4 x8, 6 x4†
32
1TB
$30K
8 x Xeon X7560
8
128 9 x8, 5 x4‡
64
2TB
$100K
8GB
$400 ea 18 x 8G = 144GB, $7200,
16GB $1100 ea 12 x16G =192GB, $13K,
64 x 8G = 512GB - $26K
64 x 16G = 1TB – $70K
Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB,
† Dell R910 and HP DL580G7 have different PCI-E
‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots
Software Stack
Operating System
Windows Server 2003 RTM, SP1
Network limitations (default) Impacts OLTP
Scalable Networking Pack (912222)
Windows Server 2008
Windows Server 2008 R2 (64-bit only)
Breaks 64 logical processor limit
Search: MSI-X
NUMA IO enhancements?
Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server
Don’t try to do DW on SQL Server 2000
SQL Server version
SQL Server 2000
Serious disk IO limitations (1GB/sec ?)
Problematic parallel execution plans
SQL Server 2005 (fixed most S2K problems)
64-bit on X64 (Opteron and Xeon)
SP2 – performance improvement 10%(?)
SQL Server 2008 & R2
Compression, Filtered Indexes, etc
Star join, Parallel query to partitioned table
Configuration
SQL Server Startup Parameter: E
Trace Flags 834, 836, 2301
Auto_Date_Correlation
Order date < A, Ship date > A
Implied: Order date > A-C, Ship date < A+C
Port Affinity – mostly OLTP
Dedicated processor ?
for log writer ?
Storage Performance
for Data Warehousing
Joe Chang
jchang6@yahoo.com
www.qdpma.com
About Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
Tools
ExecStats – cross-reference index use by SQLexecution plan
Performance Monitoring,
Profiler/Trace aggregation
Storage
Organization Structure
In many large IT departments
DB and Storage are in separate groups
Storage usually has own objectives
Bring all storage into one big system under full
management (read: control)
Storage as a Service, in the Cloud
One size fits all needs
Usually have zero DB knowledge
Of course we do high bandwidth, 600MB/sec good enough for you?
Data Warehouse Storage
OLTP – Throughput with Fast Response
DW – Flood the queues for maximum
through-put
Do not use shared storage
for data warehouse!
Storage system vendors like to give the impression the SAN is a magical,
immensely powerful box that can meet all your needs.
Just tell us how much capacity you need and don’t worry about anything else.
My advice: stay away from shared storage, controlled by different team.
Nominal and Net Bandwidth
PCI-E Gen 2 – 5 Gbit/sec signaling
x8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s net
SAS 6Gbit/s – 6 Gbit/s
x4 port: 3GB/s nominal, 2.2GB/sec net?
Fibre Channel 8 Gbit/s nominal
780GB/s point-to-point,
680MB/s from host to SAN to back-end loop
SAS RAID Controller, x8 PCI-E G2, 2 x4 6G
2.8GB/s Depends on the controller, will change!
Storage – SAS Direct-Attach
PCI-E x8
.
.
.
.
SAS x4
SAS x4
RAID
.
.
.
.
.
.
.
.
Very Many Disks
RAID
.
.
.
.
Option A:
24-disks in one enclosure for
each x4 SAS port. Two x4 SAS
ports per controller
SAS x4
SAS x4
PCI-E x8
.
.
.
.
SAS x4
SAS x4
PCI-E x8
.
.
.
.
Many Fat Pipes
RAID
.
.
.
.
PCI-E x8
.
.
.
.
SAS x4
RAID
Option B: Split enclosure over
2 x4 SAS ports, 1 controller
SAS x4
PCI-E x4
RAID
PCI-E x4
2 x10GbE
PCI-E x4
2 x10GbE
SAS x4
Balance by pipe bandwidth
Don’t forget fat network pipes
Storage – FC/SAN
8Gb FC
PCI-E x8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8Gb FC
8Gb FC
PCI-E x8 Gen 2 Slot with
quad-port 8Gb FC
8Gb FC
8Gb FC
PCI-E x8
.
.
.
.
HBA
HBA
8Gb FC
8Gb FC
8Gb FC
PCI-E x4
HBA
PCI-E x4
HBA
PCI-E x4
HBA
PCI-E x4
HBA
If 8Gb quad-port is not
supported, consider
system with many x4
slots, or consider SAS!
8Gb FC
8Gb FC
8Gb FC
8Gb FC
8Gb FC
8Gb FC
SAN systems typically
offer 3.5in 15-disk
enclosures. Difficult to get
high spindle count with
density.
8Gb FC
8Gb FC
8Gb FC
PCI-E x4
HBA
PCI-E x4
2 x10GbE
PCI-E x4
2 x10GbE
8Gb FC
1-2 15-disk enclosures
per 8Gb FC port, 2030MB/s per disk?
Storage – SSD / HDD Hybrid
SSD
SSD
SSD
SSD
SAS x4
SSD
SSD
SSD
SSD
SAS x4
SSD
SSD
SSD
SSD
SAS x4
SSD
SSD
SSD
SSD
SAS x4
SSD
SSD
SSD
SSD
SAS x4
SSD
SSD
SSD
SSD
SAS x4
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
No RAID w/SSD?
SAS x4
PCI-E x8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
SAS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
PCI-E x8
PCI-E x8
PCI-E x8
SAS
SAS
SAS
SAS x4
PCI-E x4
RAID
SAS x4
PCI-E x4
2 x10GbE
PCI-E x4
2 x10GbE
Storage enclosures typically 12 disks per channel. Can
only support bandwidth of a few SSD. Use remaining
bays for extra storage with HDD. No point expending
valuable SSD space for backups and flat files
Log:
Single DB – HDD, unless rollbacks or T-log backups disrupts log writes.
Multi DB – SSD, otherwise to many RAID1 pairs to logs
SSD
Current: mostly 3Gbps SAS/SATA SDD
Some 6Gbps SATA SSD
Fusion IO – direct PCI-E Gen2 interface
320GB-1.2TB capacity, 200K IOPS, 1.5GB/s
No RAID ?
HDD is fundamentally a single point failure
SDD could be built with redundant components
HP report problems with SSD on RAID
controllers, Fujitsu did not?
Big DW Storage – iSCSI
Are you nuts?
Storage Configuration - Arrays
Shown:
two 12-disk Arrays
per 24-disk
enclosure
Options: between
6-16 disks per
array
SAN systems may
recommend R10
4+4 or R5 7+1
Very Many Spindles
Comment on Meta LUN
Data Consumption Rate: Xeon
TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M
Total Mem
SF
Total
MB/s
MB/s
per core
5sp2 85.4
100
1,165.5
145.7
Conroe
8
144 8sp1 42.2
100
2,073.5
259.2
Nehalem
3.33
12
192
8r2
21.0
100
4,166.7
347.2
Westmere
4 Xeon 7560
2.26
32
640
8r2
37.2
300
7,056.5
220.5
Neh.-EX
8 Xeon 7560
2.26
64
512
8r2
183.8 3000 14,282
223.2
Processors
GHz Cores GB
SQL
2 Xeon 5355
2.66
8
64
2 Xeon 5570
2.93
2 Xeon 5680
Q1
sec
Data consumption rate is much higher for current generation
Nehalem and Westmere processors than Core 2 referenced
in Microsoft FTDW document. TPC-H Q1 is more compute
intensive than the FTDW light query.
Data Consumption Rate: Opteron
TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M
Processors
Total Mem
GHz Cores GB
SQL
Q1
sec
SF
Total
MB/s
MB/s
per core
4 Opt 8220
2.8
8
128 5rtm 309.7
300
868.7
121.1
8 Opt 8360
2.5
32
256 8rtm
91.4
300
2,872.0
89.7
Barcelona
8 Opt 8384
2.7
32
256 8rtm
72.5
300
3,620.7
113.2
Shanghai
8 Opt 8439
2.8
48
256 8sp1 49.0
300
5,357.1
111.6
Istanbul
8 Opt 8439
2.8
48
512 8rtm 166.9 1000 5,242.7
109.2
2 Opt 6176
2.3
24
192
8r2
20.2
100
4,331.7
180.5
Magny-C
4 Opt 6176
2.3
48
512
8r2
31.8
300
8,254.7
172.0
-
Expected Istanbul to have better performance per core than Shanghai
due to HT Assist. Magny-Cours has much better performance per core!
(at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2?
Data Consumption Rate
TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M
Total Mem
SF
Total
MB/s
MB/s
per core
5sp2 85.4
100
1165.5
145.7
8
144 8sp1 42.2
100
2073.5
259.2
3.33
12
192
8r2
21.0
100
4166.7
347.2
2 Opt 6176
2.3
24
192
8r2
20.2
100
4331.7
180.5
4 Opt 8220
2.8
8
128 5rtm 309.7
300
868.7
121.1
8 Opt 8360
2.5
32
256 8rtm
91.4
300
2872.0
89.7
Barcelona
8 Opt 8384
2.7
32
256 8rtm
72.5
300
3620.7
113.2
Shanghai
8 Opt 8439
2.8
48
256 8sp1 49.0
300
5357.1
111.6
Istanbul
4 Opt 6176
2.3
48
512
8r2
31.8
300
8254.7
172.0
Magny-C
8 Xeon 7560
2.26
64
512
8r2
183.8 3000
14282
223.2
Processors
GHz Cores GB
SQL
2 Xeon 5355
2.66
8
64
2 Xeon 5570
2.93
2 Xeon 5680
Q1
sec
2U disk enclosure
24 x 73GB
15K 2.5in disks
$14K, $600 per disk
Storage Targets
Total
Cores
BW
Core
Target
MB/s
PCI-E
x8-x4
2 Xeon X5680
12
350
4200
5-1
2
2 - 48
4 - 96
4 Opt 6176
48
175
8400
5-1
4
4 - 96
8 - 192 10 GB/s
4 Xeon X7560
32
250
8000
6-4
6
6 - 144
12 - 288 15 GB/s
8 Xeon X7560
64
225 14400
9-5
11†
Processors
† 8-way
SAS Storage
Storage
Actual
HBA Units/Disks Units/Disks Bandwidth
5 GB/s
10 - 240 20 - 480 26 GB/s
:
9 controllers in x8 slots, 24 disks per x4 SAS port
2 controllers in x4 slots, 12 disk
24 15K disks per enclosure,
12 disks per x4 SAS port requires 100MB/sec per disk,
possible but not always practical
24 disks per x4 SAS port requires 50MB/sec,
more achievable in practice
Think: Shortest path to metal (iron-oxide)
Your Storage and the Optimizer
Model
Disks
BW (KB/s)
Sequential IOPS
“Random” IOPS
SequentialRand IO ratio
Optimizer
-
10,800
1,350
320
4.22
SAS 2x4
24
2,800,000
350,000
9,600
36.5
SAS 2x4
48
2,800,000
350,000
19,200
18.2
FC 4G
30
360,000
45,000
12,000
3.75
SSD
8
2,800,000
350,000
280,000
1.25
Assumptions
2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8
HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and shortstroked, possible skip-seek. SSD 35,000 IOPS
The SQL Server Query Optimizer make key lookup versus table scan decisions
based on a 4.22 sequential-to-random IO ratio
A DW configured storage system has a 18-36 ratio, 30 disks per 4G FC about
matches the QO, SSD is in the other direction
Data Consumption Rates
TPC-H SF100
Query 1, 9, 13, 21
450
400
X 5355 5sp2
X 5570 8sp1
350
X 5680 8R2
O 6176 8R2
300
250
200
150
100
50
0
Q1
Q9
Q18
Q21
300
TPC-H SF300
Query 1, 9, 13, 21
250
O DC 2.8G 128 5rtm
O QC 2.5G 256 8rtm
O QC 2.7G 256 8rtm
O 6C 2.8G 256 8sp1
O 12C 2.3G 512 8R2
X7560 R2 640
200
150
100
50
0
Q1
Q9
Q18
Q21
Fast Track Reference Architecture
My Complaints
Several Expensive SAN systems (11 disks)
Each must be configured independently
Scripting?
$1,500-2,000 amortized per disk
Too many 2-disk Arrays
2 LUN per Array, too many data files
Build Indexes with MAXDOP 1
Is this brain dead?
Designed around 100MB/sec per disk
Not all DW is single scan, or sequential
Fragmentation
Weak Storage System
1) Fragmentation could degrade IO performance,
2) Defragmenting very large table on a weak storage
system could render the database marginally to
completely non-functional for a very long time.
Powerful Storage System
3) Fragmentation has very little impact.
4) Defragmenting has mild impact, and completes
within night time window.
What is the correct conclusion?
Table
File
Partition
LUN
Disk
Operating System View of
Storage
Operating System Disk View
Disk 2
Basic
396GB
Online
Controller 1 Port 0
Disk 3
Basic
396GB
Online
Controller 1 Port 1
Disk 4
Basic
396GB
Online
Controller 2 Port 0
Disk 5
Basic
396GB
Online
Controller 2 Port 1
Disk 6
Basic
396GB
Online
Controller 3 Port 0
Disk 7
Basic
396GB
Online
Controller 3 Port 1
Additional disks not shown, Disk 0 is boot drive, 1 – install source?
File Layout
Disk 2, Partition 0
File Group for the big Table
File 1
Disk 3 Partition 0
File Group for the big Table
File 2
Disk 4 Partition 0
File Group for the big Table
File 3
Disk 5 Partition 0
File Group for the big Table
File 4
Disk 6 Partition 0
File Group for the big Table
File 5
Disk 7 Partition 0
File Group for the big Table
File 6
Each File Group is distributed
across all data disks
Partition 1
File Group for all others
File 1
Partition 1
Small File Group
File 2
Partition 1
Small File Group
File 3
Partition 1
Small File Group
File 4
Partition 1
Small File Group
File 5
Partition 1
Small File Group
File 6
Partition 2
Tempdb
File 1
Partition 2
Tempdb
File 2
Partition 2
Tempdb
File 3
Partition 2
Tempdb
File 4
Partition 2
Tempdb
File 5
Partition 2
Tempdb
File 6
Partition 4
Backup and Load
File 1
Partition 4
Backup and Load
File 2
Partition 4
Backup and Load
File 3
Partition 4
Backup and Load
File 4
Partition 4
Backup and Load
File 5
Partition 4
Backup and Load
File 6
Log disks not shown, tempdb share common pool with data
File Groups and Files
Dedicated File Group for largest table
Never defragment
One file group for all other regular tables
Load file group?
Rebuild indexes to different file group
Partitioning - Pitfalls
Disk 2
File Group 1
Table Partition 1
Disk 3
File Group 2
Table Partition 2
Common Partitioning Strategy
Partition Scheme maps
partitions to File Groups
Disk 4
File Group 3
Table Partition 3
Disk 5
File Group 4
Table Partition 4
Disk 6
File Group 5
Table Partition 5
Disk 7
File Group 6
Table Partition 6
What happens in a table scan?
Read first from Part 1
then 2, then 3, … ?
SQL 2008 HF to read from
each partition in parallel?
What if partitions have
disparate sizes?
Parallel Execution Plans
Joe Chang
jchang6@yahoo.com
www.qdpma.com
About Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
Tools
ExecStats – cross-reference index use by SQLexecution plan
Performance Monitoring,
Profiler/Trace aggregation
So you bought a 64+ core box
Now
Learn all about Parallel Execution
All guns (cores) blazing
Negative scaling Yes, this can happen, how will you know
Super-scaling No I have not been smoking pot
High degree of parallelism & small SQL
Anomalies, execution plan changes etc
Compression How much in CPU do I pay for this?
Partitioning Great management tool, what else?
Parallel Execution Plans
Reference: Adam Machanic PASS
Execution Plan Quickie
I/O and CPU Cost components
F4
Estimated Execution
Plan
Cost is duration in seconds on some reference platform
IO Cost for scan: 1 = 10,800KB/s,
810 implies 8,748,000KB
IO in Nested Loops Join: 1 = 320/s, multiple of 0.003125
Index + Key Lookup - Scan
Actual
LU
Scan
CPU
1919
8736
Time (Data in memory)
1919
8727
(926.67- 323655 *
0.0001581) / 0.003125 =
280160 (86.6%)
True cross-over approx
1,400,000 rows
1 row : page
1,093,729 pages/1350 = 810.17 (8,748MB)
Index + Key Lookup - Scan
Actual
LU
Scan
CPU
2138
18622
Time
321
658
8748000KB/8/1350 = 810
(817- 280326 * 0.0001581) / 0.003125 = 247259 (88%)
Actual Execution Plan
Estimated
Actual
Note Actual Number
of Rows, Rebinds,
Rewinds
Actual
Estimated
Row Count and Executions
Outer
Inner
Source
For Loop Join inner source and Key Lookup,
Actual Num Rows = Num of Exec × Num of Rows
Parallel Plans
Parallelism Operations
Distribute Streams
Non-parallel source, parallel destination
Repartition Streams
Parallel source and destination
Gather Streams
Destination is non-parallel
Parallel Execution Plans
Note: gold circle with double arrow,
and parallelism operations
Parallel Scan (and Index Seek)
2X
DOP 1
DOP 2
IO Cost same
CPU reduce by
degree of
parallelism,
except no
reduction for
DOP 16
8X
4X
IO contributes
most of cost!
DOP 4
DOP 8
Parallel Scan 2
DOP 16
Hash Match Aggregate
CPU cost only reduces
By 2X,
Parallel Scan
IO Cost is the same
CPU cost reduced in proportion to degree of
parallelism, last 2X excluded?
On a weak storage system, a single thread can saturate the IO channel,
Additional threads will not increase IO (reduce IO duration).
A very powerful storage system can provide IO proportional to the number
of threads. It might be nice if this was optimizer option?
The IO component can be a very large portion of the overall plan cost
Not reducing IO cost in parallel plan may inhibit generating favorable plan,
i.e., not sufficient to offset the contribution from the Parallelism operations.
A parallel execution plan is more likely on larger systems (-P to fake it?)
Actual Execution Plan - Parallel
More Parallel Plan Details
Parallel Plan - Actual
Parallelism – Hash Joins
Hash Join Cost
DOP 4
DOP 1
Search: Understanding Hash Joins
For In-memory, Grace, Recursive
DOP 2
DOP 8
Hash Join Cost
CPU Cost is linear with number of rows, outer and inner source
See BOL on Hash Joins for In-Memory, Grace, Recursive
IO Cost is zero for small intermediate data size,
beyond set point proportional to server memory(?)
IO is proportional to excess data (beyond in-memory limit)
Parallel Plan: Memory allocation is per thread!
Summary: Hash Join plan cost depends on memory if IO
component is not zero, in which case is disproportionately
lower with parallel plans. Does not reflect real cost?
Parallelism Repartition Streams
DOP 2
DOP 4
DOP 8
Bitmap
BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering
A bitmap filter uses a compact representation of a set of values from a table in one
part of the operator tree to filter rows from a second table in another part of the
tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in
the second table that qualify for the join to the first table are processed.
SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering
speeds up query execution by eliminating rows with key values that cannot produce any join records before
passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact
representation of a set of values from a table in one part of the operator tree to filter rows from a second table in
another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer
rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap
is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing
Data Warehouse Query Performance Through Bitmap Filtering.
Parallel Execution Plan Summary
Queries with high IO cost may show little
plan cost reduction on parallel execution
Plans with high portion hash or sort cost
show large parallel plan cost reduction
Parallel plans may be inhibited by high row
count in Parallelism Repartition Streams
Watch out for (Parallel) Merge Joins!
Scaling Theory
Parallel Execution Strategy
Partition work into little pieces
Ensures each thread has same amount
High overhead to coordinate
Partition into big pieces
May have uneven distribution between threads
Small table join to big table
Thread for each row from small table
Partitioned table options
What Should Scale?
3
2
Trivially parallelizable:
1) Split large chunk of work among threads,
2) Each thread works independently,
3) Small amount of coordination to consolidate threads
2
More Difficult?
4
Parallelizable:
1) Split large chunk of work among threads,
2) Each thread works on first stage
3) Large coordination effort between threads
4) More work
…
Consolidate
3
2
3
2
Partitioned Tables
No Repartition Streams
Regular Table
Partitioned Tables
No Repartition Streams operations!
Scaling Reality
8-way Quad-Core Opteron
Windows Server 2008 R2
SQL Server 2008 SP1 + HF 27
Test Queries
TPC-H SF 10 database
Standard, Compressed, Partitioned (30)
Line Item Table SUM, 59M rows, 8.75GB
Orders Table 15M rows
CPU-sec
35
30
Sum 1 column
Standard
CPU-sec to SUM 1 or
2 columns in Line
Item
Sum 2 columns
25
20
15
10
5
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Compressed
32
28
24
20
Sum 1 column
16
Sum 2 columns
12
8
4
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Speed Up
26
24
Sum 1
22
Sum 2
20
S2 Group
18
S2 Join
Standard
16
14
12
10
8
6
4
2
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
32
28
24
Compressed
Sum 1 column
Sum 2 columns
S2 Group
20
S2 Join
16
12
8
4
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
Line Item sum 1 column
35
Sum 1 Std
30
CPU-sec
Compressed
Partitioned
25
20
15
10
5
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Speed up relative to
DOP 1
35
30
Sum 1 Std
Compressed
25
Partitioned
20
15
10
5
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Line Item Sum w/Group By
60
50
CPU-sec
40
30
Group Std
20
Compressed
Hash
10
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
Speedup
26
24
Group Std
22
20
Compressed
Hash
18
16
14
12
10
8
6
4
2
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
Hash Join
120
100
CPU-sec
80
60
Join Std
40
Compressed
Partitioned
20
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
Speedup
30
25
Join Std
Compressed
20
Partitioned
15
10
5
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
Key Lookup and Table Scan
20
18
16
CPU-sec
1.4M rows
14
12
10
8
6
4
Key Lookup std
Key Lookup compr
Table Scan uncmp
Table Scan cmpr
2
0
DOP 1
32
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Speedup
Key Lookup std
Key Lookup compr
Table Scan uncmp
Table Scan cmpr
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Parallel Execution Summary
Contention in queries w/low cost per page
Simple scan,
High Cost per Page – improves scaling!
Multiple Aggregates, Hash Join, Compression
Table Partitioning –
alternative query plans
Loop Joins – broken at high DOP
Merge Join – seriously broken (parallel)
Scaling DW Summary
Massive IO bandwidth
Parallel options for data load, updates etc
Investigate Parallel Execution Plans
Scaling from DOP 1, 2, 4, 8, 16, 32 etc
Scaling with and w/o HT
Strategy for limiting DOP with multiple
users
Fixes from Microsoft Needed
Contention issues in parallel execution
Table scan, Nested Loops
Better plan cost model for scaling
Back-off on parallelism if gain is negligible
Fix throughput degradation with multiple
users running big DW queries
Sybase and Oracle, Throughput is close to
Power or better
Test Systems
Test Systems
2-way quad-core Xeon 5430 2.66GHz
Windows Server 2008 R2, SQL 2008 R2
8-way dual-core Opteron 2.8GHz
Windows Server 2008 SP1, SQL 2008 SP1
8-way quad-core Opteron 2.7GHz Barcelona
Windows Server 2008 R2, SQL 2008 SP1
Build 2789
8-way systems were configured for AD- not good!
Test Methodology
Boot with all processors
Run queries at MAXDOP 1, 2, 4, 8, etc
Not the same as running on 1-way, 2-way,
4-way server
Interpret results with caution
References
Search Adam Machanic PASS
SQL Server Scaling on Big Iron
(NUMA) Systems TPC-H
Joe Chang
jchang6@yahoo.com
www.qdpma.com
About Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
Tools
ExecStats – cross-reference index use by SQLexecution plan
Performance Monitoring,
Profiler/Trace aggregation
TPC-H
TPC-H
DSS – 22 queries, geometric mean
60X range plan cost, comparable actual range
Power – single stream
Tests ability to scale parallel execution plans
Throughput – multiple streams
Scale Factor 1 – Line item data is 1GB
875MB with DATE instead of DATETIME
Only single column indexes allowed, Ad-hoc
Observed Scaling Behaviors
Good scaling, leveling off at high DOP
Perfect Scaling ???
Super Scaling
Negative Scaling
especially at high DOP
Execution Plan change
Completely different behavior
TPC-H Published Results
TPC-H SF 100GB
2-way Xeon 5355, 5570, 5680, Opt 6176
100,000
80,000
Xeon 5355
5570 HDD
5570 SSD
5680 SSD
5570 Fusion
Opt 6176
60,000
40,000
20,000
0
Power
Throughput
Between 2-way Xeon 5570, all are close, HDD has best throughput,
SATA SSD has best composite, and Fusion-IO has be power.
Westmere and Magny-Cours, both 192GB memory, are very close
QphH
TPC-H SF 300GB
8x QC/6C & 4x12C Opt,
160,000
Opt 8360 4C
Opt 8439 6C
X 7560 8C
140,000
120,000
Opt 8384 4C
Opt 6716 12
100,000
80,000
60,000
40,000
20,000
0
Power
Throughput
6C Istanbul improved over 4C Shanghai by 45% Power,
73% Through-put, 59% overall.
4x12C 2.3GHz improved17% over 8x6C 2.8GHz
QphH
TPC-H SF 1000
140,000
120,000
100,000
80,000
60,000
Opt 8439 SQL
Superdome
40,000
Opt 8439 Sybase
Superdome 2
20,000
0
Power
Throughput
QphH
TPC-H SF 3TB
X7460 & X7560
200,000
180,000
160,000
140,000
120,000
100,000
80,000
16 x X7460
60,000
8 x 7560
40,000
POWER6
20,000
0
Power
Throughput
QphH
Nehalem-EX 64 cores better
than 96 Core 2.
TPC-H SF 100GB, 300GB & 3TB
100,000
80,000
Xeon 5355
5570 HDD
5570 SSD
5680 SSD
5570 Fusion
Opt 6176
SF100 2-way
60,000
40,000
20,000
0
Power
Throughput
QphH
Westmere and Magny-Cours
are very close
Between 2-way Xeon 5570, all
are close, HDD has best
through-put, SATA SSD has
best composite, and Fusion-IO
has be power
160,000
Opt 8360 4C
Opt 8439 6C
X 7560 8C
140,000
120,000
SF300 8x QC/6C & 4x12C
Opt 8384 4C
Opt 6716 12
6C Istanbul improved over 4C
Shanghai by 45% Power, 73%
Through-put, 59% overall.
4x12C 2.3GHz improved17%
over 8x6C 2.8GHz
100,000
80,000
60,000
40,000
20,000
0
Power
Throughput
SF 3TB X7460 & X7560
200,000
150,000
16 x X7460
8 x 7560
100,000
50,000
0
QphH
32 x Pwr6
Nehalem-EX 64 cores better
than 96 Core 2.
TPC-H Published Results
SQL Server excels in Power
Limited by Geometric mean, anomalies
Trails in Throughput
Other DBMS get better throughput than power
SQL Server throughput below Power
by wide margin
Speculation – SQL Server does not throttle
back parallelism with load?
TPC-H SF100
Total Mem
Through
put
Processors
GHz Cores GB
SQL
SF
2 Xeon 5355
2.66
8
64
5sp2
100
23,378.0 13,381.0 17,686.7
2x5570 HDD
2.93
8
144 8sp1
100
67,712.9 38,019.1 50,738.4
2x5570 SSD
2.93
8
144 8sp1
100
70,048.5 37,749.1 51,422.4
5570 Fusion
2.93
8
144 8sp1
100
72,110.5 36,190.8 51,085.6
2 Xeon 5680
3.33
12
192
8r2
100
99,426.3 55,038.2 73,974.6
2 Opt 6176
2.3
24
192
8r2
100
94,761.5 53,855.6 71,438.3
Power
QphH
TPC-H SF300
Processors
Total Mem
GHz Cores GB
SQL
SF
Power
Through
put
QphH
4 Opt 8220
2.8
8
128 5rtm
300
25,206.4 13,283.8 18,298.5
8 Opt 8360
2.5
32
256 8rtm
300
67,287.4 41,526.4 52,860.2
8 Opt 8384
2.7
32
256 8rtm
300
75,161.2 44,271.9 57,684.7
8 Opt 8439
2.8
48
256 8sp1
300 109,067.1 76,869.0 91,558.2
4 Opt 6176
2.3
48
512
8r2
300 129,198.3 89,547.7 107,561.2
4 Xeon 7560
2.26
32
640
8r2
300 152,453.1 96,585.4 121,345.6
All of the above are HP results?, Sun result Opt 8384, sp1,
Pwr 67,095.6, Thr 45,343.5, QphH 55,157.5
TPC-H 1TB
Processors
Total Mem
GHz Cores GB
SQL
SF
Power
Through
put
QphH
8 Opt 8439
2.8
48
512 8R2? 1000 95,789.1 69,367.6 81,367.6
8 Opt 8439
2.8
48
384 ASE 1000 108,436.8 96,652.7 102,375.3
Itanium 9350
1.73
64
512
O11R2
1000 139,181.0 141,188.1 140,181.1
TPC-H 3TB
Processors
Total Mem
GHz Cores GB
SQL
SF
Power
Through
put
QphH
16 Xeon 7460 2.66
96
1024 8r2
3000 120,254.8 87,841.4 102,254.8
8 Xeon 7560
2.26
64
512
8r2
3000 185,297.7 142,685.6 162,601.7
Itanium 9350
1.73
64
512
Sybase
1000 142,790.7 171,607.4 156,537.3
POWER6
5.0
64
512
Sybase
3000 142,790.7 171,607.4 156,537.3
TPC-H Published Results
Processors
GHz Cores GB
Total Mem
SQL
SF
Power
Through
put
QphH
2 Xeon 5355
2.66
8
64
5sp2
100
23,378
13,381
17,686.7
2 Xeon 5570
2.93
8
144 8sp1
100
72,110.5 36,190.8 51,085.6
2 Xeon 5680
3.33
12
192
8r2
100
99,426.3 55,038.2 73,974.6
2 Opt 6176
2.3
24
192
8r2
100
94,761.5 53,855.6 71,438.3
4 Opt 8220
2.8
8
128 5rtm
300
25,206.4 13,283.8 18,298.5
8 Opt 8360
2.5
32
256 8rtm
300
67,287.4 41,526.4 52,860.2
8 Opt 8384
2.7
32
256 8rtm
300
75,161.2 44,271.9 57,684.7
8 Opt 8439
2.8
48
256 8sp1
300 109,067.1 76,869.0 91,558.2
4 Opt 6176
2.3
48
512
8r2
300 129,198.3 89,547.7 107,561.2
8 Xeon 7560
2.26
64
512
8r2
3000 185,297.7 142,685.6 162,601.7
SF100 2-way Big Queries (sec)
60
50
5570 HDD
5570 SSD
5570 FusionIO
5680 SSD
6176 SSD
Query time in sec
40
30
20
10
0
Q1
Q9
Q13
Q18
Q21
Xeon 5570 with SATA SSD poor on Q9, reason unknown
Both Xeon 5680 and Opteron 6176 big improvement over Xeon 5570
SF100 Middle Q
8
7
5570 HDD
5570 SSD
5680 SSD
6176 SSD
5570 FusionIO
Query time in sec
6
5
4
3
2
1
0
Q3
Q5
Q7
Q8
Q10
Q11
Q12
Q16
Xeon 5570-HDD and 5680-SSD poor on Q12, reason unknown
Opteron 6176 poor on Q11
Q22
SF100 Small Queries
3.0
Query time in sec
2.5
5570 HDD
5570 SSD
5680 SSD
6176 SSD
5570 FusionIO
2.0
1.5
1.0
0.5
0.0
Q2
Q4
Q6
Q14
Xeon 5680 and Opteron poor on Q20
Note limited scaling on Q2, & 17
Q15
Q17
Q19
Q20
SF300 32+ cores Big Queries
120
8 x 8360 QC 2M
100
8 x 8384 QC 6M
Query time in sec
8 x 8439 6C
4 x 6176 12C
80
4 x 7560 8C
60
40
20
0
Q1
Q9
Q13
Q18
Opteron 6176 poor relative to 8439 on Q9 & 13,
same number of total cores
Q21
SF300 Middle Q
28
24
8x8384 QC 6M
8x8439 6C
4x6176 12C
4x7560 8C
20
Query time in sec
8x8360 QC 2M
16
12
8
4
0
Q3
Q5
Q7
Q8
Q10
Q11
Q12
Q16
Opteron 6176 much better than 8439 on Q11 & 19
Worse on Q12
Q19
Q20
Q22
SF300 Small Q
6
5
8 x 8360 QC 2M
8 x 8384 QC 6M
8 x 8439 6C
4 x 6176 12C
Query time in sec
4 x 7560 8C
4
3
2
1
0
Q2
Q4
Q6
Q14
Q15
Opteron 6176 much better on Q2, even with 8439 on others
Q17
SF1000
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
SF1000
400
350
SQL Server
Sybase
300
250
200
150
100
50
0
Q1
Q9
Q13
Q18
Q21
SF1000
80
SQL Server
70
Sybase
60
50
40
30
20
10
0
Q3
Q5
Q7
Q8
Q10
Q11
Q12
Q17
Q19
SF1000
35
30
SQL Server
25
Sybase
20
15
10
5
0
Q2
Q4
Q6
Q14
Q15
Q16
Q20
Q22
SF1000 Itanium - Superdome
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
SF 3TB – 8×7560 versus 16×7460
5.6X
2.5
2.0
1.5
1.0
0.5
0.0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
Broadly 50% faster overall, 5X+ on one, slower on 2, comparable on 3
64 cores, 7560 relative to PWR6
6
5
4
3
2
1
0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
600
Uni 16x6
500
DL980 8x8
Pwr6
400
300
200
100
0
Q1
Q9
Q13
Q18
Q21
200
Uni 16x6
180
DL980 8x8
160
Pwr6
140
120
100
80
60
40
60
20
Uni 16x6
0
50
Q3
Q5
Q7
Q8
Q10
Q11
Q12
Q17
DL980 8x8
Q19 Pwr6
40
30
20
10
0
Q2
Q4
Q6
Q14
Q15
Q16
Q20
Q22
TPC-H Summary
Scaling is impressive on some SQL
Limited ability (value) is scaling small Q
Anomalies, negative scaling
TPC-H Queries
Q1 Pricing Summary Report
Query 2 Minimum Cost Supplier
Wordy, but only touches the small tables, second lowest plan cost (Q15)
Q3
Q6 Forecasting Revenue Change
Q7 Volume Shipping
Q8 National Market Share
Q9 Product Type Profit Measure
Q11 Important Stock Identification
Non-Parallel
Parallel
Q12 Random IO?
Q13 Why does Q13 have perfect scaling?
Q17 Small Quantity Order Revenue
Q18 Large Volume Customer
Non-Parallel
Parallel
Q19
Q20?
Date functions are usually written as
because Line Item date columns are “date” type
CAST helps DOP 1 plan, but get bad plan for parallel
This query may get a poor execution plan
Q21 Suppliers Who Kept Orders Waiting
Note 3 references to Line Item
Q22
TPC-H Studies
Joe Chang
jchang6@yahoo.com
www.qdpma.com
About Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
Tools
ExecStats – cross-reference index use by SQLexecution plan
Performance Monitoring,
Profiler/Trace aggregation
TPC-H
TPC-H
DSS – 22 queries, geometric mean
60X range plan cost, comparable actual range
Power – single stream
Tests ability to scale parallel execution plans
Throughput – multiple streams
Scale Factor 1 – Line item data is 1GB
875MB with DATE instead of DATETIME
Only single column indexes allowed, Ad-hoc
SF 10, test studies
Not valid for publication
Auto-Statistics enabled,
Excludes compile time
Big Queries – Line Item Scan
Super Scaling – Mission Impossible
Small Queries & High Parallelism
Other queries, negative scaling
Did not apply T2301, or disallow page locks
Big Q: Plan Cost vs Actual
3,500
3,000
DOP 1
DOP 2
DOP 8
DOP 16
Plan Cost
reduction from
DOP1 to 16/32
Q1
28%
Q9
44%
Q18
70%
Q21
20%
DOP 4
2,500
2,000
Plan Cost @ 10GB
1,500
1,000
500
0
Q1
Q9
Q13
Q18
Q21
75
60
memory affects
Hash IO onset
DOP 1
DOP 2
DOP 4
DOP 8
DOP 30
DOP 16
DOP 32
DOP 24
Plan Cost says
scaling is poor
except for Q18,
Q18 & Q 21
> 3X Q1, Q9
Actual Query time
In seconds
45
30
15
0
Q1
Q9
Q13
Q18
Q21
Plan Cost is poor
indicator of true
parallelism scaling
Big Query: Speed Up and CPU
34
32
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Q1
Q9
Holy Grail
Speed up relative to
DOP 1
Q13
Q18
Q21
90
80
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
70
CPU time
In seconds
60
50
40
30
20
10
0
Q1
Q9
Q13
Q18
Q21
Q13 has slightly
better than perfect
scaling?
In general,
excellent scaling to
DOP 8-24, weak
afterwards
Super Scaling
Suppose at DOP 1, a query runs for 100
seconds, with one CPU fully pegged
CPU time = 100 sec, elapse time = 100 sec
What is best case for DOP 2?
Assuming nearly zero Repartition Threads cost
CPU time = 100 sec, elapsed time = 50?
Super Scaling: CPU time decreases going from
Non-Parallel to Parallel plan!
No, I have not started drinking, yet
Super Scaling
2.5
CPU normalized to DOP 1
2.0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
1.5
1.0
0.5
CPU-sec goes
down from DOP
1 to 2 and
higher (typically
8)
0.0
Q7
Q8
Q11
Q21
Q22
26
24
DOP 1
DOP 2
DOP 4
DOP 8
22
DOP 16
DOP 24
DOP 30
DOP 32
20
3.5X speedup
from DOP 1 to 2
(Normalized to
DOP 1)
Speed up relative to DOP 1
18
16
14
12
10
8
6
4
2
0
Q7
Q8
Q11
Q21
Q22
CPU and Query time in seconds
20
18
16
CPU time
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
14
12
10
8
6
4
2
0
Q7
Q8
Q11
Q21
Q22
Q21
Q22
12
10
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
8
Query time
6
4
2
0
Q7
Q8
Q11
Super Scaling Summary
Most probable cause
Bitmap Operator in Parallel Plan
Bitmap Filters are great,
Question for Microsoft:
Can I use Bitmap Filters in OLTP systems with
non-parallel plans?
Small Queries – Plan Cost vs Act
250
200
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
Query 3 and 16
have lower plan
cost than Q17, but
not included
Plan Cost
150
100
50
0
Q2
Q4
Q6
Q15
Q17
Q20
Q4,6,17 great
scaling to DOP 4,
then weak
3.5
3.0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
2.5
Query time
2.0
Negative scaling
also occurs
1.5
1.0
0.5
0.0
Q2
Q4
Q6
Q15
Q17
Q20
Small Queries CPU & Speedup
6
CPU time
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
5
4
3
2
1
0
Q2
Q4
Q6
Q15
Q17
Q20
What did I get for all
that extra CPU?,
Interpretation: sharp
jump in CPU means
poor scaling,
disproportionate
means negative
scaling
18
16
Speed up
14
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
12
10
8
6
4
2
0
Q2
Q4
Q6
Q15
Q17
Q20
Query 2 negative at
DOP 2, Q4 is good,
Q6 get speedup, but
at CPU premium,
Q17 and 20
negative after DOP
8
High Parallelism – Small Queries
Why? Almost No value Sometimes you do get lucky
TPC-H geometric mean scoring
Small queries have as much impact as large
Linear sum of weights large queries
OLTP with 32, 64+ cores
Parallelism good if super-scaling
Default max degree of parallelism 0
Seriously bad news, especially for small Q
Increase cost threshold for parallelism?
Q that go Negative
4.0
3.5
3.0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Query time
2.5
2.0
1.5
1.0
0.5
0.0
Q17
14
Q19
“Speedup”
12
10
Q20
Q22
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
8
6
4
2
0
Q17
Q19
Q20
Q22
CPU
12
10
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
8
6
4
2
0
Q17
Q19
Q20
Q22
Other Queries – CPU & Speedup
22
20
DOP 1
DOP 2
DOP 4
DOP 8
18
DOP 16
DOP 24
DOP 30
DOP 32
CPU time
16
14
12
10
8
6
4
2
0
Q3
Q5
Q10
Q12
22
20
18
16
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Q14
Q16
Speedup
14
12
10
8
6
4
2
0
Q3
Q5
Q10
Q12
Q14
Q16
Q3 has problems
beyond DOP 2
Other - Query Time seconds
16
14
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
Query time
12
10
8
6
4
2
0
Q3
Q5
Q10
Q12
Q14
Q16
Scaling Summary
Some queries show excellent scaling
Super-scaling, better than 2X
Sharp CPU jump on last DOP doubling
Need strategy to cap DOP
To limit negative scaling
Especially for some smaller queries?
Other anomalies
Compression
PAGE
Compression Overhead - Overall
1.5
Query time compressed
relative to uncompressed
1.4
40% overhead for
compression at low
DOP,
10% overhead at
max DOP???
1.3
1.2
1.1
1.0
DOP 1
DOP 2
DOP 4
1.5
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
CPU time compressed
relative to uncompressed
1.4
1.3
1.2
1.1
1.0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
2.0
Query time compressed
relative to uncompressed
1.8
1.6
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
2.0
CPU time compressed
relative to uncompressed
1.8
1.6
DOP 1
DOP 2
DOP 4
DOP 16
DOP 24
DOP 32
DOP 8
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
Compressed Table
LINEITEM – real data may be more compressible
Uncompressed: 8,749,760KB, Average Bytes per row: 149
Compressed: 4,819,592KB, Average Bytes per row: 82
Partitioning
Orders and Line Item on Order Key
Partitioning Impact - Overall
1.8
1.7
1.6
1.5
Query time partitioned
relative to not partitioned
1.4
1.3
1.2
1.1
1.0
0.9
0.8
DOP 1
1.15
1.10
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 30
DOP 32
DOP 16
DOP 24
DOP 30
DOP 32
CPU time partitioned
relative to not partitioned
1.05
1.00
0.95
0.90
DOP 1
DOP 2
DOP 4
DOP 8
6
5
Query time partitioned
relative to not partitioned
4
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
3
2
1
0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
Q17
Q18
Q19
Q20
Q21
Q22
4.0
3.5
CPU time partitioned
relative to not partitioned
3.0
DOP 1
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
2.5
2.0
1.5
1.0
0.5
0.0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Plan for Partitioned Tables
Scaling DW Summary
Massive IO bandwidth
Parallel options for data load, updates etc
Investigate Parallel Execution Plans
Scaling from DOP 1, 2, 4, 8, 16, 32 etc
Scaling with and w/o HT
Strategy for limiting DOP with multiple
users
Fixes from Microsoft Needed
Contention issues in parallel execution
Table scan, Nested Loops
Better plan cost model for scaling
Back-off on parallelism if gain is negligible
Fix throughput degradation with multiple
users running big DW queries
Sybase and Oracle, Throughput is close to
Power or better
Query Plans
Big Queries
Q1 Pricing Summary Report
Q1 Plan
Non-Parallel
Parallel plan 28% lower
than scalar, IO is 70%,
no parallel plan cost
reduction
Parallel
Q9 Product Type Profit Measure
Non-Parallel
Parallel
IO from 4 tables contribute
58% of plan cost, parallel
plan is 39% lower
Q9 Non-Parallel Plan
Table/Index Scans comprise 64%, IO
from 4 tables contribute 58% of plan
cost
Join sequence: Supplier, (Part,
PartSupp), Line Item, Orders
Q9 Parallel Plan
Non-Parallel: (Supplier), (Part, PartSupp), Line Item, Orders
Parallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp
Q9 Non-Parallel Plan details
Table Scans comprise 64%,
IO from 4 tables contribute
58% of plan cost
Q9 Parallel reg vs Partitioned
Q13 Why does Q13 have perfect scaling?
Q18 Large Volume Customer
Non-Parallel
Parallel
Q18 Graphical Plan
Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan
Q18 Plan Details
Non-Parallel
Parallel
Non-Parallel Plan Hash Match cost is 1245 IO, 494.6 CPU
DOP 16/32: size is below IO threshold, CPU reduced by >10X
Q21 Suppliers Who Kept Orders Waiting
Non-Parallel
Note 3 references to Line Item
Parallel
Q21 Non-Parallel Plan
H3
H2
H3
H1
H2
H1
Q21 Parallel
Q21
3 full Line Item clustered index scans
Plan cost is approx 3X Q1, single “scan”
Super Scaling
Q7 Volume Shipping
Non-Parallel
Parallel
Q7 Non-Parallel Plan
Join sequence: Nation, Customer, Orders, Line Item
Q7 Parallel Plan
Join sequence: Nation, Customer, Orders, Line Item
Q8 National Market Share
Non-Parallel
Parallel
Q8 Non-Parallel Plan
Join sequence: Part, Line Item, Orders, Customer
Q8 Parallel Plan
Join sequence: Part, Line Item,
Orders, Customer
Q11 Important Stock Identification
Non-Parallel
Parallel
Q11
Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp
Q11
Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp
Small Queries
Query 2 Minimum Cost Supplier
Wordy, but only touches the small tables, second lowest plan cost (Q15)
Q2
Clustered Index Scan on Part and PartSupp have highest cost (48%+42%)
Q2
PartSupp is now Index Scan + Key Lookup
Q6 Forecasting Revenue Change
Note sure why this blows CPU
Scalar values are pre-computed, pre-converted
Q20?
Date functions are usually written as
because Line Item date columns are “date” type
CAST helps DOP 1 plan, but get bad plan for parallel
This query may get a poor execution plan
Q20
Q20
Q20 alternate - parallel
Statistics estimation
error here
Penalty for mistake
applied here
Other Queries
Q3
Q3
Q12 Random IO?
Will this generate random IO?
Query 12 Plans
Parallel
Non-Parallel
Queries that go Negative
Q17 Small Quantity Order Revenue
Q17
Table Spool is concern
Q17
the usual suspects
Q19
Q19
Q22
Q22
32
30
DOP 2
DOP 4
DOP 8
28
DOP 16
DOP 24
DOP 32
Speedup from DOP 1 query time
26
24
22
20
18
16
14
12
10
8
6
4
2
0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
Tot
Q17
Q18
Q19
Q20
Q21
Q22
Tot
3.0
2.5
DOP 2
DOP 4
DOP 8
DOP 16
DOP 24
DOP 32
CPU relative to DOP 1
2.0
1.5
1.0
0.5
0.0
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Download