System Architecture: Big Iron (NUMA) Joe Chang jchang6@yahoo.com www.qdpma.com About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQLexecution plan Performance Monitoring, Profiler/Trace aggregation Scaling SQL on NUMA Topics OLTP – Thomas Kejser session “Designing High Scale OLTP Systems” Data Warehouse Ongoing Database Development Bulk Load – SQL CAT paper + TK session “The Data Loading Performance Guide” Other Sessions with common coverage: Monitoring and Tuning Parallel Query Execution II, R Meyyappan (SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham Notes from the field: High Performance Storage, John Langford SQL Server Storage – 1000GB Level, Brent Ozar Server Systems and Architecture Symmetric Multi-Processing CPU CPU CPU CPU System Bus MCH PXH PXH ICH SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory SMP makes no reference to memory architecture? Not to be confused to Simultaneous Multi-Threading (SMT) Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT) Non-Uniform Memory Access CPU CPU CPU CPU Memory Controller Node Controller CPU CPU CPU CPU Memory Controller Node Controller CPU CPU CPU CPU Memory Controller Node Controller CPU CPU CPU CPU Memory Controller Node Controller Shared Bus or X Bar NUMA Architecture - Path to memory is not uniform 1) Node: Processors, Memory, Separate or combined Memory + Node Controllers 2) Nodes connected by shared bus, cross-bar, ring Traditionally, 8-way+ systems Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware AMD Opteron Opteron Opteron Opteron Opteron HT2100 HT2100 HT1100 Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior! For practical purposes: behave like SMP system Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns? Actual: more complicated because of snooping (cache coherency traffic) 8-way Opteron Sys Architecture CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other http://www.techpowerup.com/img/09-08-26/17d.jpg Nehalem System Architecture Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible NUMA Local and Remote Memory Local memory is closer than remote Physical access time is shorter What is actual access time? With cache coherency requirement! HT Assist – Probe Filter part of L3 cache used as directory cache ZDNET Source Snoop Coherency From HP PREMA Architecture whitepaper: All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line DL980G7 From HP PREAM Architecture whitepaper: Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory (*only cache tags, not cache data) HP ProLiant DL980 Architecture Node Controllers reduces effective memory latency Superdome 2 – Itanium, sx3000 Agent – Remote Ownership Tag + L4 cache tags 64M eDRAM L4 cache data IBM x3850 X5 (Glue-less) Connect two 4-socket Nodes to make 8-way system OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 24 16 8 0 25 17 9 1 Node 0 26 18 10 2 27 19 11 3 Node 1 28 20 12 4 29 21 13 5 30 22 14 6 Node 2 31 23 15 7 2 1 Node 3 NUMA: first interleaved within a node, then spanned across nodes 6 4 2 0 Node 0 7 5 3 1 14 12 10 8 Node 1 15 13 11 9 22 20 18 16 Node 2 23 21 19 17 30 28 26 24 31 29 27 25 2 Node 3 1 Memory stripe is then spanned across nodes Windows OS NUMA Support Memory models SUMA – Sufficiently Uniform Memory Access 24 16 8 0 25 17 9 1 26 18 10 2 Node 0 27 19 11 3 28 20 12 4 Node 1 29 21 13 5 30 22 14 6 Node 2 31 23 15 7 Node 3 Memory is striped across NUMA nodes NUMA – separate memory pools by Node 6 4 2 0 7 5 3 1 Node 0 14 12 10 8 15 13 11 9 Node 1 22 20 18 16 23 21 19 17 Node 2 30 28 26 24 31 29 27 25 Node 3 Memory Model Example: 4 Nodes SUMA Memory Model memory access uniformly distributed 25% of memory accesses local, 75% remote NUMA Memory Model Goal is better than 25% local node access True local access time also needs to be faster Cache Coherency may increase local access Architecting for NUMA End to End Affinity App Server North East Mid Atlantic South East TCP Port 1440 CPU Node 0 1441 Node 1 1442 Node 2 Node 3 Central 1443 Texas 1444 Node 4 Mountain 1445 Node 5 California 1446 Pacific NW 1447 Node 6 Node 7 Memory Table 0-0 0-1 NE 1-0 1-1 MidA 2-0 2-1 SE 3-0 3-1 Cen 4-0 4-1 Tex 5-0 5-1 Mnt 6-0 6-1 Cal 7-0 7-1 Web determines port for each user by group (but should not be by geography!) PNW Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? Architecting for NUMA End to End Affinity App Server North East Mid Atlantic South East TCP Port 1440 CPU Node 0 1441 Node 1 1442 Node 2 Node 3 Central 1443 Texas 1444 Node 4 Mountain 1445 Node 5 California 1446 Pacific NW 1447 Node 6 Node 7 Memory Table 0-0 0-1 NE 1-0 1-1 MidA 2-0 2-1 SE 3-0 3-1 Cen 4-0 4-1 Tex 5-0 5-1 Mnt 6-0 6-1 Cal 7-0 7-1 Web determines port for each user by group (but should not be by geography!) PNW Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? HP-UX LORA HP-UX – Not Microsoft Windows Locality-Optimizer Resource Alignment 12.5% Interleaved Memory 87.5% NUMA node Local Memory System Tech Specs Processors Cores DIMM PCI-E G2 Total Cores Max memory Base 2 x Xeon X56x0 6 18 5 x8+,1 x4 12 192G* $7K 4 x Opteron 6100 12 32 5 x8, 1 x4 48 512G $14K 4 x Xeon X7560 8 64 4 x8, 6 x4† 32 1TB $30K 8 x Xeon X7560 8 128 9 x8, 5 x4‡ 64 2TB $100K 8GB $400 ea 18 x 8G = 144GB, $7200, 16GB $1100 ea 12 x16G =192GB, $13K, 64 x 8G = 512GB - $26K 64 x 16G = 1TB – $70K Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB, † Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots Software Stack Operating System Windows Server 2003 RTM, SP1 Network limitations (default) Impacts OLTP Scalable Networking Pack (912222) Windows Server 2008 Windows Server 2008 R2 (64-bit only) Breaks 64 logical processor limit Search: MSI-X NUMA IO enhancements? Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server Don’t try to do DW on SQL Server 2000 SQL Server version SQL Server 2000 Serious disk IO limitations (1GB/sec ?) Problematic parallel execution plans SQL Server 2005 (fixed most S2K problems) 64-bit on X64 (Opteron and Xeon) SP2 – performance improvement 10%(?) SQL Server 2008 & R2 Compression, Filtered Indexes, etc Star join, Parallel query to partitioned table Configuration SQL Server Startup Parameter: E Trace Flags 834, 836, 2301 Auto_Date_Correlation Order date < A, Ship date > A Implied: Order date > A-C, Ship date < A+C Port Affinity – mostly OLTP Dedicated processor ? for log writer ? Storage Performance for Data Warehousing Joe Chang jchang6@yahoo.com www.qdpma.com About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQLexecution plan Performance Monitoring, Profiler/Trace aggregation Storage Organization Structure In many large IT departments DB and Storage are in separate groups Storage usually has own objectives Bring all storage into one big system under full management (read: control) Storage as a Service, in the Cloud One size fits all needs Usually have zero DB knowledge Of course we do high bandwidth, 600MB/sec good enough for you? Data Warehouse Storage OLTP – Throughput with Fast Response DW – Flood the queues for maximum through-put Do not use shared storage for data warehouse! Storage system vendors like to give the impression the SAN is a magical, immensely powerful box that can meet all your needs. Just tell us how much capacity you need and don’t worry about anything else. My advice: stay away from shared storage, controlled by different team. Nominal and Net Bandwidth PCI-E Gen 2 – 5 Gbit/sec signaling x8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s net SAS 6Gbit/s – 6 Gbit/s x4 port: 3GB/s nominal, 2.2GB/sec net? Fibre Channel 8 Gbit/s nominal 780GB/s point-to-point, 680MB/s from host to SAN to back-end loop SAS RAID Controller, x8 PCI-E G2, 2 x4 6G 2.8GB/s Depends on the controller, will change! Storage – SAS Direct-Attach PCI-E x8 . . . . SAS x4 SAS x4 RAID . . . . . . . . Very Many Disks RAID . . . . Option A: 24-disks in one enclosure for each x4 SAS port. Two x4 SAS ports per controller SAS x4 SAS x4 PCI-E x8 . . . . SAS x4 SAS x4 PCI-E x8 . . . . Many Fat Pipes RAID . . . . PCI-E x8 . . . . SAS x4 RAID Option B: Split enclosure over 2 x4 SAS ports, 1 controller SAS x4 PCI-E x4 RAID PCI-E x4 2 x10GbE PCI-E x4 2 x10GbE SAS x4 Balance by pipe bandwidth Don’t forget fat network pipes Storage – FC/SAN 8Gb FC PCI-E x8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Gb FC 8Gb FC PCI-E x8 Gen 2 Slot with quad-port 8Gb FC 8Gb FC 8Gb FC PCI-E x8 . . . . HBA HBA 8Gb FC 8Gb FC 8Gb FC PCI-E x4 HBA PCI-E x4 HBA PCI-E x4 HBA PCI-E x4 HBA If 8Gb quad-port is not supported, consider system with many x4 slots, or consider SAS! 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC 8Gb FC SAN systems typically offer 3.5in 15-disk enclosures. Difficult to get high spindle count with density. 8Gb FC 8Gb FC 8Gb FC PCI-E x4 HBA PCI-E x4 2 x10GbE PCI-E x4 2 x10GbE 8Gb FC 1-2 15-disk enclosures per 8Gb FC port, 2030MB/s per disk? Storage – SSD / HDD Hybrid SSD SSD SSD SSD SAS x4 SSD SSD SSD SSD SAS x4 SSD SSD SSD SSD SAS x4 SSD SSD SSD SSD SAS x4 SSD SSD SSD SSD SAS x4 SSD SSD SSD SSD SAS x4 SSD SSD SSD SSD SSD SSD SSD SSD No RAID w/SSD? SAS x4 PCI-E x8 . . . . . . . . . . . . . . . . SAS . . . . . . . . . . . . . . . . PCI-E x8 PCI-E x8 PCI-E x8 SAS SAS SAS SAS x4 PCI-E x4 RAID SAS x4 PCI-E x4 2 x10GbE PCI-E x4 2 x10GbE Storage enclosures typically 12 disks per channel. Can only support bandwidth of a few SSD. Use remaining bays for extra storage with HDD. No point expending valuable SSD space for backups and flat files Log: Single DB – HDD, unless rollbacks or T-log backups disrupts log writes. Multi DB – SSD, otherwise to many RAID1 pairs to logs SSD Current: mostly 3Gbps SAS/SATA SDD Some 6Gbps SATA SSD Fusion IO – direct PCI-E Gen2 interface 320GB-1.2TB capacity, 200K IOPS, 1.5GB/s No RAID ? HDD is fundamentally a single point failure SDD could be built with redundant components HP report problems with SSD on RAID controllers, Fujitsu did not? Big DW Storage – iSCSI Are you nuts? Storage Configuration - Arrays Shown: two 12-disk Arrays per 24-disk enclosure Options: between 6-16 disks per array SAN systems may recommend R10 4+4 or R5 7+1 Very Many Spindles Comment on Meta LUN Data Consumption Rate: Xeon TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Total Mem SF Total MB/s MB/s per core 5sp2 85.4 100 1,165.5 145.7 Conroe 8 144 8sp1 42.2 100 2,073.5 259.2 Nehalem 3.33 12 192 8r2 21.0 100 4,166.7 347.2 Westmere 4 Xeon 7560 2.26 32 640 8r2 37.2 300 7,056.5 220.5 Neh.-EX 8 Xeon 7560 2.26 64 512 8r2 183.8 3000 14,282 223.2 Processors GHz Cores GB SQL 2 Xeon 5355 2.66 8 64 2 Xeon 5570 2.93 2 Xeon 5680 Q1 sec Data consumption rate is much higher for current generation Nehalem and Westmere processors than Core 2 referenced in Microsoft FTDW document. TPC-H Q1 is more compute intensive than the FTDW light query. Data Consumption Rate: Opteron TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Processors Total Mem GHz Cores GB SQL Q1 sec SF Total MB/s MB/s per core 4 Opt 8220 2.8 8 128 5rtm 309.7 300 868.7 121.1 8 Opt 8360 2.5 32 256 8rtm 91.4 300 2,872.0 89.7 Barcelona 8 Opt 8384 2.7 32 256 8rtm 72.5 300 3,620.7 113.2 Shanghai 8 Opt 8439 2.8 48 256 8sp1 49.0 300 5,357.1 111.6 Istanbul 8 Opt 8439 2.8 48 512 8rtm 166.9 1000 5,242.7 109.2 2 Opt 6176 2.3 24 192 8r2 20.2 100 4,331.7 180.5 Magny-C 4 Opt 6176 2.3 48 512 8r2 31.8 300 8,254.7 172.0 - Expected Istanbul to have better performance per core than Shanghai due to HT Assist. Magny-Cours has much better performance per core! (at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2? Data Consumption Rate TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Total Mem SF Total MB/s MB/s per core 5sp2 85.4 100 1165.5 145.7 8 144 8sp1 42.2 100 2073.5 259.2 3.33 12 192 8r2 21.0 100 4166.7 347.2 2 Opt 6176 2.3 24 192 8r2 20.2 100 4331.7 180.5 4 Opt 8220 2.8 8 128 5rtm 309.7 300 868.7 121.1 8 Opt 8360 2.5 32 256 8rtm 91.4 300 2872.0 89.7 Barcelona 8 Opt 8384 2.7 32 256 8rtm 72.5 300 3620.7 113.2 Shanghai 8 Opt 8439 2.8 48 256 8sp1 49.0 300 5357.1 111.6 Istanbul 4 Opt 6176 2.3 48 512 8r2 31.8 300 8254.7 172.0 Magny-C 8 Xeon 7560 2.26 64 512 8r2 183.8 3000 14282 223.2 Processors GHz Cores GB SQL 2 Xeon 5355 2.66 8 64 2 Xeon 5570 2.93 2 Xeon 5680 Q1 sec 2U disk enclosure 24 x 73GB 15K 2.5in disks $14K, $600 per disk Storage Targets Total Cores BW Core Target MB/s PCI-E x8-x4 2 Xeon X5680 12 350 4200 5-1 2 2 - 48 4 - 96 4 Opt 6176 48 175 8400 5-1 4 4 - 96 8 - 192 10 GB/s 4 Xeon X7560 32 250 8000 6-4 6 6 - 144 12 - 288 15 GB/s 8 Xeon X7560 64 225 14400 9-5 11† Processors † 8-way SAS Storage Storage Actual HBA Units/Disks Units/Disks Bandwidth 5 GB/s 10 - 240 20 - 480 26 GB/s : 9 controllers in x8 slots, 24 disks per x4 SAS port 2 controllers in x4 slots, 12 disk 24 15K disks per enclosure, 12 disks per x4 SAS port requires 100MB/sec per disk, possible but not always practical 24 disks per x4 SAS port requires 50MB/sec, more achievable in practice Think: Shortest path to metal (iron-oxide) Your Storage and the Optimizer Model Disks BW (KB/s) Sequential IOPS “Random” IOPS SequentialRand IO ratio Optimizer - 10,800 1,350 320 4.22 SAS 2x4 24 2,800,000 350,000 9,600 36.5 SAS 2x4 48 2,800,000 350,000 19,200 18.2 FC 4G 30 360,000 45,000 12,000 3.75 SSD 8 2,800,000 350,000 280,000 1.25 Assumptions 2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8 HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and shortstroked, possible skip-seek. SSD 35,000 IOPS The SQL Server Query Optimizer make key lookup versus table scan decisions based on a 4.22 sequential-to-random IO ratio A DW configured storage system has a 18-36 ratio, 30 disks per 4G FC about matches the QO, SSD is in the other direction Data Consumption Rates TPC-H SF100 Query 1, 9, 13, 21 450 400 X 5355 5sp2 X 5570 8sp1 350 X 5680 8R2 O 6176 8R2 300 250 200 150 100 50 0 Q1 Q9 Q18 Q21 300 TPC-H SF300 Query 1, 9, 13, 21 250 O DC 2.8G 128 5rtm O QC 2.5G 256 8rtm O QC 2.7G 256 8rtm O 6C 2.8G 256 8sp1 O 12C 2.3G 512 8R2 X7560 R2 640 200 150 100 50 0 Q1 Q9 Q18 Q21 Fast Track Reference Architecture My Complaints Several Expensive SAN systems (11 disks) Each must be configured independently Scripting? $1,500-2,000 amortized per disk Too many 2-disk Arrays 2 LUN per Array, too many data files Build Indexes with MAXDOP 1 Is this brain dead? Designed around 100MB/sec per disk Not all DW is single scan, or sequential Fragmentation Weak Storage System 1) Fragmentation could degrade IO performance, 2) Defragmenting very large table on a weak storage system could render the database marginally to completely non-functional for a very long time. Powerful Storage System 3) Fragmentation has very little impact. 4) Defragmenting has mild impact, and completes within night time window. What is the correct conclusion? Table File Partition LUN Disk Operating System View of Storage Operating System Disk View Disk 2 Basic 396GB Online Controller 1 Port 0 Disk 3 Basic 396GB Online Controller 1 Port 1 Disk 4 Basic 396GB Online Controller 2 Port 0 Disk 5 Basic 396GB Online Controller 2 Port 1 Disk 6 Basic 396GB Online Controller 3 Port 0 Disk 7 Basic 396GB Online Controller 3 Port 1 Additional disks not shown, Disk 0 is boot drive, 1 – install source? File Layout Disk 2, Partition 0 File Group for the big Table File 1 Disk 3 Partition 0 File Group for the big Table File 2 Disk 4 Partition 0 File Group for the big Table File 3 Disk 5 Partition 0 File Group for the big Table File 4 Disk 6 Partition 0 File Group for the big Table File 5 Disk 7 Partition 0 File Group for the big Table File 6 Each File Group is distributed across all data disks Partition 1 File Group for all others File 1 Partition 1 Small File Group File 2 Partition 1 Small File Group File 3 Partition 1 Small File Group File 4 Partition 1 Small File Group File 5 Partition 1 Small File Group File 6 Partition 2 Tempdb File 1 Partition 2 Tempdb File 2 Partition 2 Tempdb File 3 Partition 2 Tempdb File 4 Partition 2 Tempdb File 5 Partition 2 Tempdb File 6 Partition 4 Backup and Load File 1 Partition 4 Backup and Load File 2 Partition 4 Backup and Load File 3 Partition 4 Backup and Load File 4 Partition 4 Backup and Load File 5 Partition 4 Backup and Load File 6 Log disks not shown, tempdb share common pool with data File Groups and Files Dedicated File Group for largest table Never defragment One file group for all other regular tables Load file group? Rebuild indexes to different file group Partitioning - Pitfalls Disk 2 File Group 1 Table Partition 1 Disk 3 File Group 2 Table Partition 2 Common Partitioning Strategy Partition Scheme maps partitions to File Groups Disk 4 File Group 3 Table Partition 3 Disk 5 File Group 4 Table Partition 4 Disk 6 File Group 5 Table Partition 5 Disk 7 File Group 6 Table Partition 6 What happens in a table scan? Read first from Part 1 then 2, then 3, … ? SQL 2008 HF to read from each partition in parallel? What if partitions have disparate sizes? Parallel Execution Plans Joe Chang jchang6@yahoo.com www.qdpma.com About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQLexecution plan Performance Monitoring, Profiler/Trace aggregation So you bought a 64+ core box Now Learn all about Parallel Execution All guns (cores) blazing Negative scaling Yes, this can happen, how will you know Super-scaling No I have not been smoking pot High degree of parallelism & small SQL Anomalies, execution plan changes etc Compression How much in CPU do I pay for this? Partitioning Great management tool, what else? Parallel Execution Plans Reference: Adam Machanic PASS Execution Plan Quickie I/O and CPU Cost components F4 Estimated Execution Plan Cost is duration in seconds on some reference platform IO Cost for scan: 1 = 10,800KB/s, 810 implies 8,748,000KB IO in Nested Loops Join: 1 = 320/s, multiple of 0.003125 Index + Key Lookup - Scan Actual LU Scan CPU 1919 8736 Time (Data in memory) 1919 8727 (926.67- 323655 * 0.0001581) / 0.003125 = 280160 (86.6%) True cross-over approx 1,400,000 rows 1 row : page 1,093,729 pages/1350 = 810.17 (8,748MB) Index + Key Lookup - Scan Actual LU Scan CPU 2138 18622 Time 321 658 8748000KB/8/1350 = 810 (817- 280326 * 0.0001581) / 0.003125 = 247259 (88%) Actual Execution Plan Estimated Actual Note Actual Number of Rows, Rebinds, Rewinds Actual Estimated Row Count and Executions Outer Inner Source For Loop Join inner source and Key Lookup, Actual Num Rows = Num of Exec × Num of Rows Parallel Plans Parallelism Operations Distribute Streams Non-parallel source, parallel destination Repartition Streams Parallel source and destination Gather Streams Destination is non-parallel Parallel Execution Plans Note: gold circle with double arrow, and parallelism operations Parallel Scan (and Index Seek) 2X DOP 1 DOP 2 IO Cost same CPU reduce by degree of parallelism, except no reduction for DOP 16 8X 4X IO contributes most of cost! DOP 4 DOP 8 Parallel Scan 2 DOP 16 Hash Match Aggregate CPU cost only reduces By 2X, Parallel Scan IO Cost is the same CPU cost reduced in proportion to degree of parallelism, last 2X excluded? On a weak storage system, a single thread can saturate the IO channel, Additional threads will not increase IO (reduce IO duration). A very powerful storage system can provide IO proportional to the number of threads. It might be nice if this was optimizer option? The IO component can be a very large portion of the overall plan cost Not reducing IO cost in parallel plan may inhibit generating favorable plan, i.e., not sufficient to offset the contribution from the Parallelism operations. A parallel execution plan is more likely on larger systems (-P to fake it?) Actual Execution Plan - Parallel More Parallel Plan Details Parallel Plan - Actual Parallelism – Hash Joins Hash Join Cost DOP 4 DOP 1 Search: Understanding Hash Joins For In-memory, Grace, Recursive DOP 2 DOP 8 Hash Join Cost CPU Cost is linear with number of rows, outer and inner source See BOL on Hash Joins for In-Memory, Grace, Recursive IO Cost is zero for small intermediate data size, beyond set point proportional to server memory(?) IO is proportional to excess data (beyond in-memory limit) Parallel Plan: Memory allocation is per thread! Summary: Hash Join plan cost depends on memory if IO component is not zero, in which case is disproportionately lower with parallel plans. Does not reflect real cost? Parallelism Repartition Streams DOP 2 DOP 4 DOP 8 Bitmap BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in the second table that qualify for the join to the first table are processed. SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering. Parallel Execution Plan Summary Queries with high IO cost may show little plan cost reduction on parallel execution Plans with high portion hash or sort cost show large parallel plan cost reduction Parallel plans may be inhibited by high row count in Parallelism Repartition Streams Watch out for (Parallel) Merge Joins! Scaling Theory Parallel Execution Strategy Partition work into little pieces Ensures each thread has same amount High overhead to coordinate Partition into big pieces May have uneven distribution between threads Small table join to big table Thread for each row from small table Partitioned table options What Should Scale? 3 2 Trivially parallelizable: 1) Split large chunk of work among threads, 2) Each thread works independently, 3) Small amount of coordination to consolidate threads 2 More Difficult? 4 Parallelizable: 1) Split large chunk of work among threads, 2) Each thread works on first stage 3) Large coordination effort between threads 4) More work … Consolidate 3 2 3 2 Partitioned Tables No Repartition Streams Regular Table Partitioned Tables No Repartition Streams operations! Scaling Reality 8-way Quad-Core Opteron Windows Server 2008 R2 SQL Server 2008 SP1 + HF 27 Test Queries TPC-H SF 10 database Standard, Compressed, Partitioned (30) Line Item Table SUM, 59M rows, 8.75GB Orders Table 15M rows CPU-sec 35 30 Sum 1 column Standard CPU-sec to SUM 1 or 2 columns in Line Item Sum 2 columns 25 20 15 10 5 0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Compressed 32 28 24 20 Sum 1 column 16 Sum 2 columns 12 8 4 0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Speed Up 26 24 Sum 1 22 Sum 2 20 S2 Group 18 S2 Join Standard 16 14 12 10 8 6 4 2 0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 32 28 24 Compressed Sum 1 column Sum 2 columns S2 Group 20 S2 Join 16 12 8 4 0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 Line Item sum 1 column 35 Sum 1 Std 30 CPU-sec Compressed Partitioned 25 20 15 10 5 0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Speed up relative to DOP 1 35 30 Sum 1 Std Compressed 25 Partitioned 20 15 10 5 0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Line Item Sum w/Group By 60 50 CPU-sec 40 30 Group Std 20 Compressed Hash 10 0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 Speedup 26 24 Group Std 22 20 Compressed Hash 18 16 14 12 10 8 6 4 2 0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 Hash Join 120 100 CPU-sec 80 60 Join Std 40 Compressed Partitioned 20 0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 Speedup 30 25 Join Std Compressed 20 Partitioned 15 10 5 0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 Key Lookup and Table Scan 20 18 16 CPU-sec 1.4M rows 14 12 10 8 6 4 Key Lookup std Key Lookup compr Table Scan uncmp Table Scan cmpr 2 0 DOP 1 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Speedup Key Lookup std Key Lookup compr Table Scan uncmp Table Scan cmpr DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Parallel Execution Summary Contention in queries w/low cost per page Simple scan, High Cost per Page – improves scaling! Multiple Aggregates, Hash Join, Compression Table Partitioning – alternative query plans Loop Joins – broken at high DOP Merge Join – seriously broken (parallel) Scaling DW Summary Massive IO bandwidth Parallel options for data load, updates etc Investigate Parallel Execution Plans Scaling from DOP 1, 2, 4, 8, 16, 32 etc Scaling with and w/o HT Strategy for limiting DOP with multiple users Fixes from Microsoft Needed Contention issues in parallel execution Table scan, Nested Loops Better plan cost model for scaling Back-off on parallelism if gain is negligible Fix throughput degradation with multiple users running big DW queries Sybase and Oracle, Throughput is close to Power or better Test Systems Test Systems 2-way quad-core Xeon 5430 2.66GHz Windows Server 2008 R2, SQL 2008 R2 8-way dual-core Opteron 2.8GHz Windows Server 2008 SP1, SQL 2008 SP1 8-way quad-core Opteron 2.7GHz Barcelona Windows Server 2008 R2, SQL 2008 SP1 Build 2789 8-way systems were configured for AD- not good! Test Methodology Boot with all processors Run queries at MAXDOP 1, 2, 4, 8, etc Not the same as running on 1-way, 2-way, 4-way server Interpret results with caution References Search Adam Machanic PASS SQL Server Scaling on Big Iron (NUMA) Systems TPC-H Joe Chang jchang6@yahoo.com www.qdpma.com About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQLexecution plan Performance Monitoring, Profiler/Trace aggregation TPC-H TPC-H DSS – 22 queries, geometric mean 60X range plan cost, comparable actual range Power – single stream Tests ability to scale parallel execution plans Throughput – multiple streams Scale Factor 1 – Line item data is 1GB 875MB with DATE instead of DATETIME Only single column indexes allowed, Ad-hoc Observed Scaling Behaviors Good scaling, leveling off at high DOP Perfect Scaling ??? Super Scaling Negative Scaling especially at high DOP Execution Plan change Completely different behavior TPC-H Published Results TPC-H SF 100GB 2-way Xeon 5355, 5570, 5680, Opt 6176 100,000 80,000 Xeon 5355 5570 HDD 5570 SSD 5680 SSD 5570 Fusion Opt 6176 60,000 40,000 20,000 0 Power Throughput Between 2-way Xeon 5570, all are close, HDD has best throughput, SATA SSD has best composite, and Fusion-IO has be power. Westmere and Magny-Cours, both 192GB memory, are very close QphH TPC-H SF 300GB 8x QC/6C & 4x12C Opt, 160,000 Opt 8360 4C Opt 8439 6C X 7560 8C 140,000 120,000 Opt 8384 4C Opt 6716 12 100,000 80,000 60,000 40,000 20,000 0 Power Throughput 6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall. 4x12C 2.3GHz improved17% over 8x6C 2.8GHz QphH TPC-H SF 1000 140,000 120,000 100,000 80,000 60,000 Opt 8439 SQL Superdome 40,000 Opt 8439 Sybase Superdome 2 20,000 0 Power Throughput QphH TPC-H SF 3TB X7460 & X7560 200,000 180,000 160,000 140,000 120,000 100,000 80,000 16 x X7460 60,000 8 x 7560 40,000 POWER6 20,000 0 Power Throughput QphH Nehalem-EX 64 cores better than 96 Core 2. TPC-H SF 100GB, 300GB & 3TB 100,000 80,000 Xeon 5355 5570 HDD 5570 SSD 5680 SSD 5570 Fusion Opt 6176 SF100 2-way 60,000 40,000 20,000 0 Power Throughput QphH Westmere and Magny-Cours are very close Between 2-way Xeon 5570, all are close, HDD has best through-put, SATA SSD has best composite, and Fusion-IO has be power 160,000 Opt 8360 4C Opt 8439 6C X 7560 8C 140,000 120,000 SF300 8x QC/6C & 4x12C Opt 8384 4C Opt 6716 12 6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall. 4x12C 2.3GHz improved17% over 8x6C 2.8GHz 100,000 80,000 60,000 40,000 20,000 0 Power Throughput SF 3TB X7460 & X7560 200,000 150,000 16 x X7460 8 x 7560 100,000 50,000 0 QphH 32 x Pwr6 Nehalem-EX 64 cores better than 96 Core 2. TPC-H Published Results SQL Server excels in Power Limited by Geometric mean, anomalies Trails in Throughput Other DBMS get better throughput than power SQL Server throughput below Power by wide margin Speculation – SQL Server does not throttle back parallelism with load? TPC-H SF100 Total Mem Through put Processors GHz Cores GB SQL SF 2 Xeon 5355 2.66 8 64 5sp2 100 23,378.0 13,381.0 17,686.7 2x5570 HDD 2.93 8 144 8sp1 100 67,712.9 38,019.1 50,738.4 2x5570 SSD 2.93 8 144 8sp1 100 70,048.5 37,749.1 51,422.4 5570 Fusion 2.93 8 144 8sp1 100 72,110.5 36,190.8 51,085.6 2 Xeon 5680 3.33 12 192 8r2 100 99,426.3 55,038.2 73,974.6 2 Opt 6176 2.3 24 192 8r2 100 94,761.5 53,855.6 71,438.3 Power QphH TPC-H SF300 Processors Total Mem GHz Cores GB SQL SF Power Through put QphH 4 Opt 8220 2.8 8 128 5rtm 300 25,206.4 13,283.8 18,298.5 8 Opt 8360 2.5 32 256 8rtm 300 67,287.4 41,526.4 52,860.2 8 Opt 8384 2.7 32 256 8rtm 300 75,161.2 44,271.9 57,684.7 8 Opt 8439 2.8 48 256 8sp1 300 109,067.1 76,869.0 91,558.2 4 Opt 6176 2.3 48 512 8r2 300 129,198.3 89,547.7 107,561.2 4 Xeon 7560 2.26 32 640 8r2 300 152,453.1 96,585.4 121,345.6 All of the above are HP results?, Sun result Opt 8384, sp1, Pwr 67,095.6, Thr 45,343.5, QphH 55,157.5 TPC-H 1TB Processors Total Mem GHz Cores GB SQL SF Power Through put QphH 8 Opt 8439 2.8 48 512 8R2? 1000 95,789.1 69,367.6 81,367.6 8 Opt 8439 2.8 48 384 ASE 1000 108,436.8 96,652.7 102,375.3 Itanium 9350 1.73 64 512 O11R2 1000 139,181.0 141,188.1 140,181.1 TPC-H 3TB Processors Total Mem GHz Cores GB SQL SF Power Through put QphH 16 Xeon 7460 2.66 96 1024 8r2 3000 120,254.8 87,841.4 102,254.8 8 Xeon 7560 2.26 64 512 8r2 3000 185,297.7 142,685.6 162,601.7 Itanium 9350 1.73 64 512 Sybase 1000 142,790.7 171,607.4 156,537.3 POWER6 5.0 64 512 Sybase 3000 142,790.7 171,607.4 156,537.3 TPC-H Published Results Processors GHz Cores GB Total Mem SQL SF Power Through put QphH 2 Xeon 5355 2.66 8 64 5sp2 100 23,378 13,381 17,686.7 2 Xeon 5570 2.93 8 144 8sp1 100 72,110.5 36,190.8 51,085.6 2 Xeon 5680 3.33 12 192 8r2 100 99,426.3 55,038.2 73,974.6 2 Opt 6176 2.3 24 192 8r2 100 94,761.5 53,855.6 71,438.3 4 Opt 8220 2.8 8 128 5rtm 300 25,206.4 13,283.8 18,298.5 8 Opt 8360 2.5 32 256 8rtm 300 67,287.4 41,526.4 52,860.2 8 Opt 8384 2.7 32 256 8rtm 300 75,161.2 44,271.9 57,684.7 8 Opt 8439 2.8 48 256 8sp1 300 109,067.1 76,869.0 91,558.2 4 Opt 6176 2.3 48 512 8r2 300 129,198.3 89,547.7 107,561.2 8 Xeon 7560 2.26 64 512 8r2 3000 185,297.7 142,685.6 162,601.7 SF100 2-way Big Queries (sec) 60 50 5570 HDD 5570 SSD 5570 FusionIO 5680 SSD 6176 SSD Query time in sec 40 30 20 10 0 Q1 Q9 Q13 Q18 Q21 Xeon 5570 with SATA SSD poor on Q9, reason unknown Both Xeon 5680 and Opteron 6176 big improvement over Xeon 5570 SF100 Middle Q 8 7 5570 HDD 5570 SSD 5680 SSD 6176 SSD 5570 FusionIO Query time in sec 6 5 4 3 2 1 0 Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q16 Xeon 5570-HDD and 5680-SSD poor on Q12, reason unknown Opteron 6176 poor on Q11 Q22 SF100 Small Queries 3.0 Query time in sec 2.5 5570 HDD 5570 SSD 5680 SSD 6176 SSD 5570 FusionIO 2.0 1.5 1.0 0.5 0.0 Q2 Q4 Q6 Q14 Xeon 5680 and Opteron poor on Q20 Note limited scaling on Q2, & 17 Q15 Q17 Q19 Q20 SF300 32+ cores Big Queries 120 8 x 8360 QC 2M 100 8 x 8384 QC 6M Query time in sec 8 x 8439 6C 4 x 6176 12C 80 4 x 7560 8C 60 40 20 0 Q1 Q9 Q13 Q18 Opteron 6176 poor relative to 8439 on Q9 & 13, same number of total cores Q21 SF300 Middle Q 28 24 8x8384 QC 6M 8x8439 6C 4x6176 12C 4x7560 8C 20 Query time in sec 8x8360 QC 2M 16 12 8 4 0 Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q16 Opteron 6176 much better than 8439 on Q11 & 19 Worse on Q12 Q19 Q20 Q22 SF300 Small Q 6 5 8 x 8360 QC 2M 8 x 8384 QC 6M 8 x 8439 6C 4 x 6176 12C Query time in sec 4 x 7560 8C 4 3 2 1 0 Q2 Q4 Q6 Q14 Q15 Opteron 6176 much better on Q2, even with 8439 on others Q17 SF1000 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 SF1000 400 350 SQL Server Sybase 300 250 200 150 100 50 0 Q1 Q9 Q13 Q18 Q21 SF1000 80 SQL Server 70 Sybase 60 50 40 30 20 10 0 Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q17 Q19 SF1000 35 30 SQL Server 25 Sybase 20 15 10 5 0 Q2 Q4 Q6 Q14 Q15 Q16 Q20 Q22 SF1000 Itanium - Superdome 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 SF 3TB – 8×7560 versus 16×7460 5.6X 2.5 2.0 1.5 1.0 0.5 0.0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Broadly 50% faster overall, 5X+ on one, slower on 2, comparable on 3 64 cores, 7560 relative to PWR6 6 5 4 3 2 1 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 600 Uni 16x6 500 DL980 8x8 Pwr6 400 300 200 100 0 Q1 Q9 Q13 Q18 Q21 200 Uni 16x6 180 DL980 8x8 160 Pwr6 140 120 100 80 60 40 60 20 Uni 16x6 0 50 Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q17 DL980 8x8 Q19 Pwr6 40 30 20 10 0 Q2 Q4 Q6 Q14 Q15 Q16 Q20 Q22 TPC-H Summary Scaling is impressive on some SQL Limited ability (value) is scaling small Q Anomalies, negative scaling TPC-H Queries Q1 Pricing Summary Report Query 2 Minimum Cost Supplier Wordy, but only touches the small tables, second lowest plan cost (Q15) Q3 Q6 Forecasting Revenue Change Q7 Volume Shipping Q8 National Market Share Q9 Product Type Profit Measure Q11 Important Stock Identification Non-Parallel Parallel Q12 Random IO? Q13 Why does Q13 have perfect scaling? Q17 Small Quantity Order Revenue Q18 Large Volume Customer Non-Parallel Parallel Q19 Q20? Date functions are usually written as because Line Item date columns are “date” type CAST helps DOP 1 plan, but get bad plan for parallel This query may get a poor execution plan Q21 Suppliers Who Kept Orders Waiting Note 3 references to Line Item Q22 TPC-H Studies Joe Chang jchang6@yahoo.com www.qdpma.com About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQLexecution plan Performance Monitoring, Profiler/Trace aggregation TPC-H TPC-H DSS – 22 queries, geometric mean 60X range plan cost, comparable actual range Power – single stream Tests ability to scale parallel execution plans Throughput – multiple streams Scale Factor 1 – Line item data is 1GB 875MB with DATE instead of DATETIME Only single column indexes allowed, Ad-hoc SF 10, test studies Not valid for publication Auto-Statistics enabled, Excludes compile time Big Queries – Line Item Scan Super Scaling – Mission Impossible Small Queries & High Parallelism Other queries, negative scaling Did not apply T2301, or disallow page locks Big Q: Plan Cost vs Actual 3,500 3,000 DOP 1 DOP 2 DOP 8 DOP 16 Plan Cost reduction from DOP1 to 16/32 Q1 28% Q9 44% Q18 70% Q21 20% DOP 4 2,500 2,000 Plan Cost @ 10GB 1,500 1,000 500 0 Q1 Q9 Q13 Q18 Q21 75 60 memory affects Hash IO onset DOP 1 DOP 2 DOP 4 DOP 8 DOP 30 DOP 16 DOP 32 DOP 24 Plan Cost says scaling is poor except for Q18, Q18 & Q 21 > 3X Q1, Q9 Actual Query time In seconds 45 30 15 0 Q1 Q9 Q13 Q18 Q21 Plan Cost is poor indicator of true parallelism scaling Big Query: Speed Up and CPU 34 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Q1 Q9 Holy Grail Speed up relative to DOP 1 Q13 Q18 Q21 90 80 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 70 CPU time In seconds 60 50 40 30 20 10 0 Q1 Q9 Q13 Q18 Q21 Q13 has slightly better than perfect scaling? In general, excellent scaling to DOP 8-24, weak afterwards Super Scaling Suppose at DOP 1, a query runs for 100 seconds, with one CPU fully pegged CPU time = 100 sec, elapse time = 100 sec What is best case for DOP 2? Assuming nearly zero Repartition Threads cost CPU time = 100 sec, elapsed time = 50? Super Scaling: CPU time decreases going from Non-Parallel to Parallel plan! No, I have not started drinking, yet Super Scaling 2.5 CPU normalized to DOP 1 2.0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 1.5 1.0 0.5 CPU-sec goes down from DOP 1 to 2 and higher (typically 8) 0.0 Q7 Q8 Q11 Q21 Q22 26 24 DOP 1 DOP 2 DOP 4 DOP 8 22 DOP 16 DOP 24 DOP 30 DOP 32 20 3.5X speedup from DOP 1 to 2 (Normalized to DOP 1) Speed up relative to DOP 1 18 16 14 12 10 8 6 4 2 0 Q7 Q8 Q11 Q21 Q22 CPU and Query time in seconds 20 18 16 CPU time DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 14 12 10 8 6 4 2 0 Q7 Q8 Q11 Q21 Q22 Q21 Q22 12 10 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 8 Query time 6 4 2 0 Q7 Q8 Q11 Super Scaling Summary Most probable cause Bitmap Operator in Parallel Plan Bitmap Filters are great, Question for Microsoft: Can I use Bitmap Filters in OLTP systems with non-parallel plans? Small Queries – Plan Cost vs Act 250 200 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 Query 3 and 16 have lower plan cost than Q17, but not included Plan Cost 150 100 50 0 Q2 Q4 Q6 Q15 Q17 Q20 Q4,6,17 great scaling to DOP 4, then weak 3.5 3.0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 2.5 Query time 2.0 Negative scaling also occurs 1.5 1.0 0.5 0.0 Q2 Q4 Q6 Q15 Q17 Q20 Small Queries CPU & Speedup 6 CPU time DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 5 4 3 2 1 0 Q2 Q4 Q6 Q15 Q17 Q20 What did I get for all that extra CPU?, Interpretation: sharp jump in CPU means poor scaling, disproportionate means negative scaling 18 16 Speed up 14 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 12 10 8 6 4 2 0 Q2 Q4 Q6 Q15 Q17 Q20 Query 2 negative at DOP 2, Q4 is good, Q6 get speedup, but at CPU premium, Q17 and 20 negative after DOP 8 High Parallelism – Small Queries Why? Almost No value Sometimes you do get lucky TPC-H geometric mean scoring Small queries have as much impact as large Linear sum of weights large queries OLTP with 32, 64+ cores Parallelism good if super-scaling Default max degree of parallelism 0 Seriously bad news, especially for small Q Increase cost threshold for parallelism? Q that go Negative 4.0 3.5 3.0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Query time 2.5 2.0 1.5 1.0 0.5 0.0 Q17 14 Q19 “Speedup” 12 10 Q20 Q22 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 8 6 4 2 0 Q17 Q19 Q20 Q22 CPU 12 10 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 8 6 4 2 0 Q17 Q19 Q20 Q22 Other Queries – CPU & Speedup 22 20 DOP 1 DOP 2 DOP 4 DOP 8 18 DOP 16 DOP 24 DOP 30 DOP 32 CPU time 16 14 12 10 8 6 4 2 0 Q3 Q5 Q10 Q12 22 20 18 16 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Q14 Q16 Speedup 14 12 10 8 6 4 2 0 Q3 Q5 Q10 Q12 Q14 Q16 Q3 has problems beyond DOP 2 Other - Query Time seconds 16 14 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 Query time 12 10 8 6 4 2 0 Q3 Q5 Q10 Q12 Q14 Q16 Scaling Summary Some queries show excellent scaling Super-scaling, better than 2X Sharp CPU jump on last DOP doubling Need strategy to cap DOP To limit negative scaling Especially for some smaller queries? Other anomalies Compression PAGE Compression Overhead - Overall 1.5 Query time compressed relative to uncompressed 1.4 40% overhead for compression at low DOP, 10% overhead at max DOP??? 1.3 1.2 1.1 1.0 DOP 1 DOP 2 DOP 4 1.5 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 CPU time compressed relative to uncompressed 1.4 1.3 1.2 1.1 1.0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 2.0 Query time compressed relative to uncompressed 1.8 1.6 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 2.0 CPU time compressed relative to uncompressed 1.8 1.6 DOP 1 DOP 2 DOP 4 DOP 16 DOP 24 DOP 32 DOP 8 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Compressed Table LINEITEM – real data may be more compressible Uncompressed: 8,749,760KB, Average Bytes per row: 149 Compressed: 4,819,592KB, Average Bytes per row: 82 Partitioning Orders and Line Item on Order Key Partitioning Impact - Overall 1.8 1.7 1.6 1.5 Query time partitioned relative to not partitioned 1.4 1.3 1.2 1.1 1.0 0.9 0.8 DOP 1 1.15 1.10 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32 DOP 16 DOP 24 DOP 30 DOP 32 CPU time partitioned relative to not partitioned 1.05 1.00 0.95 0.90 DOP 1 DOP 2 DOP 4 DOP 8 6 5 Query time partitioned relative to not partitioned 4 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 3 2 1 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q17 Q18 Q19 Q20 Q21 Q22 4.0 3.5 CPU time partitioned relative to not partitioned 3.0 DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 2.5 2.0 1.5 1.0 0.5 0.0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Plan for Partitioned Tables Scaling DW Summary Massive IO bandwidth Parallel options for data load, updates etc Investigate Parallel Execution Plans Scaling from DOP 1, 2, 4, 8, 16, 32 etc Scaling with and w/o HT Strategy for limiting DOP with multiple users Fixes from Microsoft Needed Contention issues in parallel execution Table scan, Nested Loops Better plan cost model for scaling Back-off on parallelism if gain is negligible Fix throughput degradation with multiple users running big DW queries Sybase and Oracle, Throughput is close to Power or better Query Plans Big Queries Q1 Pricing Summary Report Q1 Plan Non-Parallel Parallel plan 28% lower than scalar, IO is 70%, no parallel plan cost reduction Parallel Q9 Product Type Profit Measure Non-Parallel Parallel IO from 4 tables contribute 58% of plan cost, parallel plan is 39% lower Q9 Non-Parallel Plan Table/Index Scans comprise 64%, IO from 4 tables contribute 58% of plan cost Join sequence: Supplier, (Part, PartSupp), Line Item, Orders Q9 Parallel Plan Non-Parallel: (Supplier), (Part, PartSupp), Line Item, Orders Parallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp Q9 Non-Parallel Plan details Table Scans comprise 64%, IO from 4 tables contribute 58% of plan cost Q9 Parallel reg vs Partitioned Q13 Why does Q13 have perfect scaling? Q18 Large Volume Customer Non-Parallel Parallel Q18 Graphical Plan Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan Q18 Plan Details Non-Parallel Parallel Non-Parallel Plan Hash Match cost is 1245 IO, 494.6 CPU DOP 16/32: size is below IO threshold, CPU reduced by >10X Q21 Suppliers Who Kept Orders Waiting Non-Parallel Note 3 references to Line Item Parallel Q21 Non-Parallel Plan H3 H2 H3 H1 H2 H1 Q21 Parallel Q21 3 full Line Item clustered index scans Plan cost is approx 3X Q1, single “scan” Super Scaling Q7 Volume Shipping Non-Parallel Parallel Q7 Non-Parallel Plan Join sequence: Nation, Customer, Orders, Line Item Q7 Parallel Plan Join sequence: Nation, Customer, Orders, Line Item Q8 National Market Share Non-Parallel Parallel Q8 Non-Parallel Plan Join sequence: Part, Line Item, Orders, Customer Q8 Parallel Plan Join sequence: Part, Line Item, Orders, Customer Q11 Important Stock Identification Non-Parallel Parallel Q11 Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp Q11 Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp Small Queries Query 2 Minimum Cost Supplier Wordy, but only touches the small tables, second lowest plan cost (Q15) Q2 Clustered Index Scan on Part and PartSupp have highest cost (48%+42%) Q2 PartSupp is now Index Scan + Key Lookup Q6 Forecasting Revenue Change Note sure why this blows CPU Scalar values are pre-computed, pre-converted Q20? Date functions are usually written as because Line Item date columns are “date” type CAST helps DOP 1 plan, but get bad plan for parallel This query may get a poor execution plan Q20 Q20 Q20 alternate - parallel Statistics estimation error here Penalty for mistake applied here Other Queries Q3 Q3 Q12 Random IO? Will this generate random IO? Query 12 Plans Parallel Non-Parallel Queries that go Negative Q17 Small Quantity Order Revenue Q17 Table Spool is concern Q17 the usual suspects Q19 Q19 Q22 Q22 32 30 DOP 2 DOP 4 DOP 8 28 DOP 16 DOP 24 DOP 32 Speedup from DOP 1 query time 26 24 22 20 18 16 14 12 10 8 6 4 2 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Tot Q17 Q18 Q19 Q20 Q21 Q22 Tot 3.0 2.5 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 32 CPU relative to DOP 1 2.0 1.5 1.0 0.5 0.0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16