SQL Server Scaling on Big Iron (NUMA) Systems TPC-H Joe Chang jchang6@yahoo.com www.qdpma.com About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQLexecution plan Performance Monitoring, Profiler/Trace aggregation TPC-H TPC-H DSS – 22 queries, geometric mean 60X range plan cost, comparable actual range Power – single stream Tests ability to scale parallel execution plans Throughput – multiple streams Scale Factor 1 – Line item data is 1GB 875MB with DATE instead of DATETIME Only single column indexes allowed, Ad-hoc Observed Scaling Behaviors Good scaling, leveling off at high DOP Perfect Scaling ??? Super Scaling Negative Scaling especially at high DOP Execution Plan change Completely different behavior TPC-H Published Results TPC-H SF 100GB 2-way Xeon 5355, 5570, 5680, Opt 6176 100,000 80,000 Xeon 5355 5570 HDD 5570 SSD 5680 SSD 5570 Fusion Opt 6176 60,000 40,000 20,000 0 Power Throughput Between 2-way Xeon 5570, all are close, HDD has best throughput, SATA SSD has best composite, and Fusion-IO has be power. Westmere and Magny-Cours, both 192GB memory, are very close QphH TPC-H SF 300GB 8x QC/6C & 4x12C Opt, 160,000 Opt 8360 4C Opt 8439 6C X 7560 8C 140,000 120,000 Opt 8384 4C Opt 6716 12 100,000 80,000 60,000 40,000 20,000 0 Power Throughput 6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall. 4x12C 2.3GHz improved17% over 8x6C 2.8GHz QphH TPC-H SF 1000 140,000 120,000 100,000 80,000 60,000 Opt 8439 SQL Superdome 40,000 Opt 8439 Sybase Superdome 2 20,000 0 Power Throughput QphH Oracle RAC, 64-nodes, 128 Xeon 5450 quad-core 3.0GHz processors Power 782,608, 5.6X higher than Superdome 2 with 64-cores TPC-H SF 3TB X7460 & X7560 250,000 200,000 150,000 16 x X7460 8 x 7560 POWER6 M9000 100,000 50,000 0 Power Throughput QphH Nehalem-EX 64 cores better than 96 Core 2. TPC-H SF 100GB, 300GB & 3TB 100,000 80,000 Xeon 5355 5570 HDD 5570 SSD 5680 SSD 5570 Fusion Opt 6176 SF100 2-way 60,000 40,000 20,000 0 Power Throughput QphH Westmere and Magny-Cours are very close Between 2-way Xeon 5570, all are close, HDD has best through-put, SATA SSD has best composite, and Fusion-IO has be power 160,000 Opt 8360 4C Opt 8439 6C X 7560 8C 140,000 120,000 SF300 8x QC/6C & 4x12C Opt 8384 4C Opt 6716 12 6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall. 4x12C 2.3GHz improved17% over 8x6C 2.8GHz 100,000 80,000 60,000 40,000 20,000 0 Power Throughput SF 3TB X7460 & X7560 200,000 150,000 16 x X7460 8 x 7560 100,000 50,000 0 QphH 32 x Pwr6 Nehalem-EX 64 cores better than 96 Core 2. TPC-H Published Results SQL Server excels in Power Limited by Geometric mean, anomalies Trails in Throughput Other DBMS get better throughput than power SQL Server throughput below Power by wide margin Speculation – SQL Server does not throttle back parallelism with load? TPC-H SF100 Total Mem Through put Processors GHz Cores GB SQL SF 2 Xeon 5355 2.66 8 64 5sp2 100 23,378.0 13,381.0 17,686.7 2x5570 HDD 2.93 8 144 8sp1 100 67,712.9 38,019.1 50,738.4 2x5570 SSD 2.93 8 144 8sp1 100 70,048.5 37,749.1 51,422.4 5570 Fusion 2.93 8 144 8sp1 100 72,110.5 36,190.8 51,085.6 2 Xeon 5680 3.33 12 192 8r2 100 99,426.3 55,038.2 73,974.6 2 Opt 6176 2.3 24 192 8r2 100 94,761.5 53,855.6 71,438.3 Power QphH TPC-H SF300 Processors Total Mem GHz Cores GB SQL SF Power Through put QphH 4 Opt 8220 2.8 8 128 5rtm 300 25,206.4 13,283.8 18,298.5 8 Opt 8360 2.5 32 256 8rtm 300 67,287.4 41,526.4 52,860.2 8 Opt 8384 2.7 32 256 8rtm 300 75,161.2 44,271.9 57,684.7 8 Opt 8439 2.8 48 256 8sp1 300 109,067.1 76,869.0 91,558.2 4 Opt 6176 2.3 48 512 8r2 300 129,198.3 89,547.7 107,561.2 4 Xeon 7560 2.26 32 640 8r2 300 152,453.1 96,585.4 121,345.6 All of the above are HP results?, Sun result Opt 8384, sp1, Pwr 67,095.6, Thr 45,343.5, QphH 55,157.5 TPC-H 1TB Processors Total Mem GHz Cores GB SQL SF Power Through put QphH 8 Opt 8439 2.8 48 512 8R2? 1000 95,789.1 69,367.6 81,367.6 8 Opt 8439 2.8 48 384 ASE 1000 108,436.8 96,652.7 102,375.3 Itanium 9140 1.6 64 384 O11g 1000 111,557.0 128,259.1 123,323.1 Itanium 9350 1.73 64 512 O11R2 1000 139,181.0 141,188.1 140,181.1 Xeon 5450 3.0 512 2048 O RAC 1000 782,608.7 1,740,122 1,166,977 TPC-H 3TB Processors Total Mem GHz Cores GB SQL SF Power Through put QphH 16 Xeon 7460 2.66 96 1024 8r2 3000 120,254.8 87,841.4 102,254.8 8 Xeon 7560 2.26 64 512 8r2 3000 185,297.7 142,685.6 162,601.7 POWER6 5.0 64 512 Sybase 3000 142,790.7 171,607.4 156,537.3 SPARC 2.88 128 512 O11R2 3000 182,350.7 216,967.7 198,907.5 TPC-H Published Results Processors GHz Cores GB Total Mem SQL SF Power Through put QphH 2 Xeon 5355 2.66 8 64 5sp2 100 23,378 13,381 17,686.7 2 Xeon 5570 2.93 8 144 8sp1 100 72,110.5 36,190.8 51,085.6 2 Xeon 5680 3.33 12 192 8r2 100 99,426.3 55,038.2 73,974.6 2 Opt 6176 2.3 24 192 8r2 100 94,761.5 53,855.6 71,438.3 4 Opt 8220 2.8 8 128 5rtm 300 25,206.4 13,283.8 18,298.5 8 Opt 8360 2.5 32 256 8rtm 300 67,287.4 41,526.4 52,860.2 8 Opt 8384 2.7 32 256 8rtm 300 75,161.2 44,271.9 57,684.7 8 Opt 8439 2.8 48 256 8sp1 300 109,067.1 76,869.0 91,558.2 4 Opt 6176 2.3 48 512 8r2 300 129,198.3 89,547.7 107,561.2 8 Xeon 7560 2.26 64 512 8r2 3000 185,297.7 142,685.6 162,601.7 SF100 Big Queries (sec) 60 50 5570 HDD 5570 SSD 5570 FusionIO 5680 SSD 6176 SSD Query time in sec 40 30 20 10 0 Q1 Q9 Q13 Q18 Q21 Xeon 5570 with SATA SSD poor on Q9, reason unknown Both Xeon 5680 and Opteron 6176 big improvement over Xeon 5570 SF100 Middle Q 8 7 5570 HDD 5570 SSD 5680 SSD 6176 SSD 5570 FusionIO Query time in sec 6 5 4 3 2 1 0 Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q16 Xeon 5570-HDD and 5680-SSD poor on Q12, reason unknown Opteron 6176 poor on Q11 Q22 SF100 Small Queries 3.0 Query time in sec 2.5 5570 HDD 5570 SSD 5680 SSD 6176 SSD 5570 FusionIO 2.0 1.5 1.0 0.5 0.0 Q2 Q4 Q6 Q14 Xeon 5680 and Opteron poor on Q20 Note limited scaling on Q2, & 17 Q15 Q17 Q19 Q20 SF300 Big Queries 120 8 x 8360 QC 2M 100 8 x 8384 QC 6M Query time in sec 8 x 8439 6C 4 x 6176 12C 80 4 x 7560 8C 60 40 20 0 Q1 Q9 Q13 Q18 Opteron 6176 poor relative to 8439 on Q9 & 13, same number of total cores Q21 SF300 Middle Q 28 24 8x8384 QC 6M 8x8439 6C 4x6176 12C 4x7560 8C 20 Query time in sec 8x8360 QC 2M 16 12 8 4 0 Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q16 Opteron 6176 much better than 8439 on Q11 & 19 Worse on Q12 Q19 Q20 Q22 SF300 Small Q 6 5 8 x 8360 QC 2M 8 x 8384 QC 6M 8 x 8439 6C 4 x 6176 12C Query time in sec 4 x 7560 8C 4 3 2 1 0 Q2 Q4 Q6 Q14 Q15 Opteron 6176 much better on Q2, even with 8439 on others Q17 SF1000 Sybase vs. SQL Server 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Query time, Sybase relative SQL Server, both on DL785 48-core SF1000 Large Queries 400 350 SQL Server Sybase 300 250 200 150 100 50 0 Q1 Q9 Q13 Q18 Q21 SF1000 Middle Queries 80 SQL Server 70 Sybase 60 50 40 30 20 10 0 Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q17 Q19 SF1000 Small Queries 35 30 SQL Server 25 Sybase 20 15 10 5 0 Q2 Q4 Q6 Q14 Q15 Q16 Q20 Q22 SF1000 Itanium - Superdome 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Query time, Superdome 2 versus Superdome, 16-way quad-core and 32-way dual-core 512-core C2 RAC vs. 64-core It2 14 12 10 8 6 4 2 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Query time, Superdome 2 versus RAC, 16-way quad-core (64 cores) and 64-node 2-way quad-core (512 cores) Oracle RAC 5.6X higher Power SF 3TB – 8×7560 versus 16×7460 5.6X 2.5 2.0 1.5 1.0 0.5 0.0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Broadly 50% faster overall, 5X+ on one, slower on 2, comparable on 3 64 cores, PWR6 vs. Xeon 7560 6 5 4 3 2 1 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Query time, POWER6 relative to X7560 Overall, Xeon 7560 is 30% faster on power, but wide variations on individual queries, some with Pwr6 faster SF3000 Big Queries 600 Uni 16x6 500 DL980 8x8 Pwr6 400 M9000 300 200 100 0 Q1 Q9 Q13 Q18 Q21 SF3000 Middle and Small Q 200 Uni 16x6 180 DL980 8x8 160 Pwr6 140 M9000 120 100 80 60 40 20 0 Q3 Q5 Q7 Q8 Q10 Q11 Q12 Q16 Q17 Q19 60 Uni 16x6 50 DL980 8x8 Pwr6 40 M9000 30 20 10 0 Q2 Q4 Q6 Q14 Q15 Q16 Q20 Q22 TPC-H Summary Scaling is impressive on some SQL Limited ability (value) is scaling small Q Anomalies, negative scaling TPC-H Queries Q1 Pricing Summary Report Query 2 Minimum Cost Supplier Wordy, but only touches the small tables, second lowest plan cost (Q15) Q3 Q6 Forecasting Revenue Change Q7 Volume Shipping Q8 National Market Share Q9 Product Type Profit Measure Q11 Important Stock Identification Non-Parallel Parallel Q12 Random IO? Q13 Why does Q13 have perfect scaling? Q17 Small Quantity Order Revenue Q18 Large Volume Customer Non-Parallel Parallel Q19 Q20? Date functions are usually written as because Line Item date columns are “date” type CAST helps DOP 1 plan, but get bad plan for parallel This query may get a poor execution plan Q21 Suppliers Who Kept Orders Waiting Note 3 references to Line Item Q22