System Architecture: Big Iron (NUMA) Joe Chang jchang6@yahoo.com www.qdpma.com About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQLexecution plan Performance Monitoring, Profiler/Trace aggregation Scaling SQL on NUMA Topics OLTP – Thomas Kejser session “Designing High Scale OLTP Systems” Data Warehouse Ongoing Database Development Bulk Load – SQL CAT paper + TK session “The Data Loading Performance Guide” Other Sessions with common coverage: Monitoring and Tuning Parallel Query Execution II, R Meyyappan (SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham Notes from the field: High Performance Storage, John Langford SQL Server Storage – 1000GB Level, Brent Ozar Server Systems and Architecture Symmetric Multi-Processing CPU CPU CPU CPU System Bus MCH PXH PXH ICH SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory SMP makes no reference to memory architecture? Not to be confused to Simultaneous Multi-Threading (SMT) Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT) Non-Uniform Memory Access CPU CPU CPU CPU Memory Controller Node Controller CPU CPU CPU CPU Memory Controller Node Controller CPU CPU CPU CPU Memory Controller Node Controller CPU CPU CPU CPU Memory Controller Node Controller Shared Bus or X Bar NUMA Architecture - Path to memory is not uniform 1) Node: Processors, Memory, Separate or combined Memory + Node Controllers 2) Nodes connected by shared bus, cross-bar, ring Traditionally, 8-way+ systems Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware AMD Opteron Opteron Opteron Opteron Opteron HT2100 HT2100 HT1100 Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior! For practical purposes: behave like SMP system Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns? Actual: more complicated because of snooping (cache coherency traffic) 8-way Opteron Sys Architecture CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other http://www.techpowerup.com/img/09-08-26/17d.jpg Nehalem System Architecture Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible NUMA Local and Remote Memory Local memory is closer than remote Physical access time is shorter What is actual access time? With cache coherency requirement! HT Assist – Probe Filter part of L3 cache used as directory cache ZDNET Source Snoop Coherency From HP PREMA Architecture whitepaper: All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line DL980G7 From HP PREAM Architecture whitepaper: Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory (*only cache tags, not cache data) HP ProLiant DL980 Architecture Node Controllers reduces effective memory latency Superdome 2 – Itanium, sx3000 Agent – Remote Ownership Tag + L4 cache tags 64M eDRAM L4 cache data IBM x3850 X5 (Glue-less) Connect two 4-socket Nodes to make 8-way system Fujitsu R900 4 IOH 14 x8 PCI-E slots, 2 x4, 1x8 internal OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 48 32 16 0 49 33 17 1 50 34 18 2 51 35 19 3 52 36 20 4 Node 0 53 37 21 5 54 38 22 6 55 39 23 7 56 40 24 8 Node 0 57 41 25 9 58 42 26 10 59 43 27 11 60 44 28 12 Node 0 61 45 29 13 62 46 30 14 63 47 31 15 2 1 Node 0 NUMA: first interleaved within a node, then spanned across nodes 12 8 4 0 13 9 5 1 14 10 6 2 Node 0 15 11 7 3 28 24 20 16 29 25 21 17 30 26 22 18 Node 0 31 27 23 19 44 40 36 32 45 41 37 33 46 42 38 34 Node 0 47 43 39 35 60 56 52 48 61 57 53 49 62 58 54 50 Node 0 Memory stripe is then spanned across nodes 63 59 55 51 2 1 OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 24 16 8 0 25 17 9 1 Node 0 26 18 10 2 27 19 11 3 Node 1 28 20 12 4 29 21 13 5 30 22 14 6 Node 2 31 23 15 7 2 1 Node 3 NUMA: first interleaved within a node, then spanned across nodes 6 4 2 0 Node 0 7 5 3 1 14 12 10 8 Node 1 15 13 11 9 22 20 18 16 Node 2 23 21 19 17 30 28 26 24 31 29 27 25 2 Node 3 1 Memory stripe is then spanned across nodes Windows OS NUMA Support Memory models SUMA – Sufficiently Uniform Memory Access 24 16 8 0 25 17 9 1 26 18 10 2 Node 0 27 19 11 3 28 20 12 4 Node 1 29 21 13 5 30 22 14 6 Node 2 31 23 15 7 Node 3 Memory is striped across NUMA nodes NUMA – separate memory pools by Node 6 4 2 0 7 5 3 1 Node 0 14 12 10 8 15 13 11 9 Node 1 22 20 18 16 23 21 19 17 Node 2 30 28 26 24 31 29 27 25 Node 3 Memory Model Example: 4 Nodes SUMA Memory Model memory access uniformly distributed 25% of memory accesses local, 75% remote NUMA Memory Model Goal is better than 25% local node access True local access time also needs to be faster Cache Coherency may increase local access Architecting for NUMA End to End Affinity App Server North East Mid Atlantic South East TCP Port 1440 CPU Node 0 1441 Node 1 1442 Node 2 Node 3 Central 1443 Texas 1444 Node 4 Mountain 1445 Node 5 California 1446 Pacific NW 1447 Node 6 Node 7 Memory Table 0-0 0-1 NE 1-0 1-1 MidA 2-0 2-1 SE 3-0 3-1 Cen 4-0 4-1 Tex 5-0 5-1 Mnt 6-0 6-1 Cal 7-0 7-1 Web determines port for each user by group (but should not be by geography!) PNW Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? Architecting for NUMA End to End Affinity App Server North East Mid Atlantic South East TCP Port 1440 CPU Node 0 1441 Node 1 1442 Node 2 Node 3 Central 1443 Texas 1444 Node 4 Mountain 1445 Node 5 California 1446 Pacific NW 1447 Node 6 Node 7 Memory Table 0-0 0-1 NE 1-0 1-1 MidA 2-0 2-1 SE 3-0 3-1 Cen 4-0 4-1 Tex 5-0 5-1 Mnt 6-0 6-1 Cal 7-0 7-1 Web determines port for each user by group (but should not be by geography!) PNW Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? HP-UX LORA HP-UX – Not Microsoft Windows Locality-Optimizer Resource Alignment 12.5% Interleaved Memory 87.5% NUMA node Local Memory System Tech Specs Processors Cores DIMM PCI-E G2 Total Cores Max memory Base 2 x Xeon X56x0 6 18 5 x8+,1 x4 12 192G* $7K 4 x Opteron 6100 12 32 5 x8, 1 x4 48 512G $14K 4 x Xeon X7560 8 64 4 x8, 6 x4† 32 1TB $30K 8 x Xeon X7560 8 128 9 x8, 5 x4‡ 64 2TB $100K 8GB $400 ea 18 x 8G = 144GB, $7200, 16GB $1100 ea 12 x16G =192GB, $13K, 64 x 8G = 512GB - $26K 64 x 16G = 1TB – $70K Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB, † Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots Software Stack Operating System Windows Server 2003 RTM, SP1 Network limitations (default) Impacts OLTP Scalable Networking Pack (912222) Windows Server 2008 Windows Server 2008 R2 (64-bit only) Breaks 64 logical processor limit Search: MSI-X NUMA IO enhancements? Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server Don’t try to do DW on SQL Server 2000 SQL Server version SQL Server 2000 Serious disk IO limitations (1GB/sec ?) Problematic parallel execution plans SQL Server 2005 (fixed most S2K problems) 64-bit on X64 (Opteron and Xeon) SP2 – performance improvement 10%(?) SQL Server 2008 & R2 Compression, Filtered Indexes, etc Star join, Parallel query to partitioned table Configuration SQL Server Startup Parameter: E Trace Flags 834, 836, 2301 Auto_Date_Correlation Order date < A, Ship date > A Implied: Order date > A-C, Ship date < A+C Port Affinity – mostly OLTP Dedicated processor ? for log writer ?