Storage Performance 2013 Joe Chang www.qdpma.com #SQLSatRiyadh About Joe • • • • • SQL Server consultant since 1999 Query Optimizer execution plan cost formulas (2002) True cost structure of SQL plan operations (2003?) Database with distribution statistics only, no data 2004 Decoding statblob/stats_stream – writing your own statistics • Disk IO cost structure • Tools for system monitoring, execution plan analysis See ExecStats on www.qdpma.com Storage Performance Chain SQL Server • All elements must be correct Engine SQL Server – No weak links • Perfect on 6 out 7 elements Extent SQL Server File and 1 not correct = bad IO performance Dir At/SAN Pool SAS/FC RAID Group SAS HDD SDD Storage Performance Overview • System Architecture – PCI-E, SAS, HBA/RAID controllers • SSD, NAND, Flash Controllers, Standards – Form Factors, Endurance, ONFI, Interfaces • SLC, MLC Performance • Storage system architecture – Direct-attach, SAN • Database – SQL Server Files, FileGroup Sandy Bridge EN & EP EN QPI C3 C2 C1 C0 PCI-E LLC MI C4 C5 C6 C7 QPI QPI C3 C2 C1 C0 x4 PCI-E x8 QPI C4 C5 C6 C7 MI PCI-E x8 PCI-E LLC PCI-E x8 MI PCIe x8 PCIe x8 PCIe x8 DMI 2 MI PCI-E Xeon E5-2400, Socket B2 1356 pins 1 QPI 8 GT/s, 3 DDR3 memory channels 24 PCI-E 3.0 8GT/s, DMI2 (x4 @ 5GT/s) E5-2470 8 core 2.3GHz 20M 8.0GT/s (3.1) E5-2440 6 core 2.4GHz 15M 7.2GT/s (2.9) E5-2407 4c – 4t 2.2GHz 10M 6.4GT/s (n/a) PCH EP QPI C3 C2 C1 C0 LLC MI C4 C5 C6 C7 QPI QPI C3 C2 C1 C0 MI PCI-E LLC MI C4 C5 C6 C7 MI x4 PCI-E x8 PCI-E x8 PCI-E x8 PCI-E x8 PCI-E x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 DMI 2 PCH 80 PCI-E gen 3 lanes + 8 gen 2 possible Dell T620 4 x16, 2 x8, 1 x4 Dell R720 1 x16, 6 x8 HP DL380 G8p 2 x16, 3 x8, 1 x4 Supermicro X9DRX+F 10 x8, 1 x4 g2 Xeon E5-2600, Socket: R 2011-pin 2 QPI, 4 DDR3, 40 PCI-E 3.0 8GT/s, DMI2 Model, cores, clock, LLC, QPI, (Turbo) E5-2690 8 core 2.9GHz 20M 8.0GT/s (3.8)* E5-2680 8 core 2.7GHz 20M 8.0GT/s (3.5) E5-2670 8 core 2.6GHz 20M 8.0GT/s (3.3) E5-2667 6 core 2.9GHz 15M 8.0GT/s (3.5)* E5-2665 8 core 2.4GHz 20M 8.0GT/s (3.1) E5-2660 8 core 2.2GHz 20M 8.0GT/s (3.0) E5-2650 8 core 2.0GHz 20M 8.0GT/s (2.8) E5-2643 4 core 3.3GHz 10M 8.0GT/s (3.5)* E5-2640 6 core 2.5GHz 15M 7.2GT/s (3.0) Disable cores in BIOS/UEFI? Xeon E5-4600 C3 C2 C1 C0 LLC MI PCI-E LLC MI C4 C5 C6 C7 MI QPI PCI-E C3 C2 C1 C0 QPI QPI QPI PCI-E MI QPI PCI-E MI PCI-E LLC C4 C5 C6 C7 PCI-E PCI-E PCI-E PCI-E PCI-E C3 C2 C1 C0 PCI-E PCI-E PCI-E PCI-E PCI-E QPI C4 C5 C6 C7 QPI QPI C3 C2 C1 C0 MI PCI-E LLC MI MI PCI-E PCI-E PCI-E PCI-E PCI-E PCI-E DMI 2 PCI-E PCI-E PCI-E PCI-E PCI-E Dell R820 HP DL560 G8p Supermicro X9QR C4 C5 C6 C7 Xeon E5-4600 Socket: R 2011-pin 2 QPI, 4 DDR3 40 PCI-E 3.0 8GT/s, DMI2 Model, cores, Clock, LLC, QPI, (Turbo) E5-4650 8 core 2.70GHz 20M 8.0GT/s (3.3)* E5-4640 8 core 2.40GHz 20M 8.0GT/s (2.8) E5-4620 8 core 2.20GHz 16M 7.2GT/s (2.6) E5-4617 6c - 6t 2.90GHz 15M 7.2GT/s (3.4) E5-4610 6 core 2.40GHz 15M 7.2GT/s (2.9) E5-4607 6 core 2.20GHz 12M 6.4GT/s (n/a) E5-4603 4 core 2.00GHz 10M 6.4GT/s (n/a) Hi-freq 6-core gives up HT No high-frequency 4-core, 2 x16, 4 x8, 1 int 2 x16, 3 x8, 1 x4 7 x16, 1 x8 160 PCI-E gen 3 lanes + 16 gen 2 possible 2 PCI-E, SAS & RAID CONTROLLERS PCI-E gen 1, 2 & 3 Gen • • • • Raw bit Unencoded Bandwidth rate per direction BW x8 Net Bandwidth Per direction x8 PCIe 1 2.5GT/s 2Gbps ~250MB/s 2GB/s 1.6GB/s PCIe 2 5.0GT/s 4Gbps ~500MB/s 4GB/s 3.2GB/s PCIe 3 8.0GT/s 8Gbps ~1GB/s 8GB/s 6.4GB/s? PCIe 1.0 & 2.0 encoding scheme 8b/10b PCIe 3.0 encoding scheme 128b/130b Simultaneous bi-directional transfer Protocol Overhead – Sequence/CRC, Header – 22 bytes, (20%?) Adaptec Series 7: 6.6GB/s, 450K IOPS PCI-E Packet Net realizable bandwidth appears to be 20% less (1.6GB/s of 2.0GB/s) PCIe Gen2 & SAS/SATA 6Gbps • SATA 6Gbps – single lane, net BW 560MB/s • SAS 6Gbps, x 4 lanes, net BW 2.2GB/s – Dual-port, SAS protocol only • Not supported by SATA SAS x4 6G A A B A B 2.2GB/s PCIe g2 x8 3.2GB/s HBA SAS x4 6G B Some bandwidth mismatch is OK, especially on downstream side PCIe 3 & SAS • 12Gbps – coming soon? Slowly? – Infrastructure will take more time SAS x4 6G SAS x4 6G PCIe g3 x8 HBA SAS x4 6G SAS x4 6G SAS x4 6Gb SAS Expander SAS x4 6Gb SAS x4 12G PCIe g3 x8 HBA SAS x4 12G SAS x4 6Gb SAS Expander SAS x4 6Gb PCIe 3.0 x8 HBA 2 SAS x4 12Gbps ports or 4 SAS x4 6Gbps port if HBA can support 6GB/s PCIe Gen3 & SAS 6Gbps LSI 12Gpbs SAS 3008 PCIe RAID Controllers? • 2 x4 SAS 6Gbps ports (2.2GB/s per x4 port) – 1st generation PCIe 2 – 2.8GB/s? – Adaptec: PCIe g3 can do 4GB/s – 3 x4 SAS 6Gbps bandwidth match PCIe 3.0 x8 • 6 x4 SAS 6Gpbs – Adaptec Series 7, PMC – 1 Chip: x8 PCIe g3 and 24 SAS 6Gbps lanes • Because they could SAS x4 6G SAS x4 6G SAS x4 6G PCIe g3 x8 HBA SAS x4 6G SAS x4 6G SAS x4 6G 2 SSD, NAND, FLASH CONTROLLERS SSD Evolution • HDD replacement – using existing HDD infrastructure – PCI-E card form factor lack expansion flexibility • Storage system designed around SSD – PCI-E interface with HDD like form factor? – Storage enclosure designed for SSD • Rethink computer system memory & storage • Re-do the software stack too! SFF-8639 & Express Bay SCSI Express – storage over PCI-E, NVM-e New Form Factors - NGFF Enterprise 10K/15K HDD - 15mm 15mm SSD Storage Enclosure could be 1U, 75 5mm devices? SATA Express Card (NGFF) Crucial mSATA M2 SSD – NAND Flash • NAND – SLC, MLC regular and high-endurance – eMLC could mean endurance or embedded - differ • Controller interfaces NAND to SATA or PCIE • Form Factor – SATA/SAS interface in 2.5in HDD or new form factor – PCI-E interface and FF, or HDD-like FF – Complete SSD storage system NAND Endurance Intel – High Endurance Technology MLC NAND Endurance – Write Performance Endurance SLC MLC-e MLC Write Performance Cost Structure MLC = 1 MLC EE = 1.3 SLC = 3 Process depend. 34nm 25nm 20nm Write perf? NAND P/E - Micron 34 or 25nm MLC NAND is probably good Database can support cost structure NAND P/E - IBM 34 or 25nm MLC NAND is probably good Database can support cost structure Write Endurance Vendors commonly cite single spec for range of models, 120, 240, 480GB Should vary with raw capacity? Depends on overprovioning? 3 year life is OK for MLC cost structure, maybe even 2 year MLC 20TB / life = 10GB/day for 2000 days (5 years+), 20GB/day – 3 years Vendors now cite 72TB write endurance for 120-480GB capacities? NAND • SLC – fast writes, high endurance • eMLC – slow writes, medium endurance • MLC – medium writes, low endurance • MLC cost structure of $1/GB @ 25nm – eMLC 1.4X, SLC 2X? ONFI Open NAND Flash Interface organization • 1.0 2006 – 50MB/s • 2.0 2008 – 133MB/s • 2.1 2009 – 166 & 200MB/s • 3.0 2011 – 400MB/s – Micron has 200 & 333MHz products ONFI 1.0 – 6 channels to support 3Gbps SATA, 260MB/s ONFI 2.0 – 4+ channels to support 6Gbps SATA, 560MB/s NAND write performance MLC 85MB/s per 4-die channel (128GB) 340MB/s over 4 channels (512GB)? Controller Interface PCIe vs. SATA NAND NAND Some bandwidth mistmatch/overkill OK ONFI 2 – 8 channels at 133MHz to SATA 6Gbps – 560 MB/s a good match NAND NAND Controller PCIe or SATA? Multiple lanes? NAND NAND NAND But ONFI 3.0 is overwhelming SATA 6Gbps? NAND NAND 6-8 channel at 400MB/s to match 2.2GB/s x4 SAS? 16 channel+ at 400MB/s to match 6.4GB/s x8 PCIe 3 CPU access efficiency and scaling Intel & NVM Express Controller Interface PCIe vs. SATA PCIe SATA DRAM DRAM Controller Controller NAND NAND NAND NAND NAND NAND NAND NAND NAND NAND NAND NAND NAND NAND PCIe NAND Controller Vendors Vendor Channels PCIe Gen IDT 32 x8 Gen3 NVMe Micron 32 x8 Gen2 Fusion-IO 3x4? X8 Gen2? SATA & PCI-E SSD Capacities 64 Gbit MLC NAND die 150mm2 25nm 1 64 Gbit die 8 x 64 Gbit die in 1 package = 64GB SATA Controller – 8 channels, 8 package x 64GB = 512GB PCI-E Controller – 32 channels x 64GB = 2TB 2 x 32 Gbit 34nm 1 x 64 Gbit 25nm 1 x 64 Gbit 29nm PCI-E vs. SATA/SAS • SATA/SAS controllers have 8 NAND channels – No economic benefit in fewer channels? – 8 ch. Good match for 50MB/s NAND to SATA 3G • 3Gbps – approx 280MB/s realizable BW – 8 ch also good match for 100MB/s to SATA 6G • 6Gbps – 560MB/s realizable BW – NAND is now at 200 & 333MB/s • PCI-E – 32 channels practical – 1500 pins – 333MHz good match to gen 3 x8 – 6.4GB/s BW Crucial/Micron P400m & e Crucial P400m 100GB 200GB 400GB Raw 168GB 336GB 672GB Seq Read (up to) 380MB/s 380MB/s 380MB/ Seq Write (up to) 200MB/s 310MB/s 310MB/s Random Read 52K 54K 60K Random Write 21K 26K 26K Endurance 2M-hr MTBF 1.75PB 3.0PB 7.0PB Price $300? $600? $1000? 100GB 200GB 400GB 128 256 512 Seq Read (up to) 350MB/s 350MB/s 350MB/ Seq Write (up to) 140MB/s 140MB/s 140MB/s Random Read 50K 50K 50K Random Write 7.5K 7.5K 7.5K Endurance 1.2M-hr MTBF 175TB 175TB 175TB Price $176 $334 $631 Preliminary – need to update Crucial P400e Raw P410m SAS specs slightly different EE MLC Higher endurance write perf not lower than MLC? Crucial m4 & m500 Crucial m500 120GB 240GB 480GB 960GB Raw 128GB 256GB 512GB 1024 Seq Read (up to) 500MB/s 500MB/s 500MB/ Seq Write (up to) 130MB/s 250MB/s 400MB/s Random Read 62K 72K 80K Random Write 35K 60K 80K Endurance 1.2M-hr MTBF 72TB 72TB 72TB Price $130 $220 $400 128GB 256GB 512GB 128 256 512 Seq Read (up to) 415MB/s 415MB/s 415MB/ Seq Write (up to) 175MB/s 260MB/s 260MB/s Random Read 40K 40K 40K Random Write 35K 50K 50K Endurance 72TB 72TB 72TB Price $112 $212 $400 Preliminary – need to update Crucial m4 Raw $600 Micron & Intel SSD Pricing (2013-02) $1,000 $900 $800 $700 $600 m400 $500 P400e $400 P400m S3700 $300 $200 $100 $0 100/128 200/256 400/512 P400m raw capacities are 168, 336 and 672GB (pricing retracted) Intel SSD DC S3700 pricing $235, 470, 940 and 1880 (800GB) respectively 4K Write K IOPS 60 50 40 m400 P400e 30 P400m S3700 20 10 0 100/128 200/256 400/512 P400m raw capacities are 168, 336 and 672GB (pricing retracted) Intel SSD DC S3700 pricing $235, 470, 940 and 1880 (800GB) respectively SSD Summary • MLC is possible with careful write strategy – Partitioning to minimize index rebuilds – Avoid full database restore to SSD • Endurance (HET) MLC – write perf? – Standard DB practice work – But avoid frequent index defrags? • SLC – only extreme write intensive? – Lower volume product – higher cost 3 DIRECT ATTACH STORAGE Full IO Bandwidth QPI 192 GB 192 GB QPI PCIe x4 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x4 10GbE RAID RAID RAID RAID Infini Band Infini Band RAID RAID RAID RAID Misc SSD SSD SSD HDD SSD SSD SSD HDD HDD SSD SSD HDD HDD HDD HDD HDD • 10 PCIe g3 x8 slots possible – Supermicro only – HP, Dell systems have 5-7 x8+ slots + 1 x4? • 4GB per slot with 2 x4 SAS, – 6GB/s with 4 x4 • Mixed SSD + HDD – reduce wear on MLC Misc devices on 2 x4 PCIe g2, Internal boot disks, 1GbE or 10GbE, graphics System Storage Strategy QPI 192 GB 192 GB QPI RAID RAID RAID RAID SSD SSD SSD SSD HDD HDD HDD HDD PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x4 10GbE IB Dell & HP only have 5-7 slots 4 Controllers @ 4GB/s each is probably good enough? Few practical products can use PCIe G3 x16 slots • Capable of 16GB/s with initial capacity – 4 HBA, 4-6GB/s each • with allowance for capacity growth – And mixed SSD + HDD Clustered SAS Storage Node 1 Node 2 QPI QPI QPI QPI 192 GB HBA HBA MD3220 192 GB HBA HBA MD3220 192 GB HBA HBA MD3220 192 GB HBA HBA MD3220 Dell MD3220 supports clustering Upto 4 nodes w/o external switch (extra nodes not shown) SAS SAS SAS SAS Host Host Host Host IOC 2GB SSD SSD SSD SSD HDD HDD HDD HDD PCIE Switch SAS Exp Host Host Host Host 2GB IOC PCIE Exp Host Host Host Host 2GB IOC PCIE Exp SAS SAS SAS SAS Host Host Host Host IOC 2GB PCIE Switch SAS Exp Alternate SSD/HDD Strategy QPI 192 GB Backup System 192 GB QPI SSD SSD RAID RAID RAID RAID HDD HDD HDD HDD HDD • Primary System – All SSD for data & temp, – logs may be HDD • Secondary System – HDD for backup and restore testing PCIe x4 SSD IB PCIe x8 SSD IB PCIe x8 HBA PCIe x8 HBA PCIe x8 HBA PCIe x8 HBA PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x4 10GbE System Storage Mixed SSD + HDD Each RAID Group-Volume should not exceed 2GB/s BW of x4 SAS 2-4 volumes per x8 PCIe G3 slot QPI 192 GB 192 GB QPI HBA x8 x8 HBA x8 x8 HBA x8 x4 10GbE HBA SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD HDD IB SATA SSD read 350-500MB/s, write 140MB/s+ 8 per volume allows for some overkill 16 SSD per RAID Controller 64 SATA/SAS SSD’s to deliver 1624GB/s 4 HDD per volume rule does not apply HDD for local database backup, restore tests, and DW flat files SSD & HDD on shared channel – simultaneous bi-directional IO SSD/HDD System Strategy • MLC is possible with careful write strategy – Partitioning to minimize index rebuilds – Avoid full database restore to SSD • Hybrid SSD + HDD system, full-duplex signalling • Endurance (HET) MLC – write perf? – Standard DB practice work, avoid index defrags • SLC – only extreme write intensive? – Lower volume product – higher cost • HDD – for restore testing SAS Expander 2 x4 to hosts 1 x4 for expansion 24 x1 for disks Disk Enclosure expansion ports not shown Storage Infracture – designed for HDD 15mm 2U • 2 SAS Expanders for dual-port support – 1 x4 upstream (to host), 1 x4 downstream (expan) – 24 x1 for bays Mixed HDD + SSD Enclosure 15mm 2U • Current: 24 x 15mm = 360mm + spacing • Proposed 16 x15mm=240mm + 16x7mm= 120 Enclosure 24x15mm and proposed Host 384 GB x8 PCIe x8 SAS SAS x4 12 Gpbs 4GB/s SAS Expander SAS SAS SAS Expander SAS Expander Host HBA SAS SAS SAS SAS x4 6 Gpbs 2.2GB/s PCIe x8 PCIe x8 PCIe HBA SAS Expander 384 GB SAS SAS Current 2U Enclosure, 24 x 15mm bays – HDD or SSD 2 SAS expanders – 32 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 24 lanes for bays 2 RAID Groups for SSD, 2 for HDD 1 SSD Volume on path A 1 SSD Volume on path B New SAS 12Gbps 16 x 15mm + 16 x 7mm bays 2 SAS expanders – 40 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 32 lanes for bays Enclosure 24x15mm and proposed 384 GB Host 384 GB x8 PCIe x8 PCIe x8 PCIe x8 PCIe SAS Expander SAS SAS SAS SAS Expander SAS SAS SAS SAS SAS SAS Expander HBA SAS Expander HBA Host Current 2U Enclosure, 24 x 15mm bays – HDD or SSD 2 SAS expanders – 32 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 24 lanes for bays 2 RAID Groups for SSD, 2 for HDD 1 SSD Volume on path A 1 SSD Volume on path B New SAS 12Gbps 16 x 15mm + 16 x 7mm bays 2 SAS expanders – 40 lanes each 4 lanes upstream to host 4 lanes downstream for expansion 32 lanes for bays Alternative Expansion SAS x4 Enclosure 3 SAS x4 HBA SAS x4 SAS x4 Expander PCIe x8 Expander Host SAS x4 Expander SAS x4 Expander Enclosure 1 Enclosure 2 SAS x4 Enclosure 4 SAS x4 Each SAS expander – 40 lanes, 8 lanes upstream to host with no expansion or 4 lanes upstream and 5 lanes downstream for expansion 32 lanes for bays PCI-E with Expansion 384 GB Host 384 GB x8 SAS x8 SAS Expander SAS x8 SAS SAS Expander x8 x8 PCIe x8 PCI-E Switch x8 SAS SAS x4 6 Gpbs 2.2GB/s PCIe x8 PCIe x8 PCIe HBA Express bay form factor? Host Few x8 ports? or many x4 ports? • PCI-E slot SSD suitable for known capacity • 48 & 64 lanes PCI-E switches available – x8 or x4 ports Enclosure for SSD (+ HDD?) • 2 x4 on each expander upstream – 4GB/s – No downstream ports for expansion? • 32 ports for device bays – 16 SSD (7mm) + 16 HDD (15mm) • 40 lanes total, no expansion – 48 lanes with expansion Large SSD Array • Large number of devices, large capacity – Downstream from CPU has excess bandwidth • Do not need SSD firmware peak performance – 1 ) no stoppages, 2) consistency is nice • Mostly static data – some write intensive – Careful use of partitioning to avoid index rebuild and defragmentation – If 70% is static, 10% is write intensive • Does wear leveling work 4 DATABASE – SQL SERVER Database Environment • OLTP + DW Databases are very high value – Software license + development is huge – 1 or more full time DBA, several application developers, and help desk personnel – Can justify any reasonable expense – Full knowledge of data (where the writes are) – Full control of data (where the writes are) – Can adjust practices to avoid writes to SSD Database – Storage Growth • 10GB per day data growth Big company – 10M items at 1KB per row (or 4 x 250 byte rows) – 18 TB for 5 years (1831 days) – Database log can stay on HDD • Heavy system – 64-128 x 256/512GB (raw) SSD – Each SSD can support 20GB/day (36TB lifetime?) • With Partitioning – few full index rebuilds • Can replace MLC SSD every 2 years if required Extra Capacity - Maintenance • Storage capacity will be 2-3X database size – It will be really stupid if you cannot update application for lack of space to modify a large table – SAN evironment • Only required storage capacity allocated • May not be able to perform maintenance ops – If SAN admin does not allocate extra space SSD/HDD Component Pricing 2013 • • • • • MLC consumer MLC Micron P400e MLC endurance SLC HDD 600GB, 10K <$1.0K/TB <$1.2K/TB <$2.0K/TB $4K??? $400 Database Storage Cost • • • • 8 x256GB (raw) SSD per x4 SAS channel = 2TB 2 x4 ports per RAID controller, 4TB/RC 4 RAID Controller per 2 socket system, 16TB 32 TB with 512GB SSD, 64TB with 1TB, – 64 SSD per system at $250 (MLC) – 64 HDD 10K 600GB $400 – Server 2xE5, 24x16GB, qty 2 – SQL Server 2012 EE $6K x 16 cores $16K $26K $12K each $96K HET MLC and even SLC premium OK Server/Enterprise premium – high validation effort, low volume, high support expectations OLTP & DW • OLTP – backup to local HDD – Superfast backup, read 10GB/s, write 3GB/s (R5) – Writes to data blocked during backup – Recovery requires log replay • DW – example: 10TB data, 16TB SSD – Flat files on HDD – Tempdb will generate intensive writes (1TB) • Database (real) restore testing – Force tx roll forward/back, i.e., need HDD array SQL Server Storage Configuration • IO system must have massive IO bandwidth – IO over several channels • Database must be able to use all channels simultaneously – Multiple files per filegroups • Volumes / RAID Groups on each channel – Volume comprised of several devices HDD, RAID versus SQL Server • HDD – pure sequential – not practical, • impossible to maintain – Large block 256K good enough • 64K OK • RAID Controller – 64K to 256K stripe size • SQL Server – Default extent allocation: 64K per file – With –E, 4 consecutive extents – why not 16??? File Layout Physical View QPI 192 GB x8 HBA x8 HBA x8 x8 x4 10GbE 192 GB QPI HBA HBA Each Filegroup and tempdb has 1 data file on every data volume IO to any object is distributed over all paths and all disks Filegroup & File Layout Disk 2 Basic Controller 1 Port 0 FileGroup A, File 1 FileGroup B, File 1 Tempdb File 1 Disk 3 Basic Controller 1 Port 1 FileGroup A, File 2 FileGroup B, File 2 Tempdb File 2 Disk 4 Basic Controller 2 Port 0 FileGroup A, File 3 FileGroup B, File 3 Tempdb File 3 Disk 5 Basic Controller 2 Port 1 FileGroup A, File 4 FileGroup B, File 4 Tempdb File 4 Disk 6 Basic Controller 3 Port 0 FileGroup A, File 5 FileGroup B, File 5 Tempdb File 5 Disk 7 Basic Controller 3 Port 1 FileGroup A, File 6 FileGroup B, File 6 Tempdb File 6 As shown, 2 RAID groups per controller, 1 per port. Can be 4 RG/volume per Ctlr Disk 8 Basic Controller 4 Port 0 FileGroup A, File 7 FileGroup B, File 7 Tempdb File 7 OS and Log disks not shown Disk 9 Basic Controller 4 Port 1 FileGroup A, File 8 FileGroup B, File 8 Tempdb File 8 Each File Group has 1 file on each data volume Each object is distributed across all data “disks” Tempdb data files share same volumes RAID versus SQL Server Extents Disk 2 Basic 1112GB Online Controller 1 Port 0 Disk 3 Basic 1112GB Online Controller 1 Port 1 Disk 4 Basic 1112GB Online Controller 2 Port 0 Disk 5 Basic 1112GB Online Controller 2 Port 1 Extent 1 Extent 17 Extent 33 Extent 2 Extent 18 Extent 34 Extent 3 Extent 19 Extent 35 Extent 4 Extent 20 Extent 36 Extent 5 Extent 21 Extent 37 Extent 6 Extent 22 Extent 38 Extent 7 Extent 23 Extent 39 Extent 8 Extent 24 Extent 40 Extent 9 Extent 25 Extent 41 Extent 13 Extent 29 Extent 45 Extent 10 Extent 26 Extent 42 Extent 14 Extent 30 Extent 46 Extent 11 Extent 27 Extent 43 Extent 15 Extent 31 Extent 47 Extent 12 Extent 28 Extent 44 Extent 16 Extent 32 Extent 48 Default: allocate 1 extent from file 1, allocate extent 2 from file 2, Disk IO – 64K Only 1 disk in each RAID group is active Consecutive Extents -E Disk 2 Basic 1112GB Online Controller 1 Port 0 Disk 3 Basic 1112GB Online Controller 1 Port 1 Disk 4 Basic 1112GB Online Controller 2 Port 0 Disk 5 Basic 1112GB Online Controller 2 Port 1 Extent 1 Extent 17 Extent 33 Extent 5 Extent 21 Extent 37 Extent 9 Extent 25 Extent 41 Extent 13 Extent 29 Extent 45 Extent 2 Extent 18 Extent 34 Extent 6 Extent 22 Extent 38 Extent 10 Extent 26 Extent 42 Extent 14 Extent 30 Extent 46 Extent 3 Extent 19 Extent 35 Extent 4 Extent 20 Extent 36 Allocate 4 consecutive extents from each file, OS issues 256K Disk IO Extent 7 Extent 23 Extent 39 Extent 8 Extent 24 Extent 40 Extent 11 Extent 27 Extent 43 Extent 12 Extent 28 Extent 44 Extent 15 Extent 31 Extent 47 Extent 16 Extent 32 Extent 48 Each HDD in RAID group sees 64K IO Upto 4 disks in RG gets IO Storage Summary • • • • • • • OLTP – endurance MLC or consumer MLC? DW - MLC w/ higher OP QA – consumer MLC or endurance MLC? Tempdb – possibly SLC Single log – HDD, multiple logs: SSD? Backups/test Restore/Flat files – HDD No caching, no auto-tiers SAN Software Cache + Tier Cache + Auto-Tier Good idea if 1) No knowledge 2) No control In Database We have 1) Full knowledge 2) Full Control Virtual file stats Filegroups partitioning Common SAN Vendor Configuration Node 1 Node 2 768 GB 768 GB Switch 8 Gbps FC or Switch 10Gbps FCOE SP A SP B 24 GB 24 GB x4 SAS 2GB/s Main Volume Log volume SSD 10K 7.2K Hot Spares Multi-path IO: perferred port alternate port Single large volume for data, additional volumes for log, tempdb, etc All data IO on single FC port 700MB/s IO bandwidth Path and component fault-tolerance, poor IO performance Multiple Paths & Volumes 3 Node 1 768 GB Node 2 768 GB x8 x8 x8 x8 x8 x8 SSD SSD 8 Gb FC Switch Switch SP A Multiple local SSD for tempdb Multiple quad-port FC HBAs SP B 24 GB 24 GB x4 SAS 2GB/s Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 Data 9 Data 10 Data 11 Data 12 Data 13 Data 14 Data 15 Data 16 SSD 1 SSD 2 SSD 3 SSD 4 Log 1 Log 2 Log 3 Log 4 Many SAS ports Data files must also be evenly distributed Optional SSD volumes Multiple Paths & Volumes 2 Node 1 768 GB Node 2 768 GB x8 x8 x8 x8 x8 x8 SSD SSD 8 Gb FC Switch Switch SP A Multiple local SSD for tempdb Multiple quad-port FC HBAs SP B 24 GB 24 GB x4 SAS 2GB/s Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 Data 9 Data 10 Data 11 Data 12 Data 13 Data 14 Data 15 Data 16 SSD 1 SSD 2 SSD 3 SSD 4 Log 1 Log 2 Log 3 Log 4 Many SAS ports Data files must also be evenly distributed Optional SSD volumes 8Gbps FC rules • 4-5 HDD RAID Group/Volumes – SQL Server with –E only allocates 4 consecutive extents • 2+ Volumes per FC port – Target 700MB/s per 8Gbps FC port • SSD Volumes – Limited by 700-800MB/s per 8Gbps FC port – Too many ports required for serious BW – Management headache from too many volumes SQL Server • SQL Server table scan to – heap generates 512K IO, easy to hit 100MB/s/disk – (clustered) index 64K IO, 30-50MB/s per disk likely EMC VNX 5300 FT DW Ref Arch iSCSI & File structure x4 x4 x4 x4 x4 x4 10GbE 10GbE 10GbE 10GbE 10GbE 10GbE RJ45 Controller 1 DB1 files SFP+ Controller 2 DB2 files RJ45 Controller 1 DB1 files SFP+ Controller 2 DB2 files RJ45 Controller 1 DB1 file 1 DB2 file 1 SFP+ Controller 2 DB1 file 2 DB2 file2 EMC VMAX EMC VMAX orig and 2nd gen Front End Back End Front End Front End/ Back End Ports Front End/ Back End Ports CPU Complex Back End CMI-II CPU Complex Global Memory 2.3 GHz Xeon (Harpertown) 16 CPU cores 128 GB cache memory (maximum) Dual Virtual Matrix PCIe Gen1 2.8 GHz Xeon w/turbo (Westmere) 24 CPU cores 256 GB cache memory (maximum) Quad Virtual Matrix PCIe Gen2 EMC VMAX 10K EMC VMAX Virtual Matrix Virtual Matrix VMAX Director EMC VMAX Director VMAX 10K new Upto 4 engines, 1 x 6c 2.8G per dir 50GB/s VM BW? 16 x 8Gbps FC per engine FC HBA FC HBA SAS IOH Director SAS VMI FC HBA FC HBA VMI SAS VMAX 20K Engine 4 QC 2.33GHz 128GB Virtual Maxtrix BW 24GB/s System - 8 engines, 1TB, VM BW 192GB/s, 128 FE ports IOH IOH IOH VMI VMI VMI VMI VMAX Engine? SAS Director VMAX 40K Engine 4 SC 2.8GHz 256GB Virtual Maxtrix BW 50GB/s System - 8 engines, 2TB, VM BW 400GB/s, 128 FE ports RapidIO IPC 3.125GHz, 2.5Gb/s 8/10 4 lanes per connection 10Gb/s = 1.25GB/s, 2.5GB/s full duplex 4 Conn per engine - 10GB/s 36 PCI-E per IOH, 72 combined 8 FE, 8 BE 16 VMI 1, 32 VMI 2 SQL Server Default Extent Allocation Data Extent 1 Extent 5 Extent 9 Extent 13 file 1 Extent 17 Extent 21 Extent 25 Extent 29 Extent 33 Extent 37 Extent 41 Extent 45 Data Extent 2 Extent 6 Extent 10 Extent 14 file 2 Extent 18 Extent 22 Extent 26 Extent 30 Extent 34 Extent 38 Extent 42 Extent 46 Data Extent 3 Extent 7 Extent 11 Extent 15 file 3 Extent 19 Extent 23 Extent 27 Extent 31 Extent 35 Extent 39 Extent 43 Extent 47 Allocate 1 extent per file in round robin Proportional fill EE/SE table scan tries to stay 1024 pages ahead? Data Extent 4 Extent 8 Extent 12 Extent 16 file 4 Extent 20 Extent 24 Extent 28 Extent 32 Extent 36 Extent 40 Extent 44 Extent 48 SQL can read 64 contiguous pages from 1 file. The storage engine reads index pages serially in key order. Partitioned table support for heap organization desired? SAN Node 1 Node 2 768 GB 768 GB Node 1 QPI 8 Gb FC Switch Switch SP A SP B 24 GB 24 GB QPI 192 GB HBA x4 SAS 2GB/s Volume 1 Data Volume 2 Data Volume 3 Data Volume 4 Data Volume .. Data Volume .. Data Volume 15 Data Volume 16 Data Volume - Log Log SSD 1 SSD 2 SSD ... SSD 8 Log SSD Node 2 10K HBA 192 GB HBA HBA 192 GB HBA HBA 192 GB HBA HBA Clustered SAS SAS In Node 2 SAS Out Node 1 Node 2 768 GB QPI QPI 192 GB Node 1 Node 1 768 GB 768 GB SAS SAS SAS SAS Host Host Host Host IOC 2GB PCIE Switch SAS Exp HBA SAS SAS SAS SAS Host Host Host Host IOC 2GB Host Host Host Host IOC SAS Exp PCIE Switch RAID RAID RAID RAID SSD SSD SSD SSD HDD HDD HDD HDD PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x8 PCIe x4 10GbE HBA IB Switch SAS Exp 192 GB HBA 192 GB HBA HBA Host Host Host Host IOC Switch SAS Exp Exp HBA 192 GB HBA Host Host Host Host HBA Fusion-IO ioScale