Storage 2013

advertisement
Storage Performance 2013
Joe Chang
www.qdpma.com
#SQLSatRiyadh
About Joe
•
•
•
•
•
SQL Server consultant since 1999
Query Optimizer execution plan cost formulas (2002)
True cost structure of SQL plan operations (2003?)
Database with distribution statistics only, no data 2004
Decoding statblob/stats_stream
– writing your own statistics
• Disk IO cost structure
• Tools for system monitoring, execution plan analysis
See ExecStats on
www.qdpma.com
Storage Performance Chain
SQL Server
• All elements must be correct
Engine
SQL Server
– No weak links
• Perfect on 6 out 7 elements
Extent
SQL Server
File
and 1 not correct
= bad IO performance
Dir At/SAN
Pool
SAS/FC
RAID
Group
SAS
HDD
SDD
Storage Performance Overview
• System Architecture
– PCI-E, SAS, HBA/RAID controllers
• SSD, NAND, Flash Controllers, Standards
– Form Factors, Endurance, ONFI, Interfaces
• SLC, MLC Performance
• Storage system architecture
– Direct-attach, SAN
• Database
– SQL Server Files, FileGroup
Sandy Bridge EN & EP
EN
QPI
C3
C2
C1
C0
PCI-E
LLC
MI
C4
C5
C6
C7
QPI
QPI
C3
C2
C1
C0
x4
PCI-E x8
QPI
C4
C5
C6
C7
MI
PCI-E x8
PCI-E
LLC
PCI-E x8
MI
PCIe x8
PCIe x8
PCIe x8
DMI 2
MI
PCI-E
Xeon E5-2400, Socket B2 1356 pins
1 QPI 8 GT/s, 3 DDR3 memory channels
24 PCI-E 3.0 8GT/s, DMI2 (x4 @ 5GT/s)
E5-2470 8 core 2.3GHz 20M 8.0GT/s (3.1)
E5-2440 6 core 2.4GHz 15M 7.2GT/s (2.9)
E5-2407 4c – 4t 2.2GHz 10M 6.4GT/s (n/a)
PCH
EP
QPI
C3
C2
C1
C0
LLC
MI
C4
C5
C6
C7
QPI
QPI
C3
C2
C1
C0
MI
PCI-E
LLC
MI
C4
C5
C6
C7
MI
x4
PCI-E x8
PCI-E x8
PCI-E x8
PCI-E x8
PCI-E x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
DMI 2
PCH
80 PCI-E gen 3 lanes + 8 gen 2 possible
Dell T620
4 x16, 2 x8, 1 x4
Dell R720
1 x16, 6 x8
HP DL380 G8p
2 x16, 3 x8, 1 x4
Supermicro X9DRX+F 10 x8, 1 x4 g2
Xeon E5-2600, Socket: R 2011-pin
2 QPI, 4 DDR3, 40 PCI-E 3.0 8GT/s, DMI2
Model, cores, clock, LLC, QPI, (Turbo)
E5-2690 8 core 2.9GHz 20M 8.0GT/s (3.8)*
E5-2680 8 core 2.7GHz 20M 8.0GT/s (3.5)
E5-2670 8 core 2.6GHz 20M 8.0GT/s (3.3)
E5-2667 6 core 2.9GHz 15M 8.0GT/s (3.5)*
E5-2665 8 core 2.4GHz 20M 8.0GT/s (3.1)
E5-2660 8 core 2.2GHz 20M 8.0GT/s (3.0)
E5-2650 8 core 2.0GHz 20M 8.0GT/s (2.8)
E5-2643 4 core 3.3GHz 10M 8.0GT/s (3.5)*
E5-2640 6 core 2.5GHz 15M 7.2GT/s (3.0)
Disable cores in BIOS/UEFI?
Xeon E5-4600
C3
C2
C1
C0
LLC
MI
PCI-E
LLC
MI
C4
C5
C6
C7
MI
QPI
PCI-E
C3
C2
C1
C0
QPI
QPI
QPI
PCI-E
MI
QPI
PCI-E
MI
PCI-E
LLC
C4
C5
C6
C7
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
C3
C2
C1
C0
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
QPI
C4
C5
C6
C7
QPI
QPI
C3
C2
C1
C0
MI
PCI-E
LLC
MI
MI
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
DMI 2
PCI-E
PCI-E
PCI-E
PCI-E
PCI-E
Dell R820
HP DL560 G8p
Supermicro X9QR
C4
C5
C6
C7
Xeon E5-4600
Socket: R 2011-pin
2 QPI, 4 DDR3
40 PCI-E 3.0 8GT/s, DMI2
Model, cores, Clock, LLC, QPI, (Turbo)
E5-4650 8 core 2.70GHz 20M 8.0GT/s (3.3)*
E5-4640 8 core 2.40GHz 20M 8.0GT/s (2.8)
E5-4620 8 core 2.20GHz 16M 7.2GT/s (2.6)
E5-4617 6c - 6t 2.90GHz 15M 7.2GT/s (3.4)
E5-4610 6 core 2.40GHz 15M 7.2GT/s (2.9)
E5-4607 6 core 2.20GHz 12M 6.4GT/s (n/a)
E5-4603 4 core 2.00GHz 10M 6.4GT/s (n/a)
Hi-freq 6-core gives up HT
No high-frequency 4-core,
2 x16, 4 x8, 1 int
2 x16, 3 x8, 1 x4
7 x16, 1 x8
160 PCI-E gen 3 lanes + 16 gen 2 possible
2
PCI-E, SAS & RAID CONTROLLERS
PCI-E gen 1, 2 & 3
Gen
•
•
•
•
Raw bit Unencoded Bandwidth
rate
per direction
BW x8
Net Bandwidth
Per direction x8
PCIe 1 2.5GT/s
2Gbps
~250MB/s
2GB/s
1.6GB/s
PCIe 2 5.0GT/s
4Gbps
~500MB/s
4GB/s
3.2GB/s
PCIe 3 8.0GT/s
8Gbps
~1GB/s
8GB/s
6.4GB/s?
PCIe 1.0 & 2.0 encoding scheme 8b/10b
PCIe 3.0 encoding scheme 128b/130b
Simultaneous bi-directional transfer
Protocol Overhead
– Sequence/CRC, Header
– 22 bytes, (20%?)
Adaptec Series 7: 6.6GB/s, 450K IOPS
PCI-E Packet
Net realizable bandwidth appears to be
20% less (1.6GB/s of 2.0GB/s)
PCIe Gen2 & SAS/SATA 6Gbps
• SATA 6Gbps – single lane, net BW 560MB/s
• SAS 6Gbps, x 4 lanes, net BW 2.2GB/s
– Dual-port, SAS protocol only
• Not supported by SATA
SAS x4 6G
A
A
B
A
B
2.2GB/s
PCIe g2 x8
3.2GB/s
HBA
SAS x4 6G
B
Some bandwidth mismatch is OK, especially on downstream side
PCIe 3 & SAS
• 12Gbps – coming soon? Slowly?
– Infrastructure will take more time
SAS x4 6G
SAS x4 6G
PCIe g3 x8
HBA
SAS x4 6G
SAS x4 6G
SAS x4 6Gb
SAS
Expander SAS x4 6Gb
SAS x4 12G
PCIe g3 x8
HBA
SAS x4 12G
SAS x4 6Gb
SAS
Expander SAS x4 6Gb
PCIe 3.0 x8 HBA
2 SAS x4 12Gbps ports
or 4 SAS x4 6Gbps port if HBA can support 6GB/s
PCIe Gen3 & SAS 6Gbps
LSI 12Gpbs SAS 3008
PCIe RAID Controllers?
• 2 x4 SAS 6Gbps ports (2.2GB/s per x4 port)
– 1st generation PCIe 2 – 2.8GB/s?
– Adaptec: PCIe g3 can do 4GB/s
– 3 x4 SAS 6Gbps bandwidth match PCIe 3.0 x8
• 6 x4 SAS 6Gpbs – Adaptec Series 7, PMC
– 1 Chip: x8 PCIe g3 and 24 SAS 6Gbps lanes
• Because they could
SAS x4 6G
SAS x4 6G
SAS x4 6G
PCIe g3 x8
HBA
SAS x4 6G
SAS x4 6G
SAS x4 6G
2
SSD, NAND, FLASH CONTROLLERS
SSD Evolution
• HDD replacement
– using existing HDD infrastructure
– PCI-E card form factor lack expansion flexibility
• Storage system designed around SSD
– PCI-E interface with HDD like form factor?
– Storage enclosure designed for SSD
• Rethink computer system memory & storage
• Re-do the software stack too!
SFF-8639 & Express Bay
SCSI Express – storage over PCI-E, NVM-e
New Form Factors - NGFF
Enterprise 10K/15K HDD - 15mm
15mm
SSD Storage Enclosure could
be 1U, 75 5mm devices?
SATA Express Card (NGFF)
Crucial
mSATA
M2
SSD – NAND Flash
• NAND
– SLC, MLC regular and high-endurance
– eMLC could mean endurance or embedded - differ
• Controller interfaces NAND to SATA or PCIE
• Form Factor
– SATA/SAS interface in 2.5in HDD or new form
factor
– PCI-E interface and FF, or HDD-like FF
– Complete SSD storage system
NAND Endurance
Intel – High Endurance Technology MLC
NAND Endurance – Write Performance
Endurance
SLC
MLC-e
MLC
Write Performance
Cost Structure
MLC = 1
MLC EE = 1.3
SLC = 3
Process depend.
34nm
25nm
20nm
Write perf?
NAND P/E - Micron
34 or 25nm MLC NAND is
probably good
Database can support cost
structure
NAND P/E - IBM
34 or 25nm MLC NAND is probably good
Database can support cost structure
Write Endurance
Vendors commonly
cite single spec for
range of models, 120,
240, 480GB
Should vary with raw
capacity?
Depends on overprovioning?
3 year life is OK for
MLC cost structure,
maybe even 2 year
MLC 20TB / life = 10GB/day for 2000 days (5 years+), 20GB/day – 3 years
Vendors now cite 72TB write endurance for 120-480GB capacities?
NAND
• SLC – fast writes, high endurance
• eMLC – slow writes, medium endurance
• MLC – medium writes, low endurance
• MLC cost structure of $1/GB @ 25nm
– eMLC 1.4X, SLC 2X?
ONFI
Open NAND Flash Interface organization
• 1.0
2006 – 50MB/s
• 2.0
2008 – 133MB/s
• 2.1
2009 – 166 & 200MB/s
• 3.0
2011 – 400MB/s
– Micron has 200 & 333MHz products
ONFI 1.0 – 6 channels to support 3Gbps SATA, 260MB/s
ONFI 2.0 – 4+ channels to support 6Gbps SATA, 560MB/s
NAND write performance
MLC
85MB/s per 4-die channel (128GB)
340MB/s over 4 channels (512GB)?
Controller Interface PCIe vs. SATA
NAND
NAND
Some bandwidth mistmatch/overkill OK
ONFI 2 – 8 channels at 133MHz to SATA
6Gbps – 560 MB/s a good match
NAND
NAND
Controller
PCIe or SATA?
Multiple lanes?
NAND
NAND
NAND
But ONFI 3.0 is overwhelming SATA 6Gbps?
NAND
NAND
6-8 channel at 400MB/s
to match 2.2GB/s x4 SAS?
16 channel+ at 400MB/s
to match 6.4GB/s x8 PCIe 3
CPU access efficiency and scaling
Intel & NVM Express
Controller Interface PCIe vs. SATA
PCIe
SATA
DRAM
DRAM
Controller
Controller
NAND
NAND
NAND
NAND
NAND
NAND
NAND
NAND
NAND
NAND
NAND
NAND
NAND
NAND
PCIe NAND Controller Vendors
Vendor
Channels PCIe Gen
IDT
32
x8 Gen3 NVMe
Micron
32
x8 Gen2
Fusion-IO 3x4?
X8 Gen2?
SATA & PCI-E SSD Capacities
64 Gbit MLC NAND die
150mm2 25nm
1 64 Gbit die
8 x 64 Gbit die in 1 package =
64GB
SATA Controller – 8 channels, 8 package x 64GB = 512GB
PCI-E Controller – 32 channels x 64GB = 2TB
2 x 32 Gbit 34nm
1 x 64 Gbit 25nm
1 x 64 Gbit 29nm
PCI-E vs. SATA/SAS
• SATA/SAS controllers have 8 NAND channels
– No economic benefit in fewer channels?
– 8 ch. Good match for 50MB/s NAND to SATA 3G
• 3Gbps – approx 280MB/s realizable BW
– 8 ch also good match for 100MB/s to SATA 6G
• 6Gbps – 560MB/s realizable BW
– NAND is now at 200 & 333MB/s
• PCI-E – 32 channels practical – 1500 pins
– 333MHz good match to gen 3 x8 – 6.4GB/s BW
Crucial/Micron P400m & e
Crucial P400m
100GB
200GB
400GB
Raw
168GB
336GB
672GB
Seq Read (up to)
380MB/s
380MB/s
380MB/
Seq Write (up to)
200MB/s
310MB/s
310MB/s
Random Read
52K
54K
60K
Random Write
21K
26K
26K
Endurance 2M-hr MTBF
1.75PB
3.0PB
7.0PB
Price
$300?
$600?
$1000?
100GB
200GB
400GB
128
256
512
Seq Read (up to)
350MB/s
350MB/s
350MB/
Seq Write (up to)
140MB/s
140MB/s
140MB/s
Random Read
50K
50K
50K
Random Write
7.5K
7.5K
7.5K
Endurance 1.2M-hr MTBF
175TB
175TB
175TB
Price
$176
$334
$631
Preliminary – need to update
Crucial P400e
Raw
P410m SAS
specs slightly
different
EE MLC Higher
endurance
write perf not
lower than
MLC?
Crucial m4 & m500
Crucial m500
120GB
240GB
480GB
960GB
Raw
128GB
256GB
512GB
1024
Seq Read (up to)
500MB/s
500MB/s
500MB/
Seq Write (up to)
130MB/s
250MB/s
400MB/s
Random Read
62K
72K
80K
Random Write
35K
60K
80K
Endurance 1.2M-hr MTBF
72TB
72TB
72TB
Price
$130
$220
$400
128GB
256GB
512GB
128
256
512
Seq Read (up to)
415MB/s
415MB/s
415MB/
Seq Write (up to)
175MB/s
260MB/s
260MB/s
Random Read
40K
40K
40K
Random Write
35K
50K
50K
Endurance
72TB
72TB
72TB
Price
$112
$212
$400
Preliminary – need to update
Crucial m4
Raw
$600
Micron & Intel SSD Pricing (2013-02)
$1,000
$900
$800
$700
$600
m400
$500
P400e
$400
P400m
S3700
$300
$200
$100
$0
100/128
200/256
400/512
P400m raw capacities are 168, 336 and 672GB (pricing retracted)
Intel SSD DC S3700 pricing $235, 470, 940 and 1880 (800GB) respectively
4K Write K IOPS
60
50
40
m400
P400e
30
P400m
S3700
20
10
0
100/128
200/256
400/512
P400m raw capacities are 168, 336 and 672GB (pricing retracted)
Intel SSD DC S3700 pricing $235, 470, 940 and 1880 (800GB) respectively
SSD Summary
• MLC is possible with careful write strategy
– Partitioning to minimize index rebuilds
– Avoid full database restore to SSD
• Endurance (HET) MLC – write perf?
– Standard DB practice work
– But avoid frequent index defrags?
• SLC – only extreme write intensive?
– Lower volume product – higher cost
3
DIRECT ATTACH STORAGE
Full IO Bandwidth
QPI
192 GB
192 GB
QPI
PCIe x4
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x4
10GbE
RAID
RAID
RAID
RAID
Infini
Band
Infini
Band
RAID
RAID
RAID
RAID
Misc
SSD
SSD
SSD
HDD
SSD
SSD
SSD
HDD
HDD
SSD
SSD
HDD
HDD
HDD
HDD
HDD
• 10 PCIe g3 x8 slots possible – Supermicro only
– HP, Dell systems have 5-7 x8+ slots + 1 x4?
• 4GB per slot with 2 x4 SAS,
– 6GB/s with 4 x4
• Mixed SSD + HDD – reduce wear on MLC
Misc devices on 2 x4 PCIe g2, Internal boot disks, 1GbE or 10GbE, graphics
System Storage Strategy
QPI
192 GB
192 GB
QPI
RAID
RAID
RAID
RAID
SSD
SSD
SSD
SSD
HDD
HDD
HDD
HDD
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x4
10GbE
IB
Dell & HP only have 5-7 slots
4 Controllers @ 4GB/s each is
probably good enough?
Few practical products can use PCIe
G3 x16 slots
• Capable of 16GB/s with initial capacity
– 4 HBA, 4-6GB/s each
• with allowance for capacity growth
– And mixed SSD + HDD
Clustered SAS Storage
Node 1
Node 2
QPI
QPI
QPI
QPI
192 GB
HBA
HBA
MD3220
192 GB
HBA
HBA
MD3220
192 GB
HBA
HBA
MD3220
192 GB
HBA
HBA
MD3220
Dell MD3220 supports clustering
Upto 4 nodes w/o external switch
(extra nodes not shown)
SAS SAS SAS SAS
Host Host Host Host
IOC
2GB
SSD
SSD
SSD
SSD
HDD
HDD
HDD
HDD
PCIE
Switch
SAS
Exp
Host Host Host Host
2GB
IOC
PCIE
Exp
Host Host Host Host
2GB
IOC
PCIE
Exp
SAS SAS SAS SAS
Host Host Host Host
IOC
2GB
PCIE
Switch
SAS
Exp
Alternate SSD/HDD Strategy
QPI
192 GB
Backup System
192 GB
QPI
SSD
SSD
RAID
RAID
RAID
RAID
HDD
HDD
HDD
HDD
HDD
• Primary System
– All SSD for data & temp,
– logs may be HDD
• Secondary System
– HDD for backup and restore testing
PCIe x4
SSD
IB
PCIe x8
SSD
IB
PCIe x8
HBA
PCIe x8
HBA
PCIe x8
HBA
PCIe x8
HBA
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x4
10GbE
System Storage Mixed SSD + HDD
Each RAID Group-Volume should not
exceed 2GB/s BW of x4 SAS
2-4 volumes per x8 PCIe G3 slot
QPI
192 GB
192 GB
QPI
HBA
x8
x8
HBA
x8
x8
HBA
x8
x4
10GbE
HBA
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
HDD
HDD
HDD
HDD HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD HDD
HDD
HDD
HDD
IB
SATA SSD
read 350-500MB/s,
write 140MB/s+
8 per volume allows for some overkill
16 SSD per RAID Controller
64 SATA/SAS SSD’s to deliver 1624GB/s
4 HDD per volume rule does not apply
HDD for local database backup, restore tests, and DW flat files
SSD & HDD on shared channel – simultaneous bi-directional IO
SSD/HDD System Strategy
• MLC is possible with careful write strategy
– Partitioning to minimize index rebuilds
– Avoid full database restore to SSD
• Hybrid SSD + HDD system, full-duplex signalling
• Endurance (HET) MLC – write perf?
– Standard DB practice work, avoid index defrags
• SLC – only extreme write intensive?
– Lower volume product – higher cost
• HDD – for restore testing
SAS Expander
2 x4 to hosts
1 x4 for expansion
24 x1 for disks
Disk Enclosure
expansion
ports not
shown
Storage Infracture – designed for HDD
15mm
2U
• 2 SAS Expanders for dual-port support
– 1 x4 upstream (to host), 1 x4 downstream (expan)
– 24 x1 for bays
Mixed HDD + SSD Enclosure
15mm
2U
• Current: 24 x 15mm = 360mm + spacing
• Proposed 16 x15mm=240mm + 16x7mm= 120
Enclosure 24x15mm and proposed
Host
384 GB
x8
PCIe
x8
SAS
SAS x4
12 Gpbs
4GB/s
SAS Expander
SAS
SAS
SAS Expander
SAS Expander
Host
HBA
SAS
SAS
SAS
SAS x4
6 Gpbs
2.2GB/s
PCIe
x8
PCIe
x8
PCIe
HBA
SAS Expander
384 GB
SAS
SAS
Current 2U Enclosure,
24 x 15mm bays – HDD or SSD
2 SAS expanders – 32 lanes each
4 lanes upstream to host
4 lanes downstream for expansion
24 lanes for bays
2 RAID Groups for SSD,
2 for HDD
1 SSD Volume on path A
1 SSD Volume on path B
New SAS 12Gbps
16 x 15mm + 16 x 7mm bays
2 SAS expanders – 40 lanes each
4 lanes upstream to host
4 lanes downstream for expansion
32 lanes for bays
Enclosure 24x15mm and proposed
384 GB
Host
384 GB
x8
PCIe
x8
PCIe
x8
PCIe
x8
PCIe
SAS Expander
SAS
SAS
SAS
SAS Expander
SAS
SAS
SAS
SAS
SAS
SAS Expander
HBA
SAS Expander
HBA
Host
Current 2U Enclosure,
24 x 15mm bays – HDD or SSD
2 SAS expanders – 32 lanes each
4 lanes upstream to host
4 lanes downstream for expansion
24 lanes for bays
2 RAID Groups for SSD,
2 for HDD
1 SSD Volume on path A
1 SSD Volume on path B
New SAS 12Gbps
16 x 15mm + 16 x 7mm bays
2 SAS expanders – 40 lanes each
4 lanes upstream to host
4 lanes downstream for expansion
32 lanes for bays
Alternative Expansion
SAS x4
Enclosure 3
SAS x4
HBA
SAS x4
SAS x4
Expander
PCIe x8
Expander
Host
SAS x4
Expander
SAS x4
Expander
Enclosure 1
Enclosure 2
SAS x4
Enclosure 4
SAS x4
Each SAS expander – 40 lanes,
8 lanes upstream to host with no
expansion
or 4 lanes upstream and 5 lanes
downstream for expansion
32 lanes for bays
PCI-E with Expansion
384 GB
Host
384 GB
x8
SAS
x8
SAS Expander
SAS
x8
SAS
SAS Expander
x8
x8
PCIe
x8
PCI-E Switch
x8
SAS
SAS x4
6 Gpbs
2.2GB/s
PCIe
x8
PCIe
x8
PCIe
HBA
Express bay
form factor?
Host
Few x8 ports?
or
many x4 ports?
• PCI-E slot SSD suitable for known capacity
• 48 & 64 lanes PCI-E switches available
– x8 or x4 ports
Enclosure for SSD (+ HDD?)
• 2 x4 on each expander upstream – 4GB/s
– No downstream ports for expansion?
• 32 ports for device bays
– 16 SSD (7mm) + 16 HDD (15mm)
• 40 lanes total, no expansion
– 48 lanes with expansion
Large SSD Array
• Large number of devices, large capacity
– Downstream from CPU has excess bandwidth
• Do not need SSD firmware peak performance
– 1 ) no stoppages, 2) consistency is nice
• Mostly static data – some write intensive
– Careful use of partitioning to avoid index rebuild
and defragmentation
– If 70% is static, 10% is write intensive
• Does wear leveling work
4
DATABASE – SQL SERVER
Database Environment
• OLTP + DW Databases are very high value
– Software license + development is huge
– 1 or more full time DBA, several application
developers, and help desk personnel
– Can justify any reasonable expense
– Full knowledge of data (where the writes are)
– Full control of data (where the writes are)
– Can adjust practices to avoid writes to SSD
Database – Storage Growth
• 10GB per day data growth
Big company
– 10M items at 1KB per row (or 4 x 250 byte rows)
– 18 TB for 5 years (1831 days)
– Database log can stay on HDD
• Heavy system
– 64-128 x 256/512GB (raw) SSD
– Each SSD can support 20GB/day (36TB lifetime?)
• With Partitioning – few full index rebuilds
• Can replace MLC SSD every 2 years if required
Extra Capacity - Maintenance
• Storage capacity will be 2-3X database size
– It will be really stupid if you cannot update
application for lack of space to modify a large
table
– SAN evironment
• Only required storage capacity allocated
• May not be able to perform maintenance ops
– If SAN admin does not allocate extra space
SSD/HDD Component Pricing 2013
•
•
•
•
•
MLC consumer
MLC Micron P400e
MLC endurance
SLC
HDD 600GB, 10K
<$1.0K/TB
<$1.2K/TB
<$2.0K/TB
$4K???
$400
Database Storage Cost
•
•
•
•
8 x256GB (raw) SSD per x4 SAS channel = 2TB
2 x4 ports per RAID controller, 4TB/RC
4 RAID Controller per 2 socket system, 16TB
32 TB with 512GB SSD, 64TB with 1TB,
– 64 SSD per system at $250 (MLC)
– 64 HDD 10K 600GB $400
– Server 2xE5, 24x16GB, qty 2
– SQL Server 2012 EE $6K x 16 cores
$16K
$26K
$12K each
$96K
HET MLC and even SLC premium OK
Server/Enterprise premium – high validation effort, low volume, high support expectations
OLTP & DW
• OLTP – backup to local HDD
– Superfast backup, read 10GB/s, write 3GB/s (R5)
– Writes to data blocked during backup
– Recovery requires log replay
• DW – example: 10TB data, 16TB SSD
– Flat files on HDD
– Tempdb will generate intensive writes (1TB)
• Database (real) restore testing
– Force tx roll forward/back, i.e., need HDD array
SQL Server Storage Configuration
• IO system must have massive IO bandwidth
– IO over several channels
• Database must be able to use all channels
simultaneously
– Multiple files per filegroups
• Volumes / RAID Groups on each channel
– Volume comprised of several devices
HDD, RAID versus SQL Server
• HDD
– pure sequential – not practical,
• impossible to maintain
– Large block 256K good enough
• 64K OK
• RAID Controller – 64K to 256K stripe size
• SQL Server
– Default extent allocation: 64K per file
– With –E, 4 consecutive extents – why not 16???
File Layout Physical View
QPI
192 GB
x8
HBA
x8
HBA
x8
x8
x4
10GbE
192 GB
QPI
HBA
HBA
Each Filegroup and tempdb has 1
data file on every data volume
IO to any object is distributed over all
paths and all disks
Filegroup & File Layout
Disk 2
Basic
Controller 1 Port 0
FileGroup A, File 1
FileGroup B, File 1
Tempdb
File 1
Disk 3
Basic
Controller 1 Port 1
FileGroup A, File 2
FileGroup B, File 2
Tempdb
File 2
Disk 4
Basic
Controller 2 Port 0
FileGroup A, File 3
FileGroup B, File 3
Tempdb
File 3
Disk 5
Basic
Controller 2 Port 1
FileGroup A, File 4
FileGroup B, File 4
Tempdb
File 4
Disk 6
Basic
Controller 3 Port 0
FileGroup A, File 5
FileGroup B, File 5
Tempdb
File 5
Disk 7
Basic
Controller 3 Port 1
FileGroup A, File 6
FileGroup B, File 6
Tempdb
File 6
As shown, 2 RAID groups per
controller, 1 per port.
Can be 4 RG/volume per Ctlr
Disk 8
Basic
Controller 4 Port 0
FileGroup A, File 7
FileGroup B, File 7
Tempdb
File 7
OS and Log disks not shown
Disk 9
Basic
Controller 4 Port 1
FileGroup A, File 8
FileGroup B, File 8
Tempdb
File 8
Each File Group has 1 file on
each data volume
Each object is distributed across
all data “disks”
Tempdb data files share same
volumes
RAID versus SQL Server Extents
Disk 2
Basic
1112GB
Online
Controller 1 Port 0
Disk 3
Basic
1112GB
Online
Controller 1 Port 1
Disk 4
Basic
1112GB
Online
Controller 2 Port 0
Disk 5
Basic
1112GB
Online
Controller 2 Port 1
Extent 1
Extent 17
Extent 33
Extent 2
Extent 18
Extent 34
Extent 3
Extent 19
Extent 35
Extent 4
Extent 20
Extent 36
Extent 5
Extent 21
Extent 37
Extent 6
Extent 22
Extent 38
Extent 7
Extent 23
Extent 39
Extent 8
Extent 24
Extent 40
Extent 9
Extent 25
Extent 41
Extent 13
Extent 29
Extent 45
Extent 10
Extent 26
Extent 42
Extent 14
Extent 30
Extent 46
Extent 11
Extent 27
Extent 43
Extent 15
Extent 31
Extent 47
Extent 12
Extent 28
Extent 44
Extent 16
Extent 32
Extent 48
Default: allocate 1 extent
from file 1,
allocate extent 2 from file
2,
Disk IO – 64K
Only 1 disk in each RAID
group is active
Consecutive Extents -E
Disk 2
Basic
1112GB
Online
Controller 1 Port 0
Disk 3
Basic
1112GB
Online
Controller 1 Port 1
Disk 4
Basic
1112GB
Online
Controller 2 Port 0
Disk 5
Basic
1112GB
Online
Controller 2 Port 1
Extent 1
Extent 17
Extent 33
Extent 5
Extent 21
Extent 37
Extent 9
Extent 25
Extent 41
Extent 13
Extent 29
Extent 45
Extent 2
Extent 18
Extent 34
Extent 6
Extent 22
Extent 38
Extent 10
Extent 26
Extent 42
Extent 14
Extent 30
Extent 46
Extent 3
Extent 19
Extent 35
Extent 4
Extent 20
Extent 36
Allocate 4 consecutive
extents from each file,
OS issues 256K Disk IO
Extent 7
Extent 23
Extent 39
Extent 8
Extent 24
Extent 40
Extent 11
Extent 27
Extent 43
Extent 12
Extent 28
Extent 44
Extent 15
Extent 31
Extent 47
Extent 16
Extent 32
Extent 48
Each HDD in RAID group
sees 64K IO
Upto 4 disks in RG gets IO
Storage Summary
•
•
•
•
•
•
•
OLTP – endurance MLC or consumer MLC?
DW - MLC w/ higher OP
QA – consumer MLC or endurance MLC?
Tempdb – possibly SLC
Single log – HDD, multiple logs: SSD?
Backups/test Restore/Flat files – HDD
No caching, no auto-tiers
SAN
Software Cache + Tier
Cache + Auto-Tier
Good idea if
1) No knowledge
2) No control
In Database
We have
1) Full knowledge
2) Full Control
Virtual file stats
Filegroups
partitioning
Common SAN Vendor Configuration
Node 1
Node 2
768 GB
768 GB
Switch
8 Gbps
FC
or
Switch
10Gbps
FCOE
SP A
SP B
24 GB
24 GB
x4 SAS
2GB/s
Main Volume
Log volume
SSD
10K 7.2K
Hot Spares
Multi-path IO:
perferred port
alternate port
Single large volume for data,
additional volumes for log,
tempdb, etc
All data IO on single FC port
700MB/s IO bandwidth
Path and component fault-tolerance, poor IO performance
Multiple Paths & Volumes 3
Node
1
768 GB
Node
2
768 GB
x8
x8
x8
x8
x8
x8
SSD
SSD
8 Gb
FC
Switch
Switch
SP A
Multiple local SSD
for tempdb
Multiple quad-port
FC HBAs
SP B
24 GB
24 GB
x4 SAS
2GB/s
Data 1
Data 2
Data 3
Data 4
Data 5
Data 6
Data 7
Data 8
Data 9
Data 10
Data 11
Data 12
Data 13
Data 14
Data 15
Data 16
SSD 1
SSD 2
SSD 3
SSD 4
Log 1
Log 2
Log 3
Log 4
Many SAS ports
Data files must also be
evenly distributed
Optional SSD volumes
Multiple Paths & Volumes 2
Node
1
768 GB
Node
2
768 GB
x8
x8
x8
x8
x8
x8
SSD
SSD
8 Gb
FC
Switch
Switch
SP A
Multiple local SSD
for tempdb
Multiple quad-port
FC HBAs
SP B
24 GB
24 GB
x4 SAS
2GB/s
Data 1
Data 2
Data 3
Data 4
Data 5
Data 6
Data 7
Data 8
Data 9
Data 10
Data 11
Data 12
Data 13
Data 14
Data 15
Data 16
SSD 1
SSD 2
SSD 3
SSD 4
Log 1
Log 2
Log 3
Log 4
Many SAS ports
Data files must also be
evenly distributed
Optional SSD volumes
8Gbps FC rules
• 4-5 HDD RAID Group/Volumes
– SQL Server with –E only allocates 4 consecutive
extents
• 2+ Volumes per FC port
– Target 700MB/s per 8Gbps FC port
• SSD Volumes
– Limited by 700-800MB/s per 8Gbps FC port
– Too many ports required for serious BW
– Management headache from too many volumes
SQL Server
• SQL Server table scan to
– heap generates 512K IO, easy to hit 100MB/s/disk
– (clustered) index 64K IO, 30-50MB/s per disk likely
EMC VNX 5300 FT DW Ref Arch
iSCSI & File structure
x4
x4
x4
x4
x4
x4
10GbE
10GbE
10GbE
10GbE
10GbE
10GbE
RJ45
Controller 1
DB1 files
SFP+
Controller 2
DB2 files
RJ45
Controller 1
DB1 files
SFP+
Controller 2
DB2 files
RJ45
Controller 1
DB1 file 1
DB2 file 1
SFP+
Controller 2
DB1 file 2
DB2 file2
EMC VMAX
EMC VMAX orig and 2nd gen
Front End
Back End
Front End
Front End/ Back End Ports
Front End/ Back End Ports
CPU Complex
Back End
CMI-II
CPU Complex
Global Memory





2.3 GHz Xeon (Harpertown)
16 CPU cores
128 GB cache memory (maximum)
Dual Virtual Matrix
PCIe Gen1





2.8 GHz Xeon w/turbo (Westmere)
24 CPU cores
256 GB cache memory (maximum)
Quad Virtual Matrix
PCIe Gen2
EMC VMAX 10K
EMC VMAX Virtual Matrix
Virtual
Matrix
VMAX Director
EMC VMAX Director
VMAX 10K new
Upto 4 engines, 1 x 6c 2.8G per dir
50GB/s VM BW?
16 x 8Gbps FC per engine
FC HBA FC HBA
SAS
IOH
Director
SAS
VMI
FC HBA FC HBA
VMI
SAS
VMAX 20K Engine 4 QC 2.33GHz 128GB
Virtual Maxtrix BW 24GB/s
System - 8 engines, 1TB, VM BW
192GB/s, 128 FE ports
IOH
IOH
IOH
VMI VMI
VMI VMI
VMAX Engine?
SAS
Director
VMAX 40K Engine 4 SC 2.8GHz
256GB
Virtual Maxtrix BW 50GB/s
System - 8 engines, 2TB, VM BW
400GB/s, 128 FE ports
RapidIO IPC
3.125GHz, 2.5Gb/s 8/10
4 lanes per connection
10Gb/s = 1.25GB/s,
2.5GB/s full duplex
4 Conn per engine - 10GB/s
36 PCI-E per IOH, 72 combined
8 FE, 8 BE
16 VMI 1, 32 VMI 2
SQL Server Default Extent Allocation
Data Extent 1 Extent 5 Extent 9 Extent 13
file 1 Extent 17 Extent 21 Extent 25 Extent 29
Extent 33 Extent 37 Extent 41 Extent 45
Data Extent 2 Extent 6 Extent 10 Extent 14
file 2 Extent 18 Extent 22 Extent 26 Extent 30
Extent 34 Extent 38 Extent 42 Extent 46
Data Extent 3 Extent 7 Extent 11 Extent 15
file 3 Extent 19 Extent 23 Extent 27 Extent 31
Extent 35 Extent 39 Extent 43 Extent 47
Allocate 1 extent per file in round
robin
Proportional fill
EE/SE table scan tries to stay 1024
pages ahead?
Data Extent 4 Extent 8 Extent 12 Extent 16
file 4 Extent 20 Extent 24 Extent 28 Extent 32
Extent 36 Extent 40 Extent 44 Extent 48
SQL can read 64 contiguous pages from 1 file. The storage engine
reads index pages serially in key order.
Partitioned table support for heap organization desired?
SAN
Node 1
Node 2
768 GB
768 GB
Node 1
QPI
8 Gb
FC
Switch
Switch
SP A
SP B
24 GB
24 GB
QPI
192 GB
HBA
x4 SAS
2GB/s
Volume 1 Data
Volume 2 Data
Volume 3 Data
Volume 4 Data
Volume .. Data
Volume .. Data
Volume 15 Data
Volume 16 Data
Volume - Log
Log
SSD 1 SSD 2 SSD ... SSD 8
Log
SSD
Node 2
10K
HBA
192 GB
HBA
HBA
192 GB
HBA
HBA
192 GB
HBA
HBA
Clustered SAS
SAS
In
Node 2
SAS
Out
Node 1
Node 2
768 GB
QPI
QPI
192 GB
Node 1
Node 1
768 GB
768 GB
SAS SAS SAS SAS
Host Host Host Host
IOC
2GB
PCIE
Switch
SAS
Exp
HBA
SAS SAS SAS SAS
Host Host Host Host
IOC
2GB
Host Host Host Host
IOC
SAS
Exp
PCIE
Switch
RAID
RAID
RAID
RAID
SSD
SSD
SSD
SSD
HDD
HDD
HDD
HDD
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x8
PCIe x4
10GbE
HBA
IB
Switch
SAS
Exp
192 GB
HBA
192 GB
HBA
HBA
Host Host Host Host
IOC
Switch
SAS
Exp
Exp
HBA
192 GB
HBA
Host Host
Host Host
HBA
Fusion-IO ioScale
Download