System Architecture

advertisement
System Architecture:
Big Iron (NUMA)
Joe Chang
jchang6@yahoo.com
www.qdpma.com
About Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
Tools
ExecStats – cross-reference index use by SQLexecution plan
Performance Monitoring,
Profiler/Trace aggregation
Scaling SQL on NUMA Topics
OLTP – Thomas Kejser session
“Designing High Scale OLTP Systems”
Data Warehouse
Ongoing Database Development
Bulk Load – SQL CAT paper + TK session
“The Data Loading Performance Guide”
Other Sessions with common coverage:
Monitoring and Tuning Parallel Query Execution II, R Meyyappan
(SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham
Notes from the field: High Performance Storage, John Langford
SQL Server Storage – 1000GB Level, Brent Ozar
Server Systems and Architecture
Symmetric Multi-Processing
CPU
CPU
CPU
CPU
System Bus
MCH
PXH
PXH
ICH
SMP, processors are not
dedicated to specific tasks
(ASMP), single OS image,
each processor can acess
all memory
SMP makes no reference to
memory architecture?
Not to be confused to Simultaneous Multi-Threading (SMT)
Intel calls SMT Hyper-Threading (HT), which is not to be
confused with AMD Hyper-Transport (also HT)
Non-Uniform Memory Access
CPU CPU CPU CPU
Memory
Controller
Node Controller
CPU CPU CPU CPU
Memory
Controller
Node Controller
CPU CPU CPU CPU
Memory
Controller
Node Controller
CPU CPU CPU CPU
Memory
Controller
Node Controller
Shared Bus or X Bar
NUMA Architecture - Path to memory is not uniform
1) Node: Processors, Memory, Separate or combined
Memory + Node Controllers
2) Nodes connected by shared bus, cross-bar, ring
Traditionally, 8-way+ systems
Local memory latency ~150ns, remote node memory ~300-400ns, can
cause erratic behavior if OS/code is not NUMA aware
AMD Opteron
Opteron
Opteron
Opteron
Opteron
HT2100
HT2100
HT1100
Technically, Opteron is
NUMA,
but remote node memory
latency is low,
no negative impact or
erratic behavior!
For practical purposes:
behave like SMP system
Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns?
Actual: more complicated because of snooping (cache coherency traffic)
8-way Opteron Sys Architecture
CPU
0
CPU
1
CPU
2
CPU
3
CPU
4
CPU
5
CPU
6
CPU
7
Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links.
Note 8-way top and bottom right processors use 2 HT to connect
to other processors, 3rd HT for IO,
CPU 1 & 7 require 3 hops to each other
http://www.techpowerup.com/img/09-08-26/17d.jpg
Nehalem System Architecture
Intel Nehalem generation processors have Quick Path Interconnect (QPI)
Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI
8-way Glue-less is possible
NUMA Local and Remote Memory
Local memory is closer than remote
Physical access time is shorter
What is actual access time?
With cache coherency requirement!
HT Assist – Probe Filter
part of L3 cache used as directory cache
ZDNET
Source Snoop Coherency
From HP PREMA Architecture
whitepaper:
All reads result in snoops to all
other caches, …
Memory controller cannot return
the data until it has collected all
the snoop responses and is sure
that no cache provided a more
recent copy of the memory line
DL980G7
From HP PREAM Architecture whitepaper:
Each node controller stores information about* all data in the processor
caches, minimizes inter-processor coherency communication, reduces
latency to local memory
(*only cache tags, not cache data)
HP ProLiant DL980 Architecture
Node Controllers reduces effective memory latency
Superdome 2 – Itanium, sx3000
Agent –
Remote
Ownership
Tag
+ L4 cache
tags
64M eDRAM
L4 cache
data
IBM x3850 X5 (Glue-less)
Connect two 4-socket Nodes to make 8-way system
Fujitsu R900
4 IOH
14 x8 PCI-E slots, 2 x4, 1x8 internal
OS Memory Models
SUMA: Sufficiently Uniform Memory Access
Memory interleaved across nodes
48
32
16
0
49
33
17
1
50
34
18
2
51
35
19
3
52
36
20
4
Node 0
53
37
21
5
54
38
22
6
55
39
23
7
56
40
24
8
Node 0
57
41
25
9
58
42
26
10
59
43
27
11
60
44
28
12
Node 0
61
45
29
13
62
46
30
14
63
47
31
15
2
1
Node 0
NUMA: first interleaved within a node,
then spanned across nodes
12
8
4
0
13
9
5
1
14
10
6
2
Node 0
15
11
7
3
28
24
20
16
29
25
21
17
30
26
22
18
Node 0
31
27
23
19
44
40
36
32
45
41
37
33
46
42
38
34
Node 0
47
43
39
35
60
56
52
48
61
57
53
49
62
58
54
50
Node 0
Memory stripe is then spanned across nodes
63
59
55
51
2
1
OS Memory Models
SUMA: Sufficiently Uniform Memory Access
Memory interleaved across nodes
24
16
8
0
25
17
9
1
Node 0
26
18
10
2
27
19
11
3
Node 1
28
20
12
4
29
21
13
5
30
22
14
6
Node 2
31
23
15
7
2
1
Node 3
NUMA: first interleaved within a node,
then spanned across nodes
6
4
2
0
Node 0
7
5
3
1
14
12
10
8
Node 1
15
13
11
9
22
20
18
16
Node 2
23
21
19
17
30
28
26
24
31
29
27
25
2
Node 3
1
Memory stripe is then spanned across nodes
Windows OS NUMA Support
Memory models
SUMA – Sufficiently Uniform Memory Access
24
16
8
0
25
17
9
1
26
18
10
2
Node 0
27
19
11
3
28
20
12
4
Node 1
29
21
13
5
30
22
14
6
Node 2
31
23
15
7
Node 3
Memory is striped across
NUMA nodes
NUMA – separate memory pools by Node
6
4
2
0
7
5
3
1
Node 0
14
12
10
8
15
13
11
9
Node 1
22
20
18
16
23
21
19
17
Node 2
30
28
26
24
31
29
27
25
Node 3
Memory Model Example: 4 Nodes
SUMA Memory Model
memory access uniformly distributed
25% of memory accesses local, 75% remote
NUMA Memory Model
Goal is better than 25% local node access
True local access time also needs to be faster
Cache Coherency may increase local access
Architecting for NUMA
End to End Affinity
App Server
North East
Mid Atlantic
South East
TCP Port
1440
CPU
Node 0
1441
Node 1
1442
Node 2
Node 3
Central
1443
Texas
1444
Node 4
Mountain
1445
Node 5
California
1446
Pacific NW
1447
Node 6
Node 7
Memory
Table
0-0
0-1
NE
1-0
1-1
MidA
2-0
2-1
SE
3-0
3-1
Cen
4-0
4-1
Tex
5-0
5-1
Mnt
6-0
6-1
Cal
7-0
7-1
Web determines
port for each user
by group
(but should not be
by geography!)
PNW
Affinitize port to
NUMA node
Each node access
localized data
(partition?)
OS may allocate
substantial chunk
from Node 0?
Architecting for NUMA
End to End Affinity
App Server
North East
Mid Atlantic
South East
TCP Port
1440
CPU
Node 0
1441
Node 1
1442
Node 2
Node 3
Central
1443
Texas
1444
Node 4
Mountain
1445
Node 5
California
1446
Pacific NW
1447
Node 6
Node 7
Memory
Table
0-0
0-1
NE
1-0
1-1
MidA
2-0
2-1
SE
3-0
3-1
Cen
4-0
4-1
Tex
5-0
5-1
Mnt
6-0
6-1
Cal
7-0
7-1
Web determines
port for each user
by group
(but should not be
by geography!)
PNW
Affinitize port to
NUMA node
Each node access
localized data
(partition?)
OS may allocate
substantial chunk
from Node 0?
HP-UX LORA
HP-UX – Not Microsoft Windows
Locality-Optimizer Resource Alignment
12.5% Interleaved Memory
87.5% NUMA node Local Memory
System Tech Specs
Processors
Cores DIMM
PCI-E G2
Total
Cores
Max
memory
Base
2 x Xeon X56x0
6
18
5 x8+,1 x4
12
192G*
$7K
4 x Opteron 6100
12
32
5 x8, 1 x4
48
512G
$14K
4 x Xeon X7560
8
64
4 x8, 6 x4†
32
1TB
$30K
8 x Xeon X7560
8
128 9 x8, 5 x4‡
64
2TB
$100K
8GB
$400 ea 18 x 8G = 144GB, $7200,
16GB $1100 ea 12 x16G =192GB, $13K,
64 x 8G = 512GB - $26K
64 x 16G = 1TB – $70K
Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB,
† Dell R910 and HP DL580G7 have different PCI-E
‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots
Software Stack
Operating System
Windows Server 2003 RTM, SP1
Network limitations (default) Impacts OLTP
Scalable Networking Pack (912222)
Windows Server 2008
Windows Server 2008 R2 (64-bit only)
Breaks 64 logical processor limit
Search: MSI-X
NUMA IO enhancements?
Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server
Don’t try to do DW on SQL Server 2000
SQL Server version
SQL Server 2000
Serious disk IO limitations (1GB/sec ?)
Problematic parallel execution plans
SQL Server 2005 (fixed most S2K problems)
64-bit on X64 (Opteron and Xeon)
SP2 – performance improvement 10%(?)
SQL Server 2008 & R2
Compression, Filtered Indexes, etc
Star join, Parallel query to partitioned table
Configuration
SQL Server Startup Parameter: E
Trace Flags 834, 836, 2301
Auto_Date_Correlation
Order date < A, Ship date > A
Implied: Order date > A-C, Ship date < A+C
Port Affinity – mostly OLTP
Dedicated processor ?
for log writer ?
Download