Performance - Center for Computation & Technology

advertisement
Prof. Thomas Sterling
Department of Computer Science
Louisiana State University
February 15, 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &
MEANS
SMP NODES
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
Topics
•
•
•
•
•
•
•
•
•
•
Introduction
SMP Context
Performance: Amdahl’s Law
SMP System structure
Processor core
Memory System
Chip set
South Bridge – I/O
Performance Issues
Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
2
Topics
•
•
•
•
•
•
•
•
•
•
Introduction
SMP Context
Performance: Amdahl’s Law
SMP System structure
Processor core
Memory System
Chip set
South Bridge – I/O
Performance Issues
Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
3
Opening Remarks
• This week is about supercomputer architecture
– Last time: end of cooperative computing
– Today: capability computing with modern microprocessor and
multicore SMP node
• As we’ve seen, there is a diversity of HPC system types
• Most common systems are either SMPs or are
ensembles of SMP nodes
• “SMP” stands for: “Symmetric Multi-Processor”
• System performance is strongly influenced by SMP node
performance
• Understanding structure, functionality, and operation of
SMP nodes will allow effective programming
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
4
The take-away message
• Primary structure and elements that make up an SMP
node
• Primary structure and elements that make up the
modern multicore microprocessor component
• The factors that determine microprocessor delivered
performance
• The factors that determine overall SMP sustained
performance
• Amdahl’s law and how to use it
• Calculating cpi
• Reference: J. Hennessy & D. Patterson, “Computer Architecture
A Quantitative Approach” 3rd Edition, Morgan Kaufmann, 2003
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
5
Topics
•
•
•
•
•
•
•
•
•
•
Introduction
SMP Context
Performance: Amdahl’s Law
SMP System structure
Processor core
Memory System
Chip set
South Bridge – I/O
Performance Issues
Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
6
SMP Context
• A standalone system
– Incorporates everything needed for
•
•
•
•
•
Processors
Memory
External I/O channels
Local disk storage
User interface
– Enterprise server and institutional computing market
• Exploits economy of scale to enhance performance to cost
• Substantial performance
– Target for ISVs (Independent Software Vendors)
• Shared memory multiple thread programming platform
– Easier to program than distributed memory machines
– Enough parallelism to fully employ system threads (processor cores)
• Building block for ensemble supercomputers
– Commodity clusters
– MPPs
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
7
Topics
•
•
•
•
•
•
•
•
•
•
Introduction
SMP Context
Performance: Amdahl’s Law
SMP System structure
Processor core
Memory System
Chip set
South Bridge – I/O
Performance Issues
Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
8
Performance: Amdahl’s Law
Baton Rouge to Houston
• from my house on East Lakeshore Dr.
•
•
•
•
•
•
to downtown Hyatt Regency
distance of 271
in air flight time: 1 hour
door to door time to drive: 4.5 hours
cruise speed of Boeing 737: 600 mph
cruise speed of BMW 528: 60 mph
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
9
Amdahl’s Law: drive or fly?
• Peak performance gain: 10X
– BMW cruise approx. 60 MPH
– Boeing 737 cruise approx. 600 MPH
• Time door to door
– BMW
• Google estimates 4 hours 30 minutes
– Boeing 737
•
•
•
•
•
•
•
•
•
Time to drive to BTR from my house = 15 minutes
Wait time at BTR = 1 hour
Taxi time at BTR = 5 minutes
Continental estimates BTR to IAH 1 hour
Taxi time at IAH = 15 minutes (assuming gate available)
Time to get bags at IAH = 25 minutes
Time to get rental car = 15 minutes
Time to drive to Hyatt Regency from IAH = 45 minutes
Total time = 4.0 hours
• Sustained performance gain: 1.125X
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
10
Amdahl’s Law
TO
start
end
TA
TF
start
end
TF/g
TO  time for non - accelerate d computatio n
TA  time for accelerate d computatio n
TF  time of portion of computatio n that can be accelerate d
g  peak performanc e gain for accelerate d portion of computatio n
f  fraction of non - accelerate d computatio n to be accelerate d
S  speed up of computatio n with accelerati on applied
S  T O TA
f  TF T O
f
TA  1  f  TO     TO
g
TO
S
1  f  TO   f   TO
g
S
1
f
1  f   
g
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
11
Amdahl’s Law and Parallel
Computers
• Amdahl’s Law (FracX: original % to be speed up)
Speedup = 1 / [(FracX/SpeedupX) + (1-FracX)]
• A portion is sequential => limits parallel speedup
– Speedup <= 1/ (1-FracX)
• Ex. What fraction sequential to get 80X speedup from
100 processors? Assume either 1 processor or 100
fully used
80 = 1 / [(FracX/100) + (1-FracX)]
0.8*FracX + 80*(1-FracX) = 80 - 79.2*FracX = 1
FracX = (80-1)/79.2 = 0.9975
• Only 0.25% sequential!
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
12
Amdahl’s Law with Overhead
TO
start
end
tF
TA tF
tF
tF
n
TF   tFi
i
start
end
v + tF/g
v  overhead of accelerate d work segment
n
V  total overhead for accelerate d work   vi
i
TA  1  f  TO 
f
 TO  n  v
g
TO
TO
S

TA 1  f  TO  f  TO  n  v
g
1
S
1  f   f  n  v
g
TO
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
13
Topics
•
•
•
•
•
•
•
•
•
•
Introduction
SMP Context
Performance: Amdahl’s Law
SMP System structure
Processor core
Memory System
Chip set
South Bridge – I/O
Performance Issues
Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
14
SMP Node Diagram
MP
MP
MP
MP
L1
L2
L1
L2
L1
L2
L1
L2
L3
M1
Legend :
MP : MicroProcessor
L1,L2,L3 : Caches
M1, M2, … : Memory Banks
S : Storage
NIC : Network Interface Card
L3
M2
Controller
Mn
S
S
NIC
NIC
PCI-e
JTAG
Ethernet
Peripherals
USB
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
15
SMP System Examples
Vendor & name Processor
Number of Cores per Memory
cores
proc.
Chipset
2 TB Proprietary GX+,
RIO-2
PCI slots
IBM eServer
p5 595
IBM Power5
1.9 GHz
64
2
≤240 PCI-X
(20 standard)
Microway
QuadPuter-8
AMD Opteron
2.6 Ghz
16
2
128 GB Nvidia nForce Pro
2200+2050
6 PCIe
Ion M40
Intel Itanium 2
1.6 GHz
8
2
128 GB Hitachi CF-3e
4 PCIe
2 PCI-X
Intel Server
System
SR870BN4
Intel Itanium 2
1.6 GHz
8
2
64 GB Intel E8870
8 PCI-X
HP Proliant
ML570 G3
Intel Xeon 7040
3 GHz
8
2
64 GB Intel 8500
4 PCIe
6 PCI-X
Dell PowerEdge
2950
Intel Xeon 5300
2.66 GHz
8
4
32 GB Intel 5000X
3 PCIe
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
16
Sample SMP Systems
DELL PowerEdge
HP Proliant
Intel Server System
Microway Quadputer
IBM p5 595
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
17
HyperTransport-based SMP System
Source: http://www.devx.com/amd/Article/17437
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
18
Comparison of Opteron and Xeon SMP
Systems
Source: http://www.devx.com/amd/Article/17437
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
19
Multi-Chip Module (MCM) Component of
IBM Power5 Node
CSC 7600 Lecture 9 : SMP Nodes20
Spring 2011
20
Major Elements of an SMP Node
•
•
•
•
Processor chip
DRAM main memory cards
Motherboard chip set
On-board memory network
– North bridge
•
On-board I/O network
– South bridge
•
PCI industry standard interfaces
– PCI, PCI-X, PCI-express
•
System Area Network controllers
– e.g. Ethernet, Myrinet, Infiniband, Quadrics, Federation Switch
•
System Management network
– Usually Ethernet
– JTAG for low level maintenance
•
•
Internal disk and disk controller
Peripheral interfaces
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
21
Topics
•
•
•
•
•
•
•
•
•
•
Introduction
SMP Context
Performance: Amdahl’s Law
SMP System structure
Processor core
Memory System
Chip set
South Bridge – I/O
Performance Issues
Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
22
Itanium™ Processor Silicon
(Copyright: Intel at Hotchips ’00)
IA-32
Control
FPU
IA-64 Control
Integer Units
Instr.
Fetch &
Decode
Cache
Core Processor Die
TLB
Cache
Bus
4 x 1MB L3 cache
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
23
Multicore Microprocessor Component
Elements
• Multiple processor cores
– One or more processors
• L1 caches
– Instruction cache
– Data cache
• L2 cache
– Joint instruction/data cache
– Dedicated to individual core processor
• L3 cache
– Not all systems
– Shared among multiple cores
– Often off die but in same package
• Memory interface
– Address translation and management (sometimes)
– North bridge
• I/O interface
– South bridge
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
24
Comparison of Current Microprocessors
Processor
Clock rate Caches
(per core)
ILP
(each core)
Cores
per
chip
Process &
die size
Power
Linpack
TPP
(one core)
AMD Opteron
2.6 GHz
L1I: 64KB
L1D: 64KB
L2: 1MB
2 FPops/cycle
3 Iops/cycle
2* LS/cycle
2
90nm,
220mm2
95W
3.89
Gflops
IBM Power5+
2.2 GHz
L1I: 64KB
L1D: 32KB
L2: 1.875MB
L3: 18MB
4 FPops/cycle
2 Iops/cycle
2 LS/cycle
2
90nm,
243mm2
180W (est.)
8.33
Gflops
Intel
Itanium 2
(9000 series)
1.6 GHz
L1I: 16KB
L1D: 16KB
L2I: 1MB
L2D: 256KB
L3: 3MB or more
4 FPops/cycle
4 Iops/cycle
2 LS/cycle
2
90nm,
596mm2
104W
5.95
Gflops
Intel Xeon
Woodcrest
3 GHz
L1I: 32KB
L1D: 32KB
L2: 2MB
4 Fpops/cycle
3 Iops/cycle
1L+1S/cycle
2
65nm,
144mm2
80W
6.54
Gflops
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
25
Processor Core Micro Architecture
• Execution Pipeline
– Stages of functionality to process issued instructions
– Hazards are conflicts with continued execution
– Forwarding supports closely associated operations exhibiting
precedence constraints
• Out of Order Execution
– Uses reservation stations
– hides some core latencies and provide fine grain asynchronous
operation supporting concurrency
• Branch Prediction
– Permits computation to proceed at a conditional branch point
prior to resolving predicate value
– Overlaps follow-on computation with predicate resolution
– Requires roll-back or equivalent to correct false guesses
– Sometimes follows both paths, and several deep
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
26
Topics
•
•
•
•
•
•
•
•
•
•
Introduction
SMP Context
Performance: Amdahl’s Law
SMP System structure
Processor core
Memory System
Chip set
South Bridge – I/O
Performance Issues
Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
27
Recap: Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
1000
CPU
“Moore’s Law”
µProc
60%/yr.
(2X/1.5yr)
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM
9%/yr.
(2X/10 yrs)
1
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
100
Time
Copyright 2001, UCB, David Patterson
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
28
What is a cache?
• Small, fast storage used to improve average access time to slow
memory.
• Exploits spatial and temporal locality
• In computer architecture, almost everything is a cache!
–
–
–
–
–
–
Registers: a cache on variables
First-level cache: a cache on second-level cache
Second-level cache: a cache on memory
Memory: a cache on disk (virtual memory)
TLB :a cache on page table
Branch-prediction: a cache on prediction information
Proc/Regs
L1-Cache
Bigger
L2-Cache
Faster
Memory
Disk, Tape, etc.
Copyright 2001, UCB, David Patterson
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
29
Capacity
Access Time
Cost
Levels of the Memory Hierarchy
CPU Registers
100s Bytes
< 0.5 ns (typically 1 CPU cycle)
Cache
L1 cache:
10s-100s K Bytes
1-5 ns
$10/ Mbyte
Main Memory
Few G Bytes
50ns- 150ns
$0.02/ MByte
Disk
100s-1000s G Bytes
500000ns- 1500000ns
$ 0.25/ GByte
Tape
infinite
sec-min
$0.0014/ MByte
Upper Level
Staging
Xfer Unit
faster
Registers
Instr. Operands
prog./compiler
1-8 bytes
Cache
Blocks
cache cntl
8-128 bytes
Memory
Pages
OS
512-4K bytes
Files
user/operator
Mbytes
Disk
Tape
Larger
Lower Level
Copyright
2001,
UCB,
Patterson
CSC 7600
Lecture
9 :David
SMP Nodes
Spring 2011
30
Cache Measures
• Hit rate: fraction found in that level
– So high that usually talk about Miss rate
• Average memory-access time
= Hit time + Miss rate x Miss penalty
(ns or clocks)
• Miss penalty: time to replace a block from lower level,
including time to replace in CPU
–
–
access time:
time to lower level
= f(latency to lower level)
transfer time: time to transfer block
=f(BW between upper & lower levels)
Copyright
2001,
UCB,
Patterson
CSC 7600
Lecture
9 :David
SMP Nodes
Spring 2011
31
Memory Hierarchy: Terminology
• Hit: data appears in some block in the upper level (example:
Block X)
– Hit Rate: the fraction of memory accesses found in the upper level
– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieved from a block in the lower level
(Block Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block to the processor
• Hit Time << Miss Penalty (500 instructions on 21264!)
To Processor
Upper Level
Memory
Lower Level
Memory
Blk X
From Processor
Blk Y
Copyright
2001,
UCB,
Patterson
CSC 7600
Lecture
9 :David
SMP Nodes
Spring 2011
32
Cache Performance
T  I count  CPI  Tcycle
I count  I ALU  I MEM
 I ALU 
 I MEM
  CPI ALU  
CPI  
 I count 
 I count
T = total execution time
Tcycle = time for a single processor cycle
Icount = total number of instructions
IALU = number of ALU instructions (e.g. register – register)
IMEM = number of memory access instructions ( e.g. load, store)
CPI = average cycles per instructions
CPIALU = average cycles per ALU instructions

  CPI MEM

CPIMEM = average cycles per memory instruction
rmiss = cache miss rate
rhit = cache hit rate
CPIMEM-MISS = cycles per cache miss
CPIMEM-HIT=cycles per cache hit
MALU = instruction mix for ALU instructions
MMEM = instruction mix for memory access instruction
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
33
Cache Performance
InstructionMix :
I ALU
M ALU 
I count
M MEM 
I MEM
I count
M ALU  M MEM  1
CPI  M ALU  CPI ALU   ( M MEM  CPI MEM )
T  I count  M ALU  CPI ALU   ( M MEM  CPI MEM ) Tcycle
T = total execution time
Tcycle = time for a single processor cycle
Icount = total number of instructions
IALU = number of ALU instructions (e.g. register – register)
IMEM = number of memory access instructions ( e.g. load, store)
CPI = average cycles per instructions
CPIALU = average cycles per ALU instructions
CPIMEM = average cycles per memory instruction
rmiss = cache miss rate
rhit = cache hit rate
CPIMEM-MISS = cycles per cache miss
CPIMEM-HIT=cycles per cache hit
MALU = instruction mix for ALU instructions
MMEM = instruction mix for memory access instruction
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
34
Cache Performance
CPI MEM  CPI MEM  HIT  rMISS  CPI MEM  MISS
T  I count  M ALU  CPI ALU   M MEM  CPI MEM  HIT  rMISS  CPI MEM  MISS  Tcycle
T = total execution time
Tcycle = time for a single processor cycle
Icount = total number of instructions
IALU = number of ALU instructions (e.g. register – register)
IMEM = number of memory access instructions ( e.g. load, store)
CPI = average cycles per instructions
CPIALU = average cycles per ALU instructions
CPIMEM = average cycles per memory instruction
rmiss = cache miss rate
rhit = cache hit rate
CPIMEM-MISS = cycles per cache miss
CPIMEM-HIT=cycles per cache hit
MALU = instruction mix for ALU instructions
MMEM = instruction mix for memory access instruction
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
35
Cache Performance: Example
I count  1011
I MEM  2  1010
CPI ALU  1
Tcycle  0.5ns
CPI MEM  MISS  100
I ALU  I count  I MEM  8 1010
M ALU
I ALU 8 1010 8



 0. 8
I count
1011
10
M MEM
I MEM 2 1010


 0.2
11
I count
10
CPI MEM  HIT  1
rhitA  0.9
rhitB  0.5
CPI MEM  A  CPI MEM  HIT  rMISS  A  CPI MEM  MISS
CPI MEM  B  CPI MEM  HIT  rMISS  B  CPI MEM  MISS
 1  (1  0.9)  100  11
 1  (1  0.5)  100  51
TA  1011  ((0.8  1)  (0.2  11))  5  10 10
TB  1011  ((0.8  1)  (0.2  51))  5 10 10
 150 sec
 550 sec
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
36
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
37
Topics
•
•
•
•
•
•
•
•
•
•
Introduction
SMP Context
Performance: Amdahl’s Law
SMP System structure
Processor core
Memory System
Chip set
South Bridge – I/O
Performance Issues
Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
38
Motherboard Chipset
•
•
•
•
•
Provides core functionality of motherboard
Embeds low-level protocols to facilitate efficient communication between
local components of computer system
Controls the flow of data between the CPU, system memory, on-board
peripheral devices, expansion interfaces and I/O susbsystem
Also responsible for power management features, retention of non-volatile
configuration data and real-time measurement
Typically consists of:
– Northbridge (Memory Controller Hub, MCH), managing traffic between the
processor, RAM, GPU, southbridge and optionally PCI Express slots
– Southbridge (I/O Controller Hub, ICH), coordinating slower set of devices,
including traditional PCI bus, ISA bus, SMBus, IDE (ATA), DMA and interrupt
controllers, real-time clock, BIOS memory, ACPI power management, LPC bridge
(providing fan control, floppy disk, keyboard, mouse, MIDI interfaces, etc.), and
optionally Ethernet, USB, IEEE1394, audio codecs and RAID interface
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
39
Major Chipset Vendors
• Intel
– http://developer.intel.com/products/chipsets/index.htm
• Via
– http://www.via.com.tw/en/products/chipsets
• SiS
– http://www.sis.com/products/product_000001.htm
• AMD/ATI
– http://ati.amd.com/products/integrated.html
• Nvidia
– http://www.nvidia.com/page/mobo.html
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
40
Chipset Features Overview
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
41
Motherboard
• Also referred to as main board, system board, backplane
• Provides mechanical and electrical support for pluggable
components of a computer system
• Constitutes the central circuitry of a computer,
distributing power and clock signals to target devices,
and implementing communication backplane for data
exchanges between them
• Defines expansion possibilities of a computer system
through slots accommodating special purpose cards,
memory modules, processor(s) and I/O ports
• Available in many form factors and with various
capabilities to match particular system needs, housing
capacity and cost
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
42
Motherboard Form Factors
• Refer to standardized motherboard sizes
• Most popular form factor used today is ATX, evolved
from now obsolete AT (Advanced Technology) format
• Examples of other common form factors:
– MicroATX, miniaturized version of ATX
– WTX, large form factor designated for use in high power
workstations/servers featuring multiple processors
– Mini-ITX, designed for use in thin clients
– PC/104 and ETX, used in embedded systems and single
board computers
– BTX (Balanced Technology Extended), introduced by Intel as
a possible successor to ATX
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
43
Motherboard Manufacturers
•
•
•
•
•
•
•
•
•
•
•
Abit
Albatron
Aopen
ASUS
Biostar
DFI
ECS
Epox
FIC
Foxconn
Gigabyte
•
•
•
•
•
•
•
•
•
IBM
Intel
Jetway
MSI
Shuttle
Soyo
SuperMicro
Tyan
VIA
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
44
Populated CPU Socket
Source: http://www.motherboards.org
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
45
DIMM Memory Sockets
Source: http://www.motherboards.org
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
46
Motherboard on Arete
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
47
SuperMike Motherboard:
Tyan Thunder i7500 (S720)
Source: http://www.tyan.com
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
48
Topics
•
•
•
•
•
•
•
•
•
•
Introduction
SMP Context
Performance: Amdahl’s Law
SMP System structure
Processor core
Memory System
Chip set
South Bridge – I/O
Performance Issues
Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
49
PCI enhanced systems
http://arstechnica.com/articles/paedia/hardware/pcie.ars/1
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
50
PCI-express
Lane
width
Clock
speed
Throughput
(duplex,
bits)
Throughput
(duplex, bytes)
Initial expected uses
x1
2.5 GHz
5 Gbps
400 MBps
Slots, Gigabit Ethernet
x2
2.5 GHz
10 Gbps
800 MBps
x4
2.5 GHz
20 Gbps
1.6 GBps
x8
2.5 GHz
40 Gbps
3.2 GBps
x16
2.5 GHz
80 Gbps
6.4 GBps
http://www.redbooks.ibm.com/abstracts/tips0456.html
Slots, 10 Gigabit Ethernet, SCSI,
SAS
Graphics adapters
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
51
PCI-X
Bus Width
Clock Speed
Features
Bandwidth
PCI-X 66
64 Bits
66 MHz
Hot Plugging, 3.3 V
533 MB/s
PCI-X
133
64 Bits
133 MHz
Hot Plugging, 3.3 V
1.06 GB/s
PCI-X
266
64 Bits, optional
16 Bits only
133 MHz
Double Data
Rate
Hot Plugging, 3.3 & 1.5 V, ECC
supported
2.13 GB/s
PCI-X
533
64 Bits, optional
16 Bits only
133 MHz
Quad Data
Rate
Hot Plugging, 3.3 & 1.5 V, ECC
supported
4.26 GB/s
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
52
Bandwidth Comparisons
CONNECTION
BITS
BYTES
PCI 32-bit/33 MHz
1.06666 Gbit/s
133.33 MB/s
PCI 64-bit/33 MHz
2.13333 Gbit/s
266.66 MB/s
PCI 32-bit/66 MHz
2.13333 Gbit/s
266.66 MB/s
PCI 64-bit/66 MHz
4.26666 Gbit/s
533.33 MB/s
PCI 64-bit/100 MHz
6.39999 Gbit/s
799.99 MB/s
PCI Express (x1 link)[6]
2.5 Gbit/s
250 MB/s
PCI Express (x4 link)[6]
10 Gbit/s
1 GB/s
PCI Express (x8 link)[6]
20 Gbit/s
2 GB/s
PCI Express (x16
link)[6]
40 Gbit/s
4 GB/s
PCI Express 2.0 (x32
link)[6]
80 Gbit/s
8 GB/s
PCI-X DDR 16-bit
4.26666 Gbit/s
533.33 MB/s
PCI-X 133
8.53333 Gbit/s
1.06666 GB/s
PCI-X QDR 16-bit
8.53333 Gbit/s
1.06666 GB/s
PCI-X DDR
17.066 Gbit/s
2.133 GB/s
PCI-X QDR
34.133 Gbit/s
4.266 GB/s
AGP 8x
17.066 Gbit/s
2.133 GB/s
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
53
HyperTransport : Context
•
•
•
•
•
•
Northbridge-Southbridge device connection facilitates
communication over fast processor bus between
system memory, graphics adaptor, CPU
Southbridge operates several I/O interfaces, through
the Northbridge operating over another proprietary
connection
This approach is potentially limited by the emerging
bandwidth demands over inadequate I/O buses
HyperTransport is one of the many technologies aimed
at improving I/O.
High data rates are achieved by using enhanced, lowswing, 1.2 V Low Voltage Differential Signaling (LVDS)
that employs fewer pins and wires consequently
reducing cost and power requirements.
HyperTransport also helps in communication between
multiple AMD Opteron CPUs
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
http://www.amd.com/us-en/Processors/ComputingSolutions/0,,30_288_13265_13295%5E13340,00.html
54
Hyper-Transport (continued)
• Point-to-point parallel topology uses 2
unidirectional links (one each for upstream and
downstream)
• HyperTransport technology chunks data into
packets to reduce overhead and improve efficiency
of transfers.
• Each HyperTransport technology link also contains
8-bit data path that allows for insertion of a control
packet in the middle of a long data packet, thus
reducing latency.
• In Summary : “HyperTransport™ technology
delivers the raw throughput and low latency
necessary for chip-to-chip communication. It
increases I/O bandwidth, cuts down the number of
different system buses, reduces power
consumption, provides a flexible, modular bridge
architecture, and ensures compatibility with PCI. “
CSC 7600 Lecture 9 : SMP Nodes
http://www.amd.com/us-en/Processors/ComputingSolutions
/0,,30_288_13265_13295%5E13340,00.html Spring 2011
55
Topics
•
•
•
•
•
•
•
•
•
•
Introduction
SMP Context
Performance: Amdahl’s Law
SMP System structure
Processor core
Memory System
Chip set
South Bridge – I/O
Performance Issues
Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
56
Performance Issues
• Cache behavior
– Hit/miss rate
– Replacement strategies
•
•
•
•
•
Prefetching
Clock rate
ILP
Branch prediction
Memory
– Access time
– Bandwidth
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
57
Topics
•
•
•
•
•
•
•
•
•
•
Introduction
SMP Context
Performance: Amdahl’s Law
SMP System structure
Processor core
Memory System
Chip set
South Bridge – I/O
Performance Issues
Summary – Material for the Test
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
58
Summary – Material for the Test
• Please make sure that you have addressed all
points outlined on slide 5
• Understand content on slide 7
• Understand concepts, equations, problems on
slides 11, 12, 13
• Understand content on 21, 24, 26, 29
• Understand concepts on slides 32,33,34,35,36
• Understand content on slides 39, 57
• Required reading material :
http://arstechnica.com/articles/paedia/hardware/pcie.ars/1
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
59
CSC 7600 Lecture 9 : SMP Nodes
Spring 2011
60
Download