Document

advertisement
Master Degree Program (Laurea Magistrale) in Computer Science and Networking
High Performance Computing
Multiprocessor architectures
• Ref: Sections 10.1, 10.2, 15, 16, 17(except 17.5), 18.
• Background (Appendix): firmware, firmware communications, memories,
caching, process management; see also Section 11 for memory and caching.
Contents
Shared memory architecture = multiprocessor
• Multicore technology (Chip MultiProcessor – CMP)
• Functional and performance features of external memory and
caching in multiprocessors
• Interconnection networks
• Multiprocessor taxonomy
• Local I/O
Sections 15, 16, 17, 18 contain several ‘descriptive-style’ parts, e.g.
classifications, technologies, products, etc.
They can be read easily by the students.
During the lectures we’ll concentrate on the most critical issues from the
conceptual and technical point of view, through examples and exercises.
MCSN - High Performance Computing
2
Abstract architecture and physical architecture
...
Abstract interconnection network:
all the needed direct links
corresponding to the
interprocess
communication channels
Abstraction of physical interconnect,
Memory hierarchy, I/O,
Process run-time support,
Process mapping onto PEs, etc.
All the physical details are
condensed into a small number of
parameters used to evaluate Lcom.
MCSN - High Performance Computing
Abstract Processing Elements (PE),
having all the main features of
real PEs
(processor, assembler, memory
hierarchy, external memory, I/O, etc.);
one PE for each process
Evaluation of calculation times Tcalc
Cost model
of the
specific parallel program
executed
on the
specific parallel architecture
3
Multiple Instruction Stream (MIMD) architectures
Parallelism between processes
MCSN - High Performance Computing
4
Shared memory vs distributed memory
Multiprocessor
Currently, the main
technology for multicore
and multicore-based
systems.
We’ll start with shared
memory multiprocessors.
Multicomputer
Currently, the main
technology for clusters
and data-centres.
Processing nodes are
multiprocessors.
MCSN - High Performance Computing
5
Levels and shared memory
Applications
Processes
Assembler
Firmware
contains all
Shared physical memory
instructions
(multiprocessor):
and data
any processor can address
• private
(thus can access directly) any • shared
location of the physical
of
memory
processes
Hardware
MCSN - High Performance Computing
6
Levels and shared memory
Applications
Processes
Graphs of cooperating processes
expressed by a concurrent language
messagepassing
(e.g. LC) or
shared
data
No sharing implemented by sharing
Assembler
RTS (concurrent language)
based on shared data structures
(communication channel descriptors,
processFirmware
descriptors, etc.)
exploiting
Hardware
the shared
physical memory
MCSN - High Performance Computing
Different shared data at different levels
Shared physical memory
(multiprocessor):
any processor can address
(thus can access directly) any
location of the physical
memory
contains all
instructions
and data
• private
• shared
of
7
processes
Generic scheme of multiprocessor
M0
…
Mj
…
External shared
main memory
Mm-1
N x m interconnect
Interconnection network(s)
Processing Element
(PE)
routing and flow control
PE0
W0
…
PEi
CPU0
Wi
CPUi
…
PEN-1
WN-1
CPUN-1
PE interface unit:
decouples CPUs
from interconnect
technology
(‘Wrapping’ unit)
CPU (processor
units, MMUs,
caches) + local I/O
MCSN - High Performance Computing
8
Typical PE
Interconnection Network
To/from
external
memory
modules and other PEs
Processing Node (PE)
PE interface unit
W
External memory interface
CPU
Secondary Cache
Primary Cache (Instr. + Data)
MMUs
Processor units
I/O
interface
Interrupt
Arbiter
...
UC
Local I/O units
MCSN - High Performance Computing
9
Shared memory basics - 1
MCSN - High Performance Computing
10
Example: an elementary multiprocessor
Just to understand / to review basic concepts and techniques, which will be
extended to real multiprocessor architectures
M
tM
W
W
C2
C2
C1
C1
t
P
t
P
PE0
Question:
Which kinds of requests are sent from a PE to M and which reply is sent from M to a PE? [true/false]
1. Copy a message from P_msg to Q_vtg
2. Request a message to be assigned to Q_vtg
3. A single-word read
4. A single-word write
5. A C1-block read
6. A C1-block write
PE1
Abstract
PE0
P
Abstract
PE1
Q
MCSN - High Performance Computing
11
Example: an elementary multiprocessor
Just to understand / to review basic concepts and techniques, which will be
extended to real multiprocessor architectures
M
tM
W
W
C2
C2
C1
C1
t
P
t
P
PE0
PE1
Abstract
PE0
P
Abstract
PE1
Q
MCSN - High Performance Computing
Question:
Which kinds of requests are sent from a PE to M and which reply is sent from M to a PE? [true/false]
1. Copy a message from P_msg to Q_vtg
2. Request a message to be assigned to Q_vtg
3. A single-word read
4. A single-word write yes, if Write-Though
5. A C1-block read
yes
6. A C1-block write
yes, if Write-Back
Question:
What is the format (configuration of bits) of a
request PE-M and of a reply M-PE ?
Question:
What happens if a request from PE0 and a request
from PE1 arrive ‘simultaneoulsy’ to M ?
12
Behavior of the memory unit M
• Processing module as a unifying concept at the various levels =
processing unit at firmware level, process at the process level.
• All the same mechanisms studied for process cooperation (LC) are
applied at the firmware level too, though with different
implementations and performances.
• Communication through RDY-ACK interfaces.
• Nondeterminism: test simultaneously, in the same clock cycle, all
the RDYs of the input interfaces; select one of the ready requests,
possibly applying a fair priority strategy.
• Nondeterminism may be implemented as real parallelism in the
same clock cycle: if input requests are compatible and the memory
bandwidth is sufficient, more requests can be served
simultaneously.
MCSN - High Performance Computing
13
Behavior of the memory unit M
• A further feature can be defined for a shared memory unit:
indivisible sequences of memory accesses.
• An additional bit (INDIV) is associated to each memory request: if it
is set to 1, once the associated request is accepted by M, the other
requests are left pending in the input interfaces (simple waiting
queue mechanims), until INDIV is reset to 0.
• During an indivisible sequence of memory accesses, the M behavior
is deterministic.
• At the end, the nondeterministic/parallel behavior is resumed
(possibly by serving a waiting request).
• This mechanism is provided by some machines : proper instructions
(e.g. TEST_AND_SET) or annotation in LOAD and STORE instructions.
MCSN - High Performance Computing
14
In general
M0
…
Mj
…
Nodeterminism
and parallelism
Mm-1
in the behavior
of
memory units
Interconnection network(s)
PE0
W0
…
PEi
CPU0
MCSN - High Performance Computing
Wi
CPUi
…
PEN-1
WN-1
and of
network
switching units
CPUN-1
15
Technology overview and multicore
MCSN - High Performance Computing
16
CPU technology
Pipelined
External memory
interface (MINF)
C2
MMU
and
multithreaded
processor
technology:
general view
(Sections 12, 13)
C1
P
MCSN - High Performance Computing
In the simplified
cost model adopted
for this course,
this structure is
invisible
and abstracted by
the equivalent
service time per
instruction Tinstr
(e.g. 2t).
17
Pipelined / vectorized Execution Unit
da DM
INT Pipelined
Mul / Div
da IU
a IU
EU_Master
•general,
Distribuzione
•floating-point
Operazioni corte
•and
LOAD
vector
•registers
Registri RG
• Registri RFP
FP Pipelined
Add / Sub
+ Vectorization
facilities
Collettore
FP Pipelined
Mul / Div
MCSN - High Performance Computing
18
Multithreading (‘hardware’ multithreading)
Example of 2-thread CPU
e.g.
Hyperthreading
IM – instruction C1
IU0
IU1
DM – data C1
external
memory
M
I/O
C2
EU_Master0
I
EU_Master1
N
switch
F
FU0
CPU chip
MCSN - High Performance Computing
FU1
FU2
FU3
Ideally,
an equivalent
number of
q  N PEs
is available,
where q
is the
multithreading
degree.
In practice,
aqN
with a < 1.
19
Multicore technology: Chip MultiProcessor (CMP)
CMP
single chip
MINF
I/O INF
MINF
MINF
MINF
internal interconnect
PE 0
PE 1
For our purposes, the terms
‘multicore’ and ‘manycore’ are
synonymous.
We use the more general and
rigorous term ‘Chip MultiProcessor
(CMP)’.
MCSN - High Performance Computing
...
I/O INF
PE N-1
W
PE / core
C2
C1 (instr + data)
pipelined
processor
local I/O
coproc.
20
Internal interconnect examples for CMP
Ring
PE
PE
PE
PE
sw
sw
sw
sw
sw
sw
sw
sw
PE
PE
PE
PE
Switching Unit (or,
simply, Switch):
routing and flow
control
2D toroidal mesh
Crossbar
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
MCSN - High Performance Computing
…
…
…
…
…
…
…
…
…
…
…
…
…
21
Example of single-CMP system
Ethernet
high-bandwidth main memory
Fiber channel
M0
...
M7
M0
graphics
...
video
M7
I/O chassis
I/O chassis
IM
...
...
IM
MINF
MINF
MINF
MINF
CMP
I/O and networking
interconnect
I/O INF
internal interconnect
PE 0
PE 1
...
router
scsi
I/O INF
I/O and networking
interconnect
PE N-1
Raid subsystems
scsi scsi scsi
storage server
LANS / WANS / other subnets
MCSN - High Performance Computing
22
Example of multiple-CMP system
high-bandwidth shared main memory
M
...
M
...
M
...
M
...
M
...
M
external interconnect
CMP0
PE
...
CMP1
PE
PE
...
CMPm-1
PE
PE
...
PE
...
MCSN - High Performance Computing
23
Intel Xeon (4-16 PEs) and Tilera Tile64 (64 PEs)
MCSN - High Performance Computing
24
Intel Xeon Phi (64 PEs)
Bidirectional ring
interconnect
Internal local
memory (GDDR5
technology),
up to 16GB
(3rd level cachelike)
PE:
• pipelined
• in-order
• vectorized arithm.
• 4-thread
• 2-level cache
• ring interface
MCSN - High Performance Computing
25
Shared memory basics - 2
MCSN - High Performance Computing
26
Memory bandwidth and latency
High bandwidth of M (BM) is needed for:
1. Minimize the latency of cache-block transfers (Tfault)
2. Minimize contention of PEs for memory accesses
MCSN - High Performance Computing
27
Memory bandwidth and latency
High bandwidth of M (BM) is needed for:
1. Minimize the latency of cache-block transfers (Tfault)
2. Minimize contention of PEs for memory accesses
MCSN - High Performance Computing
28
Minimize the latency of cache-block transfers
• If BM = 1 words/tM, cache is quite useless for programs characterized by locality
only (or mainly locality)
• BM = s1 words/tM is the best offered bandwidth: exploitable if the remaining
subsystems (interconnect, PEs) are able to sustain it.
• Solutions:
1. Interleaved macro-modules (hopefully, m = s1 , e.g. = 8)
M0
M1
…
M7
Interleaved macro-module 0
M8
M9
…
Interleaved macro-module 1
M15
...
Interleaved … 2
2. High-bandwidth firmware communications from M to interconnect and
PEs. Notice: s1-wide links are not realistic  1-word links
• Pipelined communications and wormhole flow-control:  next week
MCSN - High Performance Computing
29
Cost model of FW communications (Sect. 10.1, 10.2)
𝑻𝒊𝒅 = 𝒎𝒂𝒙 (𝑻𝒄𝒂𝒍𝒄 , 𝑻𝒔−𝒄𝒐𝒎 )
Single buffering (figure):
Communication latency:
𝑳𝒄𝒐𝒎 = 𝟐(𝝉 + 𝑻𝒕𝒓 )
This expresses also the communication service time:
𝑻𝒔−𝒄𝒐𝒎 = 𝑳𝒄𝒐𝒎 = 𝟐(𝝉 + 𝑻𝒕𝒓 )
Sender::
Receiver::
wait ACK;
write msg into OUT,
set RDY, reset ACK, …
Double buffering:
𝑳𝒄𝒐𝒎 = 𝟐(𝝉 + 𝑻𝒕𝒓 )
𝑳𝒄𝒐𝒎
𝑻𝒔−𝒄𝒐𝒎 =
= 𝝉 + 𝑻𝒕𝒓
𝟐
wait RDY;
use IN,
set ACK, reset RDY, …
Service time
Tid
Calculation time
Tcalc
RDY
ACK
RDY
Lcom
Receiver
Communication Latency
Communication time NOT overlapped to
(i.e. not masked by) internal calculation
Tcom
…
Sender
Transmission Latency (Link
only)
Tid = Tcalc + Tcom
…
Clock cycle for calculation only
Ttr
t
Lcom = 2 (Ttr + t)
MCSN - High Performance Computing
Clock cycle for calculation and
communication
On chip
Ttr = 0
Tcalc ≥ Lcom

Tcom = 0
Alternate use of the
two interfaces
30
Elementary system example: memory latency
M
(tM)
M0
...
(t)
C1
PE0
Block is
pipelined (t)
word-by-word
𝑅𝑄0 = 𝜏𝑀 + (𝜎1 + 3)𝑇𝑡𝑟 + (𝜎1 + 6)𝜏
A more general and accurate, and easier-to-use, cost model will be
studied for fully pipelined communications.
W
C2
RQ0 is the BASE memory access latency per block = without impact
of contention: optimistic or for single PE.
C1
PE1
tM
M
IM
W
C2
C1
IM
Memory
interface
unit
(t)
C2
MEMORY ACCESS LATENCY per block:
...
(t)
W
Double buffered links
M7
Ttr
t
Stream
of s1 words
Request
e.g. 48 – 112 bits
in parallel
s1 (𝝉 + 𝑻𝒕𝒓 )
1 word
(not in scale)
Possible optimization (not so popular): the processor could re-start at this
point if the first word of the stream has the address that generated fault.
MCSN - High Performance Computing
31
Elementary system example: memory latency
For all units, except M: 𝑻𝒄𝒂𝒍𝒄 = 𝝉 →
𝑻𝒊𝒅 = 𝑻𝒔−𝒄𝒐𝒎 = 𝝉 + 𝑻𝒕𝒓
Memory service time: 𝑻𝑴 =
𝝉𝑴
𝝈𝟏
If 𝑻𝒊𝒅 ≤ 𝑻𝑴  M (stream generator) is the
bottleneck.
tM
Example: 𝜏𝑀 = 32𝜏, 𝜎1 = 8, 𝑇𝑡𝑟 = 2𝜏.
𝑅𝑄0 = 𝜏𝑀 + (𝜎1 + 3)𝑇𝑡𝑟 + (𝜎1 + 6)𝜏
s1(t + Ttr)
= 68 t
If 𝑻𝒊𝒅 > 𝑻𝑴  IM-net-PE is the bottleneck.
Example: 𝜏𝑀 = 32𝜏, 𝜎1 = 8, 𝑇𝑡𝑟 = 4𝜏.
𝑅𝑄0 = 𝜎1 τ + 𝑇𝑡𝑟 + (𝜎1 + 3)𝑇𝑡𝑟 + (𝜎1 + 6)𝜏
= 98 t
tM
s1(t + Ttr)
tM
tM
tM
s1(t + Ttr)
s1(t + Ttr)
s1(t + Ttr)
MCSN - High Performance Computing
32
Tfault
M
M0
...
(tM)
M7
IM
From now on:
Memory
interface
unit
(t)
W
C2
(t)
C1
PE0
UNDER-LOAD memory access latency per block: RQ  RQ0
(t)
𝑻𝒇𝒂𝒖𝒍𝒕 = 𝑵𝒇𝒂𝒖𝒍𝒕 ∗ 𝑹𝑸
W
C2
Initially, we assume:
𝑹𝑸 = 𝑹𝑸𝟎
C1
PE1
Abstract
PE0
P
Abstract
PE1
Q
MCSN - High Performance Computing
Example of the first week
For Q:
𝑇𝑐𝑎𝑙𝑐 = 𝑇𝑄0 + 𝑇𝑓𝑎𝑢𝑙𝑡
𝑇𝑄0 = 8 𝑀 𝜏
For M = 5 Mega, no reuse can be exploited  RQ = 68t
𝑀
𝑁𝑓𝑎𝑢𝑙𝑡 =
𝜎1
𝑇𝑓𝑎𝑢𝑙𝑡 = 𝑁𝑓𝑎𝑢𝑙𝑡 ∗ 𝑅𝑄 = 8.5 𝑁 𝜏
𝑇𝑐𝑎𝑙𝑐 = 𝑇𝑄0 + 𝑇𝑓𝑎𝑢𝑙𝑡 = 16.5 𝑁 𝜏
33
Elementary system example
The evaluation of RQ0 (M bottleneck) is optimistic because the
request part of the timing diagram contains a rough
simplification:
tM
Ttr
t
Request
e.g. 48 – 112 bits
in parallel
also the request will be pipelined
word-by-word: request link 1-word
wide.
Exercises
1. Why for all units, except M: 𝑻𝒄𝒂𝒍𝒄 = 𝝉?
2. Explain the RQ0 evaluation in the example when M is not bottleneck.
MCSN - High Performance Computing
34
In general
...
M0
M1
…
M7
M8
Interleaved macro-module 0
M9
…
M15
Interleaved macro-module 1
IM
Interleaved … 2
IM
IM
Network
...
...
...
...
s1 words are read in parallel by the macro-module,
and sent in pipeline one word at the time through
IM - network- W - C2 - C1.
Reverse path for block writing.
MCSN - High Performance Computing
...
...
...
Target:
W
C2
C1
𝝈𝟏
𝝉𝑴
≤ 𝑩𝒏𝒆𝒕𝒘𝒐𝒓𝒌
and in general it is:
𝐵𝑛𝑒𝑡𝑤𝑜𝑟𝑘  𝐵𝑐𝑎𝑐ℎ𝑒
35
Memory bandwidth and latency
High bandwidth of M (BM) is needed for:
1. Minimize the latency of cache-block transfers (Tfault)
2. Minimize contention of PEs for memory accesses
MCSN - High Performance Computing
36
Memory bandwidth and contention
tM
Single internally interleaved macro-module:
1
𝐵𝑀 =
blocks/sec
tM
𝜏𝑀
s1(t + Ttr)
M(0)
M0
...
s1(t + Ttr)
M(1)
M7
M8
...
IM
Only one PE at the time can be served.
Two externally interleaved macromodules (= interleaved each other), each
macro-module internally interleaved (=
inside the macro-module):
2
𝐵𝑀−𝒎𝒂𝒙 =
blocks/sec
M15
IM
interconnect
W
C2
C1
PE0
tM
conflict
Two PE at the time can be served, if not
in conflict for the same macro-module.
C2
C1
PE1
tM
s1(t + Ttr)
𝜏𝑀
W
W select the
macro-module M(j)
according to the
index j contained
in the physical
address.
tM
no conflict
tM
s1(t + Ttr)
s1(t + Ttr)
MCSN - High Performance Computing
s1(t + Ttr)
37
In general
m macro-modules
...
M0
M1
…
M7
M8
Interleaved macro-module 0
...
…
M15
Interleaved macro-module 1
IM
Network
M9
Interleaved … 2
IM
The destination
macro-module
name belongs to
the routing
information set
(inserted by W)
...
IM
...
...
...
N Processing Elements
MCSN - High Performance Computing
38
A first idea of contention effect
For an (externally) interleaved memory, the probability that a generic processor
accesses any (macro-)module is approximated by 1/m.
With this assumption, the probability of having PEs in conflict for the same macromodule is distributed according to the binomial law. We can find (Section 17.3.5) :
Interleaved memory bandwidth (m modules, N processors)
64.00
56.00
48.00
40.00
m=4
m=8
32.00
m=16
24.00
m=32
16.00
m= 64
8.00
0.00
0
8
16
24
MCSN - High Performance Computing
32
N
40
48
56
Simplified
evaluation:
• only a subclass of
multiprocessor
architectures
(SMP),
• no network effect
on latency and
conflicts,
• no impact of
parallel program
structures.
64
39
A more general client-server model will be derived
Contention in memory AND in the network  the importance of high-bandwidth
and low-latency networks
MCSN - High Performance Computing
40
Caching
• Caching is even more important in multiprocessor,
– for latency and contention reduction,
– provided that reuse is intensively exploited.
• For shared data, intensive reuse can exist, with a proper
design of process RTS.
• However, the CACHE COHERENCE problem arises (studied in
the second part of semester).
MCSN - High Performance Computing
41
Multiprocessor taxonomy
MCSN - High Performance Computing
42
SMP vs NUMA architectures
M0
…
Mj
Mm-
…
1
Interconnection network
W0
CPU0
…
Wi
…
CPUi
WN-1
CPUN-1
Symmetric MultiProcessor:
The base latency is independent of
the specific PE and memory macromodule.
Also called UMA(Uniform Memory
Access).
Interconnection network
Non Uniform Memory Access:
The base latency depends (heavily) on
the specific PE and referred macromodule.
Local memories are shared.
Each of them can be interleaved,
but they are sequentially organized
each other.
Local accesses have lower latency than
remote ones. All private information
are allocated in the local memory.
MCSN - High Performance Computing
M0
W0
CPU0
…
WN-1
MN-1
CPUN-1
Target: contention is reduced, at the expence of base
latency for shared data (optimizations are needed).
43
SMP-like single-CMP architecture
M0
...
M7
M0
...
IM
...
...
IM
MINF
MINF
MINF
MINF
M7
CMP
I/O INF
internal interconnect
PE 0
MCSN - High Performance Computing
PE 1
...
I/O INF
PE N-1
44
SMP and NUMA multiple-CMP architectures
a) multiple-CMP SMP architecture
M
...
M
M
IM0
...
M
M
IMj
...
M
IMm-1
external interconnect
PE
WW
WW
CMP0
CMPi
...
PE
PE
...
WW
CMPN-1
PE
PE
...
PE
b) multiple-CMP NUMA architecture
external interconnect
M
M
...
IM0
WW
M
M
...
IMi
WW
M
...
IMN-1
CMPi
PE
MCSN - High Performance Computing
WW
M
CMP0
PE
...
PE
...
CMPN-1
PE
PE
...
PE
45
Process_to_Processor Mapping
Anonymous Processors
Dedicated Processors
Dynamic mapping
(low-level scheduling)
Static mapping
Originally NUMA
Originally SMP
Multiprogrammed
mapping
Exclusive mapping
One-to-one
More processes share
dynamically the same PE
Context-switch overhead
‘Traditional’ computing servers,
data-centres (?), cloud (?).
MCSN - High Performance Computing
Parallel applications dedicated to
specific domains
Exercise: give an approximate evaluation
of the context-switch calculation time.
46
Interconnection networks
MCSN - High Performance Computing
47
Two extreme cases of networks
Old-style bus
Crossbar
• Bus is no longer applicable to highly parallel systems : cheap, but no parallelism
in memory accesses  minimum bandwidth, and maximum latency.
• Crossbar = fully interconnected with N2 dedicated links: maximum parallelism
and bandwidth, minimum latency, but can be applied to limited parallelism only
(e.g., N = 8) because of link cost and pin-count reasons.
• Limited degree networks for highly parallel systems: much lower cost than
crossbar by reducing the number of links and interfaces (pin-count), at the
expence of latency, but the maximum bandwidth can equally be achieved.
MCSN - High Performance Computing
48
‘High-performance’ networks
•
Many of the Limited Degree networks, that are studied for multiprocessors,
are used in distributed memory systems and in high-performance
multicomputers too.
–
•
The firmware level is the same, or is very similar for different architectures.
The main difference lies in the implementation of routing and flow control
protocols:
–
–
–
•
Notable industrial examples: Infiniband, Myrinet, QS-net, etc.
In multiprocessors and high-performance multicomputers, the primitive protocols at the firmware
level are (can be) used directly in the RTS of applications,
without the additional software layers like TCP-IP of traditional networks.
The overhead imposed by traditional TCP-IP implementations is evaluated in several orders of
magnitude (e.g. msecs vs nsecs of latency !):
• no/scarce firmware support (NIC is used for physical layers only),
• in kernel mode on top of operating systems (e.g., Linux).
We’ll see that the modern network systems, cited above, render visible the
primitive firmware protocols too
–
–
For high-performance distributed applications, unless TCP-IP is forced by binary portability reasons
of ‘old’/legacy products.
Moreover, such networks implement TCP-IP with intensive firmware support (mainly in NIC) and in
user mode: 1-2 orders of magnitude of overhead is saved.
MCSN - High Performance Computing
49
Firmware messages as streams
• Messages are packets transmitted as streams of elementary
data units, typically words.
• Example: a cache block transmitted from the main memory as
a stream of s1 words.
MCSN - High Performance Computing
50
Evaluation metrics
At least evaluated as order of magnitude O(f(N)) Typical limited-degree networks:
• Cost of links
– bus O(1)
– crossbar O(N2) : absolute maximum
O(1), O(N), O(N lgN)
• Maximum bandwidth
– bus O(1)
– crossbar O(N) : absolute maximum
O(N)
• Complexity of design to achieve the
maximum bandwidth (nondeterminism vs
parallelism)
– bus O(1)
– crossbar O(cN) : absolute maximum (monolitic design)
O(c2)  O(1) for any N (modular
design)
• Latency ( distance)
– bus O(N)
– crossbar O(1) : absolute minimum
MCSN - High Performance Computing
O(N), O( 𝑁)
𝑶(𝒍𝒈𝑵): the best except O(1)
51
From crossbars to limited-degree networks
Monolitic (single unit)
N x N crossbar
N x N crossbar
N
bidirectional
N
bidirectional
interfaces
interfaces
Monolitic 2 x 2 crossbar
input interfaces
output interfaces
MUX
Assumed as
elementary buiding block
for N  N
modular design
input
links
output
links
MUX
Exercise: describe the firmware behavior of the 2  2 switch, and prove that the
1
2
maximum bandwidth is given by 𝜏+ 𝑇 (single buffering) or by 𝜏+ 𝑇 (double buffering).
𝑡𝑟
MCSN - High Performance Computing
𝑡𝑟
52
Modular design for limited-degree networks
A 4 x 4 limited degree network implemented by
the limited degree interconnection of
2 x 2 elementary crossbars
2 x2
crossbar
2 x2
crossbar
Binary butterfly
with
dimension n = 𝐥𝐠 𝟐 𝑵
( 2 in the example)
2 x2
crossbar
Notable example of
multi-stage network
(2-stage in the
example)
N=4
2 x2
crossbar
network dimension n = number of stages
MCSN - High Performance Computing
53
Modular crossbar as a butterfly
n=1
22
n=2
22
22
n=3
22
‘straight’ links: next stage, same level
22
‘oblique’ links: next stage; the base-2
representations of the source and destination
levels differ only in the source-index bit.
22
22
22
22
22
22
22
22
22
22
22
22
MCSN - High Performance Computing
54
k-ary n-fly networks
ariety k, dimension n
Butterfly
n
PE
sw
PE
M
sw
sw
k
PE
M
sw
PE
M
sw
sw
PE
sw
sw
sw
M
M
PE
sw
PE
sw
sw
M
Fat tree algorithm,
Simple deterministic routing
based on the binary representations ofsw
sender and destination, current stage
sw
index, and straight/oblique
link .
sw
Typical utilization: SMP
PE
PE
sw
PE
MCSN - High Performance Computing
PE
• Number of processing nodes = 2N, N = kn
M
M
PE
Extendable to any ariety k, though it must
be ‘low’ for limited degree networks.
• Node degree = 2k
• Latency  distance = n = lg 𝒌 𝑵
• Number of links and switches
• = O(N lgN), respectively (n – 1) 2n , n 2n-1
• Maximum bandwidth = O(N)
sw
• Complexity for maximum bandwidth =
sw
sw
O(1),
once the elementary
crossbar is
available.
PE
PE
PE
PE
55
Fat tree
(typical for NUMA) has logarithmic mean latency (e.g.  n or 2n, with
n = lgPE2(N) number of tree levels), and other
similar properties of butterflies.
M
sw
sw
sw based.
M
PE
Routing
algorithm:
common-ancestor
Butterfly
A tree structure
PE
M
In NUMA, process mapping must be chosen properly, in order to minimize distances.
sw
sw
sw
M
PE
However,
contention in switches is too high
with simple trees.
PE
M
sw
sw
In order
to sw
minimize contention,
the linkMand switch bandwidth increases from level to
PE
level,PEe.g. doubles: fat tree.
M
sw
sw
sw
PE
Problem:
also the cost and complexity ofM switches increases from level to level!
Modular crossbars cannot be used, otherwise the latency increases.
Fat tree
sw
sw
sw
sw
PE
sw
PE
PE
MCSN - High Performance Computing
sw
PE
PE
sw
PE
PE
PE
56
Generalized fat tree
PE
Modest increase of
contention.
PE
PE
PE
PE
Suitable both for
NUMA and for SMP,
if switches behave
according to the
butterfly-routing or
to the tree-routing.
PE
PE
PE
Third level
crossbar
Second level
crossbar
MCSN - High Performance Computing
First level
crossbar
57
k-ary n-cubes
4-ary 1-cube
Toroidal structures: rings.
switch unit
• Number of processing nodes = N = kn
4-ary 2-cube
• Node degree = 2n
• Latency  distance = O(k n)
𝒏
= O( 𝑵) for small n
= O(lgk N) for large n
• However, process mapping is critical.
• Number of links and switches = k n
4-ary 3-cube
• Maximum bandwidth = O(N)
• Complexity for maximum bandwidth
= O(cn) for minimum latency,
otherwise O(1).
• Simple deterministic routing
(dimensional).
MCSN - High Performance Computing
58
Local Input-Output
MCSN - High Performance Computing
59
Interprocessor communicatons
• In a multiprocessor, the main mode of processor cooperation
for process RTS is via shared memory.
• However, there are some cases in which asynchronous events
are needed and more efficiently signaled through direct
interprocessor communications, i.e. via Input-Output.
• Examples:
– processor synchronization (locking, notify),
– low-level scheduling (process wake-up),
– cache coherence strategies, etc.
• In such cases, signaling and testing the presence of
asynchronous events via shared memoria is very time
consuming in terms of latency, bandwidth and contention.
MCSN - High Performance Computing
60
Local I/O
Each PE contains an on-chip local I/O unit (UC), to
send and receive interprocessor event messages.
The same, or a dedicated, interconnection structure
is used.
Traditional I/O bus has no sense for performance
reasons: instead dedicated, on chip links are
provided with CPU and W.
Interconnection Structure
…
W0
CPU0
W n-1
CPUn-1
UC0
UCn-1
internal interconnect
•
PE
(core)
input interprocessor messages
W
output interprocessor msg.s
C2
instruction
C1
IM
Load/Store
requests
data
C1
DM
IU
interrupt
interface
Load data
EU
Int, Ackint
interrupt message
MCSN - High Performance Computing
local I/O unit
(UC)
local I/O
memory
(MUC)
To start an interprocessor
comunication, a CPU uses the I/O
instructions: Memory Mapped I/O.
• The associated UC forwards the event
message to the destination PE UC, in
the form of word stream through Ws
and interconnect.
• W is able to distinguish memory access
requests/replies from interprocessor
communications.
• The receiving UC uses the interrupt
mechanism to forward the event
message to the destination CPU.
There is no request-reply behavior, instead
it is a purely asynchronous mechanism.61
Example 1
Assume that the event message is composed by the event_code and by two data words (data_1,
data_2), and that the process running on destination PE inserts the tuple (event_code, data_1,
data_2) in a queue associated to the event.
Source CPU executes the following Memory Mapped I/O instructions:
STORE
STORE
STORE
STORE
RUC, 0, PE_dest
RUC, 1, event_code
RUC, 2, data_1
RUC, 3, data_2
where RUC means ...
Interrupt message from UC to CPU: (event, parameter_1, parameter_2)
Destination CPU executes the following interrupt handler:
HANDLER:
STORE
STORE
STORE
…
GOTO
…
Rbuffer_ev, Rbuffer_pointer, Revent
Rbuffer_1, Rbuffer_pointer, Rparameter_1
Rbuffer_2, Rbuffer_pointer, Rparameter_2
Rret_interrupt
Exercise:
1.
What happens in a Memory Mapped I/O instruction if the I/O unit doesn’t contain a physical
local memory?
2.
Can the STORE instructions executed by the source CPU be replaced by LOAD instructions?
MCSN - High Performance Computing
62
Example 2
Alternative behavior: the process running in the destination PE is
in a busy waiting condition of the event message, executing the
special instruction:
WAITINT Rmask, Revent, Rparameter_1, Rparamter_2
or, if WAITINT instruction is not primitive, a simple busy waiting
loop like:
MASKINT Rmask
WAIT: GOTO WAIT
EI
(no real handler)
MCSN - High Performance Computing
63
Synchronous vs asynchronous event notification
process
instructions
interrupt
interrupt
handler
event
registration
Example 1
asynchronous wait
synchronus wait
Example 2
MCSN - High Performance Computing
64
Download