Synchronization and Communication in the T3E Multiprocessor

advertisement
Synchronization and
Communication in the T3E
Multiprocessor
Background
• T3E is the second of Cray’s massively
scalable multiprocessors (after T3D)
• Both are scalable up to 2048 processing
elements
• Shared memory systems, but
programmable using message passing
(PVM or MPI, “more portable”) or shared
memory (HPF)
Challenges
• T3E (and T3D) attempted to overcome the
inherent limitations of employing commodity
microprocessors in very large multiprocessors
• Memory interface - cache-line based system
makes references to single words inefficient
• Typical address spaces too small for use in big
systems
• Non-cached references are often desirable (e.g.
message to other processor)
T3D Strengths (used in T3E)
• External structure in each
PE to expand address space
• Shared address space
• 3D torus interconnect
• Pipelined remote memory
access with prefetch queue
and non-cached stores
T3D: Room for improvement
• Overblown barrier network
• One outstanding cache line fill at a time
(low load bandwidth)
• Too many ways to access remote memory
• Low single-node performance
• Unoptimized special hardware features
(block transfer engine, DTB Annex,
dedicated message queues and registers)
T3E Overview
• Each PE contains Alpha
21164, local memory, and
control and routing chips
• Network links timemultiplexed at 5X system
frequency
• Self-hosted running
Unicos/mk
• No remote caching or
board-level caches
E-Registers
•
•
•
•
Extend physical address space
Increase attainable memory pipelining
Enable high single-word bandwidth
Provide mechanisms for data distribution,
messaging, and atomic memory
operations
• In general, they improve on the inefficient
individual structures of the T3D
Operations with E-Registers
•
•
Appropriate operands are stored in
appropriate E-registers by processor
Processor then issues another store
command to initiate operation
– Address specifies command and source or
destination E-register
– Data specifies pointer to already stored
operands and remote address index
Address Translation
• Global virtual addresses
and virtual PE numbers
formed outside
processors
• Centrifuge used for
efficient data distribution
• Specifying memory
location on data bus
enables bigger address
space
Remote Reads/Writes
• All operations done by reading into E-registers
(Gets) or writing from E-registers to memory
(Puts)
• Vector forms transfer 8 words with arbitrary
stride (e.g. every 3rd word)
• Large number of E-registers allows significant
Gets/Puts pipelining
– Limited by bus interface (256B/26.7ns)
• Single word load bandwidth high – can be
loaded into contiguous E-registers, then moved
into cache (instead of getting each cache line)
Atomic Memory Operations
• Fetch_&_Inc, Fetch_&_Add,
Compare_&_Swap, Masked_Swap
• Can be performed on any memory location
• Performed like any E-register operation
– Operands in E-registers
– Triggered via store, sent over network
– Result sent back and stored in specified Eregister
Messaging
T3D: Specific queue location of fixed size
T3E: Arbitrary number of queues, mapped to
normal memory, of any size up to 128 MB
T3D: All incoming messages generated interrupts,
adding significant penalties
T3E: Three options – interrupt, don’t interrupt
(detected via polling), and interrupt after
threshold number of messages
Messaging Specifics
• Message queues consist of Message
Queue Control Words (MQCW)
• Messages assembled into 8 E-registers,
SEND issued with address of MQCW
• Message queue is managed in software –
avoids OS if polling is used
Synchronization
• Support for barriers and eurekas
(message from one processor to group)
• 32 barrier synchronization units (BSUs) at
each processor, accessed as memorymapped registers
• Synchronization packets use a dedicated
high-priority virtual channel
– Propagated through logical tree embedded in
3D torus interconnect
Synchronization
•
Simple barrier operation
involves 2 states
–
–
•
First arms all processors in
group (S_ARM)
Once all are armed,
network notifies all of
completion and processors
return to S_BAR
Eureka requires 3 states
to ensure one is received
before issuing next one
–
Eureka notification
immediately followed by
barrier
Performance
Increasing number of E-registers
allows greater pipelining and
bandwidth (limited by control logic)
Effective bandwidth greatly
increases with higher transfer sizes
due to effects of overhead, startup
latency
Performance
Transfer bandwidth independent
of stride, except when data
happens to be loaded from same
bank(s) (multiples of 4, 8)
Several million AMOs/sec required
to saturate memory system and
increase latency
Performance
Very high message bandwidth is
supported without latency increase
Hardware barrier many times faster
than efficient software barrier
(about 15 for 1024 PEs)
Conclusions
• E-registers allow a highly pipelined memory
system and provide a common interface for all
global memory operations
• Both messages and standard shared memory
ops supported
• Fast hardware barrier supported with almost no
extra cost
• No remote caching eliminates need for bulky
coherence mechanisms and helps allow 2048
PE systems
• Paper provides no means of quantitative
comparison to alternative systems
Download