Berkeley NOW

advertisement
CS 258
Parallel Computer Architecture
Lecture 2
Convergence of Parallel Architectures
January 28, 2008
Prof John D. Kubiatowicz
http://www.cs.berkeley.edu/~kubitron/cs258
Review
• Industry has decided that Multiprocessing is the
future/best use of transistors
– Every major chip manufacturer now making MultiCore chips
• History of microprocessor architecture is
parallelism
– translates area and density into performance
• The Future is higher levels of parallelism
– Parallel Architecture concepts apply at many levels
– Communication also on exponential curve
• Proper way to compute speedup
– Incorrect way to measure:
» Compare parallel program on 1 processor to parallel program
on p processors
– Instead:
» Should compare uniprocessor program on 1 processor to
parallel program on p processors
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.2
History
• Parallel architectures tied closely to programming
models
– Divergent architectures, with no predictable pattern of
growth.
– Mid 80s renaissance
Application Software
Systolic
Arrays
System
Software
Architecture
SIMD
Message Passing
Dataflow
1/28/08
Shared Memory
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.3
Plan for Today
• Look at major programming models
–
–
–
–
where did they come from?
The 80s architectural rennaisance!
What do they provide?
How have they converged?
• Extract general structure and fundamental issues
Systolic
Arrays
Dataflow
1/28/08
SIMD
Generic
Architecture
Message Passing
Shared Memory
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.4
Programming Model
• Conceptualization of the machine that programmer
uses in coding applications
– How parts cooperate and coordinate their activities
– Specifies communication and synchronization operations
• Multiprogramming
– no communication or synch. at program level
• Shared address space
– like bulletin board
• Message passing
– like letters or phone calls, explicit point to point
• Data parallel:
– more regimented, global actions on data
– Implemented with shared address space or message passing
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.5
Shared Memory  Shared Addr. Space
Processor
Processor
Processor
Processor
Shared Memory
Processor
Processor
Processor
Processor
• Range of addresses shared by all processors
– All communication is Implicit (Through memory)
– Want to communicate a bunch of info? Pass pointer.
• Programming is “straightforward”
1/28/08
– Generalization of multithreaded programming
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.6
Historical Development
• “Mainframe” approach
–
–
–
–
–
Motivated by multiprogramming
Extends crossbar used for Mem and I/O
Processor cost-limited => crossbar
Bandwidth scales with p
High incremental cost
» use multistage instead
P
P
I/O
C
I/O
C
M
M
M
M
• “Minicomputer” approach
–
–
–
–
–
–
Almost all microprocessor systems have bus
Motivated by multiprogramming, TP
Used heavily for parallel computing
I/O
C
Called symmetric multiprocessor (SMP)
Latency larger than for uniprocessor
Bus is bandwidth bottleneck
I/O
C
» caching is key: coherence problem
– Low incremental cost
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
M
M
$
$
P
P
Lec 2.7
Adding Processing Capacity
I/O
devices
Mem
Mem
Mem
Interconnect
Processor
Mem
I/O ctrl
I/O ctrl
Interconnect
Processor
• Memory capacity increased by adding modules
• I/O by controllers and devices
• Add processors for processing!
–
1/28/08
For higher-throughput multiprogramming, or parallel
programs
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.8
Shared Physical Memory
• Any processor can directly reference any location
– Communication operation is load/store
– Special operations for synchronization
• Any I/O controller - any memory
• Operating system can run on any processor, or all.
–
•
OS uses shared memory to coordinate
What about application processes?
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.9
Shared Virtual Address Space
• Process = address space plus thread of control
• Virtual-to-physical mapping can be established so
that processes shared portions of address space.
– User-kernel or multiple processes
• Multiple threads of control on one address space.
– Popular approach to structuring OS’s
– Now standard application capability (ex: POSIX threads)
•
Writes to shared address visible to other threads
Natural extension of uniprocessors model
– conventional memory operations for communication
– special atomic operations for synchronization
» also load/stores
–
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.10
Structured Shared Address Space
Virtual address spaces for a
collection of processes communicating
via shared addresses
Load
P1
Machine physical address space
Pn pr i v at e
Pn
Common physical
addresses
P2
P0
St or e
Shared portion
of address space
Private portion
of address space
P2 pr i vat e
P1 pr i vat e
P0 pr i vat e
• Add hoc parallelism used in system code
• Most parallel applications have structured SAS
• Same program on each processor
– shared variable X means the same thing to each thread
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.11
Cache Coherence Problem
R?
W
$4
$4
R?
$4
$4
Write-Through?
0
1
2
3
4
5
6
7
Miss
• Caches are aliases for memory locations
• Does every processor eventually see new value?
• Tightly related: Cache Consistency
– In what order do writes appear to other processors?
• Buses make this easy: every processor can snoop on
every write
– Essential feature: Broadcast
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.12
Engineering: Intel Pentium Pro Quad
CPU
P-Pr o
module
256-KB
Interrupt
L2 $
controller
Bus interf ace
P-Pr o
module
P-Pr o
module
PCI
bridge
PCI bus
PCI
I/O
cards
PCI
bridge
PCI bus
P-Pr o bus (64-bit data, 36-bit address, 66 MHz)
Memory
controller
MIU
1-, 2-, or 4-w ay
interleaved
DRAM
– All coherence and
multiprocessing glue in
processor module
– Highly integrated, targeted
at high volume
– Low latency and bandwidth
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.13
Engineering: SUN Enterprise
P
$
P
$
$2
$2
CPU/mem
cards
Mem ctrl
Bus interf ace/sw itch
Gigaplane bus (256 data, 41 address, 83 MHz)
I/O cards
2 FiberChannel
SBUS
SBUS
SBUS
• Proc + mem card - I/O card
100bT, SCSI
Bus interf ace
– 16 cards of either type
– All memory accessed over bus, so symmetric
– Higher bandwidth, higher latency bus
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.14
Quad-Processor Xeon Architecture
• All sharing through pairs of front side busses (FSB)
– Memory traffic/cache misses through single chipset to memory
– Example “Blackford” chipset
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.15
Scaling Up
M
M

M
Network
Omega
Network
$
$
P
P
General
Network
Network

“Dance hall”
$
P
M
$
P
M
$

M
P
$
P
Distributed memory
– Problem is interconnect: cost (crossbar) or bandwidth (bus)
– Dance-hall: bandwidth still scalable, but lower cost than crossbar
» latencies to memory uniform, but uniformly large
– Distributed memory or non-uniform memory access (NUMA)
» Construct shared address space out of simple message
transactions across a general-purpose network (e.g. readrequest, read-response)
– Caching shared (particularly nonlocal) data?
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.16
Stanford DASH
• Clusters of 4 processors share 2nd-level cache
• Up to 16 clusters tied together with 2-dim mesh
• 16-bit directory associated with every memory line
– Each memory line has home cluster that contains DRAM
– The 16-bit vector says which clusters (if any) have read copies
– Only one writer permitted at a time
P
P
P
P
L1-$
L1-$
L1-$
L1-$
L2-Cache
• Never got more than 12 clusters (48 processors)
working at one time: Asynchronous network probs!
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.17
The MIT Alewife Multiprocessor
• Cache-coherence Shared Memory
•
•
•
•
•
– Partially in Software!
– Limited Directory + software overflow
User-level Message-Passing
Rapid Context-Switching
2-dimentional Asynchronous network
One node/board
Got 32-processors (+ I/O boards) working
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.18
Engineering: Cray T3E
External I/O
P
$
Mem
Mem
ctrl
and NI
XY
Sw itch
Z
– Scale up to 1024 processors, 480MB/s links
– Memory controller generates request message for non-local references
– No hardware mechanism for coherence
» SGI Origin etc. provide this
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.19
AMD Direct Connect
• Communication over general interconnect
1/28/08
– Shared memory/address space traffic over network
– I/O traffic to memory over network
– Multiple topology options (seems to scale to 8 or 16 processor
chips)
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.20
What is underlying Shared Memory??
Network
M
$
P
Systolic
Arrays
Dataflow
M
$

P
M
$
P
SIMD
Generic
Architecture
Message Passing
Shared Memory
• Packet switched networks better utilize available
link bandwidth than circuit switched networks
• So, network passes messages around!
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.21
Message Passing Architectures
• Complete computer as building block, including
I/O
– Communication via Explicit I/O operations
• Programming model
– direct access only to private address space (local memory),
– communication via explicit messages (send/receive)
• High-level block diagram
– Communication integration?
» Mem, I/O, LAN, Cluster
– Easier to build and scale than SAS
Network
M
$
P
M
$
P

• Programming model more removed from basic
hardware operations
M
$
P
– Library or OS intervention
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.22
Message-Passing Abstraction
Match
ReceiveY, P, t
AddressY
Send X, Q, t
AddressX
Local process
address space
Local process
address space
ProcessP
Process Q
–
–
–
–
–
–
Send specifies buffer to be transmitted and receiving process
Recv specifies sending process and application storage to receive into
Memory to memory copy, but need to name processes
Optional tag on send and matching rule on receive
User process names local data and entities in process/tag space too
In simplest form, the send/recv match achieves pairwise synch event
» Other variants too
– Many overheads: copying, buffer management, protection
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.23
Evolution of Message-Passing Machines
• Early machines: FIFO on each link
– HW close to prog. Model;
– synchronous ops
– topology central (hypercube algorithms)
101
001
100
000
111
011
110
010
CalTech Cosmic Cube (Seitz, CACM Jan 95)
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.24
MIT J-Machine (Jelly-bean machine)
• 3-dimensional network topology
– Non-adaptive, E-cubed routing
– Hardware routing
– Maximize density of communication
• 64-nodes/board, 1024 nodes total
• Low-powered processors
– Message passing instructions
– Associative array primitives to aid in synthesizing shared-address space
• Extremely fine-grained communication
– Hardware-supported Active Messages
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.25
Diminishing Role of Topology?
• Shift to general links
– DMA, enabling non-blocking ops
» Buffered by system at destination
until recv
– Store&forward routing
• Fault-tolerant, multi-path
routing:
• Diminishing role of topology
Intel iPSC/1 -> iPSC/2 -> iPSC/860
– Any-to-any pipelined routing
– node-network interface dominates
communication time
» Network fast relative to overhead
» Will this change for ManyCore?
– Simplifies programming
– Allows richer design space
» grids vs hypercubes
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.26
Example Intel Paragon
i860
i860
L1 $
L1 $
Intel
Paragon
node
Memory bus (64-bit, 50 MHz)
Mem
ctrl
DMA
Driver
Sandia’ s Intel Paragon XP/S-based Super computer
4-w ay
interleaved
DRAM
2D grid netw ork
w ith processing node
attached to every sw itch
1/28/08
NI
Kubiatowicz CS258 ©UCB Spring 2008
8 bits,
175 MHz,
bidirectional
Lec 2.27
Building on the mainstream: IBM SP-2
• Made out of
essentially
complete RS6000
workstations
• Network
interface
integrated in
I/O bus (bw
limited by I/O
bus)
Pow er 2
CPU
IBM SP-2 node
L2 $
Memory bus
General interconnection
netw ork f ormed fom
r
8-port sw itches
4-w ay
interleaved
DRAM
Memory
controller
MicroChannel bus
I/O
DMA
i860
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
NI
DRAM
NIC
Lec 2.28
Berkeley NOW
• 100 Sun Ultra2
workstations
• Inteligent
network
interface
– proc + mem
• Myrinet Network
– 160 MB/s per link
– 300 ns per hop
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.29
Data Parallel Systems
• Programming model
– Operations performed in parallel on each element of data structure
– Logically single thread of control, performs sequential or parallel steps
– Conceptually, a processor associated with each data element
• Architectural model
– Array of many simple, cheap processors with little memory each
» Processors don’t sequence through instructions
– Attached to a control processor that issues instructions
– Specialized and general communication, cheap global synchronization
• Original motivations
Matches simple differential equation solvers
– Centralize high cost of instruction fetch/sequencing
–
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Control
processor
PE
PE

PE
PE
PE

PE


PE
PE


PE
Lec 2.30
Application of Data Parallelism
– Each PE contains an employee record with his/her salary
If salary > 100K then
salary = salary *1.05
else
salary = salary *1.10
– Logically, the whole operation is a single step
– Some processors enabled for arithmetic operation, others disabled
• Other examples:
– Finite differences, linear algebra, ...
– Document searching, graphics, image processing, ...
• Some recent machines:
– Thinking Machines CM-1, CM-2 (and CM-5)
– Maspar MP-1 and MP-2,
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.31
Connection Machine
(Tucker, IEEE Computer, Aug. 1988)
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.32
NVidia Tesla Architecture
Combined GPU and general CPU
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.33
Components of NVidia Tesla architecture
• SM has 8 SP thread processor cores
–
–
–
–
32 GFLOPS peak at 1.35 GHz
IEEE 754 32-bit floating point
32-bit, 64-bit integer
2 SFU special function units
–
–
–
–
Memory load/store/atomic
Texture fetch
Branch, call, return
Barrier synchronization instruction
• Scalar ISA
• Multithreaded Instruction Unit
– 768 independent threads per SM
– HW multithreading & scheduling
• 16KB Shared Memory
– Concurrent threads share data
– Low latency load/store
• Full GPU
– Total performance > 500GOps
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.34
Evolution and Convergence
• SIMD Popular when cost savings of centralized
sequencer high
– 60s when CPU was a cabinet
– Replaced by vectors in mid-70s
» More flexible w.r.t. memory layout and easier to manage
– Revived in mid-80s when 32-bit datapath slices just fit on chip
• Simple, regular applications have good locality
• Programming model converges with SPMD (single
program multiple data)
– need fast global synchronization
– Structured global address space, implemented with either SAS
or MP
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.35
CM-5
• Repackaged
SparcStation
– 4 per board
• Fat-Tree
network
• Control network
for global
synchronization
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.36
Systolic
Arrays
Dataflow
1/28/08
SIMD
Generic
Architecture
Message Passing
Shared Memory
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.37
Dataflow Architectures
• Represent computation as a
graph of essential dependences
– Logical processor at each node,
activated by availability of operands
– Message (tokens) carrying tag of next
instruction sent to next processor
– Tag compared with others in matching
store;1 matchb fires cexecution
e
a = (b +1)  (b  c)
d=ce
f =a d

+

d

Dataflow graph
a

Netw ork
f
Token
store
Program
store
Waiting
Matching
Instruction
fetch
Execute
Form
token
Netw ork
Monsoon (MIT)
Token queue
1/28/08
Netw ork
Kubiatowicz CS258
©UCB Spring 2008
Lec 2.38
Evolution and Convergence
• Key characteristics
– Ability to name operations, synchronization, dynamic scheduling
• Problems
–
–
–
–
Operations have locality across them, useful to group together
Handling complex data structures like arrays
Complexity of matching store and memory units
Expose too much parallelism (?)
• Converged to use conventional processors and memory
– Support for large, dynamic set of threads to map to processors
– Typically shared address space as well
– But separation of progr. model from hardware (like data-parallel)
• Lasting contributions:
– Integration of communication with thread (handler) generation
– Tightly integrated communication and fine-grained synchronization
– Remained useful concept for software (compilers etc.)
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.39
Systolic Architectures
• VLSI enables inexpensive special-purpose chips
– Represent algorithms directly by chips connected in regular pattern
– Replace single processor with array of regular processing elements
– Orchestrate data flow for high throughput with less memory access
M
M
PE
PE
PE
PE
• Different from pipelining
– Nonlinear array structure, multidirection data flow, each PE may
have (small) local instruction and data memory
• SIMD? : each PE may do something different
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.40
Systolic Arrays (contd.)
Example: Systolic array for 1-D convolution
y(i) = w1  x(i) + w2  x(i + 1) + w3  x(i + 2) + w4  x(i + 3)
x8
x7
x6
x5
x4
x3
x2
w4
y3
y2
x1
w3
w2
w1
y1
xin
yin
x
w
xout
xout = x
x = xin
yout = yin + w  xin
yout
– Practical realizations (e.g. iWARP) use quite general processors
» Enable variety of algorithms on same hardware
– But dedicated interconnect channels
» Data transfer directly from register to register across channel
– Specialized, and same problems as SIMD
» General purpose systems work well for same algorithms (locality
etc.)
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.41
Toward Architectural Convergence
• Evolution and role of software have blurred boundary
– Send/recv supported on SAS machines via buffers
– Can construct global address space on MP
(GA -> P | LA)
– Page-based (or finer-grained) shared virtual memory
• Hardware organization converging too
– Tighter NI integration even for MP (low-latency, high-bandwidth)
– Hardware SAS passes messages
• Even clusters of workstations/SMPs are parallel systems
– Emergence of fast system area networks (SAN)
• Programming models distinct, but organizations converging
– Nodes connected by general network and communication assists
– Implementations also converging, at least in high-end machines
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.42
Convergence: Generic Parallel Architecture
Netw ork

Communication
assist (CA)
Mem
$
P
• Node: processor(s), memory system, plus
communication assist
– Network interface and communication controller
• Scalable network
• Convergence allows lots of innovation, within
framework
– Integration of assist with node, what operations, how efficiently...
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.43
Flynn’s Taxonomy
• # instruction x # Data
–
–
–
–
Single Instruction Single Data (SISD)
Single Instruction Multiple Data (SIMD)
Multiple Instruction Single Data
Multiple Instruction Multiple Data (MIMD)
• Everything is MIMD!
• However – Question is one of efficiency
– How easily (and at what power!) can you do certain operations?
– GPU solution from NVIDIA good at graphics – is it good in
general?
• As (More?) Important: communication architecture
– How do processors communicate with one another
– How does the programmer build correct programs?
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.44
Any hope for us to do research
in multiprocessing?
• Yes: FPGAs as New Research Platform
• As ~ 25 CPUs can fit in Field Programmable
Gate Array (FPGA), 1000-CPU system from
~ 40 FPGAs?
• 64-bit simple “soft core” RISC at 100MHz in 2004 (VirtexII)
• FPGA generations every 1.5 yrs; 2X CPUs, 2X clock rate
• HW research community does logic design
(“gate shareware”) to create out-of-thebox, Massively Parallel Processor runs
standard binaries of OS, apps
1/28/08
– Gateware: Processors, Caches, Coherency, Ethernet
Interfaces, Switches, Routers, … (IBM, Sun have donated
processors)
– E.g., 1000 processor, IBM Power binary-compatible,
cache-coherent supercomputer @ 200 MHz; fast enough for
research
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.45
RAMP
• Since goal is to ramp up research in
multiprocessing, called Research
Accelerator for Multiple Processors
– To learn more, read “RAMP: Research Accelerator for
Multiple Processors - A Community Vision for a Shared
Experimental Parallel HW/SW Platform,” Technical
Report UCB//CSD-05-1412, Sept 2005
– Web page ramp.eecs.berkeley.edu
• Project Opportunities?
–
–
–
–
–
1/28/08
Many
Infrastructure development for research
Validation against simulators/real systems
Development of new communication features
Etc….
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.46
Why RAMP Good for Research?
SMP
Cluster
Cost (1000 CPUs)
F ($40M)
C ($2M)
A+ ($0M)
A ($0.1M)
Cost of ownership
A
D
A
A
Scalability
C
A
A
A
D (120 kw,
D (120 kw,
A+ (.1 kw,
A (1.5 kw,
Community
D
A
A
A
Observability
D
C
A+
A+
Reproducibility
B
D
A+
A+
Flexibility
D
C
A+
A+
Credibility
A+
A+
F
A
A (2 GHz)
A (3 GHz)
F (0 GHz)
C (0.2 GHz)
C
B-
B
A-
Power/Space
(kilowatts, racks)
Perform. (clock)
GPA
1/28/08
12 racks)
12 racks)
Simulate
0.1 racks)
Kubiatowicz CS258 ©UCB Spring 2008
RAMP
0.3 racks)
Lec 2.47
RAMP 1 Hardware
• Completed Dec. 2004 (14x17 inch 22-layer
PCB)
• Module:
– FPGAs, memory,
10GigE conn.
– Compact Flash
– Administration/
maintenance
ports:
» 10/100 Enet
» HDMI/DVI
» USB
– ~4K/module w/o
FPGAs or DRAM

1/28/08
Called “BEE2” for Berkeley Emulation Engine 2
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.48
RAMP Blue Prototype (1/07)
• 8 MicroBlaze cores / FPGA
• 8 BEE2 modules (32 “user” FPGAs)
x 4 FPGAs/module
= 256 cores @ 100MHz
• Full star-connection between
modules
• It works; runs NAS benchmarks
• CPUs are softcore MicroBlazes
(32-bit Xilinx RISC architecture)
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.49
Vision: Multiprocessing Watering Hole
RAMP
Parallel file system Dataflow language/computer Data center in a box
Thread scheduling Security enhancements Internet in a box
Multiprocessor switch design
Router design Compile to FPGA
Fault insertion to check dependability Parallel languages
• RAMP attracts many communities to shared
artifact
 Cross-disciplinary interactions
 Accelerate innovation in multiprocessing
• RAMP as next Standard Research Platform?
(e.g., VAX/BSD Unix in 1980s, x86/Linux in
1990s)
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.50
Conclusion
• Several major types of communication:
–
–
–
–
–
Shared Memory
Message Passing
Data-Parallel
Systolic
DataFlow
• Is communication “Turing-complete”?
– Can simulate each of these on top of the other!
• Many tradeoffs in hardware support
• Communication is a first-class citizen!
– How to perform communication is essential
» IS IT IMPLICIT or EXPLICIT?
– What to do with communication errors?
– Does locality matter???
– How to synchronize?
1/28/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 2.51
Download