multi - Cleveland State University

advertisement
Communication Models for Parallel
Computer Architectures
• Two distinct models have been proposed for how
CPUs in a parallel computer system should
communicate.
 In the first model, all CPUs share a common physical
memory.
• This kind of system is called a multiprocessor or shared
memory system.
 In the second design, each CPU has its own private
memory.
• Such a design is called a multicomputer or distributed
memory system.
Multiprocessors
 Consider a program to find all of the objects in a bitmap image.
• One copy of the image is kept in memory.
• Each CPU runs a single process which inspects one section of
the image.
• Some objects occupy multiple sections, so it is essential that
each process have access to the entire image.
 Example multiprocessors include:




Sun Enterprise 1000
Sequent NUMA-Q
SGI Origin 2000
HP/Convex Exemplar
Multiprocessors
Multicomputers
 In a multicomputer solving the same problem, each
CPU has a section of the image in its local memory.
• If the CPUs need to follow an object across the border, they
must request the information from a neighboring CPU.
 This is done via message passing.
 Programming multicomputers is more difficult than
programming multiprocessors, but they are more
scalable.
• Building a multicomputer with 10,000 CPUs is
straightforward.
Multicomputers
Multicomputers
 Example multicomputers include:
• IBM SP/2
• Intel/Sandia Option Red
• Wisconsin COW
 Much research focuses on hybrid systems combining
the best of both worlds.
 Shared memory might be implemented at a higher-level
than the hardware.
• The operating system might simulate a shared memory by
providing a single system-wide paged shared address space.
 This approach is called DSM (Distributed Shared Memory).
Shared Memory
• Each machine has its own virtual memory and its own page
table.
• When a CPU does a LOAD or STORE on a page it does not
have, a trap to the OS occurs.
• The OS locates the page and asks the CPU currently holding it
to unmap the page and send it over the interconnection
network.
• When it arrives, the page is mapped in and the faulting
instruction restarted.
 A third possibility is to have a user-level runtime
system implement a form of shared memory.
Shared Memory
Shared Memory
 The programming language provides a shared
memory abstraction implemented by the
compiler and runtime system.
• The Linda model is based on the abstraction of a
shared space of tuples.
 Processes can input a tuple from the shared tuple space or
output a tuple to the shared tuple space.
• The Orca model allows shared generic objects.
 Processes can execute object-specific methods on shared
objects.
 When a change occurs to the internal state of some object,
it is up to the runtime system to simultaneously update all
copies of the object.
Interconnection Networks
• Multicomputers are held together by
interconnection networks which move packets
between CPUs and memory.
 The CPUs and memory modules of multiprocessors are
also interconnected.
 Interconnection networks consist of:
•
•
•
•
•
CPUs
Memory modules
Interfaces
Links
Switches
Interconnection Networks
 The links are the physical channels over which
bits move. They can be
• electrical or optical fiber
• serial or parallel
• simplex, half-duplex, or full duplex
 The switches are devices with multiple input
ports and multiple output ports.
• When a packet arrives at an input port on a switch
some bits are used to select the output port to which
the packet is sent.
Topology
Switching
 An interconnection network consists of switches and
wires connecting them.
 The following slide shows an example.
• Each switch has four input ports and four output ports.
• In addition each switch has some CPUs and interconnect
circuitry.
• The job of the switch is to accept packets arriving on any input
port and send each one out on the correct output port.
• Each output port is connected to an input port of another
switch by a parallel or serial line.
Switching
Switching
 Several switching strategies are possible.
• In circuit switching, before a packet is sent, the entire path
from the source to the destination is reserved in advance.
 All ports and buffers are claimed, so that when transmission
starts, all necessary resources are guaranteed to be available and
the bits can move at full speed from the source, through the
switches to the destination.
• In store-and-forward packet switching, no advance
reservation is needed.
 The source sends a complete packet to the first switch where it is
stored in its entirety.
 The switches may need to buffer packets if an output port is
busy.
Switching
Communication Methods
 When a program is split up into pieces, the pieces
(processes) often need to communicate with one
another.
 This communication can be done in one of two ways:
• shared variables
• explicit message passing
• Logical sharing of variables is possible even on a
multicomputer.
• Message passing is easy to implement on a multiprocessor by
simply copying from the sender to the receiver.
Communication Methods
Message Passing Modes
• Messaging systems can be either persistent
or transient
 Are messages retained when the senders and/or
receivers stop executing?
• Can also be either synchronous or
asynchronous
 Blocking vs. non-blocking
Persistent Communication
• Persistent communication of letters back in the days of the
Pony Express.
Persistence and Synchronicity in
Communication
a)
b)
Persistent asynchronous communication
Persistent synchronous communication
2-22.1
Persistence and Synchronicity in
Communication
2-22.2
c)
d)
Transient asynchronous communication
Receipt-based transient synchronous communication
Persistence and Synchronicity in
Communication
e)
f)
Delivery-based transient synchronous communication at message delivery
Response-based transient synchronous communication
Remote Procedure Call
(RPC)
• Developed by Birrell and Nelson (1984).
 In multiprocessor systems the client code for
copying a file is quite different from the normal
centralized (uniprocessor) code.
 Let’s make the client server request-reply look
like a normal procedure call and return.
 Notice that getchar in the centralized version
turns into a read system call. The following is
for Unix:
• read looks like a normal procedure to its caller.
Remote Procedure Call
(RPC)
• read is a user mode program.
• read manipulates registers and then does a trap to
the kernel.
• After the trap, the kernel manipulates registers and
then does a C-language routine and lots of work gets
done (drivers, disks, etc).
• After the I/O, the process get unblocked, the kernel
read manipulates registers, and returns. The user
mode read manipulates registers and returns to the
original caller.
 Let’s do something similar with request reply:
Remote Procedure Call
(RPC)
 User (client) does a subroutine call to getchar
(or read).
• Client knows nothing about messages.
 We link in a user mode program called the
client stub (analogous to the user mode read
above).
• This takes the parameters to read and converts them
to a message (marshalls the arguments).
• Sends a message to machine containing the server
directed to a server stub.
• Does a blocking receive (of the reply message).
Remote Procedure Call
(RPC)
 The server stub is linked with the server.
• It receives the message from the client stub.
• Unmarshalls the arguments and calls the server (as a
subroutine).
 The server procedure does what it does and
returns (to the server stub).
• Server knows nothing about messages
 Server stub now converts this to a reply
message sent to the client stub.
• Marshalls the arguments.
Remote Procedure Call
(RPC)
 Client stub unblocks and receives the reply.
• Unmarshalls the arguments.
• Returns to the client.
 Client believes (correctly) that the routine it
calls has returned just like a normal procedure
does.
Passing Value Parameters
(1)
• Steps involved in doing remote computation through RPC
2-8
Remote Procedure Call
(RPC)
 Heterogeneity: Machines have different data
formats.
• How can we handle these differences in RPC?
 Have conversions between all possibilities.
 Done during marshalling and unmarshalling.
• Adopt a standard and convert to/from it.
Passing Value Parameters
(2)
a)
b)
c)
Original message on the Pentium
The message after receipt on the SPARC
The message after being inverted. The little numbers in
boxes indicate the address of each byte
Remote Procedure Call
(RPC)
 Pointers: Avoid them for RPC!
• Can put the object pointed to into the message itself
(assuming you know its length).
• Convert call-by-reference to copyin/copyout
 If we have in or out parameters (instead of in out) can
eliminate one of the copies
• Change the server to handle pointers in a special
way.
 Callback to client stub
Registering and name
servers
 As we said before, we can use a name server.
 This permits the server to move using the
following process.
• deregister from the name server
• move
• reregister
 This is sometimes called dynamic binding.
Registering and name
servers
 The client stub calls the name server (binder)
the first time to get a handle to use for the
future.
• There is a callback from the binder to the client stub
if the server deregisters or we could have the
attempt to use the handle fail so that the client stub
will go to the binder again.
How does a programmer
create a program with RPC?
• uuidgen generates a unique identifier for the
RPC
• Include it in an IDL (interface description
language file) and describe the interface for
the RPC in the file as well
• Write the client and server code
• Client and server stubs are generated from
the IDL file automatically
• Link things together and run on desired
machines
Writing a Client and a Server
2-14
• The steps in writing a client and a server in DCE RPC.
Processor Allocation
• Processor Allocation
 Decide which processes should run on which
processors.
 Could also be called process allocation.
 We assume that any process can run on any
processor.
Processor Allocation
 Often the only difference between different
processors is:
• CPU speed
• CPU speed and amount of memory
 What if the processors are not homogeneous?
• Assume that we have binaries for all the different
architectures.
 What if not all machines are directly connected
• Send process via intermediate machines
Processor Allocation
• If we have only PowerPC binaries, restrict the
process to PowerPC machines.
• If we need machines very close for fast
communication, restrict the processes to a group of
close machines.
 Can you move a running process or are
processor allocations done at process creation
time?
• Migratory allocation algorithms vs. non migratory.
Processor Allocation
 What is the figure of merit, i.e. what do we
want to optimize in order to find the best
allocation of processes to processors?
• Similar to CPU scheduling in centralized operating
systems.
 Minimize response time is one possibility.
Processor Allocation
• We are not assuming all machines are equally fast.
 Consider two processes. P1 executes 100 millions
instructions, P2 executes 10 million instructions.
 Both processes enter system at time t=0
 Consider two machines A executes 100 MIPS, B 10 MIPS
 If we run P1 on A and P2 on B each takes 1 second so
average response time is 1 sec.
 If we run P1 on B and P2 on A, P1 takes 10 seconds P2 .1
sec. so average response time is 5.05 sec.
 If we run P2 then P1 both on A finish at times .1 and 1.1
so average response time is .6 seconds!!
Processor Allocation
 Minimize response ratio.
• Response ratio is the time to run on some machine
divided by time to run on a standardized
(benchmark) machine, assuming the benchmark
machine is unloaded.
• This takes into account the fact that long jobs should
take longer.
 Maximize CPU utilization
 Throughput
• Jobs per hour
• Weighted jobs per hour
Processor Allocation
 If weighting is CPU time, we get CPU utilization
 This is the way to justify CPU utilization (user centric)
• Design issues
 Deterministic vs. Heuristic
• Use deterministic for embedded applications, when
all requirements are known a priori.
 Patient monitoring in hospital
 Nuclear reactor monitoring
 Centralized vs. distributed
• We have a tradeoff of accuracy vs. fault tolerance
and bottlenecks.
Processor Allocation
 Optimal vs. best effort
• Optimal normally requires off line processing.
• Similar requirements as for deterministic.
• Usual tradeoff of system effort vs. result quality.
 Transfer policy
• Does a process decide to shed jobs just based on its
own load or does it have (and use) knowledge of
other loads?
• Also called local vs. global
• Usual tradeoff of system effort (gather data) vs.
result quality.
Processor Allocation
• Location policy
 Sender vs. receiver initiated.
• Sender initiated - uploading programs to a compute
server
• Receiver initiated - downloading Java applets
 Look for help vs. look for work.
 Both are done.
Processor Allocation
• Implementation issues
 Determining local load
• Normally use a weighted mean of recent loads with
more recent weighted higher.
Processor Allocation
• Example algorithms
• Min cut deterministic algorithm
 Define a graph with processes as nodes and IPC
traffic as arcs
 Goal: Cut the graph (i.e some arcs) into pieces
so that
• All nodes in one piece can be run on one processor
 Memory constraints
 Processor completion times
• Values on cut arcs are minimized
Processor Allocation
• Minimize the max
 minimize the maximum traffic for a process pair
• Minimize the sum
 minimize total traffic
 Minimize the sum to/from a piece
 don't overload a processor
• Minimize the sum between pieces
 minimize traffic for processor pair
 Tends to get hard as you get more realistic
Processor Allocation
• Up-down centralized algorithm
 Centralized table that keeps "usage" data for a
user, the users are defined to be the workstation
owners. Call this the score for the user.
 The goal is to give each user a fair share.
 When user requests a remote job, if a
workstation is available it is assigned.
 For each process a user has running remotely,
the user's score increases by a fixed amount
each time interval.
Processor Allocation
 When a user has an unsatisfied request pending (and
none being satisfied), the score decreases (it can go
negative).
 If no requests are pending and none are being satisfied,
the score is bumped towards zero.
 When a processor becomes free, assign it to a
requesting user with the lowest score.
Processor Allocation
• Hierarchical algorithm
 Goal - assign multiple processors to a job
 Quick idea of algorithm
• Processors arranged in a tree
• Requests go up the tree until a subtree has enough resources
• Request is split and parts go back down the tree
 Arrange processors in a hierarchy (tree)
• This is a logical tree independent of how physically connected
• Each node keeps (imperfect) track of how many available
processors are below it.
 If a processor can run more than one process, must be more
sophisticated and must keep track of how many processes can be
allocated (without overload) in the subtree below.
Processor Allocation
• If a new request appears in the tree, the current node sees if it
can be satisfied by the processors below (plus itself).
 If so, do it.
 If not pass the request up the tree
 Actually since machines may be down or the data on availability
may be out of date, you actually try to find more processes than
requested
• Once a request has gone high enough to be satisfied, the
current node splits the request into pieces and sends each piece
to appropriate child.
• What if a node dies?
 Promote one of its children say C
 Now C's children are peers with the previous peers of C
Processor Allocation
 If this is considered too unbalanced, we can promote one of C
children to take C's place.
• How can we decide which child C to promote?
 Peers of dead node have an election
 Children of dead node have an election
 Parent of dead node decides
• What if the root dies?
 Must use children since no peers or parent
 If we want to use peers, then we do not have a single root
 I.e. the top level of the hierarchy is a collection of roots that
communicate. This is a forest, not a tree
 What if multiple requests are generated simultaneously?
Processor Allocation
• Gets hard fast as information gets stale and potential race
conditions and deadlocks are possible.
• Distributed heuristic algorithm





Goal - find a lightly loaded processor to migrate job to
Send probe to a random processor
If the remote load is low, ship the job
If the remote load is high, try another random probe
After k (parameter of implementation) probes all say
the load is too high, give up and run the job locally.
 Modelled analytically and seen to work fairly well
Scheduling
 General goal is to have processes that
communicate frequently run simultaneously
 If they don’t and we use busy waiting for
messages, we will have a huge disaster.
 Even if we use context switching, we may have
a small disaster as only one message transfer
can occur per time scheduling slot
 Co-scheduling (a.k.a. gang scheduling).
Processes belonging to a job are scheduled
together
Scheduling
• Time slots are coordinated among the processors.
• Some slots are for gangs; other slots are for regular
processes.
Taxonomy of Parallel
Computers
• Although many researchers have tried to
come up with a taxonomy of parallel
computers, the only one which is widely
used is that of Flynn (1972).
• This classification is based on two concepts
 instruction streams
• corresponding to a program counter
 data streams
• consisting of a set of operands
Taxonomy of Parallel
Computers
Taxonomy of Parallel
Computers
Memory Semantics
 Even though all multiprocessors present the
CPUs with the image of a single shared address
space, often there are many memory modules
present, each holding some portion of the
physical memory.
• The CPUs and memories are often interconnected
by a complex interconnection network.
• Several CPUs may be attempting to read a memory
word at the same time several other CPUs are
attempting to write the same word.
• Multiple copies of some blocks may be in caches.
Memory Semantics
 One view of memory semantics is to view it as
a contract between the software and the
memory hardware.
• The rules are called consistency models, and many
different ones have been proposed and implemented.
• For example, suppose that CPU 0 writes the value 1
to some memory word and a little later CPU 1 writes
the value 2 to the same word.
• Now CPU 2 reads the word and gets the value 1.
• Is this an error?
Memory Semantics
 The simplest model is strict consistency.
• With this model, any read to a location x, always returns the
value of the most recent write to x.
• This model is great for programmers, but almost impossible to
implement.
 The next best model is called sequential consistency.
• The basic idea is that in the presence of multiple read and write
requests, some interleaving of all the requests is chosen by the
hardware (nondeterministically), but all CPUs see the same
order.
Memory Semantics
Memory Semantics
 A looser consistency model, but one that is
easier to implement on large multiprocessors, is
processor consistency. It has two properties:
• Writes by any CPU are seen by all CPUs in the
order they were issued.
• For every memory word, all CPUs see all writes to it
in the same order.
• If CPU 1 issues writes with values 1A, 1B, and 1C
to some memory location in that sequence, then all
other processors see them in that order too.
• Every memory word has an unambiguous value
after several CPUs write to it and stop.
Memory Semantics
 Weak consistency does not even guarantee that writes
from a single CPU are seen in that order.
• One CPU might see 1A before 1B and another CPU might see
1A after 1B.
• However, to add some order, weakly consistent memories have
synchronization variables or a synchronization variable.
 When a synchronization is executed, all pending writes are
finished and no new ones are started until all the old ones are
done and the synchronization itself is done.
 In effect a synchronization “flushes the pipeline” and brings the
memory to a stable state with no operations pending.
 Time is divided into epochs delimited by the synchronizations.
Memory Semantics
Memory Semantics
 Weak consistency has the problem that it is
quite inefficient because it must finish off all
pending memory operations and hold all new
ones until the current ones are done.
 Release consistency improves matters by
adopting a model akin to critical sections.
• The idea behind this model is that when a process
exits a critical region it is not necessary to force all
writes to complete immediately. It is only necessary
to make sure that they are done before any process
enters the critical region again.
Memory Semantics
 In this model, the synchronization operation
offered by weak consistency is split into two
different operations.
• To read or write a shared data variable, a CPU must
first do an acquire operation on the synchronization
variable to get exclusive access to the shared data.
• When it is done, the CPU does a release operation
on the synchronization variable to indicate that it is
finished.
UMA Bus-Based SMP
Architectures
• The simplest multiprocessors are based on a
single bus.
 Two or more CPUs and one or more memory
modules all use the same bus for
communication.
 If the bus is busy when a CPU wants to read
memory, it must wait.
 Adding more CPUs results in more waiting.
 This can alleviated by having a private cache
for each CPU.
UMA Bus-Based SMP
Architectures
Snooping Caches
 With caches a CPU may have stale data in its
private cache.
 This problem is known as the cache coherence
or cache consistency problem.
 This problem can be controlled by algorithms
called cache coherence protocols.
• In all solutions, the cache controller is specially
deigned to allow it to eavesdrop on the bus,
monitoring all bus requests and taking action in
certain cases.
• These devices are called snooping caches.
Snooping Caches
MESI Cache Coherence
Protocol
 When a protocol has the property that not all writes go
directly through to memory (a bit is set instead and the
cache line is eventually written to memory) we call it a
write-back protocol.
 One popular write-back protocol is called the MESI
protocol.
• It is used by the Pentium II and other CPUs.
• Each cache entry can be in one of four states:
 Invalid - the cache entry does not contain valid data
 Shared - multiple caches may hold the line; memory is up to date
MESI Cache Coherence
Protocol
 Exclusive - no other cache holds the line; memory is up to date
 Modified - the entry is valid; memory is invalid; no copies exist
• Initially all cache entries are invalid
• The first time memory is read, the cache line is marked E
(exclusive)
• If some other CPU reads the data, the first CPU sees this on the
bus, announces that it holds the data as well, and both entries
are marked S (shared)
• If one of the CPUs writes the cache entry, it tells all other
CPUs to invalidate their entries (I) and its entry is now in the
M (modify) state.
MESI Cache Coherence
Protocol
• If some other CPU now wants to read the modified
line from memory, the cached copy is sent to
memory, and all CPUs needing it read it from
memory. They are marked as S.
• If we write to an uncached line and the writeallocate is in use, we will load the line, write to it
and mark it as M.
• If write-allocate is not in use, the write goes directly
to memory and the line is not cached anywhere.
MESI Cache Coherence
Protocol
UMA Multiprocessors Using
Crossbar Switches
 Even with all possible optimizations, the use of
a single bus limits the size of a UMA
multiprocessor to about 16 or 32 CPUs.
• To go beyond that, a different kind of
interconnection network is needed.
• The simplest circuit for connecting n CPUs to k
memories is the crossbar switch.
 Crossbar switches have long been used in telephone
switches.
 At each intersection is a crosspoint - a switch that can be
opened or closed.
 The crossbar is a nonblocking network.
UMA Multiprocessors Using
Crossbar Switches
Sun Enterprise 1000
 An example of a UMA multiprocessor based on
a crossbar switch is the Sun Enterprise 1000.
• This system consists of a single cabinet with up to
64 CPUs.
• The crossbar switch is packaged on a circuit board
with eight plug in slots on each side.
• Each slot can hold up to four UltraSPARC CPUs
and 4 GB of RAM.
• Data is moved between memory and the caches on a
16 X 16 crossbar switch.
• There are four address buses used for snooping.
Sun Enterprise 1000
UMA Multiprocessors Using
Multistage Switching Networks
 In order to go beyond the limits of the Sun
Enterprise 1000, we need to have a better
interconnection network.
 We can use 2  2 switches to build large
multistage switching networks.
• One example is the omega network.
• The wiring pattern of the omega network is called
the perfect shuffle.
• The labels of the memory can be used for routing
packets in the network.
• The omega network is a blocking network.
UMA Multiprocessors Using
Multistage Switching Networks
UMA Multiprocessors Using
Multistage Switching Networks
NUMA Multiprocessors
 To scale to more than 100 CPUs, we have to
give up uniform memory access time.
 This leads to the idea of NUMA (NonUniform
Memory Access) multiprocessors.
• They share a single address space across all the
CPUs, but unlike UMA machines local access is
faster than remote access.
• All UMA programs run without change on NUMA
machines, but the performance is worse.
 When the access time to the remote machine is not hidden
(by caching) the system is called NC-NUMA.
NUMA Multiprocessors
 When coherent caches are present, the system is called
CC-NUMA.
 It is also sometimes known as hardware DSM since it is
basically the same as software distributed shared memory
but implemented by the hardware using a small page size.
• One of the first NC-NUMA machines was the
Carnegie Mellon Cm*.
 This system was implemented with LSI-11 CPUs (the
LSI-11 was a single-chip version of the DEC PDP-11).
 A program running out of remote memory took ten times
as long as one using local memory.
 Note that there is no caching in this type of system so
there is no need for cache coherence protocols.
NUMA Multiprocessors
Cache Coherent NUMA
Multiprocessors
 Not having a cache is a major handicap.
 One of the most popular approaches to building
large CC-NUMA (Cache Coherent NUMA)
multiprocessors currently is the directorybased multiprocessor.
• Maintain a database telling where each cache line is
and what its status is.
• The db is kept in special-purpose hardware that
responds in a fraction of a bus cycle.
Cache Coherent NUMA
Multiprocessors
DASH Multiprocessor
 The first directory-based CC-NUMA
multiprocessor, DASH (Directory
Architecture for SHared Memory), was built
at Stanford University as a research project.
• It has heavily influenced a number of commercial
products such as the SGI Origin 2000
• The prototype consists of 16 clusters, each one
containing a bus, four MIPS R3000 CPUs, 16 MB
of global memory, and some I/O equipment.
• Each CPU snoops on its local bus, but not on any
other buses, so global coherence needs a different
mechanism.
DASH Multiprocessor
DASH Multiprocessor
 Each cluster has a directory that keeps track of which
clusters currently have copies of its lines.
 Each cluster in DASH is connected to an interface that
allows the cluster to communicate with other clusters.
• The interfaces are connected in a rectangular grid.
• A cache line can be in one of three states
 UNCACHED
 SHARED
 MODIFIED
• The DASH protocols are based on ownership and invalidation.
DASH Multiprocessor
• At every instant each cache line has a unique owner.
 For UNCACHED or SHARED lines, the line’s home
cluster is the owner
 For MODIFIED lines, the cluster holding the one and only
copy is the owner.
• Requests for a cache line work there way out from
the cluster to the global network.
• Maintaining memory consistency in DASH is fairly
complex and slow.
• A single memory access may require a substantial
number of packets to be sent.
Sequent NUMA-Q
Multiprocessor
 The DASH was an important project, but it was never a
commercial system.
 As an example of a commercial CC-NUMA
multiprocessor, consider the Sequent NUMA-Q 2000.
• It uses an interesting and important cache coherence protocol
called SCI (Scalable Coherent Interface).
• The NUMA-Q is based on the standard quad board sold by
Intel containing four Pentium Pro CPU chips and up to 4 GB
of RAM.
 All these caches are kept coherent by using the MESI protocol.
Sequent NUMA-Q
Multiprocessor
Sequent NUMA-Q
Multiprocessor
 Each quad board is extended with an IQ-Link
board plugged into a slot designed for network
controllers.
• The IQ-Link primarily implements the SCI protocol.
• It holds 32 MB of cache, a directory for the cache, a
snooping interface to the local quad board bus and a
custom chip called the data pump that connects it
with other IQ-Link boards.
 It pumps data from the input side to the output side,
keeping data aimed at its node and passing other data
unmodified.
 Together all the IQ-link boards form a ring.
Sequent NUMA-Q
Multiprocessor
Distributed Shared Memory
 A collection of CPUs sharing a common paged
virtual address space is called DSM
(Distributed Shared Memory).
• When a CPU accesses a page in its own local RAM,
the read or write just happens without any further
delay.
• If the page is in a remote memory, a page fault is
generated.
• The runtime system or OS sends a message to the
node holding the page to unmap it and send it over.
• Read-only pages may be shared.
Distributed Shared Memory
Distributed Shared Memory
 Pages, however, are an unnatural unit for
sharing, so other approaches have been tried.
 Linda provides processes on multiple machines
with a highly structured distributed shared
memory.
• The memory is accessed through a small set of
primitive operations that can be added to existing
languages such as C and FORTRAN.
• The unifying concept behind Linda is that of an
abstract tuple space.
• Four operations are provided on tuples:
Distributed Shared Memory
• out, puts a tuple into the tuple space
• in, retrieves a tuple from the tuple space.
 The tuples are addresses by content, rather than by name.
• read is like in but it does not remove the tuple from the tuple
space.
• eval causes its parameters to be evaluated in parallel and the
resulting tuple to be deposited in the tuple space.
 Various implementations of Linda exist on
multicomputers.
• Broadcasting and directories are used for distributing the
tuples.
Distributed Shared Memory
Distributed Shared Memory
 Orca uses full-blown objects rather than tuples
as the unit of sharing.
 Objects consist of internal state plus operations
for changing the state.
 Each Orca method consists of a list of (guard,
block-of-statements) pairs.
• A guard is a Boolean expression that does not
contain any side effects, or the empty guard, which
is simply true.
• When an operation is invoked, all of its guards are
evaluated in an unspecified order.
Distributed Shared Memory
• If all of them are false, the invoking process is
delayed until one becomes true.
• When a guard is found that evaluates to true, the
block of statements following it is executed.
• Orca has a fork statement to create a new process on
a user-specified processor.
• Operations on shared objects are atomic and
sequentially consistent.
• Orca integrates shared data and synchronization in a
way not present in page-based DSM systems.
Distributed Shared Memory
Download