Distributed Shared Memory over IP Networks - Revisited Dan Gibson Chuck Tsen

advertisement
Distributed Shared Memory over
IP Networks - Revisited
“Which would you
rather use to plow a
field: two strong oxen,
or 1024 chickens?”
-Seymour Cray
Dan Gibson
Chuck Tsen
CS/ECE 757 Spring 2005
Instructor: Mark Hill
DSM over IP

Abstract a single, large, powerful machine from
a group of many commodity machines
–
–
–
–
–
Workstations can be idle for long periods
Networking infrastructure already exists
Increase performance through parallel execution
Increase utilization of existing hardware
Fault tolerance, fault isolation
Overview - 1

Source-level distributed shared memory
system for x86-Linux
–
–
Code written with DSM/IP API
Memory is identically named at high-level (C/C++)



Only explicitly shared variables are shared
User-level library checks coherence on each access
Similar to programming in an SMP environment
Overview - 2

Evaluated on a small (Ethernet) network of
heterogeneous machines
–
–

Application-dependent performance
–
–

4 x86-based machines of varying quality
Small, simulated workloads
Communication latency must be tolerated
Relaxed consistency models improve performance for some
workloads
Heterogeneity vastly affects performance
–
–
All machines operate as slowly as the slowest machine in
Barrier-synchronized workloads
Pentium 4 4.0 GHz is faster than a Pentium 3 600 Mhz!
Motivation

DSMoIP similar to building an SMP with
unordered interconnect
–
–
–

Shared-memory over message-passing
Race conditions possible
Coherence concerns
DSMoIP offers a tangible result
–
Runs on real HW
Challenges

IP on Ethernet is slow!
–
–

Common case must be fast
–
–

Point-to-point peak bandwidth <12Mb/s
Latency ~ 100us, RTT ~ 200us
Software check of coherence on every access
Many coherence changes require at least one RTT
IP semantics difficult for correctness
–
Packets can be:



Dropped
Delivered More than Once
Delivered Very Late, etc.
Related Projects

TreadMarks
–

Brazos (Rice)
–

DSM over UDP/IP, page-level granularity, sourcelevel compatibility, release consistency
DSM using multicast IP, MPI compatibility, scope
consistency, source-level compatibility
Plurix (Universitdt Ulm)
–
Distributed OS for shared memory, Java-source
compatible
More Related Projects

Quarks (Utah)
–


DSM over IP, user-level package interface, sourcelevel compatibility, many coherence options
SHRIMP (Princeton), StarDust (IRISA)
And more…
–
–
DSM over IP is not a new idea
Most have demonstrated speedup for some
workloads
Implementation Overview

All participating machines have space reserved for
each shared object at all times
–

Each shared object has accompanying coherence data
Depending on coherence, the object may or may not
be accessible
–
–
If permissions are needed on a shared object, network
communication is initiated
Identical semantics to a long-latency memory operation
Implementation Overview
Implementation Overview


Shared memory objects use altered syntax for
declaration and access
User-level DSMoIP library, consisting of:
–
–

Communication backbone
Coherence engine
Programmer uses a small API for setup and
synchronization
API – Accessing Shared Objects
SMP:
DSMoIP:
a = x + b;
x = y – 1;
a = x.Read() + b;
x.Write(y.Read()-1);
z = array[i] – 7;
array[i]++;
z = array.Read(i) – 7;
array.Write(
array.Read(i)+1,
i);
Variables x, y, z, and array are shared. All others are unshared
API – Accessing Shared Objects



Naming of shared objects is syntactically
different, but semantically unchanged
Use of .Read() and .Write() functions
allows DSM to interpose if necessary
Use of C/C++ operator overloading can
remove some of the changes in syntax (not
implemented)
Communication Backbone (CB)


A substrate on which to build coherence
mechanisms
Provide primitives for communication
–
–
–
Guaranteed delivery over an unreliable network
At-most-once delivery
Arbitrary message passing to a logical thread or
broadcast to all threads
Coherence Engine (CE)


Uses primitives in communication backbone to
abstract shared memory from a messagepassing system
Swappable, based on needs of application
–
–
CE determines consistency model
CE plays a major role in performance
CE/CB Interaction Example
Read()
Get Read
Perm.
Read()
commits
Read
Perm.
MACHINE 1
MACHINE 2
a / None
a / Read
Read
CE/CB Interaction Example
Write()
Get Write
Perm.
Write()
commits
Write
Perm.
MACHINE 1
MACHINE 2
a / Read
a / Read
Write
None
CE/CB Interaction Example
Read()
Read()
commits
Already have perm.
MACHINE 1
MACHINE 2
a / Write
a / None
CE/CB Interaction Example
INT
IRET
Request
Write
Perm.
Grant
Perm.
MACHINE 1
MACHINE 2
a / Write
a / None
None
Write
CE/CB Interaction Example
Read()
Get Read
Perm.
Read()
commits
Read
Perm.
MACHINE 1
MACHINE 2
a / None
a / Write
Read
Read
Coherence Engines Implemented

ENGINE_NC
–
Fast, simple engine that pays no attention to
consistency


–
Reads can occur at any time
Writes are broadcast, can be observed by any processor in
any order, and are not guaranteed to ever become visible
at all processors
Communicates only on writes, non-blocking
Coherence Engines Implemented

ENGINE_OU
–
–
Owned/Unowned
Sequentially-consistent engine



–
Ensures SC by actually enforcing total order of accesses
Each shared object is valid at only one processor for the
duration of execution
Slow, cumbersome (nearly every access generates traffic)
Blocking, latent reads and writes
Coherence Engines Implemented

ENGINE_MSI
–
Based on MSI cache coherence protocol


–
–
Shared objects can exist in multiple locations in S state
Writing is exclusive to a single, M-state machine
Communicates as needed by typical MSI protocol
Sequentially consistent
Evaluation

DSM/IP evaluated using four heterogeneous x86/Linux
machines
–
–

Microbenchmarks test correctness
Simulated workloads explore performance issues
Early results indicate that DSM/IP overwhelms 100M
LAN
–
–
–
1-10k broadcast packets/second per machine
TCP/IP streams in subnet are suppressed by UDP packets in
absence of Quality of Service measures
Process is self-throttling, but makes DSM/IP unattractive to
network administrators
Measurement

Event counting
–

Regions of DSM Library augmented with event
counters
Timers
–
–
gettimeofday(), setitimer() for low-granularity
timing
Built-in x86 cycle timers for high-granularity
measurement

Validated with gettimeofday()
Timer Validation


Assumed system-call based timers accurate
x86 cycle timers validated for long executions
against gettimeofday()
Microbenchmarks

message_test
–
–

barrier_test
–
–

Each thread sends a specified number of messages to the
next thread
Attempts to overload network/backbone to produce an
incorrect execution
Each thread iterates over a specified number of barriers
Attempts to cause failures in DSM_Barrier() mechanism
lock_test
–
–
All threads attempt to acquire a lock and increment global
variable
Puts maximal pressure on DSM_Lock primitive
Microbenchmarks – Validation

message_test
–

barrier_test
–

nMessages = 1M, N=[1,4], no lost messages
(runtime ~ 2 minutes)
nBarriers = 1M, N=[1,4], consistent synchronization
(runtime ~4 minutes)
lock_test
–
nIncs = 1M, ENGINE_OU (SC), N=[1,4], final value
of variable = N-million (correct)
Simulated Workloads
More Communication

Three simulated workloads
–
–
–

sea, similar to OCEAN from SPLASH-2
genetic, a typical iterative genetic algorithm
xstate, an exhaustive solver for complex
expressions
Implementations are relatively simplified, for
quick development.
Simulated Workloads - sea

Integer version of SPLASH-2’s OCEAN
–
–
–
–
Simple iterative simulation, taking averages of
surrounding points
Uses large integer array
Coherence granularity depends on number of
threads
Uses DSM_Barrier() primitives for
synchronization
Simulated Workloads - genetic

Multithreaded genetic algorithm
–
–
Uses distributed genetic process to “breed” a
specific integer from a random population
Iterates in two phases:


–
Breeding: Generate new potential solutions from the most
fit members of the current population
Survival of the Fittest: Remove the least fit solutions from
the population
Uses DSM_Barrier() for synchronization
Simulated Workloads - xstate

Integer solver for arbitrary expressions
–
–
Uses space exploration to find fixed points of hardcoded expression
Employs a global list of solutions, protected with a
DSM_Lock primitive
Methodology


Each simulated workload run for each
coherence engine
Normalized parallel-phase runtime (versus
uniprocessor case) measured with highresolution counters
–
–
Does not include startup overhead for DSM system
(~1.5 seconds)
Does not include other initialization overheads
Results - SEA
Engine ABT
%Wait
OU
0.6 s
82%
NC
MSI
0.6 s 10%
0.05s 15%
Results - GENETIC
Engine ABT
%Wait
OU
0.02 s 20%
NC
MSI
0.02 s 9%
0.01 s 1%
Results - XSTATE
Engine ABT
%Wait
OU
3.1 s
8%
NC
MSI
2.7 s
2.6 s
2%
9%
Observations

ENGINE_NC
–
–

ENGINE_OU
–
–

speedup for workload with lightest communication load
Scalability concerns: Unicast on every read & write!
ENGINE_MSI
–
–

speedup for all workloads (but no consistency)
Scalability concerns: Broadcasts on every write!
Reduces network traffic & has speedup for some workloads
Impressive performance on SEA
Effect of heterogeneity is significant!
–
Est. fastest machine spends 25-35% of its time waiting for
other machines to arrive at barriers
Conclusions



Performance for DSM/IP implementation is
marginal for small networks (N<=4)
Coherence engine significantly affects
performance
Naïve implementation of SC has far too much
overhead
Conclusions

Common case might not be fast enough
–
Best case:

–
Worst case:

–
Single load or store becomes a function call + ~10 instrs.
Single load or store becomes a RTT on the network
(~1M-100M instruction times)
Average case: (depends on engine)

LD/ST becomes ~100-1K+ instruction times?
More Questions

How might other engine implementations
behave?
–

Is the communication substrate to blame for all
performance problems?
How would DSM/IP perform for
N=8,16,64,128?
Questions
SEA Demo
Extra Slides
Initial Synchronization
Initial Synchronization
Initial Synchronization
Download