Distributed Shared Memory over IP Networks - Revisited “Which would you rather use to plow a field: two strong oxen, or 1024 chickens?” -Seymour Cray Dan Gibson Chuck Tsen CS/ECE 757 Spring 2005 Instructor: Mark Hill DSM over IP Abstract a single, large, powerful machine from a group of many commodity machines – – – – – Workstations can be idle for long periods Networking infrastructure already exists Increase performance through parallel execution Increase utilization of existing hardware Fault tolerance, fault isolation Overview - 1 Source-level distributed shared memory system for x86-Linux – – Code written with DSM/IP API Memory is identically named at high-level (C/C++) Only explicitly shared variables are shared User-level library checks coherence on each access Similar to programming in an SMP environment Overview - 2 Evaluated on a small (Ethernet) network of heterogeneous machines – – Application-dependent performance – – 4 x86-based machines of varying quality Small, simulated workloads Communication latency must be tolerated Relaxed consistency models improve performance for some workloads Heterogeneity vastly affects performance – – All machines operate as slowly as the slowest machine in Barrier-synchronized workloads Pentium 4 4.0 GHz is faster than a Pentium 3 600 Mhz! Motivation DSMoIP similar to building an SMP with unordered interconnect – – – Shared-memory over message-passing Race conditions possible Coherence concerns DSMoIP offers a tangible result – Runs on real HW Challenges IP on Ethernet is slow! – – Common case must be fast – – Point-to-point peak bandwidth <12Mb/s Latency ~ 100us, RTT ~ 200us Software check of coherence on every access Many coherence changes require at least one RTT IP semantics difficult for correctness – Packets can be: Dropped Delivered More than Once Delivered Very Late, etc. Related Projects TreadMarks – Brazos (Rice) – DSM over UDP/IP, page-level granularity, sourcelevel compatibility, release consistency DSM using multicast IP, MPI compatibility, scope consistency, source-level compatibility Plurix (Universitdt Ulm) – Distributed OS for shared memory, Java-source compatible More Related Projects Quarks (Utah) – DSM over IP, user-level package interface, sourcelevel compatibility, many coherence options SHRIMP (Princeton), StarDust (IRISA) And more… – – DSM over IP is not a new idea Most have demonstrated speedup for some workloads Implementation Overview All participating machines have space reserved for each shared object at all times – Each shared object has accompanying coherence data Depending on coherence, the object may or may not be accessible – – If permissions are needed on a shared object, network communication is initiated Identical semantics to a long-latency memory operation Implementation Overview Implementation Overview Shared memory objects use altered syntax for declaration and access User-level DSMoIP library, consisting of: – – Communication backbone Coherence engine Programmer uses a small API for setup and synchronization API – Accessing Shared Objects SMP: DSMoIP: a = x + b; x = y – 1; a = x.Read() + b; x.Write(y.Read()-1); z = array[i] – 7; array[i]++; z = array.Read(i) – 7; array.Write( array.Read(i)+1, i); Variables x, y, z, and array are shared. All others are unshared API – Accessing Shared Objects Naming of shared objects is syntactically different, but semantically unchanged Use of .Read() and .Write() functions allows DSM to interpose if necessary Use of C/C++ operator overloading can remove some of the changes in syntax (not implemented) Communication Backbone (CB) A substrate on which to build coherence mechanisms Provide primitives for communication – – – Guaranteed delivery over an unreliable network At-most-once delivery Arbitrary message passing to a logical thread or broadcast to all threads Coherence Engine (CE) Uses primitives in communication backbone to abstract shared memory from a messagepassing system Swappable, based on needs of application – – CE determines consistency model CE plays a major role in performance CE/CB Interaction Example Read() Get Read Perm. Read() commits Read Perm. MACHINE 1 MACHINE 2 a / None a / Read Read CE/CB Interaction Example Write() Get Write Perm. Write() commits Write Perm. MACHINE 1 MACHINE 2 a / Read a / Read Write None CE/CB Interaction Example Read() Read() commits Already have perm. MACHINE 1 MACHINE 2 a / Write a / None CE/CB Interaction Example INT IRET Request Write Perm. Grant Perm. MACHINE 1 MACHINE 2 a / Write a / None None Write CE/CB Interaction Example Read() Get Read Perm. Read() commits Read Perm. MACHINE 1 MACHINE 2 a / None a / Write Read Read Coherence Engines Implemented ENGINE_NC – Fast, simple engine that pays no attention to consistency – Reads can occur at any time Writes are broadcast, can be observed by any processor in any order, and are not guaranteed to ever become visible at all processors Communicates only on writes, non-blocking Coherence Engines Implemented ENGINE_OU – – Owned/Unowned Sequentially-consistent engine – Ensures SC by actually enforcing total order of accesses Each shared object is valid at only one processor for the duration of execution Slow, cumbersome (nearly every access generates traffic) Blocking, latent reads and writes Coherence Engines Implemented ENGINE_MSI – Based on MSI cache coherence protocol – – Shared objects can exist in multiple locations in S state Writing is exclusive to a single, M-state machine Communicates as needed by typical MSI protocol Sequentially consistent Evaluation DSM/IP evaluated using four heterogeneous x86/Linux machines – – Microbenchmarks test correctness Simulated workloads explore performance issues Early results indicate that DSM/IP overwhelms 100M LAN – – – 1-10k broadcast packets/second per machine TCP/IP streams in subnet are suppressed by UDP packets in absence of Quality of Service measures Process is self-throttling, but makes DSM/IP unattractive to network administrators Measurement Event counting – Regions of DSM Library augmented with event counters Timers – – gettimeofday(), setitimer() for low-granularity timing Built-in x86 cycle timers for high-granularity measurement Validated with gettimeofday() Timer Validation Assumed system-call based timers accurate x86 cycle timers validated for long executions against gettimeofday() Microbenchmarks message_test – – barrier_test – – Each thread sends a specified number of messages to the next thread Attempts to overload network/backbone to produce an incorrect execution Each thread iterates over a specified number of barriers Attempts to cause failures in DSM_Barrier() mechanism lock_test – – All threads attempt to acquire a lock and increment global variable Puts maximal pressure on DSM_Lock primitive Microbenchmarks – Validation message_test – barrier_test – nMessages = 1M, N=[1,4], no lost messages (runtime ~ 2 minutes) nBarriers = 1M, N=[1,4], consistent synchronization (runtime ~4 minutes) lock_test – nIncs = 1M, ENGINE_OU (SC), N=[1,4], final value of variable = N-million (correct) Simulated Workloads More Communication Three simulated workloads – – – sea, similar to OCEAN from SPLASH-2 genetic, a typical iterative genetic algorithm xstate, an exhaustive solver for complex expressions Implementations are relatively simplified, for quick development. Simulated Workloads - sea Integer version of SPLASH-2’s OCEAN – – – – Simple iterative simulation, taking averages of surrounding points Uses large integer array Coherence granularity depends on number of threads Uses DSM_Barrier() primitives for synchronization Simulated Workloads - genetic Multithreaded genetic algorithm – – Uses distributed genetic process to “breed” a specific integer from a random population Iterates in two phases: – Breeding: Generate new potential solutions from the most fit members of the current population Survival of the Fittest: Remove the least fit solutions from the population Uses DSM_Barrier() for synchronization Simulated Workloads - xstate Integer solver for arbitrary expressions – – Uses space exploration to find fixed points of hardcoded expression Employs a global list of solutions, protected with a DSM_Lock primitive Methodology Each simulated workload run for each coherence engine Normalized parallel-phase runtime (versus uniprocessor case) measured with highresolution counters – – Does not include startup overhead for DSM system (~1.5 seconds) Does not include other initialization overheads Results - SEA Engine ABT %Wait OU 0.6 s 82% NC MSI 0.6 s 10% 0.05s 15% Results - GENETIC Engine ABT %Wait OU 0.02 s 20% NC MSI 0.02 s 9% 0.01 s 1% Results - XSTATE Engine ABT %Wait OU 3.1 s 8% NC MSI 2.7 s 2.6 s 2% 9% Observations ENGINE_NC – – ENGINE_OU – – speedup for workload with lightest communication load Scalability concerns: Unicast on every read & write! ENGINE_MSI – – speedup for all workloads (but no consistency) Scalability concerns: Broadcasts on every write! Reduces network traffic & has speedup for some workloads Impressive performance on SEA Effect of heterogeneity is significant! – Est. fastest machine spends 25-35% of its time waiting for other machines to arrive at barriers Conclusions Performance for DSM/IP implementation is marginal for small networks (N<=4) Coherence engine significantly affects performance Naïve implementation of SC has far too much overhead Conclusions Common case might not be fast enough – Best case: – Worst case: – Single load or store becomes a function call + ~10 instrs. Single load or store becomes a RTT on the network (~1M-100M instruction times) Average case: (depends on engine) LD/ST becomes ~100-1K+ instruction times? More Questions How might other engine implementations behave? – Is the communication substrate to blame for all performance problems? How would DSM/IP perform for N=8,16,64,128? Questions SEA Demo Extra Slides Initial Synchronization Initial Synchronization Initial Synchronization