Lightweight Logging For Lazy Release Consistent DSM Costa, et. al. CS 717 - 11/01/01 Definition of a SDSM In a software distributed shared memory (SDSM), each node runs its own operating system, and has a local physical memory Each node runs a local process. The these processes form the parallel application The union of the local memory of each of the local processes form the global memory of the application The global memory appears as one virtual address space – a process accesses all memory locations in the same manner, using standard load and stores Basic Implementation of a SDSM The virtual add. space is divided among different memory pages, which are distributed among the local memory of the different processes Each node has a copy of the page to node assignments We use the hardware’s virtual memory support to provide the appearance of SM (page table and faults) The SDSM system is implemented as fault handler routines Such a system is also called a SVM system Illustration N1 P1 N2 P3 P5 N3 P4 P5 P2 The same virtual page might appear in multiple physical pages, on multiple nodes SDSM Operation If N2 attempts to write x on P2 P2 is marked as invalid on N2’s page table, so access will cause a fault Fault handler checks page-node map, and then requests that N3 send it P2 N3 sends page, and notifies all nodes of the change N3 sets page access to “invalid” N2 sets page access to “read/write” Handler returns Multiple N’s can have the same P in their physical add. space, if P is “read-only” for all of them, but only one N can have a copy of P if it is “read/write” Page Size Granularity Memory access is managed at the granularity of an OS page Easy to implement Can be very inefficient If N exhibits poor spatial locality, a lot of unnecessary data transfer If both x and y are on the same page, P, and N1 is repeatedly writing to x while N2 is writing to y, P will be continually sent back and forth between N1 and N2 – false sharing Sequential Consistency Defined by Lamport as: A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor occur in this sequence in the order specified by its program Is this SDSM Sequentially Consistent? Assume a and b are on P1 and P2 respectively N1 a=1 b=1 N2 print b print a If N2 does not invalidate its copy of P1, but does invalidate P2, the output will be <1,0> which is invalid under SC Ensuring Sequential Consistency For the system to be SC, N1 must ensure that N2 invalidated its copy of a page before it can write to that page Before a write, N1 must tell N2 to invalidate its copy of the page, and then wait for N2 to acknowledge that it has done so Of course, if we know that N2’s copy is already invalidated, we don’t need to do this N2 could not have re-obtained access with out N1’s copy being invalidated Ping-Pong Effect SC, combined with the large sharing granularity (OS page), can lead to the ping-pong effect Substantial, expensive, communication cost due to falsesharing A Problem With SC N1 is continually writing to x while N2 is cont. reading from y, both on the same P N2 has P in “read-only”, N1 has P in “r-o” N1 attempts to write to x, faults, tells N2 to go to “invalid” N1 waits for N2 to go to “invalid”, N1 goes to “r/w”, N1 does write N2 tries to read, faults, tells N1 to go to “r-o”, and send current copy of P, N2 goes to “r-o” N2 gets P, does read Ping-Pong Effect N1 R/W R/O inval ack R/O R/W … req reply N2 R/O inval R/O inval Relaxing the Consistency Model The memory consistency model specifies constraints on the order in which memory operations appear to to execute wrt. each other Can we relax the consistency model to improve performance? Release Consistency Certain operations are specified as ‘acquire’ and ‘release’ operations Code below an acquire can never be moved above the acquire Code above the release can never be moved below the release As long as there are no race-conditions, behavior of program same under RC or SC RC Illustration I acq II rel III acq I II rel III Lazy Release Consistency (LRC) In order for a system to be RC, it must ensure that all memory writes above a release become visible before that release is visible i.e., before issuing a release, it must invalidate all other copies of the same page Can we relax this further? LRC LRC is a further relaxation: Lets not invalidate pages until absolutely necessary N1: I, acquire, II, release N2: III, acquire, IV, release Only when N2 is about to issue an acquire, does N1 ensure that all changes it make before its release are visible N1 invalidates N2’s copy of the pages before N2 does its acquire Illustration N1 RC A I R inval A ack II R inval ack N2 N1 A… LRC A I R A II R inval N2 ack A… TreadMarks A high performance SDSM Implements LRC Keller, Cox, Zwaenepoel 1994 Intervals The execution of each process is divided into intervals, beginning at a synchronization access (acq. or release) These form a partial order: intervals on the same process are totally ordered intval. x precedes y if the release that ended x corresponds to the acquire that began y When a process begins a new interval, it creates a new IntervalRecord Vector Clocks Each process also keeps a current vector clock, VC, <…,L,M,N,O,…> If VCN is process N’s vector clock, VCN(M) is the most recent interval of process M that process N knows about VCN(N) is therefore the current interval of process N Interval Records An IntervalRecord is a structure containing: The pid of the process that created this record The vector-clock timestamp of when this interval was created A list of WriteNotices Write Notices A WriteNotice is a record containing: The page number of the page written to A diff showing the changes made to this page A pointer to the corresponding IntervalRecord Acquiring A Lock When N1 wants to acquire a lock, it sends its current vector clock to the Lock Manager The Lock Manager forwards this message to the last process that acquired this lock (assume N2) N2 replies (to N1) with all the IntervalRecords that have a timestamp between the VC sent by N1 and the VC of the IR that ended with the most recent release of that lock N1 received IntervalRecords from N2 N1 stores these IntervalRecord in volatile memory N1 invalidates all pages for which it received a WriteNotice (in the IRs) On a page fault, N1 obtains a copy of the page, and then applies all the diffs for that page in interval order If N1 is about to write to that page, it makes a copy of it (so that it can compute the diff of its changes) Example N1 <0,0,0> acq <1,0,0> write P rel Request <0,0,0> IR/DIFF <1,0,0> <1,1,0> N2 <0,0,0> acq Apply diff write P rel IR/DIFF <1,0,0> Request <0,0,0> N3 IR/DIFF <1,1,0> <1,1,1> <0,0,0> acq Apply diff write P rel Example (cont.) If N1 were to issue another acquire, it would only have to apply the diffs in the IR of time <1,1,1> and <1,1,0>, because its current VC was <1,0,0> Improvement: Garbage Collection Each N is keeping a log of all shared memory writes that it made, along with all writes that it needed to know about At a barrier, Ns can synchronize, so that each N has the most up to date copy of its pages, and the logs could then be discarded Improvement: Sending Diffs You might notice that if N1 writes to pages P1, P2, P3 during an interval, and N2 acquires the lock next, N1 needs to send the three diffs to N2, regardless if N2 will actually need those pages In truth, N1 does not send the diffs, it sends a pointer to its local memory, where the diff is located If N2 needs to apply that diff, it will request that diff from N1, using that pointer Adding Fault Tolerance Assume we would like the ability to survive single node failure (only one fails at a time, but multiple failures may occur during the running of the application) What information would we need to log, and where? Remember, we already log IntervalRecords and WriteNotices as part of the usual operation of TreadMarks Ni fails and then restarts If it acquires a lock, it must see the same version of the page that it saw during the original run Therefore Nj must send it the same WriteNotices (diffs) as before, even though Nj’s current version of the page might be very different, and Nj’s vector clock has also changed Example <0,0,0> ACQ/WRI/REL <1,0,0> N1 IR <1,0,0> N2 <0,0,0> ACQ/WRI/REL<1,1,0> IR <1,0,0> IR <1,1,0> IR <1,1,1> <0,0,0> ACQ/WRI/REL <1,1,1> N3 <1,1,0> ACQ/WRI/REL<1,2,1> X If N3 is restarted, when it reissues the acquire, it must receive the same set of WriteNotices as it had during its original run. If we run the algorithm un-modified, N3 would receive <1,0,0><1,1,0><1,1,1><1,2,1>, and the application would be incorrect Send Log Therefore, N2 needs some way of logging which IntervalRecords it had sent to N3 It does this by storing the VC of N3 when it issued the acquire (this was sent to it with the request) and the VC of N2 when it received the request This is stored in N2’s send-log From these two VC’s, N2 can determine which IntervalRecords it had sent to N1 Example Send-Log: {N2, <0,0,0><1,0,0>} N1 IR <1,0,0> Send-Log: {N3, <0,0,0><1,1,0>} N2 <0,0,0> ACQ WRI REL<1,1,0> IR <1,0,0> IR <1,1,0> X N3 <0,0,0> ACQ WRI REL <1,1,1> Restart When N3 restarts, it will request the acquire at time <0,0,0> N2 will look in its send log, and see that when it received an acquire request from N3 at <0,0,0>, it was at time <1,1,0>, so it will send the IR of all the intervening intervals Therefore, N3 receives the same diffs as it did before Logging, cont. Is the send-log sufficient to provide the level of fault-tolerance that we wanted? Imagine N2 had failed, and then restarted, could we then survive the failure of N3? Logging No, we could not survive the subsequent failure of N3, because N2 no longer had its send-log We also need a way to recreate N2’s send log Receive-Log On every acquire, N, logs its vector time, before the acquire and its new vector time after seeing the IntervalRecords sent to it by M in N’s receive-log If M fails, M’s send-log can be recreated from N’s receive-log Example Send-Log: {N2, <0,0,0><1,0,0>} N1 IR <1,0,0> N2 Send-Log: {N3, <0,0,0><1,1,0>} Recv-Log: {N1, <0,0,0><1,0,0>} IR <1,0,0> IR <1,1,0> N3 Recv-Log: {N2, <0,0,0><1,1,0>} X If N2 were to fail, it would get restarted N1’s send-log will ensure that N2 sees the same page as it did originally When, in the future, N3 sees a VC time later than that in its receive log (wrt. N2) it will forward the information in its receive-log to N2 N2 will recreate its send-log We could now survive future failures Checkpointing When we arrive at garbage collection point, we could checkpoint all processes Minimize rollback Survive concurrent failures Empty logs Results Results 2 Appl. Log Size (MB) Water 3.10 Avg. Ckpt. Size (MB) 3.05 SOR .33 7.84 TSP .05 2.49 Results 3