LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University of Hong Kong September, 2004 Presentation Outline • • • • • Why LOTS? (Objectives) DSM Background and Related Work Design of LOTS Performance Testing and Results Conclusion and Future Work 2 The Problem in Current DSM • Lack of shared object (memory) space – Another major problem apart from performance – Fixed address mapping in virtual memory – Shared object space size < process space • TreadMarks: ~ min RAM size among all machines • JIAJIA V1.0: 128 MB – 32-bit machines max 4 GB shared space – Unscalable: Fixed regardless of # machines – Large problems (with > 4GB shared memory need) can’t be run directly The programmer needs to change the application code to reduce the memory utilization. 3 Objectives of LOTS • Using 64-bit machines is not a total solution! • 32-bit machines are dominating the market (poor man’s clusters :<) • Problems keep increasing memory consumption (Rich man’s cluster) (Poor man’s cluster) 4 Objectives of LOTS • Hence we introduce LOTS: – Large Shared Object Space > 4GB – Dynamic run-time memory mapping technique – Local disk as the backing store for temporarily unused objects – Shared space size now limited by disk space – Lazy disk read/write reasonable performance 5 Some DSM Background • Memory Consistency Issues: – Memory Consistency Models P Q Y=5 • Sequential Consistency (IVY) Acq(L) performs poorly X=3 • Relaxed models reduce Rel(L) redundant data traffic • Lazy Release Consistency In Scope, Q sees X to be (TreadMarks) 3, but Y may • Scope Consistency (JIAJIA) not be 5 Acq(L) X=? Y=? Rel(L) 6 Some DSM Background • More Memory Consistency Issues: – Coherence Protocols • Home-based (JIAJIA) vs Homeless (TreadMarks) vs Migrating-Home (JUMP) • Write-update vs Write-invalidate • Adaptive Protocol (DOSA, ADSM) – Coherence Protocol has to match with memory model for higher efficiency • No DSM deals with Large Object Space! 7 Related Work • Large object space support: – Pointer swizzling • Artificial, invalid addresses are translated to machine-addressable form during access • Used in persistent store (QuickStore, Thor-1) Compiler-generated addresses cause page fault at runtime and are translated to valid ones Process Space Unused objects free their virtual addresses and are swapped out (i.e., swizzled out) to hard disk 8 Design of LOTS • Dynamic Memory Mapping (DMM) – Uses C++ Operator Overloading as the interface • Overloads [], +, -, *, /, ++, --, >=, <=, !=, etc. – Purely runtime; Network Remote Memory Data (DMM) Area (2) Bring in object from disk / network Array A A->ctrl Heap Area Local Hard Disk Virtual Memory Area (3) Internal structure points to object data for access (1) Access invokes mapping mechanism A[5]=7; Program 9 LOTS Shared Objects Creation Process Space • Through the LOTS memory allocator 0xffffffff Kernel Reserved – Exists as a C++ class • Memory allocation through alloc() function • Put data into specific part of process space • Object control info in heap area Array A A->Ctrl C++ Stack 0xc0000000 0xb0000000 DSM Control Area 0x90000000 Twin Area 0x70000000 DMM Area Heap Area 0x50000000 DAT Segment TXT Segment 0x00000000 10 LOTS Memory Allocator • Bypass Doug Lea’s Memory Allocator used in original C/C++ • Uses mmap() to get physical memory, and map the shared object data to the process space. – Free queues and used queues – Small & large objects allocated separately ½G … 2M 1M … 8 0x50000000 … DMM Area 2M 1M … 0x70000000 48 40 32 24 16 Free queue Used block Free block Twin and Control Area Heap Area ½G 48 40 32 24 16 8 Used queue 11 Shared Memory Behavior • Goal: Reduce redundant data traffic – Memory Consistency Model: Scope – Memory Coherence Protocol: Mixed • Lock-synchronized objects : Homeless + writeupdate • Barrier-synchronized objects : Migrating-Home + write-invalidate • Principle: To eliminate as much all-to-all data communication as possible 12 Mixed Coherence Protocol • An Example: Updates Movement Home Token Movement P0 Acq(L1) x1=1 y1=5 Rel(L1) P1 X Y x1=1 y1=5 Barrier X Acq(L1) x1++ y1++ P2 P3 Acq(L2) x2=3 Rel(L2) Home of X and Y Y x2 = 3 Rel(L1) New Home Inv X, Y Y x1 = 2, x2 = 4 Acq(L2) x2++ Rel(L2) x2 = 4 Inv X, Y Inv X, Y x1=? X When the processes arrive at the barrier, the process that holds the token of the object will become the new home of that object, and other processes will send the updates to the home. 13 Making LOTS More Efficient • Eliminating Diff Accumulation Problem – Lock and timestamp info in DSM control area – Calculate diff on request, no redundancy T=1 (len=6) T=2 (len=4) T=3 (len=4) T=4 (len=3) X1 X2 X3 X6 X7 X8 X1 X3 X5 X8 X2 X5 X7 X8 X3 X5 X7 Value X1 X2 X3 X4 X5 X6 X7 X8 Last Updated Time 2 3 4 0 4 1 4 3 Time Traditional Method 1 LOTS Method 3 X 1 X2 X3 X4 X5 X6 X7 X8 All updates above need to be sent (17 units data + 8 units of control) 1 X6 2 2 X2 X8 4 1 X1 3 X 3 X5 X7 Length Only send 7 units data + 8 units of control data 14 Other Components in LOTS • C++ runtime library in Linux • Minimal set of functions as interface – Retains as much C++ syntax as possible to improve programmability • Synchronization: Locks and Barriers – Barriers: With/Without memory effect • Communication: Sockets with UDP/IP • SIGIO handler for incoming messages 15 Performance Testing • Two Kinds of Testing 1 Without invoking large object space support • Compare performance with other DSM (JIAJIA V1.0, as both have similar communication protocol) • Report no. of messages and bytes sent • Calculate large object space support overhead • 16 Pentium IV 2GHz machines with 100Mbps Fast Ethernet connection, 128MB mem, Linux Fedora 2 With large object space support • Use an application with large memory demand • Run on different platforms for analysis • Expect disk read/write overhead dominates 16 Test 1: Timing Performance LOTS<JiaJia LOTS<JiaJia LOTS>JiaJia LOTS<JiaJia LOTS: LOTS enabled x-axis : problem size, LOTS-x : LOTS disabled y-axis : execution time in seconds17 Performance Results Summary • LOTS beat JIAJIA V1.0 in most applications – Mixed protocol + “Diff accumulation elimination” reduce data traffic • Large object space support and access checking incur a considerable overhead – about 5-15% of total execution time (application dependent) 18 Test 2: Large Object Space • Using 4-node PC and server clusters CPU (MHz) OS P4-2GHz Fedora 1 Xeon P3 500 MHz x4 (SMP) Fedora 1 RAM (MB) 512 # Shared Per Obj Total Shared Exec Time Objs (X) Size (MB) Obj Size (GB) (sec) 33 128 4.125 142 132 128 16.50 373 1024 132 128 16.50 507 132 511 65.87 2112 201 511 100.30 3227 236 511 117.77 3839 • Test program: simple matrix operations • With 120GB (SCSI) hard disk in each machine, able to claim 117.77GB Shared Object Space • Disk read and write time is closely related to the OS version. 19 Conclusions • LOTS succeed in: – Providing a large shared object space larger than the local process space during runtime – Performing reasonably well by reducing data traffic through Scope Consistency, mixed coherence protocol and “diff accumulation elimination” technique – Similar programming interface with C++ 20 Future Work • A Number of Optimizations: – Further increase shared object space • “the minimum hard disk space x number of processes / 2”. • Recent progress: 64GB (4GB x 16) of shared objects can be allocated in 16 machines, each having a 9GB hard disk. – Reduce disk overhead – Reduce over-loading overhead (access check) – Load-aware migrating-home protocol: coherence protocol adapting to network traffic and processor loading (e.g., avoid too many “homes” in a single machine) 21 Questions ? Test 1: No. of Messages Sent The percentage is obtained by dividing the number of messages sent in LOTS over that in JIAJIA for the same application. 90 80 70 60 Due to mixed protocol, LOTS send fewer messages through the network than JIAJIA FL (n=2048) 50 LU (n=1024) ME (n=8192) 40 30 RX (n=8192) RB (n=2048) 20 10 % 0 p=2 p=4 p=8 p=16 No. of procs (p) 23 Test 1: No. of Bytes Sent The percentage is obtained by dividing the number of bytes sent in LOTS over that in JIAJIA for the same application. 100 90 80 70 60 FL (n=2048) 50 LU (n=1024) 40 ME (n=8192) 30 RX (n=8192) 20 RB (n=2048) 10 % 0 p=2 p=4 p=8 p=16 No. of procs (p) 24 Test 2: Large Object Space • Allocate shared objects with total size > 4GB, and another process accesses each of them once (array addition with p=4) int main(int argc, char **argv) { int i, j, pp, local[4]; // 2D int array Pointer <Pointer <int> > a; lots_init(); // init LOTS // shared memory allocation a.alloc(X); for (i=0; i<X; i++) a[i].alloc(size); nm_barrier(); // barrier for (j = 0; j < linec; j++) { pp = (dsmid + j) % linec; for (i = pp; i < X; i += 4) { acq(i); a[i][0] += rand(); rel(i); } } // array addition nm_barrier(); return 0; } 25