PPT - The University of Hong Kong

advertisement
LOTS: A Software DSM
Supporting Large Object Space
Benny Wang-Leung Cheung,
Cho-Li Wang, and Francis Chi-Moon Lau
Department of Computer Science
The University of Hong Kong
September, 2004
Presentation Outline
•
•
•
•
•
Why LOTS? (Objectives)
DSM Background and Related Work
Design of LOTS
Performance Testing and Results
Conclusion and Future Work
2
The Problem in Current DSM
• Lack of shared object (memory) space
– Another major problem apart from performance
– Fixed address mapping in virtual memory
– Shared object space size < process space
• TreadMarks: ~ min RAM size among all machines
• JIAJIA V1.0: 128 MB
– 32-bit machines  max 4 GB shared space
– Unscalable: Fixed regardless of # machines
– Large problems (with > 4GB shared memory need)
can’t be run directly  The programmer needs to
change the application code to reduce the
memory utilization.
3
Objectives of LOTS
• Using 64-bit machines is
not a total solution!
• 32-bit machines are
dominating the market
(poor man’s clusters :<)
• Problems keep
increasing memory
consumption
(Rich man’s cluster)
(Poor man’s cluster)
4
Objectives of LOTS
• Hence we introduce LOTS:
– Large Shared Object Space > 4GB
– Dynamic run-time memory mapping technique
– Local disk as the backing store for temporarily
unused objects
– Shared space size now limited by disk space
– Lazy disk read/write  reasonable performance
5
Some DSM Background
• Memory Consistency Issues:
– Memory Consistency Models
P
Q
Y=5
• Sequential Consistency (IVY)
Acq(L)
performs poorly
X=3
• Relaxed models reduce
Rel(L)
redundant data traffic
• Lazy Release Consistency In Scope, Q
sees X to be
(TreadMarks)
3, but Y may
• Scope Consistency (JIAJIA) not be 5
Acq(L)
X=?
Y=?
Rel(L)
6
Some DSM Background
• More Memory Consistency Issues:
– Coherence Protocols
• Home-based (JIAJIA) vs Homeless (TreadMarks) vs
Migrating-Home (JUMP)
• Write-update vs Write-invalidate
• Adaptive Protocol (DOSA, ADSM)
– Coherence Protocol has to match with memory
model for higher efficiency
• No DSM deals with Large Object Space!
7
Related Work
• Large object space support:
– Pointer swizzling
• Artificial, invalid addresses are translated to
machine-addressable form during access
• Used in persistent store (QuickStore, Thor-1)
Compiler-generated
addresses cause page
fault at runtime and are
translated to valid ones
Process
Space
Unused objects free their
virtual addresses and are
swapped out (i.e.,
swizzled out) to hard disk
8
Design of LOTS
• Dynamic Memory Mapping (DMM)
– Uses C++ Operator Overloading as the interface
• Overloads [], +, -, *, /, ++, --, >=, <=, !=, etc.
– Purely runtime;
Network
Remote Memory
Data (DMM) Area
(2) Bring
in object
from disk
/ network
Array A
A->ctrl
Heap Area
Local Hard Disk
Virtual Memory Area
(3) Internal structure
points to object data
for access
(1) Access invokes
mapping mechanism
A[5]=7;
Program
9
LOTS Shared Objects Creation
Process Space
• Through the LOTS
memory allocator
0xffffffff
Kernel Reserved
– Exists as a C++ class
• Memory allocation
through alloc() function
• Put data into specific
part of process space
• Object control info in
heap area
Array A
A->Ctrl
C++ Stack
0xc0000000
0xb0000000
DSM Control Area
0x90000000
Twin Area
0x70000000
DMM Area
Heap Area
0x50000000
DAT Segment
TXT Segment
0x00000000
10
LOTS Memory Allocator
• Bypass Doug Lea’s Memory Allocator used in
original C/C++
• Uses mmap() to get physical memory, and map
the shared object data to the process space.
– Free queues and used queues
– Small & large objects allocated separately
½G
…
2M
1M
…
8
0x50000000
…
DMM Area
2M
1M
…
0x70000000
48 40 32 24 16
Free
queue
Used
block
Free
block
Twin and
Control
Area
Heap
Area
½G
48 40 32 24 16
8
Used
queue
11
Shared Memory Behavior
• Goal: Reduce redundant data traffic
– Memory Consistency Model: Scope
– Memory Coherence Protocol: Mixed
• Lock-synchronized objects : Homeless + writeupdate
• Barrier-synchronized objects : Migrating-Home
+ write-invalidate
• Principle: To eliminate as much all-to-all data
communication as possible
12
Mixed Coherence Protocol
• An Example:
Updates Movement
Home Token Movement
P0
Acq(L1)
x1=1
y1=5
Rel(L1)
P1
X
Y
x1=1
y1=5
Barrier
X
Acq(L1)
x1++
y1++
P2
P3
Acq(L2)
x2=3
Rel(L2)
Home of X
and Y
Y
x2 = 3
Rel(L1)
New Home
Inv X, Y
Y
x1 = 2, x2 = 4
Acq(L2)
x2++
Rel(L2)
x2 = 4
Inv X, Y
Inv X, Y
x1=? X
When the processes arrive at the barrier, the process that holds the token
of the object will become the new home of that object, and other
processes will send the updates to the home.
13
Making LOTS More Efficient
• Eliminating Diff Accumulation Problem
– Lock and timestamp info in DSM control area
– Calculate diff on request, no redundancy
T=1
(len=6)
T=2
(len=4)
T=3
(len=4)
T=4
(len=3)
X1 X2 X3 X6 X7 X8
X1 X3 X5 X8
X2 X5 X7 X8
X3 X5 X7
Value X1 X2 X3 X4 X5 X6 X7 X8
Last Updated Time 2 3 4 0 4 1 4 3
Time
Traditional Method
1
LOTS Method
3
X 1 X2 X3 X4 X5 X6 X7 X8
All updates above need to be sent
(17 units data + 8 units of control)
1 X6 2
2 X2 X8 4
1 X1
3 X 3 X5 X7
Length
Only send 7 units data + 8 units
of control data
14
Other Components in LOTS
• C++ runtime library in Linux
• Minimal set of functions as interface
– Retains as much C++ syntax as possible to
improve programmability
• Synchronization: Locks and Barriers
– Barriers: With/Without memory effect
• Communication: Sockets with UDP/IP
• SIGIO handler for incoming messages
15
Performance Testing
• Two Kinds of Testing
1 Without invoking large object space support
• Compare performance with other DSM (JIAJIA V1.0,
as both have similar communication protocol)
• Report no. of messages and bytes sent
• Calculate large object space support overhead
• 16 Pentium IV 2GHz machines with 100Mbps Fast
Ethernet connection, 128MB mem, Linux Fedora
2 With large object space support
• Use an application with large memory demand
• Run on different platforms for analysis
• Expect disk read/write overhead dominates
16
Test 1: Timing Performance
LOTS<JiaJia
LOTS<JiaJia
LOTS>JiaJia
LOTS<JiaJia
LOTS: LOTS enabled
x-axis : problem size,
LOTS-x : LOTS disabled
y-axis : execution time in seconds17
Performance Results Summary
• LOTS beat JIAJIA V1.0 in most
applications
– Mixed protocol + “Diff accumulation
elimination” reduce data traffic
• Large object space support and access
checking incur a considerable overhead
– about 5-15% of total execution time
(application dependent)
18
Test 2: Large Object Space
• Using 4-node PC and server clusters
CPU (MHz)
OS
P4-2GHz
Fedora 1
Xeon P3
500 MHz
x4 (SMP)
Fedora 1
RAM
(MB)
512
# Shared Per Obj Total Shared Exec Time
Objs (X) Size (MB) Obj Size (GB)
(sec)
33
128
4.125
142
132
128
16.50
373
1024
132
128
16.50
507
132
511
65.87
2112
201
511
100.30
3227
236
511
117.77
3839
• Test program: simple matrix operations
• With 120GB (SCSI) hard disk in each machine, able to claim
117.77GB Shared Object Space
• Disk read and write time is closely related to the OS version.
19
Conclusions
• LOTS succeed in:
– Providing a large shared object space larger
than the local process space during runtime
– Performing reasonably well by reducing
data traffic through Scope Consistency,
mixed coherence protocol and “diff
accumulation elimination” technique
– Similar programming interface with C++
20
Future Work
• A Number of Optimizations:
– Further increase shared object space
•  “the minimum hard disk space x number of processes
/ 2”.
• Recent progress: 64GB (4GB x 16) of shared objects can
be allocated in 16 machines, each having a 9GB hard
disk.
– Reduce disk overhead
– Reduce over-loading overhead (access check)
– Load-aware migrating-home protocol: coherence
protocol adapting to network traffic and processor
loading (e.g., avoid too many “homes” in a single
machine)
21
Questions ?
Test 1: No. of Messages Sent
The percentage is obtained by dividing the number of messages sent
in LOTS over that in JIAJIA for the same application.
90
80
70
60
Due to mixed protocol,
LOTS send fewer
messages through the
network than JIAJIA
FL
(n=2048)
50
LU
(n=1024)
ME
(n=8192)
40
30
RX
(n=8192)
RB
(n=2048)
20
10
%
0
p=2
p=4
p=8
p=16
No. of procs (p)
23
Test 1: No. of Bytes Sent
The percentage is obtained by dividing the number of bytes sent in
LOTS over that in JIAJIA for the same application.
100
90
80
70
60
FL
(n=2048)
50
LU
(n=1024)
40
ME
(n=8192)
30
RX
(n=8192)
20
RB
(n=2048)
10
%
0
p=2
p=4
p=8
p=16
No. of procs (p)
24
Test 2: Large Object Space
• Allocate shared objects with total size > 4GB,
and another process accesses each of them
once (array addition with p=4)
int main(int argc, char **argv)
{
int i, j, pp, local[4];
// 2D int array
Pointer <Pointer <int> > a;
lots_init(); // init LOTS
// shared memory allocation
a.alloc(X);
for (i=0; i<X; i++)
a[i].alloc(size);
nm_barrier();
// barrier
for (j = 0; j < linec; j++) {
pp = (dsmid + j) % linec;
for (i = pp; i < X; i += 4) {
acq(i);
a[i][0] += rand();
rel(i);
} } // array addition
nm_barrier();
return 0;
}
25
Download