(PowerPoint 398Kb)

advertisement
The MILLIPEDE Project
Technion, Israel
Windows-NT based Distributed
Virtual Parallel Machine
http://www.cs.technion.ac.il/Labs/Millipede
What is Millipede ?
A strong Virtual Parallel Machine:
employ non-dedicated distributed environments
Programs
Implementation of Parallel Programming Langs
Distributed Environment
Programming Paradigms
SPLASH
Cilk/Calipso
Other
Java
ParC
CC++
CParPar
“Bare Millipede”
ParFortran90
Events Mechanism (MJEC)
Migration Services (MGS)
Communication
Packages
U-Net, Transis, Horus,…
Distributed Shared Memory (DSM)
Operating System Services
Software Packages
Communication, Threads,
Page Protection, I/O
User-mode threads
So, what’s in a VPM?
Check list:
Using non-dedicated cluster of PCs (+ SMPs)
Multi-threaded
Shared memory
User-mode
Strong support for weak memory
Dynamic page- and job-migration
Load sharing for maximal locality of reference
Convergence to optimal level of parallelism
Using a non-dedicated cluster
•
•
•
•
Dynamically identify idle machines
Move work to idle machines
Evacuate busy machines
Do everything transparently
to native user
• Co-existence of
several parallel
applications
Multi-Threaded Environments
• Well known:
– Better utilization of resources
– An intuitive and high level of abstraction
– Latency hiding by comp. and comm. overlap
• Natural for parallel programing paradigms
& environments
– Programmer defined max-level of parallelism
– Actual level of parallelism set dynamically.
Applications scale up and down
– Nested parallelism
– SMPs
Convergence to Optimal Speedup
• The Tradeoff:
Higher level of parallelism
VS.
Better locality of memory reference
• Optimal speedup - not necessarily with the
maximal number of computers
• Achieved level of parallelism - depends on
the program needs and on
the capabilities of the system
No/Explicit/Implicit Access
Shared Memory
PVM
C-Linda
/* Receive data from master */
msgtype = 0;
pvm_recv(-1, msgtype);
pvm_upkint(&nproc, 1, 1);
pvm_upkint(tids, nproc, 1);
pvm_upkint(&n, 1, 1);
pvm_upkfloat(data, n, 1);
/* Retrieve data from DSM */
rd(“init data”, ?nproc, ?n, ?data);
/* Determine which slave I am
(0..nproc-1)*/
for(i=0; i<nproc; i++)
if(mytid==tids[i]) {me=i; break;}
/* Worker id is given at creation
no need to compute it now */
/* Do calculations with data*/
result=work(me, n, data, tids, nproc);
/* do calculation. put result in DSM*/
out(“result”, id, work(id, n, data, nproc));
/* send result to master */
pvm_initsend(PvmDataDefault);
pvm_pkint(&me, 1, 1);
pvm_pkfloat(&result, 1, 1);
msg_type = 5;
master = pvm_paremt();
pvm_send(master, msgtype);
/* Exit PVM before stopping */
pvm_exit();
“Bare” Millipede
result = work(milGetMyId(),
n, data,
milGetTotalIds());
Relaxed Consistency
(Avoiding false sharing and ping pong)
• Sequential, CRUW,
page
Sync(var), Arbitrary-CW Sync
• Multiple relaxations for different shared
variables within the same program
• No broadcast, no central address servers
(so can work efficiently interconnected LANs)
• New protocols welcome (user defined?!)
• Step-by-step optimization towards
maximal parallelism
copies
LU Decomposition 1024x1024 matrix written
in SPLASH Advantages gained when reducing consistency
of a single variable (the Global structure):
Reducing Consistency
Number of page migrations (page #4)
4
70
3.5
original
60
#migrations per host
reduced
speedups
3
2.5
2
Original
1.5
Reduced
50
40
30
20
10
0
1
1
2
3
hosts
4
5
1
2
3
hosts
4
5
MJEC - Millipede Job Event Control
An open mechanism with which various
synchronization methods can be implemented
• A job has a unique systemwide id
• Jobs communicate and synchronize by
sending events
• Although a job is mobile, its events follow
and reach its events queue wherever it goes
• Event handlers are context-sensitive
MJEC (con’t)
• Modes:
– In Execution-Mode: arriving events are enqueued
– In Dispatching-Mode: events are dequeued and
handled by a user-supplied dispatching routine
MJEC Interface
Execution Mode
milEnterDispatchingMode(func, context)
ret := func(INIT, context)
ret==EXIT?
No
Yes
event pending?
ret := func(event, context)
Yes
ret := func(EXIT, context)
Wait for event
Registration and Entering Dispatch Mode:
milEnterDispatchingMode((FUNC)foo, void *context)
Post Event:
milPostEvent(id target, int event, int data)
Dispatcher Routine Syntax:
int foo(id origin, int event, int data, void *context)
Experience with MJEC
•
ParC: ~ 250 lines
SPLASH: ~ 120 lines
• Easy implementation of many
synchronization methods: semaphores,
locks, condition variables, barriers
• Implementation of location-dependent
services (e.g., graphical display)
Example - Barriers with MJEC
Barrier Server
Barrier() {
milPostEvent(BARSERV, ARR, 0);
milEnterDispatchingMode(wait_in_barrier, 0);
Dispatcher:
…
…
...
}
EVENT
ARR
wait_in_barrier(src, event, context) {
Job
Job
if (event == DEP)
return EXIT_DISPATCHER;
else
return STAY_IN_DISPATCHER;
BARRIER(...)
Dispatcher:
…
…
…
…
…
}
Example - Barriers with MJEC (con’t)
BarrierServer() {
Barrier Server
milEnterDispatchingMode(barrier_server, info);
Dispatcher:
…
…
...
EVENT
DEP
EVENT
DEP
}
barrier_server(src, event, context) {
if (event == ARR)
enqueue(context.queue, src);
if (should_release(context))
while(context.cnt>0) {
milPostEvent(context.dequeue, DEP);
}
return STAY_IN_DISPATCHER;
Job
Job
BARRIER(...)
Dispatcher:
…
…
BARRIER(...)
Dispatcher:
…
…
}
Dynamic Page- and Job-Migration
• Migration may occur in case of:
–
–
–
–
Remote memory access
Load imbalance
User comes back from lunch
Improving locality by location rearrangement
• Sometimes migration should be disabled
– by system: ping-pong, critical section
– by programmer: control system
Locality of memory reference is THE
dominant efficiency factor
Migration Can Help Locality:
Only Job Migration
Only Page Migration
Page & Job Migration
Load Sharing + Max. Locality =
Minimum-Weight multiway cut
p
q
r
p
q
r
Problems with the
multiway cut model
• NP-hard for #cuts>2. We have n>X,000,000.
Polynomial 2-approximations known
• Not optimized for load balancing
• Page replica
• Graph changes dynamically
• Only external accesses are
recorded ===> only partial
information is available
Our Approach
Access
page 1
page 2
page 1
page 0
• Record the history of remote accesses
• Use this information when taking decisions
concerning load balancing/load sharing
• Save old information to avoid repeating bad
decisions (learn from mistakes)
• Detect and solve ping-pong situations
• Do everything by piggybacking on
communication that is taking place anyway
Ping Pong
Detection (local):
1. Local threads attempt to use the page short time
after it leaves the local host
2. The page leaves the host shortly after arrival
Treatment (by ping-pong server):
• Collect information regarding all participating
hosts and threads
• Try to locate an underloaded target host
• Stabilize the system by locking-in pages/threads
TSP - Effect of Locality Optimization
15 cities, Bare Millipede
sec
4000
3500
3000
2500
2000
1500
1000
500
0
NO-FS
OPTIMIZED-FS
FS
1
2
3
4
5
6
hosts
In the NO -FS case false sharing is avoided by aligning all allocations to
page size. In the other two cases each page is used by 2 threads: in FS no
optimizations are used, and in OPTIMIZED -FS the history mechanism is
enabled.
TSP on 6 hosts
k number of threads falsely sharing a page
k
2
2
optimized? # DSM- # ping-pong
related treatment msgs
messages
Yes
5100
290
No
176120
0
Number of execution
thread
time (sec)
migrations
68
645
23
1020
3
3
Yes
No
4080
160460
279
0
87
32
620
1514
4
4
Yes
No
5060
155540
343
0
99
44
690
1515
5
5
Yes
No
6160
162505
443
0
139
55
700
1442
Ping Pong Detection Sensitivity
TSP-1
1000
900
800
700
600
500
400
300
200
100
0
2 3 4 5 6 7 8 9 10111213 14151617 181920
Best results are achieved at maximal sensitivity,
since all pages are accessed frequently.
TSP-2
1100
1000
900
800
700
600
500
400
300
200
100
0
2 3 4 5 6 7 8 9 1011 121314151617 181920
Since part of the pages are accessed frequently
and part - only occasionally, maximal sensitivity
causes unnecessary ping pong treatment and
significantly increases execution time.
Applications
• Numerical computations:
Multigrid
• Model checking:
BDDs
• Compute-intensive graphics:
Ray-Tracing, Radiosity
• Games, Search trees, Pruning, Tracking, CFD
...
Performance Evaluation
L - underloaded
H - overloaded
Delta(ms) - lock in time
t/o delta - polling (MGS,DSM)
msg delta - system pages delta
T_epoch - max history time
??? - remove old histories
- refresh old histories
L_epoch - histories length
page histories
vs.
job histories
migration heuristic which func?
ping-pong - what is initial noise?
- what freq. is PP?
LU Decomposition 1024x1024 matrix written
in SPLASH:
Performance improvements when there are
few threads on each host
6
1 threads/host
5
speedup
3 threads/host
4
3
2
1
1
2
3
hosts
4
5
6
LU Decomposition 2048x2048 matrix written
in SPLASH Super-Linear speedups due to the caching
effect.
speedup
8
7.18
7
6
5
4
4.47
3
2
1
1
3
hosts
6
180
160
140
120
100
80
60
40
20
0
3.5
3
2.5
2
1.5
1
0.5
0
1
2
hosts
3
4
Speedup
Time
Jacobi Relaxation 512x512 matrix (using 2
matrices, no false sharing) written in ParC
Overhead of ParC/Millipede on a single host.
Testing with Tracking algorithm:
Overheads
1.05
1.04
1.03
1.02
1.01
1
0.99
0.98
0.97
Relative
Pure C
"Bare" Millipede
ParC on Millipede
1
10
targets
20
Info...
http://www.cs.technion.ac.il/Labs/Millipede
millipede@cs.technion.ac.il
Release available at the Millipede site !
Download