The MILLIPEDE Project Technion, Israel Windows-NT based Distributed Virtual Parallel Machine http://www.cs.technion.ac.il/Labs/Millipede What is Millipede ? A strong Virtual Parallel Machine: employ non-dedicated distributed environments Programs Implementation of Parallel Programming Langs Distributed Environment Programming Paradigms SPLASH Cilk/Calipso Other Java ParC CC++ CParPar “Bare Millipede” ParFortran90 Events Mechanism (MJEC) Migration Services (MGS) Communication Packages U-Net, Transis, Horus,… Distributed Shared Memory (DSM) Operating System Services Software Packages Communication, Threads, Page Protection, I/O User-mode threads So, what’s in a VPM? Check list: Using non-dedicated cluster of PCs (+ SMPs) Multi-threaded Shared memory User-mode Strong support for weak memory Dynamic page- and job-migration Load sharing for maximal locality of reference Convergence to optimal level of parallelism Using a non-dedicated cluster • • • • Dynamically identify idle machines Move work to idle machines Evacuate busy machines Do everything transparently to native user • Co-existence of several parallel applications Multi-Threaded Environments • Well known: – Better utilization of resources – An intuitive and high level of abstraction – Latency hiding by comp. and comm. overlap • Natural for parallel programing paradigms & environments – Programmer defined max-level of parallelism – Actual level of parallelism set dynamically. Applications scale up and down – Nested parallelism – SMPs Convergence to Optimal Speedup • The Tradeoff: Higher level of parallelism VS. Better locality of memory reference • Optimal speedup - not necessarily with the maximal number of computers • Achieved level of parallelism - depends on the program needs and on the capabilities of the system No/Explicit/Implicit Access Shared Memory PVM C-Linda /* Receive data from master */ msgtype = 0; pvm_recv(-1, msgtype); pvm_upkint(&nproc, 1, 1); pvm_upkint(tids, nproc, 1); pvm_upkint(&n, 1, 1); pvm_upkfloat(data, n, 1); /* Retrieve data from DSM */ rd(“init data”, ?nproc, ?n, ?data); /* Determine which slave I am (0..nproc-1)*/ for(i=0; i<nproc; i++) if(mytid==tids[i]) {me=i; break;} /* Worker id is given at creation no need to compute it now */ /* Do calculations with data*/ result=work(me, n, data, tids, nproc); /* do calculation. put result in DSM*/ out(“result”, id, work(id, n, data, nproc)); /* send result to master */ pvm_initsend(PvmDataDefault); pvm_pkint(&me, 1, 1); pvm_pkfloat(&result, 1, 1); msg_type = 5; master = pvm_paremt(); pvm_send(master, msgtype); /* Exit PVM before stopping */ pvm_exit(); “Bare” Millipede result = work(milGetMyId(), n, data, milGetTotalIds()); Relaxed Consistency (Avoiding false sharing and ping pong) • Sequential, CRUW, page Sync(var), Arbitrary-CW Sync • Multiple relaxations for different shared variables within the same program • No broadcast, no central address servers (so can work efficiently interconnected LANs) • New protocols welcome (user defined?!) • Step-by-step optimization towards maximal parallelism copies LU Decomposition 1024x1024 matrix written in SPLASH Advantages gained when reducing consistency of a single variable (the Global structure): Reducing Consistency Number of page migrations (page #4) 4 70 3.5 original 60 #migrations per host reduced speedups 3 2.5 2 Original 1.5 Reduced 50 40 30 20 10 0 1 1 2 3 hosts 4 5 1 2 3 hosts 4 5 MJEC - Millipede Job Event Control An open mechanism with which various synchronization methods can be implemented • A job has a unique systemwide id • Jobs communicate and synchronize by sending events • Although a job is mobile, its events follow and reach its events queue wherever it goes • Event handlers are context-sensitive MJEC (con’t) • Modes: – In Execution-Mode: arriving events are enqueued – In Dispatching-Mode: events are dequeued and handled by a user-supplied dispatching routine MJEC Interface Execution Mode milEnterDispatchingMode(func, context) ret := func(INIT, context) ret==EXIT? No Yes event pending? ret := func(event, context) Yes ret := func(EXIT, context) Wait for event Registration and Entering Dispatch Mode: milEnterDispatchingMode((FUNC)foo, void *context) Post Event: milPostEvent(id target, int event, int data) Dispatcher Routine Syntax: int foo(id origin, int event, int data, void *context) Experience with MJEC • ParC: ~ 250 lines SPLASH: ~ 120 lines • Easy implementation of many synchronization methods: semaphores, locks, condition variables, barriers • Implementation of location-dependent services (e.g., graphical display) Example - Barriers with MJEC Barrier Server Barrier() { milPostEvent(BARSERV, ARR, 0); milEnterDispatchingMode(wait_in_barrier, 0); Dispatcher: … … ... } EVENT ARR wait_in_barrier(src, event, context) { Job Job if (event == DEP) return EXIT_DISPATCHER; else return STAY_IN_DISPATCHER; BARRIER(...) Dispatcher: … … … … … } Example - Barriers with MJEC (con’t) BarrierServer() { Barrier Server milEnterDispatchingMode(barrier_server, info); Dispatcher: … … ... EVENT DEP EVENT DEP } barrier_server(src, event, context) { if (event == ARR) enqueue(context.queue, src); if (should_release(context)) while(context.cnt>0) { milPostEvent(context.dequeue, DEP); } return STAY_IN_DISPATCHER; Job Job BARRIER(...) Dispatcher: … … BARRIER(...) Dispatcher: … … } Dynamic Page- and Job-Migration • Migration may occur in case of: – – – – Remote memory access Load imbalance User comes back from lunch Improving locality by location rearrangement • Sometimes migration should be disabled – by system: ping-pong, critical section – by programmer: control system Locality of memory reference is THE dominant efficiency factor Migration Can Help Locality: Only Job Migration Only Page Migration Page & Job Migration Load Sharing + Max. Locality = Minimum-Weight multiway cut p q r p q r Problems with the multiway cut model • NP-hard for #cuts>2. We have n>X,000,000. Polynomial 2-approximations known • Not optimized for load balancing • Page replica • Graph changes dynamically • Only external accesses are recorded ===> only partial information is available Our Approach Access page 1 page 2 page 1 page 0 • Record the history of remote accesses • Use this information when taking decisions concerning load balancing/load sharing • Save old information to avoid repeating bad decisions (learn from mistakes) • Detect and solve ping-pong situations • Do everything by piggybacking on communication that is taking place anyway Ping Pong Detection (local): 1. Local threads attempt to use the page short time after it leaves the local host 2. The page leaves the host shortly after arrival Treatment (by ping-pong server): • Collect information regarding all participating hosts and threads • Try to locate an underloaded target host • Stabilize the system by locking-in pages/threads TSP - Effect of Locality Optimization 15 cities, Bare Millipede sec 4000 3500 3000 2500 2000 1500 1000 500 0 NO-FS OPTIMIZED-FS FS 1 2 3 4 5 6 hosts In the NO -FS case false sharing is avoided by aligning all allocations to page size. In the other two cases each page is used by 2 threads: in FS no optimizations are used, and in OPTIMIZED -FS the history mechanism is enabled. TSP on 6 hosts k number of threads falsely sharing a page k 2 2 optimized? # DSM- # ping-pong related treatment msgs messages Yes 5100 290 No 176120 0 Number of execution thread time (sec) migrations 68 645 23 1020 3 3 Yes No 4080 160460 279 0 87 32 620 1514 4 4 Yes No 5060 155540 343 0 99 44 690 1515 5 5 Yes No 6160 162505 443 0 139 55 700 1442 Ping Pong Detection Sensitivity TSP-1 1000 900 800 700 600 500 400 300 200 100 0 2 3 4 5 6 7 8 9 10111213 14151617 181920 Best results are achieved at maximal sensitivity, since all pages are accessed frequently. TSP-2 1100 1000 900 800 700 600 500 400 300 200 100 0 2 3 4 5 6 7 8 9 1011 121314151617 181920 Since part of the pages are accessed frequently and part - only occasionally, maximal sensitivity causes unnecessary ping pong treatment and significantly increases execution time. Applications • Numerical computations: Multigrid • Model checking: BDDs • Compute-intensive graphics: Ray-Tracing, Radiosity • Games, Search trees, Pruning, Tracking, CFD ... Performance Evaluation L - underloaded H - overloaded Delta(ms) - lock in time t/o delta - polling (MGS,DSM) msg delta - system pages delta T_epoch - max history time ??? - remove old histories - refresh old histories L_epoch - histories length page histories vs. job histories migration heuristic which func? ping-pong - what is initial noise? - what freq. is PP? LU Decomposition 1024x1024 matrix written in SPLASH: Performance improvements when there are few threads on each host 6 1 threads/host 5 speedup 3 threads/host 4 3 2 1 1 2 3 hosts 4 5 6 LU Decomposition 2048x2048 matrix written in SPLASH Super-Linear speedups due to the caching effect. speedup 8 7.18 7 6 5 4 4.47 3 2 1 1 3 hosts 6 180 160 140 120 100 80 60 40 20 0 3.5 3 2.5 2 1.5 1 0.5 0 1 2 hosts 3 4 Speedup Time Jacobi Relaxation 512x512 matrix (using 2 matrices, no false sharing) written in ParC Overhead of ParC/Millipede on a single host. Testing with Tracking algorithm: Overheads 1.05 1.04 1.03 1.02 1.01 1 0.99 0.98 0.97 Relative Pure C "Bare" Millipede ParC on Millipede 1 10 targets 20 Info... http://www.cs.technion.ac.il/Labs/Millipede millipede@cs.technion.ac.il Release available at the Millipede site !