Large-Scale Parallel Computing on Grids Henri Bal bal@cs.vu.nl Vrije Universiteit Amsterdam 7th Int.Workshop on Parallel and Distributed Methods in VerifiCation (PDMC) 29 March 2008 The ‘Promise of the Grid’ Efficient and transparent (i.e. easy-to-use) wall-socket computing over a distributed set of resources [Sunderam ICCS’2004, based on Foster/Kesselman] Parallel computing on grids ● Mostly limited to ● trivially parallel applications ● ● applications that run on one cluster at a time ● ● parameter sweeps, SETI@home, RSA-155 use grid to schedule application on a suitable cluster Our goal: run real parallel applications on a large-scale grid, using co-allocated resources Reality: ‘Problems of the Grid’ ● ● ● Performance & scalability Heterogeneous Low-level & changing programming interfaces User ● ● ● ● Connectivity issues Fault tolerance Malleability Wide-Area Grid Systems writing & deploying grid applications is hard It’s a jungle out there It's a jungle out there Disorder and confusion everywhere No one seems to care Well I do Hey, who's in charge here? It's a jungle out there (Randy Newman) ….. Some solutions ● ● Performance & scalability ● Fast optical private networks (OPNs) ● Algorithmic optimizations Heterogeneity ● ● Use Virtual Machine technology (e.g. Java VM) Grid programming systems dealing with connectivity, fault tolerance … ● Ibis programming system ● GAT (Grid Application Toolkit) Outline ● Grid infrastructures ● ● ● DAS: a Computer Science grid with a 40 Gb/s optical wide-area network Parallel algorithms and applications on grids ● Search algorithms, Awari ● DiVinE model checker (early experiences) Programming environments ● Ibis and JavaGAT Distributed ASCI Supercomputer (DAS) VU Leiden DAS-1 (1997) U. Amsterdam SURFnet Delft DAS-2 (2002) DAS-3 (2006) DAS-3 272 nodes (AMD Opterons) 792 cores 1TB memory LAN: Myrinet 10G Gigabit Ethernet WAN: 40 Gb/s OPN Heterogeneous: 2.2-2.6 GHz Single/dual-core Delft no Myrinet Wide-area algorithms ● Key problem is fighting high latency ● ● LAN: 1-100 usec, WAN: 1-100 msec Experience Albatros project: traditional optimizations can be very effective for grids ● Use asynchronous communication ● ● Allows latency hiding & message combining Exploit hierarchical structure of grids ● E.g. load balancing, group communication Example 1: heuristic search ● ● Parallel search: partition search space Many applications have transpositions ● ● Reach same position through different moves Transposition table: ● ● Hash table storing positions that have been evaluated before Shared among all processors Distributed transposition tables ● Partitioned tables ● ● ● 10,000s synchronous messages per second Poor performance even on a cluster Replicated tables ● Broadcasting doesn’t scale (many updates) Transposition Driven Scheduling ● Send job asynchronously to owner table entry ● Can be overlapped with computation ● Random (=good) load balancing ● Delayed & combined into fewer large messages => Bulk transfers are far more efficient on most networks Speedups for Rubik’s cube Single Myrinet cluster TDS on wide-area DAS-2 70 Speedup 60 50 40 30 20 10 0 16-node cluster ● 4x 16-node clusters 64-node cluster Latency-insensitive algorithm works well even on a grid ● Despite huge amount of communication Example 2: Solving awari Solved by John Romein [IEEE Computer, Oct. 2003] ● ● ● ● ● Computed on a Myrinet cluster (DAS-2) Recently used wide-area DAS-3 [CCGrid, May 2008] Determined score for 889,063,398,406 positions Game is a draw Andy Tanenbaum: ``You just ruined a perfectly fine 3500 year old game’’ Rules of awari North South ● Start with 4*12 = 48 stones ● Sow counterclockwise ● Capture if last, enemy pit contains 2-3 stones ● Goal: capture majority of stones (We don’t use) MiniMax 4 Max 1 4 Min 3 1 6 4 Max 2 ● 3 1 1 4 6 Depth-first search (e.g., alpha-beta) 2 4 Retrograde analysis initial state Max Min 3 4 2 Max 2 3 4 2 final states Start from final positions with known outcome ● ● ● Like mate, stalemate in chess Propagate information upwards (``unmoves’’) Retrograde analysis initial state 1 Max 6 1 Min 3 1 6 2 Max 2 3 1 1 4 6 2 final states Retrograde analysis initial state 4 Max 1 4 Min 3 1 6 4 Max 2 3 1 1 4 6 2 4 final states Parallel retrograde analysis Partition database, like transposition tables ● ● Random distribution (hashing), good load balance Repeatedly send results to parent nodes ● ● Asynchronous, combined into bulk transfers Extremely communication intensive: ● ● 1 Pbit of data in 51 hours ● What happens on a grid? Awari on DAS-3 grid ● ● Implementation on single big cluster ● 144 cores (1 cluster of 72 nodes, 2 cores/node) ● Myrinet (MPI) Naïve implementation on 3 small clusters ● 144 cores (3 clusters of 24 nodes, 2 cores/node) ● Myrinet + 10G light paths (OpenMPI) Initial insights ● Single-cluster version has high performance, despite high communication rate ● ● Up to 28 Gb/s cumulative network throughput Naïve grid version has flow control problems ● ● Faster CPUs overwhelm slower CPUs with work Unrestricted job queue growth => Add regular global synchronizations (barriers) ● Naïve grid version 2.1x slower than 1 big cluster Optimizations ● Scalable barrier synchronization algorithm ● ● ● ● Reduce host overhead ● ● ● Ring algorithm has too much latency on a grid Tree algorithm for barrier&termination detection Better flushing strategy while in termination phase CPU overhead for MPI message handling/polling Don’t just look at latency and throughput Optimize grain size per network (LAN vs. WAN) ● Large messages (much combining) have lower host overhead but higher load-imbalance Performance ● ● Optimizations improved grid performance by 50% Grid version only 15% slower than 1 big cluster ● Despite huge amount of communication (14.8 billion messages for 48-stone database) DiVinE ● Developed at Masaryk University Brno (ParaDiSe Labs, Jiri Barnat – main architect) ● Distributed Verification Environment ● ● ● Tool for distributed state space analysis and LTL model-checking Development and fast prototyping environment Ported to DAS-3 by Kees Verstoep DiVinE experiments on DAS-3 ● Tried difficult problems from BEEM database ● ● ● Maximal Accepting Predecessor version Experiments: ● 1 cluster, 10 Gb/s Myrinet ● 1 cluster, 1 Gb/s Ethernet ● 3 clusters, 10 Gb/s light paths Up to 144 cores (72 nodes) in total Elevator-1 Elevator-1: scaling problem size Anderson Preliminary insights ● Many similarities between Awari and DiVinE ● ● Largest problem solved so far: ● ● ● Similar communication patterns, data rates, random state distribution 1.6 billion states on 128 (64+32+32) nodes 350 GB memory, 40 minutes runtime, peak WAN capacity close to 10G We need bigger problems ● Industry-strength challenges Outline ● Grid infrastructures ● ● ● DAS: a Computer Science grid with a 40 Gb/s optical wide-area network Parallel algorithms and applications on grids ● Search algorithms, Awari ● DiVinE model checker (early experiences) Programming environments ● Ibis and JavaGAT Research at the VU The Ibis Project ● ● Goal: ● drastically simplify grid programming/deployment ● write and go! Used for many large-scale grid experiments ● ● Non-trivially parallel applications running on many heterogeneous resources (co-allocation) Up to 1200 CPUs (Grid’5000 + DAS-3) Approach (1) ● Write & go: minimal assumptions about execution environment ● ● Use middleware-independent APIs ● ● Virtual Machines (Java) deal with heterogeneity Mapped automatically onto middleware Different programming abstractions ● Low-level message passing ● High-level divide-and-conquer Approach (2) ● Designed to run in dynamic/hostile grid environment ● ● ● Handle fault-tolerance and malleability Solve connectivity problems automatically (SmartSockets) Modular and flexible: can replace Ibis components by external ones ● Scheduling: Zorilla P2P system or external broker Global picture Ibis Portability Layer (IPL) ● A “run-anywhere” communication library ● ● Support malleability & fault-tolerance ● ● Sent along with the application (jar-files) Change number of machines during the run Efficient communication ● Configured at startup, based on capabilities (multicast, ordering, reliability, callbacks) ● Can use native libraries (e.g., MPI on MX) ● Optimizations using bytecode rewriting Satin: divide-and-conquer fib(5) ● fib(4) Divide-and-conquer is fib(3) inherently hierarchical fib(3) fib(2) fib(2) More general than fib(2) fib(1) fib(1) fib(0) fib(1) master/worker fib(0) fib(1) Cilk-like primitives (spawn/sync) in Java Fault tolerant (grids often crash) cpu 2 ● fib(1) fib(0) cpu 3 cpu 1 ● ● cpu 1 Gene sequence comparison in Satin (on DAS-3) Speedup on 1 cluster, up to 256 cores • Run times on 5 clusters, up to 576 cores Divide&conquer scales much better than master-worker JavaGAT ● GAT: Grid Application Toolkit ● ● Used by applications to access grid services ● ● Makes grid applications independent of the underlying grid infrastructure File copying, resource discovery, job submission & monitoring, user authentication API is currently standardized (SAGA) ● SAGA being implemented on JavaGAT Grid Applications with GAT File.copy(...) Grid Application GAT Remote Files Monitoring submitJob(...) Info Resource service Management GAT Engine GridLab Globus Unicore SSH P2P Local Intelligent dispatching gridftp globus Global picture Ibis applications ● Multimedia content analysis ● SAT-solver ● Brain imaging ● Mass spectroscopy ● N-body simulations ● …… ● Other programming systems ● Workflow engine for astronomy (D-grid), grid file system, ProActive, Jylab, … Multimedia content analysis ● Analyzes video streams to recognize objects ● Extract feature vectors from images ● ● Describe properties (color, shape) ● Data-parallel task implemented with C++/MPI Compute on consecutive images ● Task-parallelism on a grid MMCA application Client (Java) Parallel Parallel Servers Horus Horus (C++) Server Servers Ibis (local desk-top machine) (Java) Broker (grid) (Java) (any machine world-wide) MMCA with Ibis ● Initial implementation with TCP was unstable ● Ibis simplifies communication, fault tolerance ● SmartSockets solves connectivity problems ● Clickable deployment interface ● ● Demonstrated at many conferences (SC’07) 20 clusters on 3 continents, 500-800 cores ● Frame rate increased from 1/30 to 15 frames/sec MMCA ‘Most Visionary Research’ award at AAAI 2007, (Frank Seinstra et al.) Summary ● Goal: ● ● Parallel applications on large-scale grids Need efficient distributed algorithms for grids ● Latency-tolerant (asynchronous) ● Available bandwidth is increasing ● Early experiments with DiVinE are promising ● High-level grid programming environments ● Heterogeneity, connectivity, fault tolerance, … Acknowledgements Current members Past members Rob van Nieuwpoort Jason Maassen Thilo Kielmann Frank Seinstra Niels Drost Ceriel Jacobs Kees Verstoep John Romein Gosia Wrzesinska Rutger Hofman Maik Nijhuis Olivier Aumage Fabrice Huet Alexandre Denis Roelof Kemp Kees van Reeuwijk More information ● Ibis can be downloaded from ● • http://www.cs.vu.nl/ibis Ibis tutorials • Next one at CCGrid 2008 (19 May, Lyon) EXTRA StarPlane ● ● ● Applications can dynamically allocate light paths Applications can change topology of the WAN Joint work with Cees de Laat (Univ. of Amsterdam) NOC DAS-3 and Grid’5000 Grid’5000 • 10 Gb/sec light path Paris - Amsterdam • First step towards a European Computer Science grid? • See Cappello’s keynote @ CCGrid2007 Configuration LU Head * storage * CPU * memory * Myri 10G * 10GE Compute * storage * CPU * memory * Myri 10G TUD UvA-VLe UvA-MN VU 10TB 5TB 2TB 2TB 10TB 2x2.4GHz DC 2x2.4GHz DC 2x2.2GHz DC 2x2.2GHz DC 2x2.4GHz DC 16GB 16GB 8GB 16GB 8GB 1 1 1 1 1 1 1 1 1 32 400GB 2x2.6GHz 4GB 1 Myrinet * 10G ports * 10GE ports 33 (7) 8 Nortel * 1GE ports * 10GE ports 32 (16) 1 (1) 68 250GB 2x2.4GHz 4GB 136 (8) 9 (3) 40 (1) 250GB 2x2.2GHz DC 4GB 1 46 2x250GB 2x2.4GHz 4GB 1 41 8 47 8 86 (2) 8 40 (8) 2 46 (2) 2 85 (11) 1 (1) TOTALS 29TB 64GB 85 271 250GB 84 TB 2x2.4GHz DC 1.9 THz 4GB 1048 GB 1 320 Gb/s 339 Gb/s DAS-3 networks Nortel 5530 + 3 * 5510 compute nodes (85) 1 Gb/s ethernet (85x) ethernet switch 1 or 10 Gb/s Campus uplink 10 Gb/s ethernet 10 Gb/s Myrinet (85x) 10 Gb/s eth fiber (8x) 80 Gb/s DWDM SURFnet6 Nortel OME 6500 with DWDM blade 10 Gb/s Myrinet 10 Gb/s ethernet blade Myri-10G switch Headnode (10 TB mass storage) vrije Universiteit Current SURFnet6 network Groningen1 ● Dwingeloo1 Sub network 4: Blue Azur Zwolle1 Amsterdam1 Subnetwork 3: Red Amsterdam2 Leiden Hilversum Enschede1 ● Sub network 1: Green Den Haag Nijmegen1 Utrecht1 Delft 4 DAS-3 sites happen to be on Subnetwork 1 (olive green) Only VU needs extra connection Rotterdam1 Dordrecht1 Subnetwork 2: Dark blue Breda1 Source: Erik-Jan Bos (Siren’06 talk) Den Bosch1 Eindhoven1 Subnetwork 5: Grey Tilburg1 Maastricht1 Adding 10G waves for DAS-3 WSS WSS -> WSS -> split select select split WSS <- WSS <- G5 G5 G5 G5 ● Amsterdam1 Amsterdam2 WSS -> WSS -> split select select split WSS <- WSS <- G5 G5 G5 G5 G5 ● G5 UvA MultiMedian G5 G5G6 Delft G6 G5 GMD G5 Amsterdam3 G6 G6 VU GMD “spur” – hardwired connections Band 5 (up to 8 channels) added at all participating nodes Wavelength Selective Switches (WSS) added to support reconfigurability GMD G9 G9 Leiden Leiden G9 G9 GMD Hilversum ● “Spur” to connect VU GMD G4 G4 G4 G4 Delft Delft G3 G3 G5 G1 GMD Delft G1 GMD Den Haag GMD GMD Utrecht G5 G3 GMD Delft G3 G1 G1 ● Full photonic mesh now possible between DAS-3 sites Source: Erik-Jan Bos (Siren’06 talk) Sequential Fibonacci public long fib(int n) { if (n < 2) return n; long x = fib(n - 1); long y = fib(n – 2); return x + y; } Parallel Fibonacci interface FibInterface extends ibis.satin.Spawnable { public long fib(int n); } public long fib(int n) { if (n < 2) return n; long x = fib(n - 1); long y = fib(n – 2); sync(); return x + y; } Parallel Fibonacci interface FibInterface extends ibis.satin.Spawnable { public long fib(int n); } Mark methods as public long fib(int n) { if (n < 2) return n; Spawnable. They can run in parallel. long x = fib(n - 1); long y = fib(n – 2); sync(); return x + y; } Parallel Fibonacci interface FibInterface extends ibis.satin.Spawnable { public long fib(int n); } Mark methods as public long fib(int n) { if (n < 2) return n; Spawnable. They can run in parallel. long x = fib(n - 1); long y = fib(n – 2); sync(); return x + y; } Wait until spawned methods are done. Elevator-1 times Grid version 9 424.324 18 223.127 36 125.208 72 89.3937 (4.76x , 60%) Ethernet version 9 596.386 18 284.167 36 179.068 72 106.425 Myrinet version: 9 401.882 18 215.056 36 111.812 72 71.7411 (5.64x, 70%) Problem DVE-1 SPIN-1 DVE-64.2 lamport.6 55.9 1.5 2.7 lamport.7 310.9 9.7 6.2 lamport.8 560.3 16.3 9.8 msmie.4 85.2 16.2 2.0 fischer.6 110.0 23.9 2.4 hanoi.3 139.0 42.6 24.5 telephony.4 148.3 34.8 3.3 telephony.7 286.7 67.8 5.9 leader_filters.7 175.4 46.5 4.5 sorter.4 184.1 29.4 7.6 frogs.4 196.4 30.4 3.2 phils.6 264.8 117.0 5.4 iprotocol.7 514.9 159.0 13.3 szymanski.5 904.9 155.0 14.7