Large-Scale Parallel Computing on Grids Henri Bal 7th Int.Workshop on Parallel and Distributed

advertisement
Large-Scale Parallel Computing
on Grids
Henri Bal
bal@cs.vu.nl
Vrije Universiteit Amsterdam
7th Int.Workshop on Parallel and Distributed
Methods in VerifiCation (PDMC)
29 March 2008
The ‘Promise of the Grid’
Efficient and transparent (i.e. easy-to-use)
wall-socket computing over a distributed set
of resources [Sunderam ICCS’2004, based on Foster/Kesselman]
Parallel computing on grids
●
Mostly limited to
●
trivially parallel applications
●
●
applications that run on one cluster at a time
●
●
parameter sweeps, SETI@home, RSA-155
use grid to schedule application on a suitable cluster
Our goal: run real parallel applications on a
large-scale grid, using co-allocated resources
Reality: ‘Problems of the Grid’
●
●
●
Performance &
scalability
Heterogeneous
Low-level & changing
programming interfaces
User
●
●
●
●
Connectivity
issues
Fault tolerance
Malleability
Wide-Area Grid Systems
writing & deploying grid applications is hard
It’s a jungle out there
It's a jungle out there
Disorder and confusion everywhere
No one seems to care
Well I do
Hey, who's in charge here?
It's a jungle out there
(Randy Newman)
…..
Some solutions
●
●
Performance & scalability
●
Fast optical private networks (OPNs)
●
Algorithmic optimizations
Heterogeneity
●
●
Use Virtual Machine technology (e.g. Java VM)
Grid programming systems dealing with
connectivity, fault tolerance …
●
Ibis programming system
●
GAT (Grid Application Toolkit)
Outline
●
Grid infrastructures
●
●
●
DAS: a Computer Science grid with a 40 Gb/s
optical wide-area network
Parallel algorithms and applications on grids
●
Search algorithms, Awari
●
DiVinE model checker (early experiences)
Programming environments
●
Ibis and JavaGAT
Distributed ASCI Supercomputer (DAS)
VU
Leiden
DAS-1 (1997)
U. Amsterdam
SURFnet
Delft
DAS-2 (2002)
DAS-3 (2006)
DAS-3
272 nodes
(AMD Opterons)
792 cores
1TB memory
LAN:
Myrinet 10G
Gigabit Ethernet
WAN:
40 Gb/s OPN
Heterogeneous:
2.2-2.6 GHz
Single/dual-core
Delft no Myrinet
Wide-area algorithms
●
Key problem is fighting high latency
●
●
LAN: 1-100 usec, WAN: 1-100 msec
Experience Albatros project: traditional
optimizations can be very effective for grids
●
Use asynchronous communication
●
●
Allows latency hiding & message combining
Exploit hierarchical structure of grids
●
E.g. load balancing, group communication
Example 1: heuristic search
●
●
Parallel search:
partition search space
Many applications have transpositions
●
●
Reach same position through different moves
Transposition table:
●
●
Hash table storing positions that have been
evaluated before
Shared among all processors
Distributed transposition tables
●
Partitioned tables
●
●
●
10,000s synchronous
messages per second
Poor performance
even on a cluster
Replicated tables
●
Broadcasting doesn’t
scale (many updates)
Transposition Driven Scheduling
●
Send job asynchronously to owner table entry
●
Can be overlapped with computation
●
Random (=good) load balancing
●
Delayed & combined into fewer large messages
=> Bulk transfers are far more efficient on most networks
Speedups for Rubik’s cube
Single Myrinet cluster
TDS on wide-area DAS-2
70
Speedup
60
50
40
30
20
10
0
16-node
cluster
●
4x 16-node
clusters
64-node
cluster
Latency-insensitive algorithm works well even on a
grid
●
Despite huge amount of communication
Example 2: Solving awari
Solved by John Romein [IEEE Computer, Oct. 2003]
●
●
●
●
●
Computed on a Myrinet cluster (DAS-2)
Recently used wide-area DAS-3 [CCGrid, May 2008]
Determined score for 889,063,398,406
positions
Game is a draw
Andy Tanenbaum:
``You just ruined a perfectly fine 3500 year old game’’
Rules of awari
North
South
●
Start with 4*12 = 48 stones
●
Sow counterclockwise
●
Capture if last, enemy pit contains 2-3 stones
●
Goal: capture majority of stones
(We don’t use) MiniMax
4
Max
1
4
Min
3
1
6
4
Max
2
●
3
1
1
4
6
Depth-first search (e.g., alpha-beta)
2
4
Retrograde analysis
initial state
Max
Min
3
4
2
Max
2
3
4
2
final states
Start from final positions with known outcome
●
●
●
Like mate, stalemate in chess
Propagate information upwards (``unmoves’’)
Retrograde analysis
initial state
1
Max
6
1
Min
3
1
6
2
Max
2
3
1
1
4
6
2
final states
Retrograde analysis
initial state
4
Max
1
4
Min
3
1
6
4
Max
2
3
1
1
4
6
2
4
final states
Parallel retrograde analysis
Partition database, like transposition tables
●
●
Random distribution (hashing), good load balance
Repeatedly send results to parent nodes
●
●
Asynchronous, combined into bulk transfers
Extremely communication intensive:
●
●
1 Pbit of data in 51 hours
●
What happens on a grid?
Awari on DAS-3 grid
●
●
Implementation on single big cluster
●
144 cores (1 cluster of 72 nodes, 2 cores/node)
●
Myrinet (MPI)
Naïve implementation on 3 small clusters
●
144 cores (3 clusters of 24 nodes, 2 cores/node)
●
Myrinet + 10G light paths (OpenMPI)
Initial insights
●
Single-cluster version has high performance,
despite high communication rate
●
●
Up to 28 Gb/s cumulative network throughput
Naïve grid version has flow control problems
●
●
Faster CPUs overwhelm slower CPUs with work
Unrestricted job queue growth
=> Add regular global synchronizations (barriers)
●
Naïve grid version 2.1x slower than 1 big cluster
Optimizations
●
Scalable barrier synchronization algorithm
●
●
●
●
Reduce host overhead
●
●
●
Ring algorithm has too much latency on a grid
Tree algorithm for barrier&termination detection
Better flushing strategy while in termination phase
CPU overhead for MPI message handling/polling
Don’t just look at latency and throughput
Optimize grain size per network (LAN vs. WAN)
●
Large messages (much combining) have lower host
overhead but higher load-imbalance
Performance
●
●
Optimizations improved grid performance by 50%
Grid version only 15% slower than 1 big cluster
●
Despite huge amount of communication
(14.8 billion messages for 48-stone database)
DiVinE
●
Developed at Masaryk University Brno
(ParaDiSe Labs, Jiri Barnat – main architect)
●
Distributed Verification Environment
●
●
●
Tool for distributed state space analysis and
LTL model-checking
Development and fast prototyping environment
Ported to DAS-3 by Kees Verstoep
DiVinE experiments on DAS-3
●
Tried difficult problems from BEEM database
●
●
●
Maximal Accepting Predecessor version
Experiments:
●
1 cluster, 10 Gb/s Myrinet
●
1 cluster, 1 Gb/s Ethernet
●
3 clusters, 10 Gb/s light paths
Up to 144 cores (72 nodes) in total
Elevator-1
Elevator-1: scaling problem size
Anderson
Preliminary insights
●
Many similarities between Awari and DiVinE
●
●
Largest problem solved so far:
●
●
●
Similar communication patterns, data rates,
random state distribution
1.6 billion states on 128 (64+32+32) nodes
350 GB memory, 40 minutes runtime, peak WAN
capacity close to 10G
We need bigger problems
●
Industry-strength challenges
Outline
●
Grid infrastructures
●
●
●
DAS: a Computer Science grid with a 40 Gb/s
optical wide-area network
Parallel algorithms and applications on grids
●
Search algorithms, Awari
●
DiVinE model checker (early experiences)
Programming environments
●
Ibis and JavaGAT
Research at the VU
The Ibis Project
●
●
Goal:
●
drastically simplify grid programming/deployment
●
write and go!
Used for many large-scale grid experiments
●
●
Non-trivially parallel applications running on
many heterogeneous resources (co-allocation)
Up to 1200 CPUs (Grid’5000 + DAS-3)
Approach (1)
●
Write & go: minimal assumptions about
execution environment
●
●
Use middleware-independent APIs
●
●
Virtual Machines (Java) deal with heterogeneity
Mapped automatically onto middleware
Different programming abstractions
●
Low-level message passing
●
High-level divide-and-conquer
Approach (2)
●
Designed to run in dynamic/hostile grid
environment
●
●
●
Handle fault-tolerance and malleability
Solve connectivity problems automatically
(SmartSockets)
Modular and flexible: can replace Ibis
components by external ones
●
Scheduling: Zorilla P2P system or external broker
Global picture
Ibis Portability Layer (IPL)
●
A “run-anywhere” communication library
●
●
Support malleability & fault-tolerance
●
●
Sent along with the application (jar-files)
Change number of machines during the run
Efficient communication
●
Configured at startup, based on capabilities
(multicast, ordering, reliability, callbacks)
●
Can use native libraries (e.g., MPI on MX)
●
Optimizations using bytecode rewriting
Satin: divide-and-conquer
fib(5)
●
fib(4)
Divide-and-conquer is
fib(3)
inherently hierarchical
fib(3)
fib(2)
fib(2)
More general than fib(2) fib(1) fib(1) fib(0) fib(1)
master/worker
fib(0)
fib(1)
Cilk-like primitives (spawn/sync) in Java
Fault tolerant (grids often crash)
cpu 2
●
fib(1)
fib(0)
cpu 3
cpu 1
●
●
cpu 1
Gene sequence comparison in
Satin (on DAS-3)
Speedup on 1 cluster, up
to 256 cores
•
Run times on 5 clusters,
up to 576 cores
Divide&conquer scales much better than master-worker
JavaGAT
●
GAT: Grid Application Toolkit
●
●
Used by applications to access grid services
●
●
Makes grid applications independent of the
underlying grid infrastructure
File copying, resource discovery, job submission
& monitoring, user authentication
API is currently standardized (SAGA)
●
SAGA being implemented on JavaGAT
Grid Applications with GAT
File.copy(...)
Grid Application
GAT
Remote
Files
Monitoring
submitJob(...)
Info
Resource
service
Management
GAT Engine
GridLab
Globus
Unicore
SSH
P2P
Local
Intelligent
dispatching
gridftp
globus
Global picture
Ibis applications
●
Multimedia content analysis
●
SAT-solver
●
Brain imaging
●
Mass spectroscopy
●
N-body simulations
●
……
●
Other programming systems
●
Workflow engine for astronomy (D-grid), grid file
system, ProActive, Jylab, …
Multimedia content analysis
●
Analyzes video streams to recognize objects
●
Extract feature vectors from images
●
●
Describe properties (color, shape)
●
Data-parallel task implemented with C++/MPI
Compute on consecutive images
●
Task-parallelism on a grid
MMCA application
Client
(Java)
Parallel
Parallel
Servers
Horus
Horus
(C++)
Server
Servers
Ibis
(local desk-top machine)
(Java)
Broker
(grid)
(Java)
(any machine world-wide)
MMCA with Ibis
●
Initial implementation with TCP was unstable
●
Ibis simplifies communication, fault tolerance
●
SmartSockets solves connectivity problems
●
Clickable deployment interface
●
●
Demonstrated at many
conferences (SC’07)
20 clusters on 3 continents, 500-800 cores
●
Frame rate increased from 1/30 to 15 frames/sec
MMCA
‘Most Visionary Research’ award at AAAI 2007, (Frank Seinstra et al.)
Summary
●
Goal:
●
●
Parallel applications on large-scale grids
Need efficient distributed algorithms for grids
●
Latency-tolerant (asynchronous)
●
Available bandwidth is increasing
●
Early experiments with DiVinE are promising
●
High-level grid programming environments
●
Heterogeneity, connectivity, fault tolerance, …
Acknowledgements
Current members
Past members
Rob van Nieuwpoort
Jason Maassen
Thilo Kielmann
Frank Seinstra
Niels Drost
Ceriel Jacobs
Kees Verstoep
John Romein
Gosia Wrzesinska
Rutger Hofman
Maik Nijhuis
Olivier Aumage
Fabrice Huet
Alexandre Denis
Roelof Kemp
Kees van Reeuwijk
More information
●
Ibis can be downloaded from
●
•
http://www.cs.vu.nl/ibis
Ibis tutorials
•
Next one at CCGrid 2008 (19 May, Lyon)
EXTRA
StarPlane
●
●
●
Applications can dynamically
allocate light paths
Applications can change
topology of the WAN
Joint work with
Cees de Laat
(Univ. of Amsterdam)
NOC
DAS-3 and Grid’5000
Grid’5000
• 10 Gb/sec light path Paris - Amsterdam
• First step towards a European Computer
Science grid?
• See Cappello’s keynote @ CCGrid2007
Configuration
LU
Head
* storage
* CPU
* memory
* Myri 10G
* 10GE
Compute
* storage
* CPU
* memory
* Myri 10G
TUD
UvA-VLe
UvA-MN
VU
10TB
5TB
2TB
2TB
10TB
2x2.4GHz DC 2x2.4GHz DC 2x2.2GHz DC 2x2.2GHz DC 2x2.4GHz DC
16GB
16GB
8GB
16GB
8GB
1
1
1
1
1
1
1
1
1
32
400GB
2x2.6GHz
4GB
1
Myrinet
* 10G ports
* 10GE ports
33 (7)
8
Nortel
* 1GE ports
* 10GE ports
32 (16)
1 (1)
68
250GB
2x2.4GHz
4GB
136 (8)
9 (3)
40 (1)
250GB
2x2.2GHz DC
4GB
1
46
2x250GB
2x2.4GHz
4GB
1
41
8
47
8
86 (2)
8
40 (8)
2
46 (2)
2
85 (11)
1 (1)
TOTALS
29TB
64GB
85
271
250GB
84 TB
2x2.4GHz DC 1.9 THz
4GB
1048 GB
1
320 Gb/s
339 Gb/s
DAS-3 networks
Nortel 5530 + 3 * 5510
compute nodes (85)
1 Gb/s ethernet (85x)
ethernet switch
1 or 10 Gb/s Campus uplink
10 Gb/s ethernet
10 Gb/s Myrinet (85x)
10 Gb/s eth fiber (8x)
80 Gb/s DWDM
SURFnet6
Nortel OME 6500
with DWDM blade
10 Gb/s Myrinet
10 Gb/s ethernet blade
Myri-10G switch
Headnode
(10 TB mass storage)
vrije Universiteit
Current SURFnet6 network
Groningen1
●
Dwingeloo1
Sub network 4:
Blue Azur
Zwolle1
Amsterdam1
Subnetwork 3:
Red
Amsterdam2
Leiden
Hilversum
Enschede1
●
Sub network 1:
Green
Den Haag
Nijmegen1
Utrecht1
Delft
4 DAS-3 sites
happen to be on
Subnetwork 1 (olive
green)
Only VU needs
extra connection
Rotterdam1
Dordrecht1
Subnetwork 2:
Dark blue
Breda1
Source: Erik-Jan Bos
(Siren’06 talk)
Den Bosch1
Eindhoven1
Subnetwork 5:
Grey
Tilburg1
Maastricht1
Adding 10G waves for DAS-3
WSS
WSS ->
WSS ->
split
select
select
split
WSS <-
WSS <-
G5
G5
G5
G5
●
Amsterdam1
Amsterdam2
WSS ->
WSS ->
split
select
select
split
WSS <-
WSS <-
G5
G5
G5
G5
G5
●
G5
UvA
MultiMedian
G5
G5G6
Delft
G6
G5
GMD
G5
Amsterdam3
G6
G6
VU
GMD
“spur” – hardwired
connections
Band 5 (up to 8
channels) added at all
participating nodes
Wavelength Selective
Switches (WSS) added
to support
reconfigurability
GMD
G9
G9
Leiden
Leiden
G9
G9
GMD
Hilversum
●
“Spur” to connect VU
GMD
G4
G4
G4
G4
Delft
Delft
G3
G3
G5
G1
GMD
Delft
G1
GMD
Den Haag
GMD
GMD
Utrecht
G5
G3
GMD
Delft
G3
G1
G1
●
Full photonic mesh
now possible between
DAS-3 sites
Source: Erik-Jan
Bos (Siren’06 talk)
Sequential Fibonacci
public long fib(int n) {
if (n < 2) return n;
long x = fib(n - 1);
long y = fib(n – 2);
return x + y;
}
Parallel Fibonacci
interface FibInterface extends ibis.satin.Spawnable {
public long fib(int n);
}
public long fib(int n) {
if (n < 2) return n;
long x = fib(n - 1);
long y = fib(n – 2);
sync();
return x + y;
}
Parallel Fibonacci
interface FibInterface extends ibis.satin.Spawnable {
public long fib(int n);
}
Mark methods as
public long fib(int n) {
if (n < 2) return n;
Spawnable.
They can run in parallel.
long x = fib(n - 1);
long y = fib(n – 2);
sync();
return x + y;
}
Parallel Fibonacci
interface FibInterface extends ibis.satin.Spawnable {
public long fib(int n);
}
Mark methods as
public long fib(int n) {
if (n < 2) return n;
Spawnable.
They can run in parallel.
long x = fib(n - 1);
long y = fib(n – 2);
sync();
return x + y;
}
Wait until spawned
methods are done.
Elevator-1 times
Grid version
9 424.324
18 223.127
36 125.208
72 89.3937 (4.76x , 60%)
Ethernet version
9 596.386
18 284.167
36 179.068
72 106.425
Myrinet version:
9 401.882
18 215.056
36 111.812
72 71.7411 (5.64x, 70%)
Problem
DVE-1 SPIN-1 DVE-64.2
lamport.6
55.9
1.5
2.7
lamport.7
310.9 9.7
6.2
lamport.8
560.3 16.3
9.8
msmie.4
85.2
16.2
2.0
fischer.6
110.0 23.9
2.4
hanoi.3
139.0 42.6 24.5
telephony.4
148.3 34.8
3.3
telephony.7
286.7 67.8 5.9
leader_filters.7 175.4 46.5
4.5
sorter.4
184.1 29.4
7.6
frogs.4
196.4 30.4
3.2
phils.6
264.8 117.0 5.4
iprotocol.7
514.9 159.0 13.3
szymanski.5
904.9 155.0 14.7
Download