talk - Parallel Programming Laboratory

advertisement
Parallelizing Spacetime
Discontinuous Galerkin Methods
Jonathan Booth
University of Illinois at Urbana/Champaign
In conjunction with: L. Kale, R. Haber, S. Thite,
J. Palaniappan
This research made possible via NSF grant
DMR 01-21695
http://charm.cs.uiuc.edu
Parallel Programming Lab
• Led by Professor Laxmikant Kale
• Application-oriented
– Research is driven by real applications and
the needs of real applications
•
•
•
•
NAMD
CSAR Rocket Simulation (Roc*)
Spacetime Discontinuous Galerkin
Petaflops Performance Prediction (Blue Gene)
– Focus on scaleable performance for real
applications
http://charm.cs.uiuc.edu
Charm++ Overview
• In development for roughly ten years
• Based on C++
• Runs on many platforms
– Desktops
– Clusters
– Supercomputers
• Overlays a C layer called Converse
– Allows multiple languages to work together
http://charm.cs.uiuc.edu
Charm++: Programmer View
Processor
Object/Task
• System of objects
• Asynchronous
communication via
method invocation
• Use an object identifier to
refer to an object.
• User sees each object
execute its methods
atomically
– As if on its own processor
http://charm.cs.uiuc.edu
Charm++: System View
• Set of objects invoked
by messages
• Set of processors of
the physical machine
• Keeps track of object
to processor mapping
• Routes messages
between objects
Processor
Object/Task
http://charm.cs.uiuc.edu
Charm++ Benefits
• Program is not tied to a fixed number of
processors
– No problem if program needs 128 processors
and only 45 available
– Called processor virtualization
• Load balancing accomplished
automatically
– User writes a short routine to transfer object
between processors
http://charm.cs.uiuc.edu
Load Balancing - Green Process
Starts Heavy Computation
A
B
C
http://charm.cs.uiuc.edu
Yellow Processes Migrate Away –
System Handles Message Routing
A
A
B
B
C
C
http://charm.cs.uiuc.edu
Load Balancing
• Load balancing isn’t solely dependant on
CPU usage
• Balancers consider network usage as well
– Can move objects to lessen network
bandwidth usage
• Migrating an object to disk instead of
another processor gives checkpoint/restart,
out-of-core execution
http://charm.cs.uiuc.edu
Parallel Spacetime Discontinuous
Galerkin
• Mesh generation is an advancing front algorithm
– Adds an independent set of elements called patches
to the mesh
• Spacetime methods are setup in such a way
they are easy to parallelize
– Each patch depends only on inflow elements
• Cone constraint insures no other dependencies
– Amount of data per patch is small
• Inexpensive to send a patch and its inflow elements to
another processor
http://charm.cs.uiuc.edu
Mesh Generation
Unsolved Patches
Mesh Generation
Unsolved Patches
Solved Patches
Mesh Generation
Refinement
Unsolved Patches
Solved Patches
Parallelization Method (1D)
• Master-Slave method
– Centralized mesh generation
– Distributed physics solver code
– Simplistic implementation
• But fast to get running
• Provides object migration sanity check
• No “time-step”
– as soon as a patch returns the master
generates any new patches it can and sends
them off to be solved
http://charm.cs.uiuc.edu
Results - Patches / Second
Patches/Second
250
200
150
100
50
0
0
10
20
Processors
30
40
http://charm.cs.uiuc.edu
Scaling Problems
• Speedup is ideal at 4 slave processors
• After 4 slaves, diminishing speedup occurs
• Possible sources:
– Network bandwidth overload
– Charm++ system overhead (grainsize control)
– Mesh generator overload
• Problem doesn’t scale-down
– More processors don’t slow the computation
down
http://charm.cs.uiuc.edu
Network Bandwidth
• Size of a patch to send both ways is 2048
bytes (very conservative estimate)
• Can compute 36 patches/(second*CPU)
• Each CPU needs 72kbytes/second
• 100Mbit Ethernet provides 10Mbyte/sec
• Network can support ~130 CPUs
– Must not be a lack of network bandwidth
http://charm.cs.uiuc.edu
Charm++ System Overhead
(Grainsize Control)
• Grainsize is a measure of the smallest unit of work
• Too small and overhead dominates
– Network latency overhead
– Object creation overhead
• Each patch takes 1.7ms to setup the connection to send
(both ways)
• Can send ~550 patches/sec to remote processors
– Again, higher than observed patch/second rate
• Grainsize can be reduced by sending multiple patches at
once
– Speeds up the computation but speedup still flattens out after 8
processors
http://charm.cs.uiuc.edu
Mesh Generation
• With 0 slave processors, 31ms/patch
• With 1 slave processor, 27ms/patch
• Geometry code takes 4ms to generate a patch
– Mesh generator needs a bit more time due to
Charm++ message sending overhead
• Leads to less than 250 patches/second
• Can’t trivially speed this up
– Would have to parallelize mesh generation
– Parallel mesh generation also would lighten network
load if the mesh were fully distributed to slave nodes
http://charm.cs.uiuc.edu
Testing the Mesh Generator
Bottleneck
• Does speeding up the mesh generator
give better results?
• Leaves the question how to speed up the
mesh generator
– The cluster used is a P3 Xeon 500Mhz
– So run the mesh generator on something
faster (a P4 2.8Ghz)
– Everything still on 100Mbit network
Patches/Sec
Fast Mesh Generator Results
900
800
700
600
500
400
300
200
100
0
0
5
10
15
20
Processors
25
30
35
Future Directions
• Parallelize geometry/mesh generation
– Easy to do in theory
– More complex in practice with refinement,
coarsening
– Lessens network bandwidth consumption
• Only have to send border elements of all meshes
• Compared to all elements sent right now
– Better cache performance
http://charm.cs.uiuc.edu
More Future Directions
• Send only necessary data
– Currently send everything, needed or not
• Use migration to balance load rather than slaves
– Means we’ll also get checkpoint/restart and out-ofcore execution for free
– Also means we can load balance away some of the
network communication
• Integrate 2D mesh generation/physics code
– Nothing in the parallel code knows the dimensionality
http://charm.cs.uiuc.edu
Download