The HPCGAP Project: Delivering Usable Scalable Parallelism to Computational Pure Mathematicians

advertisement
The HPCGAP Project:
Delivering Usable Scalable Parallelism to
Computational Pure Mathematicians
Steve Linton, University of St Andrews
[email protected]
Friday, 25 June 2010
Overview
•
•
Friday, 25 June 2010
Computational Group Theory and GAP
HPCGAP project
•
•
•
overall software model
use cases
a couple of technical highlights
•
•
dataspaces
challenges for our skeletons
GAP
Groups, Algorithms, Programming
•
•
•
•
The mathematician’s handle
on symmetry
Key objects in pure and
applied mathematics
Early adopters of
computation in pure maths
Groups of interest are often
infinite, or very large indeed
Study the group by
computing with just a
(carefully selected) few of
its elements
•
Friday, 25 June 2010
“There will be positively no internal
alterations to be made even if we wish
suddenly to switch from calculating the
energy levels of the neon atom to the
enumeration of groups of order 720.”
Alan Turing (1945)
808017424794512875886459904961710757
005754368000000000
Groups, Algorithms, Programming
•
•
•
Given a concise description of a
group
generating permutations or
matrices
finite presentation
Calculate global properties of group:
size,
composition factors,
membership test
character table
Search for elements of the group
with special properties
“find an element that moves this
to that”
find all the unipotent matrices in
the group
•
•
•
•
•
•
•
•
Friday, 25 June 2010
Rubik’s Cube Group:
• Generated by 5
permutations of 48 small
squares
• Size = 227314537211
• Structure:
(211x37):(A8xA11):2
• No element that just
twists one corner
Groups, Algorithms, Programming
•
•
•
GAP History
Development began in
Aachen, mid-80s
Neubüser, Schönert, others
1997, Neubüser retired
international project
coordinated from St
Andrews till 2005
coordination now shared
with three other centres
Free Software under GPL
Widely used and extended
•
•
•
•
Friday, 25 June 2010
GAP Numbers
• 174K lines of C
• 450K lines of GAP in core
system
• 4000+ operations
• 10000+ methods
• 1M lines of GAP in 92
contributed packages
• 100MB + of data libraries
• 1350 pages in reference
manual
• over 1000 citations
GAP In Action
gap> AvgOrder :=
> g->Sum(ConjugacyClasses(g),
> c-> Size(c)*Order(Representative(c)))/
> Size(g);
•
Qn: is there a non-trivial group
whose elements have integer
average order?
•
Define a function and run over
some groups from the database
•
Database includes polycylicallypresented and permutation
groups
•
Note
function( g ) ... end
gap> AvgOrder(MathieuGroup(11));
53131/7920
gap> ForAny(AllSmallGroups([2..100]),
> g->IsInt(AvgOrder(g)));
false
Friday, 25 June 2010
•
generic operations like Size
and ConjugacyClasses
•
higher-order functions like
Sum, ForAny
HPCGAP
•
•
EPSRC project (Infrastructures programme): St Andrews, EPCC,
Aberdeen, Heriot-Watt
2009–2013
Goal to support parallelism in GAP at all levels for users and
programmers
Supporting use of multiple processors on a single computer within a
single GAP session
Supporting distributed computing on clusters and supercomputers
Providing some benefit for non-expert users who happen to have a
multicore computer
Providing easy ways for domain experts to describe and implement
parallel algorithms in a relatively pain-free way on a full range of
computers
Allowing real experts to explore a wide range of parallel programming
models
Taking the opportunity to do a lot of necessary reengineering
•
•
•
•
•
•
•
Friday, 25 June 2010
HPCGAP Model
Low level
Primitives
Shared Memory
Distributed
Threads, Mutexes,
Dataspaces
MPI binding, binary object
marshalling
Mid level
Common primitives
Channels, Single-assignment variables, Barriers, Knowledge
bases
High level
Skeletons
ParList, ParSum, ParReduce, DFS, Chain Reduction,
First to Finish, .....
Thread-safe Library
Full GAP library -- some parts may have to be protected
by “big locks”
Parallelised Library
Key Bottleneck Routines
Exact linear algebra, Base and strong generating set,
Backtrack search
Applications
Exemplars
Computations in E8, Design search, homology,...
Friday, 25 June 2010
Use(r) Cases
•
•
Friday, 25 June 2010
Naive user
uses GAP just as before, but on
multicore system
sees speed-up when he happens
to hit a parallelised library routine
maybe backgrounds computations
from the command-line
Sequential programmer
existing GAP code will run
unchanged in a single thread
if run in multiple threads is more
likely to give a meaningful error
message than unpredictable
wrong results
•
•
•
•
•
Use(r) Cases 2
•
Skeleton user
can write, run and debug parallel programs
using our skeletons without too much
pain
skeletons should cover a decent range
of common algebraic computations
Parallel programmer
developing specialised shared or
distributed memory applications
using mid-level abstractions directly
Skeleton developer
building efficient skeleton implementations
for a variety of environments
tuning new mid and high level
abstractions for performance.
•
•
•
•
•
•
•
•
Friday, 25 June 2010
•
Dataspaces 1
Goal: To allow as much sequential code as possible to be
“automatically” thread-safe
minimise surprises for the programmer
make the surprises that do happen understandable and
repeatable errors, not unrepeatable wrong answers
Key observation: Most long-lived and widely shared
objects in a GAP workspace are mathematically immutable
practically they only change by acquiring information
Idea:
newly created mutable objects are only accessible to the
creating thread, unless explicitly shared
shared mutable objects can only be accessed when you
hold an appropriate lock
these rules enforced by the kernel
Major complication: when to treat subobjects as part of
an object and when as distinct objects
•
•
•
•
•
•
•
•
•
Friday, 25 June 2010
Dataspaces 2
Shared Dataspaces
Newly created
non-global
objects go
here
Access
via atomic
kernel ops
Thread-local
Thread-localThread-local
Dataspace
Dataspace Dataspace
[1,2,3]
[1,2,3]
Migration
[7,11]
[1,2,3]
rec(a := 3, b := (1,6))
rec(a := 3, b := (1,6))
rec(a := 3, b := (1,6))
rec(..)
Free
access
Thread
Thread
Thread
Global Dataspace
S10
1765489256789421067
channel
Friday, 25 June 2010
Access
when lock
held
Objects for which
the kernel permits
only atomic
operations
Mainly immutable
All such objects
are here
Some of My Favourite Skeletons
•
ParList( list, func)
Inputs: A list of objects; a function of one
argument \emph{with no side-effects}
Outputs: The list of results of applying
the function to the objects
Note: No guarantee of what order the
function executions will happen in (or
where).
Captures ``trivial'' or ``embarrassing''
parallelism -- lots of completely independent
tasks
Still challenges with regularity, scaling,
robustness.
•
•
•
•
•
•
DFS: Find all the nodes of a (directed-)graph given
by seed nodes and a function for finding (out-)
neighbours of a node.
Variations of this handle orbit and orbitstabilizer computations.
•
•
•
Friday, 25 June 2010
Chain Reduction Given an ordered list
of objects, reduce each object against each
earlier object in some way.
Gaussian elimination is one
application
Pair Completion Makes new objects
from pairs of old ones, reduces new
objects wrt old ones and possibly v.v.
Knuth-Bendix, Buchberger's
algorithm
First to finish Try a number of
algorithms to solve a problem -- report
the first solution
hard part seems to be to stop the
other ones
•
•
Hard part is to recognise where you have been
Duplicate Elimination Given two lists of keys
and data, find, and deal specially with, all the keys
present in both lists, return a list of the results and
the non-duplicates
This handles addition of sparse vectors inter alia
•
•
•
•
•
Skeleton challenges 1- Irregularity
gap> ForAny(AllSmallGroups([2..100]),
•
> g->IsInt(AvgOrder(g)));
•
•
Number of Groups
10000
1000
•
100
10
1
Friday, 25 June 2010
50
100
200
150
Group Order
250
300
350
•
If we try to parallelise the outer loop here
and go a bit beyond 100, the size of the cases
becomes very irregular
Slightly silly example, but the problem is
inherent
• many algorithms actually randomised, so
runtime varies between calls
Also typically encounter irregularly sparse
matrices
• blocks of zeros, ultrasparse blocks, dense
blocks, unevenly sized
So our skeleton implementations need to
schedule dynamically, and perhaps even
repartition workload if one section turns out
hard
Good results with skeletons driven by implicit
parallelism in Haskell
Skeleton Challenges 2
Scalability and Robustness
•
We aim to build software that will support applications at all
scales up to national supercomputers (Hector and successors)
•
•
Not expecting application scaling up to that level to be “free”
but we’d like it to be possible
Have to work hard to avoid central chokepoints in skeletons
•
•
distribute work through a hierarchy of managers
•
local managers also need to manage checkpointing, restart,
node failure
distributed data structure library and associated skeletons to
avoid having to reunite data
•
Friday, 25 June 2010
may mean limiting use of MPI to local subclusters, unless
MPI becomes robust.
Concluding Remarks
•
We’re parallelising a system not an application
•
•
other people will build parallel programs on our software
•
lots of domain expert (but not programming expert) users with
a huge base of legacy code that we don’t even know about
•
•
Friday, 25 June 2010
hide explicit parallel programming for most users inside
skeletons
aim for most legacy code to be thread-safe as is -- therefore
safe to plug into skeletons
•
for shared memory, try to combine the speed of passing
pointers between threads with the safety of having distributed
data
•
try to provide a framework for managing robustness etc.
separate from mathematical programming
We’ll let you know how we get on.
Download