The HPCGAP Project: Delivering Usable Scalable Parallelism to Computational Pure Mathematicians Steve Linton, University of St Andrews sal@cs.st-and.ac.uk Friday, 25 June 2010 Overview • • Friday, 25 June 2010 Computational Group Theory and GAP HPCGAP project • • • overall software model use cases a couple of technical highlights • • dataspaces challenges for our skeletons GAP Groups, Algorithms, Programming • • • • The mathematician’s handle on symmetry Key objects in pure and applied mathematics Early adopters of computation in pure maths Groups of interest are often infinite, or very large indeed Study the group by computing with just a (carefully selected) few of its elements • Friday, 25 June 2010 “There will be positively no internal alterations to be made even if we wish suddenly to switch from calculating the energy levels of the neon atom to the enumeration of groups of order 720.” Alan Turing (1945) 808017424794512875886459904961710757 005754368000000000 Groups, Algorithms, Programming • • • Given a concise description of a group generating permutations or matrices finite presentation Calculate global properties of group: size, composition factors, membership test character table Search for elements of the group with special properties “find an element that moves this to that” find all the unipotent matrices in the group • • • • • • • • Friday, 25 June 2010 Rubik’s Cube Group: • Generated by 5 permutations of 48 small squares • Size = 227314537211 • Structure: (211x37):(A8xA11):2 • No element that just twists one corner Groups, Algorithms, Programming • • • GAP History Development began in Aachen, mid-80s Neubüser, Schönert, others 1997, Neubüser retired international project coordinated from St Andrews till 2005 coordination now shared with three other centres Free Software under GPL Widely used and extended • • • • Friday, 25 June 2010 GAP Numbers • 174K lines of C • 450K lines of GAP in core system • 4000+ operations • 10000+ methods • 1M lines of GAP in 92 contributed packages • 100MB + of data libraries • 1350 pages in reference manual • over 1000 citations GAP In Action gap> AvgOrder := > g->Sum(ConjugacyClasses(g), > c-> Size(c)*Order(Representative(c)))/ > Size(g); • Qn: is there a non-trivial group whose elements have integer average order? • Define a function and run over some groups from the database • Database includes polycylicallypresented and permutation groups • Note function( g ) ... end gap> AvgOrder(MathieuGroup(11)); 53131/7920 gap> ForAny(AllSmallGroups([2..100]), > g->IsInt(AvgOrder(g))); false Friday, 25 June 2010 • generic operations like Size and ConjugacyClasses • higher-order functions like Sum, ForAny HPCGAP • • EPSRC project (Infrastructures programme): St Andrews, EPCC, Aberdeen, Heriot-Watt 2009–2013 Goal to support parallelism in GAP at all levels for users and programmers Supporting use of multiple processors on a single computer within a single GAP session Supporting distributed computing on clusters and supercomputers Providing some benefit for non-expert users who happen to have a multicore computer Providing easy ways for domain experts to describe and implement parallel algorithms in a relatively pain-free way on a full range of computers Allowing real experts to explore a wide range of parallel programming models Taking the opportunity to do a lot of necessary reengineering • • • • • • • Friday, 25 June 2010 HPCGAP Model Low level Primitives Shared Memory Distributed Threads, Mutexes, Dataspaces MPI binding, binary object marshalling Mid level Common primitives Channels, Single-assignment variables, Barriers, Knowledge bases High level Skeletons ParList, ParSum, ParReduce, DFS, Chain Reduction, First to Finish, ..... Thread-safe Library Full GAP library -- some parts may have to be protected by “big locks” Parallelised Library Key Bottleneck Routines Exact linear algebra, Base and strong generating set, Backtrack search Applications Exemplars Computations in E8, Design search, homology,... Friday, 25 June 2010 Use(r) Cases • • Friday, 25 June 2010 Naive user uses GAP just as before, but on multicore system sees speed-up when he happens to hit a parallelised library routine maybe backgrounds computations from the command-line Sequential programmer existing GAP code will run unchanged in a single thread if run in multiple threads is more likely to give a meaningful error message than unpredictable wrong results • • • • • Use(r) Cases 2 • Skeleton user can write, run and debug parallel programs using our skeletons without too much pain skeletons should cover a decent range of common algebraic computations Parallel programmer developing specialised shared or distributed memory applications using mid-level abstractions directly Skeleton developer building efficient skeleton implementations for a variety of environments tuning new mid and high level abstractions for performance. • • • • • • • • Friday, 25 June 2010 • Dataspaces 1 Goal: To allow as much sequential code as possible to be “automatically” thread-safe minimise surprises for the programmer make the surprises that do happen understandable and repeatable errors, not unrepeatable wrong answers Key observation: Most long-lived and widely shared objects in a GAP workspace are mathematically immutable practically they only change by acquiring information Idea: newly created mutable objects are only accessible to the creating thread, unless explicitly shared shared mutable objects can only be accessed when you hold an appropriate lock these rules enforced by the kernel Major complication: when to treat subobjects as part of an object and when as distinct objects • • • • • • • • • Friday, 25 June 2010 Dataspaces 2 Shared Dataspaces Newly created non-global objects go here Access via atomic kernel ops Thread-local Thread-localThread-local Dataspace Dataspace Dataspace [1,2,3] [1,2,3] Migration [7,11] [1,2,3] rec(a := 3, b := (1,6)) rec(a := 3, b := (1,6)) rec(a := 3, b := (1,6)) rec(..) Free access Thread Thread Thread Global Dataspace S10 1765489256789421067 channel Friday, 25 June 2010 Access when lock held Objects for which the kernel permits only atomic operations Mainly immutable All such objects are here Some of My Favourite Skeletons • ParList( list, func) Inputs: A list of objects; a function of one argument \emph{with no side-effects} Outputs: The list of results of applying the function to the objects Note: No guarantee of what order the function executions will happen in (or where). Captures ``trivial'' or ``embarrassing'' parallelism -- lots of completely independent tasks Still challenges with regularity, scaling, robustness. • • • • • • DFS: Find all the nodes of a (directed-)graph given by seed nodes and a function for finding (out-) neighbours of a node. Variations of this handle orbit and orbitstabilizer computations. • • • Friday, 25 June 2010 Chain Reduction Given an ordered list of objects, reduce each object against each earlier object in some way. Gaussian elimination is one application Pair Completion Makes new objects from pairs of old ones, reduces new objects wrt old ones and possibly v.v. Knuth-Bendix, Buchberger's algorithm First to finish Try a number of algorithms to solve a problem -- report the first solution hard part seems to be to stop the other ones • • Hard part is to recognise where you have been Duplicate Elimination Given two lists of keys and data, find, and deal specially with, all the keys present in both lists, return a list of the results and the non-duplicates This handles addition of sparse vectors inter alia • • • • • Skeleton challenges 1- Irregularity gap> ForAny(AllSmallGroups([2..100]), • > g->IsInt(AvgOrder(g))); • • Number of Groups 10000 1000 • 100 10 1 Friday, 25 June 2010 50 100 200 150 Group Order 250 300 350 • If we try to parallelise the outer loop here and go a bit beyond 100, the size of the cases becomes very irregular Slightly silly example, but the problem is inherent • many algorithms actually randomised, so runtime varies between calls Also typically encounter irregularly sparse matrices • blocks of zeros, ultrasparse blocks, dense blocks, unevenly sized So our skeleton implementations need to schedule dynamically, and perhaps even repartition workload if one section turns out hard Good results with skeletons driven by implicit parallelism in Haskell Skeleton Challenges 2 Scalability and Robustness • We aim to build software that will support applications at all scales up to national supercomputers (Hector and successors) • • Not expecting application scaling up to that level to be “free” but we’d like it to be possible Have to work hard to avoid central chokepoints in skeletons • • distribute work through a hierarchy of managers • local managers also need to manage checkpointing, restart, node failure distributed data structure library and associated skeletons to avoid having to reunite data • Friday, 25 June 2010 may mean limiting use of MPI to local subclusters, unless MPI becomes robust. Concluding Remarks • We’re parallelising a system not an application • • other people will build parallel programs on our software • lots of domain expert (but not programming expert) users with a huge base of legacy code that we don’t even know about • • Friday, 25 June 2010 hide explicit parallel programming for most users inside skeletons aim for most legacy code to be thread-safe as is -- therefore safe to plug into skeletons • for shared memory, try to combine the speed of passing pointers between threads with the safety of having distributed data • try to provide a framework for managing robustness etc. separate from mathematical programming We’ll let you know how we get on.