Multi-core strategies for particle methods The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Williams, John R., Holmes, David, and Tilke, Peter (2010) Multicore strategies for particle methods. ISBN 978-0-9551179-8-5. Proceedings of the Fifth International Conference on Discrete Element Methods In: Fifth International Conference on Discrete Element Methods, 25-26 August 2010, London, UK. As Published http://www.sems.qmul.ac.uk/conferences/dem5/downloads/progr amme.pdf Publisher Queen Mary, University of London Version Author's final manuscript Accessed Wed May 25 21:49:09 EDT 2016 Citable Link http://hdl.handle.net/1721.1/67703 Terms of Use Creative Commons Attribution-Noncommercial-Share Alike 3.0 Detailed Terms http://creativecommons.org/licenses/by-nc-sa/3.0/ DISCRETE ELEMENT METHODS Multi-core Strategies For Particle Methods John R. Williams, David Holmes and Peter Tilke* Massachusetts Institute of Technology, *Schlumberger-Doll Research Laboratory This paper discusses the implementation of particle based numerical methods on multi-core machines. In contrast to cluster computing, where memory is distributed across machines, multi-core machine can share memory across all cores. Here general strategies are developed for spatial management of particles and sub-domains that optimize computation on shared memory machines. In particular, we extend cell hashing so that cells bundle particles into orthogonal tasks that can be safely distributed across cores avoiding the use of “memory locks” while still protecting against race conditions. Adjusting task size provides for optimal load balancing and maximizing cache hits. Additionally, the way in which tasks are mapped to execution threads has a significant influence on the memory footprint and it is shown that minimizing memory usage is one of the most important factors in achieving execution speed and performance on multi-core. A novel algorithm called H-Dispatch is used to pipeline tasks to processing cores. The performance is demonstrated in speed-up and efficiency tests on a smooth particle hydrodynamics (SPH) flow simulator. An efficiency of over 90% is achieved on a 24-core machine. INTRODUCTION Recent trends in computational physics suggest a rapidly growing interest in particle based numerical techniques. While particle based methods allow robust handling of mass advection, this comes at the computational cost of managing the spatial interactions of particles. For problems involving millions of particles, it is necessary to develop efficient strategies for parallel implementation. A variety of authors have reported parallel particle codes on clusters of machines (see for example Nelson et al [1], Morris et al [2], Sbalzarini et al [3], Walther and Sbalzarini [4], Ferrari et al [5]). The key difference between parallel computing on clusters and on multi-core is that memory can be shared across all cores on today’s multi-core machines. However, there are well known challenges of thread safety and limited memory bandwidth. A variety of approaches to programming on multi-core have been proposed to-date. Concurrency tools from traditional cluster Simulations of Discontinua – Theory and Applications 11 DISCRETE ELEMENT METHODS computing, like MPI [6] and OpenMP [7] have been used to achieve fast take-up of the new technology. If one runs an MPI process on each core, the operating system guarantees memory isolation across processes. Communication of information across process boundaries requires MPI messages to be sent, which is relatively slow compared with direct memory sharing. In response to this, Chrysanthakopoulos and co-workers [7, 8] have implemented multi-core concurrency libraries using Port based abstractions. These mimic the functionality of a message passing library like MPI, but use shared memory as the medium for data exchange, rather than exchanging serialized packets over TCP/IP as is the case for MPI. Such an approach provides extensibility in program structure, while still capitalizing on the speed advantages of shared memory. In an earlier paper, Holmes et al. [9] have shown that a programming model developed using such port based techniques provides significant performance advantages over MPI and OpenMP. In this paper, we apply the programming model proposed in [9] to the parallelization of particle based numerical methods on multi-core. PARTICLE BASED NUMERICAL METHODS Mesh based numerical methods have been the cornerstone of computational physics for decades. Here, integration points are positioned according to some topological connectivity or mesh to ensure compatibility of the numerical interpolation. Examples of Eulerian mesh based methods include finite difference (FD) and the lattice Boltzmann method (LBM), while Lagrangian examples include the finite element method (FEM). While powerful for a wide range of problems, mesh limitations for problems involving large deformation and complex material interfaces has led to significant developments in meshless and particle based methodologies. For such methods, integration points are positioned freely in space, capable of advecting with material in a Lagrangian sense. For methods like molecular dynamics (MD) and the discrete element method (DEM), such points represent literal particles (atoms and molecules for MD and discrete grains for DEM), while for methods like dissipative particle dynamics (DPD) [10], smooth particle hydrodynamics (SPH) [11], wavelets [12], and the reproducing kernel particle method (RKPM) [13], the particle analogy is largely figurative. For such methods, parti- Simulations of Discontinua – Theory and Applications 12 DISCRETE ELEMENT METHODS cles provide positions at which to enforce a partition of unity. By partitioning unity across the particles, continuity can be imposed without a defined mesh, allowing such methods to represent a continuum in a generalized way. In Eulerian mesh based methods, such as FD and LBM, continuity is inherently provided by the static mesh, while for Lagrangian mesh based approaches like FEM, continuity is enforced through the use of element shape functions. The partition of unity imposed on mesh-free particle methods can be seen to be a generalization of shape functions for arbitrary integration point arrangements. From Li and Liu [14] ...meshfree methods are the natural extension of finite element methods, they provide a perfect habitat for a more general and more appealing computational paradigm - the partition of unity. The advantage of partition of unity methods is that any expression related to a field quantity can be imposed on the continuum. Where, for a bounded domain in Eucledean space, a set of nonnegative compactly supported functions, sums to unity. Correspondingly, the value of some field function can be determined from its value at all other points via f ( x ) ak ( x k ) k 0 ,1, 2... n The function f (x) can be related to any physical field expression; hydrodynamic, mechanical, electrical, chemical, magnetic etc. Such versatility is a key advantage of particle methods. Strategies for Programming on Multi-Core By eliminating the need for ghost regions and high volume to surface ratio sub-domains, shared memory enables a near arbitrary selection of spatial divisions by which to distribute a numerical task. Correspondingly, it becomes convenient to repurpose a spatial division strategy already in use in one form or other in most particle based numerical codes. Spatial hashing techniques have been used widely to achieve algorithmically efficient interaction detection in particle codes (see for example DEM [15, 16]). SPATIAL HASHING IN PARTICLE METHODS The central idea behind spatial hashing is to overlay some regular spatial structure over the randomly positioned particles e.g. an array of equally sized cells. We can then perform spatial reasoning on the cells rather than on the particles themselves. Simulations of Discontinua – Theory and Applications 13 DISCRETE ELEMENT METHODS Spatial hashing assigns particles to cells or ‘bins’ based on a hash of particle coordinates. The numerical expense of such an algorithm is O (N) where N is particle number [15,16]. A variety of programmatic implementations of hash cell methods have been used (for example linked list). In this work a dictionary hash table is used where a generic list of particles is stored for each cell and indexed based on an integer unique key to that cell, i.e. key ( k ny j ) nx i where nx and ny are the total number of cells in the x and y dimensions and i, j and k are the integer cell coordinates in the x, y and z dimensions. Managing Thread Safety The cells also provide a means for defining task packages to the cores. By assigning a number of cells to each core we assure “task orthogonality” in that each core is operating on different set of particles. Traditional software applications for shared memory parallel architectures have utilized locks to avoid thread contention. We note that a core may “read” the memory of particles belonging to surrounding cells but may not update them. We execute a single loop in which global memory for both previous and current field values ( vtn and vt n1 ) is stored for each particle. Gradient terms can then be calculated as functions of values in previous memory, while updates are written to the current value memory, in the same loop. In parallel, minimizing the frequency of so called synchronization points has advantages for performance and we utilize this “rolling memory algorithm”. This allows the previous and updated terms to be maintained without needing to replace the former with the latter at the end of each step and thus, ensuring a single synchronization per step. Counter Intuitive Design In a typical SPH simulator, two operations must be done on particles in each cell per step, separated by a synchronization, the first to determine the particle number density of each particle, and the second to perform the field variable updates. Interacting particles must be known for each of these two stages. Using a standard SPH formulation, the performance of two structure variations are compared in this case study: A. Interacting particles are determined in the first stage for all cells and stored in a global list for use in the second, and B. Interacting particles are determined as needed in each stage for each cell, i.e. twice per cell per time step. Recalculation of interacting particles means they need Simulations of Discontinua – Theory and Applications 14 DISCRETE ELEMENT METHODS only be kept in a local thread list that is overwritten with each newly dispatched cell. While differences in execution memory are to be expected of the two code versions, the differences in execution time are more surprising. For low core counts (< 10 cores), as would be expected, the single search variant (A) solves more quickly than the double (B) due to less computations. After this point, however, the double search (B) is shown to provide marked improvements in speed over (A) (up to 50%). This can be attributed to better cache blocking of the second approach and the significantly smaller amount of data experiencing latency when being loaded from RAM to cache. The fact that such performance gains only manifest when more than 10 cores are used, suggests that for less than 10 cores, RAM pipeline bandwidth is sufficient to handle a global interaction list. Fig 1. (a) memory vs. Number of Particles (b) Speed-up vs. Number of Cores The MIT SPH simulator (see http://geonumerics.mit.edu ) has been validated for single and multi-phase flow. Figure 2 shows results for two phase RayleighTaylor instability [17]. Simulations of Discontinua – Theory and Applications 15 DISCRETE ELEMENT METHODS Fig 2. Rayleigh-Taylor instability example showing extremely complex fluid phase interface geometries; clear liquid is water, green liquid is oil. [9]. REFERENCES 1. M. T. Nelson, W. Humphrey, A. Gursoy, A. Dalke, L. V. Kale, R. D. Skeel, K. Schulten, NAMD: a parallel, object-oriented molecular dynamics program, International Journal of High Performance Comp. Applications 10 (1996) 251-268. 2. J. P. Morris, Y. Zhu, P. J. Fox, Parallel simulations of pore-scale flow through porous media, Computers and Geotechnics 25 (1999) 227-246. 3. I. F. Sbalzarini, J. H. Walther, M. Bergdorf, S. E. Hieber, E. M. Kotsalis, P. Koumoutsakos, PPM - a highly efficient parallel particle-mesh library for the simulation of continuum systems, Journal of Computational Physics 215 (2006) 566-588. 4. J. H. Walther, I. F. Sbalzarini, Large-scale parallel discrete element simulations of granular flow, Engineering Computations 26 (2009) 688{697. Simulations of Discontinua – Theory and Applications 16 DISCRETE ELEMENT METHODS DEM5-2010, London, 25-26 August 2010 Page 7 5. A. Ferrari, M. Dumbser, E. F. Toro, A. Armanini, A new 3D parallel SPH scheme for free surface flows, Computers and Fluids 38 (2009) 1203-1217. 6. W. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable Parallel Programming With the Message-Passing Interface, MIT Press, Cambridge, 1999. 7. X. Qiu, G. C. Fox, H. Yuan, S.-H. Bae, G. Chrysanthakopoulos, H. F. Nielsen, Parallel data mining on multicore clusters, in: 7th International Conference on Grid and Cooperative Computing GCC2008. 8. G. Chrysanthakopoulos, S. Singh, An asynchronous messaging library for C#, in: Proceedings of the Workshop on Synchronization and Concurrency in ObjectOriented Languages, OOPSLA 2005, San Diego, 2005, pp. 89-97. 9. D. W. Holmes, J. R. Williams, P. Tilke, An events based algorithm for distributing concurrent tasks on multi-core architectures, Computer Physics Communications 181 (2010) 341-354. 10. M. Liu, P. Meakin, H. Huang, Dissipative particle dynamics simulation of pore-scale multiphase fluid flow, Water Resources Research 43 (2007) W04411. 11. J. J. Monaghan, Smoothed particle hydrodynamics, Annual Review of Astronomy and Astrophysics 30 (1992) 543{574. 12. W. K. Liu, Y. Chen, Wavelet and multiple scale reproducing kernel methods, International Journal for Numerical Methods in Fluids 21 (1995) 901{931. 13. W. K. Liu, S. Hao, T. Belytschko, S. Li, C. T. Chang, Multi-scale methods, International Journal for Numerical Methods 14. S. Li, W. K. Liu, Meshfree Particle Methods, Springer-Verlag, Berlin Heidelberg, 2004. 14. A. Munjiza, K. R. F. Andrews, NBS contact detection algorithm for bodies of similar size, International Journal for Numerical Methods in Engineering 43 (1998) 131-149. 16. J. R. Williams, E. Perkins, B. Cook, A contact algorithm for partitioning N arbitrary sized objects, Engineering Computations 21 (2004) 235-248. 17. A. M. Tartakovsky, P. Meakin, A smoothed particle hydrodynamics model for miscible flow in three-dimensional fractures and the two-dimensional Rayleigh-Taylor instability, Journal of Computational Physics 207 (2005) 610624. Simulations of Discontinua – Theory and Applications 17