Physics of the Earth and Planetary Interiors 171 (2008) 122–136 Contents lists available at ScienceDirect Physics of the Earth and Planetary Interiors journal homepage: www.elsevier.com/locate/pepi Fractional Steps methods for transient problems on commodity computer architectures M. Krotkiewski ∗ , M. Dabrowski, Y.Y. Podladchikov Physics of Geological Processes, University of Oslo, Pb 1048 Blindern, 0316 Oslo, Norway a r t i c l e i n f o Article history: Received 30 October 2007 Received in revised form 15 July 2008 Accepted 4 August 2008 Keywords: ADI Fractional Steps Locally One-Dimensional Parabolic Hyperbolic Commodity a b s t r a c t Fractional Steps methods are suitable for modeling transient processes that are central to many geological applications. Low memory requirements and modest computational complexity facilitates calculations on high-resolution three-dimensional models. An efficient implementation of Alternating Direction Implicit/Locally One-Dimensional schemes for an Opteron-based shared memory system is presented. The memory bandwidth usage, the main bottleneck on modern computer architectures, is specially addressed. High efficiency of above 2 GFlops per CPU is sustained for problems of 1 billion degrees of freedom. The optimized sequential implementation of all 1D sweeps is comparable in execution time to copying the used data in the memory. Scalability of the parallel implementation on up to 8 CPUs is close to perfect. Performing one timestep of the Locally One-Dimensional scheme on a system of 10003 unknowns on 8 CPUs takes only 11 s. We validate the LOD scheme using a computational model of an isolated inclusion subject to a constant far field flux. Next, we study numerically the evolution of a diffusion front and the effective thermal conductivity of composites consisting of multiple inclusions and compare the results with predictions based on the differential effective medium approach. Finally, application of the developed parabolic solver is suggested for a real-world problem of fluid transport and reactions inside a reservoir. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Geological systems are usually heterogeneous and exhibit large material property contrasts. They are often formed by multiphysics processes interacting on many temporal and spatial scales. In order to understand these systems numerical models are frequently employed. Appropriate resolution of the behavior of these heterogeneous systems, without the (over-)simplifications of a priori applied homogenization techniques, requires numerical models capable of efficiently and accurately dealing with high resolution models. A popular technique is the finite element method (FEM) combined with unstructured meshes capable of dealing with the geometrical complexities of geological problems. In these methods a linear system of equations is often assembled into a sparse matrix and solved. While it can be successfully done for twodimensional models with high resolution even on a modern desktop computer (Dabrowski et al., 2008), three-dimensional problems require supercomputers and sophisticated numerical methods. Direct solvers are unfeasible due to enormous memory ∗ Corresponding author. E-mail address: marcink@fys.uio.no (M. Krotkiewski). 0031-9201/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.pepi.2008.08.008 requirements and large computational times. Iterative solvers (like Conjugate Gradients) are an available alternative, but they require good, problem dependent preconditioners that can be efficiently parallelized. Methods based on structured meshes, although less accurate in terms of geometry representation, are often employed. The known structure of the mesh makes them cheaper in terms of memory requirements, and can significantly decrease the computational cost. The state-of-the-art method for structured meshes is multigrid (Brandt, 1977; Hackbusch, 1981; Wesseling, 2004). For the steady state Poisson equation on uniform grids it converges in a few iterations, regardless of the resolution. Applications to parabolic problems have also been widely studied (Larsson et al., 1995; Lubich and Ostermann, 1987). However, the formulation for heterogeneous materials and variable grids is more complicated, as is an efficient parallel implementation (Overman and Rosendale, 1993). For relatively simple, well conditioned transient problems a lighter method may be more suitable. Classical explicit finite difference methods are much simpler both in terms of understanding and efficient parallel implementation. Usually, however, they are impractical for high-resolution problems due to severe timestep restriction. Operator splitting techniques try to overcome this limitation. The general idea is to divide a single timestep into a sequence of one-dimensional M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 implicit solves along the spatial dimensions of the domain. The computational cost of a single timestep of such schemes is comparable to that of explicit methods, but the timestep restriction can be avoided. We first present the class of Fractional Steps methods for transient problems, such as heat diffusion (parabolic) and wave propagation (hyperbolic). The suitability of a contemporary shared memory, Opteron-based commodity architecture for this approach is later investigated. We focus on high resolution 3D problems with up to 10003 degrees of freedom and heterogeneous material properties. We use the second-order space discretization of the underlying equations, which for the Fractional Steps methods results in one-dimensional, tridiagonal systems of linear equations. An optimized algorithm for efficient computation and solution of such systems on an eight-way Dual-core Opteron machine is presented. 2. Fractional step method Consider an initial-boundary value parabolic problem of the heat conduction: ∂T = div(k · grad T ) + f ∂t T = T T = T0 cp in ˝ × , t ∈ (1) on ∂˝ × in ˝ for t = 0 where T denotes temperature, k is thermal conductivity, f is heat generation term, and Cp is a product of density and specific heat capacity. Thermal field T is prescribed on the boundary ∂˝ × , where ˝ denotes the spatial domain, and is the temporal domain. The initial conditions T0 are given in the whole ˝. In the case of three-dimensional homogeneous media, the finite difference (FD) discretization A of the operator (cp )−1 div(k grad. . .) on the uniform Cartesian grid yields: A = Ax + Ay + Az Ax = (x ∇ x ) 2 (h) (2) where = k/cp is thermal diffusivity, h = x = y = z is a grid spacing and (x T)i = Ti − Ti−1 , (x T)i = Ti+1 − Ti denote backward and forward difference operator in the x direction, respectively. Here, the subscript i indexes the discrete grid points in X dimension. The resulting spatial discretization of (1) in the one-dimensional case is ∂Ti (Ti−1 − 2Ti + Ti+1 ) = 2 ∂t (h) (3) The operators Ay , Az are analogous to Ax , and in three dimensions A is the standard 7-point stencil for the Laplacian scaled by /(h)2 . For heterogeneous materials and non-uniform grids we use the finite volume spatial discretization that in one-dimensional case is stated as xi+1/2 + xi−1/2 2 = ki−1/2 xi−1/2 + (cp )i Ti−1 − ∂Ti ∂t xi−1/2 ki+1/2 + xi+1/2 ki−1/2 xi+1/2 + xi−1/2 2 xi−1/2 xi+1/2 fi Ti + ki+1/2 xi+1/2 Ti+1 (4) where fractional subscript indices correspond to the center of the edge between two neighboring points in the x direction. xi−1/2 denotes the spatial distance between two grid points xi , xi−1 . Since in our approach the conductivity k is defined in the centers of the 3D cells, it has to be averaged in that mid-edge point. The simplest 123 approach is the arithmetical average: 1 ki−1/2 = 4(yj−1/2 + yj+1/2 )(zk−1/2 + zk+1/2 ) × (ki−1/2,j−1/2,k−1/2 yj−1/2 zk−1/2 + ki−1/2,j+1/2,k−1/2 yj+1/2 zk−1/2 + ki−1/2,j−1/2,k+1/2 yj−1/2 zk+1/2 + ki−1/2,j+1/2,k+1/2 yj+1/2 zk+1/2 ) (5) and ki+1/2 is computed analogously. In the above formula, subscript indices i, j and k are the spatial indices of the nodes in a threedimensional Cartesian grid. The classical explicit scheme used to integrate (1): T t+1 − T t = AT t + f (6) is second-order accurate in spatial coordinates, first order in time and is stable under the restriction ≤ (h)2 /6 (Courant et al., 1967). The superscript index t is used to denote the subsequent timesteps. The maximum admissible integration step cr becomes very restrictive for refined grids, and in the case of non-uniform grid spacing it is determined by the size of the smallest cell. Moreover, in heterogeneous materials cr is restricted by the strongest heterogeneity, even if it is insignificant in size. The computational complexity of explicit methods per integration step is small, but due to the timestep restriction they may require a large number of iterations to integrate the evolution of the system. In order to alleviate the timestep restriction an implicit method can be used. The second-order accurate in time Crank–Nicholson scheme: T t+1 − T t 1 = A(T t+1 + T t ) + f 2 (7) is unconditionally stable (no restriction on ). It requires solving a system of linear equations: I− A T t+1 = 2 I+ A T t + f 2 (8) A in our case is symmetric and sparse, i.e. most of the matrix entries are zeros. For 1D problems A is a tridiagonal matrix and can be easily solved using the Thomas Algorithm. In the 2D case it is practical to use variants of the well-known Gaussian elimination, e.g. the Cholesky factorization for symmetric positive definite systems, LU factorization with pivoting for non-symmetric systems, or a generalized Thomas Algorithm for block tridiagonal matrices. Sparse direct solvers are easy to use, robust and very efficient on modern computers. Unfortunately, they can not be applied to large 3D problems because of extreme memory requirements and computational complexity. The number of new non-zero entries (fill) introduced during the factorization is much larger than the number of the original non-zeros. Various iterative methods, like Conjugate Gradients usually used with some preconditioner, can be used instead. However, for strongly heterogeneous problems the convergence rate deteriorates. Moreover, preconditioners are problem dependent and finding a good, parallelizable one is often difficult. The fractional step methods described in this paper combine the advantages of the two mentioned strategies for time integration of system (1): low computational cost of the explicit scheme, and the stability of the implicit approach. The general idea is to replace a complex operator A by simpler ones that are used in sequence (Fractional Steps) during integration of the parabolic system like (1). In the context of the heat diffusion operator the split is naturally dictated by the spatial components Ax , Ay , Az . During every fractional step many one-dimensional, and in our case tridiagonal systems of equations have to be solved. 124 M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 2.1. Alternating Direction Implicit schemes The Alternating Direction Implicit (ADI) scheme for twodimensional parabolic problems was proposed by Peaceman and Rachford (1955) and Douglas (1955): 1 T t+1/2 − T t = (Ax T t+1/2 + Ay T t ), 2 1 T t+1 − T t+1/2 = (Ax T t+1/2 + Ay T t+1 ) 2 (9) In the above formula and throughout the paper the fractional superscript time index (e.g. Tt+1/2 ) is used to describe the fractional (incomplete) steps. The values of T defined at those intermediate steps has no physical meaning. Only T at full timesteps denoted by integer superscript indices gives an approximation to the temperature field. To define the stability, we introduce the difference step operator C(,h) that is defined as the action of the scheme in the whole step: T t+1 = C(, h)T t (10) The stability of the scheme requires that ||C(, h)|| ≤ 1. The algorithm (9) is unconditionally stable, i.e. it is stable for any ≥ 0. In addition to the truncation error of the Crank–Nicholson implicit scheme (7), the error term related to the splitting is O( 2 ). Thus, ADI approximates the original problem with the same order of accuracy. However, the additional error term may be large and the improvements to the original scheme were suggested (Douglas and Kim, 2001). The 2D ADI scheme is also applicable as an iterative solver for the steady-state variant of the problem (1). In whole steps, using two-layer difference scheme notation (Janenko, 1971) it is given by T t+1 − T t = ˝1 T t+1 + ˝2 T t , ˝2 = ˝1 = Ax + Ay + Ax Ay 2 4 Ax + Ay − Ax Ay , 2 4 (11) It is clearly seen that for the 2D ADI scheme the relation A = ˝1 + ˝2 is satisfied for any , which assures that the scheme converges to the steady-state solution independently of the value of the pseudotimestep. This condition is referred to as complete consistency. The choice of an optimal parameter sequence 1 , . . ., n as well as other techniques accelerating the convergence are summarized, e.g. in Marchuk (1990). The operators Ax , Ay obtained for the homogeneous media on uniform grids commute, i.e. Ax Ay = Ay Ax (12) In practice the commutativity condition (12) proves to be important in deriving properties of the fractional step schemes like stability and consistency. The convergence of the two-dimensional ADI scheme without the requirement of the commutativity of the operators Ax , Ay was discussed by Birkhoff and Varga (1959). Further considerations related to the rate of convergence and parameter choice can be found in Pearcy (1962) and Widlund (1966). The simple extension of the ADI scheme to three-dimensional cases results in the loss of unconditional stability (Janenko, 1971). The stable version was proposed by Douglas and Rachford (1956): T t+1/3 − Tt = Ax T t+1/3 + Ay T t + Az T t , T t+2/3 − T t+1/3 = Ay (T t+2/3 − T t ), T t+1 − T t+2/3 = Az (T t+1 − T t ) (13) The scheme is proven to be unconditionally stable for homogeneous media and is completely consistent. The second-order accurate in time version was suggested later by Douglas (1962). The pair-wise commutativity is required for the stability of the classical ADI schemes in 3D, thus they are not stable for heterogeneous materials. This limits their use as iterative solvers for steady state problems. For non-commuting positive-definite operators multistage alternating direction method was suggested (Douglas et al., 1966). 2.2. Locally One-Dimensional schemes The splitting algorithms belong to the other class of fractional step techniques: 1 T t+1/3 − T t = Ax (T t + T t+1/3 ), 2 T t+2/3 − T t+1/3 1 = Ay (T t+1/3 + T t+2/3 ), 2 T t+1 − T t+2/3 1 = Az (T t+2/3 + T t+1 ) 2 (14) In the above, each of the equations involves only one-dimensional difference operators and the scheme is therefore categorized as Locally One-Dimensional (LOD). A similar, fully implicit (backward Euler) variant of (14) is possible. Both algorithms are unconditionally stable (Janenko, 1971). The two-dimensional variant of the scheme (14) is second-order accurate in time for homogeneous media, but this property is lost for non-commutative operators Ax , Ay . It can be restored by introducing two-cycle splitting (xy sweeps followed by reversed order yx sweeps) (Marchuk, 1990). The two-dimensional version of the LOD scheme (14) for homogeneous media is identical in whole steps with ADI. However, this equivalence holds only until boundary conditions are considered. Imposing boundary conditions onto the one-dimensional sweeps in (14) leads to the finite approximation error at points located next to the boundaries. The resulting scheme in whole step yields: I− 1 Ax 2 I− 1 1 Ay T t+1 − I + Ax 2 2 I+ 1 Ay T t = Rn 2 (15) where Rn = 0 everywhere inside the computational domain, except for the points near the boundaries. The method of undetermined functions is an elegant and efficient way of imposing boundary conditions. An auxiliary right-hand side vector is introduced that is determined by the requirement of the vanishing Rn . Similar considerations apply to the heat generation term that needs to be modified before it enters the LOD scheme. The modified LOD scheme is strongly consistent (Janenko, 1971). Similar approaches have been proposed for hyperbolic type of equations in (Marchuk, 1990). The problem of acoustic wave propagation and Locally One-Dimensional scheme in this case are given by ∂2 ϕ ∂ − ∂xn ∂t 2 3 kn (x, t) ∂ϕ ∂xn = 0, n=1 ϕxt+1 − ϕyt − ϕzt + ϕxt 1 = Ax (ϕxt+1 + ϕxt ), 3 2 ϕyt+1 − ϕxt+1 − ϕzt + ϕyt 1 = Ay (ϕyt+1 + ϕyt ), 3 2 ϕzt+1 − ϕxt+1 − ϕyt+1 + ϕzt 1 = Az (ϕzt+1 + ϕzt ), 3 2 ϕxt = ϕt+1/3 ϕyt = ϕt+2/3 ϕzt = ϕt+1 (16) M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 Note the additional indexing of ϕt by x, y and z. This notation is introduced solely for the clarity of the formula and represents the fractional (intermediate) steps of the method, e.g. ϕxt = ϕt+1/3 . 3. Tridiagonal systems of equations Fractional step methods consist of a sequence of implicit one-dimensional sweeps through the domain. In our case of a second-order spatial discretization of (1) a symmetric tridiagonal system of linear equations needs to be solved. However, a straightforward application of both Dirichlet and von Neuman boundary conditions results in loss of the symmetry. For this reason, and since no additional complexity is introduced to the solver, we will study a more general case of a non-symmetric system: ⎛ b1 ⎜ a2 ⎜ ⎜ A=⎜ ⎜ ⎝ c1 b2 a3 0 c2 b3 .. . 0 .. . .. . an cn−1 bn ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ (17) where n denotes the number of points in the discretization. The simplest algorithm to solve such a system is the Thomas algorithm (TA), which is in fact the simplest case of the Gaussian elimination, or LU decomposition (Conte and Boor, 1980). 3.1. Thomas algorithm The Thomas algorithm consists of two phases: forward elimination and backward substitution, described by first-order linear and non-linear recurrences presented in Eqs. (18) and (19), e.g. pi = f(pi−1 ). Basic implementation of TA requires 8n floating point operations, two vectors of length n (x, the result and rhs, the righthand side), two auxiliary vectors of length n (p and q) used during the factorization process, and the tridiagonal matrix A of size 3n. TA cannot be trivially executed in parallel due to the existence of the first-order recurrences. The forward elimination phase is described by the following: pi = ci , bi − ai · pi−1 p1 = c1 , b1 qi = rhsi − ai · qi−1 , bi − ai · pi−1 q1 = rhs1 b1 (18) and the backward substitution phase can be written as xi = qi − xi+1 · pi , x n = qn (19) The above formulas directly reflect the steps taken during the Gaussian elimination applied to this kind of matrices. 3.2. Parallel tridiagonal solvers The drawback of the basic TA is that it is strictly sequential. Many parallel algorithms for the solution of tridiagonal systems of equations have been developed. Cyclic reduction was proposed by Hockney (1965) and has since been widely used on distributed machines (Allmann et al., 2001). The idea is to recursively reduce the number of equations by 2 until a system of only 2 equations is obtained. Even in the sequential case it is often preferred to the TA because of its natural ability to handle periodic boundary conditions. Divide and conquer algorithm is a similar approach (Gander and Golub, 1997). Recursive doubling was proposed by Stone (1973) and its suitability for hypercube architectures was investigated by Egecioglu et al. (1989). This algorithm is based on recursive doubling solutions of linear recurrence relations, which 125 in the case of tridiagonal systems allow to compute LU decomposition of the matrix in O(log2 (p)) parallel steps, where p is the number of processors. Solving tridiagonal systems on distributed architectures in the context of ADI-type of methods has been studied by Wakatani (2004), who proposed a pre-propagation scheme for solving first-order recurrences. As in the case of all the previously mentioned approaches, parallelization of the solver has been done at the expense of at least doubling the computational complexity and increasing the memory requirements. Other known methods include iterative relaxation schemes, like Gauss-Seidel, Jacobi, Redblack line relaxation and segment relaxation. In the cases where many independent tridiagonal systems of equations have to be solved (like ADI-type of methods), there have been attempts to parallelize TA for distributed machines through pipelining (Povitsky, 1999). For a single 1D tridiagonal system divided between 2 CPUs, first CPU performs half of the forward elimination phase and sends required data to the second CPU. The second CPU finishes the forward elimination phase, performs half of the backward substitution phase and sends the required data to the first CPU. Although for a single system of equations this approach is sequential, because only 1 CPU computes at any given time, it works when we have many independent 1D equations. For a comparison of the performance of many of the above algorithms, see Hofhaus and VandeVelde (1996). Since our interest are ADI-type of methods for 3D problems on shared memory architectures, for our parallel implementation we chose to use the TA because of its optimal computational complexity and modest memory requirements. As shown later in the performance analysis section, we were able to obtain a scalable parallel implementation without the need to use the pipelining approach. 4. Implementation Large class of numerical methods discretizing PDEs on structured meshes, such as Finite Differences implemented as stencil operations, only require a small number of floating point operations per a memory access. During the recent years a growing discrepancy between the memory and the CPU speeds is observed (McCalpin, 1991–2007), which results in the memory bandwidth being a bottleneck, rather than the CPU speed. The theoretical peak performances of modern CPUs are only achievable for algorithms that are computationally heavy compared to the memory bandwidth requirements (e.g. BLAS Level3 operations (Whaley et al., 2001)). In the following section we demonstrate how to efficiently implement a wide class of ADI/LOD methods for three-dimensional problems, with special attention paid to the memory performance. For all our performance tests we use an eight-way Opteron system with 2.4 GHz DualCore CPUs equipped with 64 kb L1 cache, 1 Mb L2 cache and DDR333 memory. The code is written in C. Sequential version is compiled using Intel and GNU gcc-3.2.3C compilers. The parallel version is only compiled using the (better performing) Intel compiler. Compiler options are shown in the Appendix I. Some technical knowledge is useful in order to fully understand our optimized 3D Fractional Steps/ADI implementation. Appendix I presents the general idea of efficient traversal of multidimensional arrays using the loop order notation commonly found in the computational literature. In short, 3D arrays are accessed using three nested loops iterating over the three dimensions: i, j, k for X, Y and Z dimensions respectively. There are six possible arrangements of the loop nest structure: ijk, ikj, jik, jki, kij, kji. Loop structures with i index as the innermost loop (array traversal along the X dimension) achieve what we call a linear memory access, which is crucial for good performance. In this paper we show how to efficiently 126 M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 handle Fractional Steps/ADI type of algorithms, where 1D implicit sweeps need to access the data along all the dimensions, which is in general not consistent with the linear memory layout. Appendix II introduces the idea of SSE vectorization and the prerequisites for its use in numerical codes. Appendix III describes general issues associated with the cache reuse and shows why in certain cases it is of less importance than the linear memory access, which enables efficient use of memory prefetching and thus hides the memory access time. The main goal of this part of the paper is to provide general guidelines on how to efficiently implement a wide class of Fractional Steps-type of methods on modern RISC CPUs. We do not concentrate on the cache reuse. Rather than that, we stress the linear memory access and code rearrangements that make SSE2 vectorization and effective memory prefetching possible. In this sense our results are general and do not require any architecture-specific, compile-time parameters, such as block sizes. They apply equally well to most tions wherever possible grows more and more popular, especially with the introduction of high performance GPUs, FPGAs, or the Cell processor, on which the performance difference reaches one order of magnitude (Langou et al., 2006). Although single precision floats may not provide enough accuracy for certain problems, it is recently widely studied how to combine them with double precision computations only for accuracy critical parts of the code, and obtain a full double precision result (Kurzak and Dongarra, 2007; Strzodka and Göddeke, 2006). With some care this approach can also be applied to ADI-type of methods. However, it is beyond the scope of this paper. For now, our implementation can run in full double precision mode when required. All our performance considerations also hold for this case. 5. LOD/ADI A general engine of any ADI/LOD type of algorithms looks as follows: modern CPUs (Intel, AMD, IBM), although the exact performance numbers will differ between different architectures and different compilers. 4.1. Performance measurements A common measure of computational efficiency is flops (floating point operations per second). Different approaches can be used to estimate this value. The number of floating point operations required by a best known algorithm can be computed, but since the operations are often not the bottleneck, this can be meaningless. Another way is to add all the operations performed explicitly in the source code. Unfortunately, because of heavy optimizations performed by the compiler this number may differ from what the CPU actually executes. Hardware counters capable of tracking performance statistics like cache misses and the actual number of flops performed by a CPU recently became popular (Browne et al., 2000). They are very useful when identifying performance problems in the code, but this analysis can be complex for multi-core CPUs. In our paper we chose to present values estimated directly from the fastest obtained sequential implementation. In particular, we include all the compulsory operations that can not be removed by the compiler (e.g. 8n in the Thomas algorithm), and we include operations that could potentially be pre-computed before the execution and stored in auxiliary arrays, but it is not done since it would either significantly increase the total memory requirements (threefold in the case of the conductivity parameter k), or even increase the execution time due to higher memory bandwidth requirements. As we show later, averaging the conductivity on the fly does not increase execution time much. Since our numbers are probably slightly higher than the real ones due to compiler optimizations, we also provide the absolute CPU time spent on the computations and compare it to the simplest unimprovable case of making a copy of a memory area. In our performance tests we use single precision floating point numbers because they require twice less memory than double precision, and the floating point computations are faster. In the case of our implementation and the inspected architecture this results in more than twofold speedup. Using single precision for computa- A simple implementation of every 1D sweep consists of three nested loops over i, j and k indices for X, Y and Z dimensions respectively, where the innermost loop is iterated along the direction of the sweep. Here we study the simplest, backward Euler (as opposed to Crank–Nicholson) variant of the LOD scheme presented in Eq. (14). In our implementation, during the tridiagonal matrix assembly material parameters K assigned to grid cells are averaged on the fly, which in a general case allows for transient material properties. Another material property, cp is either averaged from the surrounding cells to the nodes before the computations, or simply defined at the grid nodes. The right-hand side consists of the old temperature values (since we study a backward Euler scheme) and the source/sink term F is applied only during the X sweep. Effectively during the X sweep we operate on four 3D arrays of size nx × ny × nz (where temperatures are both read and written), and during Y/Z sweeps three arrays are used. Best achievable performance estimated just based on the time required to read and write the above mentioned data in the memory is around 10 s for X sweep, and around 8 s for Y/Z sweeps (see Appendix I for the explanation). We perform ca. 50 operations per point, where most of the operations are related to averaging of the K properties from the cell centers to the midedge points (5). While this step could be performed before running the solver, it would increase the total memory requirements for K array threefold, and increase the persweep memory requirements three times for the schemes requiring derivatives in the other dimensions. Consequently, precomputing these values would actually increase the execution time. We show that computing these values during the sweeps does not affect the execution times much. For the X sweep the best order of the loops is kji (see Appendix I for the detailed explanation of the loop nest notation and the impact loop order has on performance). For the Y sweep we can choose between ikj and kij (the innermost loop is done along the dimension of the sweep). Similarly, for the Z sweep we can use either ijk, or jik implementation. Only in the case of the X sweep the memory is accessed linearly. Performance results of all these implementations for a model of size nx = ny = nz = 1000 are presented in Table 1. Improving the Y and Z sweeps requires accessing the memory linearly, i.e. implementing the loop order as kji and jki, respectively. The idea is presented in Fig. 1. Instead of building and solving a M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 127 Fig. 1. In order to obtain linear memory access during the Z sweep a whole plane of systems of equations has to be simultaneously created and solved. While solving a single equation in the Z dimension requires k to be the innermost loop, solving a plane of systems of equations makes it possible to have i as the innermost loop. single tridiagonal system, we build and solve a whole plane of systems of equations simultaneously. The pseudocode for the Z sweep is presented below: Described loop transformation assures linear memory access during the Y and Z sweeps, and allows the use of SSE vectorization for the matrix computation. Moreover, as a natural consequence of solving a plane of systems of equations, SSE can also be used to implement a vectorized Thomas algorithm, i.e. solve many equations laid out correctly in the memory at the same time (see Appendix II for better understanding of SSE vectorization). In the X sweep vectorization can only be used during the matrix computation. Using a vectorized TA in this case requires explicit rearrangement and copying of the data in the memory, which is inefficient and slows down the code. The importance of using vectorization wherever possible can be verified by compiling and running the code with and without SSE2 support (see Table 2). Presented approach requires additional storage for nx tridiagonal matrices, right-hand side vectors and the factor. Since these arrays are of considerable size and do not fit into CPUs cache for large problems, they decrease the overall performance of the algorithm. From Eq. (18) it can be noticed that every entry of the A matrix is only used once during the forward elimination stage of TA. Hence, it is beneficial to factorize it on the fly by including the Table 1 Performance of the naive implementation of the implicit LOD method (Intel compiler) Time (s) MFlops X sweep, kji Y sweep, ikj Y sweep, kij Z sweep, ijk Z sweep, jik 29 1750 288 169 85 572 337 144 212 229 forward elimination stage into the matrix building loop. This optimization eliminates the need to store and later read the entries of A, which speeds up the code considerably. Doing so also allows us to use the rhs vector in place of the q vector from Eq. (18). Finally, in the fully implicit case the temperature values themselves can be used as the rhs vector. These improvements can also be applied in the Crank–Nicholson schemes and the schemes that require explicit derivatives in the other dimensions by using an auxiliary vector. The performance results for the methods described above are summarized in Table 3. It is worth noting that the best Y and Z sweeps implementations are faster than the X sweep. This is due to the SSE vectorization of the TA, which is a natural consequence of presented approach, but can not be efficiently implemented for the X sweep. For comparison, the table also includes the time needed to perform a single step of our implementation of the explicit finite difference scheme with variable material coefficients, SSE vectorization and proper memory access. Table 4 presents the performance results of the same code compiled with gcc and icc. Clearly, icc being two times faster utilizes the vector abilities of Opteron much better. Fig. 2 presents the flops performance of the optimized code depending on the system size. High efficiency is sustained for large problems and for sweeps in all spatial directions. For small systems the flops performance is up to 30% higher, which is due to the cache reuse of the K properties array between the solves done on subsequent planes, and the factor and rhs during forward and backward stages of the TA. For large problems, the performance is memory bandwidth bounded and therefore does not depend on factors like CPU cache size. The CPU theoretically 128 M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 Table 2 The impact of SSE vectorization on execution time (Intel compiler) Time (s) MFlops X sweep not vectorized X sweep SSE vectorized Y sweep, kji not vectorized Y sweep, kji SSE vectorized Z sweep, jki not vectorized Z sweep, jki SSE vectorized 48 1039 29 1750 45 1073 32 1491 47 1028 33 1452 Table 3 Performance of SSE vectorized code with integrated TA solver (Intel compiler) Time (s) MFlops memcpy Explicit scheme X sweep, kji vectorized Y sweep, kji Vectorized Y sweep, kji integrated TA vectorized Z sweep, jki vectorized Z sweep, jki integrated TA ∼8 to 10 0 37 29 1750 32 1491 20 2428 33 1452 22 2229 could compute faster (it does so for smaller systems), but can not due to data starvation. This shows that there is only so much to gain by optimizing the cache reuse in the sequential version of the code. 5.1. Parallel implementation In the simplest approach ADI/LOD methods are parallelized by dividing the outermost loop between the threads. This assures locality of most of the data used during the sweeps in two dimensions, the exception being the material properties K, for which neighboring CPUs share the border. On the other hand, the sweep in the third dimension accesses the data across all the CPUs, which involves very heavy communication and a performance hit. On a shared memory system this penalty is considerably lower than on distributed memory architectures. Starting from a close to optimal sequential performance we show that this simple approach yields very good parallel results for the studied computer architecture. On shared memory machines the whole memory of the system can be allocated and addressed directly as a single array. This way there is no need to explicitly program the communication, which is performed automatically by the CPU while accessing the required data in the global memory space. On NUMA (Non-Uniform Memory Access) capable architectures every CPU has its private memory bank, which assures parallelization of not only the computations, but also the memory bandwidth. On these systems in order to obtain a scalable code it is important to assure that the data accessed by a thread belongs to the private memory bank of the CPU the thread is executed on. For scientific applications it is commonly achieved by binding threads to specific CPUs and using a technique called first touch to allocate the data. Basically, every thread initializes (e.g. sets to 0) only this part of an array, which it will later use during the computations. The operating system then assigns all parts of the array to the proper memory banks. Fig. 3 presents scaling of the optimized code for a model of size nx = ny = nz = 1000. Speedup of the X and Y sweeps on up to 8 CPUs is linear and close to perfect. The Z sweep suffers a small Fig. 2. Flops performance of the optimized sequential implementation of 1D sweeps in all special directions depending on the problem size. but acceptable performance penalty due to the communication between CPUs. Using more CPUs requires switching to the second core, in which case limited speedup can only be observed for the X sweep. On a multi-core NUMA system all cores of a CPU share Table 4 Performance comparison of the fully optimized code compiled with gcc and icc Time (s) MFlops X sweep, icc X sweep, gcc Y sweep, icc Y sweep, gcc Z sweep, icc Z sweep, gcc 29 1750 52 965 20 2428 44 1093 22 2229 45 1062 Fig. 3. Flops performance of parallel execution of the optimized implementation for a constant problem size of nx = ny = nz = 1000. After 8 CPUs (vertical line) two cores were used on every CPUs. M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 129 parameters matrix K at the CPU boundaries. This however would require allocation of a separate K matrix for the Z dimension. For small enough problems some speedup is also observed for all the sweeps when using the second core. In this case the code spends relatively more time on computations, and, due to cache reuse, leaves some memory bandwidth unused. It brings us to a conclusion that although cache reuse plays much smaller role for sequential codes on single cores, it may yet prove to be important for utilizing the parallel computational power of multi-core systems. 6. Numerical examples 6.1. Effective properties of heterogeneous media Fig. 4. Parallel efficiency of the method on 8 Opteron CPUs depending on problem size. the same memory bus, which results in the threads competing for the memory bandwidth. Since in the case of heavily optimized Y and Z sweeps most of it was already used by a single thread, no speedup is to be expected. On the other hand, the TA solver in the sequential X sweep was not vectorized, and the computations took relatively longer time. Effectively, some memory bandwidth was left to be utilized by the second core. For studied problem size, the X, Y and Z sweeps on 8 CPUs take 3.8, 3.0 and 4.2 s, respectively. Fig. 4 shows parallel efficiency of the code for 8 CPUs. An interesting observation can be made for small problems (nx = ny = nz = ∼200), for which not only the sequential performance of Y and Z sweeps is higher (see Fig. 2), but also the parallel efficiency. This means that the cache reuse also limits the data exchange between the CPUs. It indicates possible improvements to the parallel code involving duplication of the material In this section we present an application of the previously described Locally One-Dimensional scheme (14) to a numerical study of diffusion fronts in heterogeneous media. Firstly, we validate our numerical method and analyze the average steady state thermal gradient inside a single spherical heterogeneity subjected to a uniform thermal flux at the model boundaries. We compare the numerical results to an analytical prediction derived for an inclusion embedded in an infinite host. Next, the method is used to compute the time evolution of the diffusion front in a medium consisting of multiple spherical inclusions. The inclusion concentration, number and configuration are systematically varied during our simulations. The system evolution is integrated until the steady state is reached. At that time we compute the average thermal flux in the entire domain determining the effective conductivity of the system. The numerical results are compared to the estimates obtained from an effective property scheme based on the differential effective medium (DEM) approach. Finally, we resolve the complex structure of the local diffusion fronts in such heterogeneous media and show how the effective conductivity model provides an excellent description framework on a larger scale even for the transient problem. Table 5 Impact of the grid resolution on the result quality in a single inclusion benchmark: (5a) weak inclusion case and (5b) strong inclusion case Kinc 0.1 (a) Resolution n3 points 51 101 151 201 251 301 501 Analytical prediction Kinc 0.01 Kinc 0.001 Grad Err% Grad Err% Grad Err% 1.489 1.508 1.488 1.476 1.466 1.461 1.448 1.4285 4.2 5.6 4.2 3.3 2.6 2.3 1.4 2.000 1.764 1.671 1.626 1.597 1.580 1.544 1.4925 34.0 18.2 12.0 8.9 7.0 5.9 3.5 2.108 1.812 1.703 1.650 1.618 1.599 1.558 1.4992 40.6 20.9 13.6 10.1 7.9 6.7 3.9 Kinc 10 Kinc 100 Kinc 1000 Grad Err% Grad Err% Grad 0.494 0.353 0.314 0.296 0.285 0.279 0.267 0.25 97.6 41.2 25.6 18.4 14.0 11.6 6.8 0.0836 0.0497 0.0416 0.0382 0.0361 0.0349 0.0322 0.029 188.3 71.4 43.4 31.7 24.5 20.3 11.0 0.00897 0.00520 0.00428 0.00388 0.00364 0.00361 – 0.0030 Err% 3 (b) Resolution n points 51 101 151 201 251 301 501 Analytical prediction 199.0 73.3 42.7 29.3 21.3 20.3 – Table shows the average value of temperature gradient in the inclusion at steady-state and the deviation from the expected result. The size of the domain is 1, inclusion radius is 0.05. Timestep used was 75 times the maximum explicit timestep allowed. The error estimate is calculated as: 100 × abs(numerical result − analytical result)/ abs(analytical result). 130 M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 Table 6 Impact of grid resolution on the result quality Resolution n3 grid points 51 101 151 201 251 501 Concentration 0.1 Concentration 0.3 Weak Strong Weak Strong 0.832 0.846 0.851 0.852 0.854 0.856 0.0218 0.0165 0.0152 0.0146 0.0143 0.0135 0.584 0.593 0.596 0.598 0.599 0.601 0.0696 0.0460 0.0382 0.0344 0.0323 0.0281 128 inclusions with concentrations 0.1 and 0.3. Timestep used was 75 times the maximum explicit timestep allowed. Weak case: kincl = 0.01, khost = 1, strong case: kincl = 1, khost = 0.01. 6.2. Single inclusion benchmark A spherical inclusion of radius 0.05 is placed in the center of a unit cube. The thermal conductivity of the host material is set to 1 in all our simulations and we systematically vary the conductivity of the inclusion phase. For the plane X = 0 we set the temperature to 1, and for X = 1 we set it to 0. At all the other walls of the cube we apply the zero-flux boundary condition. The initial temperature is set to zero in the whole domain, except for the X = 0 plane. The domain is discretized using a uniform Cartesian grid with an equal number of grid points in every dimension. The cell conductivity is based on whether the cell center is placed inside or outside the inclusion. The cell conductivity is not averaged in any way. Starting from time t = 0 we let the system evolve with a fixed timestep dt until no significant temperature changes can be observed in the model. Thus, we obtain the approximation to the steady-state solution. At this point we compute the average temperature gradient in the X direction inside the inclusion. For an explicit time integration scheme the Courant– Fredrichs–Levy restriction on the timestep states that the maximum admissible timestep yields dtCFL = (h)2 /6kmax , where h denotes the uniform grid spacing and kmax is a maximum conductivity value in the model. In all the performed numerical experiments we used dt = 75 dtCFL to verify the advantages of the unconditional stability of the method. As described in previous sections, if the steady-state is the goal of the computations one should Fig. 6. Average temperature profiles along the X dimension at time t = 0.05. Three different concentrations of 128 inclusions are studied on a 5013 resolution grid with dt = 75 times the maximum explicit timestep. The time is normalized with the effective conductivity of the medium. Obtained 1D profiles are compared to the analytical solution of 1D heat conduction equation in a homogeneous medium. modify the timestep during the iterations to accelerate the convergence. However, our main focus in this study is the transient problem of thermal diffusion and therefore we keep the timestep constant throughout the simulation. It is well known that both the thermal flux and mechanical strains are uniform inside an ellipsoidal inclusion for the constant far-field thermal and mechanical loads (Eshelby, 1957). In the ther incl is given by mal case the flux inside a spherical inclusion q incl = q 3kincl /khost ∞ q 2 + kincl /khost (20) where kincl and khost are the thermal conductivities of the inclusion ∞ is the far field flux. and the host phase respectively, and q Table 5 presents the results of our numerical experiments, where we study the impact of the numerical resolution on the inclusion flux for different inclusion conductivities. (5a) presents the results for a weak inclusion, whereas (5b)—for a strong inclusion. It has to be noted that the analytical result (20) is derived for the case when the boundaries are infinitely far from the inclusion. In our case, although the inclusion is quite small, the boundaries are at a finite distance and hence have some influence on the result. Also, the Locally One-Dimensional scheme (14) is not completely consistent and converged solutions are not exactly the solution of the steady-state thermal problem. 6.3. Multiple inclusions—effective conductivity Fig. 5. Effective conductivity of a host filled with different concentrations of weak and strong spherical inclusions. The dots represent obtained numerical results for weak (empty dots) and strong (filled dots) inclusions. Dotted and dashed lines are given by the DEM upscaling scheme. In this section we present a study of the effective properties of heterogeneous media consisting of randomly distributed spherical inclusions. In the numerical experiments we analyze different numbers and various concentrations of both weak and strong inclusions. All the inclusions have the same conductivity kincl , and the conductivity of the matrix is denoted as khost . The setup used in our numerical experiments is similar to that presented in the previous section. The computational domain is a unit cube and the boundary conditions are the same: Dirichlet boundary conditions on the X = 0 and 1 planes with temperature values 0 and 1, and zero-flux M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 131 Fig. 7. A snapshot of the diffusion fronts during the simulation. The left-most column presents 128 inclusions for the three studied concentrations: 0.1, 0.2 and 0.3. The middle column presents the same real-time snapshot for all the simulations. The two three-dimensional iso-contours are plotted for temperature values 0.7 and 0.1. The color on these iso-surfaces denotes the temperature gradient. The 2D cut at the back of the 3D cube shows temperatures on a chosen X–Y plane. The same plane is later shown alone in the last column. boundary conditions on the remaining walls of the box. We randomly place a given number of equally sized spherical inclusions so that their total volume adds up to the required concentration. We allow the system to converge to the steady-state using the timestep that is 75 times larger than the maximal value admissible in explicit integration. We then compute an average flux in the X direction through the entire domain. Table 6 presents a study of the impact of the computational resolution on the result. The runs have been performed for 128 inclusions. Concentrations 0.1 and 0.3 are considered for both weak (kincl = 0.01, khost = 1) and strong (kincl = 1, khost = 0.01) inclusions. To validate the numerical prediction we have repeated the experiments for 6 different configurations of 128 inclusions, and for different numbers of inclusions (from 32 up to 500) with the same concentration 0.1. The computational resolution was 5013 and the dt was 75 times the maximum explicit timestep. For 6 different distributions of 128 inclusions the maximum difference between the results was 0.2% for weak inclusions and 0.5% for strong inclusions. For a varying number of weak inclusions the obtained results are already quite good for a relatively small number of heterogeneities. 32 and 500 inclusions gave an average flux of 0.8525 and 0.8566, respectively. For the strong case the result differed more: 0.0133 and 0.0143, respectively. This can be attributed to a coarser discretization of individual, smaller inclusions. The results obtained numerically are compared with the prediction obtained with the differential effective medium (DEM) upscaling scheme in Fig. 5. The numerical results represented by dots are computed for 128 weak (empty dots) and strong (filled dots) inclusions with different concentrations. The resolution of the grid was 5013 points. The differential effective medium schemes are known to exhibit a very good predictive power over a wide range of concentrations and material property contrasts (e.g. Weber et al., 2003). The classical differential effective medium scheme for a composite consisting of spherical inclusions leads to the following approximation (Bruggeman, 1936): kincl − keff kincl − khost k host keff 1/3 =1−f (21) 132 M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 As seen in Fig. 5, the obtained numerical results are in very good agreement with the analytical prediction for both weak and strong inclusions. 6.4. Multiple inclusions—diffusion fronts In the above sections we presented an analysis of the accuracy of the LOD method based on the approximation of the steadystate solution of the thermal diffusion problem. Here, we look at the diffusion fronts and the transient stages of the simulation. We study 128 weak inclusions with kincl = 0.01 discretized on a 5013 points grid. The timestep dt is 75 times the maximum explicit value. Fig. 6 presents computed thermal profiles along the X dimension averaged on the Y–Z planes for all the studied concentrations. For each concentration the time is normalized by the previously obtained effective thermal diffusivity of the medium and the constant unit domain size in the X direction. The results are shown for t = 0.05. The averaged profiles agree very well with the analytical solution of the corresponding one-dimensional thermal diffusion problem. This shows that the effective material property approach is well applicable to the transient part of the heat diffusion problems. The strength of our direct numerical simulation approach is that we can explicitly resolve the local three-dimensional structure of a diffusion front. Fig. 7 presents snapshots of the time evolution of the studied setup. The left-most column presents 128 inclusions for the three studied concentrations. The middle column presents the same real-time snapshot for all the simulations. The two threedimensional iso-contours are plotted for temperature values 0.7 and 0.1. The color on these iso-surfaces denotes the temperature gradient. The 2D cut at the back of the 3D cube shows temperatures on a chosen X–Y plane. The same plane is later shown alone in the last column. where i = 1, 2, 3 is the spatial coordinate index, repeated indexes eq imply summation, c, , ˛, ˜ cf , D, and v are the concentration of a trace component, density, kinetic reaction constant, equilibrium concentration, diffusion coefficient, porosity and velocity, respectively. Subscript “f” refers to fluid phase and subscript “s” refers to only solids that contain the trace component. First two equations are the mass balances of the trace element in fluid and solid phases. Third equation is the total mass balance of fluid and solid matter in which the net volumetric effect of the dissolution–precipitation reactions is neglected. We aim at resolving stiffness of the final system of equations arising due to fast kinetics while keeping diffusion terms to be able to study transient effects of reactive transport. To keep this example simple, we set solid velocity and diffusion coefficient to zero as negligible compare to the fluid’s velocity and diffusivity: ∂(cf · f ) eq = −qi · ∇ i cf + ∇ i · (f · Df · ∇ i cf ) + ˛ ˜ · (cf − cf ), ∂t ∂Cs eq = −˛ ˜ · (cf − cf ), q = f · f · vif , ∂t s Cs = cs · s · , f = const. f (23) Finally, using characteristic length scale of reservoir, L, and choosing time scale to eliminate diffusion coefficient yields following nondimensional system of equations: ∂(cf · f ) eq = −qi · ∇ i cf + ∇ i · (f · ∇ i cf ) + ˛ · (cf − cf ), ∂t ∂Cs eq = −˛ · (cf − cf ) ∂t (24) First two terms of the right side of the first equation describe the transport (advection and diffusion). The last term represents the reactions part. Reaction is possible only if solid is present: ˛L ˜ 2 Df 0 if Cs > 0 6.5. Reactive transport ˛= In this section we briefly present an application of the parabolic fractional step solver as a component of a reactive transport solver. Subsurface fluid circulation brings fluids out of chemical equilibrium and causes rich variety of phenomena ranging from channelling and fingering instabilities to fluidization and explosive eruptions. A number of issues have to be resolved by numerical modelling. Both thermal and volumetric effects of chemical reactions are of primary importance. Porosity alteration by dissolution–precipitation processes results in strong and nonlinear variation of key model properties, such as permeability. A much simple model of the reactive transport problem considered here is stated as Generally, f can change during the simulation since chemical reactions may enhance and reduce porosity. We choose to resolve initial heterogeneity of porosity but we do not change it with time for simplicity reason. ∂(cf · f · f ) ∂t = −∇ i · (cf · f · f · vif ) + ∇ i · (f · f · Df · ∇ i cf ) + ˛ ˜ · f · (cf − cf ), eq ∂(cs · s · s ) ∂t ˜ · f · (cf − cf ), = −∇ i · (cs · s · s · vis )+∇ i · (s · s · Ds · ∇ i cs )− ˛ eq ⎛⎛ ∇ i · ⎝⎝ ⎞ ⎞ · ⎠ · vis + f · f · (vif − vis )⎠ all phases =− ∂ all phases ∂t · ≈0 (22) (25) if Cs = 0 6.6. Numerical results A straight-forward spatial discretization of equations (264) with minimum grid spacing dx that includes all three operators (advection, diffusion and reaction) in a single system of equations can be done using either the finite difference (FD), or the finite element (FEM) method. This approach is often called a full physics model and is employed in many industry-standard reservoir simulators. The time integration of equations (274) can be computed using an explicit Euler scheme. However, the time step restriction due to explicit treatment of advection, diffusion and reaction are of the order of dx/max(q), dx2 and 1/˛, respectively. Effectively, time integration requires a prohibitively large number of time steps for high resolution and fast kinetics problems we aim at. Unconditionally stable implicit schemes remove the time step restriction. However, the numerical solution of large systems of equations for three-dimensional discretizations is too CPU and memory expensive. A commonly used approach is to apply fractional step idea to the individual physical processes that are part of equations (284). Previously in this paper we referred to Fractional Steps only in the context of spatial operator splitting. Here, operator splitting is also M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 133 Fig. 8. A comparison of the results of a reaction–advection–diffusion solver for two different spatial resolutions. Pictures present the dissolution front at two different moments during the simulation. Iso-contours denote different values of concentration c (blue: c = 0; red: c = maximum concentration). The upper frames present a 400 × 500 × 45 model, the lower ones—1600 × 2000 × 140. The dt used is equal to min(dx, dy, dz). Pictures on the left present an early stage of the dissolution process, the ones on the right present a later stage. Advection flux q = (dx, 0, 0). Reaction rate is fast compared to the diffusion rate. applied according to the physical processes. In short, diffusion, advection and reaction can be solved successively. In the case of equation (294) one time step consists of the following three phases (Fractional Steps): 1. Apply diffusion operator c(n + 1/3) = diffusion(c(n)). 2. Apply advection operator c(n + 2/3) = advection(c(n + 1/3)). 3. Compute reaction c(n + 1) = reaction(C(n + 2/3), phi(n)). This approach is first-order accurate in time. The severe restriction on the time step introduced by the diffusion operator for high resolution models can be removed by employing an unconditionally stable solver for the diffusion part. Our choice is to use the Fractional Steps method (14). Reaction part is an ordinary differential equation. It can be solve by unconditionally stable implicit backward Euler scheme without solving large system of equations. Hence, we may now run the simulation with time step of order of dx/max(q). Fig. 8 presents the results of an example study. We have used real-world porosity data of the Gullfaks oil reservoir to compute the effective diffusivity field D. In the initial stage the fluid is in equilibrium with the solid, i.e. the fluid is a fully saturated solution. The model is flushed with pure fluid (cf = 0) from the left side, which is implemented as a Dirichlet concentration boundary condition. For simplicity, the advection flux q is constant in time and uniform in space, i.e. the fluid is transported along the X direction. Here we assume that the reaction speed is big compared to the diffusivity D, i.e. whenever fluid that is out of equilibrium flows through the solid, reaction takes place instantaneously until equilibrium is reached, or all substance is dissolved. This is implemented as a large parameter ˛ = 250, which results in a rather sharp, narrow dissolution front. 7. Conclusions We have presented an efficient implementation of ADI/LOD type of methods for three-dimensional problems on modern commodity SMP architectures. Special attention has been paid to optimization of the memory bandwidth usage, which is much more important than optimizing the cache reuse. Our approach is largely applicable to all modern RISC architectures like Intel, AMD and IBM SP. Our tests have been performed on an eight-way Opteron system with DDR333 memory. Optimized sequential implementation of the Y and Z sweeps is comparable in execution time to just copying the data in the memory. The time needed to perform one complete LOD timestep is approximately twice the time of an explicit scheme. Efficiency of the parallel implementation on 8 CPUs is close to perfect. Scalability when using the second core is limited because of memory bandwidth starvation. Computing one timestep of the LOD scheme on a system of 10003 unknowns on 8 CPUs takes 11 s. The LOD scheme has been validated using a computational model of an isolated inclusion subject to a constant far field flux as a benchmark problem. The effective thermal conductivity of composites consisting of multiple inclusions was studied numerically and found to be in a perfect agreement with analytical predictions based on the differential effective medium approach. Our implementation of the LOD scheme allows us to resolve the complex structure of the diffusion front in a strongly heterogeneous medium with numerous inclusions. The results show that the effective material property approach is also suitable for the transient part of the heat diffusion problems. Finally, we have applied the Fractional Steps approach to a reaction–advection–diffusion problem, effectively removing the severe timestep restriction introduced by the diffusion solver. Appendix A. Operations on multidimensional data Three-dimensional arrays are stored linearly in computers memory, as shown in Fig. 9. The data is transferred from the main memory to the CPU cache in blocks called cache lines (usually 64 bytes), thus the cost of accessing a whole cache line is the same as that of accessing a single value. Using all the values from a given cache line at least once before it is removed from the cache back to the main memory significantly decreases the access cost per value. The following pseudo-code fragments show two possible implementations of the difference operator in the Z dimension of a 3D Cartesian domain. The only difference is the order of the loops: 134 M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 Fig. 9. Three-dimensional array placement in computer’s memory. Because of the loop order, first implementation will be referred to as ijk, the second one—kji. Traversal of the 3D array for both of these implementations is presented in Figs. 10 and 11, respectively. The innermost loop in the kji version is consistent with the linear memory layout, thus it is made sure that all the values from every loaded cache line are used at least once. The importance of this practice is best reflected in the following performance comparison. For a 3D array of size nx = ny = nz = 1000 and single precision floating point numbers, the kji implementation of the difference operator in the Z direction takes 6.9 s, while the ijk takes 95 s. For a difference operator in the X direction, the execution time is 4.9 s, the same as just copying an array of this size from one place to another in the memory. On modern architectures the cost of accessing RAM is further decreased by assuring that the data is constantly read while the CPU is busy performing computations. This technique is known as prefetching and is automatically activated by the CPU or a compiler, provided that subsequent cache lines are processed in order. Throughout the paper we refer to this approach as the linear memory access. In the kji implementation we traverse the array along the direction of the memory (i index), therefore prefetching will be used during execution. Appendix B. SSE vectorization Most modern CPUs are capable of limited vectorization of the floating point operations as long as the vector elements are located next to each other in the memory. On x86 architecture it is realized through the SSE2 instruction set, which makes it possible to perform four single precision or two double precision operations at the same time. This feature usually has to be activated using special compiler flags. With icc we use the ‘-xW – O3’ flags. With gcc the equivalent would be ‘-O3 – msse2’. In a common vector notation, the second loop from Appendix I would be executed by the CPU as To make vectorization possible, the vectors operated on need to be stored linearly in the memory. Although the performance improvement is negligible for a simple difference operator, it becomes much more pronounced in a more computationally heavy code. Appendix C. Cache reuse As noted in Appendix I, one should always use all the data from a cache line at least once. However, this simple approach does not assure that a given cache line will not be loaded from RAM again during later computations. A common efficient programming M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 135 Fig. 10. Direction of memory access during the Z difference operator, ijk loop order. Fig. 11. Direction of memory access during the Z difference operator, kji loop order. paradigm is based on maximizing the cache reuse, i.e. using every cache line in all or most of the required computations. In the case of stencil operations this leads to so called blocking or tiling (Rivera and Tseng, 2000). In short, a 3D grid is divided into smaller cubes that fit completely into the CPUs cache. All required operations on a given block are performed before moving to the next one. During the past years lots of work has been done on designing cache-friendly algorithms. As described in (Kamil et al., 2005), many of them have become ineffective on modern architectures due to the increased importance of data prefetching and linear memory access on the overall performance. Our results agree with these findings. In the simple example of the difference operator, with optimal cache reuse for the Z direction one could only hope to decrease the execution time from 6.9 to 4.9 s. For problems with relatively more computations per data access the possible gains are even smaller. Therefore, in our approach we do not explicitly optimize cache reuse, and instead we only consider linear memory access for improving performance. References Allmann, S., Rauber, T., Runger, G., 2001. Cyclic reduction on distributed shared memory machines. In: Proceedings of the Ninth Euromicro Workshop on Parallel and Distributed Processing, Mantova, Italy, pp. 290– 297. Birkhoff, G., Varga, R.S., 1959. Implicit alternating direction methods. Transactions of the American Mathematical Society 92 (1), 13. Brandt, A., 1977. Multilevel adaptive solutions to boundary-value problems. Mathematics of Computation 31 (138), 333–390. Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P., 2000. A portable programming interface for performance evaluation on modern processors. International Journal of High Performance Computing Applications 14 (3), 189–204. Bruggeman, D.A.G., 1936. Calculation of various physical constants of heterogenous substances. II. Dielectricity constants and conductivity of non regular multi crystal systems. Annalen Der Physik 25 (7), 0645–0672. 136 M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136 Conte, S.D., Boor, C.W.D., 1980. Elementary Numerical Analysis: An Algorithmic Approach. McGraw-Hill Higher Education, 408 pp. Courant, R., Friedric, K., Lewy, H., 1967. On partial difference equations of mathematical physics. IBM Journal of Research and Development 11 (2), 215. Dabrowski, M., Krotkiewski, M., Schmid, D.W., 2008. MILAMIN: MATLAB-based finite element method solver for large problems. Geochemistry Geophysics Geosystems, 9. Douglas, J., 1955. On the numerical integration of D2U − Dx2 + D2U − Dy2 = Du − D + implicit methods. Journal of the Society for Industrial and Applied Mathematics 3 (1), 42–65. Douglas, J., 1962. Alternating direction methods for three space variables. Numerische Mathematik 4 (1), 41. Douglas Jr., J., Garder, A.O., Pearcy, C., 1966. Multistage alternating direction methods. SIAM Journal on Numerical Analysis 3 (4), 570. Douglas Jr., J., Rachford Jr., H.H., 1956. On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American Mathematical Society 82 (2), 421. Douglas, J., Kim, S., 2001. Improved accuracy for locally one-dimensional methods for parabolic equations. Mathematical Models & Methods in Applied Sciences 11 (9), 1563–1579. Egecioglu, O., Koc, C.K., Laub, A.J., 1989. Recursive doubling algorithm for solution of tridiagonal systems on hypercube multiprocessors. Journal of Computational and Applied Mathematics 27, 95–108. Eshelby, J.D., 1957. The determination of the elastic field of an ellipsoidal inclusion, and related problems. Proceedings of the Royal Society of London Series A: Mathematical and Physical Sciences 241 (1226), 376–396. Gander, W., Golub, G.H., 1997. Cyclic reduction—history and applications. In: Proceedings of the Workshop on Scientific Computing. Springer-Verlag, Hong Kong. Hackbusch, W., 1981. Fast numerical-solution of time-periodic parabolic problems by a multigrid method. SIAM Journal on Scientific and Statistical Computing 2 (2), 198–206. Hockney, R.W., 1965. A fast direct solution of Poissons equation using Fourier analysis. Journal of the ACM 12 (1), 95-&. Hofhaus, J., VandeVelde, E.F., 1996. Alternating-direction line-relaxation methods on multicomputers. Siam Journal on Scientific Computing 17 (2), 454–478. Janenko, N.N., 1971. The Method of Fractional Steps the Solution of Problems of Mathematical Physics in Several Variables, vol. VIII. Springer, Berlin, 160 pp. Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K., 2005. Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of the 2005 Workshop on Memory System Performance. ACM, Chicago, IL. Kurzak, J., Dongarra, J., 2007. Implementation of mixed precision in solving systems of linear equations on the cell processor. Concurrency and Computation-Practice & Experience 19 (10), 1371–1385. Langou, J., et al., 2006. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. ACM, Tampa, Florida. Larsson, S., Thomee, V., Zhou, S.Z., 1995. On multigrid methods for parabolic problems. Journal of Computational Mathematics 13 (3), 193–205. Lubich, C., Ostermann, A., 1987. Multigrid dynamic iteration for parabolic equations. BIT 27 (2), 216–234. Marchuk, G.I., 1990. Splitting and Alternating Direction Methods. Handbook of Numerical Analysis, vol. 1. Elsevier, Amsterdam, pp. 197–462. McCalpin, J.D., 1991–2007. STREAM: Sustainable Memory Bandwidth in High Performance Computers. A Continually Updated Technical Report. Overman, A., Rosendale, J.V., 1993. Mapping robust parallel multigrid algorithms to scalable memory architectures. In: McCormick, N.D.M.a.T.A.M.a.S.F. (Ed.), Sixth Copper Mountain Conference on Multigrid Methods. NASA, Hampton, VA, pp. 635–647. Peaceman, D.W., Rachford, H.H., 1955. The numerical solution of parabolic and elliptic differential equations. Journal of the Society for Industrial and Applied Mathematics 3 (1), 28–41. Pearcy, C., 1962. On convergence of alternating direction procedures. Numerische Mathematik 4 (1), 172. Povitsky, A., 1999. Parallelization of pipelined algorithms for sets of linear banded systems. Journal of Parallel and Distributed Computing 59 (1), 68– 97. Rivera, G., Tseng, C.-W., 2000. Tiling optimizations for 3D scientific computations. In: Proceedings of the 2000ACM/IEEE conference on Supercomputing (CDROM). IEEE Computer Society, Dallas, TX, United States. Stone, H.S., 1973. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. Journal of ACM 20 (1), 27–38. Strzodka, R., Göddeke, D., 2006. Mixed precision methods for convergent iterative schemes. In: Proceedings of the 2006 Workshop on Edge Computing Using New Commodity Architectures. Wakatani, A., 2004. A parallel and scalable algorithm for ADI method with prepropagation and message vectorization. Parallel Computing 30 (12), 1345–1359. Weber, L., Dorn, J., Mortensen, A., 2003. On the electrical conductivity of metal matrix composites containing high volume fractions of non-conducting inclusions. Acta Materialia 51 (11), 3199–3211. Wesseling, P., 2004. An Introduction to Multigrid Methods. Whaley, R.C., Petitet, A., Dongarra, J.J., 2001. Automated empirical optimization of software and the ATLAS project. Parallel Computing 27 (1/2), 3–35. Widlund, O.B., 1966. On the rate of convergence of an alternating direction implicit method in a noncommutative case. Mathematics of Computation 20 (96), 500.