Physics of the Earth and Planetary Interiors computer architectures M. Krotkiewski

advertisement
Physics of the Earth and Planetary Interiors 171 (2008) 122–136
Contents lists available at ScienceDirect
Physics of the Earth and Planetary Interiors
journal homepage: www.elsevier.com/locate/pepi
Fractional Steps methods for transient problems on commodity
computer architectures
M. Krotkiewski ∗ , M. Dabrowski, Y.Y. Podladchikov
Physics of Geological Processes, University of Oslo, Pb 1048 Blindern, 0316 Oslo, Norway
a r t i c l e
i n f o
Article history:
Received 30 October 2007
Received in revised form 15 July 2008
Accepted 4 August 2008
Keywords:
ADI
Fractional Steps
Locally One-Dimensional
Parabolic
Hyperbolic
Commodity
a b s t r a c t
Fractional Steps methods are suitable for modeling transient processes that are central to many geological
applications. Low memory requirements and modest computational complexity facilitates calculations
on high-resolution three-dimensional models. An efficient implementation of Alternating Direction
Implicit/Locally One-Dimensional schemes for an Opteron-based shared memory system is presented. The
memory bandwidth usage, the main bottleneck on modern computer architectures, is specially addressed.
High efficiency of above 2 GFlops per CPU is sustained for problems of 1 billion degrees of freedom. The
optimized sequential implementation of all 1D sweeps is comparable in execution time to copying the
used data in the memory. Scalability of the parallel implementation on up to 8 CPUs is close to perfect. Performing one timestep of the Locally One-Dimensional scheme on a system of 10003 unknowns on 8 CPUs
takes only 11 s. We validate the LOD scheme using a computational model of an isolated inclusion subject to a constant far field flux. Next, we study numerically the evolution of a diffusion front and the
effective thermal conductivity of composites consisting of multiple inclusions and compare the results
with predictions based on the differential effective medium approach. Finally, application of the developed parabolic solver is suggested for a real-world problem of fluid transport and reactions inside a
reservoir.
© 2008 Elsevier B.V. All rights reserved.
1. Introduction
Geological systems are usually heterogeneous and exhibit large
material property contrasts. They are often formed by multiphysics processes interacting on many temporal and spatial scales.
In order to understand these systems numerical models are frequently employed. Appropriate resolution of the behavior of these
heterogeneous systems, without the (over-)simplifications of a priori applied homogenization techniques, requires numerical models
capable of efficiently and accurately dealing with high resolution
models.
A popular technique is the finite element method (FEM) combined with unstructured meshes capable of dealing with the
geometrical complexities of geological problems. In these methods a linear system of equations is often assembled into a sparse
matrix and solved. While it can be successfully done for twodimensional models with high resolution even on a modern
desktop computer (Dabrowski et al., 2008), three-dimensional
problems require supercomputers and sophisticated numerical
methods. Direct solvers are unfeasible due to enormous memory
∗ Corresponding author.
E-mail address: marcink@fys.uio.no (M. Krotkiewski).
0031-9201/$ – see front matter © 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.pepi.2008.08.008
requirements and large computational times. Iterative solvers (like
Conjugate Gradients) are an available alternative, but they require
good, problem dependent preconditioners that can be efficiently
parallelized.
Methods based on structured meshes, although less accurate in
terms of geometry representation, are often employed. The known
structure of the mesh makes them cheaper in terms of memory
requirements, and can significantly decrease the computational
cost. The state-of-the-art method for structured meshes is multigrid (Brandt, 1977; Hackbusch, 1981; Wesseling, 2004). For the
steady state Poisson equation on uniform grids it converges in a few
iterations, regardless of the resolution. Applications to parabolic
problems have also been widely studied (Larsson et al., 1995; Lubich
and Ostermann, 1987). However, the formulation for heterogeneous materials and variable grids is more complicated, as is an
efficient parallel implementation (Overman and Rosendale, 1993).
For relatively simple, well conditioned transient problems a lighter
method may be more suitable.
Classical explicit finite difference methods are much simpler
both in terms of understanding and efficient parallel implementation. Usually, however, they are impractical for high-resolution
problems due to severe timestep restriction. Operator splitting
techniques try to overcome this limitation. The general idea is
to divide a single timestep into a sequence of one-dimensional
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
implicit solves along the spatial dimensions of the domain. The
computational cost of a single timestep of such schemes is comparable to that of explicit methods, but the timestep restriction can
be avoided.
We first present the class of Fractional Steps methods for
transient problems, such as heat diffusion (parabolic) and wave
propagation (hyperbolic). The suitability of a contemporary shared
memory, Opteron-based commodity architecture for this approach
is later investigated. We focus on high resolution 3D problems with
up to 10003 degrees of freedom and heterogeneous material properties. We use the second-order space discretization of the underlying equations, which for the Fractional Steps methods results in
one-dimensional, tridiagonal systems of linear equations. An optimized algorithm for efficient computation and solution of such
systems on an eight-way Dual-core Opteron machine is presented.
2. Fractional step method
Consider an initial-boundary value parabolic problem of the
heat conduction:
∂T
= div(k · grad T ) + f
∂t
T = T
T = T0
cp
in ˝ × , t ∈ (1)
on ∂˝ × in ˝ for t = 0
where T denotes temperature, k is thermal conductivity, f is heat
generation term, and Cp is a product of density and specific heat
capacity. Thermal field T is prescribed on the boundary ∂˝ × ,
where ˝ denotes the spatial domain, and is the temporal domain.
The initial conditions T0 are given in the whole ˝.
In the case of three-dimensional homogeneous media,
the finite difference (FD) discretization A of the operator
(cp )−1 div(k grad. . .) on the uniform Cartesian grid yields:
A = Ax + Ay + Az
Ax =
(x ∇ x )
2
(h)
(2)
where = k/cp is thermal diffusivity, h = x = y = z is a grid
spacing and (x T)i = Ti − Ti−1 , (x T)i = Ti+1 − Ti denote backward and
forward difference operator in the x direction, respectively. Here,
the subscript i indexes the discrete grid points in X dimension. The
resulting spatial discretization of (1) in the one-dimensional case
is
∂Ti
(Ti−1 − 2Ti + Ti+1 )
=
2
∂t
(h)
(3)
The operators Ay , Az are analogous to Ax , and in three dimensions A is
the standard 7-point stencil for the Laplacian scaled by /(h)2 . For
heterogeneous materials and non-uniform grids we use the finite
volume spatial discretization that in one-dimensional case is stated
as
xi+1/2 + xi−1/2
2
=
ki−1/2
xi−1/2
+
(cp )i
Ti−1 −
∂Ti
∂t
xi−1/2 ki+1/2 + xi+1/2 ki−1/2
xi+1/2 + xi−1/2
2
xi−1/2 xi+1/2
fi
Ti +
ki+1/2
xi+1/2
Ti+1
(4)
where fractional subscript indices correspond to the center of the
edge between two neighboring points in the x direction. xi−1/2
denotes the spatial distance between two grid points xi , xi−1 . Since
in our approach the conductivity k is defined in the centers of the
3D cells, it has to be averaged in that mid-edge point. The simplest
123
approach is the arithmetical average:
1
ki−1/2 =
4(yj−1/2 + yj+1/2 )(zk−1/2 + zk+1/2 )
× (ki−1/2,j−1/2,k−1/2 yj−1/2 zk−1/2
+ ki−1/2,j+1/2,k−1/2 yj+1/2 zk−1/2
+ ki−1/2,j−1/2,k+1/2 yj−1/2 zk+1/2
+ ki−1/2,j+1/2,k+1/2 yj+1/2 zk+1/2 )
(5)
and ki+1/2 is computed analogously. In the above formula, subscript
indices i, j and k are the spatial indices of the nodes in a threedimensional Cartesian grid.
The classical explicit scheme used to integrate (1):
T t+1 − T t
= AT t + f
(6)
is second-order accurate in spatial coordinates, first order in time
and is stable under the restriction ≤ (h)2 /6 (Courant et al.,
1967). The superscript index t is used to denote the subsequent
timesteps. The maximum admissible integration step ␶cr becomes
very restrictive for refined grids, and in the case of non-uniform
grid spacing it is determined by the size of the smallest cell. Moreover, in heterogeneous materials cr is restricted by the strongest
heterogeneity, even if it is insignificant in size. The computational
complexity of explicit methods per integration step is small, but
due to the timestep restriction they may require a large number of
iterations to integrate the evolution of the system.
In order to alleviate the timestep restriction an implicit method
can be used. The second-order accurate in time Crank–Nicholson
scheme:
T t+1 − T t
1
= A(T t+1 + T t ) + f
2
(7)
is unconditionally stable (no restriction on ). It requires solving a
system of linear equations:
I−
A T t+1 =
2
I+
A T t + f
2
(8)
A in our case is symmetric and sparse, i.e. most of the matrix entries
are zeros. For 1D problems A is a tridiagonal matrix and can be easily solved using the Thomas Algorithm. In the 2D case it is practical
to use variants of the well-known Gaussian elimination, e.g. the
Cholesky factorization for symmetric positive definite systems, LU
factorization with pivoting for non-symmetric systems, or a generalized Thomas Algorithm for block tridiagonal matrices. Sparse
direct solvers are easy to use, robust and very efficient on modern computers. Unfortunately, they can not be applied to large
3D problems because of extreme memory requirements and computational complexity. The number of new non-zero entries (fill)
introduced during the factorization is much larger than the number
of the original non-zeros. Various iterative methods, like Conjugate Gradients usually used with some preconditioner, can be used
instead. However, for strongly heterogeneous problems the convergence rate deteriorates. Moreover, preconditioners are problem
dependent and finding a good, parallelizable one is often difficult.
The fractional step methods described in this paper combine the
advantages of the two mentioned strategies for time integration of
system (1): low computational cost of the explicit scheme, and the
stability of the implicit approach. The general idea is to replace a
complex operator A by simpler ones that are used in sequence (Fractional Steps) during integration of the parabolic system like (1). In
the context of the heat diffusion operator the split is naturally dictated by the spatial components Ax , Ay , Az . During every fractional
step many one-dimensional, and in our case tridiagonal systems of
equations have to be solved.
124
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
2.1. Alternating Direction Implicit schemes
The Alternating Direction Implicit (ADI) scheme for twodimensional parabolic problems was proposed by Peaceman and
Rachford (1955) and Douglas (1955):
1
T t+1/2 − T t
= (Ax T t+1/2 + Ay T t ),
2
1
T t+1 − T t+1/2
= (Ax T t+1/2 + Ay T t+1 )
2
(9)
In the above formula and throughout the paper the fractional superscript time index (e.g. Tt+1/2 ) is used to describe the fractional
(incomplete) steps. The values of T defined at those intermediate
steps has no physical meaning. Only T at full timesteps denoted by
integer superscript indices gives an approximation to the temperature field. To define the stability, we introduce the difference step
operator C(,h) that is defined as the action of the scheme in the
whole step:
T t+1 = C(, h)T t
(10)
The stability of the scheme requires that ||C(, h)|| ≤ 1. The algorithm (9) is unconditionally stable, i.e. it is stable for any ≥ 0. In
addition to the truncation error of the Crank–Nicholson implicit
scheme (7), the error term related to the splitting is O( 2 ). Thus, ADI
approximates the original problem with the same order of accuracy.
However, the additional error term may be large and the improvements to the original scheme were suggested (Douglas and Kim,
2001).
The 2D ADI scheme is also applicable as an iterative solver for
the steady-state variant of the problem (1). In whole steps, using
two-layer difference scheme notation (Janenko, 1971) it is given by
T t+1 − T t
= ˝1 T t+1 + ˝2 T t ,
˝2 =
˝1 =
Ax + Ay
+ Ax Ay
2
4
Ax + Ay
− Ax Ay ,
2
4
(11)
It is clearly seen that for the 2D ADI scheme the relation A = ˝1 + ˝2
is satisfied for any , which assures that the scheme converges to
the steady-state solution independently of the value of the pseudotimestep. This condition is referred to as complete consistency. The
choice of an optimal parameter sequence 1 , . . ., n as well as other
techniques accelerating the convergence are summarized, e.g. in
Marchuk (1990).
The operators Ax , Ay obtained for the homogeneous media on
uniform grids commute, i.e.
Ax Ay = Ay Ax
(12)
In practice the commutativity condition (12) proves to be important
in deriving properties of the fractional step schemes like stability and consistency. The convergence of the two-dimensional ADI
scheme without the requirement of the commutativity of the operators Ax , Ay was discussed by Birkhoff and Varga (1959). Further
considerations related to the rate of convergence and parameter
choice can be found in Pearcy (1962) and Widlund (1966).
The simple extension of the ADI scheme to three-dimensional
cases results in the loss of unconditional stability (Janenko, 1971).
The stable version was proposed by Douglas and Rachford (1956):
T t+1/3
− Tt
= Ax T t+1/3 + Ay T t + Az T t ,
T t+2/3 − T t+1/3
= Ay (T t+2/3 − T t ),
T t+1 − T t+2/3
= Az (T t+1 − T t )
(13)
The scheme is proven to be unconditionally stable for homogeneous
media and is completely consistent. The second-order accurate in
time version was suggested later by Douglas (1962).
The pair-wise commutativity is required for the stability of the
classical ADI schemes in 3D, thus they are not stable for heterogeneous materials. This limits their use as iterative solvers for steady
state problems. For non-commuting positive-definite operators
multistage alternating direction method was suggested (Douglas
et al., 1966).
2.2. Locally One-Dimensional schemes
The splitting algorithms belong to the other class of fractional
step techniques:
1
T t+1/3 − T t
= Ax (T t + T t+1/3 ),
2
T t+2/3 − T t+1/3
1
= Ay (T t+1/3 + T t+2/3 ),
2
T t+1 − T t+2/3
1
= Az (T t+2/3 + T t+1 )
2
(14)
In the above, each of the equations involves only one-dimensional
difference operators and the scheme is therefore categorized as
Locally One-Dimensional (LOD). A similar, fully implicit (backward
Euler) variant of (14) is possible. Both algorithms are unconditionally stable (Janenko, 1971).
The two-dimensional variant of the scheme (14) is second-order
accurate in time for homogeneous media, but this property is lost
for non-commutative operators Ax , Ay . It can be restored by introducing two-cycle splitting (xy sweeps followed by reversed order
yx sweeps) (Marchuk, 1990).
The two-dimensional version of the LOD scheme (14) for homogeneous media is identical in whole steps with ADI. However, this
equivalence holds only until boundary conditions are considered.
Imposing boundary conditions onto the one-dimensional sweeps
in (14) leads to the finite approximation error at points located next
to the boundaries. The resulting scheme in whole step yields:
I−
1
Ax
2
I−
1
1
Ay T t+1 − I + Ax
2
2
I+
1
Ay T t = Rn
2
(15)
where Rn = 0 everywhere inside the computational domain, except
for the points near the boundaries. The method of undetermined
functions is an elegant and efficient way of imposing boundary conditions. An auxiliary right-hand side vector is introduced that is
determined by the requirement of the vanishing Rn . Similar considerations apply to the heat generation term that needs to be modified
before it enters the LOD scheme. The modified LOD scheme is
strongly consistent (Janenko, 1971).
Similar approaches have been proposed for hyperbolic type of
equations in (Marchuk, 1990). The problem of acoustic wave propagation and Locally One-Dimensional scheme in this case are given
by
∂2 ϕ ∂
−
∂xn
∂t 2
3
kn (x, t)
∂ϕ
∂xn
= 0,
n=1
ϕxt+1 − ϕyt − ϕzt + ϕxt
1
= Ax (ϕxt+1 + ϕxt ),
3
2
ϕyt+1 − ϕxt+1 − ϕzt + ϕyt
1
= Ay (ϕyt+1 + ϕyt ),
3
2
ϕzt+1 − ϕxt+1 − ϕyt+1 + ϕzt
1
= Az (ϕzt+1 + ϕzt ),
3
2
ϕxt = ϕt+1/3
ϕyt = ϕt+2/3
ϕzt = ϕt+1
(16)
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
Note the additional indexing of ϕt by x, y and z. This notation is
introduced solely for the clarity of the formula and represents the
fractional (intermediate) steps of the method, e.g. ϕxt = ϕt+1/3 .
3. Tridiagonal systems of equations
Fractional step methods consist of a sequence of implicit
one-dimensional sweeps through the domain. In our case of a
second-order spatial discretization of (1) a symmetric tridiagonal
system of linear equations needs to be solved. However, a straightforward application of both Dirichlet and von Neuman boundary
conditions results in loss of the symmetry. For this reason, and since
no additional complexity is introduced to the solver, we will study
a more general case of a non-symmetric system:
⎛
b1
⎜ a2
⎜
⎜
A=⎜
⎜
⎝
c1
b2
a3
0
c2
b3
..
.
0
..
.
..
.
an
cn−1
bn
⎞
⎟
⎟
⎟
⎟
⎟
⎠
(17)
where n denotes the number of points in the discretization. The
simplest algorithm to solve such a system is the Thomas algorithm
(TA), which is in fact the simplest case of the Gaussian elimination,
or LU decomposition (Conte and Boor, 1980).
3.1. Thomas algorithm
The Thomas algorithm consists of two phases: forward elimination and backward substitution, described by first-order linear
and non-linear recurrences presented in Eqs. (18) and (19), e.g.
pi = f(pi−1 ). Basic implementation of TA requires 8n floating point
operations, two vectors of length n (x, the result and rhs, the righthand side), two auxiliary vectors of length n (p and q) used during
the factorization process, and the tridiagonal matrix A of size 3n. TA
cannot be trivially executed in parallel due to the existence of the
first-order recurrences. The forward elimination phase is described
by the following:
pi =
ci
,
bi − ai · pi−1
p1 =
c1
,
b1
qi =
rhsi − ai · qi−1
,
bi − ai · pi−1
q1 =
rhs1
b1
(18)
and the backward substitution phase can be written as
xi = qi − xi+1 · pi ,
x n = qn
(19)
The above formulas directly reflect the steps taken during the Gaussian elimination applied to this kind of matrices.
3.2. Parallel tridiagonal solvers
The drawback of the basic TA is that it is strictly sequential.
Many parallel algorithms for the solution of tridiagonal systems
of equations have been developed. Cyclic reduction was proposed
by Hockney (1965) and has since been widely used on distributed
machines (Allmann et al., 2001). The idea is to recursively reduce
the number of equations by 2 until a system of only 2 equations
is obtained. Even in the sequential case it is often preferred to
the TA because of its natural ability to handle periodic boundary
conditions. Divide and conquer algorithm is a similar approach
(Gander and Golub, 1997). Recursive doubling was proposed by
Stone (1973) and its suitability for hypercube architectures was
investigated by Egecioglu et al. (1989). This algorithm is based on
recursive doubling solutions of linear recurrence relations, which
125
in the case of tridiagonal systems allow to compute LU decomposition of the matrix in O(log2 (p)) parallel steps, where p is the number
of processors. Solving tridiagonal systems on distributed architectures in the context of ADI-type of methods has been studied
by Wakatani (2004), who proposed a pre-propagation scheme for
solving first-order recurrences. As in the case of all the previously
mentioned approaches, parallelization of the solver has been done
at the expense of at least doubling the computational complexity
and increasing the memory requirements. Other known methods
include iterative relaxation schemes, like Gauss-Seidel, Jacobi, Redblack line relaxation and segment relaxation. In the cases where
many independent tridiagonal systems of equations have to be
solved (like ADI-type of methods), there have been attempts to parallelize TA for distributed machines through pipelining (Povitsky,
1999). For a single 1D tridiagonal system divided between 2 CPUs,
first CPU performs half of the forward elimination phase and sends
required data to the second CPU. The second CPU finishes the forward elimination phase, performs half of the backward substitution
phase and sends the required data to the first CPU. Although for a
single system of equations this approach is sequential, because only
1 CPU computes at any given time, it works when we have many
independent 1D equations. For a comparison of the performance of
many of the above algorithms, see Hofhaus and VandeVelde (1996).
Since our interest are ADI-type of methods for 3D problems on
shared memory architectures, for our parallel implementation we
chose to use the TA because of its optimal computational complexity and modest memory requirements. As shown later in the
performance analysis section, we were able to obtain a scalable
parallel implementation without the need to use the pipelining
approach.
4. Implementation
Large class of numerical methods discretizing PDEs on structured meshes, such as Finite Differences implemented as stencil
operations, only require a small number of floating point operations per a memory access. During the recent years a growing
discrepancy between the memory and the CPU speeds is observed
(McCalpin, 1991–2007), which results in the memory bandwidth
being a bottleneck, rather than the CPU speed. The theoretical peak
performances of modern CPUs are only achievable for algorithms
that are computationally heavy compared to the memory bandwidth requirements (e.g. BLAS Level3 operations (Whaley et al.,
2001)).
In the following section we demonstrate how to efficiently
implement a wide class of ADI/LOD methods for three-dimensional
problems, with special attention paid to the memory performance.
For all our performance tests we use an eight-way Opteron system
with 2.4 GHz DualCore CPUs equipped with 64 kb L1 cache, 1 Mb
L2 cache and DDR333 memory. The code is written in C. Sequential
version is compiled using Intel and GNU gcc-3.2.3C compilers. The
parallel version is only compiled using the (better performing) Intel
compiler. Compiler options are shown in the Appendix I.
Some technical knowledge is useful in order to fully understand
our optimized 3D Fractional Steps/ADI implementation. Appendix I
presents the general idea of efficient traversal of multidimensional
arrays using the loop order notation commonly found in the computational literature. In short, 3D arrays are accessed using three
nested loops iterating over the three dimensions: i, j, k for X, Y and
Z dimensions respectively. There are six possible arrangements of
the loop nest structure: ijk, ikj, jik, jki, kij, kji. Loop structures with
i index as the innermost loop (array traversal along the X dimension) achieve what we call a linear memory access, which is crucial
for good performance. In this paper we show how to efficiently
126
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
handle Fractional Steps/ADI type of algorithms, where 1D implicit
sweeps need to access the data along all the dimensions, which is
in general not consistent with the linear memory layout. Appendix
II introduces the idea of SSE vectorization and the prerequisites for
its use in numerical codes. Appendix III describes general issues
associated with the cache reuse and shows why in certain cases it
is of less importance than the linear memory access, which enables
efficient use of memory prefetching and thus hides the memory
access time.
The main goal of this part of the paper is to provide general
guidelines on how to efficiently implement a wide class of Fractional
Steps-type of methods on modern RISC CPUs. We do not concentrate
on the cache reuse. Rather than that, we stress the linear memory
access and code rearrangements that make SSE2 vectorization and
effective memory prefetching possible. In this sense our results are
general and do not require any architecture-specific, compile-time
parameters, such as block sizes. They apply equally well to most
tions wherever possible grows more and more popular, especially
with the introduction of high performance GPUs, FPGAs, or the Cell
processor, on which the performance difference reaches one order
of magnitude (Langou et al., 2006). Although single precision floats
may not provide enough accuracy for certain problems, it is recently
widely studied how to combine them with double precision computations only for accuracy critical parts of the code, and obtain a
full double precision result (Kurzak and Dongarra, 2007; Strzodka
and Göddeke, 2006). With some care this approach can also be
applied to ADI-type of methods. However, it is beyond the scope
of this paper. For now, our implementation can run in full double
precision mode when required. All our performance considerations
also hold for this case.
5. LOD/ADI
A general engine of any ADI/LOD type of algorithms looks as follows:
modern CPUs (Intel, AMD, IBM), although the exact performance
numbers will differ between different architectures and different
compilers.
4.1. Performance measurements
A common measure of computational efficiency is flops (floating point operations per second). Different approaches can be used
to estimate this value. The number of floating point operations
required by a best known algorithm can be computed, but since
the operations are often not the bottleneck, this can be meaningless. Another way is to add all the operations performed explicitly
in the source code. Unfortunately, because of heavy optimizations
performed by the compiler this number may differ from what the
CPU actually executes. Hardware counters capable of tracking performance statistics like cache misses and the actual number of flops
performed by a CPU recently became popular (Browne et al., 2000).
They are very useful when identifying performance problems in the
code, but this analysis can be complex for multi-core CPUs. In our
paper we chose to present values estimated directly from the fastest
obtained sequential implementation. In particular, we include all
the compulsory operations that can not be removed by the compiler (e.g. 8n in the Thomas algorithm), and we include operations
that could potentially be pre-computed before the execution and
stored in auxiliary arrays, but it is not done since it would either
significantly increase the total memory requirements (threefold in
the case of the conductivity parameter k), or even increase the
execution time due to higher memory bandwidth requirements.
As we show later, averaging the conductivity on the fly does not
increase execution time much. Since our numbers are probably
slightly higher than the real ones due to compiler optimizations,
we also provide the absolute CPU time spent on the computations
and compare it to the simplest unimprovable case of making a copy
of a memory area.
In our performance tests we use single precision floating point
numbers because they require twice less memory than double precision, and the floating point computations are faster. In the case
of our implementation and the inspected architecture this results
in more than twofold speedup. Using single precision for computa-
A simple implementation of every 1D sweep consists of three
nested loops over i, j and k indices for X, Y and Z dimensions respectively, where the innermost loop is iterated along the direction of
the sweep. Here we study the simplest, backward Euler (as opposed
to Crank–Nicholson) variant of the LOD scheme presented in Eq.
(14). In our implementation, during the tridiagonal matrix assembly material parameters K assigned to grid cells are averaged on
the fly, which in a general case allows for transient material properties. Another material property, cp is either averaged from the
surrounding cells to the nodes before the computations, or simply defined at the grid nodes. The right-hand side consists of the
old temperature values (since we study a backward Euler scheme)
and the source/sink term F is applied only during the X sweep.
Effectively during the X sweep we operate on four 3D arrays of
size nx × ny × nz (where temperatures are both read and written),
and during Y/Z sweeps three arrays are used. Best achievable performance estimated just based on the time required to read and
write the above mentioned data in the memory is around 10 s for
X sweep, and around 8 s for Y/Z sweeps (see Appendix I for the
explanation). We perform ca. 50 operations per point, where most
of the operations are related to averaging of the K properties from
the cell centers to the midedge points (5). While this step could
be performed before running the solver, it would increase the total
memory requirements for K array threefold, and increase the persweep memory requirements three times for the schemes requiring
derivatives in the other dimensions. Consequently, precomputing
these values would actually increase the execution time. We show
that computing these values during the sweeps does not affect the
execution times much.
For the X sweep the best order of the loops is kji (see Appendix I
for the detailed explanation of the loop nest notation and the impact
loop order has on performance). For the Y sweep we can choose
between ikj and kij (the innermost loop is done along the dimension
of the sweep). Similarly, for the Z sweep we can use either ijk, or
jik implementation. Only in the case of the X sweep the memory is
accessed linearly. Performance results of all these implementations
for a model of size nx = ny = nz = 1000 are presented in Table 1.
Improving the Y and Z sweeps requires accessing the memory
linearly, i.e. implementing the loop order as kji and jki, respectively.
The idea is presented in Fig. 1. Instead of building and solving a
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
127
Fig. 1. In order to obtain linear memory access during the Z sweep a whole plane of systems of equations has to be simultaneously created and solved. While solving a single
equation in the Z dimension requires k to be the innermost loop, solving a plane of systems of equations makes it possible to have i as the innermost loop.
single tridiagonal system, we build and solve a whole plane of
systems of equations simultaneously. The pseudocode for the Z
sweep is presented below:
Described loop transformation assures linear memory access
during the Y and Z sweeps, and allows the use of SSE vectorization
for the matrix computation. Moreover, as a natural consequence
of solving a plane of systems of equations, SSE can also be used to
implement a vectorized Thomas algorithm, i.e. solve many equations
laid out correctly in the memory at the same time (see Appendix II
for better understanding of SSE vectorization). In the X sweep vectorization can only be used during the matrix computation. Using
a vectorized TA in this case requires explicit rearrangement and
copying of the data in the memory, which is inefficient and slows
down the code. The importance of using vectorization wherever
possible can be verified by compiling and running the code with
and without SSE2 support (see Table 2).
Presented approach requires additional storage for nx tridiagonal matrices, right-hand side vectors and the factor. Since these
arrays are of considerable size and do not fit into CPUs cache for
large problems, they decrease the overall performance of the algorithm. From Eq. (18) it can be noticed that every entry of the A
matrix is only used once during the forward elimination stage of
TA. Hence, it is beneficial to factorize it on the fly by including the
Table 1
Performance of the naive implementation of the implicit LOD method (Intel
compiler)
Time (s)
MFlops
X sweep, kji
Y sweep, ikj
Y sweep, kij
Z sweep, ijk
Z sweep, jik
29
1750
288
169
85
572
337
144
212
229
forward elimination stage into the matrix building loop. This optimization eliminates the need to store and later read the entries of
A, which speeds up the code considerably. Doing so also allows us
to use the rhs vector in place of the q vector from Eq. (18). Finally,
in the fully implicit case the temperature values themselves can be
used as the rhs vector. These improvements can also be applied in
the Crank–Nicholson schemes and the schemes that require explicit
derivatives in the other dimensions by using an auxiliary vector.
The performance results for the methods described above are
summarized in Table 3. It is worth noting that the best Y and Z
sweeps implementations are faster than the X sweep. This is due to
the SSE vectorization of the TA, which is a natural consequence of
presented approach, but can not be efficiently implemented for the
X sweep. For comparison, the table also includes the time needed
to perform a single step of our implementation of the explicit finite
difference scheme with variable material coefficients, SSE vectorization and proper memory access.
Table 4 presents the performance results of the same code compiled with gcc and icc. Clearly, icc being two times faster utilizes
the vector abilities of Opteron much better.
Fig. 2 presents the flops performance of the optimized code
depending on the system size. High efficiency is sustained for
large problems and for sweeps in all spatial directions. For small
systems the flops performance is up to 30% higher, which is
due to the cache reuse of the K properties array between the
solves done on subsequent planes, and the factor and rhs during
forward and backward stages of the TA. For large problems, the
performance is memory bandwidth bounded and therefore does
not depend on factors like CPU cache size. The CPU theoretically
128
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
Table 2
The impact of SSE vectorization on execution time (Intel compiler)
Time (s)
MFlops
X sweep not
vectorized
X sweep SSE
vectorized
Y sweep, kji
not vectorized
Y sweep, kji SSE
vectorized
Z sweep, jki not
vectorized
Z sweep, jki
SSE vectorized
48
1039
29
1750
45
1073
32
1491
47
1028
33
1452
Table 3
Performance of SSE vectorized code with integrated TA solver (Intel compiler)
Time (s)
MFlops
memcpy
Explicit
scheme
X sweep, kji
vectorized Y
sweep, kji
Vectorized Y sweep,
kji integrated TA
vectorized Z
sweep, jki
vectorized Z sweep,
jki integrated TA
∼8 to 10
0
37
29
1750
32
1491
20
2428
33
1452
22
2229
could compute faster (it does so for smaller systems), but can not
due to data starvation. This shows that there is only so much to
gain by optimizing the cache reuse in the sequential version of the
code.
5.1. Parallel implementation
In the simplest approach ADI/LOD methods are parallelized by
dividing the outermost loop between the threads. This assures
locality of most of the data used during the sweeps in two dimensions, the exception being the material properties K, for which
neighboring CPUs share the border. On the other hand, the sweep
in the third dimension accesses the data across all the CPUs, which
involves very heavy communication and a performance hit. On a
shared memory system this penalty is considerably lower than on
distributed memory architectures. Starting from a close to optimal sequential performance we show that this simple approach
yields very good parallel results for the studied computer architecture.
On shared memory machines the whole memory of the system can be allocated and addressed directly as a single array. This
way there is no need to explicitly program the communication,
which is performed automatically by the CPU while accessing the
required data in the global memory space. On NUMA (Non-Uniform
Memory Access) capable architectures every CPU has its private
memory bank, which assures parallelization of not only the computations, but also the memory bandwidth. On these systems in
order to obtain a scalable code it is important to assure that the
data accessed by a thread belongs to the private memory bank
of the CPU the thread is executed on. For scientific applications
it is commonly achieved by binding threads to specific CPUs and
using a technique called first touch to allocate the data. Basically,
every thread initializes (e.g. sets to 0) only this part of an array,
which it will later use during the computations. The operating
system then assigns all parts of the array to the proper memory
banks.
Fig. 3 presents scaling of the optimized code for a model of
size nx = ny = nz = 1000. Speedup of the X and Y sweeps on up to
8 CPUs is linear and close to perfect. The Z sweep suffers a small
Fig. 2. Flops performance of the optimized sequential implementation of 1D sweeps
in all special directions depending on the problem size.
but acceptable performance penalty due to the communication
between CPUs. Using more CPUs requires switching to the second
core, in which case limited speedup can only be observed for the
X sweep. On a multi-core NUMA system all cores of a CPU share
Table 4
Performance comparison of the fully optimized code compiled with gcc and icc
Time (s)
MFlops
X sweep,
icc
X sweep,
gcc
Y sweep,
icc
Y sweep,
gcc
Z sweep,
icc
Z sweep,
gcc
29
1750
52
965
20
2428
44
1093
22
2229
45
1062
Fig. 3. Flops performance of parallel execution of the optimized implementation for
a constant problem size of nx = ny = nz = 1000. After 8 CPUs (vertical line) two cores
were used on every CPUs.
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
129
parameters matrix K at the CPU boundaries. This however would
require allocation of a separate K matrix for the Z dimension. For
small enough problems some speedup is also observed for all the
sweeps when using the second core. In this case the code spends
relatively more time on computations, and, due to cache reuse,
leaves some memory bandwidth unused. It brings us to a conclusion that although cache reuse plays much smaller role for
sequential codes on single cores, it may yet prove to be important for utilizing the parallel computational power of multi-core
systems.
6. Numerical examples
6.1. Effective properties of heterogeneous media
Fig. 4. Parallel efficiency of the method on 8 Opteron CPUs depending on problem
size.
the same memory bus, which results in the threads competing for
the memory bandwidth. Since in the case of heavily optimized Y
and Z sweeps most of it was already used by a single thread, no
speedup is to be expected. On the other hand, the TA solver in the
sequential X sweep was not vectorized, and the computations took
relatively longer time. Effectively, some memory bandwidth was
left to be utilized by the second core. For studied problem size,
the X, Y and Z sweeps on 8 CPUs take 3.8, 3.0 and 4.2 s, respectively.
Fig. 4 shows parallel efficiency of the code for 8 CPUs.
An interesting observation can be made for small problems
(nx = ny = nz = ∼200), for which not only the sequential performance
of Y and Z sweeps is higher (see Fig. 2), but also the parallel efficiency. This means that the cache reuse also limits the
data exchange between the CPUs. It indicates possible improvements to the parallel code involving duplication of the material
In this section we present an application of the previously
described Locally One-Dimensional scheme (14) to a numerical
study of diffusion fronts in heterogeneous media. Firstly, we validate our numerical method and analyze the average steady state
thermal gradient inside a single spherical heterogeneity subjected
to a uniform thermal flux at the model boundaries. We compare
the numerical results to an analytical prediction derived for an
inclusion embedded in an infinite host. Next, the method is used
to compute the time evolution of the diffusion front in a medium
consisting of multiple spherical inclusions. The inclusion concentration, number and configuration are systematically varied during
our simulations. The system evolution is integrated until the steady
state is reached. At that time we compute the average thermal
flux in the entire domain determining the effective conductivity
of the system. The numerical results are compared to the estimates obtained from an effective property scheme based on the
differential effective medium (DEM) approach. Finally, we resolve
the complex structure of the local diffusion fronts in such heterogeneous media and show how the effective conductivity model
provides an excellent description framework on a larger scale even
for the transient problem.
Table 5
Impact of the grid resolution on the result quality in a single inclusion benchmark: (5a) weak inclusion case and (5b) strong inclusion case
Kinc 0.1
(a) Resolution n3 points
51
101
151
201
251
301
501
Analytical prediction
Kinc 0.01
Kinc 0.001
Grad
Err%
Grad
Err%
Grad
Err%
1.489
1.508
1.488
1.476
1.466
1.461
1.448
1.4285
4.2
5.6
4.2
3.3
2.6
2.3
1.4
2.000
1.764
1.671
1.626
1.597
1.580
1.544
1.4925
34.0
18.2
12.0
8.9
7.0
5.9
3.5
2.108
1.812
1.703
1.650
1.618
1.599
1.558
1.4992
40.6
20.9
13.6
10.1
7.9
6.7
3.9
Kinc 10
Kinc 100
Kinc 1000
Grad
Err%
Grad
Err%
Grad
0.494
0.353
0.314
0.296
0.285
0.279
0.267
0.25
97.6
41.2
25.6
18.4
14.0
11.6
6.8
0.0836
0.0497
0.0416
0.0382
0.0361
0.0349
0.0322
0.029
188.3
71.4
43.4
31.7
24.5
20.3
11.0
0.00897
0.00520
0.00428
0.00388
0.00364
0.00361
–
0.0030
Err%
3
(b) Resolution n points
51
101
151
201
251
301
501
Analytical prediction
199.0
73.3
42.7
29.3
21.3
20.3
–
Table shows the average value of temperature gradient in the inclusion at steady-state and the deviation from the expected result. The size of the domain is 1, inclusion
radius is 0.05. Timestep used was 75 times the maximum explicit timestep allowed. The error estimate is calculated as: 100 × abs(numerical result − analytical result)/
abs(analytical result).
130
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
Table 6
Impact of grid resolution on the result quality
Resolution n3 grid points
51
101
151
201
251
501
Concentration 0.1
Concentration 0.3
Weak
Strong
Weak
Strong
0.832
0.846
0.851
0.852
0.854
0.856
0.0218
0.0165
0.0152
0.0146
0.0143
0.0135
0.584
0.593
0.596
0.598
0.599
0.601
0.0696
0.0460
0.0382
0.0344
0.0323
0.0281
128 inclusions with concentrations 0.1 and 0.3. Timestep used was 75 times the
maximum explicit timestep allowed. Weak case: kincl = 0.01, khost = 1, strong case:
kincl = 1, khost = 0.01.
6.2. Single inclusion benchmark
A spherical inclusion of radius 0.05 is placed in the center of a
unit cube. The thermal conductivity of the host material is set to 1 in
all our simulations and we systematically vary the conductivity of
the inclusion phase. For the plane X = 0 we set the temperature to 1,
and for X = 1 we set it to 0. At all the other walls of the cube we apply
the zero-flux boundary condition. The initial temperature is set to
zero in the whole domain, except for the X = 0 plane. The domain
is discretized using a uniform Cartesian grid with an equal number
of grid points in every dimension. The cell conductivity is based on
whether the cell center is placed inside or outside the inclusion. The
cell conductivity is not averaged in any way. Starting from time t = 0
we let the system evolve with a fixed timestep dt until no significant
temperature changes can be observed in the model. Thus, we obtain
the approximation to the steady-state solution. At this point we
compute the average temperature gradient in the X direction inside
the inclusion.
For an explicit time integration scheme the Courant–
Fredrichs–Levy restriction on the timestep states that the maximum admissible timestep yields dtCFL = (h)2 /6kmax , where h
denotes the uniform grid spacing and kmax is a maximum conductivity value in the model. In all the performed numerical
experiments we used dt = 75 dtCFL to verify the advantages of the
unconditional stability of the method. As described in previous sections, if the steady-state is the goal of the computations one should
Fig. 6. Average temperature profiles along the X dimension at time t = 0.05. Three
different concentrations of 128 inclusions are studied on a 5013 resolution grid with
dt = 75 times the maximum explicit timestep. The time is normalized with the effective conductivity of the medium. Obtained 1D profiles are compared to the analytical
solution of 1D heat conduction equation in a homogeneous medium.
modify the timestep during the iterations to accelerate the convergence. However, our main focus in this study is the transient
problem of thermal diffusion and therefore we keep the timestep
constant throughout the simulation.
It is well known that both the thermal flux and mechanical
strains are uniform inside an ellipsoidal inclusion for the constant
far-field thermal and mechanical loads (Eshelby, 1957). In the ther incl is given by
mal case the flux inside a spherical inclusion q
incl =
q
3kincl /khost
∞
q
2 + kincl /khost
(20)
where kincl and khost are the thermal conductivities of the inclusion
∞ is the far field flux.
and the host phase respectively, and q
Table 5 presents the results of our numerical experiments,
where we study the impact of the numerical resolution on the
inclusion flux for different inclusion conductivities. (5a) presents
the results for a weak inclusion, whereas (5b)—for a strong
inclusion.
It has to be noted that the analytical result (20) is derived for the
case when the boundaries are infinitely far from the inclusion. In
our case, although the inclusion is quite small, the boundaries are
at a finite distance and hence have some influence on the result.
Also, the Locally One-Dimensional scheme (14) is not completely
consistent and converged solutions are not exactly the solution of
the steady-state thermal problem.
6.3. Multiple inclusions—effective conductivity
Fig. 5. Effective conductivity of a host filled with different concentrations of weak
and strong spherical inclusions. The dots represent obtained numerical results for
weak (empty dots) and strong (filled dots) inclusions. Dotted and dashed lines are
given by the DEM upscaling scheme.
In this section we present a study of the effective properties of
heterogeneous media consisting of randomly distributed spherical inclusions. In the numerical experiments we analyze different
numbers and various concentrations of both weak and strong inclusions. All the inclusions have the same conductivity kincl , and the
conductivity of the matrix is denoted as khost . The setup used in our
numerical experiments is similar to that presented in the previous
section. The computational domain is a unit cube and the boundary conditions are the same: Dirichlet boundary conditions on the
X = 0 and 1 planes with temperature values 0 and 1, and zero-flux
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
131
Fig. 7. A snapshot of the diffusion fronts during the simulation. The left-most column presents 128 inclusions for the three studied concentrations: 0.1, 0.2 and 0.3. The middle
column presents the same real-time snapshot for all the simulations. The two three-dimensional iso-contours are plotted for temperature values 0.7 and 0.1. The color on
these iso-surfaces denotes the temperature gradient. The 2D cut at the back of the 3D cube shows temperatures on a chosen X–Y plane. The same plane is later shown alone
in the last column.
boundary conditions on the remaining walls of the box. We randomly place a given number of equally sized spherical inclusions
so that their total volume adds up to the required concentration. We
allow the system to converge to the steady-state using the timestep
that is 75 times larger than the maximal value admissible in explicit
integration. We then compute an average flux in the X direction
through the entire domain.
Table 6 presents a study of the impact of the computational
resolution on the result. The runs have been performed for 128
inclusions. Concentrations 0.1 and 0.3 are considered for both
weak (kincl = 0.01, khost = 1) and strong (kincl = 1, khost = 0.01) inclusions.
To validate the numerical prediction we have repeated the
experiments for 6 different configurations of 128 inclusions, and
for different numbers of inclusions (from 32 up to 500) with the
same concentration 0.1. The computational resolution was 5013
and the dt was 75 times the maximum explicit timestep. For
6 different distributions of 128 inclusions the maximum difference between the results was 0.2% for weak inclusions and 0.5%
for strong inclusions. For a varying number of weak inclusions
the obtained results are already quite good for a relatively small
number of heterogeneities. 32 and 500 inclusions gave an average flux of 0.8525 and 0.8566, respectively. For the strong case the
result differed more: 0.0133 and 0.0143, respectively. This can be
attributed to a coarser discretization of individual, smaller inclusions.
The results obtained numerically are compared with the prediction obtained with the differential effective medium (DEM)
upscaling scheme in Fig. 5. The numerical results represented by
dots are computed for 128 weak (empty dots) and strong (filled
dots) inclusions with different concentrations. The resolution of the
grid was 5013 points. The differential effective medium schemes are
known to exhibit a very good predictive power over a wide range
of concentrations and material property contrasts (e.g. Weber et
al., 2003). The classical differential effective medium scheme for a
composite consisting of spherical inclusions leads to the following
approximation (Bruggeman, 1936):
kincl − keff
kincl − khost
k
host
keff
1/3
=1−f
(21)
132
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
As seen in Fig. 5, the obtained numerical results are in very good
agreement with the analytical prediction for both weak and strong
inclusions.
6.4. Multiple inclusions—diffusion fronts
In the above sections we presented an analysis of the accuracy
of the LOD method based on the approximation of the steadystate solution of the thermal diffusion problem. Here, we look at
the diffusion fronts and the transient stages of the simulation. We
study 128 weak inclusions with kincl = 0.01 discretized on a 5013
points grid. The timestep dt is 75 times the maximum explicit
value. Fig. 6 presents computed thermal profiles along the X dimension averaged on the Y–Z planes for all the studied concentrations.
For each concentration the time is normalized by the previously
obtained effective thermal diffusivity of the medium and the constant unit domain size in the X direction. The results are shown for
t = 0.05. The averaged profiles agree very well with the analytical
solution of the corresponding one-dimensional thermal diffusion
problem. This shows that the effective material property approach
is well applicable to the transient part of the heat diffusion problems.
The strength of our direct numerical simulation approach is that
we can explicitly resolve the local three-dimensional structure of
a diffusion front. Fig. 7 presents snapshots of the time evolution
of the studied setup. The left-most column presents 128 inclusions
for the three studied concentrations. The middle column presents
the same real-time snapshot for all the simulations. The two threedimensional iso-contours are plotted for temperature values 0.7
and 0.1. The color on these iso-surfaces denotes the temperature
gradient. The 2D cut at the back of the 3D cube shows temperatures
on a chosen X–Y plane. The same plane is later shown alone in the
last column.
where i = 1, 2, 3 is the spatial coordinate index, repeated indexes
eq
imply summation, c, , ˛,
˜ cf , D, and v are the concentration of
a trace component, density, kinetic reaction constant, equilibrium
concentration, diffusion coefficient, porosity and velocity, respectively. Subscript “f” refers to fluid phase and subscript “s” refers to
only solids that contain the trace component. First two equations
are the mass balances of the trace element in fluid and solid phases.
Third equation is the total mass balance of fluid and solid matter
in which the net volumetric effect of the dissolution–precipitation
reactions is neglected. We aim at resolving stiffness of the final system of equations arising due to fast kinetics while keeping diffusion
terms to be able to study transient effects of reactive transport.
To keep this example simple, we set solid velocity and diffusion
coefficient to zero as negligible compare to the fluid’s velocity and
diffusivity:
∂(cf · f )
eq
= −qi · ∇ i cf + ∇ i · (f · Df · ∇ i cf ) + ˛
˜ · (cf − cf ),
∂t
∂Cs
eq
= −˛
˜ · (cf − cf ),
q = f · f · vif ,
∂t
s
Cs = cs · s ·
, f = const.
f
(23)
Finally, using characteristic length scale of reservoir, L, and choosing
time scale to eliminate diffusion coefficient yields following nondimensional system of equations:
∂(cf · f )
eq
= −qi · ∇ i cf + ∇ i · (f · ∇ i cf ) + ˛ · (cf − cf ),
∂t
∂Cs
eq
= −˛ · (cf − cf )
∂t
(24)
First two terms of the right side of the first equation describe the
transport (advection and diffusion). The last term represents the
reactions part. Reaction is possible only if solid is present:
˛L
˜ 2
Df
0
if Cs > 0
6.5. Reactive transport
˛=
In this section we briefly present an application of the parabolic
fractional step solver as a component of a reactive transport
solver. Subsurface fluid circulation brings fluids out of chemical equilibrium and causes rich variety of phenomena ranging
from channelling and fingering instabilities to fluidization and
explosive eruptions. A number of issues have to be resolved by
numerical modelling. Both thermal and volumetric effects of chemical reactions are of primary importance. Porosity alteration by
dissolution–precipitation processes results in strong and nonlinear variation of key model properties, such as permeability. A
much simple model of the reactive transport problem considered
here is stated as
Generally, f can change during the simulation since chemical reactions may enhance and reduce porosity. We choose to resolve initial
heterogeneity of porosity but we do not change it with time for
simplicity reason.
∂(cf · f · f )
∂t
= −∇ i · (cf · f · f · vif ) + ∇ i · (f · f · Df · ∇ i cf ) + ˛
˜ · f · (cf − cf ),
eq
∂(cs · s · s )
∂t
˜ · f · (cf − cf ),
= −∇ i · (cs · s · s · vis )+∇ i · (s · s · Ds · ∇ i cs )− ˛
eq
⎛⎛
∇ i · ⎝⎝
⎞
⎞
· ⎠ · vis + f · f · (vif − vis )⎠
all phases
=−
∂
all phases
∂t
·
≈0
(22)
(25)
if Cs = 0
6.6. Numerical results
A straight-forward spatial discretization of equations (264) with
minimum grid spacing dx that includes all three operators (advection, diffusion and reaction) in a single system of equations can
be done using either the finite difference (FD), or the finite element (FEM) method. This approach is often called a full physics
model and is employed in many industry-standard reservoir simulators. The time integration of equations (274) can be computed
using an explicit Euler scheme. However, the time step restriction
due to explicit treatment of advection, diffusion and reaction are
of the order of dx/max(q), dx2 and 1/˛, respectively. Effectively,
time integration requires a prohibitively large number of time steps
for high resolution and fast kinetics problems we aim at. Unconditionally stable implicit schemes remove the time step restriction.
However, the numerical solution of large systems of equations for
three-dimensional discretizations is too CPU and memory expensive.
A commonly used approach is to apply fractional step idea to
the individual physical processes that are part of equations (284).
Previously in this paper we referred to Fractional Steps only in the
context of spatial operator splitting. Here, operator splitting is also
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
133
Fig. 8. A comparison of the results of a reaction–advection–diffusion solver for two different spatial resolutions. Pictures present the dissolution front at two different moments
during the simulation. Iso-contours denote different values of concentration c (blue: c = 0; red: c = maximum concentration). The upper frames present a 400 × 500 × 45 model,
the lower ones—1600 × 2000 × 140. The dt used is equal to min(dx, dy, dz). Pictures on the left present an early stage of the dissolution process, the ones on the right present
a later stage. Advection flux q = (dx, 0, 0). Reaction rate is fast compared to the diffusion rate.
applied according to the physical processes. In short, diffusion,
advection and reaction can be solved successively. In the case of
equation (294) one time step consists of the following three phases
(Fractional Steps):
1. Apply diffusion operator c(n + 1/3) = diffusion(c(n)).
2. Apply advection operator c(n + 2/3) = advection(c(n + 1/3)).
3. Compute reaction c(n + 1) = reaction(C(n + 2/3), phi(n)).
This approach is first-order accurate in time. The severe restriction on the time step introduced by the diffusion operator for high
resolution models can be removed by employing an unconditionally stable solver for the diffusion part. Our choice is to use the
Fractional Steps method (14). Reaction part is an ordinary differential equation. It can be solve by unconditionally stable implicit
backward Euler scheme without solving large system of equations.
Hence, we may now run the simulation with time step of order of
dx/max(q).
Fig. 8 presents the results of an example study. We have used
real-world porosity data of the Gullfaks oil reservoir to compute the
effective diffusivity field D. In the initial stage the fluid is in equilibrium with the solid, i.e. the fluid is a fully saturated solution. The
model is flushed with pure fluid (cf = 0) from the left side, which is
implemented as a Dirichlet concentration boundary condition. For
simplicity, the advection flux q is constant in time and uniform in
space, i.e. the fluid is transported along the X direction. Here we
assume that the reaction speed is big compared to the diffusivity D,
i.e. whenever fluid that is out of equilibrium flows through the solid,
reaction takes place instantaneously until equilibrium is reached,
or all substance is dissolved. This is implemented as a large parameter ˛ = 250, which results in a rather sharp, narrow dissolution
front.
7. Conclusions
We have presented an efficient implementation of ADI/LOD type
of methods for three-dimensional problems on modern commodity
SMP architectures. Special attention has been paid to optimization
of the memory bandwidth usage, which is much more important
than optimizing the cache reuse. Our approach is largely applicable to all modern RISC architectures like Intel, AMD and IBM SP.
Our tests have been performed on an eight-way Opteron system
with DDR333 memory. Optimized sequential implementation of
the Y and Z sweeps is comparable in execution time to just copying
the data in the memory. The time needed to perform one complete LOD timestep is approximately twice the time of an explicit
scheme. Efficiency of the parallel implementation on 8 CPUs is
close to perfect. Scalability when using the second core is limited
because of memory bandwidth starvation. Computing one timestep
of the LOD scheme on a system of 10003 unknowns on 8 CPUs takes
11 s.
The LOD scheme has been validated using a computational
model of an isolated inclusion subject to a constant far field flux as a
benchmark problem. The effective thermal conductivity of composites consisting of multiple inclusions was studied numerically and
found to be in a perfect agreement with analytical predictions based
on the differential effective medium approach. Our implementation
of the LOD scheme allows us to resolve the complex structure of the
diffusion front in a strongly heterogeneous medium with numerous inclusions. The results show that the effective material property
approach is also suitable for the transient part of the heat diffusion
problems.
Finally, we have applied the Fractional Steps approach to a
reaction–advection–diffusion problem, effectively removing the
severe timestep restriction introduced by the diffusion solver.
Appendix A. Operations on multidimensional data
Three-dimensional arrays are stored linearly in computers
memory, as shown in Fig. 9. The data is transferred from the main
memory to the CPU cache in blocks called cache lines (usually 64
bytes), thus the cost of accessing a whole cache line is the same as
that of accessing a single value. Using all the values from a given
cache line at least once before it is removed from the cache back
to the main memory significantly decreases the access cost per
value. The following pseudo-code fragments show two possible
implementations of the difference operator in the Z dimension of a
3D Cartesian domain. The only difference is the order of the loops:
134
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
Fig. 9. Three-dimensional array placement in computer’s memory.
Because of the loop order, first implementation will be referred
to as ijk, the second one—kji. Traversal of the 3D array for both of
these implementations is presented in Figs. 10 and 11, respectively.
The innermost loop in the kji version is consistent with the linear
memory layout, thus it is made sure that all the values from every
loaded cache line are used at least once. The importance of this
practice is best reflected in the following performance comparison.
For a 3D array of size nx = ny = nz = 1000 and single precision floating
point numbers, the kji implementation of the difference operator
in the Z direction takes 6.9 s, while the ijk takes 95 s. For a difference
operator in the X direction, the execution time is 4.9 s, the same as
just copying an array of this size from one place to another in the
memory.
On modern architectures the cost of accessing RAM is further
decreased by assuring that the data is constantly read while the
CPU is busy performing computations. This technique is known as
prefetching and is automatically activated by the CPU or a compiler, provided that subsequent cache lines are processed in order.
Throughout the paper we refer to this approach as the linear memory access. In the kji implementation we traverse the array along
the direction of the memory (i index), therefore prefetching will be
used during execution.
Appendix B. SSE vectorization
Most modern CPUs are capable of limited vectorization of the
floating point operations as long as the vector elements are located
next to each other in the memory. On x86 architecture it is realized through the SSE2 instruction set, which makes it possible to
perform four single precision or two double precision operations at
the same time. This feature usually has to be activated using special
compiler flags. With icc we use the ‘-xW – O3’ flags. With gcc the
equivalent would be ‘-O3 – msse2’. In a common vector notation,
the second loop from Appendix I would be executed by the CPU as
To make vectorization possible, the vectors operated on need
to be stored linearly in the memory. Although the performance
improvement is negligible for a simple difference operator, it
becomes much more pronounced in a more computationally heavy
code.
Appendix C. Cache reuse
As noted in Appendix I, one should always use all the data from
a cache line at least once. However, this simple approach does
not assure that a given cache line will not be loaded from RAM
again during later computations. A common efficient programming
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
135
Fig. 10. Direction of memory access during the Z difference operator, ijk loop order.
Fig. 11. Direction of memory access during the Z difference operator, kji loop order.
paradigm is based on maximizing the cache reuse, i.e. using every
cache line in all or most of the required computations. In the case
of stencil operations this leads to so called blocking or tiling (Rivera
and Tseng, 2000). In short, a 3D grid is divided into smaller cubes
that fit completely into the CPUs cache. All required operations on
a given block are performed before moving to the next one.
During the past years lots of work has been done on designing cache-friendly algorithms. As described in (Kamil et al., 2005),
many of them have become ineffective on modern architectures
due to the increased importance of data prefetching and linear
memory access on the overall performance. Our results agree with
these findings. In the simple example of the difference operator,
with optimal cache reuse for the Z direction one could only hope
to decrease the execution time from 6.9 to 4.9 s. For problems with
relatively more computations per data access the possible gains
are even smaller. Therefore, in our approach we do not explicitly
optimize cache reuse, and instead we only consider linear memory
access for improving performance.
References
Allmann, S., Rauber, T., Runger, G., 2001. Cyclic reduction on distributed
shared memory machines. In: Proceedings of the Ninth Euromicro Workshop on Parallel and Distributed Processing, Mantova, Italy, pp. 290–
297.
Birkhoff, G., Varga, R.S., 1959. Implicit alternating direction methods. Transactions
of the American Mathematical Society 92 (1), 13.
Brandt, A., 1977. Multilevel adaptive solutions to boundary-value problems. Mathematics of Computation 31 (138), 333–390.
Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P., 2000. A portable programming interface for performance evaluation on modern processors. International
Journal of High Performance Computing Applications 14 (3), 189–204.
Bruggeman, D.A.G., 1936. Calculation of various physical constants of heterogenous
substances. II. Dielectricity constants and conductivity of non regular multi crystal systems. Annalen Der Physik 25 (7), 0645–0672.
136
M. Krotkiewski et al. / Physics of the Earth and Planetary Interiors 171 (2008) 122–136
Conte, S.D., Boor, C.W.D., 1980. Elementary Numerical Analysis: An Algorithmic
Approach. McGraw-Hill Higher Education, 408 pp.
Courant, R., Friedric, K., Lewy, H., 1967. On partial difference equations of mathematical physics. IBM Journal of Research and Development 11 (2), 215.
Dabrowski, M., Krotkiewski, M., Schmid, D.W., 2008. MILAMIN: MATLAB-based finite
element method solver for large problems. Geochemistry Geophysics Geosystems, 9.
Douglas, J., 1955. On the numerical integration of D2U − Dx2 + D2U − Dy2 =
Du − D + implicit methods. Journal of the Society for Industrial and Applied
Mathematics 3 (1), 42–65.
Douglas, J., 1962. Alternating direction methods for three space variables.
Numerische Mathematik 4 (1), 41.
Douglas Jr., J., Garder, A.O., Pearcy, C., 1966. Multistage alternating direction methods.
SIAM Journal on Numerical Analysis 3 (4), 570.
Douglas Jr., J., Rachford Jr., H.H., 1956. On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American
Mathematical Society 82 (2), 421.
Douglas, J., Kim, S., 2001. Improved accuracy for locally one-dimensional methods
for parabolic equations. Mathematical Models & Methods in Applied Sciences
11 (9), 1563–1579.
Egecioglu, O., Koc, C.K., Laub, A.J., 1989. Recursive doubling algorithm for solution
of tridiagonal systems on hypercube multiprocessors. Journal of Computational
and Applied Mathematics 27, 95–108.
Eshelby, J.D., 1957. The determination of the elastic field of an ellipsoidal inclusion, and related problems. Proceedings of the Royal Society of London Series A:
Mathematical and Physical Sciences 241 (1226), 376–396.
Gander, W., Golub, G.H., 1997. Cyclic reduction—history and applications. In: Proceedings of the Workshop on Scientific Computing. Springer-Verlag, Hong Kong.
Hackbusch, W., 1981. Fast numerical-solution of time-periodic parabolic problems
by a multigrid method. SIAM Journal on Scientific and Statistical Computing 2
(2), 198–206.
Hockney, R.W., 1965. A fast direct solution of Poissons equation using Fourier analysis. Journal of the ACM 12 (1), 95-&.
Hofhaus, J., VandeVelde, E.F., 1996. Alternating-direction line-relaxation methods
on multicomputers. Siam Journal on Scientific Computing 17 (2), 454–478.
Janenko, N.N., 1971. The Method of Fractional Steps the Solution of Problems of
Mathematical Physics in Several Variables, vol. VIII. Springer, Berlin, 160 pp.
Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K., 2005. Impact of modern memory
subsystems on cache optimizations for stencil computations. In: Proceedings of
the 2005 Workshop on Memory System Performance. ACM, Chicago, IL.
Kurzak, J., Dongarra, J., 2007. Implementation of mixed precision in solving systems
of linear equations on the cell processor. Concurrency and Computation-Practice
& Experience 19 (10), 1371–1385.
Langou, J., et al., 2006. Exploiting the performance of 32 bit floating point arithmetic
in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems).
In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. ACM,
Tampa, Florida.
Larsson, S., Thomee, V., Zhou, S.Z., 1995. On multigrid methods for parabolic problems. Journal of Computational Mathematics 13 (3), 193–205.
Lubich, C., Ostermann, A., 1987. Multigrid dynamic iteration for parabolic equations.
BIT 27 (2), 216–234.
Marchuk, G.I., 1990. Splitting and Alternating Direction Methods. Handbook of
Numerical Analysis, vol. 1. Elsevier, Amsterdam, pp. 197–462.
McCalpin, J.D., 1991–2007. STREAM: Sustainable Memory Bandwidth in High Performance Computers. A Continually Updated Technical Report.
Overman, A., Rosendale, J.V., 1993. Mapping robust parallel multigrid algorithms to
scalable memory architectures. In: McCormick, N.D.M.a.T.A.M.a.S.F. (Ed.), Sixth
Copper Mountain Conference on Multigrid Methods. NASA, Hampton, VA, pp.
635–647.
Peaceman, D.W., Rachford, H.H., 1955. The numerical solution of parabolic and
elliptic differential equations. Journal of the Society for Industrial and Applied
Mathematics 3 (1), 28–41.
Pearcy, C., 1962. On convergence of alternating direction procedures. Numerische
Mathematik 4 (1), 172.
Povitsky, A., 1999. Parallelization of pipelined algorithms for sets of linear
banded systems. Journal of Parallel and Distributed Computing 59 (1), 68–
97.
Rivera, G., Tseng, C.-W., 2000. Tiling optimizations for 3D scientific computations.
In: Proceedings of the 2000ACM/IEEE conference on Supercomputing (CDROM).
IEEE Computer Society, Dallas, TX, United States.
Stone, H.S., 1973. An efficient parallel algorithm for the solution of a tridiagonal
linear system of equations. Journal of ACM 20 (1), 27–38.
Strzodka, R., Göddeke, D., 2006. Mixed precision methods for convergent iterative
schemes. In: Proceedings of the 2006 Workshop on Edge Computing Using New
Commodity Architectures.
Wakatani, A., 2004. A parallel and scalable algorithm for ADI method with prepropagation and message vectorization. Parallel Computing 30 (12), 1345–1359.
Weber, L., Dorn, J., Mortensen, A., 2003. On the electrical conductivity of metal matrix
composites containing high volume fractions of non-conducting inclusions. Acta
Materialia 51 (11), 3199–3211.
Wesseling, P., 2004. An Introduction to Multigrid Methods.
Whaley, R.C., Petitet, A., Dongarra, J.J., 2001. Automated empirical optimization of
software and the ATLAS project. Parallel Computing 27 (1/2), 3–35.
Widlund, O.B., 1966. On the rate of convergence of an alternating direction implicit
method in a noncommutative case. Mathematics of Computation 20 (96),
500.
Related documents
Download