On-the-fly Doppler Broadening using Multipole Representation for Monte Carlo Simulations on

On-the-fly Doppler Broadening using Multipole
Representation for Monte Carlo Simulations on
Heterogeneous Clusters
by
Sheng Xu
B.S., Physics, Peking University (2010)
Submitted to the Department of Nuclear Science and Engineering and
the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degrees of
Master of Science in Nuclear Science and Engineering
and
Master of Science in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2013
c Massachusetts Institute of Technology 2013. All rights reserved.
Signature of Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Nuclear Science and Engineering and
the Department of Electrical Engineering and Computer Science
August 19, 2013
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Kord S. Smith
KEPCO Professor of the Practice of Nuclear Science and Engineering
Thesis Supervisor
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Benoit Forget
Associate Professor of Nuclear Science and Engineering
Thesis Supervisor
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Srini Devadas
Webster Professor of Electrical Engineering and Computer Science
Thesis Reader
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mujid S. Kazimi
TEPCO Professor of Nuclear Engineering
Chair, NSE Committee on Graduate Students
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Leslie A. Kolodziejski
Professor of Electrical Engineering and Computer Science
Chair, EECS Committee on Graduate Students
2
On-the-fly Doppler Broadening using Multipole
Representation for Monte Carlo Simulations on
Heterogeneous Clusters
by
Sheng Xu
Submitted to the Department of Nuclear Science and Engineering and
the Department of Electrical Engineering and Computer Science
on August 19, 2013, in partial fulfillment of the
requirements for the degrees of
Master of Science in Nuclear Science and Engineering
and
Master of Science in Electrical Engineering and Computer Science
Abstract
In order to use Monte Carlo methods for reactor simulations beyond benchmark
activities, the traditional way of preparing and using nuclear cross sections needs to
be changed, since large datasets of cross sections at many temperatures are required
to account for Doppler effects, which can impose an unacceptably high overhead in
computer memory. In this thesis, a novel approach, based on the multipole representation, is proposed to reduce the memory footprint for the cross sections with little
loss of efficiency.
The multipole representation transforms resonance parameters into a set of poles
only some of which exhibit resonant behavior. A strategy is introduced to preprocess
the majority of the poles so that their contributions to the cross section over a small
energy interval can be approximated with a low-order polynomial, while only a small
number of poles are left to be broadened on the fly. This new approach can reduce
the memory footprint of the cross sections by one to two orders over comparable
techniques. In addition, it can provide accurate cross sections with an efficiency
comparable to current methods: depending on the machines used, the speed of the
new approach ranges from being faster than the latter, to being less than 50% slower.
Moreover, it has better scalability features than the latter.
The significant reduction in memory footprint makes it possible to deploy the
Monte Carlo code for realistic reactor simulations on heterogeneous clusters with
GPUs in order to utilize their massively parallel capability. In the thesis, a CUDA
version of this new approach is implemented for a slowing down problem to examine
its potential performance on GPUs. Through some extensive optimization efforts, the
CUDA version can achieve around 22 times speedup compared to the corresponding
serial CPU version.
Thesis Supervisor: Kord S. Smith
Title: KEPCO Professor of the Practice of Nuclear Science and Engineering
Thesis Supervisor: Benoit Forget
Title: Associate Professor of Nuclear Science and Engineering
3
4
Acknowledgments
I would firstly like to thank my supervisors, Prof. Kord Smith and Prof. Benoit
Forget, for their invaluable guidance and insights throughout this project, and for the
support and freedom they gave me to pursue the research topic that I am interested
in.
I would also like to thank Prof. Srini Devadas for being my thesis reader. His
expertise in computer architecture, especially in GPU, has helped me tremendously
during this project. I wish I could have more time to learn from him.
I am grateful to Luiz Leal of ORNL for introducing us to the multipole representation, and Roger Blomquist for his help in getting the WHOPPER code. This work
was supported by the Office of Advanced Scientific Computing Research, Office of
Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357.
Due to the nature of this project, I have also received help from many other people
of both departments and I would like to thank all of them, among whom are: Dr.
Paul Romano, Dr. Koroush Shirvan, Bryan Herman, Jeremy Roberts, Nick Horelik,
Nathan Gilbson and Will Boyd of NSE, and Prof. Charles Leiserson, Prof. Nir Shavit,
Haogang Chen and Ilia Lebedev of EECS.
Furthermore, I want to express my gratitude to Prof. Mujid Kazimi, for his
guidance during my first year and a half here at MIT, and for his continuous patience,
understanding and support over my entire three years here.
In addition, I would like to thank Clare Egan and Heather Barry of NSE and
Janet Fischer of EECS, for the many administrative processes that they have helped
me through from the application to the completion of the dual degrees. I also wish
to give my thanks to all other people at MIT who have provided direct or indirect
support to my studies over the last three years.
Lastly, I would like to thank my family and friends for their love and care to
me. Special thanks go to my wife, Hengchen Dai, for her constant love, support and
5
encouragement, which helped me go through each and every hard time during the
years that we have been together.
6
Contents
Contents
7
List of Figures
11
List of Tables
13
1 Introduction
15
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.2
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.3
Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2 Background and Review
2.1
2.2
19
Existing methods for Doppler broadening . . . . . . . . . . . . . . . .
19
2.1.1
Cullen’s method . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.1.2
Regression model . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.1.3
Explicit temperature treatment method . . . . . . . . . . . . .
25
2.1.4
Other methods . . . . . . . . . . . . . . . . . . . . . . . . . .
27
General purpose computing on GPU . . . . . . . . . . . . . . . . . .
28
2.2.1
GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.2.2
CUDA Programming Model . . . . . . . . . . . . . . . . . . .
32
2.2.3
GPU performance pitfalls . . . . . . . . . . . . . . . . . . . .
34
2.2.4
Floating point precision support . . . . . . . . . . . . . . . . .
35
7
3 Approximate Multipole Method
3.1
3.2
3.3
3.4
37
Multipole representation . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.1.1
Theory of multipole representation . . . . . . . . . . . . . . .
37
3.1.2
Doppler broadening . . . . . . . . . . . . . . . . . . . . . . . .
40
3.1.3
Characteristics of poles . . . . . . . . . . . . . . . . . . . . . .
41
3.1.4
Previous efforts on reducing poles to broaden . . . . . . . . .
42
Approximate multipole method . . . . . . . . . . . . . . . . . . . . .
44
3.2.1
Properties of Faddeeva function and the implications . . . . .
44
3.2.2
Overlapping energy domains strategy . . . . . . . . . . . . . .
47
Outer and inner window size . . . . . . . . . . . . . . . . . . . . . . .
49
3.3.1
Outer window size . . . . . . . . . . . . . . . . . . . . . . . .
51
3.3.2
Inner window size . . . . . . . . . . . . . . . . . . . . . . . . .
56
Implementation of Faddeeva function . . . . . . . . . . . . . . . . . .
60
4 Implementation and Performance Analysis on CPU
65
4.1
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.2
Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.2.1
Table lookup . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.2.2
Cullen’s method . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.2.3
Approximate multipole method . . . . . . . . . . . . . . . . .
70
Test one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.3.1
Serial performance . . . . . . . . . . . . . . . . . . . . . . . .
72
4.3.2
Revisit inner window size
. . . . . . . . . . . . . . . . . . . .
74
4.3.3
Parallel performance . . . . . . . . . . . . . . . . . . . . . . .
74
4.4
Test two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
4.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
4.3
8
5 Implementation and Performance Analysis on GPU
5.1
5.2
5.3
83
Test setup and initial implementation . . . . . . . . . . . . . . . . . .
83
5.1.1
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.1.2
Hardware specification . . . . . . . . . . . . . . . . . . . . . .
86
5.1.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
Optimization efforts . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5.2.1
Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5.2.2
Floating point precision . . . . . . . . . . . . . . . . . . . . .
90
5.2.3
Global memory efficiency . . . . . . . . . . . . . . . . . . . . .
92
5.2.4
Shared memory and register usage
. . . . . . . . . . . . . . .
95
Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
96
6 Summary and Future Work
99
6.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A Whopper Input Files
99
103
A.1 U238 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.2 U235 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.3 Gd155 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
References
107
9
10
List of Figures
1-1 U238 capture cross section at 6.67 eV resonance for different temperatures.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2-1 Evolution of GPU and CPU throughput [1]. . . . . . . . . . . . . . .
29
2-2 Architecture of an Nvidia GF100 card (Courtesy of Nvidia). . . . . .
30
2-3 GPU memory hierarchy (Courtesy of Nvidia). . . . . . . . . . . . . .
31
2-4 CUDA thread, block and grid hierarchy [1].
33
. . . . . . . . . . . . . .
3-1 Poles distribution for U238. Black and red dots represent the poles
with l = 0 and with positive and negative real parts, respectively, and
green dots represent the poles with l > 0. . . . . . . . . . . . . . . . .
42
3-2 Relative error of U238 total cross section at 3000K broadening only
principal poles compared with NJOY data. . . . . . . . . . . . . . . .
46
3-3 Relative error of U235 total cross section at 3000 K broadening only
principal poles against NJOY data. . . . . . . . . . . . . . . . . . . .
3-4 Faddeeva function in the upper half plane of complex domain
47
. . . .
48
3-5 Demonstration of overlapping window. . . . . . . . . . . . . . . . . .
49
3-6 Relative error of U238 total cross section at 3000 K calculated with
constant outer window size of twice of average resonance spacing. . .
11
53
3-7 Relative error of background cross section approximation for U238 total
cross section at 3000 K with constant outer window size of twice of
average resonance spacing. . . . . . . . . . . . . . . . . . . . . . . . .
54
3-8 Relative error of U238 total cross section at 3000 K calculated with
outer window size from Eq. 3.22. . . . . . . . . . . . . . . . . . . . .
56
3-9 Relative error of background cross section approximation for U238 total
cross section at 3000 K with outer window size from Eq. 3.22. . . . .
57
3-10 Relative error of U235 total cross section at 3000 K calculated with
outer window size from Eq. 3.22. . . . . . . . . . . . . . . . . . . . .
58
3-11 Demo for the six-point bivariate interpolation scheme. . . . . . . . . .
61
3-12 Relative error of modified W against scipy.special.wofz. . . . . . . . .
63
3-13 Relative error of modified QUICKW against scipy.special.wofz. . . . .
64
4-1 Strong scaling study of table lookup and approximate multipole methods for 300 nuclides. The straight line represents perfect scalability. .
75
4-2 Strong scaling study of table lookup and approximate multipole methods for 3000 nuclides. The straight line represents perfect scalability.
76
4-3 Weak scaling study of table lookup and approximate multipole methods with 300 nuclides per thread. . . . . . . . . . . . . . . . . . . . .
77
4-4 OpenMP scalability of table lookup and multipole methods for neutron
slowing down. The straight line represents perfect scalability. . . . . .
12
80
List of Tables
2.1
Latency and throughput of different types of memory in GPU [2]. . .
3.1
Information related to outer window size for U238 and U235 total cross
section at different temperatures . . . . . . . . . . . . . . . . . . . . .
3.2
32
59
Number of terms needed for background cross section and the corresponding storage size for various inner window size (multiple of average
resonance spacing). . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.1
Resonance Information of U235, U238 and Gd155 . . . . . . . . . . .
68
4.2
Performance results of the serial version of different methods for test
one. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
Runtime breakdown of approximate multipole method with modified
QUICKW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4
73
Performance and storage size of background information with varying
inner window size.
4.5
72
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
kef f (with standard deviation) and the average runtime per neutron
history for both table lookup and approximate multipole methods. . .
79
5.1
GPU specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
5.2
Performance statistics of the initial implementation on Quadro 4000
from Nvidia visual profiler nvvp . . . . . . . . . . . . . . . . . . . . .
13
89
5.3
kef f (with standard deviation) and the average runtime per neutron
history for different cases. . . . . . . . . . . . . . . . . . . . . . . . .
92
5.4
Speedup of GPU vs. serial CPU version on both GPU cards. . . . . .
96
5.5
Performance statistics of the optimized single precision version on Quadro
4000 from nvvp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
97
Chapter 1
Introduction
1.1
Motivation
Most Monte Carlo neutron transport codes rely on point-wise neutron cross section
data, which provides an efficient way to evaluate cross sections by linear interpolation with a pre-defined level of accuracy. Due to the effect of thermal motion
from target nucleus, or Doppler effect, cross sections usually vary with temperature,
which is particularly significant for the resonance capture and fission cross section,
as demonstrated in Fig. 1-1. This temperature dependence must be taken into consideration for realistic, detailed Monte Carlo reactor simulations, especially if Monte
Carlo method is to be used for the multi-physics simulations with thermal-hydraulic
feedback impacting temperatures and material densities.
The traditional way of dealing with this problem is to generate cross sections for
each nuclide at specific reference temperatures by Doppler broadening the corresponding 0 K cross sections with nuclear data processing codes such as NJOY [3], and then
to use linear interpolation between temperature points to get the cross sections for the
desired temperature. However, this method can require a prohibitively large amount
of cross section data to cover a large temperature range. For example, Trumbull [4]
15
0K
300 K
3000 K
4
Cross section (barn)
10
3
10
2
10
1
10
6
6.2
6.4
6.6
6.8
7
7.2
7.4
E (eV)
Figure 1-1: U238 capture cross section at 6.67 eV resonance for different temperatures.
studied the issue and suggested that using a temperature grid of 5 − 10 K spacing
for the cross sections would provide the level of accuracy needed. Considering that
the temperature range for an operating reactor (including the accident scenario) is
between 300 and 3000 K, and the data size of cross sections at a single temperature
for 300 nuclides of interest is approximately 1 GB, this means that around 250 − 500
GB of data would need to be stored and accessed, which usually exceeds the size of
the main memory of most single computer nodes and thus degrades the performance
tremendously.
On the other hand, the exceedingly high power density with faster transistors
limits the continuing increase in the speed of the central processor unit (CPU). Consequently, more and more cores have to be placed on a single processor to maintain
the performance increases through wider parallelism. One direct result is that each
processor core will have limited access to fast memory (or cache) which can cause a
large disparity between processor power and available memory. This is especially true
for the graphics processing unit (GPU), which is at the leading edge of this trend of
embracing wider parallelism (although for different reasons, which will be discussed in
16
the next chapter). Therefore, reducing the memory footprint of a program can play a
significant role in increasing its efficiency with new trends of computer architectures.
As a result, to enable the Monte Carlo reactor simulation beyond the benchmark
activities and to prepare it for the new computer architectures, the memory footprint of the temperature-dependent cross section data must be reduced. This can be
done through evaluating the cross section on the fly, but classic Doppler broadening
methods are usually too expensive, since they were all developed for preprocessing
purposes. A few recent efforts on this [5, 6] have showed some improvements, but
either the memory footprint is still too high for massively parallel architectures, or
the efficiency is significantly degraded relative to current methods. Details on these
methods will be presented in the next chapter.
In this thesis, a new method of evaluating the cross section on the fly is proposed
based on the multipole representation developed in [7], to both reduce the memory
footprint of cross section data and to preserve the computational efficiency. In particular, the energy range above 1 eV in the resolved resonance region will be the focus
of this thesis, since the cross sections in this range dominate the storage.
1.2
Objectives
The main goal of this work is to develop a method that can dramatically reduce the
memory footprint for the temperature-dependent nuclear cross section data so that
the Monte Carlo method can be used for realistic reactor simulations. At the same
time, the efficiency must be maintained at least comparable to the current method.
In addition, this new method will be implemented on the GPU to utilize its massively
parallel capability.
17
1.3
Thesis organization
The remaining parts of the thesis are organized as follows:
Chapter 2 reviews the existing methods for Doppler broadening, especially those
that are developed to be used on the fly. It also provides some background information
for general purpose computing on GPU.
Chapter 3 describes the theory of the multipole representation and subsequently
the formulation of the new method, the approximate multipole method [8]. The
underlying strategy used for this new method will be discussed, as well as some
important parameters and functions that can impact the performance or memory
footprint thereof.
Chapter 4 presents the implementation of the approximate multipole method on
CPU and its performance comparison against the current method for both the serial
and parallel version.
Chapter 5 presents the implementation and performance analysis of the approximate multipole method on GPU. Specifically, the performance bottlenecks and optimization efforts in the GPU implementation are discussed.
Chapter 6 summarizes the work in this thesis, followed by a discussion of possible
future work and directions.
18
Chapter 2
Background and Review
This chapter first reviews the existing methods for Doppler broadening, with a focus
on recent efforts for broadening on the fly. The general purpose computing on GPU
is then reviewed, mainly on the architectural aspect of GPU, the programming model
and performance issues related to GPU.
2.1
Existing methods for Doppler broadening
Doppler broadening of nuclear cross sections has long been an important problem in
nuclear reactor analysis and methods. The traditional methods are mainly for the
purpose of preprocessing and generating cross section libraries. Recently the concept
of doing the Doppler broadening on the fly has gained popularity due to the prohibitively large storage size of temperature-dependent cross section data needed by
Monte Carlo simulation with coupled thermal-hydraulic feedback, and a few methods in this category have been developed. In this section, both types of Doppler
broadening methods are presented, with a focus on the latter type.
19
2.1.1
Cullen’s method
The well-known Doppler broadening method developed by Cullen [9] uses a detailed
integration of the integral equation defining the effective cross section due to the
relative motion of target nucleus. For the ideal gas model, where the target motion
is isotropic and its velocity obeys the Maxwell-Boltzmann distribution, the effective
Doppler-broadened cross section at temperature T takes the following form
√ Z∞
α
2
2
dV σ 0 (V )V 2 [e−α(V −v) − eα(V +v) ],
σ̄(v) = √ 2
πv 0
(2.1)
M
, with kB being the Boltzmann’s constant and M the mass of the
2kB T
target nucleus, v is the incident neutron velocity, V is the relative velocity between
where α =
the neutron and the target nucleus, and σ 0 (v) represents the 0 K cross section for
neutron with velocity v.
As Cullen’s method will be used as a reference method in Chapter 4, it is helpful
to present the algorithm for evaluating Eq. 2.1 here. To begin, Eq. 2.1 can be broken
into two parts:
σ̄(v) = σ ∗ (v) − σ ∗ (−v),
(2.2)
√ Z∞
α
2
dV σ 0 (V )V 2 e−α(V −v) .
σ (v) = √ 2
πv 0
(2.3)
where
∗
The exponential term in Eq. 2.3 limits the significant part of the integral to the range
4
4
v−√ <V <v+√ ,
α
α
20
(2.4)
while for σ ∗ (v), the range of significance becomes
4
0<V < √ .
α
(2.5)
The numerical evaluation of Eq. 2.3 developed in [9] assumes that the 0 K cross
sections can be represented by a piecewise linear function of energy with acceptable
accuracy, which is just the form of NJOY PENDF files [3]. By defining the reduced
√
√
variables x = αV and y = αv, the 0 K cross sections can be expressed as
σ 0 (x) = σi0 + si (x2 − x2i ),
(2.6)
0
− σi0 )/(xi+1 − xi ) for the i-th energy interval. As a result, Eq.
with slope si = (σi+1
2.3 becomes
N Z xi+1
1 X
2
σ (y) = √ 2
σ 0 (x)x2 e−(x−y) dx
πy i=0 xi
(2.7)
σ ∗ (y) =
(2.8)
∗
X
[Ai (σi0 − si x2i ) + Bi si ],
i
where N denotes the number of energy intervals that fall in the significance range as
determined by Eq. 2.4 and 2.5, and
1
H2 +
y2
1
= 2 H4 +
y
Ai =
Bi
2
H1 + H0 ,
y
4
H3 + 6H2 + 4yH1 + y 2 H0 ,
y
where Hn is shorthand for Hn (xi − y, xi+1 − y), defined by
1 Z b n −z2
Hn (a, b) = √
z e dz.
π a
21
(2.9)
To compute Hn (a, b), one can write it in the form
Hn (a, b) = Fn (a) − Fn (b),
(2.10)
1 Z ∞ n −z2
Fn (x) = √
z e dz.
π x
(2.11)
where Fn (x) is defined by
and satisfies a recursive relation
1
erfc(x),
2
1
2
F1 (x) = √ e−a
2 π
n−1
Fn−2 (x) + xn−1 F1 (x),
Fn (x) =
2
F0 (x) =
(2.12)
(2.13)
(2.14)
with erfc(x) being the complementary error function
2 Z ∞ −z2
e dz.
erfc(x) = √
π a
(2.15)
If a and b are very close with each other, the difference in Eq. 2.10 may loose
significance. To avoid this, one can use a different method based on direct Taylor
expansion (see [10]). However, with the use of double precision floating point numbers,
this problem did not show up during the course of this thesis work.
Traditionally Cullen’s method is used in NJOY to Doppler broaden the 0 K cross
section and generate cross section libraries for reference temperatures. Any cross
section needed are then evaluated with this library by linear interpolation, which will
be denoted “table lookup” method henceforth in this thesis. As will be shown in
Chapter 4 (and also in other places such as [5]), Cullen’s method is not very practical
to broaden the cross section on they fly since it requires an unacceptable amount of
22
computation time, mainly due to the cost of evaluating complementary error functions
for the many energy points that fall in the range of significance, especially at high
energy.
2.1.2
Regression model
In [5], Yesilyurt et al. developed a new regression model to perform on-the-fly Doppler
broadening based on series expansion of the multi-level Adler-Adler formalism with
temperature dependence. Take the total cross section as an example (similar analysis
applies to other types of cross section), its expression in Adler-Adler formalism is
σt (E, ξR ) = 4πλ̄2 sin2 φ0
√ X 2
[(GR cos2φ0 + HR sin2φ0 )ψ(x, ξR )
+ πλ̄2 E{
R ΓR,t
+ (HR cos2φ0 − GR sin2φ0 )χ(x, ξR )]
+ A1 +
A2 A3 A4
+ 2 + 3 + B1 E + B2 E 2 }
E
E
E
(2.16)
where
ψ(x, ξR ) =
χ(x, ξR ) =
λ̄ =
x =
ξR =
ξR Z ∞ exp[− 14 (x − y)2 ξR2 ]dy
√
2 π ∞
1 + y2
ξR Z ∞ exp[− 14 (x − y)2 ξR2 ]ydy
√
2 π ∞
1 + y2
√
1
1
2mn awri
= √ , k0 =
,
k
h̄ awri + 1
k0 E
2(E − ER )
,
Γ
s T
awri
ΓT
,
4kB ER T
(2.17)
(2.18)
(2.19)
(2.20)
(2.21)
and the other symbols are: GR , symmetric total parameter; HR , asymmetric total
parameter; Ai and Bi , coefficients of the total background correction; φ0 , phase shift;
ER , energy of resonance R; ΓT , total resonance width; mn , neutron mass; and awri,
23
the mass ratio between the nuclide and the neutron. Since the only temperature
dependence of Eq. 2.24 is in ξR , and ξR only appears in ψ(x, ξR ) and χ(x, ξR ), as a
result, by writing ψ(x, ξR ) and χ(x, ξR ) as ψR (T ) and χR (T ) and expanding them in
terms of T ,
ψR (T ) =
X
aR,i fi (T ),
(2.22)
bR,i hi (T ),
(2.23)
i
χR (T ) =
X
i
one can arrive at an expression for the total cross section in terms of series expansion
of T
σt (E, T ) = AR +
X
a00i fi (T ) +
i
X
b00i hi (T ),
(2.24)
i
where a00i and b00i are parameters specific for a given energy and nuclide.
Through analyzing the asymptotic expansions of both ψ(x, ξR ) and χ(x, ξR ), augmented by numerical investigation [5], a final form for σt (E, T ),
6
X
ai
+
bi T i/2 + c,
σt (E, T ) =
i/2
i=1
i=1 T
6
X
(2.25)
where ai , bi and c are parameters unique to energy, reaction type and nuclide, was
found to give good accuracy for a number of nuclides examined over the temperature
ranges of 77 − 3200K.
As shown in [11], for production MCNP code, the overhead in performance to
incorporate this regression model method over the traditional table lookup method
is only 10% − 20%. However, since for each 0 K energy point of each cross section
type of any nuclide, there are 13 parameters needed, which suggests that the total
size of data for this method is around 13 times of that of standard 0 K cross sections,
or about 13 GB. Although reduced significantly from the table lookup method, this
24
size is still taxing for many hardware systems, such as GPUs.
2.1.3
Explicit temperature treatment method
In essence, the explicit temperature treatment method is not a method for Doppler
broadening, since there is no broadening of any cross section at any point during
the process. However, due to the fact that it does solve the problem of temperature
dependence of cross sections in a clever way, it is also included in this category.
As described in [6], the whole method is based on a concept similar to that of
Woodcock delta-tracking method [12], so essentially it is a type of rejection sampling
method. Specifically, for incident neutron energy E at ambient temperature T , a
majorant cross section for a nuclide n is defined as
Σmaj,n (E) = gn (E, T, awrin )
Σ0tot,n (E 0 ),
max √
√
E 0 ∈[( E−4/λn (T ))2 ,( E+4/λn (T ))2 ]
(2.26)
where gn (E, T, awrin ) is a correction factor for the temperature-initiated increase in
potential scattering cross section of the form
√
1
e−λn (T ) E
√ ,
gn (E, T, awrin ) = [1 +
]erf[λn (T ) E] − √
2λn (T )2 E
πλn (T ) E
2
s
λn (T ) =
awrin
,
kB T
(2.27)
(2.28)
0
(E)
Σ0tot,n (E) = Nn σtot,n
(2.29)
0
and Nn and σtot,n
are the number density and the 0 K total microscopic cross section
of nuclide n. A majorant cross section for a material region with n different nuclides
and a maximum temperature Tm is then defined as
Σmaj (E) =
X
Σmaj,n (E)
n
25
(2.30)
The neutron transport process can subsequently be simulated with the tracking
scheme shown in Algorithm 1.
Algorithm 1 Tracking scheme for explicit temperature treatment method
while true do
starting at position ~r, sample a path length ~l based on the majorant cross section
at ~r, Σmaj (E, ~r)
get a temporary collision point, ~r0 = ~r + ~l
sample target nuclide at position ~r0 , with probability for nuclide n as
Σmaj,n (E, ~r0 )
Pn =
Σmaj (E, ~r0 )
sample the target velocity from the Maxwellian distribution with temperature
Tm (~r0 ), get the energy E 0 corresponding to the relative velocity between neutron
and the target nucleus
Σ0tot,n (E 0 )
:
rejection sampling with criterion ξ <
Σmaj,n (E, ~r0 )
if rejected then
continue
else
set the collision point: ~r ← ~r0
sample the reaction type with energy E 0 and the 0 K microscopic cross
section
continue
end if
end while
Combining the definition of majorant cross section and the tracking scheme, it is
clear that the material majorant cross section is defined such that every single nucleus
in this material is assigned with the maximum possible microscopic cross section (if
the high energy Maxwellian tail can be ignored), and the rejection sampling is then
performed based on the real microscopic cross section from Maxwell distribution.
The dataset needed for this method are the 0 K cross sections and the majorant
microscopic cross sections for the material temperatures of interest, and thus the
storage requirement is on the order of a few GB. However, due to the use of delta
tracking and rejection sampling, the efficiency can be impacted. In fact, as reported
in [13], the runtime of explicit temperature treatment method is 2 − 4 times slower
than the standard table lookup method for a few test cases.
26
2.1.4
Other methods
There are also some other methods that have been developed for Doppler broadening.
They will be discussed briefly below.
The psi-chi method [14] is a single level resonance representation that introduces
a few substantial approximations to Eq. 2.1, including using the single-level BreitWigner formula for the 0 K cross section, omitting the second exponential term,
approximating the exponent in the first exponential term with Taylor series expansion and extending the lower limit of the integration to −∞. The final form of the
Doppler-broadened cross section uses the functions ψ(x, ξR ) and χ(x, ξR ) as defined
in Eq. 2.17 and 2.18. Due to the extensive approximations used, psi-chi method is
not very accurate for cross section evaluation, especially for low energies. Besides,
the single-level Breit-Wigner formalism is now considered obsolete as the resonance
representation for major nuclides.
The Fast-Doppler-Broadening method developed in [15] uses a two-point GaussLegendre quadrature for the integration in Eq. 2.7 when xi+1 − xi > 1.0, and uses
Eq. 2.8 otherwise. This method was found to be 2 − 3 times faster than Cullen’s
method, thus still not efficient enough for on-the-fly Doppler broadening.
Another method that was proposed during the course of this thesis makes use
of the fact that the velocity of target nucleus is usually much smaller than that
of incident neutron velocity. Consequently , the energy corresponding to the relative
velocity between neutron and target nucleus can be approximated as a function of vT µ,
where vT is the velocity of the target nucleus and µ is the cosine between the neutron
velocity and target velocity. It can be proved that vT µ obeys Gaussian distribution,
as a result, the effective Doppler-broadened cross section at temperature T can be
expressed as a convolution of the 0 K cross section and a Gaussian distribution
σ̄(E) =
Z ∞
−∞
σ 0 [Erel (u)]G(u, T )du
27
(2.31)
where u = vT µ and G(u, T ) represents a Gaussian distribution for temperature T .
With the similar energy cut-off strategy as used in Cullen’s method, and by using a
tabulated cumulative distribution function (cdf) for Gaussian distribution, the above
integration can be evaluated numerically and requires much fewer operations than the
Cullen’s method for each energy grid point. However, the inherent inaccuracy for low
energy ranges and the still high computational time makes it unattractive, especially
when compared to the approximate multipole method that will be discussed in the
next chapter, as a result, it was abandoned.
2.2
General purpose computing on GPU
GPUs started as fixed-function hardware dedicated to handle the manipulation and
display of 2D graphics in order to offload the computationally complex graphical calculations from the CPU. Due to the inherent parallel nature of displaying computer
graphics, over the years, more and more parallel processors are added to a single
GPU card and thus make it massively parallel and of tremendously high floating
point operations throughput. Nowadays, GPUs are capable of performing trillions
of floating point operations per second (Teraflops) from a single GPU card, much
higher than that of the high-end CPUs (see Fig. 2-1). In order to utilize this huge
computational power in other areas of computing, some efforts have been made to
transfer GPUs into fully-programmable processors capable of general-purpose computing, among which is NVidia’s CUDA (Compute Unified Device Architecture)[16],
a general-purpose parallel computing architecture. Ever since then, many scientific
applications have been accelerated with GPUs, including some cases for Monte Carlo
neutron transport such as [17, 18, 19, 20]. In the remaining part of this section, the
architectural aspects of GPUs and the CUDA programming model, as well as some
of the performance considerations for GPUs, are briefly reviewed.
28
Figure 2-1: Evolution of GPU and CPU throughput [1].
2.2.1
GPU architecture
GPU architecture is very different from conventional multi-core CPU design, since
GPUs are fundamentally geared towards high-throughput (versus low-latency) computing. As a parallel architecture, it is designed to process large quantities of concurrent, fine-grained tasks.
Fig. 2-2 illustrates the architecture of an NVidia GPU. A typical high performance
GPU usually contains a number of streaming multiprocessors (SMs), each of which
comprises tens of homogeneous processing cores. SMs use single instruction multiple
thread (SIMT) and simultaneous multithreading (SMT) techniques to map threads
of execution onto these cores.
SIMT techniques are architecturally efficient in that one hardware unit for instructionissue can service many data paths and different threads execute in lock step fashion.
However, to avoid the scalability issues of signal propagation delay and underutilization that may happen with a single instruction stream for the entire SM [21], GPUs
29
Figure 2-2: Architecture of an Nvidia GF100 card (Courtesy of Nvidia).
typically implement fixed-size SIMT groupings of threads called warps, and the width
of a warp is usually 32. Distinct warps are not run in lock step and may diverge.
Using SMT techniques, each SM maintains and schedules the execution contexts of
many warps. This style of SMT enables GPUs to hide latency by switching amongst
warp contexts when architectural, data, and control hazards would normally introduce stalls. This leads to a more efficient utilization of the available physical resource,
and the maximal instruction throughput occurs when the number of thread contexts
is much greater than the aggregate number of SIMT lanes per SM.
Communication between threads is achieved by reading and writing data to various shared memory spaces. As shown in Fig. 2-3, GPUs have three levels of explicitly
managed storage that vary in terms of visibility and latency: per-thread registers,
30
shared memory local to a collection of warps running on the same SM, and a large
global (device) memory in off-chip DRAM that is accessible by all threads.
Figure 2-3: GPU memory hierarchy (Courtesy of Nvidia).
Unlike traditional CPU architecture, GPUs do not implement data caches for
the purpose of maintaining the programs working set in nearby, low-latency storage.
Rather, the cumulative register file comprises the bulk of on-chip storage, and a
much smaller cache hierarchy often exists for the primary purpose of smoothing over
irregular memory access patterns. In addition, there are also two special types of readonly memory on GPUs: constant and texture memory. Both are part of the global
memory, but each of them has a corresponding cache that can facilitate the memory
access. In general, constant memory is most efficient when all threads in a warp
access the same memory location, while texture memory is good for memory access
with spatial locality. Since different types of memory have very different latency and
31
bandwidth, as shown in Table 2.1, a proper use of the memory hierarchy is essential
in achieving a good performance for GPU programs.
Table 2.1: Latency and throughput of different types of memory in GPU [2].
Registers
Shared
Global
Constant
memory
memory
memory
Latency
(unit: ∼ 1
∼5
∼ 500
∼
5 with
clock cycle)
caching
Bandwidth
Extremely
High
Modest
High
high
2.2.2
CUDA Programming Model
CUDA is both a hardware architecture and a software platform to expose that hardware at a high level for general purpose computing. CUDA gives developers access
to the virtual instruction set and memory of the parallel computational elements in
GPUs. This platform is accessible through extensions to industry-standard programming languages, including C/C++ and FORTRAN, and CUDA C is used in current
work.
A typical CUDA C program consists of two parts of code: host code and device
code. The host code is similar to any common C program, except that it needs to
take care of memory allocation on GPU, data transfer between CPU and GPU and
offload the computation to GPU. Device code, as suggested by the name, are the code
that executes on the device (GPU). The device code is started with a kernel launch
from the host code, through which a kernel function is executed by a collection of
logical threads on the GPU. These logical threads are mapped onto hardware threads
by a scheduling runtime, either in software or hardware.
In CUDA, all logical threads are organized into a two-level hierarchy, block and
grid, for efficient thread management. Specifically, a block contains a number of
cooperative threads that will be assigned to the same SM, and all the blocks of a
32
kernel form a grid, as shown in Fig. 2-4. During a kernel launch, both block and
grid dimensions have to be specified. The thread execution can then be specialized
by its identifier in the hierarchy, allowing threads to determine which portion of the
problem to operate on.
Figure 2-4: CUDA thread, block and grid hierarchy [1].
The threads within a block share the same local shared memory, since they are
assigned on the same SM, and some synchronization instructions exist to ensure
memory coherence for this per-block shared memory. On the other hand, global
memory, which can be accessed by all the threads within a grid, or threads for a
kernel, is only guaranteed to be consistent at the boundaries between sequential kernel
invocations.
33
2.2.3
GPU performance pitfalls
Although GPUs can deliver very high throughput on parallel computations, they
require large amounts of fine-grained concurrency to do so. In addition, due to the fact
that GPUs have long been a special-purpose processor for graphics, the underlying
hardware can penalize algorithms having irregular and dynamic behavior not typical
of computations related to graphics. In fact, GPU architecture is particularly sensitive
to load imbalance among processing elements. Here, two major performance pitfalls
relevant to current work are discussed: variable memory access cost from SIMT access
patterns and thread divergence.
Non-uniform memory access
Due to the SIMT nature of the GPU programming model, for each load/store instruction executed by a warp in one lock step, there are probably a few accesses to
different memory locations. Different types of physical memory may respond differently to such collective accesses, but they all share one thing in common, that is, each
of them are optimized for some specific access patterns. A mismatched access pattern
often leads to significant underutilization of the physical resource that can be as high
as an order of magnitude.
For global memory, the individual accesses for each thread within a warp can be
combined/bundled together by the memory subsystem into a single memory transaction, if every reference falls within the same contiguous global memory segment. This
is called “memory coalescing”. A a result, if each of the thread references a memory
location of a distinct segment, then many transactions have to be made, which is very
wasteful of the memory bandwidth.
For local shared memory, performance is highest when no threads within the same
warp access different words in a same memory bank. Otherwise the memory accesses
(to the same bank) will be serialized by the hardware, called “bank conflicts”. Bank
34
conflicts is a serious performance hurdle for shared memory and needs to be avoid to
achieve good performance.
Thread divergence
In the GPU programming model, logical threads are grouped into warps of execution.
A single program counter is shared by all threads within the warp. Warps, not threads,
are free to pursue independent paths through the kernel program.
To provide the illusion of individualized control flow, the execution model must
transparently handle branch divergence. This situation occurs when a conditional
branch instruction, like “If...Else...”, would redirect a subset of threads down the “If”
path, leaving the others to continue the “Else” path. Because threads within the
warp proceed in lock step fashion, the warp must necessarily execute both halves of
the branch, making some of the threads idle where appropriate. This mechanism
can lead to an inherently unfair scheduling of logical threads. In the worst case,
only one logical thread may be active while all others threads perform no work. The
GPU’s relatively large SIMT width exacerbates the problem and branch divergence
can impose an order of magnitude slowdown in overall computing throughput.
2.2.4
Floating point precision support
For the early generation of GPU cards, there was only support for single precision
floating numbers and corresponding arithmetic, since accuracy is not a big issue for
image processing and graphics display. However, with the advent of general purpose
GPU computing for scientific purposes, the need for support of double precision floating point arithmetic becomes imperative. As a result, since the introduction of GT200
card and CUDA Computability 1.3, double precision arithmetic has been supported
on most Nvidia high performance GPU cards.
For Nvidia GPUs earlier than Computability 2.0, each SM has only one special
35
function unit (SFU) that can perform double precision computation, while there are
eight units for single precision computation. As a result, on these GPU cards, the
theoretical peak performance of double precision computation is only one eighth of
that of single precision computation. Since CUDA Computability 2.0, more SFUs
that support double precision computation are added to GPUs and nowadays the
theoretical peak performance of double precision computation is usually a half of
that of single precision on a normal Nvidia high performance GPU card.
36
Chapter 3
Approximate Multipole Method
This chapter starts with the description of multipole representation. The formulation of the approximate multipole method for on-the-fly Doppler broadening is then
presented, along with the overlapping energy domains strategy. A systematic study
of a few important parameters in the proposed strategy then follows. The chapter
concludes with a discussion about an important function in the multipole method.
3.1
3.1.1
Multipole representation
Theory of multipole representation
In resonance theory, the reaction cross section for any incident channel c and exit
channel c0 can be expressed in terms of collision matrix Ucc0
σcc0 = πλ̄2 gc |δcc0 − Ucc0 |2 ,
37
(3.1)
where gc and δcc0 are the statistical spin factor and the Kronecker delta, respectively,
and λ̄ is as defined in 2.19. Similarly, the total cross section can be derived as
σt =
X
σcc0 = 2πλ̄2 gc (1 − ReUcc0 ).
(3.2)
c0
The collision matrix can be described by R-matrix representation, which has four
practical formalisms, i.e., single-level Breit-Wigner (SLBW), multilevel Breit-Wigner
(MLBW), Adler-Adler, and Reich-Moore. Because of the rigor of the Reich-Moore
formalism in representing the energy behavior of the cross section, it is used extensively for major actinides in the current ENDF/B format. In Reich-Moore formalism,
the collision matrix can be represented by the transmission probability[22]
Ucc0 = e−i(φc +φc0 ) (δcc0 − 2ρcc0 ),
(3.3)
where ρcc0 is the transmission probability from channel c to c0 , and φc and φc0 are the
hard-sphere phase shift of channel c and c0 , respectively.
Due to the physical condition that the collision matrix must be single-valued and
meromorphic1 in momentum space, the collision matrix and thus the transmission
probability can be rigorously represented by rational functions with simple poles in
√
E domain [7]. This is a generalization of the rationale suggested by de Saussure
and Perez [23] for the s-wave resonances, and lays the foundation for the multipole
representation as described in [7]. In this representation, the neutron-neutron and the
generic neutron to channel c transmission probabilities of the Reich-Moore formalism
for all N resonances with angular momentum number l can be written as
ρnn
√
2M
Pn2M −1 ( E) X
Rnλ
√
√
=
=
2M
P ( E)
E
λ=1 pλ −
1
(3.4)
A meromorphic function is a function that is well behaved except at isolated points. In contrast,
a holomorphic function is a well-behaved function in the whole domain.
38
√
2M
X
|Pc2M −1 ( E)|2
2Rcλ
√
√
|ρnc | =
=
|P 2M ( E)|2
E
λ=1 pλ −
2
(3.5)
where ρnn and ρnc are the transmission probabilities, P represents a holomorphic
function, pλ ’s are the poles of the complex function while Rnλ and Rcλ are the corresponding residues for the transmission probability, and M = (l + 1)N . These two
equations suggest that a resonance with angular momentum number l corresponds
to 2(l + 1) poles. Finding the poles is complicated since they are the roots of a
high order complex polynomial with roots that are often quite close to each other
in momentum space. A code package, WHOPPER, was developed by Argonne National Laboratory to find all the complex poles, making use of good initial guesses
and quadruple precision [7]. Once all poles and residues have been obtained, the 0 K
neutron cross-sections can be computed by substituting Eq. 3.3 - 3.5 into Eq. 3.1 3.2, which yields
(j)∗
where pλ
(x)
N 2(l+1)
X
−iRl,J,λ,j
1 XX
Re[ (j)∗ √ ]
σx (E) =
E l,J λ=1 j=1
pλ − E
(3.6)
(t)
N 2(l+1)
X
−iRl,J,λ,j
1 XX
−2iφl
σt (E) = σp (E) +
Re[e
√ ]
(j)∗
E l,J λ=1 j=1
pλ − E
(3.7)
(x)
is the complex conjugate of the j-th pole of resonance λ, and Rl,J,λ,j and
(t)
Rl,J,λ,j are the residues for reaction type x and total cross section, respectively, and
the potential cross-section σp (E) is given by
σp (E) =
X
4πλ̄2 gJ sin2 φl
(3.8)
l,J
with φl being the the phaseshift. In this form, the cross-sections can be computed by
summations over angular momentum of the channel (l), channel spin (J), number of
resonances (N ) and number of poles associated to a given resonance(2(l + 1)).
39
3.1.2
Doppler broadening
By casting the expression for cross section into the form of Lorentzian-like terms
as shown in Eq. 3.6 and 3.7, the Doppler broadened cross section can be derived in
analytical forms consisting of well-known functions, as demonstrated in [7]. Specially,
with the use of Solbrig kernel [24],
√ √
S( E, E 0 ) =
√
√
√
√
√
( E− E 0 )2
( E+ E 0 )2
E0
[−
]
[−
]
2
2
∆
∆
m
m
√ {e
−e
}
∆m πE
(3.9)
s
kB T
is the Doppler width in momentum space, and kB is the Boltzawri
mann constant, Eq. 3.6 and 3.7 can be Doppler broadened to take the following
where ∆m =
form
(x)
√
(j)∗
√ (x)
iRl,J,λ,j
E p
N 2(l+1)
X Re[ πRl,J,λ,j W(z0 ) + √π C( ∆m , ∆λm )]
1 XX
σx (E, T ) =
E l,J λ=1 j=1
∆m
σt (E, T ) = σp (E) +
1
E
N 2(l+1)
XX
X
√ (t)
Re{e−2iφl [ πRl,J,λ,j W(z0 ) +
(3.10)
√
(j)∗
iRl,J,λ,j
E pλ
√
C(
,
)]}
π
∆m ∆m
(t)
∆m
l,J λ=1 j=1
(3.11)
where
√
z0 =
(j)∗
E − pλ
,
∆m
(3.12)
W(z) is the Faddeeva function, defined as
2
i Z ∞ e−t dt
W(z) =
,
π −∞ z − t
40
(3.13)
and the quantity related to C can be regarded as the low energy correction term,
with the full expression
√
√
2
(j)∗
(j)∗
−2i pλ − ∆E2 Z ∞
E pλ
e−t −2 Et/∆m
,
)= √
e m
C(
dt (j)∗
.
∆m ∆m
π∆m ∆m
0
[pλ /∆m ]2 − t2
(3.14)
Since this correction term is generally negligible for energy above the thermal region
[25], and only energy above 1 eV are of interest in this thesis, this correction term
will be ignored from now on. As a result, the above equations become
√
N 2(l+1)
X
π
1 XX
(t)
Re[Rl,J,λ,j W(z0 )],
σx (E) =
E l,J λ=1 j=1 ∆m
√
N 2(l+1)
X
1 XX
π
(t)
σt (E) = σp (E) +
Re[e−2iφl Rl,J,λ,j W(z0 )],
E l,J λ=1 j=1 ∆m
(3.15)
(3.16)
and Doppler broadening a cross section at a given energy E is thus reduced to a
summation over all poles, each with a separate Faddeeva function evaluation.
3.1.3
Characteristics of poles
As mentioned in 3.1.1, each resonance with angular momentum l corresponds to in
total 2(l + 1) poles. Among them, two are called “s-wavelike” poles, and their real
part are of the opposite sign but have the same value, which is close to the square
root of the resonance energy, while their imaginary part are very small; the other 2l
poles behave like l conjugate pairs featuring large imaginary part with a characteristic
1
magnitude of
, where a is the channel radius and k0 is defined in Eq. 2.19. Fig.
k0 a
3-1 shows the pole distribution of U238, which has both l = 0 and l = 1 resonances,
to demonstrate the relative magnitude of different type of poles.
In ENDF/B resonance parameters, for most major nuclides, there are usually a
number of artificial resonances (called “external resonances”) outside of the resolved
resonance regions. Since they are general s-wave resonances, each of them also cor41
600
400
Im(p)
200
0
−200
−400
−600
−200
−150
−100
−50
0
Re(p)
50
100
150
200
Figure 3-1: Poles distribution for U238. Black and red dots represent the poles with
l = 0 and with positive and negative real parts, respectively, and green dots represent
the poles with l > 0.
responds to two poles, the same as an ordinary s-wave resonance. Besides, for the
external resonances that are above the upper bound of resolved resonance region, the
value of poles also obey the same rule as that for the ordinary s-wave, while for those
that have negative resonance energies, the two poles are of opposite sign and the absolute value of the real part is very small and that of the imaginary part is relatively
large, as demonstrated by the black and red dots (almost) along the imaginary axis
in Fig. 3-1.
3.1.4
Previous efforts on reducing poles to broaden
To reduce the number of poles to be broadened when evaluating cross sections for
elevated temperatures, an approach was proposed to replace half of the first type of
42
poles with a few pseudo-poles[25]. Specifically, it was noticed that the first type of
poles with negative real part have very smooth contributions to the resolved resonance
energy range, therefore, the summation of the contribution from these poles are also
smooth and can be approximated by a few smooth functions. Moreover, it turned
out that rational functions that are of the same form as the pole representation (Eq.
3.6 and 3.7) works very well as the fitting function. Each of these rational functions
effectively defines a “pseudo-pole”. It was found that only three of such pseudopoles are necessary to achieve a good accuracy for the ENDF/B-VI U238 evaluation
[25]. In addition, the contribution from the second type of poles were found to be
temperature independent due to their exceedingly large Doppler widths, therefore
they do not need to be broadened for elevated temperatures. As a result, only half
of the first type of poles, which is the same as the number of resonances (including
the external resonances), plus three pseudo-poles, are left to be broadened, which
effectively reduces the number of poles to be broadened.
For the new ENDF/B-VII U238 evaluational, there are 3343 resonances in total.
A preliminary study shows that around 10 pseudo-poles are needed to approximate
the contributions from the first type of poles with negative real part. Therefore, a
total of around 3353 poles need to be broadened.
In general, this is a good strategy in increasing the computational efficiency of
Doppler broadening process. However, the number of poles to be broadened is still
too large to be practical for performing Doppler broadening on the fly. Therefore, a
new strategy is proposed during the course of this thesis to further reduce the number
of poles to be broadened and will be discussed in the following sections.
43
3.2
3.2.1
Approximate multipole method
Properties of Faddeeva function and the implications
From the definition of Faddeeva function in Eq. 3.13, it can be approximated as a
Gauss-Hermite quadrature
M
aj
i X
,
W(z) ≈
π j=1 z − tj
(3.17)
where aj and tj are the Gauss-Hermite weights and nodes, respectively. As demonstrated in [26], the following number of quadrature points are sufficient to ensure a
relative error less than 10−5 :
M =























6,
3.9 < |z| < 6.9
4, 6.9 < |z| < 20.0
(3.18)
2, 20.0 < |z| < 100
1,
|z| ≥ 100 .
The nodes and weights for Gauss-Hermite quadrature of degree one are 0 and
√
respectively. Therefore, if |
√
π,
(j)∗
E−pλ
∆m
| ≥ 100, then
√
√
(j)∗
π
E − pλ
W[
]
∆m
∆m
√
≈
=
√
i
π · ∆m
· √
∆m π E − p(j)∗
λ
−i
√ .
(j)∗
pλ − E
π
By comparing to the 0 K expression in Eq. 3.6, this suggests that from a practical
point of view, the Doppler broadening effect is only significant to within a range of ∼
100 Doppler width away from a pole in the complex momentum domain [25]. In other
words, to calculate the cross section of any energy point for a certain temperature,
one can first sum up the contribution at 0 K from all those poles that are 100 Doppler
44
widths away from this energy, since this part is independent of temperature; and then
broaden the other neighboring poles for the desired temperature similar to Eq. 3.15
or Eq. 3.16. Taking the cross section evaluation of U238 at 1 KeV at 3000 K as an
example, 100 Doppler widths corresponds to about three in the momentum domain,
within which there are only around 470 poles. Compared to the total number to be
evaluated from 3.1.4, it is clear that this strategy can effectively reduce the number
of poles to be broadened on the fly.
As mentioned in 3.1.3, the second type of poles usually have very large imaginary
parts that are much greater than the Doppler width, therefore, they can be treated
as temperature independent. As to the first type of poles, the ones with negative real
parts in general can be treated as temperature independent too2 , since cross sections
are always computed in the complex domain with positive real parts. In addition,
those poles from negative resonances can also be treated as temperature independent,
since they usually have large imaginary parts. As a result, one is now left with only
those first type of poles with positive real part from positive resonances, which are
denoted as “principal poles” henceforth.
Fig. 3-2 and 3-3 show the relative errors of total cross section of U238 and U235
at 3000 K calculated with broadening only the principal poles against the NJOY
data. All results are based on the ENDF/B-VII resonance parameters. In general,
the agreement between the multipole representation and NJOY is very good, except
for some local minimum points (e.g. very low absolute cross section values) in U238
which have little practical importance. From now on, the cross sections calculated by
broadening all principal poles will be used as the reference multipole representation
cross section, unless otherwise noted.
An attractive property of the principal poles is that the input arguments to the
2
Strictly speaking, this property does not hold for poles that are very close to the origin and
when evaluating cross sections for very low energy, but this energy range is not of interest in current
work.
45
Reference
Calculated
RelErr 105
104
103
102
101
100
10-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8
10-9
Cross section (barn)
102
101
100
10-1
10-2
10-3
Relative Error
10
3
101
104
102
103
E (eV)
Figure 3-2: Relative error of U238 total cross section at 3000K broadening only
principal poles compared with NJOY data.
Faddeeva function from these poles always fall in the upper half plane of the complex
domain, where the Faddeeva function is well-behaved, as shown in Fig. 3-4. In this
region, due to the presence of the exponential term, the Faddeeva function decreases
rapidly with increasing |z|. This property indicates that everything else being the
same, faraway poles in general have a much smaller contribution than the nearby
poles to a given energy point. Therefore, even though some poles may not lie outside
of the 100 Doppler width range, since their contribution is so small and the variation
in temperature is even smaller, these poles may also be treated as temperature independent. This can further reduce the number of poles to be broadened. However,
since the relative importance in contribution from different poles also depends on the
value of their residue, this effect has to be studied on a nuclide by nuclide basis, which
will be discussed later.
46
1
RelErr 10
Reference
Calculated
100
10-1
10-2
10-3
10-4
101
10-5
Relative Error
Cross section (barn)
102
10-6
10-7
100
10-8
101
102
103
E (eV)
Figure 3-3: Relative error of U235 total cross section at 3000 K broadening only
principal poles against NJOY data.
3.2.2
Overlapping energy domains strategy
Since the imaginary part of principal poles are very small, the value of the real
part is the major deciding factor of whether a pole is close to or far away from an
energy point to be evaluated and thus whether it is significant in terms of Doppler
broadening effect. As a result, for any given energy E at which the cross section needs
to be evaluated, the principal poles that are close to it in momentum space should be
consecutive in energy domain, therefore, if the principal poles are sorted by their real
part, then the poles that are close to E can be specified with a start and end index.
This property makes the storage of information very convenient.
As a result, one can divide the resolved resonance region of any nuclide into many
equal-sized small energy intervals (so that a direct index fetch instead of a binary
47
0.99
0.88
10
0.77
Im(x)
8
0.66
0.55
6
0.44
4
0.33
0.22
2
0.11
0
−10
−5
5
0
10
0.00
Re(x)
(a) Real part
0.540
10
0.405
0.270
8
Im(x)
0.135
6
0.000
−0.135
4
−0.270
−0.405
2
−0.540
0
−10
−5
5
0
10
Re(x)
(b) Imaginary part
Figure 3-4: Faddeeva function in the upper half plane of complex domain
search can be used). For each interval, there are only a certain number of local poles
(including those inside the interval) that have broadening effect on this interval, and
these poles can be recorded with their indices. For the other poles, their contribution
to this interval can be pre-calculated. Due to the fact that the Faddeeva function is
smooth away from origin, the accumulated contribution will also be smooth and can
be estimated with a low order polynomial. Since the potential part of the total cross
section is very smooth as well, it can also be included in the background cross section.
48
Therefore, when evaluating the cross section, one needs only to broaden those local
poles, and to add the background cross section approximated from the polynomials.
This can be expressed as
√
1 X π
(x)
σx (E) = px (E) +
Re[Rl,J,λ,j W(z0 )]
E pλ ∈Ω ∆m
√
1 X π
(t)
Re[e−2iφl Rl,J,λ,j W(z0 )]
σt (E) = pt (E) +
E pλ ∈Ω ∆m
(3.19)
(3.20)
where px (E) and pt (E) represent the polynomial approximation of background cross
section, and pλ ∈ Ω represent the poles that need to be broadened on the fly.
To ensure the smoothness and temperature independence of the background cross
section in the energy interval (inner window), the poles that are within a certain
distance away from both edges of the energy interval also have to be broadened on
the fly. To this end, an overlapping energy domains strategy is used, i.e., outside of
the inner window, an equal size window (outer window) in energy is chosen and the
poles within both the inner and outer windows need to be broadened on the fly, as
demonstrated in Fig. 3-5. The size of the outer and inner window affects the number
of poles to be broadened on the fly, as well as the storage size and accuracy, which
will be studied in later parts of the thesis.
p(E)
outer
window
inner
window
outer
window
Figure 3-5: Demonstration of overlapping window.
3.3
Outer and inner window size
In the current overlapping energy domains strategy, to evaluate the cross section for
a given energy, one needs to broaden a number of poles lying inside the inner and
49
outer windows, as well as to calculate the background cross section. It is anticipated
that the major time component for this method will be the broadening of poles, so
the number of poles is an important metric to be monitored.
In general, the outer window mainly determines the number of poles to be evaluated, therefore, a smaller outer window is always preferred, provided the accuracy
criterion is met. As to the inner window size, although it also affects the number of
poles to be broadened, this is usually less significant compared to the outer window
size. Instead, its major impact is on the storage size of the supplemental information
other than poles and residues. As mentioned before, for each inner window, two indices of the poles to be broadened need to be stored, in addition to the polynomial
coefficients. The number of inner windows depends both on the length of resolved
resonance region and the inner window size. Since the resolved resonance region is
fixed for each nuclide, the larger the inner window, the fewer inner windows there are,
and thus less data to be stored. However, as the inner window gets larger, a higher
order polynomial may be necessary to ensure a certain accuracy for the background
cross section, which may also increase the storage size. As a result, there is a tradeoff
between performance and storage size for inner window size, which adds complexity.
As a starting point, the inner window size is set to be the average spacing of the
resonances for each nuclide (e.g. 5.98 eV for U238 and 0.70 eV for U235). There
are two advantages for this choice: first, there is in general only one pole lying in
the inner window, which poses a very minimal performance overhead; second, the
number of inner window is bounded with the number of resonances of each nuclide,
which is at most a few thousand in current ENDF/B resonance data, even though
the resolved resonance regions of different nuclides can span from a few hundred eV
to a few MeV. For the background cross section, a second-order polynomial is used
for the approximation, unless otherwise noted. To account for the 1/v effect of cross
sections for low energy, a smaller inner window size, currently 0.2 eV, is used for
50
energies below 10 eV.
In this study of inner and outer window size, two nuclides, U238 and U235, are
used, due both to their importance in nuclear reactor simulation and to the large
number of resonances they have that poses a challenge to the current method.
To quantify the accuracy, a set of cross sections for these two nuclides with the
same energy grid as that from NJOY (with the same temperature) are prepared as
the reference, by broadening all the principal poles for the desired temperatures and
accumulating the 0 K contributions from all other poles. For each nuclide, the cross
sections are evaluated for a fixed set of equal-lethargy energy points (∆ξ = 0.0001)
between 1 eV and the upper bound of the resolved resonance of this nuclide using the
overlapping energy domains strategy with the specific inner and outer window size.
The root mean square (RMS) of the relative cross section error with respect to the
reference, the RMS of the relative error weighted by the absolute difference in cross
section, and the maximum relative error are used together as the error quantification.
Assuming the total
points evaluated is N , then they can
s Pnumber of crossssection
PN
N
2
2
i=1 (RelErri )
i=1 AbsErri (RelErri )
,
and max RelErri ,
be expressed as
i=0,1,...,N
N
N
respectively. The second metric is introduced mainly to put less emphasis on cross
sections that are very small, since a low level of relative error is not usually necessary
for these cross sections in real application.
3.3.1
Outer window size
From the analysis of 3.2.1, the Doppler effect of a resonance (with pole p at energy
ER ) to an energy point E0 depends mainly on the magnitude of the corresponding
input argument to Faddeeva function, which in this case is
√
E0 − p∗
∆m
√
≈
E0 −
q
√
ER
kT
awri
51
∝
∝
√
∆ E
√
T
∆E
√
,
ET
(3.21)
This suggests that, everything else being the same, for a given nuclide, the outer
√
window size (in energy domain) should increase proportional to ET to ensure that
all those poles with significant Doppler effect are taken into account. However, as
pointed out before, as the outer window size increases, the contribution from the poles
outside the window becomes very small, and the varying contribution due to Doppler
broadening may become negligible. Therefore, the exact functional dependence on
energy or temperature may be of a different form. In the remaining part of this
section, the energy effect alone will first be studied, followed by the temperature
effect.
Energy Dependence
Since the cross section at an energy point is mostly affected by nearby resonances,
the average spacing of resonances also seems to be a good choice as the characteristic
length of the outer window. To get an idea of the minimum size of the outer window,
a constant of twice of the average resonance spacing is used to evaluate the total cross
section of U238 at 3000 K for the whole resolved resonance region. Fig. 3-6 shows the
relative error of the calculated total cross section with the overlapping energy domains
strategy compared with the corresponding reference cross section. It is clear that the
outer window size is large enough to achieve a very good accuracy for energies up to
1 KeV, but the error increases significantly for higher energy, which indicates that a
larger outer window size is needed for this energy range.
To confirm that the large error at high energy indeed comes from the Doppler
effect of poles that should have been included in the outer window, the relative error
of the background cross section approximation is also analyzed. The approximation is
52
10
104
103
102
101
100
10-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8
Cross section (barn)
102
101
100
10-1
10-2
10-3
101
Relative Error
103
6
RelErr 105
Reference
Calculated
104
102
103
E (eV)
Figure 3-6: Relative error of U238 total cross section at 3000 K calculated with
constant outer window size of twice of average resonance spacing.
very accurate for energies up to 10 KeV, and then there is a jump in relative error, as
shown in Fig. 3-7. However, the maximum relative error in background cross section
here is only around 1%, which is much lower than that of the whole cross section.
Besides, a closer comparison of the absolute difference in the background cross section
and whole cross section confirms that the major reason for the large relative error in
the whole cross section is the Doppler effect of poles outside of the outer window.
It is speculated that the large error for the background cross section comes from
the fact that the contributions from some of the poles outside of the outer window
are not very smooth, since they may still be quite close to the inner window region
due to the small outer window size. In fact, increasing the order of polynomial does
not help reduce the error much in this case.
To take into consideration the larger outer window size needed at higher energy,
53
Reference
Calculated
108
106
1
104
Cross section (barn)
102
100
10-2
10-4
Relative Error
10
RelErr 1010
10-6
10-8
10
10-10
0
10-12
10-14
101
104
102
103
E (eV)
Figure 3-7: Relative error of background cross section approximation for U238 total
cross section at 3000 K with constant outer window size of twice of average resonance
spacing.
and at the same time to control the speed at which the outer window size increases
with the energy, a logarithmic function on energy is chosen as the scaling factor for
the outer window size. Besides, it is found that an additional exponential factor can
help to fine tune the outer window size to achieve overall good accuracy. As a result,
the final form of the scaling factor, Dout , for determining the outer window size is
E
Dout = max(a · b Eub · log10 (E), 1),
(3.22)
where Eub is the upper bound of resolved resonance range, a and b are two parameters
to be determined for each nuclide. Specifically, a is directly related to the lower bound
of the outer window size at low energy, while b mainly affects the speed at which the
54
outer window size increases with energy. The major advantage of the scaling factor
√
for energy in 3.22 compared to the E dependence mentioned above is that with
the former the outer window size at high energy can be bound to only a few times
of that at low energy, while the latter can result in a few hundred times difference.
The procedure of setting these two parameters are as follows: for a specific nuclide,
a is first determined so that the cross sections at low energy range meets the desired
accuracy criterion, and b is then determined by examining the cross section behavior
at high energy.
For U238, study shows that a = 2 and b = 1.5 are appropriate to achieve a
maximum relative error of around 0.1% for most cross sections at 3000 K. The relative
errors for the total cross section are shown in Fig. 3-8. For comparison purpose, the
relative error of background cross section approximation is also presented in Fig. 39. In this case, a second order polynomial is used for the background cross section
approximation, and the results demonstrate clearly that appropriate outer window
size help to smoothen the background cross section inside the inner window.
Similar results are presented for U235 in Fig. 3-10, where a = 3 and b = 1.5.
Temperature Dependence
To study the temperature effect, the two parameters in Eq. 3.22 can be tuned for
different temperatures to achieve the same accuracy level. Table 3.1 shows the parameters for different temperatures for U235 and U238, respectively, to achieve the
same accuracy level as shown in 3.3.1. Besides the error quantification, the average
number of poles to be broadened for all the energy points are also presented in the
table, to reflect the performance for each temperature. As shown in the tables, in
general a and b both increases with increasing temperature, so does the average number of poles. However, it is hard to extract a functional form that are appropriate for
both nuclides.
55
10
2
10
1
102
101
100
10-1
10-2
100
10-3
10-1
10-5
10-4
10-6
10-2
10
Relative Error
Cross section (barn)
103
3
RelErr 10
Reference
Calculated
10-7
10-8
-3
10
1
2
3
4
10-9
10
10
10
E (eV)
Figure 3-8: Relative error of U238 total cross section at 3000 K calculated with outer
window size from Eq. 3.22.
Since a higher temperature usually means larger outer window size throughout
the energy regions considered, when preprocessing the background cross section, the
outer window size should be chosen such that the criterion for accuracy is satisfied
for the highest possible temperature in the desired application. This may result in
evaluating more poles for lower temperatures, thus degrading performance. As an
alternative, one may choose to preprocess different sets of background cross section
for different temperature ranges.
3.3.2
Inner window size
As noted before, the major impact of inner window size is on storage of the supplemental information that includes polynomial coefficients of background cross section
56
Reference
Calculated
RelErr
102
101
100
10
10-2
1
10-3
10-4
10-5
Relative Error
Cross section (barn)
10-1
10-6
10-7
10-8
10-9
101
104
102
103
E (eV)
Figure 3-9: Relative error of background cross section approximation for U238 total
cross section at 3000 K with outer window size from Eq. 3.22.
and the indices of poles to be broadened on the fly. In general, the storage size is
inversely proportional to the inner window size, since the larger the inner window,
the fewer numbers of inner windows there would be. However, the storage size is also
related to the number of terms needed to achieve the desired level of accuracy for the
background cross section. Table 3.2 shows the impact of varying inner window size
on the number of terms necessary to achieve the same level of accuracy for the total
cross section with that of the 3000 K data in Table 3.1, as well as the corresponding
storage for background information. For the storage, each coefficient is assumed to
take 8 bytes, while each index takes 4 bytes.
On a separate note, as suggested in 3.3.1, the number of polynomial terms for
background cross section also depends on the outer window size, since it may affect
the smoothness of the background cross section inside the inner window. For example,
57
Reference
Calculated
RelErr
102
101
102
100
10-1
10-2
10-3
101
10-4
Relative Error
Cross section (barn)
103
10-5
10-6
100
10-7
101
102
103
10-8
E (eV)
Figure 3-10: Relative error of U235 total cross section at 3000 K calculated with outer
window size from Eq. 3.22.
with the same set of parameters for outer window size in Table 3.1b and the same inner
window size for U235, reducing the number of terms for background cross section can
have different impact on accuracy. For 3000 K, if the number of terms is changed to
two, then the relative error of background cross section is still around 0.1% − 0.2%,
but for 300 K the relative error can reach as high as 1%, which causes the maximum
relative error of the whole cross section to also be around 1%. This difference mainly
comes from the fact that the outer windows for 300 K are smaller than that for 3000
K.
As to the effect of increasing inner window size on performance, given that the
minimum number of poles to be broadened at low energy for both U235 and U238
is around 5 − 6, it is anticipated that the maximum performance degradation from
doubling the inner window size will be below 20%, since on average only one more pole
58
Table 3.1: Information related to outer window size for U238 and U235 total cross
section at different temperatures
(a) U238
T (K)
300
1000
2000
3000
a
b
Average
number of
poles
1.2 1 6.85
1.5 1.2 8.56
1.8 1.4 10.08
2 1.5 11.18
Maximum
number of
poles
17
23
28
32
RMS
of
Rel. Err.
Weighted
RMS
of
Rel. Err.
2.4 × 10−4
2.6 × 10−4
2.4 × 10−4
2.6 × 10−4
Max. Rel.
Err.
RMS
Weighted
RMS
Max. Rel.
Err.
3.5 × 10−4
3.3 × 10−4
3.8 × 10−4
3.5 × 10−4
5.1 × 10−4
4.7 × 10−4
5.3 × 10−4
5.2 × 10−4
2.9 × 10−3
3.3 × 10−3
2.3 × 10−3
3.0 × 10−3
8.8 × 10−5
8.9 × 10−5
9.0 × 10−5
8.7 × 10−5
2.0 × 10−3
2.1 × 10−3
1.8 × 10−3
1.4 × 10−3
(b) U235
T (K)
a
b
300
1000
2000
3000
1.5
2.2
2.6
3
1.2
1.5
1.5
1.5
Average
number of
poles
7.70
11.07
12.94
14.73
Maximum
number of
poles
19
31
37
42
needs to be broadened for each cross section evaluation. The detailed performance
analysis will be discussed in Chapter 4.
In summary, for the temperature range that is of interest to reactor simulations
(300 − 3000 K), to achieve an overall accuracy level of 0.1% for the cross sections,
around 5 to 42 poles need to be broadened on the fly for both U238 and U235,
depending on the energy at which the cross section is evaluated. The storage size for
the related data is around 300 - 500 KB for both nuclides, which includes the poles,
the associated residues and angular momentum numbers, and background information
for each of three reaction types (total, capture and fission). By contrast, for U238
and U235, the point-wise cross section data can take on the order of 10 MB just for
a single temperature. For the performance analysis in Chapter 4 and 5, the outer
window size corresponding to the entries of 3000 K in Table 3.1, and the inner window
size of one average resonance spacing will be used for both U238 and U235, unless
59
Table 3.2: Number of terms needed for background cross section and the corresponding storage size for various inner window size (multiple of average resonance spacing).
Inner window
size
One
Two
Three
Four
U238
Number of Storage
terms
size (KB)
3
105.8
4
67.0
4
45.3
5
41.3
U235
Number of Storage
terms
size (KB)
3
100.8
3
51.1
3
34.5
3
26.3
otherwise noted.
3.4
Implementation of Faddeeva function
To evaluate a cross section on the fly, the poles within both the outer and inner
window have to be broadened, each of which is associated with a Faddeeva function
evaluation. It is anticipated that the Faddeeva function evaluation will be one performance bottleneck of the approximate multipole method. A preliminary performance
study also confirms this. Therefore, the efficiency of evaluating Faddeeva function is
a very important part in current method and needs to be studied in detail.
First, the Faddeeva function has a set of symmetric properties that can be utilized
to simplify the implementation [27]
W(z ∗ ) = W∗ (−z),
2
W(−z) = 2e−z − W(z).
As a result, when implementing the Faddeeva function, only the part in the first
quadrant of the complex domain is essential, while the part in other quadrants can
be derived directly from the first quadrant.
In the WHOPPER code, a subroutine for Faddeeva function (W) was implemented
60
which uses an iterative series expansion method with a stopping criterion of 1.0×10−6
for |z| < 6, and otherwise uses asymptotic expressions derived by various low-order
Gauss-Hermite quadrature depending also on the magnitude of the input argument.
Another algorithm that has been demonstrated to be very efficient is the QUICKW
used in the M C 2 -2 code [26]. For small |z|, it uses a six-point bivariate interpolation
of a pre-calculated table of Faddeeva function values with a rectangular grid on the
complex domain. For the grid shown in Fig. 3-11, the interpolation scheme can be
expressed as
h
h
*(x0+ph, y0+ph)
(x0, y0)
Figure 3-11: Demo for the six-point bivariate interpolation scheme.
q(q − 1)
q(q − 1)
f (x0 , y0 − h) +
f (x0 − h, y0 )
2
2
p(p − 2q + 1)
+(1 + pq − p2 − q 2 )f (x0 , y0 ) +
f (x0 + h, y0 )
2
q(q − 2p + 1)
+
f (x0 , y0 + h) + pqf (x0 + h, y0 + h) + O(h3 )
2
f (x0 + ph, y0 + qh) =
(3.23)
For larger |z|, QUICKW uses similar asymptotic expression as those in the WHOPPER implementation.
In current work, both implementations have been examined for use when broadening the poles on the fly, and some modifications have been made to the original
implementations. Specifically, for the WHOPPER version W, the region of series
expansion is changed to |z| < 4, and the stopping criterion is changed to 1 × 10−4 for
better performance; for the QUICKW version, the upper bound of real and imaginary
61
part of z are both set to 4, the grid spacing is reduce to 0.05 for better accuracy, and
the symmetric properties of Faddeeva function was added. Fig. 3-12 and Fig. 3-13
show the relative error of the modified version of both implementations against the
scipy.special.wofz function [28], with values below 1 × 10−4 filtered out. As shown in
the figures, both implementations are in general accurate to within 0.1%, which is
good enough for the desired accuracy level of cross sections.
A performance study shows that the modified W version is about 4.5 times slower
than the modified QUICKW version, for one billion evaluations with random input
arguments which satisfy Re(z) < 4, Im(z) < 4 (since points inside this region are
expected to take longer to evaluate). The test was run on a 2 GHz Intel CPU, and
the total runtime for both implementations are 20.83s and 4.55s, respectively.
62
(a) Real part
(b) Imaginary part
Figure 3-12: Relative error of modified W against scipy.special.wofz.
63
(a) Real part
(b) Imaginary part
Figure 3-13: Relative error of modified QUICKW against scipy.special.wofz.
64
Chapter 4
Implementation and Performance
Analysis on CPU
This chapter describes the details of the implementation of the approximate multipole
method on CPU. Its performance is then analyzed on a few test cases and compared
to that of some reference methods. The scalability of the method is also studied and
presented.
4.1
Implementation
To use the approximate multipole method for cross section evaluation, a proper library
that provides the necessary data for each nuclide needs to be first generated. The main
data needed are poles and the associated residues and angular momentum numbers,
background information along with indices for poles to be broadened, as well as some
nuclide properties such as channel radius, atomic weight and etc. The poles and
residues can be obtained from running the WHOPPER code for each nuclide. After
that, python scripts were used to process the poles, generate the background cross
section information and lump data together into a binary file. This binary file serves
as the library for the approximate multipole method.
65
The main code for the approximate multipole method is written in C. Therefore,
once the input file is read, the data has to be stored into C-compatible data structures.
In general, the data is stored nuclide by nuclide, and the code snippet below shows
the major data structures used.
Code 4.1: Data structure for approximate multipole method
// d a t a r e l a t e d t o each p o l e
typedef struct {
double
pole [ 2 ] ;
double
resi [6];
int32 t
l;
// ( t o t a l , f i s s i o n , c a p t u r e )
// Angular momentum number
} pole residue ;
// background i n f o r m a t i o n f o r each i n t e r v a l
typedef struct {
// t h e f o l l o w i n g s t r u c t s t o r e s t h e c o e f f i c i e n t s f o r
background c r o s s s e c t i o n ,
// as w e l l as i n d i c e s t o p o l e s t o be broadened
// NUM COEF: number o f c o e f f i c i e n t s f o r background c r o s s
section
// e n t r i e s [ 3 ] : t o t a l , f i s s i o n , c a p t u r e
struct { i n t 3 2 t in d [ 2 ] ; double c o e f s [NUM COEF] ; } e n t r i e s [ 3 ] ;
} bkgrd entry ;
// a l l i n f o r m a t i o n f o r a n u c l i d e
typedef struct {
isoprop
props ;
// n u c l i d e p r o p e r t i e s
int32 t
Nbkg ;
// number o f b k g r d e n t r y
66
int32 t
Nprs ;
// number o f p o l e s ( and r e s i d u e s )
bkgrd entry
∗ bkgrds ;
pole residue
∗ prs ;
// a r r a y o f b k g r d e n t r y
// a r r a y o f p o l e s ( and r e s i d u e s )
} nucdata ;
During the course of the thesis work, a different data structure for the background
information was first used. In this scheme, the background information of all nuclides
for the same energy are lumped together. The advantage of this scheme lies in the fact
that during Monte Carlo simulation, the cross sections of all (or some of) the nuclides
are usually needed for the same energy. Therefore, by arranging the background
information according to energy, the data to be fetched for a certain energy are close
to each other and this improves cache efficiency. This was confirmed by performance
results. In fact, it is around 15% faster than the alternative strategy. However, this
strategy has a few significant drawbacks. First, the inner window size of all nuclides
must be the same to be aligned, which leads to either very large storage size, if a small
inner window size is chosen, or significant performance degradation for some nuclides,
if a large inner window size is chosen. Second, the resolved resonance region varies a
lot among nuclides, with some as high as a few MeV. This results in either keeping lots
of unnecessary information for some nuclides with small resolved resonance region,
or both additional positional information and searching overhead. In addition, this
scheme is not very flexible when preparing the cross section library. As a result, it
was replaced by the current data structure.
With the per-nuclide data structure shown above, the algorithm to evaluate the
cross sections for all nuclides at a given energy, a major component in the Monte
Carlo reactor simulation, is presented in Algorithm 2.
67
Algorithm 2 Cross section evaluation with approximate multipole method.
for each nuclide do
get the index to bkgrds
calculate the background cross section
for each pole to be broadened do
calculate phaseshift(φl )
evaluate Faddeeva function
accumulate the contribution to whole cross section
end for
whole cross section = background + contribution from poles
end for
4.2
Test setup
To test the performance of the approximate multipole method against the commonly
used table lookup method, as well as Cullen’s method, two test cases are set up.
Three nuclides, U238, U235 and Gd155, are chosen and replicated to represent the
300 nuclides usually used in nuclear reactor simulation. The reasons to choose these
three nuclides are: 1) they are important nuclides in nuclear reactor; 2) their diversity
in resolved resonance region are representative of most nuclides under consideration,
as shown in Table 4.1.
Table 4.1: Resonance Information of U235, U238 and Gd155
Nuclide Upper bound
of Resolved
Resonance
Region
U235
2.25 KeV
U238
20.0 KeV
Gd155 183 eV
Number
of Average spacresonances
ing of resonances
∼ 0.7 eV
∼ 6 eV
∼ 0.5 eV
3193
3343
92
68
Number
of
energy grid
points at 300
K (NJOY)
76075
589126
12704
4.2.1
Table lookup
In this method, a few set of cross section library are first generated from NJOY for
some specified reference temperatures. For each nuclide at a single temperature point,
since the energy grid for different reaction types are the same with the default NJOY
setting, there is one 1-D array of energy and another 2-D array of cross sections where
the first dimension is for energy and the second for reaction type (three in total). The
evaluation of cross section is a two-step linear interpolation process on both energy
and temperature, shown below for a given energy E at a given temperature T
σT1 = σE (1) ,T1 +
1
σT2 = σE (2) ,T2 +
σE (1) ,T1 − σE (1) ,T1
2
1
(1)
E2
(1)
E1
−
σE (2) ,T2 − σE (2) ,T2
1
2
1
(2)
E2
−
σE = σT1 +
(2)
E1
(1)
(E − E1 )
(2)
(E − E1 )
σT2 − σT1
(T − T1 )
T2 − T1
(i)
where T1 and T2 are the temperature points bracketing T , and E1,2 represent the
energy points that E falls in for temperature Ti .
At each temperature, one binary search is needed to find the energy grid for the
given energy, as a result, there are in general two binary searches associated with one
cross section evaluation (Note since the three reaction types share the same energy
grid, only a single binary search is enough to get all of them). If the unionized energy
grid is used, then a direct index fetch can be used instead, which has a better efficiency
than binary search, but this will increase the size of the data set tremendously and
thus is not considered in this study.
In our case, the total amount of data for the three nuclides at 300 K are about
20.7 MB and the replication of 100 times results in around 2 GB of memory. In order
to include the temperature effect, this set of 300 K cross sections are again replicated
to mimic the cross sections at different temperatures. As a result, the total size of
69
memory storage for this method is 2 GB times the number of temperature points. In
the following studies, only two temperature points, 300 K and 3000 K, are used, and
thus the total memory requirement is around 4 GB.
4.2.2
Cullen’s method
For Cullen’s method, similar data structure as that in table lookup is used for 0 K
cross sections, except that the cross sections for each nuclide are now arranged in
three 1-D arrays instead of one 2-D array. Cullen’s method is then directly used to
broaden the cross sections from 0 K to the specified temperature for the selected
energy. An in-house code was developed that implements the Cullen’s method and is
used for performance tests. The total amount of data in this case is 5.7 GB for 300
nuclides. This is larger than that of 300 K because 0 K cross sections have a denser
energy grid.
4.2.3
Approximate multipole method
The implementation detailed in 4.1 is used for the approximate multipole method.
The number of poles varies among nuclides, and each pole has three pairs of residues
corresponding to three reaction types and one angular momentum number. The total
size of these data is around 43 MB for 300 nuclides.
As to the background information, currently three coefficients are used for the
background cross section approximation for each cross section type, and a separate
pair of indices are used for each cross section type. The size of the inner window is
set to be the average spacing of resonances for each nuclide. As a result, the total
size related to background information is around 62 MB.
During the evaluation of cross sections, one Faddeeva function evaluation is needed
to broaden each pole, and both of the two implementations of the Faddeeva function
discussed in Chapter 3 are used in the tests. For the modified QUICKW version, a
70
82 × 82 table of complex numbers is needed, which amounts to about 105 KB of data.
Therefore, the total memory needed for the approximate multipole method is
approximately 105 MB for 300 nuclides, much less than the other two methods.
4.3
Test one
Test one is mainly set up to examine the speed of cross section evaluation for the
approximate multipole method, without dealing with other parts of Monte Carlo
simulation. In order to both cover the resolved resonance region of most nuclides and
to simulate the neutron slowing down behavior in a real nuclear reactor setting, 300
equal lethargy points are chosen for energies between 19 KeV and 1 eV. For each
energy, the total cross section of all nuclides are evaluated for a random temperature
between 300 K and 3000 K and the nuclide that the neutron will “interact with” is
chosen according to their total cross section (although this information is not used
at all in current test). The evaluation of the total cross section is selected from
the three different methods mentioned above, i.e., table lookup, Cullen’s method
and approximate multipole method, in order to compare performance. The average
run time to evaluate the cross sections of 300 nuclides over 300 energy points is
recorded. This value is then averaged over 100 different runs and is used as a metric
for performance. One thing to note is that for all methods, a total cross section is only
evaluated when the incident neutron energy is within the nuclide’s resolved resonance
region, otherwise a constant value (three barns) is provided.
The main hardware system used for the tests in this chapter is a server cluster
consisting of nodes with two six-core Intel Xeon E5-2620 CPUs of clock speed 2 GHz
and 24GB of RAM. The L1, L2, and L3 cache sizes are 15 KB, 256 KB and 15 MB,
respectively.
71
4.3.1
Serial performance
The run time for the serial version of test one are shown in Table 4.2. Also presented
in the table are some other performance related information, like instruction per
cycle and L1 data cache miss rate, obtained from running “perf”, a commonly used
performance counter profiling tool on Linux.
Table 4.2: Performance results of the serial version of different methods for test one.
Table lookup
run time 156
(µs)
Instructions 0.60
per cycle
L1 cache 59.5%
miss rate
4.55 × 104
Multipole
Modified
Modified
QUICKW
W
231
717
1.04
1.02
0.86
0.17%
3.5%
1.6%
Cullen’s method
From the table, it is obvious that Cullen’s method is much slower than the others,
by nearly two order’s of magnitude, therefore, it will not be considered in the following
performance analysis. For the table lookup method, it is the fastest method, but it
suffers from high cache miss rate, which can be a bottleneck when used on nodes with
large numbers of cores. The approximate multiple method in general is only a few
times slower than that of table lookup, and with the modified QUICKW version for
Faddeeva function, it is only slower about 50%. As a matter of fact, test one was
also run on a different computer node, which has two quad-core Intel Xeon E5620
CPUs of clock speed 2.4 GHz and 16 GB of RAM, with the L1, L2 and L3 cache sizes
being 12 KB, 256 KB and 12 MB, respectively. This time, the approximate multipole
method with modified QUICKW is even faster than the table lookup method by a few
percent. Clearly, the faster CPU speed and the smaller caches favor the approximate
multipole method. In addition, the memory access is not much of a problem for the
72
approximate multipole method, which makes it potentially more desirable for large
scale deployment. Since the modified QUICKW for Faddeeva function is much faster
than that of modified W, it will be chosen as the default method for Faddeeva function
henceforth.
To get a better understanding of the performance of the approximate multipole
method, Table 4.3 shows a rough runtime breakdown for the major components in
Algorithm 2, with the modified QUICKW for Faddeeva function. These data are
also based on the output from “perf”. As can be seen from the table, the Faddeeva
function is a major hotspot of the code, taking more than a third of the total runtime.
In addition, the memory access time of poles and background information also takes
a large portion of the total runtime, as reflected by the time for “first access cache
miss”, that is, the cache miss caused by accessing the first piece of data associated
with each pole or each background entry. This should account for most of the cache
misses related to poles and background information, since the size of data for each
pole and each background entry are either about the same or smaller than the L1
cache line size (unless there is misalignment, and the resulting extra cache misses
would fall in the category of “others”). In the end, for the “others” category, the
major contributions come from the computations other than W(z0 ) in Eq. 3.20.
Table 4.3: Runtime breakdown of approximate multipole method with modified
QUICKW.
Components
Faddeeva function
First access cache miss
for poles and background
information
Others
73
Fraction of runtime
35.3%
30.1%
34.6%
4.3.2
Revisit inner window size
As discussed in 3.3, the inner window size affects not only the storage size for background information, but also the overall performance of the approximate multipole
method. Here a parametric study of the inner window size is performed, to examine
its effect on the performance as well as the overall storage size of background information, with the results shown in Table 4.4. Note that for the storage size of the
background information, the number of coefficients needed for different inner window
size listed in Table 3.2 are used. From the results in the table, it is clear than increasing the inner window size does not have much impact on the overall performance, and
it can significantly reduce the storage size of the background information. However,
to make things clear, for all the remaining tests in this chapter and next chapter, the
inner window size is kept at one average resonance spacing for each nuclide.
Table 4.4: Performance and storage size of background information with varying inner
window size.
Inner window
size
One
Two
Three
Four
4.3.3
Runtime (µs)
Storage size of background
information (MB)
61.7
35.4
24.1
20.4
231
236
242
251
Parallel performance
To test the parallel performance of each method, OpenMP [29] is used to parallelize
the work of evaluating the total cross section for all 300 nuclides at each energy point.
Essentially, each thread will be responsible for a certain number of nuclides, and since
all nuclides are replicated from those three nuclides, it is expected that there is no
load balancing issue with this parallel scheme (at least for a small number of threads).
74
Fig. 4-1 shows the results of the strong scaling study of both table lookup method
and approximate multipole method with modified QUICKW, where the number of
threads increases while the work load remains the same. At a first glance, it seems
that either of these two methods have good scalability. However, a closer look at
the figure suggests that both of them can achieve almost linear speedup when the
thread number is very small. Therefore, it is speculated that the amount of work
is not large enough to hide the parallelization overhead with OpenMP, not that the
methods themselves are not inheritance scalable.
Approximate multipole
Table lookup
12
10
Speedup
8
6
4
2
0
0
2
4
6
8
Number of threads
10
12
Figure 4-1: Strong scaling study of table lookup and approximate multipole methods
for 300 nuclides. The straight line represents perfect scalability.
In order to confirm this, two different test cases are run. The first one is again
a strong scaling study. However, this time the number of nuclides is increased to
3000 and the cross section data are correspondingly replicated, just to increase the
amount of work to be parallelized. Note, since the memory needed for the table
lookup method exceeds 24 GB for 3000 nuclides, a different computer node was used.
75
This new node has 128 GB of RAM, but except for that, it is identical to the other
nodes used. This time, the overhead of parallelization seems to be negligible and both
methods show a good scalability behavior (see Figure 4-2).
Approximate multipole
Table lookup
12
10
Speedup
8
6
4
2
0
0
2
4
6
8
Number of threads
10
12
Figure 4-2: Strong scaling study of table lookup and approximate multipole methods
for 3000 nuclides. The straight line represents perfect scalability.
The second test is a weak scaling study, where the number of nuclides, or the
number of cross sections to evaluate at each energy point, is kept constant (300)
for each thread, and while the number of threads increases, the total number of
nuclides, and thus the workload, also increases proportionally. For this study, the
percentage increase in runtime against number of threads is chosen as a metric to
measure the performance, and the results of both methods are shown in Fig. 4-3.
Clearly, the average runtime increases as the number of threads goes up, as expected,
but the percentage of increase in runtime for the approximate multipole method is
not very significant compared to that of the table lookup method, indicating that the
scalability of the approximate multipole method is better.
76
35
Approximate multipole
Table lookup
Percentage increase in runtime / %
30
25
20
15
10
5
0
0
2
4
6
8
Number of threads
10
12
Figure 4-3: Weak scaling study of table lookup and approximate multipole methods
with 300 nuclides per thread.
Combining the results of the test cases above, it is clear that with enough workload,
both methods have very good scalability, and the approximate multipole method is
better since it shows better behavior in the weak scaling study. However, because
the total number of nuclides in real reactor simulation is limited to about 300, the
overhead from parallelizing the work to evaluate all the cross sections at a single
energy point may overwhelm the benefit. As a result, it may be better to do this
portion of work serially and parallelize the outer loop, which will be discussed in the
next section.
4.4
Test two
Most Monte Carlo codes are parallelized for different particle histories, or the outer
loop. In addition, as demonstrated in the previous section, the amount of work of
77
evaluating cross sections for 300 nuclides at each energy is not large enough to fully
benefit from parallelization. Therefore, in this section, the scalability features of both
methods will be studied where the outer loop is parallelized.
To this end, a simple Monte Carlo code is written to simulate a mono-energetic
neutron source with energy 19.9 KeV slowing down in a homogeneous material. The
material consists of 300 nuclides generated as before, as well as a pure scatterer with
20000 barns of scattering cross section. This scatterer is chosen such that the system
is essentially the same as a homogeneous mixture of four nuclides, U235, U238, Gd155
and H1, with number density ratio of 1:1:1:10, since the scattering cross section of
H1 is about 20 barns over most of energy range and the other three nuclides are each
replicated 100 times. The cut-off energy is set to 1 eV and only fission events are
tallied to get the kef f . Again, if neutron energy is outside of the resolved resonance of
a certain nuclide, a set of constant cross sections are provided, with total cross section
being three barns and all other three major cross section types being one barn each.
The material temperature is set to be 300 K. For the approximate multipole method,
the inner and outer window size appropriate for 3000 K are used, to reflect the fact
the grid for highest temperature must always be used (unless there are multiple sets
of them). For the table lookup method, it is ensured that a cross section evaluation
always involves interpolation between two temperatures yet the cross section is the
same as that corresponding to 300 K. Algorithm 3 shows the logic of the slowing
down code used for the test.
First, to check the accuracy of the approximate multipole method, the slowing
code is run for one billion neutron histories with both table lookup and approximate
multipole methods. The value and standard deviation of kef f for both methods are
tabulated in Table 4.5. Also listed are the average time to run a whole neutron
history for performance comparison. Note that the large size of the problem for running one billion neutron histories requires a mixture of MPI [30] and OpenMP to
78
Algorithm 3 Neutron slowing down code on CPU
for a number of neutrons to simulate do
initialize a neutron
while neutron energy is above the cut-off energy do
evaluate the total cross section for all nuclides for the incident neutron energy
based on the total cross section, determine the nuclide to react with, and the
reaction type
if captured then
break
else if fission then
increment the fission events, break
else
change neutron energy according to elastic scattering formula
end if
end while
end for
distribute work across different computer nodes and each node uses multiple (12)
threads. However, the runtime data in this table are obtained by running one million
neutron histories with a serial version of the code. The results of kef f show very
good agreement between these two methods. In addition, the approximate multipole method is about 50% slower than table lookup, which is consistent with the
performance results from test one.
Table 4.5: kef f (with standard deviation) and the average runtime per neutron history
for both table lookup and approximate multipole methods.
kef f (with standard
deviation)
Table lookup
0.764287±0.000044
Approximate multipole 0.764289±0.000044
Average runtime per
neutron history (ms)
1.062
1.598
For comparison of parallel performance, OpenMP is once again used to parallelize
the outer loop on neutron histories. For each case, one million neutron histories
are run for both methods. Fig. 4-4 shows the scalability of both methods. The
approximate multipole method exhibits a nearly perfect linear speedup, and it is
79
Approximate multipole
Table lookup
12
10
Speedup
8
6
4
2
0
0
2
4
6
8
Number of threads
10
12
Figure 4-4: OpenMP scalability of table lookup and multipole methods for neutron
slowing down. The straight line represents perfect scalability.
clearly better than table lookup.
4.5
Summary
As shown from the results of both tests, the approximate multipole method has little
computation overhead compared with the standard table lookup method, which in
general is less than 50%, and on some hardware systems, much less. The major
reason for this low computation overhead comes from the fact that there is much less
memory required and many fewer cache misses.
The storage requirement of the approximate multipole method for cross sections
of resolved resonance range is only around 100 MB for 300 nuclides, which is much
less than that of the table lookup and the Cullen’s method, and is also one to two
orders less than that of the regression model and the explicit temperature treatment
80
method discussed in Chapter 2. In addition, increasing the inner window size in the
approximate multipole method can further reduce the storage without much loss in
efficiency.
From the scalability tests, it was found that the amount of work of evaluating
cross sections for 300 nuclides at one energy is not large enough to fully benefit
from parallelization for either the table lookup or the approximate multipole method.
However, if parallelized for different neutron histories, as are most Monte Carlo codes,
the approximate multipole method shows very good scalability, better than that of
table lookup, thus making it more desirable for massively parallel deployment.
81
82
Chapter 5
Implementation and Performance
Analysis on GPU
This chapter mainly describes the implementation and performance of the Monte
Carlo slowing down problem with the approximate multipole method on GPUs. Section 5.1 presents a simple version of the code. Section 5.2 discusses the performance
bottlenecks and the subsequent optimization efforts. Section 5.3 concludes the chapter with the final performance results and a discussion.
5.1
Test setup and initial implementation
Since the memory requirement of the table lookup obviously exceeds what GPU
can provide when there are multiple temperatures, only the approximated multipole
method is implemented on GPU. In addition, the modified QUICKW is chosen as
the Faddeeva function implementation on GPU since it is the fastest.
As shown in the previous chapter, the work of evaluating 300 nuclides is not large
enough to fully benefit from CPU parallelization, therefore, it is speculated that it may
not be entirely beneficial to offload this part of work to GPU, either. As a matter
of fact, a simple implementation during the early stage of the work demonstrated
83
that the GPU version can achieve only about three times of speedup compared to the
serial CPU version for the pure computation work. Moreover, there is also additional
time associated with transferring the evaluated cross sections back to CPU and the
kernel launch overhead, which can be as high as 20% of the pure computation time.
Therefore, there is not much advantage to offload the cross section evaluation to GPU
instead of doing it on CPU and thus this approach is abandoned.
As a result, only the approach similar to test two in the previous chapter is taken,
that is, to run the whole Monte Carlo simulation on GPU, instead of just offloading
the evaluation of cross section to GPU. The remaining part of this chapter will focus
on this approach.
5.1.1
Implementation
To simulate the neutron slowing down process on GPU, a CUDA version of the slowing
down code is implemented, with the same setup as in the previous chapter for nuclides,
cut-off energy etc. However, the original algorithm (Algorithm 3 of Chapter 4) is no
longer suitable to run on GPU, mainly due to the expected high branch divergence
associated with the inner while loop on neutron termination. The branch divergence
mainly comes from the fact that some neutrons may be terminated very quickly, while
others may take very long. Consequently, for all the threads within a warp, the run
time will be determined only by the longest path, which is true for every neutron
to be simulated. To avoid this apparent performance drawback, a new algorithm is
used, which allocates roughly equal amount of neutrons to each GPU thread, shown
in Algorithm 4. In addition, due to the lock step feature of GPU, all threads in a
warp always evaluate the cross sections for the same nuclide at the same time. Since
the cross section evaluation is the hotspot in the code, this feature helps to make the
behavior of threads in a warp more uniform.
The same data structures as shown in Code 4.1 of Chapter 4 are used in the
84
Algorithm 4 Neutron slowing down code on GPU
set n run = 0; initialize a neutron
while n run < neutron histories to run per thread do
evaluate the total cross section for all nuclides for the incident neutron energy
based on the total cross section, determine the nuclide to react with, and the
reaction type
if captured then
increment n run; initialize a new neutron
else if fission then
increment the fission events; increment n run; initialize a new neutron
else
change neutron energy according to elastic scattering formula
if neutron energy is below cut-off energy then
increment n run; initialize a new neutron
end if
end if
end while
initial implementation of the GPU version slowing down code since CUDA provides
the necessary support for them, although care must be taken to pass the consistent
device pointers when initializing the data for the GPU. Since the data size of the
poles and residues, as well as background information, are all on the order of tens of
megabytes, both of them can only be stored in the global memory. For the per-nuclide
data such as nuclide properties and pointers to poles and residues, they are placed
in the constant memory because they are read-only and they will fit into constant
memory.
As for the tabulated Faddeeva function values, since they are accessed in a fashion
that exhibits spatial locality, the texture memory should be a good fit, especially
since the table cannot fit into constant memory1 . However, currently CUDA only
supports texture memory for single precision floating point numbers. To use texture
memory for double precision floating point numbers, some additional measures have
to be taken. Specifically, the special CUDA data type of int2, which represents a
1
In fact, both texture memory and constant memory were tried for a smaller table (42×42), which
can fit into constant memory, and it turned out that texture memory gave a better performance.
85
structure consisting of two 4-byte integers, is chosen for the texture memory. The
bits of a double precision number are cut into two halves, with the upper 32 bits
saved into the first integer of int2 while the lower 32 bits into the second integer.
When referencing the double precision number, both of the integers are fetched and
the bits from them are then combined together and converted to the original double
precision number.
Last, the random library shipped with CUDA, cuRAND [1], is used as the parallel
random number generator in the kernel function. CuRAND uses the XORWOW
algorithm [31], which has a period of 2192 − 232 , and each thread in the kernel function
is initialized with a unique random number sequence.
5.1.2
Hardware specification
During the course of the thesis work, two different types of GPUs have been used,
Quadro 4000 and Tesla M2050. The former is resident on an in-house cluster, while
the latter is accessed through Amazon Elastic Compute Cloud. Table 5.1 lists some
specifications of both GPUs. Since both GPUs are of CUDA Computability 2.0, one
thread can have at most 63 32-bit registers. Besides, a maximum number of 1536
threads, or 48 warps, can be scheduled on one SM at the same time.
5.1.3
Results
For the initial implementation, it was only tested on the Quadro 4000 card. After
compiling the CUDA code with “-Xptxas=-v” flag for nvcc compiler, the output
shows that the kernel function needs more than the 63 registers that are available
to a single GPU thread. As a result, the number of threads that can be scheduled
on an SM concurrently is lower than what the hardware can support. In fact, the
maximum number of threads that can be scheduled on an SM is 1536, as showed
above, but due to the limit in the total number of registers available for an SM, only
86
Table 5.1: GPU specifications.
Number of SMs
Number of processing cores
Single precision floating point
performance (peak, Gigaflops)
Double precision floating point
performance (peak, Gigaflops)
Memory
Memory Interface
Memory Bandwidth (GB/s)
Shared memory per SM
Number of 32-bit registers
Quadro 4000
8
256
486.4
Tesla M2050
14
448
1030
243.2
515
2GB GDDR5
256-bit
89.6
16KB
48KB
(Configurable)
32768
3GB GDDR5
384-bit
148
16KB
48KB
(Configurable)
32768
32768/63 ≈ 520 threads, or 16 warps, can actually be scheduled. This low level of
occupancy (33%) limits the ability of GPU to switch amongst different warp contexts
to hide latency, and thus may hurt the overall performance. One additional effect of
the high register usage to performance is that some registers have to be stored to the
local memory and loaded back when needed (called “register spilling”). This can also
hurt the performance since local memory (reside in global memory) are much slower
in terms of both latency and bandwidth than registers.
Other factors that can also affect the occupancy level are the maximum number
of concurrent blocks that can be scheduled and the maximum size of shared memory
available on an SM. In the initial implementation, there is not much use of shared
memory, therefore, the limitation from shared memory is not a problem. As to the
number of blocks, the only limitation it imposes is that the number of threads of a
block must be between 64 and 512 to achieve the maximum level of occupancy limited
by the register usage, that is, 16 concurrent warps on an SM.
The timing of the kernel is achieved through using event synchronization provided
by CUDA. With 256 blocks and 64 threads per block for one million neutron histories,
87
the average runtime for one neutron history is around 0.432 ms, which represents a
∼ 3.7 times of speedup against the serial CPU version. In addition, since the bulk of
computation is now on GPU and there is not much interaction between the CPU and
the GPU except for the initialization and finalization, the communication overhead
is negligible.
5.2
Optimization efforts
5.2.1
Profiling
The speedup of the initial implementation is not as good as desired, so some optimization is necessary to improve performance. The first thing for optimization is to
find the performance bottlenecks of the code. Nvidia has shipped profiling tools that
can provide information on some of the key performance metrics, among which is the
standalone visual profiler nvvp [32]. Table 5.2 lists some performance related metrics
from nvvp for the initial implementation. Here are some brief explanations of the
listed items and their implications:
• The DRAM utilization is the fraction of global memory bandwidth used by the
code. The low level of DRAM utilization indicates that the memory bandwidth
is extremely underutilized.
• The global load efficiency is the ratio between the number of memory load
transactions issued by the code and the actual global memory transactions,
while the global memory replay overhead is the percentage of instruction issues
due to replays of non-coalesced global memory accesses. Therefore, both of the
low global load efficiency and the high global memory replay overhead suggest
there are many non-coalesced global memory accesses.
• The global cache replay overhead is similar to that of global memory replay
88
overhead, but it is caused by the L1 cache miss rate.
• The local memory overhead represents the percentage of memory traffic caused
by local memory accesses, which mainly comes from register spilling.
• The branch divergence overhead is pretty self-explanatory, and results indicate
there are many divergent branches during the kernel execution.
Table 5.2: Performance statistics of the initial implementation on Quadro 4000 from
Nvidia visual profiler nvvp
DRAM utilization
Global load efficiency
Global memory replay overhead
Global cache replay overhead
Local memory overhead
Branch divergence overhead
12.1%
1.3%
42.2%
14.8%
11.4%
67.5%
Since most cross section evaluations need to access anywhere between 5 to 42 poles
and the associated residues each time, which is much more than the access to any
other data, it is speculated that loading poles and residues are the main cause for the
non-coalesced global memory access. As a matter of fact, although different threads
within a warp always evaluate the cross sections for the same nuclide at the same
time, due to the possible difference in neutron energy, the poles (as well as residues
and angular momentum numbers) to load for different threads are different and are
very likely to be scattered. Besides, the different number of poles to be broadened for
different threads due to energy difference may also be one reason for the high branch
divergence rate as demonstrated in the profiling results.
Another source of branch divergence may come from the Faddeeva function evaluation, since depending on where in the complex domain the input argument falls
there are two ways of evaluating the function, either by table lookup or by asymptotic
expansion. Given that the neutron energy across different threads may be different,
89
the input arguments of the Faddeeva function across different threads are very likely
to be different and fall in different regions of evaluation and thus causing divergent
branches.
The large number of registers needed for each thread is mainly because of the
complexity of the kernel function, which is somewhat inevitable. As mentioned in the
previous section, this not only leads to low occupancy level on SMs, but also causes
register spilling and consequently local memory overhead.
As some of the performance bottlenecks are identified, the remaining parts of this
section will discuss the measures to avoid or mitigate these problems.
5.2.2
Floating point precision
For GPU computing, the first thing about performance is almost always the precision
of floating point numbers, since even for the most recent generation of Nvidia GPUs
the peak theoretical throughput of the single precision numbers is still about twice
of that of the double precision numbers, not to mention the older generations where
double precision throughput is even lower. In addition, there are other benefits of
transferring from double precision to single precision. First, for the same number
of variables in the kernel function, for single precision numbers fewer registers are
needed for each thread, therefore, there is less pressure on register usage and this
can increase the occupancy level or decrease/avoid the problem with register spilling.
Second, using single precision numbers can also reduce the data size and thus reduce
the memory bandwidth needed for memory loads and stores. Since in GPU programs
memory bandwidth (especially the global memory bandwidth) can usually be the
major performance bottleneck, as in our case, this aspect of single precision numbers
is also very desirable.
To use single precision, one must make sure that there is no compromise in the
accuracy of the program, or the level of accuracy degradation is acceptable. In gen90
eral, one common case where high precision floating point numbers is necessary is
subtraction of numbers close in value, which is exactly the case for the original multipole method, where poles that are (nearly) symmetric with respect to the origin have
opposite contributions to some energy ranges. However, with the approximate multipole method, since the contribution from all faraway poles have been preprocessed
and only that from the localized poles need to be accumulated on the fly, this problem
becomes minimal. Table 5.3 shows the kef f from both double precision and single
precision version of the slowing down code with 1 × 108 neutron histories. Similar to
Chapter 4, these cases are run with MPI enabled to distribute the work to different
computer nodes each of which has one CUDA-enabled GPU installed. The runtime
results, on the other hand, are obtained by running ten million neutron histories on
one GPU card to avoid the possible overhead associated with MPI. From the results,
it is confirmed that there is not much affect in accuracy from the reduced floating
point number precision. From the compiler output, using single precision does reduce the total number of registers needed for the kernel function, however, it is still
above the limit of 63, and therefore, although the register spilling and thus the local
memory overhead is mitigated, the occupancy level still maintains at the low level of
33%. The speedup from double precision to single precision can be as high as 40%,
as suggested by Table 5.3.
In addition, for single precision arithmetic CUDA also provides an optimized
version of some common mathematic functions such as division and trigonometric
functions, which are faster but less accurate than the standard ones. As demonstrated
in Table 5.3, the use of the fast mathematic functions for the single precision version
code does not compromise the accuracy, rather, it introduces another factor of nearly
40% of increase in performance.
91
Table 5.3: kef f (with standard deviation) and the average runtime per neutron history
for different cases.
k eff
average runtime (ms)
0.764256±0.000138 0.432
0.764264±0.000138 0.308
Double precision
Single precision w/o
fast math
Single precision w/ 0.764274±0.000138 0.223
fast math
5.2.3
Global memory efficiency
As discussed in 5.2.1, the difference in neutron energy may cause the threads within
a warp to load different poles and residues from global memory, which results in noncoalesced memory access and low utilization of global memory bandwidth. Although
there is no easy way of avoiding the energy difference, it is possible to mitigate the
extent to which non-coalesced memory access occurs and to increase the memory
bandwidth utilization. Presented below are two techniques that work well.
Data layout
The first technique is rearranging the data layout of poles and residues to increase
the memory coalescence. Specifically, the original data structure for the per-nuclide
struct, “nucdata”, as shown in Code 4.1 of Chapter 4, arranges the data for poles
and residues (as well as the angular momentum number l’s) in a fashion of “array of
structures” (AoS). This strategy is efficient for CPUs, since in CPU code the poles
are accessed one by one, and for each pole, the data for poles and residues as well
as l’s are referenced consecutively, as a result, grouping together the data associated
with each pole works very well for CPU cache. Additionally, for the parallel version
of the CPU code, since all threads are essentially independent of each other during
the slowing down process, and the cache is large enough that different threads can
access different cache lines (or even different cache) without much contention, this
92
strategy also works well.
For GPUs, however, things are much different. The major difference comes from
the fact that the on-chip cache for GPU is too small for the number of threads so
that data cached cannot persist very long. In our case, the additional data for poles
and residue that are fetched together with the first piece of data accessed may be
flushed before they get used, and they have to be fetched again when needed. This
is a huge waste of the memory bandwidth, since for every 8 bytes (or 4 bytes if in
single precision) needed, there are 128 bytes loaded among which at most 16 bytes
are useful (assuming no caching benefit). To avoid this underutilization of global
memory bandwidth, a different data structure, called “structure of arrays” (SoA), is
implemented for “nucdata”, and shown in Code 5.1.
Code 5.1: Data structure for approximate multipole method on GPU
// a l l i n f o r m a t i o n f o r a n u c l i d e
typedef struct {
isoprop
props ;
// n u c l i d e p r o p e r t i e s
int32 t
Nbkg ;
// number o f b k g r d e n t r y
int32 t
Nprs ;
// number o f p o l e s ( and r e s i d u e s )
bkgrd entry
∗ bkgrds ;
// a r r a y o f b k g r d e n t r y
// myFloat can be e i t h e r f l o a t or d o u b l e
myFloat
∗ prs ;
// a r r a y o f p o l e s and r e s i d u e s
int32 t
∗ Ls ;
// a r r a y o f L a s s o c i a t e d w i t h p o l e s
} nucdata ;
In this new data structure, the poles and residues are organized into a single array
of length 8 · N prs floating point numbers, where the first N prs numbers correspond
to the real part of all the poles, and the second N prs correspond to the imaginary
part, while the remaining 6 · N prs numbers correspond to the real and imaginary part
of the three types of residues arranged in a similar fashion as the poles. The angular
93
momentum numbers are also organized as a separate array. This way, when accessing
one data type (real part of the pole, imaginary part of the pole, etc.), all the threads
within a warp are confined to the memory location of the same type, which increases
the probability of different loads falling into the same memory segment reducing
the number of memory transactions and increasing the bandwidth utilization. As a
matter of fact, with only this modification, a speedup of nearly 100% is experienced
for both the single precision and double precision version of the code on both GPU
cards.
Data parallelism
One additional technique that can be exploited to increase the global memory efficiency is the data parallelism supported by GPUs. More concretely, instead of loading
the necessary data for each pole one by one when needed, we can load all the data for
a few poles at once, store them in the shared memory and then load them from shared
memory as needed. Since the memory latency of shared memory is much lower than
that of global memory, and the bandwidth much higher (see Table 2.1 of Chapter 2),
there is not much overhead from the additional load and store associated with shared
memory (although care must be taken to avoid bank conflicts). The advantage of this
strategy are two fold. On one hand, since the instructions to load/store the poles and
residues are independent of each other, they can all be issued once, effectively hiding
the latency of global memory accesses. On the other hand, the successive accesses to
the same data type may be combined into same memory transaction, if they fall in
the same memory segment, which essentially increases the memory coalescing.
Since the maximum size of shared memory available to an SM is 48 KB, the
number of poles that can be loaded to shared memory is limited. To maintain the
current occupancy, only four poles can be loaded at once for the single precision
version and two for the double precision. A varying level of speedup is seen for the
94
different floating point precision and different GPU cards, ranging from a few percent
to some tens of percent.
5.2.4
Shared memory and register usage
As discussed above, the high register requirement of each thread not only lowers the
occupancy rate, thus reducing the parallelism of the kernel and making the kernel
less capable of hiding latency, but also causes register spilling and hence the high
local memory overhead. Except for completely restructuring the kernel and/or using
different algorithm(s), there is really no good way to reduce the temporary variables
in the kernel.
One way that may help in reducing the number of registers though, is to store
some temporary variables into the shared memory. As a result, various parts of
the code were modified in order to store different temporary variables to the shared
memory, and the output from the compiler confirms that the number of registers
spilled have decreased. However, the performance became worse. In addition, since
the required number of registers was always much more than the hardware limit,
whether for double precision or single precision version, it never happened that the
number decreased to a level that could increase the occupancy. Therefore, this method
was abandoned.
Without success in performance improvement from reducing the register usage,
the next thing tried was to directly reduce the local memory overhead. Given that
local memory actually reside in global memory, and that the spilled register contents
by default have very good access patterns, the only method left to reduce the overhead
seems to be increase the L1 cache size for the global memory.
For both cards used in this work, there is a configurable on-chip cache of size 64
KB, which is shared by L1 global cache and local shared memory and can be divided
into 16 KB + 48 KB segments. By default, the shared memory is configured to 48
95
KB. It was found that if the cache was configured in favor of L1 cache, then the
performance could be improved when there was not much shared memory usage (for
example, no pre-loading poles and residues into shared memory) for both cards. However, with substantial shared memory usage, increasing the L1 cache would decrease
the size of shared memory and may hurt the overall performance. In fact, if the poles
and residues pre-loaded were stored in registers (which would increase the extent of
register spilling), and the cache was configured in favor of L1 cache to accommodate
for the spilled registers (as well as other global memory access), then for Quadro 4000
card the performance actually improved, while for Tesla M2050 card the performance
degraded.
5.3
Results and discussion
With all the optimization efforts explored in the previous section, the final speedup
of the CUDA version slowing down code compared with the corresponding serial
CPU version are shown in Table 5.4. For the purpose of comparison, the same set
of performance related metrics are also listed in Table 5.5 for the optimized single
precision version on Quadro 4000.
Table 5.4: Speedup of GPU vs. serial CPU version on both GPU cards.
Double precision
Single precision
Quadro 4000
6.0
13.3
Tesla M2050
10.7
21.6
From comparison of Table 5.2 and Table 5.5, it is clear that all performance
metrics except for local memory and branch divergence overhead have been improved
significantly, which means that the optimization efforts are quite successful. As to
the increase in local memory overhead, it is anticipated since in the Quadro version
poles and residues are preloaded to registers which may cause more register spilling.
96
Table 5.5: Performance statistics of the optimized single precision version on Quadro
4000 from nvvp.
DRAM utilization
Global load efficiency
Global memory replay overhead
Global cache replay overhead
Local memory overhead
Branch divergence overhead
39.0%
3.5%
10%
3.0%
25.8%
75.4%
For the increase in branch divergence overhead, it is speculated to also be caused by
the additional operations associated with preloading poles and residues. Some values
in Table 5.5 are still not very good, such as the DRAM utilization and the global load
efficiency, which suggests either there are some inherent issues that are not suitable
for GPU deployment, or further optimizations are needed. Here is a brief summary
of the main hurdles to the performance of the code:
1) Even though the new algorithm (Algorithm 4) for the slowing down process has
avoided some potential divergent branches, they are still present in the code due
to the random nature of Monte Carlo. There are mainly two places where the
divergence arises: i) due to differences in incoming neutron energy, the number
and sequence of poles to be broadened are different; ii) branches in the Faddeeva
function, which is a hotspot of the code.
2) The non-uniform memory access to the global memory as discussed in Section
5.2.3 is still a big issue, which also mainly comes from the randomness of Monte
Carlo.
3) The requirement of large number of registers from the complex kernel function
leads to a low occupancy, which limits the GPU’s ability to hide the memory
latency that mainly comes from 2). In addition, the associated register spilling
may still hurt the performance.
97
With the current performance results, it is almost certain that the Monte Carlo
methods cannot take full advantage of the massively parallel capability that GPUs
potentially provide. In fact, the peak theoretical throughput of the CPU used in
Chapter 4 is about 192 Gigaflops, which is about a fifth of that of Tesla M2050 card
using single precision floating point numbers. However, since the slowing down code
can achieve almost linear scalability on the CPU, there is only about a factor of two
difference in the performance between one such CPU and one Tesla M2050 card.
98
Chapter 6
Summary and Future Work
6.1
Summary
In this thesis, a new approach based on the multipole representation is proposed to
address the issue of prohibitively large memory requirements for nuclear cross sections
in Monte Carlo reactor simulations.
The multipole representation transforms resonance parameters of each nuclide into
a set of poles and residues, and then broadens these poles to obtain the cross section
for any temperature. These poles show distinct differences in their contributions to
the energy ranges of interest. Some of them have very smooth contributions thus
do not have much Doppler effect, while others do show fluctuating behavior that
is localized. As a result, an overlapping energy domains strategy is proposed to
reduce the number of poles that need be broadened on the fly, which forms the basis
for the new approximate multipole method. Specifically, the majority of the poles
that have smooth contribution over an energy interval are preprocessed so that their
contributions can be approximated with a low-order polynomials. Therefore, only
a small number of local fluctuating poles are left to be broadened on the fly. A
few parameters associated with this strategy, i.e., the outer and inner window sizes,
99
and their effects on both the performance and the memory requirement are studied,
and a set of values are recommended to achieve the desired level of accuracy while
maintaining good efficiency. In general, for major nuclides such as U238 and U235
which have more than 3000 resonances, the number of poles to be broadened on the
fly can be anywhere between 5 to around 42, depending on the energy at which the
cross section needs to be evaluated.
The approximate multipole method is then implemented on CPU and its performance is compared against that of the traditional table lookup method as well as
the Cullen’s method. It was found that this new method has little computational
overhead against the standard table lookup method, which in general is less than
50%, and on some hardware systems, the former is even faster. The major reason of
this low computation overhead comes from the fact that there is much less memory
overhead from cache misses. Besides, the approximate multipole method also shows
good scalability, better than that of the table lookup method, thus making it more
desirable for massively parallel deployment.
The major advantage of the approximate multipole method is the large reduction
in the memory footprint of resolved resonance cross section data. From the data in
Chapter 1, 2 and 4, it is clear that the new method can reduce the memory footprint
by three orders against the traditional method, and one to two orders against the
comparable techniques. This huge reduction in memory can play a significant role
in high performance computation where number of cores continue to increase at the
detriment of the available memory per core. In particular, this new method makes
it possible to run the Monte Carlo code on GPUs for realistic reactor simulations, in
order to utilize their massively parallel capability.
In this thesis, the approximate multipole method is also implemented on the GPU
for a neutron slowing down problem in a homogeneous material consisting of 301
nuclides. Two different types of GPU cards are used to form a good understanding
100
of the performance on the GPU. Through some extensive optimization efforts, the
GPU version can achieve 22 times speedup compared with a serial CPU version. The
main factors that contribute to this speedup are reduced precision of floating point
arithmetics for higher throughput, faster but less accurate mathematic functions,
a data structure that favors structure of arrays over array of structures for better
memory access pattern, and pre-loading data from global memory to faster local
shared memory. The major performance bottleneck, on the other hand, comes from
the randomness of Monte Carlo method, which manifests itself in branch divergence
and non-coalesced global memory access. In addition, the register pressure due to
complexity of the kernel also hurts the performance. With the current performance
results, it is almost certain that the Monte Carlo methods in reactor simulation cannot
take full advantage of the massively parallel capability that GPUs can potentially
provide.
6.2
Future work
To generate an entire library with the proposed approximate multipole method for
direct cross section evaluation, work still remains.
First, the WHOPPER code used in this thesis to generate the poles works exclusively for resonance parameters in the Reich-Moore format. ENDF/B-VII currently
has 50 nuclides in that format, while 250 nuclides in the MLBW format and 100
nuclides with no resonance file. A processing tool for MLBW format already exists
[33], but work remains on how to proceed with the remaining nuclides.
Second, in Chapter 3, a systematic way of determining the outer and window sizes
(mainly outer window size) is proposed based on both the observation of the functional
dependency of Doppler broadening effect range on energy as well as temperature
and the numerical investigation. Due to the fact that the outer window size is not
101
needed during cross section evaluation and that inner window and outer window are
essentially de-coupled, an alternative approach may be pursued. Specifically, after an
inner window size is decided, the optimal outer window size of each separate inner
window can be determined iteratively by achieving the specified accuracy level. This
process can also be repeated to determine the optimal inner window size.
In addition, the efficiency of approximate multipole method could also be improved
by relaxing the accuracy criteria needed in regions of low cross section values. In fact,
the cross sections in these regions are usually the most demanding for the number of
poles to be broadened on the fly, but their impact on the overall neutronic behavior
may be insignificant.
Last, the focus of this thesis is only on the resolved resonance region above 1 eV. To
extend the approximate multipole method to the low energy region, the contribution
from the correction term in Eq. 3.10 and 3.11 needs to be taken into account, as
well as the integration with low energy scattering such as S(α, β) [34]. It is also
worth exploring whether the multipole representation can be applied in the unresolved
resonance region.
102
Appendix A
Whopper Input Files
The resonance parameters of the input files listed below all come from http://t2.lanl.
gov/nis/data/endf/endfvii.1-n.html
A.1
U238
1
RESONANCES OF U238 ENDFB-VII.1
0
1
1
1
0
9.223800+4 2.360058+2
0
9.223800+4 1.000000+0
0
1.000000-5 2.000000+4
1
0.000000+0 9.480000-1
0
2.360058+2 0.000000+0
0
1
926
3 0.94800000
-4.405250+3 5.000000-1 1.393500+2
-4.133000+2 5.000000-1 5.215449-2
-3.933000+2 5.000000-1 4.993892-2
-3.733000+2 5.000000-1 4.764719-2
-3.533000+2 5.000000-1 4.527354-2
-3.333000+2 5.000000-1 4.281115-2
-3.133000+2 5.000000-1 4.025348-2
-2.933000+2 5.000000-1 3.759330-2
-2.733000+2 5.000000-1 2.551450-2
-2.533000+2 5.000000-1 2.397198-2
0
1
0
1
0
0
3
0
0
1
2
1
2
138
2.300000-2
2.300000-2
2.300000-2
2.300000-2
2.300000-2
2.300000-2
2.300000-2
2.300000-2
2.300000-2
2.300000-2
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
103
1
0
09237
09237
09237
29237
459237
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
2
2
2
2
2
-2.333000+2 5.000000-1 2.234626-2 2.300000-2
-2.133000+2 5.000000-1 2.062684-2 2.300000-2
-1.933000+2 5.000000-1 1.879962-2 2.300000-2
-1.733000+2 5.000000-1 1.685164-2 2.300000-2
-1.533000+2 5.000000-1 1.476751-2 2.300000-2
-1.333000+2 5.000000-1 1.253624-2 2.300000-2
-1.133000+2 5.000000-1 1.015824-2 2.300000-2
-9.330000+1 5.000000-1 7.658435-3 2.300000-2
-7.330000+1 5.000000-1 5.086118-3 2.300000-2
-5.330000+1 5.000000-1 2.932955-3 2.300000-2
-3.330000+1 5.000000-1 1.004548-2 2.300000-2
-7.000000+0 5.000000-1 1.685000-4 2.300000-2
6.673491+0 5.000000-1 1.475792-3 2.300000-2
2.087152+1 5.000000-1 1.009376-2 2.286379-2
... 895 resonance parameters ommitted here
2.000895+4 5.000000-1 1.947255+0 2.300000-2
2.002445+4 5.000000-1 7.114126-1 2.300000-2
2.003658+4 5.000000-1 1.617239+0 2.300000-2
2.009290+4 5.000000-1 1.016504+0 2.300000-2
2.012000+4 5.000000-1 7.876800-2 2.300000-2
2.017500+4 5.000000-1 4.742300-1 2.300000-2
2.440525+4 5.000000-1 2.900960+2 2.300000-2
2.360058+2 0.000000+0
1
0
2
851
3 0.9480000
1.131374+1 5.000000-1 4.074040-7 2.300000-2
4.330947+1 5.000000-1 6.190440-7 2.300000-2
... 846 resonance parameters ommitted here
1.988630+4 5.000000-1 1.328216-2 1.273902-2
1.988734+4 5.000000-1 5.155808-2 1.229675-2
2.000188+4 5.000000-1 1.355880-2 2.300000-2
1566
3 0.9480000
4.407476+0 1.500000+0 5.553415-8 2.300000-2
7.675288+0 1.500000+0 9.416455-9 2.300000-2
... 1561 resonance parameters ommitted here
1.998370+4 1.500000+0 3.780048-2 2.677523-2
1.999255+4 1.500000+0 3.232645-3 2.300000-2
2.000376+4 1.500000+0 5.208511-2 2.300000-2
0.00001
20000.0
0.001
300.0
104
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
2.010000-6
0.000000+0
0.000000+0
5.420000-8
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
9.990000-9
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
372
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
0.000000+0
459237 2
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
A.2
U235
1
RESONANCES OF U235 ENDFB-VII.1
0
1
1
1
0
0
1
0
1
1
0
9.223500+4 2.330248+2
0
2
0
09228
9.223500+4 1.000000+0
0
0
2
09228
1.000000-5 2.250000+3
1
3
1
09228
3.500000+0 9.602000-1
0
0
2
29228
2.330248+2 0.000000+0
0
0
12772
31939228
2
1449
3 0.96020000
-2.038300+3 3.000000+0 1.970300-2 3.379200-2-4.665200-2-1.008800-1
-1.812100+3 3.000000+0 8.574000-4 3.744500-2 7.361700-1-7.418700-1
-1.586200+3 3.000000+0 8.284500-3 3.443900-2 1.536500-1-9.918600-2
-1.357500+3 3.000000+0 5.078700-2 3.850600-2-1.691400-1-3.862200-1
-5.158800+2 3.000000+0 2.988400+0 3.803000-2-8.128500-1-8.180500-1
-7.476600+1 3.000000+0 3.837500-1 5.208500-2-8.644000-1-7.865200-1
-3.492800+0 3.000000+0 8.539000-8 3.779100-2-6.884400-3 1.297700-2
-1.504300+0 3.000000+0 8.533300-8 3.782800-2-7.039700-3 1.168600-2
-5.609800-1 3.000000+0 2.997400-4 2.085500-2 9.564400-2-1.183900-2
2.737933-1 3.000000+0 4.248600-6 4.620300-2 1.177100-1 3.484800-4
... 1431 resonance parameters ommitted here
2.246138+3 3.000000+0 3.145300-3 3.820000-2 5.781200-2-8.390000-2
2.250300+3 3.000000+0 2.240500-2 6.842500-2-4.117100-1-1.227300-1
2.254200+3 3.000000+0 2.551800-2 9.486300-2 2.674300-2 4.203200-2
2.256200+3 3.000000+0 1.423000-2 4.937900-2 2.501300-2 3.631000-2
2.283800+3 3.000000+0 7.159000+0 9.988600-2 8.765300-1 4.688800-1
2.630400+3 3.000000+0 7.853400+0 4.516400-2 7.068000-1 5.364700-1
3.330800+3 3.000000+0 1.205700+1 4.722800-2 4.744200-1 5.712900-1
4.500900+3 3.000000+0 6.143900+0 3.368100-2 2.866200-1 3.641400-1
1744
3 0.96020000
-1.132100+3 4.000000+0 1.714400+0 3.979400-2 4.770100-1-4.693700-1
-7.223900+2 4.000000+0 2.503600+0 3.612200-2 7.749400-1-8.300900-1
-3.243600+2 4.000000+0 1.519600-1 3.893400-2 7.608300-1-7.751100-1
-3.360400+0 4.000000+0 5.427700-3 2.624000-2 1.784100-1-7.486200-2
-1.818200-1 4.000000+0 3.664300-6 2.058000-2 1.563300-1-2.970900-2
3.657500-5 4.000000+0 6.46080-11 4.000000-2-5.091200-4 9.353600-4
1.134232+0 4.000000+0 1.451900-5 3.855000-2 5.184600-5 1.284500-1
... 1728 resonance parameters ommitted here
2.247883+3 4.000000+0 1.001200-2 3.820000-2 1.147400-1 1.332300-1
2.247927+3 4.000000+0 1.646500-2 3.820000-2-1.139700-1-7.628300-2
2.257300+3 4.000000+0 5.543700-2 2.692900-1 4.637500-2 2.524300-1
105
2
2
2
2
2
2.657600+3
3.142000+3
3.588700+3
3.819200+3
4.038900+3
4.274700+3
0.00001
300.0
A.3
4.000000+0
4.000000+0
4.000000+0
4.000000+0
4.000000+0
4.000000+0
20000.0
4.491400-1
2.419400-2
4.682100-2
2.097300-1
6.034300-2
1.498700-2
0.001
5.531500-2 3.829100-2 1.621300-1
4.651300-2-9.506100-2-6.489900-2
3.930100-2-5.121800-2-2.476700-3
3.849400-2-5.116600-1 6.770900-2
3.869100-2-1.141800-1-7.143200-1
3.725500-2 1.604500-2-1.079600-2
Gd155
1
RESONANCES OF Gd155 ENDFB-VII.1
0
1
0
1
0
0
1
6.415500+4 1.535920+2
1
0
6.415500+4 1.000000+0
1
0
1.000000-5 1.833000+2
1
3
1.500000+0 7.900000-1
0
0
1.535920+2 0.000000+0
0
0
2
37
3 0.79000000
2.008000+0 1.000000+0 3.706666-4 1.100000-1
3.616000+0 1.000000+0 4.400000-5 1.300000-1
... 32 resonance parameters ommitted here
1.683000+2 1.000000+0 3.013333-2 1.098000-1
1.780000+2 1.000000+0 9.733333-3 1.098000-1
1.804000+2 1.000000+0 1.466667-2 1.098000-1
55
3 0.7900000
2.680000-2 2.000000+0 1.040000-4 1.080000-1
2.568000+0 2.000000+0 1.744000-3 1.110000-1
... 50 resonance parameters ommitted here
1.714000+2 2.000000+0 9.200000-3 1.098000-1
1.735000+2 2.000000+0 3.280000-2 1.098000-1
1.756000+2 2.000000+0 2.080000-3 1.098000-1
0.00001
20000.0
0.001
300.0
106
0
1
1
1
1
2
492
1
0
46434
06434
06434
26434
456434
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
0.000000+0 0.000000+0
2
2
2
2
2
References
[1] NVIDIA Corp. CUDA Compute Unified Device Architecture Programming
Guide Version 5.5, 2013.
[2] W. Hwu. Lecture notes 4.1 of Heterogeneous Parallel Programming on Coursera.
https://www.coursera.org/course/hetero, 2012.
[3] R.E. MacFarlane and D.W. Muir. NJOY99.0 - Code System for Producing Pointwise and Multigroup Neutron and Photon Cross Sections from ENDF/B Data.
PSR-480/NJOY99.00, Los Alamos National Laboratory, 2000.
[4] T. H. Trumbull. Treatment of Nuclear Data for Transport Problems Containing
Detailed Temperature Distributions. Nucl. Tech., 156(1):75–86, 2006.
[5] G. Yesilyurt, W.R. Martin, and F.B. Brown. On-the-Fly Doppler Broadening
for Monte Carlo Codes. Nucl. Sci. and Eng., 171(3):239–257, 2012.
[6] T. Viitanen and J. Leppanen. Explicit Treatment of Thermal Motion in
Continuous-Energy Monte Carlo Tracking Routines. Nucl. Sci. and Eng.,
171(2):165–173, 2012.
[7] R.N. Hwang. A Rigorous Pole Representation of Multilevel Cross Sections and
Its Practical Applications. Nucl. Sci. and Eng., 96(3):192–209, 1987.
[8] B. Forget, S. Xu, and K. Smith. Direct Doppler Broadening in Monte Carlo
Simulations using the Multipole Representation. Submitted June 2013.
[9] D.E. Cullen and C.R. Weisbin. Exact Doppler Broadening of Tabulated Cross
Sections. Nucl. Sci. and Eng., 60:199–229, 1976.
[10] R. E. MacFarlane and A. C. Kahler. Methods for Processing ENDF/B-VII with
NJOY . Nuclear Data Sheets, 111(12):2739–2890, 2010.
[11] F.B. Brown, W.R. Martin, G. Yesilyurt, and S. Wilderman. Progress with OnThe-Fly Neutron Doppler Broadening in MCNP. Transaction of the American
Nuclear Society, Vol. 106, June 2012.
[12] E. Woodcock et al. Techniques Used in the GEM Code for Monte Carlo Neutronics Calculations in Reactors and Other Systems of Complex Geometry. ANL7050, Argonne National Laboratory, 1965.
107
[13] T. Viitanen and J. Leppanen. Explicit Temperature Treatment in Monte Carlo
Neutron Tracking Routines – First Results. In PHYSOR 2012, Knoxville, Tennessee, USA, April 2012. American Nuclear Society, LaGrange Park, IL.
[14] J. Duderstadt and L. Hamilton. Nuclear Reactor Analysis. John Wiley & Sons,
Inc, 1976.
[15] S. Li, K. Wang, and G. Yu. Research on Fast-Doppler-Broadening of Neutron
Cross Sections. In PHYSOR 2012, Knoxville, Tennessee, USA, April 2012. American Nuclear Society, LaGrange Park, IL.
[16] NVIDIA. CUDA Technology; http://www.nvidia.com/CUDA, 2007.
[17] A.G. Nelson. Monte Carlo Methods for Neutron Transport on Graphics Processing Units Using CUDA. Master’s thesis, Pennsylvanian State University,
Department of Mechanical and Nuclear Engineering, December 2009.
[18] A. Heimlich, A.C.A. Mol, and C.M.N.A. Pereira. Gpu-based monte carlo simulation in neutron transport and finite differences heat equation. Prog. In Nucl.
Energy, 53:229–239, 2011.
[19] T. Liu, A. Ding, W. Ji, and G. Xu. A Monte Carlo Neutron Transport Code
for Eigenvalue Calculations on a Dual-GPU System and CUDA Environment.
In PHYSOR 2012, Knoxville, Tennessee, USA, April 2012. American Nuclear
Society, LaGrange Park, IL.
[20] B. Yang, K. Lu, J. Liu, X. Wang, and C. Gong. GPU Accelerated Monte Carlo
Simulation of Deep Penetration Neutron Transport. In IEEE Intl Conf. on Parallel, Dist., and Grid Computing, Solan, India, 2012.
[21] D.G. Merrill. Allocation-oriented Algorithm Design Allocation oriented with Application to GPU Computing. PhD thesis, University of Virginia, School of Engineering and Applied Science, December 2011.
[22] L.C. Leal. Brief Review of the R-Matrix Theory. http://ocw.mit.edu/
courses/nuclear-engineering/22-106-neutron-interactions-and-applicationsspring-2010/lecture-notes/MIT22 106S10 lec04b.pdf, 2010.
[23] G. de Saussure and R.B. Perez. POLLA: A Fortran Program to Convert RMatrix-Type Multilevel Resonance Parameters into Equivalent Kapur-PeierlsType Parameters. ORNL-2599, Oak Ridge National Laboratory, 1969.
[24] A. W. Solbrig. Doppler Broadening of Low-Energy Resonances. Nucl. Sci. and
Eng., 10:167–168, 1961.
[25] R.N. Hwang. An Extension of the Rigorous Pole Representation of Cross Sections
for Reactor Applications. Nucl. Sci. and Eng., 111:113–131, 1992.
108
[26] H. Henryson, B.J. Toppel, and C.G. Stenberg. MC2 -2: A Code to Calculate FastNeutron Spectra and Multigroup Cross Sections. ANL-8144, Argonne National
Laboratory, 1976.
[27] R.N. Hwang. Resonance Theory in Reactor Applications. In Y. Azmy and E. Sartori, editors, Nuclear Computational Science: A Century in Review, chapter 5,
page 235. Springer, 2010.
[28] SciPy v0.12 Reference Guide (DRAFT).
http://docs.scipy.org/doc/scipy/reference/generated/scipy.special.wofz.html.
[29] OpenMP. http://openmp.org/wp/.
[30] The Message Passing Interface (MPI) Standard.
research/projects/mpi/.
http://www.mcs.anl.gov/
[31] George Marsaglia. Xorshift RNGs. Journal of Statistical Software, 8(14):1–6,
2003.
[32] CUDA Visual Profiler.
guide/index.html#visual-profiler.
http://docs.nvidia.com/cuda/profiler-users-
[33] C. Jammes and R.N. Hwang. Conversion of Single- and Multilevel Breit-Wigner
Resonance Parameters to Pole Representation Parameters. Nucl. Sc. and Eng.,
134(1):37–49, 2000.
[34] K.H. Beckurts and K. Wirtz. Neutron Physics. Springer, Berlin, 1964.
109