On-the-fly Doppler Broadening using Multipole Representation for Monte Carlo Simulations on Heterogeneous Clusters by Sheng Xu B.S., Physics, Peking University (2010) Submitted to the Department of Nuclear Science and Engineering and the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degrees of Master of Science in Nuclear Science and Engineering and Master of Science in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2013 c Massachusetts Institute of Technology 2013. All rights reserved. Signature of Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Nuclear Science and Engineering and the Department of Electrical Engineering and Computer Science August 19, 2013 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kord S. Smith KEPCO Professor of the Practice of Nuclear Science and Engineering Thesis Supervisor Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benoit Forget Associate Professor of Nuclear Science and Engineering Thesis Supervisor Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Srini Devadas Webster Professor of Electrical Engineering and Computer Science Thesis Reader Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mujid S. Kazimi TEPCO Professor of Nuclear Engineering Chair, NSE Committee on Graduate Students Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, EECS Committee on Graduate Students 2 On-the-fly Doppler Broadening using Multipole Representation for Monte Carlo Simulations on Heterogeneous Clusters by Sheng Xu Submitted to the Department of Nuclear Science and Engineering and the Department of Electrical Engineering and Computer Science on August 19, 2013, in partial fulfillment of the requirements for the degrees of Master of Science in Nuclear Science and Engineering and Master of Science in Electrical Engineering and Computer Science Abstract In order to use Monte Carlo methods for reactor simulations beyond benchmark activities, the traditional way of preparing and using nuclear cross sections needs to be changed, since large datasets of cross sections at many temperatures are required to account for Doppler effects, which can impose an unacceptably high overhead in computer memory. In this thesis, a novel approach, based on the multipole representation, is proposed to reduce the memory footprint for the cross sections with little loss of efficiency. The multipole representation transforms resonance parameters into a set of poles only some of which exhibit resonant behavior. A strategy is introduced to preprocess the majority of the poles so that their contributions to the cross section over a small energy interval can be approximated with a low-order polynomial, while only a small number of poles are left to be broadened on the fly. This new approach can reduce the memory footprint of the cross sections by one to two orders over comparable techniques. In addition, it can provide accurate cross sections with an efficiency comparable to current methods: depending on the machines used, the speed of the new approach ranges from being faster than the latter, to being less than 50% slower. Moreover, it has better scalability features than the latter. The significant reduction in memory footprint makes it possible to deploy the Monte Carlo code for realistic reactor simulations on heterogeneous clusters with GPUs in order to utilize their massively parallel capability. In the thesis, a CUDA version of this new approach is implemented for a slowing down problem to examine its potential performance on GPUs. Through some extensive optimization efforts, the CUDA version can achieve around 22 times speedup compared to the corresponding serial CPU version. Thesis Supervisor: Kord S. Smith Title: KEPCO Professor of the Practice of Nuclear Science and Engineering Thesis Supervisor: Benoit Forget Title: Associate Professor of Nuclear Science and Engineering 3 4 Acknowledgments I would firstly like to thank my supervisors, Prof. Kord Smith and Prof. Benoit Forget, for their invaluable guidance and insights throughout this project, and for the support and freedom they gave me to pursue the research topic that I am interested in. I would also like to thank Prof. Srini Devadas for being my thesis reader. His expertise in computer architecture, especially in GPU, has helped me tremendously during this project. I wish I could have more time to learn from him. I am grateful to Luiz Leal of ORNL for introducing us to the multipole representation, and Roger Blomquist for his help in getting the WHOPPER code. This work was supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357. Due to the nature of this project, I have also received help from many other people of both departments and I would like to thank all of them, among whom are: Dr. Paul Romano, Dr. Koroush Shirvan, Bryan Herman, Jeremy Roberts, Nick Horelik, Nathan Gilbson and Will Boyd of NSE, and Prof. Charles Leiserson, Prof. Nir Shavit, Haogang Chen and Ilia Lebedev of EECS. Furthermore, I want to express my gratitude to Prof. Mujid Kazimi, for his guidance during my first year and a half here at MIT, and for his continuous patience, understanding and support over my entire three years here. In addition, I would like to thank Clare Egan and Heather Barry of NSE and Janet Fischer of EECS, for the many administrative processes that they have helped me through from the application to the completion of the dual degrees. I also wish to give my thanks to all other people at MIT who have provided direct or indirect support to my studies over the last three years. Lastly, I would like to thank my family and friends for their love and care to me. Special thanks go to my wife, Hengchen Dai, for her constant love, support and 5 encouragement, which helped me go through each and every hard time during the years that we have been together. 6 Contents Contents 7 List of Figures 11 List of Tables 13 1 Introduction 15 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Background and Review 2.1 2.2 19 Existing methods for Doppler broadening . . . . . . . . . . . . . . . . 19 2.1.1 Cullen’s method . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.2 Regression model . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.3 Explicit temperature treatment method . . . . . . . . . . . . . 25 2.1.4 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . 27 General purpose computing on GPU . . . . . . . . . . . . . . . . . . 28 2.2.1 GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.2 CUDA Programming Model . . . . . . . . . . . . . . . . . . . 32 2.2.3 GPU performance pitfalls . . . . . . . . . . . . . . . . . . . . 34 2.2.4 Floating point precision support . . . . . . . . . . . . . . . . . 35 7 3 Approximate Multipole Method 3.1 3.2 3.3 3.4 37 Multipole representation . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.1 Theory of multipole representation . . . . . . . . . . . . . . . 37 3.1.2 Doppler broadening . . . . . . . . . . . . . . . . . . . . . . . . 40 3.1.3 Characteristics of poles . . . . . . . . . . . . . . . . . . . . . . 41 3.1.4 Previous efforts on reducing poles to broaden . . . . . . . . . 42 Approximate multipole method . . . . . . . . . . . . . . . . . . . . . 44 3.2.1 Properties of Faddeeva function and the implications . . . . . 44 3.2.2 Overlapping energy domains strategy . . . . . . . . . . . . . . 47 Outer and inner window size . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.1 Outer window size . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 Inner window size . . . . . . . . . . . . . . . . . . . . . . . . . 56 Implementation of Faddeeva function . . . . . . . . . . . . . . . . . . 60 4 Implementation and Performance Analysis on CPU 65 4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.1 Table lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.2 Cullen’s method . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.3 Approximate multipole method . . . . . . . . . . . . . . . . . 70 Test one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.1 Serial performance . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.2 Revisit inner window size . . . . . . . . . . . . . . . . . . . . 74 4.3.3 Parallel performance . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Test two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3 8 5 Implementation and Performance Analysis on GPU 5.1 5.2 5.3 83 Test setup and initial implementation . . . . . . . . . . . . . . . . . . 83 5.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.1.2 Hardware specification . . . . . . . . . . . . . . . . . . . . . . 86 5.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Optimization efforts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2.2 Floating point precision . . . . . . . . . . . . . . . . . . . . . 90 5.2.3 Global memory efficiency . . . . . . . . . . . . . . . . . . . . . 92 5.2.4 Shared memory and register usage . . . . . . . . . . . . . . . 95 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6 Summary and Future Work 99 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A Whopper Input Files 99 103 A.1 U238 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.2 U235 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 A.3 Gd155 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 References 107 9 10 List of Figures 1-1 U238 capture cross section at 6.67 eV resonance for different temperatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2-1 Evolution of GPU and CPU throughput [1]. . . . . . . . . . . . . . . 29 2-2 Architecture of an Nvidia GF100 card (Courtesy of Nvidia). . . . . . 30 2-3 GPU memory hierarchy (Courtesy of Nvidia). . . . . . . . . . . . . . 31 2-4 CUDA thread, block and grid hierarchy [1]. 33 . . . . . . . . . . . . . . 3-1 Poles distribution for U238. Black and red dots represent the poles with l = 0 and with positive and negative real parts, respectively, and green dots represent the poles with l > 0. . . . . . . . . . . . . . . . . 42 3-2 Relative error of U238 total cross section at 3000K broadening only principal poles compared with NJOY data. . . . . . . . . . . . . . . . 46 3-3 Relative error of U235 total cross section at 3000 K broadening only principal poles against NJOY data. . . . . . . . . . . . . . . . . . . . 3-4 Faddeeva function in the upper half plane of complex domain 47 . . . . 48 3-5 Demonstration of overlapping window. . . . . . . . . . . . . . . . . . 49 3-6 Relative error of U238 total cross section at 3000 K calculated with constant outer window size of twice of average resonance spacing. . . 11 53 3-7 Relative error of background cross section approximation for U238 total cross section at 3000 K with constant outer window size of twice of average resonance spacing. . . . . . . . . . . . . . . . . . . . . . . . . 54 3-8 Relative error of U238 total cross section at 3000 K calculated with outer window size from Eq. 3.22. . . . . . . . . . . . . . . . . . . . . 56 3-9 Relative error of background cross section approximation for U238 total cross section at 3000 K with outer window size from Eq. 3.22. . . . . 57 3-10 Relative error of U235 total cross section at 3000 K calculated with outer window size from Eq. 3.22. . . . . . . . . . . . . . . . . . . . . 58 3-11 Demo for the six-point bivariate interpolation scheme. . . . . . . . . . 61 3-12 Relative error of modified W against scipy.special.wofz. . . . . . . . . 63 3-13 Relative error of modified QUICKW against scipy.special.wofz. . . . . 64 4-1 Strong scaling study of table lookup and approximate multipole methods for 300 nuclides. The straight line represents perfect scalability. . 75 4-2 Strong scaling study of table lookup and approximate multipole methods for 3000 nuclides. The straight line represents perfect scalability. 76 4-3 Weak scaling study of table lookup and approximate multipole methods with 300 nuclides per thread. . . . . . . . . . . . . . . . . . . . . 77 4-4 OpenMP scalability of table lookup and multipole methods for neutron slowing down. The straight line represents perfect scalability. . . . . . 12 80 List of Tables 2.1 Latency and throughput of different types of memory in GPU [2]. . . 3.1 Information related to outer window size for U238 and U235 total cross section at different temperatures . . . . . . . . . . . . . . . . . . . . . 3.2 32 59 Number of terms needed for background cross section and the corresponding storage size for various inner window size (multiple of average resonance spacing). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.1 Resonance Information of U235, U238 and Gd155 . . . . . . . . . . . 68 4.2 Performance results of the serial version of different methods for test one. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Runtime breakdown of approximate multipole method with modified QUICKW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 73 Performance and storage size of background information with varying inner window size. 4.5 72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 kef f (with standard deviation) and the average runtime per neutron history for both table lookup and approximate multipole methods. . . 79 5.1 GPU specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2 Performance statistics of the initial implementation on Quadro 4000 from Nvidia visual profiler nvvp . . . . . . . . . . . . . . . . . . . . . 13 89 5.3 kef f (with standard deviation) and the average runtime per neutron history for different cases. . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4 Speedup of GPU vs. serial CPU version on both GPU cards. . . . . . 96 5.5 Performance statistics of the optimized single precision version on Quadro 4000 from nvvp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 97 Chapter 1 Introduction 1.1 Motivation Most Monte Carlo neutron transport codes rely on point-wise neutron cross section data, which provides an efficient way to evaluate cross sections by linear interpolation with a pre-defined level of accuracy. Due to the effect of thermal motion from target nucleus, or Doppler effect, cross sections usually vary with temperature, which is particularly significant for the resonance capture and fission cross section, as demonstrated in Fig. 1-1. This temperature dependence must be taken into consideration for realistic, detailed Monte Carlo reactor simulations, especially if Monte Carlo method is to be used for the multi-physics simulations with thermal-hydraulic feedback impacting temperatures and material densities. The traditional way of dealing with this problem is to generate cross sections for each nuclide at specific reference temperatures by Doppler broadening the corresponding 0 K cross sections with nuclear data processing codes such as NJOY [3], and then to use linear interpolation between temperature points to get the cross sections for the desired temperature. However, this method can require a prohibitively large amount of cross section data to cover a large temperature range. For example, Trumbull [4] 15 0K 300 K 3000 K 4 Cross section (barn) 10 3 10 2 10 1 10 6 6.2 6.4 6.6 6.8 7 7.2 7.4 E (eV) Figure 1-1: U238 capture cross section at 6.67 eV resonance for different temperatures. studied the issue and suggested that using a temperature grid of 5 − 10 K spacing for the cross sections would provide the level of accuracy needed. Considering that the temperature range for an operating reactor (including the accident scenario) is between 300 and 3000 K, and the data size of cross sections at a single temperature for 300 nuclides of interest is approximately 1 GB, this means that around 250 − 500 GB of data would need to be stored and accessed, which usually exceeds the size of the main memory of most single computer nodes and thus degrades the performance tremendously. On the other hand, the exceedingly high power density with faster transistors limits the continuing increase in the speed of the central processor unit (CPU). Consequently, more and more cores have to be placed on a single processor to maintain the performance increases through wider parallelism. One direct result is that each processor core will have limited access to fast memory (or cache) which can cause a large disparity between processor power and available memory. This is especially true for the graphics processing unit (GPU), which is at the leading edge of this trend of embracing wider parallelism (although for different reasons, which will be discussed in 16 the next chapter). Therefore, reducing the memory footprint of a program can play a significant role in increasing its efficiency with new trends of computer architectures. As a result, to enable the Monte Carlo reactor simulation beyond the benchmark activities and to prepare it for the new computer architectures, the memory footprint of the temperature-dependent cross section data must be reduced. This can be done through evaluating the cross section on the fly, but classic Doppler broadening methods are usually too expensive, since they were all developed for preprocessing purposes. A few recent efforts on this [5, 6] have showed some improvements, but either the memory footprint is still too high for massively parallel architectures, or the efficiency is significantly degraded relative to current methods. Details on these methods will be presented in the next chapter. In this thesis, a new method of evaluating the cross section on the fly is proposed based on the multipole representation developed in [7], to both reduce the memory footprint of cross section data and to preserve the computational efficiency. In particular, the energy range above 1 eV in the resolved resonance region will be the focus of this thesis, since the cross sections in this range dominate the storage. 1.2 Objectives The main goal of this work is to develop a method that can dramatically reduce the memory footprint for the temperature-dependent nuclear cross section data so that the Monte Carlo method can be used for realistic reactor simulations. At the same time, the efficiency must be maintained at least comparable to the current method. In addition, this new method will be implemented on the GPU to utilize its massively parallel capability. 17 1.3 Thesis organization The remaining parts of the thesis are organized as follows: Chapter 2 reviews the existing methods for Doppler broadening, especially those that are developed to be used on the fly. It also provides some background information for general purpose computing on GPU. Chapter 3 describes the theory of the multipole representation and subsequently the formulation of the new method, the approximate multipole method [8]. The underlying strategy used for this new method will be discussed, as well as some important parameters and functions that can impact the performance or memory footprint thereof. Chapter 4 presents the implementation of the approximate multipole method on CPU and its performance comparison against the current method for both the serial and parallel version. Chapter 5 presents the implementation and performance analysis of the approximate multipole method on GPU. Specifically, the performance bottlenecks and optimization efforts in the GPU implementation are discussed. Chapter 6 summarizes the work in this thesis, followed by a discussion of possible future work and directions. 18 Chapter 2 Background and Review This chapter first reviews the existing methods for Doppler broadening, with a focus on recent efforts for broadening on the fly. The general purpose computing on GPU is then reviewed, mainly on the architectural aspect of GPU, the programming model and performance issues related to GPU. 2.1 Existing methods for Doppler broadening Doppler broadening of nuclear cross sections has long been an important problem in nuclear reactor analysis and methods. The traditional methods are mainly for the purpose of preprocessing and generating cross section libraries. Recently the concept of doing the Doppler broadening on the fly has gained popularity due to the prohibitively large storage size of temperature-dependent cross section data needed by Monte Carlo simulation with coupled thermal-hydraulic feedback, and a few methods in this category have been developed. In this section, both types of Doppler broadening methods are presented, with a focus on the latter type. 19 2.1.1 Cullen’s method The well-known Doppler broadening method developed by Cullen [9] uses a detailed integration of the integral equation defining the effective cross section due to the relative motion of target nucleus. For the ideal gas model, where the target motion is isotropic and its velocity obeys the Maxwell-Boltzmann distribution, the effective Doppler-broadened cross section at temperature T takes the following form √ Z∞ α 2 2 dV σ 0 (V )V 2 [e−α(V −v) − eα(V +v) ], σ̄(v) = √ 2 πv 0 (2.1) M , with kB being the Boltzmann’s constant and M the mass of the 2kB T target nucleus, v is the incident neutron velocity, V is the relative velocity between where α = the neutron and the target nucleus, and σ 0 (v) represents the 0 K cross section for neutron with velocity v. As Cullen’s method will be used as a reference method in Chapter 4, it is helpful to present the algorithm for evaluating Eq. 2.1 here. To begin, Eq. 2.1 can be broken into two parts: σ̄(v) = σ ∗ (v) − σ ∗ (−v), (2.2) √ Z∞ α 2 dV σ 0 (V )V 2 e−α(V −v) . σ (v) = √ 2 πv 0 (2.3) where ∗ The exponential term in Eq. 2.3 limits the significant part of the integral to the range 4 4 v−√ <V <v+√ , α α 20 (2.4) while for σ ∗ (v), the range of significance becomes 4 0<V < √ . α (2.5) The numerical evaluation of Eq. 2.3 developed in [9] assumes that the 0 K cross sections can be represented by a piecewise linear function of energy with acceptable accuracy, which is just the form of NJOY PENDF files [3]. By defining the reduced √ √ variables x = αV and y = αv, the 0 K cross sections can be expressed as σ 0 (x) = σi0 + si (x2 − x2i ), (2.6) 0 − σi0 )/(xi+1 − xi ) for the i-th energy interval. As a result, Eq. with slope si = (σi+1 2.3 becomes N Z xi+1 1 X 2 σ (y) = √ 2 σ 0 (x)x2 e−(x−y) dx πy i=0 xi (2.7) σ ∗ (y) = (2.8) ∗ X [Ai (σi0 − si x2i ) + Bi si ], i where N denotes the number of energy intervals that fall in the significance range as determined by Eq. 2.4 and 2.5, and 1 H2 + y2 1 = 2 H4 + y Ai = Bi 2 H1 + H0 , y 4 H3 + 6H2 + 4yH1 + y 2 H0 , y where Hn is shorthand for Hn (xi − y, xi+1 − y), defined by 1 Z b n −z2 Hn (a, b) = √ z e dz. π a 21 (2.9) To compute Hn (a, b), one can write it in the form Hn (a, b) = Fn (a) − Fn (b), (2.10) 1 Z ∞ n −z2 Fn (x) = √ z e dz. π x (2.11) where Fn (x) is defined by and satisfies a recursive relation 1 erfc(x), 2 1 2 F1 (x) = √ e−a 2 π n−1 Fn−2 (x) + xn−1 F1 (x), Fn (x) = 2 F0 (x) = (2.12) (2.13) (2.14) with erfc(x) being the complementary error function 2 Z ∞ −z2 e dz. erfc(x) = √ π a (2.15) If a and b are very close with each other, the difference in Eq. 2.10 may loose significance. To avoid this, one can use a different method based on direct Taylor expansion (see [10]). However, with the use of double precision floating point numbers, this problem did not show up during the course of this thesis work. Traditionally Cullen’s method is used in NJOY to Doppler broaden the 0 K cross section and generate cross section libraries for reference temperatures. Any cross section needed are then evaluated with this library by linear interpolation, which will be denoted “table lookup” method henceforth in this thesis. As will be shown in Chapter 4 (and also in other places such as [5]), Cullen’s method is not very practical to broaden the cross section on they fly since it requires an unacceptable amount of 22 computation time, mainly due to the cost of evaluating complementary error functions for the many energy points that fall in the range of significance, especially at high energy. 2.1.2 Regression model In [5], Yesilyurt et al. developed a new regression model to perform on-the-fly Doppler broadening based on series expansion of the multi-level Adler-Adler formalism with temperature dependence. Take the total cross section as an example (similar analysis applies to other types of cross section), its expression in Adler-Adler formalism is σt (E, ξR ) = 4πλ̄2 sin2 φ0 √ X 2 [(GR cos2φ0 + HR sin2φ0 )ψ(x, ξR ) + πλ̄2 E{ R ΓR,t + (HR cos2φ0 − GR sin2φ0 )χ(x, ξR )] + A1 + A2 A3 A4 + 2 + 3 + B1 E + B2 E 2 } E E E (2.16) where ψ(x, ξR ) = χ(x, ξR ) = λ̄ = x = ξR = ξR Z ∞ exp[− 14 (x − y)2 ξR2 ]dy √ 2 π ∞ 1 + y2 ξR Z ∞ exp[− 14 (x − y)2 ξR2 ]ydy √ 2 π ∞ 1 + y2 √ 1 1 2mn awri = √ , k0 = , k h̄ awri + 1 k0 E 2(E − ER ) , Γ s T awri ΓT , 4kB ER T (2.17) (2.18) (2.19) (2.20) (2.21) and the other symbols are: GR , symmetric total parameter; HR , asymmetric total parameter; Ai and Bi , coefficients of the total background correction; φ0 , phase shift; ER , energy of resonance R; ΓT , total resonance width; mn , neutron mass; and awri, 23 the mass ratio between the nuclide and the neutron. Since the only temperature dependence of Eq. 2.24 is in ξR , and ξR only appears in ψ(x, ξR ) and χ(x, ξR ), as a result, by writing ψ(x, ξR ) and χ(x, ξR ) as ψR (T ) and χR (T ) and expanding them in terms of T , ψR (T ) = X aR,i fi (T ), (2.22) bR,i hi (T ), (2.23) i χR (T ) = X i one can arrive at an expression for the total cross section in terms of series expansion of T σt (E, T ) = AR + X a00i fi (T ) + i X b00i hi (T ), (2.24) i where a00i and b00i are parameters specific for a given energy and nuclide. Through analyzing the asymptotic expansions of both ψ(x, ξR ) and χ(x, ξR ), augmented by numerical investigation [5], a final form for σt (E, T ), 6 X ai + bi T i/2 + c, σt (E, T ) = i/2 i=1 i=1 T 6 X (2.25) where ai , bi and c are parameters unique to energy, reaction type and nuclide, was found to give good accuracy for a number of nuclides examined over the temperature ranges of 77 − 3200K. As shown in [11], for production MCNP code, the overhead in performance to incorporate this regression model method over the traditional table lookup method is only 10% − 20%. However, since for each 0 K energy point of each cross section type of any nuclide, there are 13 parameters needed, which suggests that the total size of data for this method is around 13 times of that of standard 0 K cross sections, or about 13 GB. Although reduced significantly from the table lookup method, this 24 size is still taxing for many hardware systems, such as GPUs. 2.1.3 Explicit temperature treatment method In essence, the explicit temperature treatment method is not a method for Doppler broadening, since there is no broadening of any cross section at any point during the process. However, due to the fact that it does solve the problem of temperature dependence of cross sections in a clever way, it is also included in this category. As described in [6], the whole method is based on a concept similar to that of Woodcock delta-tracking method [12], so essentially it is a type of rejection sampling method. Specifically, for incident neutron energy E at ambient temperature T , a majorant cross section for a nuclide n is defined as Σmaj,n (E) = gn (E, T, awrin ) Σ0tot,n (E 0 ), max √ √ E 0 ∈[( E−4/λn (T ))2 ,( E+4/λn (T ))2 ] (2.26) where gn (E, T, awrin ) is a correction factor for the temperature-initiated increase in potential scattering cross section of the form √ 1 e−λn (T ) E √ , gn (E, T, awrin ) = [1 + ]erf[λn (T ) E] − √ 2λn (T )2 E πλn (T ) E 2 s λn (T ) = awrin , kB T (2.27) (2.28) 0 (E) Σ0tot,n (E) = Nn σtot,n (2.29) 0 and Nn and σtot,n are the number density and the 0 K total microscopic cross section of nuclide n. A majorant cross section for a material region with n different nuclides and a maximum temperature Tm is then defined as Σmaj (E) = X Σmaj,n (E) n 25 (2.30) The neutron transport process can subsequently be simulated with the tracking scheme shown in Algorithm 1. Algorithm 1 Tracking scheme for explicit temperature treatment method while true do starting at position ~r, sample a path length ~l based on the majorant cross section at ~r, Σmaj (E, ~r) get a temporary collision point, ~r0 = ~r + ~l sample target nuclide at position ~r0 , with probability for nuclide n as Σmaj,n (E, ~r0 ) Pn = Σmaj (E, ~r0 ) sample the target velocity from the Maxwellian distribution with temperature Tm (~r0 ), get the energy E 0 corresponding to the relative velocity between neutron and the target nucleus Σ0tot,n (E 0 ) : rejection sampling with criterion ξ < Σmaj,n (E, ~r0 ) if rejected then continue else set the collision point: ~r ← ~r0 sample the reaction type with energy E 0 and the 0 K microscopic cross section continue end if end while Combining the definition of majorant cross section and the tracking scheme, it is clear that the material majorant cross section is defined such that every single nucleus in this material is assigned with the maximum possible microscopic cross section (if the high energy Maxwellian tail can be ignored), and the rejection sampling is then performed based on the real microscopic cross section from Maxwell distribution. The dataset needed for this method are the 0 K cross sections and the majorant microscopic cross sections for the material temperatures of interest, and thus the storage requirement is on the order of a few GB. However, due to the use of delta tracking and rejection sampling, the efficiency can be impacted. In fact, as reported in [13], the runtime of explicit temperature treatment method is 2 − 4 times slower than the standard table lookup method for a few test cases. 26 2.1.4 Other methods There are also some other methods that have been developed for Doppler broadening. They will be discussed briefly below. The psi-chi method [14] is a single level resonance representation that introduces a few substantial approximations to Eq. 2.1, including using the single-level BreitWigner formula for the 0 K cross section, omitting the second exponential term, approximating the exponent in the first exponential term with Taylor series expansion and extending the lower limit of the integration to −∞. The final form of the Doppler-broadened cross section uses the functions ψ(x, ξR ) and χ(x, ξR ) as defined in Eq. 2.17 and 2.18. Due to the extensive approximations used, psi-chi method is not very accurate for cross section evaluation, especially for low energies. Besides, the single-level Breit-Wigner formalism is now considered obsolete as the resonance representation for major nuclides. The Fast-Doppler-Broadening method developed in [15] uses a two-point GaussLegendre quadrature for the integration in Eq. 2.7 when xi+1 − xi > 1.0, and uses Eq. 2.8 otherwise. This method was found to be 2 − 3 times faster than Cullen’s method, thus still not efficient enough for on-the-fly Doppler broadening. Another method that was proposed during the course of this thesis makes use of the fact that the velocity of target nucleus is usually much smaller than that of incident neutron velocity. Consequently , the energy corresponding to the relative velocity between neutron and target nucleus can be approximated as a function of vT µ, where vT is the velocity of the target nucleus and µ is the cosine between the neutron velocity and target velocity. It can be proved that vT µ obeys Gaussian distribution, as a result, the effective Doppler-broadened cross section at temperature T can be expressed as a convolution of the 0 K cross section and a Gaussian distribution σ̄(E) = Z ∞ −∞ σ 0 [Erel (u)]G(u, T )du 27 (2.31) where u = vT µ and G(u, T ) represents a Gaussian distribution for temperature T . With the similar energy cut-off strategy as used in Cullen’s method, and by using a tabulated cumulative distribution function (cdf) for Gaussian distribution, the above integration can be evaluated numerically and requires much fewer operations than the Cullen’s method for each energy grid point. However, the inherent inaccuracy for low energy ranges and the still high computational time makes it unattractive, especially when compared to the approximate multipole method that will be discussed in the next chapter, as a result, it was abandoned. 2.2 General purpose computing on GPU GPUs started as fixed-function hardware dedicated to handle the manipulation and display of 2D graphics in order to offload the computationally complex graphical calculations from the CPU. Due to the inherent parallel nature of displaying computer graphics, over the years, more and more parallel processors are added to a single GPU card and thus make it massively parallel and of tremendously high floating point operations throughput. Nowadays, GPUs are capable of performing trillions of floating point operations per second (Teraflops) from a single GPU card, much higher than that of the high-end CPUs (see Fig. 2-1). In order to utilize this huge computational power in other areas of computing, some efforts have been made to transfer GPUs into fully-programmable processors capable of general-purpose computing, among which is NVidia’s CUDA (Compute Unified Device Architecture)[16], a general-purpose parallel computing architecture. Ever since then, many scientific applications have been accelerated with GPUs, including some cases for Monte Carlo neutron transport such as [17, 18, 19, 20]. In the remaining part of this section, the architectural aspects of GPUs and the CUDA programming model, as well as some of the performance considerations for GPUs, are briefly reviewed. 28 Figure 2-1: Evolution of GPU and CPU throughput [1]. 2.2.1 GPU architecture GPU architecture is very different from conventional multi-core CPU design, since GPUs are fundamentally geared towards high-throughput (versus low-latency) computing. As a parallel architecture, it is designed to process large quantities of concurrent, fine-grained tasks. Fig. 2-2 illustrates the architecture of an NVidia GPU. A typical high performance GPU usually contains a number of streaming multiprocessors (SMs), each of which comprises tens of homogeneous processing cores. SMs use single instruction multiple thread (SIMT) and simultaneous multithreading (SMT) techniques to map threads of execution onto these cores. SIMT techniques are architecturally efficient in that one hardware unit for instructionissue can service many data paths and different threads execute in lock step fashion. However, to avoid the scalability issues of signal propagation delay and underutilization that may happen with a single instruction stream for the entire SM [21], GPUs 29 Figure 2-2: Architecture of an Nvidia GF100 card (Courtesy of Nvidia). typically implement fixed-size SIMT groupings of threads called warps, and the width of a warp is usually 32. Distinct warps are not run in lock step and may diverge. Using SMT techniques, each SM maintains and schedules the execution contexts of many warps. This style of SMT enables GPUs to hide latency by switching amongst warp contexts when architectural, data, and control hazards would normally introduce stalls. This leads to a more efficient utilization of the available physical resource, and the maximal instruction throughput occurs when the number of thread contexts is much greater than the aggregate number of SIMT lanes per SM. Communication between threads is achieved by reading and writing data to various shared memory spaces. As shown in Fig. 2-3, GPUs have three levels of explicitly managed storage that vary in terms of visibility and latency: per-thread registers, 30 shared memory local to a collection of warps running on the same SM, and a large global (device) memory in off-chip DRAM that is accessible by all threads. Figure 2-3: GPU memory hierarchy (Courtesy of Nvidia). Unlike traditional CPU architecture, GPUs do not implement data caches for the purpose of maintaining the programs working set in nearby, low-latency storage. Rather, the cumulative register file comprises the bulk of on-chip storage, and a much smaller cache hierarchy often exists for the primary purpose of smoothing over irregular memory access patterns. In addition, there are also two special types of readonly memory on GPUs: constant and texture memory. Both are part of the global memory, but each of them has a corresponding cache that can facilitate the memory access. In general, constant memory is most efficient when all threads in a warp access the same memory location, while texture memory is good for memory access with spatial locality. Since different types of memory have very different latency and 31 bandwidth, as shown in Table 2.1, a proper use of the memory hierarchy is essential in achieving a good performance for GPU programs. Table 2.1: Latency and throughput of different types of memory in GPU [2]. Registers Shared Global Constant memory memory memory Latency (unit: ∼ 1 ∼5 ∼ 500 ∼ 5 with clock cycle) caching Bandwidth Extremely High Modest High high 2.2.2 CUDA Programming Model CUDA is both a hardware architecture and a software platform to expose that hardware at a high level for general purpose computing. CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in GPUs. This platform is accessible through extensions to industry-standard programming languages, including C/C++ and FORTRAN, and CUDA C is used in current work. A typical CUDA C program consists of two parts of code: host code and device code. The host code is similar to any common C program, except that it needs to take care of memory allocation on GPU, data transfer between CPU and GPU and offload the computation to GPU. Device code, as suggested by the name, are the code that executes on the device (GPU). The device code is started with a kernel launch from the host code, through which a kernel function is executed by a collection of logical threads on the GPU. These logical threads are mapped onto hardware threads by a scheduling runtime, either in software or hardware. In CUDA, all logical threads are organized into a two-level hierarchy, block and grid, for efficient thread management. Specifically, a block contains a number of cooperative threads that will be assigned to the same SM, and all the blocks of a 32 kernel form a grid, as shown in Fig. 2-4. During a kernel launch, both block and grid dimensions have to be specified. The thread execution can then be specialized by its identifier in the hierarchy, allowing threads to determine which portion of the problem to operate on. Figure 2-4: CUDA thread, block and grid hierarchy [1]. The threads within a block share the same local shared memory, since they are assigned on the same SM, and some synchronization instructions exist to ensure memory coherence for this per-block shared memory. On the other hand, global memory, which can be accessed by all the threads within a grid, or threads for a kernel, is only guaranteed to be consistent at the boundaries between sequential kernel invocations. 33 2.2.3 GPU performance pitfalls Although GPUs can deliver very high throughput on parallel computations, they require large amounts of fine-grained concurrency to do so. In addition, due to the fact that GPUs have long been a special-purpose processor for graphics, the underlying hardware can penalize algorithms having irregular and dynamic behavior not typical of computations related to graphics. In fact, GPU architecture is particularly sensitive to load imbalance among processing elements. Here, two major performance pitfalls relevant to current work are discussed: variable memory access cost from SIMT access patterns and thread divergence. Non-uniform memory access Due to the SIMT nature of the GPU programming model, for each load/store instruction executed by a warp in one lock step, there are probably a few accesses to different memory locations. Different types of physical memory may respond differently to such collective accesses, but they all share one thing in common, that is, each of them are optimized for some specific access patterns. A mismatched access pattern often leads to significant underutilization of the physical resource that can be as high as an order of magnitude. For global memory, the individual accesses for each thread within a warp can be combined/bundled together by the memory subsystem into a single memory transaction, if every reference falls within the same contiguous global memory segment. This is called “memory coalescing”. A a result, if each of the thread references a memory location of a distinct segment, then many transactions have to be made, which is very wasteful of the memory bandwidth. For local shared memory, performance is highest when no threads within the same warp access different words in a same memory bank. Otherwise the memory accesses (to the same bank) will be serialized by the hardware, called “bank conflicts”. Bank 34 conflicts is a serious performance hurdle for shared memory and needs to be avoid to achieve good performance. Thread divergence In the GPU programming model, logical threads are grouped into warps of execution. A single program counter is shared by all threads within the warp. Warps, not threads, are free to pursue independent paths through the kernel program. To provide the illusion of individualized control flow, the execution model must transparently handle branch divergence. This situation occurs when a conditional branch instruction, like “If...Else...”, would redirect a subset of threads down the “If” path, leaving the others to continue the “Else” path. Because threads within the warp proceed in lock step fashion, the warp must necessarily execute both halves of the branch, making some of the threads idle where appropriate. This mechanism can lead to an inherently unfair scheduling of logical threads. In the worst case, only one logical thread may be active while all others threads perform no work. The GPU’s relatively large SIMT width exacerbates the problem and branch divergence can impose an order of magnitude slowdown in overall computing throughput. 2.2.4 Floating point precision support For the early generation of GPU cards, there was only support for single precision floating numbers and corresponding arithmetic, since accuracy is not a big issue for image processing and graphics display. However, with the advent of general purpose GPU computing for scientific purposes, the need for support of double precision floating point arithmetic becomes imperative. As a result, since the introduction of GT200 card and CUDA Computability 1.3, double precision arithmetic has been supported on most Nvidia high performance GPU cards. For Nvidia GPUs earlier than Computability 2.0, each SM has only one special 35 function unit (SFU) that can perform double precision computation, while there are eight units for single precision computation. As a result, on these GPU cards, the theoretical peak performance of double precision computation is only one eighth of that of single precision computation. Since CUDA Computability 2.0, more SFUs that support double precision computation are added to GPUs and nowadays the theoretical peak performance of double precision computation is usually a half of that of single precision on a normal Nvidia high performance GPU card. 36 Chapter 3 Approximate Multipole Method This chapter starts with the description of multipole representation. The formulation of the approximate multipole method for on-the-fly Doppler broadening is then presented, along with the overlapping energy domains strategy. A systematic study of a few important parameters in the proposed strategy then follows. The chapter concludes with a discussion about an important function in the multipole method. 3.1 3.1.1 Multipole representation Theory of multipole representation In resonance theory, the reaction cross section for any incident channel c and exit channel c0 can be expressed in terms of collision matrix Ucc0 σcc0 = πλ̄2 gc |δcc0 − Ucc0 |2 , 37 (3.1) where gc and δcc0 are the statistical spin factor and the Kronecker delta, respectively, and λ̄ is as defined in 2.19. Similarly, the total cross section can be derived as σt = X σcc0 = 2πλ̄2 gc (1 − ReUcc0 ). (3.2) c0 The collision matrix can be described by R-matrix representation, which has four practical formalisms, i.e., single-level Breit-Wigner (SLBW), multilevel Breit-Wigner (MLBW), Adler-Adler, and Reich-Moore. Because of the rigor of the Reich-Moore formalism in representing the energy behavior of the cross section, it is used extensively for major actinides in the current ENDF/B format. In Reich-Moore formalism, the collision matrix can be represented by the transmission probability[22] Ucc0 = e−i(φc +φc0 ) (δcc0 − 2ρcc0 ), (3.3) where ρcc0 is the transmission probability from channel c to c0 , and φc and φc0 are the hard-sphere phase shift of channel c and c0 , respectively. Due to the physical condition that the collision matrix must be single-valued and meromorphic1 in momentum space, the collision matrix and thus the transmission probability can be rigorously represented by rational functions with simple poles in √ E domain [7]. This is a generalization of the rationale suggested by de Saussure and Perez [23] for the s-wave resonances, and lays the foundation for the multipole representation as described in [7]. In this representation, the neutron-neutron and the generic neutron to channel c transmission probabilities of the Reich-Moore formalism for all N resonances with angular momentum number l can be written as ρnn √ 2M Pn2M −1 ( E) X Rnλ √ √ = = 2M P ( E) E λ=1 pλ − 1 (3.4) A meromorphic function is a function that is well behaved except at isolated points. In contrast, a holomorphic function is a well-behaved function in the whole domain. 38 √ 2M X |Pc2M −1 ( E)|2 2Rcλ √ √ |ρnc | = = |P 2M ( E)|2 E λ=1 pλ − 2 (3.5) where ρnn and ρnc are the transmission probabilities, P represents a holomorphic function, pλ ’s are the poles of the complex function while Rnλ and Rcλ are the corresponding residues for the transmission probability, and M = (l + 1)N . These two equations suggest that a resonance with angular momentum number l corresponds to 2(l + 1) poles. Finding the poles is complicated since they are the roots of a high order complex polynomial with roots that are often quite close to each other in momentum space. A code package, WHOPPER, was developed by Argonne National Laboratory to find all the complex poles, making use of good initial guesses and quadruple precision [7]. Once all poles and residues have been obtained, the 0 K neutron cross-sections can be computed by substituting Eq. 3.3 - 3.5 into Eq. 3.1 3.2, which yields (j)∗ where pλ (x) N 2(l+1) X −iRl,J,λ,j 1 XX Re[ (j)∗ √ ] σx (E) = E l,J λ=1 j=1 pλ − E (3.6) (t) N 2(l+1) X −iRl,J,λ,j 1 XX −2iφl σt (E) = σp (E) + Re[e √ ] (j)∗ E l,J λ=1 j=1 pλ − E (3.7) (x) is the complex conjugate of the j-th pole of resonance λ, and Rl,J,λ,j and (t) Rl,J,λ,j are the residues for reaction type x and total cross section, respectively, and the potential cross-section σp (E) is given by σp (E) = X 4πλ̄2 gJ sin2 φl (3.8) l,J with φl being the the phaseshift. In this form, the cross-sections can be computed by summations over angular momentum of the channel (l), channel spin (J), number of resonances (N ) and number of poles associated to a given resonance(2(l + 1)). 39 3.1.2 Doppler broadening By casting the expression for cross section into the form of Lorentzian-like terms as shown in Eq. 3.6 and 3.7, the Doppler broadened cross section can be derived in analytical forms consisting of well-known functions, as demonstrated in [7]. Specially, with the use of Solbrig kernel [24], √ √ S( E, E 0 ) = √ √ √ √ √ ( E− E 0 )2 ( E+ E 0 )2 E0 [− ] [− ] 2 2 ∆ ∆ m m √ {e −e } ∆m πE (3.9) s kB T is the Doppler width in momentum space, and kB is the Boltzawri mann constant, Eq. 3.6 and 3.7 can be Doppler broadened to take the following where ∆m = form (x) √ (j)∗ √ (x) iRl,J,λ,j E p N 2(l+1) X Re[ πRl,J,λ,j W(z0 ) + √π C( ∆m , ∆λm )] 1 XX σx (E, T ) = E l,J λ=1 j=1 ∆m σt (E, T ) = σp (E) + 1 E N 2(l+1) XX X √ (t) Re{e−2iφl [ πRl,J,λ,j W(z0 ) + (3.10) √ (j)∗ iRl,J,λ,j E pλ √ C( , )]} π ∆m ∆m (t) ∆m l,J λ=1 j=1 (3.11) where √ z0 = (j)∗ E − pλ , ∆m (3.12) W(z) is the Faddeeva function, defined as 2 i Z ∞ e−t dt W(z) = , π −∞ z − t 40 (3.13) and the quantity related to C can be regarded as the low energy correction term, with the full expression √ √ 2 (j)∗ (j)∗ −2i pλ − ∆E2 Z ∞ E pλ e−t −2 Et/∆m , )= √ e m C( dt (j)∗ . ∆m ∆m π∆m ∆m 0 [pλ /∆m ]2 − t2 (3.14) Since this correction term is generally negligible for energy above the thermal region [25], and only energy above 1 eV are of interest in this thesis, this correction term will be ignored from now on. As a result, the above equations become √ N 2(l+1) X π 1 XX (t) Re[Rl,J,λ,j W(z0 )], σx (E) = E l,J λ=1 j=1 ∆m √ N 2(l+1) X 1 XX π (t) σt (E) = σp (E) + Re[e−2iφl Rl,J,λ,j W(z0 )], E l,J λ=1 j=1 ∆m (3.15) (3.16) and Doppler broadening a cross section at a given energy E is thus reduced to a summation over all poles, each with a separate Faddeeva function evaluation. 3.1.3 Characteristics of poles As mentioned in 3.1.1, each resonance with angular momentum l corresponds to in total 2(l + 1) poles. Among them, two are called “s-wavelike” poles, and their real part are of the opposite sign but have the same value, which is close to the square root of the resonance energy, while their imaginary part are very small; the other 2l poles behave like l conjugate pairs featuring large imaginary part with a characteristic 1 magnitude of , where a is the channel radius and k0 is defined in Eq. 2.19. Fig. k0 a 3-1 shows the pole distribution of U238, which has both l = 0 and l = 1 resonances, to demonstrate the relative magnitude of different type of poles. In ENDF/B resonance parameters, for most major nuclides, there are usually a number of artificial resonances (called “external resonances”) outside of the resolved resonance regions. Since they are general s-wave resonances, each of them also cor41 600 400 Im(p) 200 0 −200 −400 −600 −200 −150 −100 −50 0 Re(p) 50 100 150 200 Figure 3-1: Poles distribution for U238. Black and red dots represent the poles with l = 0 and with positive and negative real parts, respectively, and green dots represent the poles with l > 0. responds to two poles, the same as an ordinary s-wave resonance. Besides, for the external resonances that are above the upper bound of resolved resonance region, the value of poles also obey the same rule as that for the ordinary s-wave, while for those that have negative resonance energies, the two poles are of opposite sign and the absolute value of the real part is very small and that of the imaginary part is relatively large, as demonstrated by the black and red dots (almost) along the imaginary axis in Fig. 3-1. 3.1.4 Previous efforts on reducing poles to broaden To reduce the number of poles to be broadened when evaluating cross sections for elevated temperatures, an approach was proposed to replace half of the first type of 42 poles with a few pseudo-poles[25]. Specifically, it was noticed that the first type of poles with negative real part have very smooth contributions to the resolved resonance energy range, therefore, the summation of the contribution from these poles are also smooth and can be approximated by a few smooth functions. Moreover, it turned out that rational functions that are of the same form as the pole representation (Eq. 3.6 and 3.7) works very well as the fitting function. Each of these rational functions effectively defines a “pseudo-pole”. It was found that only three of such pseudopoles are necessary to achieve a good accuracy for the ENDF/B-VI U238 evaluation [25]. In addition, the contribution from the second type of poles were found to be temperature independent due to their exceedingly large Doppler widths, therefore they do not need to be broadened for elevated temperatures. As a result, only half of the first type of poles, which is the same as the number of resonances (including the external resonances), plus three pseudo-poles, are left to be broadened, which effectively reduces the number of poles to be broadened. For the new ENDF/B-VII U238 evaluational, there are 3343 resonances in total. A preliminary study shows that around 10 pseudo-poles are needed to approximate the contributions from the first type of poles with negative real part. Therefore, a total of around 3353 poles need to be broadened. In general, this is a good strategy in increasing the computational efficiency of Doppler broadening process. However, the number of poles to be broadened is still too large to be practical for performing Doppler broadening on the fly. Therefore, a new strategy is proposed during the course of this thesis to further reduce the number of poles to be broadened and will be discussed in the following sections. 43 3.2 3.2.1 Approximate multipole method Properties of Faddeeva function and the implications From the definition of Faddeeva function in Eq. 3.13, it can be approximated as a Gauss-Hermite quadrature M aj i X , W(z) ≈ π j=1 z − tj (3.17) where aj and tj are the Gauss-Hermite weights and nodes, respectively. As demonstrated in [26], the following number of quadrature points are sufficient to ensure a relative error less than 10−5 : M = 6, 3.9 < |z| < 6.9 4, 6.9 < |z| < 20.0 (3.18) 2, 20.0 < |z| < 100 1, |z| ≥ 100 . The nodes and weights for Gauss-Hermite quadrature of degree one are 0 and √ respectively. Therefore, if | √ π, (j)∗ E−pλ ∆m | ≥ 100, then √ √ (j)∗ π E − pλ W[ ] ∆m ∆m √ ≈ = √ i π · ∆m · √ ∆m π E − p(j)∗ λ −i √ . (j)∗ pλ − E π By comparing to the 0 K expression in Eq. 3.6, this suggests that from a practical point of view, the Doppler broadening effect is only significant to within a range of ∼ 100 Doppler width away from a pole in the complex momentum domain [25]. In other words, to calculate the cross section of any energy point for a certain temperature, one can first sum up the contribution at 0 K from all those poles that are 100 Doppler 44 widths away from this energy, since this part is independent of temperature; and then broaden the other neighboring poles for the desired temperature similar to Eq. 3.15 or Eq. 3.16. Taking the cross section evaluation of U238 at 1 KeV at 3000 K as an example, 100 Doppler widths corresponds to about three in the momentum domain, within which there are only around 470 poles. Compared to the total number to be evaluated from 3.1.4, it is clear that this strategy can effectively reduce the number of poles to be broadened on the fly. As mentioned in 3.1.3, the second type of poles usually have very large imaginary parts that are much greater than the Doppler width, therefore, they can be treated as temperature independent. As to the first type of poles, the ones with negative real parts in general can be treated as temperature independent too2 , since cross sections are always computed in the complex domain with positive real parts. In addition, those poles from negative resonances can also be treated as temperature independent, since they usually have large imaginary parts. As a result, one is now left with only those first type of poles with positive real part from positive resonances, which are denoted as “principal poles” henceforth. Fig. 3-2 and 3-3 show the relative errors of total cross section of U238 and U235 at 3000 K calculated with broadening only the principal poles against the NJOY data. All results are based on the ENDF/B-VII resonance parameters. In general, the agreement between the multipole representation and NJOY is very good, except for some local minimum points (e.g. very low absolute cross section values) in U238 which have little practical importance. From now on, the cross sections calculated by broadening all principal poles will be used as the reference multipole representation cross section, unless otherwise noted. An attractive property of the principal poles is that the input arguments to the 2 Strictly speaking, this property does not hold for poles that are very close to the origin and when evaluating cross sections for very low energy, but this energy range is not of interest in current work. 45 Reference Calculated RelErr 105 104 103 102 101 100 10-1 10-2 10-3 10-4 10-5 10-6 10-7 10-8 10-9 Cross section (barn) 102 101 100 10-1 10-2 10-3 Relative Error 10 3 101 104 102 103 E (eV) Figure 3-2: Relative error of U238 total cross section at 3000K broadening only principal poles compared with NJOY data. Faddeeva function from these poles always fall in the upper half plane of the complex domain, where the Faddeeva function is well-behaved, as shown in Fig. 3-4. In this region, due to the presence of the exponential term, the Faddeeva function decreases rapidly with increasing |z|. This property indicates that everything else being the same, faraway poles in general have a much smaller contribution than the nearby poles to a given energy point. Therefore, even though some poles may not lie outside of the 100 Doppler width range, since their contribution is so small and the variation in temperature is even smaller, these poles may also be treated as temperature independent. This can further reduce the number of poles to be broadened. However, since the relative importance in contribution from different poles also depends on the value of their residue, this effect has to be studied on a nuclide by nuclide basis, which will be discussed later. 46 1 RelErr 10 Reference Calculated 100 10-1 10-2 10-3 10-4 101 10-5 Relative Error Cross section (barn) 102 10-6 10-7 100 10-8 101 102 103 E (eV) Figure 3-3: Relative error of U235 total cross section at 3000 K broadening only principal poles against NJOY data. 3.2.2 Overlapping energy domains strategy Since the imaginary part of principal poles are very small, the value of the real part is the major deciding factor of whether a pole is close to or far away from an energy point to be evaluated and thus whether it is significant in terms of Doppler broadening effect. As a result, for any given energy E at which the cross section needs to be evaluated, the principal poles that are close to it in momentum space should be consecutive in energy domain, therefore, if the principal poles are sorted by their real part, then the poles that are close to E can be specified with a start and end index. This property makes the storage of information very convenient. As a result, one can divide the resolved resonance region of any nuclide into many equal-sized small energy intervals (so that a direct index fetch instead of a binary 47 0.99 0.88 10 0.77 Im(x) 8 0.66 0.55 6 0.44 4 0.33 0.22 2 0.11 0 −10 −5 5 0 10 0.00 Re(x) (a) Real part 0.540 10 0.405 0.270 8 Im(x) 0.135 6 0.000 −0.135 4 −0.270 −0.405 2 −0.540 0 −10 −5 5 0 10 Re(x) (b) Imaginary part Figure 3-4: Faddeeva function in the upper half plane of complex domain search can be used). For each interval, there are only a certain number of local poles (including those inside the interval) that have broadening effect on this interval, and these poles can be recorded with their indices. For the other poles, their contribution to this interval can be pre-calculated. Due to the fact that the Faddeeva function is smooth away from origin, the accumulated contribution will also be smooth and can be estimated with a low order polynomial. Since the potential part of the total cross section is very smooth as well, it can also be included in the background cross section. 48 Therefore, when evaluating the cross section, one needs only to broaden those local poles, and to add the background cross section approximated from the polynomials. This can be expressed as √ 1 X π (x) σx (E) = px (E) + Re[Rl,J,λ,j W(z0 )] E pλ ∈Ω ∆m √ 1 X π (t) Re[e−2iφl Rl,J,λ,j W(z0 )] σt (E) = pt (E) + E pλ ∈Ω ∆m (3.19) (3.20) where px (E) and pt (E) represent the polynomial approximation of background cross section, and pλ ∈ Ω represent the poles that need to be broadened on the fly. To ensure the smoothness and temperature independence of the background cross section in the energy interval (inner window), the poles that are within a certain distance away from both edges of the energy interval also have to be broadened on the fly. To this end, an overlapping energy domains strategy is used, i.e., outside of the inner window, an equal size window (outer window) in energy is chosen and the poles within both the inner and outer windows need to be broadened on the fly, as demonstrated in Fig. 3-5. The size of the outer and inner window affects the number of poles to be broadened on the fly, as well as the storage size and accuracy, which will be studied in later parts of the thesis. p(E) outer window inner window outer window Figure 3-5: Demonstration of overlapping window. 3.3 Outer and inner window size In the current overlapping energy domains strategy, to evaluate the cross section for a given energy, one needs to broaden a number of poles lying inside the inner and 49 outer windows, as well as to calculate the background cross section. It is anticipated that the major time component for this method will be the broadening of poles, so the number of poles is an important metric to be monitored. In general, the outer window mainly determines the number of poles to be evaluated, therefore, a smaller outer window is always preferred, provided the accuracy criterion is met. As to the inner window size, although it also affects the number of poles to be broadened, this is usually less significant compared to the outer window size. Instead, its major impact is on the storage size of the supplemental information other than poles and residues. As mentioned before, for each inner window, two indices of the poles to be broadened need to be stored, in addition to the polynomial coefficients. The number of inner windows depends both on the length of resolved resonance region and the inner window size. Since the resolved resonance region is fixed for each nuclide, the larger the inner window, the fewer inner windows there are, and thus less data to be stored. However, as the inner window gets larger, a higher order polynomial may be necessary to ensure a certain accuracy for the background cross section, which may also increase the storage size. As a result, there is a tradeoff between performance and storage size for inner window size, which adds complexity. As a starting point, the inner window size is set to be the average spacing of the resonances for each nuclide (e.g. 5.98 eV for U238 and 0.70 eV for U235). There are two advantages for this choice: first, there is in general only one pole lying in the inner window, which poses a very minimal performance overhead; second, the number of inner window is bounded with the number of resonances of each nuclide, which is at most a few thousand in current ENDF/B resonance data, even though the resolved resonance regions of different nuclides can span from a few hundred eV to a few MeV. For the background cross section, a second-order polynomial is used for the approximation, unless otherwise noted. To account for the 1/v effect of cross sections for low energy, a smaller inner window size, currently 0.2 eV, is used for 50 energies below 10 eV. In this study of inner and outer window size, two nuclides, U238 and U235, are used, due both to their importance in nuclear reactor simulation and to the large number of resonances they have that poses a challenge to the current method. To quantify the accuracy, a set of cross sections for these two nuclides with the same energy grid as that from NJOY (with the same temperature) are prepared as the reference, by broadening all the principal poles for the desired temperatures and accumulating the 0 K contributions from all other poles. For each nuclide, the cross sections are evaluated for a fixed set of equal-lethargy energy points (∆ξ = 0.0001) between 1 eV and the upper bound of the resolved resonance of this nuclide using the overlapping energy domains strategy with the specific inner and outer window size. The root mean square (RMS) of the relative cross section error with respect to the reference, the RMS of the relative error weighted by the absolute difference in cross section, and the maximum relative error are used together as the error quantification. Assuming the total points evaluated is N , then they can s Pnumber of crossssection PN N 2 2 i=1 (RelErri ) i=1 AbsErri (RelErri ) , and max RelErri , be expressed as i=0,1,...,N N N respectively. The second metric is introduced mainly to put less emphasis on cross sections that are very small, since a low level of relative error is not usually necessary for these cross sections in real application. 3.3.1 Outer window size From the analysis of 3.2.1, the Doppler effect of a resonance (with pole p at energy ER ) to an energy point E0 depends mainly on the magnitude of the corresponding input argument to Faddeeva function, which in this case is √ E0 − p∗ ∆m √ ≈ E0 − q √ ER kT awri 51 ∝ ∝ √ ∆ E √ T ∆E √ , ET (3.21) This suggests that, everything else being the same, for a given nuclide, the outer √ window size (in energy domain) should increase proportional to ET to ensure that all those poles with significant Doppler effect are taken into account. However, as pointed out before, as the outer window size increases, the contribution from the poles outside the window becomes very small, and the varying contribution due to Doppler broadening may become negligible. Therefore, the exact functional dependence on energy or temperature may be of a different form. In the remaining part of this section, the energy effect alone will first be studied, followed by the temperature effect. Energy Dependence Since the cross section at an energy point is mostly affected by nearby resonances, the average spacing of resonances also seems to be a good choice as the characteristic length of the outer window. To get an idea of the minimum size of the outer window, a constant of twice of the average resonance spacing is used to evaluate the total cross section of U238 at 3000 K for the whole resolved resonance region. Fig. 3-6 shows the relative error of the calculated total cross section with the overlapping energy domains strategy compared with the corresponding reference cross section. It is clear that the outer window size is large enough to achieve a very good accuracy for energies up to 1 KeV, but the error increases significantly for higher energy, which indicates that a larger outer window size is needed for this energy range. To confirm that the large error at high energy indeed comes from the Doppler effect of poles that should have been included in the outer window, the relative error of the background cross section approximation is also analyzed. The approximation is 52 10 104 103 102 101 100 10-1 10-2 10-3 10-4 10-5 10-6 10-7 10-8 Cross section (barn) 102 101 100 10-1 10-2 10-3 101 Relative Error 103 6 RelErr 105 Reference Calculated 104 102 103 E (eV) Figure 3-6: Relative error of U238 total cross section at 3000 K calculated with constant outer window size of twice of average resonance spacing. very accurate for energies up to 10 KeV, and then there is a jump in relative error, as shown in Fig. 3-7. However, the maximum relative error in background cross section here is only around 1%, which is much lower than that of the whole cross section. Besides, a closer comparison of the absolute difference in the background cross section and whole cross section confirms that the major reason for the large relative error in the whole cross section is the Doppler effect of poles outside of the outer window. It is speculated that the large error for the background cross section comes from the fact that the contributions from some of the poles outside of the outer window are not very smooth, since they may still be quite close to the inner window region due to the small outer window size. In fact, increasing the order of polynomial does not help reduce the error much in this case. To take into consideration the larger outer window size needed at higher energy, 53 Reference Calculated 108 106 1 104 Cross section (barn) 102 100 10-2 10-4 Relative Error 10 RelErr 1010 10-6 10-8 10 10-10 0 10-12 10-14 101 104 102 103 E (eV) Figure 3-7: Relative error of background cross section approximation for U238 total cross section at 3000 K with constant outer window size of twice of average resonance spacing. and at the same time to control the speed at which the outer window size increases with the energy, a logarithmic function on energy is chosen as the scaling factor for the outer window size. Besides, it is found that an additional exponential factor can help to fine tune the outer window size to achieve overall good accuracy. As a result, the final form of the scaling factor, Dout , for determining the outer window size is E Dout = max(a · b Eub · log10 (E), 1), (3.22) where Eub is the upper bound of resolved resonance range, a and b are two parameters to be determined for each nuclide. Specifically, a is directly related to the lower bound of the outer window size at low energy, while b mainly affects the speed at which the 54 outer window size increases with energy. The major advantage of the scaling factor √ for energy in 3.22 compared to the E dependence mentioned above is that with the former the outer window size at high energy can be bound to only a few times of that at low energy, while the latter can result in a few hundred times difference. The procedure of setting these two parameters are as follows: for a specific nuclide, a is first determined so that the cross sections at low energy range meets the desired accuracy criterion, and b is then determined by examining the cross section behavior at high energy. For U238, study shows that a = 2 and b = 1.5 are appropriate to achieve a maximum relative error of around 0.1% for most cross sections at 3000 K. The relative errors for the total cross section are shown in Fig. 3-8. For comparison purpose, the relative error of background cross section approximation is also presented in Fig. 39. In this case, a second order polynomial is used for the background cross section approximation, and the results demonstrate clearly that appropriate outer window size help to smoothen the background cross section inside the inner window. Similar results are presented for U235 in Fig. 3-10, where a = 3 and b = 1.5. Temperature Dependence To study the temperature effect, the two parameters in Eq. 3.22 can be tuned for different temperatures to achieve the same accuracy level. Table 3.1 shows the parameters for different temperatures for U235 and U238, respectively, to achieve the same accuracy level as shown in 3.3.1. Besides the error quantification, the average number of poles to be broadened for all the energy points are also presented in the table, to reflect the performance for each temperature. As shown in the tables, in general a and b both increases with increasing temperature, so does the average number of poles. However, it is hard to extract a functional form that are appropriate for both nuclides. 55 10 2 10 1 102 101 100 10-1 10-2 100 10-3 10-1 10-5 10-4 10-6 10-2 10 Relative Error Cross section (barn) 103 3 RelErr 10 Reference Calculated 10-7 10-8 -3 10 1 2 3 4 10-9 10 10 10 E (eV) Figure 3-8: Relative error of U238 total cross section at 3000 K calculated with outer window size from Eq. 3.22. Since a higher temperature usually means larger outer window size throughout the energy regions considered, when preprocessing the background cross section, the outer window size should be chosen such that the criterion for accuracy is satisfied for the highest possible temperature in the desired application. This may result in evaluating more poles for lower temperatures, thus degrading performance. As an alternative, one may choose to preprocess different sets of background cross section for different temperature ranges. 3.3.2 Inner window size As noted before, the major impact of inner window size is on storage of the supplemental information that includes polynomial coefficients of background cross section 56 Reference Calculated RelErr 102 101 100 10 10-2 1 10-3 10-4 10-5 Relative Error Cross section (barn) 10-1 10-6 10-7 10-8 10-9 101 104 102 103 E (eV) Figure 3-9: Relative error of background cross section approximation for U238 total cross section at 3000 K with outer window size from Eq. 3.22. and the indices of poles to be broadened on the fly. In general, the storage size is inversely proportional to the inner window size, since the larger the inner window, the fewer numbers of inner windows there would be. However, the storage size is also related to the number of terms needed to achieve the desired level of accuracy for the background cross section. Table 3.2 shows the impact of varying inner window size on the number of terms necessary to achieve the same level of accuracy for the total cross section with that of the 3000 K data in Table 3.1, as well as the corresponding storage for background information. For the storage, each coefficient is assumed to take 8 bytes, while each index takes 4 bytes. On a separate note, as suggested in 3.3.1, the number of polynomial terms for background cross section also depends on the outer window size, since it may affect the smoothness of the background cross section inside the inner window. For example, 57 Reference Calculated RelErr 102 101 102 100 10-1 10-2 10-3 101 10-4 Relative Error Cross section (barn) 103 10-5 10-6 100 10-7 101 102 103 10-8 E (eV) Figure 3-10: Relative error of U235 total cross section at 3000 K calculated with outer window size from Eq. 3.22. with the same set of parameters for outer window size in Table 3.1b and the same inner window size for U235, reducing the number of terms for background cross section can have different impact on accuracy. For 3000 K, if the number of terms is changed to two, then the relative error of background cross section is still around 0.1% − 0.2%, but for 300 K the relative error can reach as high as 1%, which causes the maximum relative error of the whole cross section to also be around 1%. This difference mainly comes from the fact that the outer windows for 300 K are smaller than that for 3000 K. As to the effect of increasing inner window size on performance, given that the minimum number of poles to be broadened at low energy for both U235 and U238 is around 5 − 6, it is anticipated that the maximum performance degradation from doubling the inner window size will be below 20%, since on average only one more pole 58 Table 3.1: Information related to outer window size for U238 and U235 total cross section at different temperatures (a) U238 T (K) 300 1000 2000 3000 a b Average number of poles 1.2 1 6.85 1.5 1.2 8.56 1.8 1.4 10.08 2 1.5 11.18 Maximum number of poles 17 23 28 32 RMS of Rel. Err. Weighted RMS of Rel. Err. 2.4 × 10−4 2.6 × 10−4 2.4 × 10−4 2.6 × 10−4 Max. Rel. Err. RMS Weighted RMS Max. Rel. Err. 3.5 × 10−4 3.3 × 10−4 3.8 × 10−4 3.5 × 10−4 5.1 × 10−4 4.7 × 10−4 5.3 × 10−4 5.2 × 10−4 2.9 × 10−3 3.3 × 10−3 2.3 × 10−3 3.0 × 10−3 8.8 × 10−5 8.9 × 10−5 9.0 × 10−5 8.7 × 10−5 2.0 × 10−3 2.1 × 10−3 1.8 × 10−3 1.4 × 10−3 (b) U235 T (K) a b 300 1000 2000 3000 1.5 2.2 2.6 3 1.2 1.5 1.5 1.5 Average number of poles 7.70 11.07 12.94 14.73 Maximum number of poles 19 31 37 42 needs to be broadened for each cross section evaluation. The detailed performance analysis will be discussed in Chapter 4. In summary, for the temperature range that is of interest to reactor simulations (300 − 3000 K), to achieve an overall accuracy level of 0.1% for the cross sections, around 5 to 42 poles need to be broadened on the fly for both U238 and U235, depending on the energy at which the cross section is evaluated. The storage size for the related data is around 300 - 500 KB for both nuclides, which includes the poles, the associated residues and angular momentum numbers, and background information for each of three reaction types (total, capture and fission). By contrast, for U238 and U235, the point-wise cross section data can take on the order of 10 MB just for a single temperature. For the performance analysis in Chapter 4 and 5, the outer window size corresponding to the entries of 3000 K in Table 3.1, and the inner window size of one average resonance spacing will be used for both U238 and U235, unless 59 Table 3.2: Number of terms needed for background cross section and the corresponding storage size for various inner window size (multiple of average resonance spacing). Inner window size One Two Three Four U238 Number of Storage terms size (KB) 3 105.8 4 67.0 4 45.3 5 41.3 U235 Number of Storage terms size (KB) 3 100.8 3 51.1 3 34.5 3 26.3 otherwise noted. 3.4 Implementation of Faddeeva function To evaluate a cross section on the fly, the poles within both the outer and inner window have to be broadened, each of which is associated with a Faddeeva function evaluation. It is anticipated that the Faddeeva function evaluation will be one performance bottleneck of the approximate multipole method. A preliminary performance study also confirms this. Therefore, the efficiency of evaluating Faddeeva function is a very important part in current method and needs to be studied in detail. First, the Faddeeva function has a set of symmetric properties that can be utilized to simplify the implementation [27] W(z ∗ ) = W∗ (−z), 2 W(−z) = 2e−z − W(z). As a result, when implementing the Faddeeva function, only the part in the first quadrant of the complex domain is essential, while the part in other quadrants can be derived directly from the first quadrant. In the WHOPPER code, a subroutine for Faddeeva function (W) was implemented 60 which uses an iterative series expansion method with a stopping criterion of 1.0×10−6 for |z| < 6, and otherwise uses asymptotic expressions derived by various low-order Gauss-Hermite quadrature depending also on the magnitude of the input argument. Another algorithm that has been demonstrated to be very efficient is the QUICKW used in the M C 2 -2 code [26]. For small |z|, it uses a six-point bivariate interpolation of a pre-calculated table of Faddeeva function values with a rectangular grid on the complex domain. For the grid shown in Fig. 3-11, the interpolation scheme can be expressed as h h *(x0+ph, y0+ph) (x0, y0) Figure 3-11: Demo for the six-point bivariate interpolation scheme. q(q − 1) q(q − 1) f (x0 , y0 − h) + f (x0 − h, y0 ) 2 2 p(p − 2q + 1) +(1 + pq − p2 − q 2 )f (x0 , y0 ) + f (x0 + h, y0 ) 2 q(q − 2p + 1) + f (x0 , y0 + h) + pqf (x0 + h, y0 + h) + O(h3 ) 2 f (x0 + ph, y0 + qh) = (3.23) For larger |z|, QUICKW uses similar asymptotic expression as those in the WHOPPER implementation. In current work, both implementations have been examined for use when broadening the poles on the fly, and some modifications have been made to the original implementations. Specifically, for the WHOPPER version W, the region of series expansion is changed to |z| < 4, and the stopping criterion is changed to 1 × 10−4 for better performance; for the QUICKW version, the upper bound of real and imaginary 61 part of z are both set to 4, the grid spacing is reduce to 0.05 for better accuracy, and the symmetric properties of Faddeeva function was added. Fig. 3-12 and Fig. 3-13 show the relative error of the modified version of both implementations against the scipy.special.wofz function [28], with values below 1 × 10−4 filtered out. As shown in the figures, both implementations are in general accurate to within 0.1%, which is good enough for the desired accuracy level of cross sections. A performance study shows that the modified W version is about 4.5 times slower than the modified QUICKW version, for one billion evaluations with random input arguments which satisfy Re(z) < 4, Im(z) < 4 (since points inside this region are expected to take longer to evaluate). The test was run on a 2 GHz Intel CPU, and the total runtime for both implementations are 20.83s and 4.55s, respectively. 62 (a) Real part (b) Imaginary part Figure 3-12: Relative error of modified W against scipy.special.wofz. 63 (a) Real part (b) Imaginary part Figure 3-13: Relative error of modified QUICKW against scipy.special.wofz. 64 Chapter 4 Implementation and Performance Analysis on CPU This chapter describes the details of the implementation of the approximate multipole method on CPU. Its performance is then analyzed on a few test cases and compared to that of some reference methods. The scalability of the method is also studied and presented. 4.1 Implementation To use the approximate multipole method for cross section evaluation, a proper library that provides the necessary data for each nuclide needs to be first generated. The main data needed are poles and the associated residues and angular momentum numbers, background information along with indices for poles to be broadened, as well as some nuclide properties such as channel radius, atomic weight and etc. The poles and residues can be obtained from running the WHOPPER code for each nuclide. After that, python scripts were used to process the poles, generate the background cross section information and lump data together into a binary file. This binary file serves as the library for the approximate multipole method. 65 The main code for the approximate multipole method is written in C. Therefore, once the input file is read, the data has to be stored into C-compatible data structures. In general, the data is stored nuclide by nuclide, and the code snippet below shows the major data structures used. Code 4.1: Data structure for approximate multipole method // d a t a r e l a t e d t o each p o l e typedef struct { double pole [ 2 ] ; double resi [6]; int32 t l; // ( t o t a l , f i s s i o n , c a p t u r e ) // Angular momentum number } pole residue ; // background i n f o r m a t i o n f o r each i n t e r v a l typedef struct { // t h e f o l l o w i n g s t r u c t s t o r e s t h e c o e f f i c i e n t s f o r background c r o s s s e c t i o n , // as w e l l as i n d i c e s t o p o l e s t o be broadened // NUM COEF: number o f c o e f f i c i e n t s f o r background c r o s s section // e n t r i e s [ 3 ] : t o t a l , f i s s i o n , c a p t u r e struct { i n t 3 2 t in d [ 2 ] ; double c o e f s [NUM COEF] ; } e n t r i e s [ 3 ] ; } bkgrd entry ; // a l l i n f o r m a t i o n f o r a n u c l i d e typedef struct { isoprop props ; // n u c l i d e p r o p e r t i e s int32 t Nbkg ; // number o f b k g r d e n t r y 66 int32 t Nprs ; // number o f p o l e s ( and r e s i d u e s ) bkgrd entry ∗ bkgrds ; pole residue ∗ prs ; // a r r a y o f b k g r d e n t r y // a r r a y o f p o l e s ( and r e s i d u e s ) } nucdata ; During the course of the thesis work, a different data structure for the background information was first used. In this scheme, the background information of all nuclides for the same energy are lumped together. The advantage of this scheme lies in the fact that during Monte Carlo simulation, the cross sections of all (or some of) the nuclides are usually needed for the same energy. Therefore, by arranging the background information according to energy, the data to be fetched for a certain energy are close to each other and this improves cache efficiency. This was confirmed by performance results. In fact, it is around 15% faster than the alternative strategy. However, this strategy has a few significant drawbacks. First, the inner window size of all nuclides must be the same to be aligned, which leads to either very large storage size, if a small inner window size is chosen, or significant performance degradation for some nuclides, if a large inner window size is chosen. Second, the resolved resonance region varies a lot among nuclides, with some as high as a few MeV. This results in either keeping lots of unnecessary information for some nuclides with small resolved resonance region, or both additional positional information and searching overhead. In addition, this scheme is not very flexible when preparing the cross section library. As a result, it was replaced by the current data structure. With the per-nuclide data structure shown above, the algorithm to evaluate the cross sections for all nuclides at a given energy, a major component in the Monte Carlo reactor simulation, is presented in Algorithm 2. 67 Algorithm 2 Cross section evaluation with approximate multipole method. for each nuclide do get the index to bkgrds calculate the background cross section for each pole to be broadened do calculate phaseshift(φl ) evaluate Faddeeva function accumulate the contribution to whole cross section end for whole cross section = background + contribution from poles end for 4.2 Test setup To test the performance of the approximate multipole method against the commonly used table lookup method, as well as Cullen’s method, two test cases are set up. Three nuclides, U238, U235 and Gd155, are chosen and replicated to represent the 300 nuclides usually used in nuclear reactor simulation. The reasons to choose these three nuclides are: 1) they are important nuclides in nuclear reactor; 2) their diversity in resolved resonance region are representative of most nuclides under consideration, as shown in Table 4.1. Table 4.1: Resonance Information of U235, U238 and Gd155 Nuclide Upper bound of Resolved Resonance Region U235 2.25 KeV U238 20.0 KeV Gd155 183 eV Number of Average spacresonances ing of resonances ∼ 0.7 eV ∼ 6 eV ∼ 0.5 eV 3193 3343 92 68 Number of energy grid points at 300 K (NJOY) 76075 589126 12704 4.2.1 Table lookup In this method, a few set of cross section library are first generated from NJOY for some specified reference temperatures. For each nuclide at a single temperature point, since the energy grid for different reaction types are the same with the default NJOY setting, there is one 1-D array of energy and another 2-D array of cross sections where the first dimension is for energy and the second for reaction type (three in total). The evaluation of cross section is a two-step linear interpolation process on both energy and temperature, shown below for a given energy E at a given temperature T σT1 = σE (1) ,T1 + 1 σT2 = σE (2) ,T2 + σE (1) ,T1 − σE (1) ,T1 2 1 (1) E2 (1) E1 − σE (2) ,T2 − σE (2) ,T2 1 2 1 (2) E2 − σE = σT1 + (2) E1 (1) (E − E1 ) (2) (E − E1 ) σT2 − σT1 (T − T1 ) T2 − T1 (i) where T1 and T2 are the temperature points bracketing T , and E1,2 represent the energy points that E falls in for temperature Ti . At each temperature, one binary search is needed to find the energy grid for the given energy, as a result, there are in general two binary searches associated with one cross section evaluation (Note since the three reaction types share the same energy grid, only a single binary search is enough to get all of them). If the unionized energy grid is used, then a direct index fetch can be used instead, which has a better efficiency than binary search, but this will increase the size of the data set tremendously and thus is not considered in this study. In our case, the total amount of data for the three nuclides at 300 K are about 20.7 MB and the replication of 100 times results in around 2 GB of memory. In order to include the temperature effect, this set of 300 K cross sections are again replicated to mimic the cross sections at different temperatures. As a result, the total size of 69 memory storage for this method is 2 GB times the number of temperature points. In the following studies, only two temperature points, 300 K and 3000 K, are used, and thus the total memory requirement is around 4 GB. 4.2.2 Cullen’s method For Cullen’s method, similar data structure as that in table lookup is used for 0 K cross sections, except that the cross sections for each nuclide are now arranged in three 1-D arrays instead of one 2-D array. Cullen’s method is then directly used to broaden the cross sections from 0 K to the specified temperature for the selected energy. An in-house code was developed that implements the Cullen’s method and is used for performance tests. The total amount of data in this case is 5.7 GB for 300 nuclides. This is larger than that of 300 K because 0 K cross sections have a denser energy grid. 4.2.3 Approximate multipole method The implementation detailed in 4.1 is used for the approximate multipole method. The number of poles varies among nuclides, and each pole has three pairs of residues corresponding to three reaction types and one angular momentum number. The total size of these data is around 43 MB for 300 nuclides. As to the background information, currently three coefficients are used for the background cross section approximation for each cross section type, and a separate pair of indices are used for each cross section type. The size of the inner window is set to be the average spacing of resonances for each nuclide. As a result, the total size related to background information is around 62 MB. During the evaluation of cross sections, one Faddeeva function evaluation is needed to broaden each pole, and both of the two implementations of the Faddeeva function discussed in Chapter 3 are used in the tests. For the modified QUICKW version, a 70 82 × 82 table of complex numbers is needed, which amounts to about 105 KB of data. Therefore, the total memory needed for the approximate multipole method is approximately 105 MB for 300 nuclides, much less than the other two methods. 4.3 Test one Test one is mainly set up to examine the speed of cross section evaluation for the approximate multipole method, without dealing with other parts of Monte Carlo simulation. In order to both cover the resolved resonance region of most nuclides and to simulate the neutron slowing down behavior in a real nuclear reactor setting, 300 equal lethargy points are chosen for energies between 19 KeV and 1 eV. For each energy, the total cross section of all nuclides are evaluated for a random temperature between 300 K and 3000 K and the nuclide that the neutron will “interact with” is chosen according to their total cross section (although this information is not used at all in current test). The evaluation of the total cross section is selected from the three different methods mentioned above, i.e., table lookup, Cullen’s method and approximate multipole method, in order to compare performance. The average run time to evaluate the cross sections of 300 nuclides over 300 energy points is recorded. This value is then averaged over 100 different runs and is used as a metric for performance. One thing to note is that for all methods, a total cross section is only evaluated when the incident neutron energy is within the nuclide’s resolved resonance region, otherwise a constant value (three barns) is provided. The main hardware system used for the tests in this chapter is a server cluster consisting of nodes with two six-core Intel Xeon E5-2620 CPUs of clock speed 2 GHz and 24GB of RAM. The L1, L2, and L3 cache sizes are 15 KB, 256 KB and 15 MB, respectively. 71 4.3.1 Serial performance The run time for the serial version of test one are shown in Table 4.2. Also presented in the table are some other performance related information, like instruction per cycle and L1 data cache miss rate, obtained from running “perf”, a commonly used performance counter profiling tool on Linux. Table 4.2: Performance results of the serial version of different methods for test one. Table lookup run time 156 (µs) Instructions 0.60 per cycle L1 cache 59.5% miss rate 4.55 × 104 Multipole Modified Modified QUICKW W 231 717 1.04 1.02 0.86 0.17% 3.5% 1.6% Cullen’s method From the table, it is obvious that Cullen’s method is much slower than the others, by nearly two order’s of magnitude, therefore, it will not be considered in the following performance analysis. For the table lookup method, it is the fastest method, but it suffers from high cache miss rate, which can be a bottleneck when used on nodes with large numbers of cores. The approximate multiple method in general is only a few times slower than that of table lookup, and with the modified QUICKW version for Faddeeva function, it is only slower about 50%. As a matter of fact, test one was also run on a different computer node, which has two quad-core Intel Xeon E5620 CPUs of clock speed 2.4 GHz and 16 GB of RAM, with the L1, L2 and L3 cache sizes being 12 KB, 256 KB and 12 MB, respectively. This time, the approximate multipole method with modified QUICKW is even faster than the table lookup method by a few percent. Clearly, the faster CPU speed and the smaller caches favor the approximate multipole method. In addition, the memory access is not much of a problem for the 72 approximate multipole method, which makes it potentially more desirable for large scale deployment. Since the modified QUICKW for Faddeeva function is much faster than that of modified W, it will be chosen as the default method for Faddeeva function henceforth. To get a better understanding of the performance of the approximate multipole method, Table 4.3 shows a rough runtime breakdown for the major components in Algorithm 2, with the modified QUICKW for Faddeeva function. These data are also based on the output from “perf”. As can be seen from the table, the Faddeeva function is a major hotspot of the code, taking more than a third of the total runtime. In addition, the memory access time of poles and background information also takes a large portion of the total runtime, as reflected by the time for “first access cache miss”, that is, the cache miss caused by accessing the first piece of data associated with each pole or each background entry. This should account for most of the cache misses related to poles and background information, since the size of data for each pole and each background entry are either about the same or smaller than the L1 cache line size (unless there is misalignment, and the resulting extra cache misses would fall in the category of “others”). In the end, for the “others” category, the major contributions come from the computations other than W(z0 ) in Eq. 3.20. Table 4.3: Runtime breakdown of approximate multipole method with modified QUICKW. Components Faddeeva function First access cache miss for poles and background information Others 73 Fraction of runtime 35.3% 30.1% 34.6% 4.3.2 Revisit inner window size As discussed in 3.3, the inner window size affects not only the storage size for background information, but also the overall performance of the approximate multipole method. Here a parametric study of the inner window size is performed, to examine its effect on the performance as well as the overall storage size of background information, with the results shown in Table 4.4. Note that for the storage size of the background information, the number of coefficients needed for different inner window size listed in Table 3.2 are used. From the results in the table, it is clear than increasing the inner window size does not have much impact on the overall performance, and it can significantly reduce the storage size of the background information. However, to make things clear, for all the remaining tests in this chapter and next chapter, the inner window size is kept at one average resonance spacing for each nuclide. Table 4.4: Performance and storage size of background information with varying inner window size. Inner window size One Two Three Four 4.3.3 Runtime (µs) Storage size of background information (MB) 61.7 35.4 24.1 20.4 231 236 242 251 Parallel performance To test the parallel performance of each method, OpenMP [29] is used to parallelize the work of evaluating the total cross section for all 300 nuclides at each energy point. Essentially, each thread will be responsible for a certain number of nuclides, and since all nuclides are replicated from those three nuclides, it is expected that there is no load balancing issue with this parallel scheme (at least for a small number of threads). 74 Fig. 4-1 shows the results of the strong scaling study of both table lookup method and approximate multipole method with modified QUICKW, where the number of threads increases while the work load remains the same. At a first glance, it seems that either of these two methods have good scalability. However, a closer look at the figure suggests that both of them can achieve almost linear speedup when the thread number is very small. Therefore, it is speculated that the amount of work is not large enough to hide the parallelization overhead with OpenMP, not that the methods themselves are not inheritance scalable. Approximate multipole Table lookup 12 10 Speedup 8 6 4 2 0 0 2 4 6 8 Number of threads 10 12 Figure 4-1: Strong scaling study of table lookup and approximate multipole methods for 300 nuclides. The straight line represents perfect scalability. In order to confirm this, two different test cases are run. The first one is again a strong scaling study. However, this time the number of nuclides is increased to 3000 and the cross section data are correspondingly replicated, just to increase the amount of work to be parallelized. Note, since the memory needed for the table lookup method exceeds 24 GB for 3000 nuclides, a different computer node was used. 75 This new node has 128 GB of RAM, but except for that, it is identical to the other nodes used. This time, the overhead of parallelization seems to be negligible and both methods show a good scalability behavior (see Figure 4-2). Approximate multipole Table lookup 12 10 Speedup 8 6 4 2 0 0 2 4 6 8 Number of threads 10 12 Figure 4-2: Strong scaling study of table lookup and approximate multipole methods for 3000 nuclides. The straight line represents perfect scalability. The second test is a weak scaling study, where the number of nuclides, or the number of cross sections to evaluate at each energy point, is kept constant (300) for each thread, and while the number of threads increases, the total number of nuclides, and thus the workload, also increases proportionally. For this study, the percentage increase in runtime against number of threads is chosen as a metric to measure the performance, and the results of both methods are shown in Fig. 4-3. Clearly, the average runtime increases as the number of threads goes up, as expected, but the percentage of increase in runtime for the approximate multipole method is not very significant compared to that of the table lookup method, indicating that the scalability of the approximate multipole method is better. 76 35 Approximate multipole Table lookup Percentage increase in runtime / % 30 25 20 15 10 5 0 0 2 4 6 8 Number of threads 10 12 Figure 4-3: Weak scaling study of table lookup and approximate multipole methods with 300 nuclides per thread. Combining the results of the test cases above, it is clear that with enough workload, both methods have very good scalability, and the approximate multipole method is better since it shows better behavior in the weak scaling study. However, because the total number of nuclides in real reactor simulation is limited to about 300, the overhead from parallelizing the work to evaluate all the cross sections at a single energy point may overwhelm the benefit. As a result, it may be better to do this portion of work serially and parallelize the outer loop, which will be discussed in the next section. 4.4 Test two Most Monte Carlo codes are parallelized for different particle histories, or the outer loop. In addition, as demonstrated in the previous section, the amount of work of 77 evaluating cross sections for 300 nuclides at each energy is not large enough to fully benefit from parallelization. Therefore, in this section, the scalability features of both methods will be studied where the outer loop is parallelized. To this end, a simple Monte Carlo code is written to simulate a mono-energetic neutron source with energy 19.9 KeV slowing down in a homogeneous material. The material consists of 300 nuclides generated as before, as well as a pure scatterer with 20000 barns of scattering cross section. This scatterer is chosen such that the system is essentially the same as a homogeneous mixture of four nuclides, U235, U238, Gd155 and H1, with number density ratio of 1:1:1:10, since the scattering cross section of H1 is about 20 barns over most of energy range and the other three nuclides are each replicated 100 times. The cut-off energy is set to 1 eV and only fission events are tallied to get the kef f . Again, if neutron energy is outside of the resolved resonance of a certain nuclide, a set of constant cross sections are provided, with total cross section being three barns and all other three major cross section types being one barn each. The material temperature is set to be 300 K. For the approximate multipole method, the inner and outer window size appropriate for 3000 K are used, to reflect the fact the grid for highest temperature must always be used (unless there are multiple sets of them). For the table lookup method, it is ensured that a cross section evaluation always involves interpolation between two temperatures yet the cross section is the same as that corresponding to 300 K. Algorithm 3 shows the logic of the slowing down code used for the test. First, to check the accuracy of the approximate multipole method, the slowing code is run for one billion neutron histories with both table lookup and approximate multipole methods. The value and standard deviation of kef f for both methods are tabulated in Table 4.5. Also listed are the average time to run a whole neutron history for performance comparison. Note that the large size of the problem for running one billion neutron histories requires a mixture of MPI [30] and OpenMP to 78 Algorithm 3 Neutron slowing down code on CPU for a number of neutrons to simulate do initialize a neutron while neutron energy is above the cut-off energy do evaluate the total cross section for all nuclides for the incident neutron energy based on the total cross section, determine the nuclide to react with, and the reaction type if captured then break else if fission then increment the fission events, break else change neutron energy according to elastic scattering formula end if end while end for distribute work across different computer nodes and each node uses multiple (12) threads. However, the runtime data in this table are obtained by running one million neutron histories with a serial version of the code. The results of kef f show very good agreement between these two methods. In addition, the approximate multipole method is about 50% slower than table lookup, which is consistent with the performance results from test one. Table 4.5: kef f (with standard deviation) and the average runtime per neutron history for both table lookup and approximate multipole methods. kef f (with standard deviation) Table lookup 0.764287±0.000044 Approximate multipole 0.764289±0.000044 Average runtime per neutron history (ms) 1.062 1.598 For comparison of parallel performance, OpenMP is once again used to parallelize the outer loop on neutron histories. For each case, one million neutron histories are run for both methods. Fig. 4-4 shows the scalability of both methods. The approximate multipole method exhibits a nearly perfect linear speedup, and it is 79 Approximate multipole Table lookup 12 10 Speedup 8 6 4 2 0 0 2 4 6 8 Number of threads 10 12 Figure 4-4: OpenMP scalability of table lookup and multipole methods for neutron slowing down. The straight line represents perfect scalability. clearly better than table lookup. 4.5 Summary As shown from the results of both tests, the approximate multipole method has little computation overhead compared with the standard table lookup method, which in general is less than 50%, and on some hardware systems, much less. The major reason for this low computation overhead comes from the fact that there is much less memory required and many fewer cache misses. The storage requirement of the approximate multipole method for cross sections of resolved resonance range is only around 100 MB for 300 nuclides, which is much less than that of the table lookup and the Cullen’s method, and is also one to two orders less than that of the regression model and the explicit temperature treatment 80 method discussed in Chapter 2. In addition, increasing the inner window size in the approximate multipole method can further reduce the storage without much loss in efficiency. From the scalability tests, it was found that the amount of work of evaluating cross sections for 300 nuclides at one energy is not large enough to fully benefit from parallelization for either the table lookup or the approximate multipole method. However, if parallelized for different neutron histories, as are most Monte Carlo codes, the approximate multipole method shows very good scalability, better than that of table lookup, thus making it more desirable for massively parallel deployment. 81 82 Chapter 5 Implementation and Performance Analysis on GPU This chapter mainly describes the implementation and performance of the Monte Carlo slowing down problem with the approximate multipole method on GPUs. Section 5.1 presents a simple version of the code. Section 5.2 discusses the performance bottlenecks and the subsequent optimization efforts. Section 5.3 concludes the chapter with the final performance results and a discussion. 5.1 Test setup and initial implementation Since the memory requirement of the table lookup obviously exceeds what GPU can provide when there are multiple temperatures, only the approximated multipole method is implemented on GPU. In addition, the modified QUICKW is chosen as the Faddeeva function implementation on GPU since it is the fastest. As shown in the previous chapter, the work of evaluating 300 nuclides is not large enough to fully benefit from CPU parallelization, therefore, it is speculated that it may not be entirely beneficial to offload this part of work to GPU, either. As a matter of fact, a simple implementation during the early stage of the work demonstrated 83 that the GPU version can achieve only about three times of speedup compared to the serial CPU version for the pure computation work. Moreover, there is also additional time associated with transferring the evaluated cross sections back to CPU and the kernel launch overhead, which can be as high as 20% of the pure computation time. Therefore, there is not much advantage to offload the cross section evaluation to GPU instead of doing it on CPU and thus this approach is abandoned. As a result, only the approach similar to test two in the previous chapter is taken, that is, to run the whole Monte Carlo simulation on GPU, instead of just offloading the evaluation of cross section to GPU. The remaining part of this chapter will focus on this approach. 5.1.1 Implementation To simulate the neutron slowing down process on GPU, a CUDA version of the slowing down code is implemented, with the same setup as in the previous chapter for nuclides, cut-off energy etc. However, the original algorithm (Algorithm 3 of Chapter 4) is no longer suitable to run on GPU, mainly due to the expected high branch divergence associated with the inner while loop on neutron termination. The branch divergence mainly comes from the fact that some neutrons may be terminated very quickly, while others may take very long. Consequently, for all the threads within a warp, the run time will be determined only by the longest path, which is true for every neutron to be simulated. To avoid this apparent performance drawback, a new algorithm is used, which allocates roughly equal amount of neutrons to each GPU thread, shown in Algorithm 4. In addition, due to the lock step feature of GPU, all threads in a warp always evaluate the cross sections for the same nuclide at the same time. Since the cross section evaluation is the hotspot in the code, this feature helps to make the behavior of threads in a warp more uniform. The same data structures as shown in Code 4.1 of Chapter 4 are used in the 84 Algorithm 4 Neutron slowing down code on GPU set n run = 0; initialize a neutron while n run < neutron histories to run per thread do evaluate the total cross section for all nuclides for the incident neutron energy based on the total cross section, determine the nuclide to react with, and the reaction type if captured then increment n run; initialize a new neutron else if fission then increment the fission events; increment n run; initialize a new neutron else change neutron energy according to elastic scattering formula if neutron energy is below cut-off energy then increment n run; initialize a new neutron end if end if end while initial implementation of the GPU version slowing down code since CUDA provides the necessary support for them, although care must be taken to pass the consistent device pointers when initializing the data for the GPU. Since the data size of the poles and residues, as well as background information, are all on the order of tens of megabytes, both of them can only be stored in the global memory. For the per-nuclide data such as nuclide properties and pointers to poles and residues, they are placed in the constant memory because they are read-only and they will fit into constant memory. As for the tabulated Faddeeva function values, since they are accessed in a fashion that exhibits spatial locality, the texture memory should be a good fit, especially since the table cannot fit into constant memory1 . However, currently CUDA only supports texture memory for single precision floating point numbers. To use texture memory for double precision floating point numbers, some additional measures have to be taken. Specifically, the special CUDA data type of int2, which represents a 1 In fact, both texture memory and constant memory were tried for a smaller table (42×42), which can fit into constant memory, and it turned out that texture memory gave a better performance. 85 structure consisting of two 4-byte integers, is chosen for the texture memory. The bits of a double precision number are cut into two halves, with the upper 32 bits saved into the first integer of int2 while the lower 32 bits into the second integer. When referencing the double precision number, both of the integers are fetched and the bits from them are then combined together and converted to the original double precision number. Last, the random library shipped with CUDA, cuRAND [1], is used as the parallel random number generator in the kernel function. CuRAND uses the XORWOW algorithm [31], which has a period of 2192 − 232 , and each thread in the kernel function is initialized with a unique random number sequence. 5.1.2 Hardware specification During the course of the thesis work, two different types of GPUs have been used, Quadro 4000 and Tesla M2050. The former is resident on an in-house cluster, while the latter is accessed through Amazon Elastic Compute Cloud. Table 5.1 lists some specifications of both GPUs. Since both GPUs are of CUDA Computability 2.0, one thread can have at most 63 32-bit registers. Besides, a maximum number of 1536 threads, or 48 warps, can be scheduled on one SM at the same time. 5.1.3 Results For the initial implementation, it was only tested on the Quadro 4000 card. After compiling the CUDA code with “-Xptxas=-v” flag for nvcc compiler, the output shows that the kernel function needs more than the 63 registers that are available to a single GPU thread. As a result, the number of threads that can be scheduled on an SM concurrently is lower than what the hardware can support. In fact, the maximum number of threads that can be scheduled on an SM is 1536, as showed above, but due to the limit in the total number of registers available for an SM, only 86 Table 5.1: GPU specifications. Number of SMs Number of processing cores Single precision floating point performance (peak, Gigaflops) Double precision floating point performance (peak, Gigaflops) Memory Memory Interface Memory Bandwidth (GB/s) Shared memory per SM Number of 32-bit registers Quadro 4000 8 256 486.4 Tesla M2050 14 448 1030 243.2 515 2GB GDDR5 256-bit 89.6 16KB 48KB (Configurable) 32768 3GB GDDR5 384-bit 148 16KB 48KB (Configurable) 32768 32768/63 ≈ 520 threads, or 16 warps, can actually be scheduled. This low level of occupancy (33%) limits the ability of GPU to switch amongst different warp contexts to hide latency, and thus may hurt the overall performance. One additional effect of the high register usage to performance is that some registers have to be stored to the local memory and loaded back when needed (called “register spilling”). This can also hurt the performance since local memory (reside in global memory) are much slower in terms of both latency and bandwidth than registers. Other factors that can also affect the occupancy level are the maximum number of concurrent blocks that can be scheduled and the maximum size of shared memory available on an SM. In the initial implementation, there is not much use of shared memory, therefore, the limitation from shared memory is not a problem. As to the number of blocks, the only limitation it imposes is that the number of threads of a block must be between 64 and 512 to achieve the maximum level of occupancy limited by the register usage, that is, 16 concurrent warps on an SM. The timing of the kernel is achieved through using event synchronization provided by CUDA. With 256 blocks and 64 threads per block for one million neutron histories, 87 the average runtime for one neutron history is around 0.432 ms, which represents a ∼ 3.7 times of speedup against the serial CPU version. In addition, since the bulk of computation is now on GPU and there is not much interaction between the CPU and the GPU except for the initialization and finalization, the communication overhead is negligible. 5.2 Optimization efforts 5.2.1 Profiling The speedup of the initial implementation is not as good as desired, so some optimization is necessary to improve performance. The first thing for optimization is to find the performance bottlenecks of the code. Nvidia has shipped profiling tools that can provide information on some of the key performance metrics, among which is the standalone visual profiler nvvp [32]. Table 5.2 lists some performance related metrics from nvvp for the initial implementation. Here are some brief explanations of the listed items and their implications: • The DRAM utilization is the fraction of global memory bandwidth used by the code. The low level of DRAM utilization indicates that the memory bandwidth is extremely underutilized. • The global load efficiency is the ratio between the number of memory load transactions issued by the code and the actual global memory transactions, while the global memory replay overhead is the percentage of instruction issues due to replays of non-coalesced global memory accesses. Therefore, both of the low global load efficiency and the high global memory replay overhead suggest there are many non-coalesced global memory accesses. • The global cache replay overhead is similar to that of global memory replay 88 overhead, but it is caused by the L1 cache miss rate. • The local memory overhead represents the percentage of memory traffic caused by local memory accesses, which mainly comes from register spilling. • The branch divergence overhead is pretty self-explanatory, and results indicate there are many divergent branches during the kernel execution. Table 5.2: Performance statistics of the initial implementation on Quadro 4000 from Nvidia visual profiler nvvp DRAM utilization Global load efficiency Global memory replay overhead Global cache replay overhead Local memory overhead Branch divergence overhead 12.1% 1.3% 42.2% 14.8% 11.4% 67.5% Since most cross section evaluations need to access anywhere between 5 to 42 poles and the associated residues each time, which is much more than the access to any other data, it is speculated that loading poles and residues are the main cause for the non-coalesced global memory access. As a matter of fact, although different threads within a warp always evaluate the cross sections for the same nuclide at the same time, due to the possible difference in neutron energy, the poles (as well as residues and angular momentum numbers) to load for different threads are different and are very likely to be scattered. Besides, the different number of poles to be broadened for different threads due to energy difference may also be one reason for the high branch divergence rate as demonstrated in the profiling results. Another source of branch divergence may come from the Faddeeva function evaluation, since depending on where in the complex domain the input argument falls there are two ways of evaluating the function, either by table lookup or by asymptotic expansion. Given that the neutron energy across different threads may be different, 89 the input arguments of the Faddeeva function across different threads are very likely to be different and fall in different regions of evaluation and thus causing divergent branches. The large number of registers needed for each thread is mainly because of the complexity of the kernel function, which is somewhat inevitable. As mentioned in the previous section, this not only leads to low occupancy level on SMs, but also causes register spilling and consequently local memory overhead. As some of the performance bottlenecks are identified, the remaining parts of this section will discuss the measures to avoid or mitigate these problems. 5.2.2 Floating point precision For GPU computing, the first thing about performance is almost always the precision of floating point numbers, since even for the most recent generation of Nvidia GPUs the peak theoretical throughput of the single precision numbers is still about twice of that of the double precision numbers, not to mention the older generations where double precision throughput is even lower. In addition, there are other benefits of transferring from double precision to single precision. First, for the same number of variables in the kernel function, for single precision numbers fewer registers are needed for each thread, therefore, there is less pressure on register usage and this can increase the occupancy level or decrease/avoid the problem with register spilling. Second, using single precision numbers can also reduce the data size and thus reduce the memory bandwidth needed for memory loads and stores. Since in GPU programs memory bandwidth (especially the global memory bandwidth) can usually be the major performance bottleneck, as in our case, this aspect of single precision numbers is also very desirable. To use single precision, one must make sure that there is no compromise in the accuracy of the program, or the level of accuracy degradation is acceptable. In gen90 eral, one common case where high precision floating point numbers is necessary is subtraction of numbers close in value, which is exactly the case for the original multipole method, where poles that are (nearly) symmetric with respect to the origin have opposite contributions to some energy ranges. However, with the approximate multipole method, since the contribution from all faraway poles have been preprocessed and only that from the localized poles need to be accumulated on the fly, this problem becomes minimal. Table 5.3 shows the kef f from both double precision and single precision version of the slowing down code with 1 × 108 neutron histories. Similar to Chapter 4, these cases are run with MPI enabled to distribute the work to different computer nodes each of which has one CUDA-enabled GPU installed. The runtime results, on the other hand, are obtained by running ten million neutron histories on one GPU card to avoid the possible overhead associated with MPI. From the results, it is confirmed that there is not much affect in accuracy from the reduced floating point number precision. From the compiler output, using single precision does reduce the total number of registers needed for the kernel function, however, it is still above the limit of 63, and therefore, although the register spilling and thus the local memory overhead is mitigated, the occupancy level still maintains at the low level of 33%. The speedup from double precision to single precision can be as high as 40%, as suggested by Table 5.3. In addition, for single precision arithmetic CUDA also provides an optimized version of some common mathematic functions such as division and trigonometric functions, which are faster but less accurate than the standard ones. As demonstrated in Table 5.3, the use of the fast mathematic functions for the single precision version code does not compromise the accuracy, rather, it introduces another factor of nearly 40% of increase in performance. 91 Table 5.3: kef f (with standard deviation) and the average runtime per neutron history for different cases. k eff average runtime (ms) 0.764256±0.000138 0.432 0.764264±0.000138 0.308 Double precision Single precision w/o fast math Single precision w/ 0.764274±0.000138 0.223 fast math 5.2.3 Global memory efficiency As discussed in 5.2.1, the difference in neutron energy may cause the threads within a warp to load different poles and residues from global memory, which results in noncoalesced memory access and low utilization of global memory bandwidth. Although there is no easy way of avoiding the energy difference, it is possible to mitigate the extent to which non-coalesced memory access occurs and to increase the memory bandwidth utilization. Presented below are two techniques that work well. Data layout The first technique is rearranging the data layout of poles and residues to increase the memory coalescence. Specifically, the original data structure for the per-nuclide struct, “nucdata”, as shown in Code 4.1 of Chapter 4, arranges the data for poles and residues (as well as the angular momentum number l’s) in a fashion of “array of structures” (AoS). This strategy is efficient for CPUs, since in CPU code the poles are accessed one by one, and for each pole, the data for poles and residues as well as l’s are referenced consecutively, as a result, grouping together the data associated with each pole works very well for CPU cache. Additionally, for the parallel version of the CPU code, since all threads are essentially independent of each other during the slowing down process, and the cache is large enough that different threads can access different cache lines (or even different cache) without much contention, this 92 strategy also works well. For GPUs, however, things are much different. The major difference comes from the fact that the on-chip cache for GPU is too small for the number of threads so that data cached cannot persist very long. In our case, the additional data for poles and residue that are fetched together with the first piece of data accessed may be flushed before they get used, and they have to be fetched again when needed. This is a huge waste of the memory bandwidth, since for every 8 bytes (or 4 bytes if in single precision) needed, there are 128 bytes loaded among which at most 16 bytes are useful (assuming no caching benefit). To avoid this underutilization of global memory bandwidth, a different data structure, called “structure of arrays” (SoA), is implemented for “nucdata”, and shown in Code 5.1. Code 5.1: Data structure for approximate multipole method on GPU // a l l i n f o r m a t i o n f o r a n u c l i d e typedef struct { isoprop props ; // n u c l i d e p r o p e r t i e s int32 t Nbkg ; // number o f b k g r d e n t r y int32 t Nprs ; // number o f p o l e s ( and r e s i d u e s ) bkgrd entry ∗ bkgrds ; // a r r a y o f b k g r d e n t r y // myFloat can be e i t h e r f l o a t or d o u b l e myFloat ∗ prs ; // a r r a y o f p o l e s and r e s i d u e s int32 t ∗ Ls ; // a r r a y o f L a s s o c i a t e d w i t h p o l e s } nucdata ; In this new data structure, the poles and residues are organized into a single array of length 8 · N prs floating point numbers, where the first N prs numbers correspond to the real part of all the poles, and the second N prs correspond to the imaginary part, while the remaining 6 · N prs numbers correspond to the real and imaginary part of the three types of residues arranged in a similar fashion as the poles. The angular 93 momentum numbers are also organized as a separate array. This way, when accessing one data type (real part of the pole, imaginary part of the pole, etc.), all the threads within a warp are confined to the memory location of the same type, which increases the probability of different loads falling into the same memory segment reducing the number of memory transactions and increasing the bandwidth utilization. As a matter of fact, with only this modification, a speedup of nearly 100% is experienced for both the single precision and double precision version of the code on both GPU cards. Data parallelism One additional technique that can be exploited to increase the global memory efficiency is the data parallelism supported by GPUs. More concretely, instead of loading the necessary data for each pole one by one when needed, we can load all the data for a few poles at once, store them in the shared memory and then load them from shared memory as needed. Since the memory latency of shared memory is much lower than that of global memory, and the bandwidth much higher (see Table 2.1 of Chapter 2), there is not much overhead from the additional load and store associated with shared memory (although care must be taken to avoid bank conflicts). The advantage of this strategy are two fold. On one hand, since the instructions to load/store the poles and residues are independent of each other, they can all be issued once, effectively hiding the latency of global memory accesses. On the other hand, the successive accesses to the same data type may be combined into same memory transaction, if they fall in the same memory segment, which essentially increases the memory coalescing. Since the maximum size of shared memory available to an SM is 48 KB, the number of poles that can be loaded to shared memory is limited. To maintain the current occupancy, only four poles can be loaded at once for the single precision version and two for the double precision. A varying level of speedup is seen for the 94 different floating point precision and different GPU cards, ranging from a few percent to some tens of percent. 5.2.4 Shared memory and register usage As discussed above, the high register requirement of each thread not only lowers the occupancy rate, thus reducing the parallelism of the kernel and making the kernel less capable of hiding latency, but also causes register spilling and hence the high local memory overhead. Except for completely restructuring the kernel and/or using different algorithm(s), there is really no good way to reduce the temporary variables in the kernel. One way that may help in reducing the number of registers though, is to store some temporary variables into the shared memory. As a result, various parts of the code were modified in order to store different temporary variables to the shared memory, and the output from the compiler confirms that the number of registers spilled have decreased. However, the performance became worse. In addition, since the required number of registers was always much more than the hardware limit, whether for double precision or single precision version, it never happened that the number decreased to a level that could increase the occupancy. Therefore, this method was abandoned. Without success in performance improvement from reducing the register usage, the next thing tried was to directly reduce the local memory overhead. Given that local memory actually reside in global memory, and that the spilled register contents by default have very good access patterns, the only method left to reduce the overhead seems to be increase the L1 cache size for the global memory. For both cards used in this work, there is a configurable on-chip cache of size 64 KB, which is shared by L1 global cache and local shared memory and can be divided into 16 KB + 48 KB segments. By default, the shared memory is configured to 48 95 KB. It was found that if the cache was configured in favor of L1 cache, then the performance could be improved when there was not much shared memory usage (for example, no pre-loading poles and residues into shared memory) for both cards. However, with substantial shared memory usage, increasing the L1 cache would decrease the size of shared memory and may hurt the overall performance. In fact, if the poles and residues pre-loaded were stored in registers (which would increase the extent of register spilling), and the cache was configured in favor of L1 cache to accommodate for the spilled registers (as well as other global memory access), then for Quadro 4000 card the performance actually improved, while for Tesla M2050 card the performance degraded. 5.3 Results and discussion With all the optimization efforts explored in the previous section, the final speedup of the CUDA version slowing down code compared with the corresponding serial CPU version are shown in Table 5.4. For the purpose of comparison, the same set of performance related metrics are also listed in Table 5.5 for the optimized single precision version on Quadro 4000. Table 5.4: Speedup of GPU vs. serial CPU version on both GPU cards. Double precision Single precision Quadro 4000 6.0 13.3 Tesla M2050 10.7 21.6 From comparison of Table 5.2 and Table 5.5, it is clear that all performance metrics except for local memory and branch divergence overhead have been improved significantly, which means that the optimization efforts are quite successful. As to the increase in local memory overhead, it is anticipated since in the Quadro version poles and residues are preloaded to registers which may cause more register spilling. 96 Table 5.5: Performance statistics of the optimized single precision version on Quadro 4000 from nvvp. DRAM utilization Global load efficiency Global memory replay overhead Global cache replay overhead Local memory overhead Branch divergence overhead 39.0% 3.5% 10% 3.0% 25.8% 75.4% For the increase in branch divergence overhead, it is speculated to also be caused by the additional operations associated with preloading poles and residues. Some values in Table 5.5 are still not very good, such as the DRAM utilization and the global load efficiency, which suggests either there are some inherent issues that are not suitable for GPU deployment, or further optimizations are needed. Here is a brief summary of the main hurdles to the performance of the code: 1) Even though the new algorithm (Algorithm 4) for the slowing down process has avoided some potential divergent branches, they are still present in the code due to the random nature of Monte Carlo. There are mainly two places where the divergence arises: i) due to differences in incoming neutron energy, the number and sequence of poles to be broadened are different; ii) branches in the Faddeeva function, which is a hotspot of the code. 2) The non-uniform memory access to the global memory as discussed in Section 5.2.3 is still a big issue, which also mainly comes from the randomness of Monte Carlo. 3) The requirement of large number of registers from the complex kernel function leads to a low occupancy, which limits the GPU’s ability to hide the memory latency that mainly comes from 2). In addition, the associated register spilling may still hurt the performance. 97 With the current performance results, it is almost certain that the Monte Carlo methods cannot take full advantage of the massively parallel capability that GPUs potentially provide. In fact, the peak theoretical throughput of the CPU used in Chapter 4 is about 192 Gigaflops, which is about a fifth of that of Tesla M2050 card using single precision floating point numbers. However, since the slowing down code can achieve almost linear scalability on the CPU, there is only about a factor of two difference in the performance between one such CPU and one Tesla M2050 card. 98 Chapter 6 Summary and Future Work 6.1 Summary In this thesis, a new approach based on the multipole representation is proposed to address the issue of prohibitively large memory requirements for nuclear cross sections in Monte Carlo reactor simulations. The multipole representation transforms resonance parameters of each nuclide into a set of poles and residues, and then broadens these poles to obtain the cross section for any temperature. These poles show distinct differences in their contributions to the energy ranges of interest. Some of them have very smooth contributions thus do not have much Doppler effect, while others do show fluctuating behavior that is localized. As a result, an overlapping energy domains strategy is proposed to reduce the number of poles that need be broadened on the fly, which forms the basis for the new approximate multipole method. Specifically, the majority of the poles that have smooth contribution over an energy interval are preprocessed so that their contributions can be approximated with a low-order polynomials. Therefore, only a small number of local fluctuating poles are left to be broadened on the fly. A few parameters associated with this strategy, i.e., the outer and inner window sizes, 99 and their effects on both the performance and the memory requirement are studied, and a set of values are recommended to achieve the desired level of accuracy while maintaining good efficiency. In general, for major nuclides such as U238 and U235 which have more than 3000 resonances, the number of poles to be broadened on the fly can be anywhere between 5 to around 42, depending on the energy at which the cross section needs to be evaluated. The approximate multipole method is then implemented on CPU and its performance is compared against that of the traditional table lookup method as well as the Cullen’s method. It was found that this new method has little computational overhead against the standard table lookup method, which in general is less than 50%, and on some hardware systems, the former is even faster. The major reason of this low computation overhead comes from the fact that there is much less memory overhead from cache misses. Besides, the approximate multipole method also shows good scalability, better than that of the table lookup method, thus making it more desirable for massively parallel deployment. The major advantage of the approximate multipole method is the large reduction in the memory footprint of resolved resonance cross section data. From the data in Chapter 1, 2 and 4, it is clear that the new method can reduce the memory footprint by three orders against the traditional method, and one to two orders against the comparable techniques. This huge reduction in memory can play a significant role in high performance computation where number of cores continue to increase at the detriment of the available memory per core. In particular, this new method makes it possible to run the Monte Carlo code on GPUs for realistic reactor simulations, in order to utilize their massively parallel capability. In this thesis, the approximate multipole method is also implemented on the GPU for a neutron slowing down problem in a homogeneous material consisting of 301 nuclides. Two different types of GPU cards are used to form a good understanding 100 of the performance on the GPU. Through some extensive optimization efforts, the GPU version can achieve 22 times speedup compared with a serial CPU version. The main factors that contribute to this speedup are reduced precision of floating point arithmetics for higher throughput, faster but less accurate mathematic functions, a data structure that favors structure of arrays over array of structures for better memory access pattern, and pre-loading data from global memory to faster local shared memory. The major performance bottleneck, on the other hand, comes from the randomness of Monte Carlo method, which manifests itself in branch divergence and non-coalesced global memory access. In addition, the register pressure due to complexity of the kernel also hurts the performance. With the current performance results, it is almost certain that the Monte Carlo methods in reactor simulation cannot take full advantage of the massively parallel capability that GPUs can potentially provide. 6.2 Future work To generate an entire library with the proposed approximate multipole method for direct cross section evaluation, work still remains. First, the WHOPPER code used in this thesis to generate the poles works exclusively for resonance parameters in the Reich-Moore format. ENDF/B-VII currently has 50 nuclides in that format, while 250 nuclides in the MLBW format and 100 nuclides with no resonance file. A processing tool for MLBW format already exists [33], but work remains on how to proceed with the remaining nuclides. Second, in Chapter 3, a systematic way of determining the outer and window sizes (mainly outer window size) is proposed based on both the observation of the functional dependency of Doppler broadening effect range on energy as well as temperature and the numerical investigation. Due to the fact that the outer window size is not 101 needed during cross section evaluation and that inner window and outer window are essentially de-coupled, an alternative approach may be pursued. Specifically, after an inner window size is decided, the optimal outer window size of each separate inner window can be determined iteratively by achieving the specified accuracy level. This process can also be repeated to determine the optimal inner window size. In addition, the efficiency of approximate multipole method could also be improved by relaxing the accuracy criteria needed in regions of low cross section values. In fact, the cross sections in these regions are usually the most demanding for the number of poles to be broadened on the fly, but their impact on the overall neutronic behavior may be insignificant. Last, the focus of this thesis is only on the resolved resonance region above 1 eV. To extend the approximate multipole method to the low energy region, the contribution from the correction term in Eq. 3.10 and 3.11 needs to be taken into account, as well as the integration with low energy scattering such as S(α, β) [34]. It is also worth exploring whether the multipole representation can be applied in the unresolved resonance region. 102 Appendix A Whopper Input Files The resonance parameters of the input files listed below all come from http://t2.lanl. gov/nis/data/endf/endfvii.1-n.html A.1 U238 1 RESONANCES OF U238 ENDFB-VII.1 0 1 1 1 0 9.223800+4 2.360058+2 0 9.223800+4 1.000000+0 0 1.000000-5 2.000000+4 1 0.000000+0 9.480000-1 0 2.360058+2 0.000000+0 0 1 926 3 0.94800000 -4.405250+3 5.000000-1 1.393500+2 -4.133000+2 5.000000-1 5.215449-2 -3.933000+2 5.000000-1 4.993892-2 -3.733000+2 5.000000-1 4.764719-2 -3.533000+2 5.000000-1 4.527354-2 -3.333000+2 5.000000-1 4.281115-2 -3.133000+2 5.000000-1 4.025348-2 -2.933000+2 5.000000-1 3.759330-2 -2.733000+2 5.000000-1 2.551450-2 -2.533000+2 5.000000-1 2.397198-2 0 1 0 1 0 0 3 0 0 1 2 1 2 138 2.300000-2 2.300000-2 2.300000-2 2.300000-2 2.300000-2 2.300000-2 2.300000-2 2.300000-2 2.300000-2 2.300000-2 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 103 1 0 09237 09237 09237 29237 459237 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 2 2 2 2 2 -2.333000+2 5.000000-1 2.234626-2 2.300000-2 -2.133000+2 5.000000-1 2.062684-2 2.300000-2 -1.933000+2 5.000000-1 1.879962-2 2.300000-2 -1.733000+2 5.000000-1 1.685164-2 2.300000-2 -1.533000+2 5.000000-1 1.476751-2 2.300000-2 -1.333000+2 5.000000-1 1.253624-2 2.300000-2 -1.133000+2 5.000000-1 1.015824-2 2.300000-2 -9.330000+1 5.000000-1 7.658435-3 2.300000-2 -7.330000+1 5.000000-1 5.086118-3 2.300000-2 -5.330000+1 5.000000-1 2.932955-3 2.300000-2 -3.330000+1 5.000000-1 1.004548-2 2.300000-2 -7.000000+0 5.000000-1 1.685000-4 2.300000-2 6.673491+0 5.000000-1 1.475792-3 2.300000-2 2.087152+1 5.000000-1 1.009376-2 2.286379-2 ... 895 resonance parameters ommitted here 2.000895+4 5.000000-1 1.947255+0 2.300000-2 2.002445+4 5.000000-1 7.114126-1 2.300000-2 2.003658+4 5.000000-1 1.617239+0 2.300000-2 2.009290+4 5.000000-1 1.016504+0 2.300000-2 2.012000+4 5.000000-1 7.876800-2 2.300000-2 2.017500+4 5.000000-1 4.742300-1 2.300000-2 2.440525+4 5.000000-1 2.900960+2 2.300000-2 2.360058+2 0.000000+0 1 0 2 851 3 0.9480000 1.131374+1 5.000000-1 4.074040-7 2.300000-2 4.330947+1 5.000000-1 6.190440-7 2.300000-2 ... 846 resonance parameters ommitted here 1.988630+4 5.000000-1 1.328216-2 1.273902-2 1.988734+4 5.000000-1 5.155808-2 1.229675-2 2.000188+4 5.000000-1 1.355880-2 2.300000-2 1566 3 0.9480000 4.407476+0 1.500000+0 5.553415-8 2.300000-2 7.675288+0 1.500000+0 9.416455-9 2.300000-2 ... 1561 resonance parameters ommitted here 1.998370+4 1.500000+0 3.780048-2 2.677523-2 1.999255+4 1.500000+0 3.232645-3 2.300000-2 2.000376+4 1.500000+0 5.208511-2 2.300000-2 0.00001 20000.0 0.001 300.0 104 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 2.010000-6 0.000000+0 0.000000+0 5.420000-8 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 9.990000-9 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 372 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 459237 2 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 A.2 U235 1 RESONANCES OF U235 ENDFB-VII.1 0 1 1 1 0 0 1 0 1 1 0 9.223500+4 2.330248+2 0 2 0 09228 9.223500+4 1.000000+0 0 0 2 09228 1.000000-5 2.250000+3 1 3 1 09228 3.500000+0 9.602000-1 0 0 2 29228 2.330248+2 0.000000+0 0 0 12772 31939228 2 1449 3 0.96020000 -2.038300+3 3.000000+0 1.970300-2 3.379200-2-4.665200-2-1.008800-1 -1.812100+3 3.000000+0 8.574000-4 3.744500-2 7.361700-1-7.418700-1 -1.586200+3 3.000000+0 8.284500-3 3.443900-2 1.536500-1-9.918600-2 -1.357500+3 3.000000+0 5.078700-2 3.850600-2-1.691400-1-3.862200-1 -5.158800+2 3.000000+0 2.988400+0 3.803000-2-8.128500-1-8.180500-1 -7.476600+1 3.000000+0 3.837500-1 5.208500-2-8.644000-1-7.865200-1 -3.492800+0 3.000000+0 8.539000-8 3.779100-2-6.884400-3 1.297700-2 -1.504300+0 3.000000+0 8.533300-8 3.782800-2-7.039700-3 1.168600-2 -5.609800-1 3.000000+0 2.997400-4 2.085500-2 9.564400-2-1.183900-2 2.737933-1 3.000000+0 4.248600-6 4.620300-2 1.177100-1 3.484800-4 ... 1431 resonance parameters ommitted here 2.246138+3 3.000000+0 3.145300-3 3.820000-2 5.781200-2-8.390000-2 2.250300+3 3.000000+0 2.240500-2 6.842500-2-4.117100-1-1.227300-1 2.254200+3 3.000000+0 2.551800-2 9.486300-2 2.674300-2 4.203200-2 2.256200+3 3.000000+0 1.423000-2 4.937900-2 2.501300-2 3.631000-2 2.283800+3 3.000000+0 7.159000+0 9.988600-2 8.765300-1 4.688800-1 2.630400+3 3.000000+0 7.853400+0 4.516400-2 7.068000-1 5.364700-1 3.330800+3 3.000000+0 1.205700+1 4.722800-2 4.744200-1 5.712900-1 4.500900+3 3.000000+0 6.143900+0 3.368100-2 2.866200-1 3.641400-1 1744 3 0.96020000 -1.132100+3 4.000000+0 1.714400+0 3.979400-2 4.770100-1-4.693700-1 -7.223900+2 4.000000+0 2.503600+0 3.612200-2 7.749400-1-8.300900-1 -3.243600+2 4.000000+0 1.519600-1 3.893400-2 7.608300-1-7.751100-1 -3.360400+0 4.000000+0 5.427700-3 2.624000-2 1.784100-1-7.486200-2 -1.818200-1 4.000000+0 3.664300-6 2.058000-2 1.563300-1-2.970900-2 3.657500-5 4.000000+0 6.46080-11 4.000000-2-5.091200-4 9.353600-4 1.134232+0 4.000000+0 1.451900-5 3.855000-2 5.184600-5 1.284500-1 ... 1728 resonance parameters ommitted here 2.247883+3 4.000000+0 1.001200-2 3.820000-2 1.147400-1 1.332300-1 2.247927+3 4.000000+0 1.646500-2 3.820000-2-1.139700-1-7.628300-2 2.257300+3 4.000000+0 5.543700-2 2.692900-1 4.637500-2 2.524300-1 105 2 2 2 2 2 2.657600+3 3.142000+3 3.588700+3 3.819200+3 4.038900+3 4.274700+3 0.00001 300.0 A.3 4.000000+0 4.000000+0 4.000000+0 4.000000+0 4.000000+0 4.000000+0 20000.0 4.491400-1 2.419400-2 4.682100-2 2.097300-1 6.034300-2 1.498700-2 0.001 5.531500-2 3.829100-2 1.621300-1 4.651300-2-9.506100-2-6.489900-2 3.930100-2-5.121800-2-2.476700-3 3.849400-2-5.116600-1 6.770900-2 3.869100-2-1.141800-1-7.143200-1 3.725500-2 1.604500-2-1.079600-2 Gd155 1 RESONANCES OF Gd155 ENDFB-VII.1 0 1 0 1 0 0 1 6.415500+4 1.535920+2 1 0 6.415500+4 1.000000+0 1 0 1.000000-5 1.833000+2 1 3 1.500000+0 7.900000-1 0 0 1.535920+2 0.000000+0 0 0 2 37 3 0.79000000 2.008000+0 1.000000+0 3.706666-4 1.100000-1 3.616000+0 1.000000+0 4.400000-5 1.300000-1 ... 32 resonance parameters ommitted here 1.683000+2 1.000000+0 3.013333-2 1.098000-1 1.780000+2 1.000000+0 9.733333-3 1.098000-1 1.804000+2 1.000000+0 1.466667-2 1.098000-1 55 3 0.7900000 2.680000-2 2.000000+0 1.040000-4 1.080000-1 2.568000+0 2.000000+0 1.744000-3 1.110000-1 ... 50 resonance parameters ommitted here 1.714000+2 2.000000+0 9.200000-3 1.098000-1 1.735000+2 2.000000+0 3.280000-2 1.098000-1 1.756000+2 2.000000+0 2.080000-3 1.098000-1 0.00001 20000.0 0.001 300.0 106 0 1 1 1 1 2 492 1 0 46434 06434 06434 26434 456434 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 0.000000+0 2 2 2 2 2 References [1] NVIDIA Corp. CUDA Compute Unified Device Architecture Programming Guide Version 5.5, 2013. [2] W. Hwu. Lecture notes 4.1 of Heterogeneous Parallel Programming on Coursera. https://www.coursera.org/course/hetero, 2012. [3] R.E. MacFarlane and D.W. Muir. NJOY99.0 - Code System for Producing Pointwise and Multigroup Neutron and Photon Cross Sections from ENDF/B Data. PSR-480/NJOY99.00, Los Alamos National Laboratory, 2000. [4] T. H. Trumbull. Treatment of Nuclear Data for Transport Problems Containing Detailed Temperature Distributions. Nucl. Tech., 156(1):75–86, 2006. [5] G. Yesilyurt, W.R. Martin, and F.B. Brown. On-the-Fly Doppler Broadening for Monte Carlo Codes. Nucl. Sci. and Eng., 171(3):239–257, 2012. [6] T. Viitanen and J. Leppanen. Explicit Treatment of Thermal Motion in Continuous-Energy Monte Carlo Tracking Routines. Nucl. Sci. and Eng., 171(2):165–173, 2012. [7] R.N. Hwang. A Rigorous Pole Representation of Multilevel Cross Sections and Its Practical Applications. Nucl. Sci. and Eng., 96(3):192–209, 1987. [8] B. Forget, S. Xu, and K. Smith. Direct Doppler Broadening in Monte Carlo Simulations using the Multipole Representation. Submitted June 2013. [9] D.E. Cullen and C.R. Weisbin. Exact Doppler Broadening of Tabulated Cross Sections. Nucl. Sci. and Eng., 60:199–229, 1976. [10] R. E. MacFarlane and A. C. Kahler. Methods for Processing ENDF/B-VII with NJOY . Nuclear Data Sheets, 111(12):2739–2890, 2010. [11] F.B. Brown, W.R. Martin, G. Yesilyurt, and S. Wilderman. Progress with OnThe-Fly Neutron Doppler Broadening in MCNP. Transaction of the American Nuclear Society, Vol. 106, June 2012. [12] E. Woodcock et al. Techniques Used in the GEM Code for Monte Carlo Neutronics Calculations in Reactors and Other Systems of Complex Geometry. ANL7050, Argonne National Laboratory, 1965. 107 [13] T. Viitanen and J. Leppanen. Explicit Temperature Treatment in Monte Carlo Neutron Tracking Routines – First Results. In PHYSOR 2012, Knoxville, Tennessee, USA, April 2012. American Nuclear Society, LaGrange Park, IL. [14] J. Duderstadt and L. Hamilton. Nuclear Reactor Analysis. John Wiley & Sons, Inc, 1976. [15] S. Li, K. Wang, and G. Yu. Research on Fast-Doppler-Broadening of Neutron Cross Sections. In PHYSOR 2012, Knoxville, Tennessee, USA, April 2012. American Nuclear Society, LaGrange Park, IL. [16] NVIDIA. CUDA Technology; http://www.nvidia.com/CUDA, 2007. [17] A.G. Nelson. Monte Carlo Methods for Neutron Transport on Graphics Processing Units Using CUDA. Master’s thesis, Pennsylvanian State University, Department of Mechanical and Nuclear Engineering, December 2009. [18] A. Heimlich, A.C.A. Mol, and C.M.N.A. Pereira. Gpu-based monte carlo simulation in neutron transport and finite differences heat equation. Prog. In Nucl. Energy, 53:229–239, 2011. [19] T. Liu, A. Ding, W. Ji, and G. Xu. A Monte Carlo Neutron Transport Code for Eigenvalue Calculations on a Dual-GPU System and CUDA Environment. In PHYSOR 2012, Knoxville, Tennessee, USA, April 2012. American Nuclear Society, LaGrange Park, IL. [20] B. Yang, K. Lu, J. Liu, X. Wang, and C. Gong. GPU Accelerated Monte Carlo Simulation of Deep Penetration Neutron Transport. In IEEE Intl Conf. on Parallel, Dist., and Grid Computing, Solan, India, 2012. [21] D.G. Merrill. Allocation-oriented Algorithm Design Allocation oriented with Application to GPU Computing. PhD thesis, University of Virginia, School of Engineering and Applied Science, December 2011. [22] L.C. Leal. Brief Review of the R-Matrix Theory. http://ocw.mit.edu/ courses/nuclear-engineering/22-106-neutron-interactions-and-applicationsspring-2010/lecture-notes/MIT22 106S10 lec04b.pdf, 2010. [23] G. de Saussure and R.B. Perez. POLLA: A Fortran Program to Convert RMatrix-Type Multilevel Resonance Parameters into Equivalent Kapur-PeierlsType Parameters. ORNL-2599, Oak Ridge National Laboratory, 1969. [24] A. W. Solbrig. Doppler Broadening of Low-Energy Resonances. Nucl. Sci. and Eng., 10:167–168, 1961. [25] R.N. Hwang. An Extension of the Rigorous Pole Representation of Cross Sections for Reactor Applications. Nucl. Sci. and Eng., 111:113–131, 1992. 108 [26] H. Henryson, B.J. Toppel, and C.G. Stenberg. MC2 -2: A Code to Calculate FastNeutron Spectra and Multigroup Cross Sections. ANL-8144, Argonne National Laboratory, 1976. [27] R.N. Hwang. Resonance Theory in Reactor Applications. In Y. Azmy and E. Sartori, editors, Nuclear Computational Science: A Century in Review, chapter 5, page 235. Springer, 2010. [28] SciPy v0.12 Reference Guide (DRAFT). http://docs.scipy.org/doc/scipy/reference/generated/scipy.special.wofz.html. [29] OpenMP. http://openmp.org/wp/. [30] The Message Passing Interface (MPI) Standard. research/projects/mpi/. http://www.mcs.anl.gov/ [31] George Marsaglia. Xorshift RNGs. Journal of Statistical Software, 8(14):1–6, 2003. [32] CUDA Visual Profiler. guide/index.html#visual-profiler. http://docs.nvidia.com/cuda/profiler-users- [33] C. Jammes and R.N. Hwang. Conversion of Single- and Multilevel Breit-Wigner Resonance Parameters to Pole Representation Parameters. Nucl. Sc. and Eng., 134(1):37–49, 2000. [34] K.H. Beckurts and K. Wirtz. Neutron Physics. Springer, Berlin, 1964. 109