FLUKA Pre-Optimizer for a Monte Carlo Treatment Planning System September 2015 Author: Guido Arnau Antoniucci Supervisors: Vasilis Vlachoudis Wioletta Kozlowska CERN openlab Summer Student Report 2015 Project Specification The intention of this project is to explore the Intel Xeon Phi coprocessor capabilities by creating a pre-optimizer based on the output of FLUKA to support a Monte Carlo Treatment Planning System (MCTPS) for proton therapy. FLUKA will provide 2D axis symmetric R-Phi-Z plots of energy deposition for the discrete beam energies, positions and directions that the gantry can provide. Using the Intel Xeon Phi, the plots will be recombined into a three dimensional energy deposition mesh to further extract the necessary DVH. A genetic algorithm and a steep descent algorithm running at a higher layer will adjust the number of particles emitted by the beams in order to maximize the dose received by the Planning Target Volume (PTV) and minimize the dose received by the Organs At Risc (OAR). Abstract The EN-STI-EET group is currently working on implementing a Monte Carlo Treatment Planning System (MCTPS) for proton therapy based on FLUKA. Since physics simulations are extremely CPU intensive, a full Monte Carlo based TPS will be almost unaffordable if no any sophisticated optimization is employed. In this project the Intel Xeon Phi coprocessor will be tested as a tool to speed up the preoptimization of a Treatment Plan. Firstly, a tool to evaluate the fitness score of a solution will be built and, in the second part of the project, a genetic algorithm and a steepest descent algorithm will be implemented to optimize the treatment. Table of Contents 1 2 Introduction .............................................................................................................. 5 1.1 Treatment Planning System ......................................................................................... 5 1.2 FLUKA .......................................................................................................................... 5 1.3 The Pre-Optimizer ........................................................................................................ 5 Pre-Optimizer Design ............................................................................................... 6 2.1 2.2 Higher Layer: Optimization .......................................................................................... 7 2.1.1 Genetic Algorithm ............................................................................................ 7 2.1.2 Steepest Descent ............................................................................................ 8 Lower Layer: Generation of the 3D Mesh and Fitness Computation........................... 9 2.2.1 2.3 The Intel MIC architecture .......................................................................................... 12 2.3.1 3 Fitness Function ............................................................................................ 11 OpenMP ......................................................................................................... 13 Results ................................................................................................................... 13 3.1 Optimization Algorithms ............................................................................................. 13 3.2 Intel Xeon Phi Performance ....................................................................................... 15 4 Conclusions ........................................................................................................... 16 5 Future work ............................................................................................................ 17 6 Bibliography ........................................................................................................... 17 1 Introduction 1.1 Treatment Planning System Proton therapy is a type of particle therapy that consists in using beams of protons to irradiate a diseased tissue or tumour in order to destroy the cells forming it. In practice, when this type of therapy is applied to a patient, a dosimetrist will prescribe a dose to be applied to the cancer or Planning Target Volume (PTV). A Treatment Planning System (TPS) will then try to find the best disposition and number of beams that needs to be used in order to: Apply the prescribed dose to the tumour. Minimize the radiation received by the healthy organs or Organs At Risk (OAR) of the patient. The TPS simulates multiple solutions tuning the number of beams, the angles from which each beam will be delivered from, the number of protons shot by each beam, etc. in order to find the disposition that fits the prescribed dose. A CT scan of the patient will be used to produce a more accurate simulation by providing information about the environment over which the particles travel. TPS using the Monte Carlo (MC) technique have served for decades as reference tools for accurate dose computations due to their superior accuracy compared to analytical algorithms which are most commonly used because of their shorter execution times. MC methods allow simulating with greater detail the physical processes involved in the therapy. 1.2 FLUKA FLUKA is a general purpose tool for calculations of particle transport and interactions with matter using Monte Carlo simulations. FLUKA can be used to build a MCTPS. Each beam configuration would be simulated using FLUKA and, depending on the dose -the deposited energy- in the PTV and OAR, the solution would be scored accordingly. The problem with this approach is that just simulating a single beam with FLUKA takes a lot of computational power: each simulation would take approximately one hour. We would need to simulate a great number of solutions with thousands of beams making the computation last far too long. Thus, FLUKA simulations are really computationally intensive and a shortcut needs to be found in order to build a practical MCTPS. 1.3 The Pre-Optimizer In order to be able to use a MCTPS built with FLUKA, we need to introduce a pre-optimizer of the solution which is able to approximate a good treatment without using FLUKA and thus, reduce the overall execution time of the MCTPS. 5|P a g e The Pre-Optimizer will receive as input a series of 2D cylindrical projections of the proton dose map generated by FLUKA. That means the optimizer will not be able to modify the number of beams in the treatment or the direction of those beams. FLUKA will use a patient CT scan as the environment through which the protons shot by the beams travel. The program will output an array containing a weight for each beam representing the “intensity” of each beam, that is, the number of protons shot by each beam. This series of weights will be a good approximation of a beam disposition that applies the necessary dose to the PTV and minimizes the dose received by the OAR. Figure 1. Diagram of the process followed by the MCTPS 2 Pre-Optimizer Design The Pre-Optimizer has two distinct software layers which interface with each other: the lower layer in charge of merging all the 2D plots into a 3D energy disposition mesh, which will also calculate the fitness score of the inputted predisposition of beams, and a higher layer which will run the optimization algorithms. The higher layer will generate a series of weights (one for each beam) and pass it to the lower layer, which will merge all the Usrbins –the dose depositions generated by FLUKA- into the mesh and calculate the fitness score of the treatment if the beams shot the number of protons indicated by the weights. The lower the score, the least damage is done to the OAR and the dose received by the PTV is closer to the one prescribed. This layer is discussed in section 2.1. The fitness function used in this project is described in section 2.2.1. The lower layer is written in C++ and is parallelized using OpenMP. It uses the existing Flair1 codebase -FLUKA’s extension to include a graphical interface- in order to work with the Voxel and Usrbin files generated by FLUKA. The implementation is discussed in section 2.2. 1 http://www.fluka.org/flair/index.html 6|P a g e Figure 2. Diagram of the Pre-Optimizer The higher layer consists of an adapted genetic algorithm and a steep descent algorithm. It is written using the Python programming language. More details about its implementation can be found on section 2.1. This layer interfaces with the lower layer using a wrapper written using the Python/C API which allows to seamlessly call C or C++ functions from Python. This wrapper is also in charge of casting the weights and fitness scores being transferred between layers to the proper type of variable compatible with the system that receives it. 2.1 Higher Layer: Optimization 2.1.1 Genetic Algorithm A generic Genetic Algorithm was implemented in order to serve as a randomized optimization algorithm, which have distinctively worked well for Monte Carlo simulations. The algorithm is based on the concept of randomized evolution: an initial randomly generated population is optimized towards better solutions by creating new populations derived from the previous ones. The new populations are generated off the older ones by crossover and mutation operators and selected based on their fitness score, which is calculated by the lower layer. A population is a group of individuals currently being evaluated in order to produce the next generation of the algorithm. In our case, an Individual will be represented by an array of weights –the intensity of each beam–, which is defined as his chromosome. A step on the algorithm will consist in the application of a mutation operator over some individuals and applying a crossover operator over a group of randomly selected individuals, skewing the selection towards the most fitted ones. The mutation operator consists on randomly changing some of the genes –the weights of the beams– of the chromosome of an individual. The rate of mutation can be adjusted in order to limit the number of individuals over which this operator is applied. The crossover operator consists on taking the best genes out of two individuals –the parents– and merging them, creating a new individual –the offspring–. This operator can be applied using different methods. For our program, the following operators were implemented: 7|P a g e A two point crossover operator, A uniform crossover operator (1) And a selection crossover operator (2). The selection of the individuals on which the crossover operator will be applied will be made using the tournament selection method (3) based on a fixed crossover rate, which can be tuned to make the convergence of the fitness function faster. The algorithm will run this process multiple times –each step will be considered a generation– till the number of iterations is bigger than the set limit or the fitness score is smaller than the threshold set by the user. 2.1.2 Steepest Descent This algorithm is based on the idea that the derivate of a multivariable function will be a vector pointing towards the direction of the steepest descent, that is, the direction in which the function decreases the most. If we follow this vector, and repeat the process, we will end up in a local minimum of the function. Since our fitness function is a differentiable multivariable function –the function will have as much variables as beams there are, representing the weights–, we can derivate it at each point approximating the value using numerical differentiation. That is, we can calculate the derivate of the fitness function for each weight wi , with i ∈ Number of beams, using: 𝑓′(𝑤𝑖 ) = 𝑓(𝑤𝑖 + ℎ) − 𝑓(𝑤𝑖 − ℎ) 2ℎ The parameter h is fixed and has to be tuned in order to find the most suitable value that gives the best results. This computation of the derivate for each weight will result in an n-dimensional vector with as many components as beams there are in the treatment: ⃗⃗⃗⃗⃗ ∇𝐹 = { 𝜕𝐹 𝜕𝐹 𝜕𝐹 , ,…, } 𝜕𝑤1 𝜕𝑤2 𝜕𝑤𝑛 The next step in the algorithm will be to project our present point where we derivated the function –composed by the weights being evaluated– along the direction of the vector to get a new point in the function space. To do that we just need to add the vector to the point: ⃗⃗⃗⃗⃗ 𝑤𝑖′ = 𝑤𝑖 + 𝑡 × ∇𝐹 Before doing that we need to evaluate the distance we will travel along the direction of the vector, that is, the parameter t. We need to find a tmin such that the new point found has the lowest fitness score along the direction ∇ F. To find this value we will use the Golden Section Search method (4), which consists of an algorithm that successively narrows down the range on which the minimum of the function relies until it is found. 8|P a g e Once the value is found, the old set of weights is displaced in the direction of ∇ F by a distance of tmin. This results in a new point where the whole process is applied again. The algorithm starts in a random point of the function space and stops once the maximum number of iterations is reached or the fitness score reaches a threshold set by the user. The biggest difference of the Steepest Descent algorithm with the genetic algorithm is that is not randomized, resulting in a supposedly faster conversion to the optimal fitness score but at the expense of more resource intensive computations. The algorithm also falls into the problem of being able to get stuck in local minimums and not being able to find the lowest minimum of the function. 2.2 Lower Layer: Computation Generation of the 3D Mesh and Fitness The objective of this layer is to generate a three dimensional Cartesian mesh containing the dose deposited by the beams in each voxel –each cell of a three-dimensional matrix– belonging to the PTV or OAR. Then, each of those areas will be traversed in order to calculate the fitness function of the current solution, which is scored based on the previously mentioned dose depositions. Figure 3. Process followed by the lower layer of the Pre-Optimizer The program will load a series of Usrbin files containing a 2D cylindrical projection of the dose deposited by each and merge them into a 3D Cartesian mesh, which is set based on the coordinate system used by the Voxel where the three-dimensional location of the PTV and the OAR is stored. The Usrbin files are generated by FLUKA using a CT scan of the patient as the environment where the particle transports are simulated. The beams are represented as two dimensional matrices belonging to a cylindrical space. If we rotated this matrix over itself, we would end up with a cylinder corresponding to a beam. In order to project the beams into a three dimensional space, we need to transpose this matrix into the correct position inside the three dimensional Voxel coordinate system. 9|P a g e The algorithm to produce this transformation traverses, for each of the beams, the areas of the three dimensional matrix corresponding to the PTV and OAR. Each of those positions will be transposed into the coordinate system used to generate the Usrbins –which is different for each case– and checked if it is inside the cylinder described by the beam. If it is inside, the dose corresponding to the voxel will be added to the three dimensional matrix, resulting in a matrix were all the doses deposited by the beams have been accumulated. It must be noted that this process is independent for each beam and thus, can be done in parallel, as explained in section 2.3. The reason for the beams to be saved into a two dimensional projection separately instead of creating the three dimensional matrix directly with FLUKA is because of the huge amounts of memory a three dimensional matrix would occupy: the matrix can have a size close to the hundred of millions and thus, occupy gigabytes of memory. If we needed to store one of these matrices for each beam the memory requirements needed to run this software would not be practical. The lower layer is written in C++ and reuses existing Flair code in order to work with the necessary data structures to perform the required operations. Therefore, we can visualize the Usrbin projections with Flair, as shown in Figure 4. The fact that is written in C++ will help us parallelize the program with OpenMP in order to be able to run it on the Intel Xeon Phi2. Figure 4. 2 Visualization of the beams in Flair Refer to section 2.3 for more information about the Intel Xeon Phi. 10 | P a g e Since we only need to look at the PTV and OAR areas in order to score the fitness function, we can optimize our program by only adding the dose on the voxels belonging to the PTV or OAR areas instead of scanning the whole Voxel. In Figure 5 we can observe that only the PTV has accumulated data: Figure 5. Dose received by the PTV, visualized in Flair 2.2.1 Fitness Function The Fitness Function used is a simplification of the one found in (5), where Di is the dose ̂ ̂ received by the voxel i and 𝐷 𝑃𝑇𝑉 is the prescribed dose for the PTV and 𝐷𝑂𝐴𝑅 is the maximum dose the OAR can receive. 2 2 ̂ ̂ ̂ 𝜒(𝑵) = ∑ (𝐷 𝑃𝑇𝑉 − 𝐷𝑖 ) + ∑ (𝐷𝑂𝐴𝑅 − 𝐷𝑗 ) × 𝜃 (𝐷𝑂𝐴𝑅 − 𝐷𝑗 ) 𝑖 ∈𝑃𝑇𝑉 𝑗 ∈𝑂𝐴𝑅 This function has to be minimized. This means doses that go over or under the prescribed one in the PTV area will be penalized with a higher fitness score. The voxels in the OAR areas will ̂ show the same behaviour except when the dose received by the voxel is bigger than 𝐷 𝑂𝐴𝑅 . In this case, the Heaviside function will prevent the voxel to contribute towards the fitness score. 11 | P a g e 2.3 The Intel MIC architecture As mentioned in section 2.2, each addition of the beams into the three dimensional matrix can be done in parallel. Since we can have thousands of beams, this software can benefit from massive parallelization. In this project, this computing power will be provided by the Intel Xeon Phi. The lower layer described in section 2.2 will be run completely on it. The Intel Many Integrated Core (MIC) architecture combines many Intel CPU cores running at 1.1 GHz onto a single chip. Each core is able to handle up to 4 threads using hyperthreading, achieving a theoretical peak of more of 1 teraFLOPS in double precision. A key attribute of the microarchitecture is that it is built to provide a general-purpose programming environment similar to the one offered by the Intel Xeon processor. The Intel Xeon Phi coprocessors based on the Intel MIC architecture run a full service Linux operating system, support x86 memory order model and are capable of running applications written in Fortran, C, and C++. Thus, a wide array of multiprocessing libraries is supported such as MPI, OpenMP and OpenCL. A program written in OpenMP3 can be compiled to run both on a regular multicore Intel Xeon processor and on the Intel Xeon Phi. Unlike regular GPUs, there is no need to use external or multiprocessing APIs unique for the platform being developed to. The Intel Xeon Phi is connected to an Intel Xeon processor, also known as the “host”, through a PCI Express (PCIe) bus. Any user can connect to the coprocessor through a secure shell and directly run individual jobs or submit batchjobs to it. The coprocessor also supports heterogeneous applications wherein a part of the application executes on the host while a part executes on the coprocessor, as shown in Figure 6. Figure 6. 3 Programming Models of the Intel Xeon Phi. Author: Tim Cramer, RWTH Aachen University. Refer to section 2.3.1 12 | P a g e In this project, we needed to offload the Usrbin classes containing the two dimensional projection of the beams, so the offload model could not be used because of the inability of OpenMP to offload custom C++ classes. The solution was to separate the two layers of the Pre-Optimizer: the lower layer will be running as a TCP/IP server on the Intel Xeon Phi and the higher layer, running on the host, would query the server using a TCP/IP socket: each time the server is queried with a series of weights it would return the fitness score of that deposition. Two extra layers had to be written in order to interface both layers with this model: one for the server and another one for the client, written in C++ and Python respectively. 2.3.1 OpenMP OpenMP is a directive-based multiprocessing API for C, C++ and Fortran. In order to parallelize a sequential program with OpenMP adding a few pragmas is enough. In our case, some of the added pragmas were the following. In order to fully parallelize the main loop which merges each beam into the three dimensional matrix in parallel the following code was introduced: #pragma omp parallel for default(none) shared(weight) schedule(static) for (int i = 0; i < num_beams; i++) { mergeUsrbin(*usrbinArray[i],weight[i]); } In order to protect from data races the three dimensional matrix when the program writes into it, an atomic directive must be set in the line where the operation occurs: #pragma omp atomic _voxel[pos] += weight*gy; The directive will cause the instruction to be executed atomically, that is, only one thread at a time can execute it and ensure that the result being accumulated in the variable is correct. 3 Results 3.1 Optimization Algorithms In order to prove the correct reduction on the fitness score provided by the optimization algorithms, a plot of the fitness function was developed. The Genetic Algorithm yields the plot shown in Figure 7. The fitness score is decreased and, after 2100 evaluations of the fitness function, it converges into the found minimum. In the graph, the yellow area shows the difference between the minimum and maximum scores found in each generation, with the average plotted in blue. As the generations are created, the difference is decreased as expected. 13 | P a g e Figure 7. Decrease of the fitness score after each generation on the Genetic Algorithm. The Steepest Descent algorithm generates the plot found in Figure 8. We can observe that the convergence is much slower than in the genetic algorithm. While the Genetic algorithm reduced the fitness score by 3x106, the Steepest Descent algorithm only reduced about a 100 fitness points. While the Genetic algorithm performed about 2100 evaluations before finding a minimum, the Steepest Descent algorithm performed about 170000 evaluations before getting stuck in a local minimum, with a disappointing reduction in the fitness score. This means the time it took to reduce the score on the steepest descent algorithm is eight times more than the time taken by the genetic algorithm to find a minimum. Clearly the h parameter used to approximate the derivate and the threshold set for the Golden Section Search need to be tuned in order to deliver better performance. 14 | P a g e Figure 8. Decrement of the fitness score produced by the the Steepest Descent Algorithm. 3.2 Intel Xeon Phi Performance The performance of the Intel Xeon Phi was measured against two Intel Xeon E5-2690 v3 running in the cluster provided by the department. In the plot shown in Figure 9, we can observe the execution time of a test case on the Intel Xeon Phi compared to the execution time yielded by the two Intel Xeon E5-2690 v3 as the number of threads used increased. The Intel Xeon Phi has the ability to run up to 240 threads since it has 60 cores and each can run up to 4 threads. Each Intel Xeon E5-2690 v3 has 12 cores able to run 2 threads each, totalling in 48 threads for the cluster. The program run in order to perform the test is the lower layer of the pre-optimizer –described in section 2.2– which is fed a series of pre-established weights and returns the same fitness score. The sequential code present in the lower layer of the program will only be executed once ideally, versus the thousands of times the parallel code will need to be executed. Therefore, only the execution times of the parallel code are measured in the following plot and no sequential code is present in the measurements. In the plot it can be observed that the Intel Xeon Phi, even running with the maximum number of threads, cannot achieve the performance yielded by the two Intel Xeon E5-2690 v3. The 15 | P a g e maximum speed on which the Intel Xeon Phi converges is three times bigger than the one given by the Intel Xeon E5-2690 v3. Figure 9. Speedup plots on the Intel Xeon Phi and two Intel Xeon E5-2690 v3 The executions of the program with the Intel Xeon Phi with 1 to 10 cores where excluded because of the extremely high execution times yielded. In an environment where an Intel Xeon Phi is available, it makes no sense to run the program with only a few cores. 4 Conclusions We can conclude then that a randomized algorithm like the Genetic algorithm used in this project would be a better fit for this type of Monte Carlo simulations. The Genetic algorithm converged faster than the Steep Descent algorithm and gave a bigger decrease in fitness score, although some performance tuning needs to be done on the latter mentioned algorithm. The Intel Xeon Phi performance, was considerably inferior than then one yielded by high-end processors like the Intel Xeon E5-2690 v3 for this type of applications. Software written for the Intel Xeon Phi needs to be extremely tuned for the platform. It is very hard to get good results with the Intel Xeon Phi on applications written to be run in both multicore processors and in the Intel MIC architecture, even on simple programs with few calculations like the one tested. 16 | P a g e 5 Future work The optimization algorithms still need to be tested with different initialization parameters in order to tune them for this particular case: for example, in the genetic algorithm, a thorough test needs to be done on the best crossover operator to use as well as testing other new crossover operators. Also other mutation/crossover ratios or elitism values can be tested. In the Steep Descent algorithm, other values for the parameter h used in the derivate approximation can be tested in order to improve the results gathered. The code developed also is centred on the patient case provided to conduct this experiment. Therefore, improvements could be done to the source code in order to generalize the software for any kind of treatment. To do that, extra information for each beam must be appended to the Usrbin data in order to know the point on which the beam is centred and the vector which the beam is directed towards. Also the Voxel file needs to contain information about the necessary rotations and translations needed to be applied to transform the coordinate system used by the Usrbin files to the coordinate system used by the Voxel file. 6 Bibliography 1. Crossover (genetic algorithm). [Online] 6 June 2014 . [Cited: 2 September 2015.] https://en.wikipedia.org/wiki/Crossover_%28genetic_algorithm%29. 2. Selective crossover in genetic algorithms: An empirical study. Vekaria, Kant and Clack, Chris. s.l. : Springer, 1998, Vols. Parallel Problem Solving from Nature - PPSN V. 3. Tournament selection. [Online] Wikipedia, The Free Encyclopedia. , 7 August 2015. [Cited: 2 September 2015.] https://en.wikipedia.org/w/index.php?title=Tournament_selection. 4. Golden section search. [Online] Wikipedia, The Free Encyclopedia., 13 May 2015. [Cited: 2 September 2015.] https://en.wikipedia.org/w/index.php?title=Golden_section_search. 17 | P a g e 5. A Monte Carlo-based treatment planning tool for proton therapy. Mairani, A, et al., et al. 8, s.l. : IOP Publishing, 2013, Physics in medicine and biology, Vol. 58. 18 | P a g e