SummerStudentReport-GuidoArnauAntoniucci

advertisement
FLUKA Pre-Optimizer for a
Monte Carlo Treatment
Planning System
September 2015
Author:
Guido Arnau Antoniucci
Supervisors:
Vasilis Vlachoudis
Wioletta Kozlowska
CERN openlab Summer Student Report 2015
Project Specification
The intention of this project is to explore the Intel Xeon Phi coprocessor capabilities by creating a
pre-optimizer based on the output of FLUKA to support a Monte Carlo Treatment Planning
System (MCTPS) for proton therapy. FLUKA will provide 2D axis symmetric R-Phi-Z plots of
energy deposition for the discrete beam energies, positions and directions that the gantry can
provide. Using the Intel Xeon Phi, the plots will be recombined into a three dimensional energy
deposition mesh to further extract the necessary DVH.
A genetic algorithm and a steep descent algorithm running at a higher layer will adjust the
number of particles emitted by the beams in order to maximize the dose received by the Planning
Target Volume (PTV) and minimize the dose received by the Organs At Risc (OAR).
Abstract
The EN-STI-EET group is currently working on implementing a Monte Carlo Treatment
Planning System (MCTPS) for proton therapy based on FLUKA. Since physics simulations are
extremely CPU intensive, a full Monte Carlo based TPS will be almost unaffordable if no any
sophisticated optimization is employed.
In this project the Intel Xeon Phi coprocessor will be tested as a tool to speed up the preoptimization of a Treatment Plan. Firstly, a tool to evaluate the fitness score of a solution will be
built and, in the second part of the project, a genetic algorithm and a steepest descent algorithm
will be implemented to optimize the treatment.
Table of Contents
1
2
Introduction .............................................................................................................. 5
1.1
Treatment Planning System ......................................................................................... 5
1.2
FLUKA .......................................................................................................................... 5
1.3
The Pre-Optimizer ........................................................................................................ 5
Pre-Optimizer Design ............................................................................................... 6
2.1
2.2
Higher Layer: Optimization .......................................................................................... 7
2.1.1
Genetic Algorithm ............................................................................................ 7
2.1.2
Steepest Descent ............................................................................................ 8
Lower Layer: Generation of the 3D Mesh and Fitness Computation........................... 9
2.2.1
2.3
The Intel MIC architecture .......................................................................................... 12
2.3.1
3
Fitness Function ............................................................................................ 11
OpenMP ......................................................................................................... 13
Results ................................................................................................................... 13
3.1
Optimization Algorithms ............................................................................................. 13
3.2
Intel Xeon Phi Performance ....................................................................................... 15
4
Conclusions ........................................................................................................... 16
5
Future work ............................................................................................................ 17
6
Bibliography ........................................................................................................... 17
1 Introduction
1.1 Treatment Planning System
Proton therapy is a type of particle therapy that consists in using beams of protons to irradiate a
diseased tissue or tumour in order to destroy the cells forming it.
In practice, when this type of therapy is applied to a patient, a dosimetrist will prescribe a dose to
be applied to the cancer or Planning Target Volume (PTV). A Treatment Planning System (TPS)
will then try to find the best disposition and number of beams that needs to be used in order to:


Apply the prescribed dose to the tumour.
Minimize the radiation received by the healthy organs or Organs At Risk (OAR) of the
patient.
The TPS simulates multiple solutions tuning the number of beams, the angles from which each
beam will be delivered from, the number of protons shot by each beam, etc. in order to find the
disposition that fits the prescribed dose. A CT scan of the patient will be used to produce a more
accurate simulation by providing information about the environment over which the particles
travel.
TPS using the Monte Carlo (MC) technique have served for decades as reference tools for
accurate dose computations due to their superior accuracy compared to analytical algorithms
which are most commonly used because of their shorter execution times. MC methods allow
simulating with greater detail the physical processes involved in the therapy.
1.2 FLUKA
FLUKA is a general purpose tool for calculations of particle transport and interactions with
matter using Monte Carlo simulations.
FLUKA can be used to build a MCTPS. Each beam configuration would be simulated using
FLUKA and, depending on the dose -the deposited energy- in the PTV and OAR, the solution
would be scored accordingly.
The problem with this approach is that just simulating a single beam with FLUKA takes a lot of
computational power: each simulation would take approximately one hour. We would need to
simulate a great number of solutions with thousands of beams making the computation last far too
long.
Thus, FLUKA simulations are really computationally intensive and a shortcut needs to be found
in order to build a practical MCTPS.
1.3 The Pre-Optimizer
In order to be able to use a MCTPS built with FLUKA, we need to introduce a pre-optimizer of
the solution which is able to approximate a good treatment without using FLUKA and thus,
reduce the overall execution time of the MCTPS.
5|P a g e
The Pre-Optimizer will receive as input a series of 2D cylindrical projections of the proton dose
map generated by FLUKA. That means the optimizer will not be able to modify the number of
beams in the treatment or the direction of those beams. FLUKA will use a patient CT scan as the
environment through which the protons shot by the beams travel. The program will output an
array containing a weight for each beam representing the “intensity” of each beam, that is, the
number of protons shot by each beam. This series of weights will be a good approximation of a
beam disposition that applies the necessary dose to the PTV and minimizes the dose received by
the OAR.
Figure 1.
Diagram of the process followed by the MCTPS
2 Pre-Optimizer Design
The Pre-Optimizer has two distinct software layers which interface with each other: the lower
layer in charge of merging all the 2D plots into a 3D energy disposition mesh, which will also
calculate the fitness score of the inputted predisposition of beams, and a higher layer which will
run the optimization algorithms.
The higher layer will generate a series of weights (one for each beam) and pass it to the lower
layer, which will merge all the Usrbins –the dose depositions generated by FLUKA- into the
mesh and calculate the fitness score of the treatment if the beams shot the number of protons
indicated by the weights. The lower the score, the least damage is done to the OAR and the dose
received by the PTV is closer to the one prescribed. This layer is discussed in section 2.1. The
fitness function used in this project is described in section 2.2.1.
The lower layer is written in C++ and is parallelized using OpenMP. It uses the existing Flair1
codebase -FLUKA’s extension to include a graphical interface- in order to work with the Voxel
and Usrbin files generated by FLUKA. The implementation is discussed in section 2.2.
1
http://www.fluka.org/flair/index.html
6|P a g e
Figure 2.
Diagram of the Pre-Optimizer
The higher layer consists of an adapted genetic algorithm and a steep descent algorithm. It is
written using the Python programming language. More details about its implementation can be
found on section 2.1. This layer interfaces with the lower layer using a wrapper written using the
Python/C API which allows to seamlessly call C or C++ functions from Python. This wrapper is
also in charge of casting the weights and fitness scores being transferred between layers to the
proper type of variable compatible with the system that receives it.
2.1 Higher Layer: Optimization
2.1.1 Genetic Algorithm
A generic Genetic Algorithm was implemented in order to serve as a randomized optimization
algorithm, which have distinctively worked well for Monte Carlo simulations.
The algorithm is based on the concept of randomized evolution: an initial randomly generated
population is optimized towards better solutions by creating new populations derived from the
previous ones. The new populations are generated off the older ones by crossover and mutation
operators and selected based on their fitness score, which is calculated by the lower layer.
A population is a group of individuals currently being evaluated in order to produce the next
generation of the algorithm. In our case, an Individual will be represented by an array of weights
–the intensity of each beam–, which is defined as his chromosome.
A step on the algorithm will consist in the application of a mutation operator over some
individuals and applying a crossover operator over a group of randomly selected individuals,
skewing the selection towards the most fitted ones. The mutation operator consists on randomly
changing some of the genes –the weights of the beams– of the chromosome of an individual.
The rate of mutation can be adjusted in order to limit the number of individuals over which this
operator is applied.
The crossover operator consists on taking the best genes out of two individuals –the parents– and
merging them, creating a new individual –the offspring–. This operator can be applied using
different methods. For our program, the following operators were implemented:
7|P a g e



A two point crossover operator,
A uniform crossover operator (1)
And a selection crossover operator (2).
The selection of the individuals on which the crossover operator will be applied will be made
using the tournament selection method (3) based on a fixed crossover rate, which can be tuned to
make the convergence of the fitness function faster.
The algorithm will run this process multiple times –each step will be considered a generation– till
the number of iterations is bigger than the set limit or the fitness score is smaller than the
threshold set by the user.
2.1.2 Steepest Descent
This algorithm is based on the idea that the derivate of a multivariable function will be a vector
pointing towards the direction of the steepest descent, that is, the direction in which the function
decreases the most. If we follow this vector, and repeat the process, we will end up in a local
minimum of the function.
Since our fitness function is a differentiable multivariable function –the function will have as
much variables as beams there are, representing the weights–, we can derivate it at each point
approximating the value using numerical differentiation. That is, we can calculate the derivate of
the fitness function for each weight wi , with i ∈ Number of beams, using:
𝑓′(𝑤𝑖 ) =
𝑓(𝑤𝑖 + ℎ) − 𝑓(𝑤𝑖 − ℎ)
2ℎ
The parameter h is fixed and has to be tuned in order to find the most suitable value that gives the
best results.
This computation of the derivate for each weight will result in an n-dimensional vector with as
many components as beams there are in the treatment:
⃗⃗⃗⃗⃗
∇𝐹 = {
𝜕𝐹 𝜕𝐹
𝜕𝐹
,
,…,
}
𝜕𝑤1 𝜕𝑤2
𝜕𝑤𝑛
The next step in the algorithm will be to project our present point where we derivated the function
–composed by the weights being evaluated– along the direction of the vector to get a new point in
the function space. To do that we just need to add the vector to the point:
⃗⃗⃗⃗⃗
𝑤𝑖′ = 𝑤𝑖 + 𝑡 × ∇𝐹
Before doing that we need to evaluate the distance we will travel along the direction of the vector,
that is, the parameter t. We need to find a tmin such that the new point found has the lowest fitness
score along the direction ∇ F.
To find this value we will use the Golden Section Search method (4), which consists of an
algorithm that successively narrows down the range on which the minimum of the function relies
until it is found.
8|P a g e
Once the value is found, the old set of weights is displaced in the direction of ∇ F by a distance of
tmin. This results in a new point where the whole process is applied again. The algorithm starts in a
random point of the function space and stops once the maximum number of iterations is reached
or the fitness score reaches a threshold set by the user.
The biggest difference of the Steepest Descent algorithm with the genetic algorithm is that is not
randomized, resulting in a supposedly faster conversion to the optimal fitness score but at the
expense of more resource intensive computations.
The algorithm also falls into the problem of being able to get stuck in local minimums and not
being able to find the lowest minimum of the function.
2.2 Lower Layer:
Computation
Generation
of
the
3D
Mesh
and
Fitness
The objective of this layer is to generate a three dimensional Cartesian mesh containing the dose
deposited by the beams in each voxel –each cell of a three-dimensional matrix– belonging to the
PTV or OAR. Then, each of those areas will be traversed in order to calculate the fitness function
of the current solution, which is scored based on the previously mentioned dose depositions.
Figure 3.
Process followed by the lower layer of the Pre-Optimizer
The program will load a series of Usrbin files containing a 2D cylindrical projection of the dose
deposited by each and merge them into a 3D Cartesian mesh, which is set based on the coordinate
system used by the Voxel where the three-dimensional location of the PTV and the OAR is
stored. The Usrbin files are generated by FLUKA using a CT scan of the patient as the
environment where the particle transports are simulated.
The beams are represented as two dimensional matrices belonging to a cylindrical space. If we
rotated this matrix over itself, we would end up with a cylinder corresponding to a beam. In order
to project the beams into a three dimensional space, we need to transpose this matrix into the
correct position inside the three dimensional Voxel coordinate system.
9|P a g e
The algorithm to produce this transformation traverses, for each of the beams, the areas of the
three dimensional matrix corresponding to the PTV and OAR. Each of those positions will be
transposed into the coordinate system used to generate the Usrbins –which is different for each
case– and checked if it is inside the cylinder described by the beam. If it is inside, the dose
corresponding to the voxel will be added to the three dimensional matrix, resulting in a matrix
were all the doses deposited by the beams have been accumulated.
It must be noted that this process is independent for each beam and thus, can be done in parallel,
as explained in section 2.3.
The reason for the beams to be saved into a two dimensional projection separately instead of
creating the three dimensional matrix directly with FLUKA is because of the huge amounts of
memory a three dimensional matrix would occupy: the matrix can have a size close to the
hundred of millions and thus, occupy gigabytes of memory. If we needed to store one of these
matrices for each beam the memory requirements needed to run this software would not be
practical.
The lower layer is written in C++ and reuses existing Flair code in order to work with the
necessary data structures to perform the required operations. Therefore, we can visualize the
Usrbin projections with Flair, as shown in Figure 4. The fact that is written in C++ will help us
parallelize the program with OpenMP in order to be able to run it on the Intel Xeon Phi2.
Figure 4.
2
Visualization of the beams in Flair
Refer to section 2.3 for more information about the Intel Xeon Phi.
10 | P a g e
Since we only need to look at the PTV and OAR areas in order to score the fitness function, we
can optimize our program by only adding the dose on the voxels belonging to the PTV or OAR
areas instead of scanning the whole Voxel. In Figure 5 we can observe that only the PTV has
accumulated data:
Figure 5.
Dose received by the PTV, visualized in Flair
2.2.1 Fitness Function
The Fitness Function used is a simplification of the one found in (5), where Di is the dose
̂
̂
received by the voxel i and 𝐷
𝑃𝑇𝑉 is the prescribed dose for the PTV and 𝐷𝑂𝐴𝑅 is the maximum
dose the OAR can receive.
2
2
̂
̂
̂
𝜒(𝑵) = ∑ (𝐷
𝑃𝑇𝑉 − 𝐷𝑖 ) + ∑ (𝐷𝑂𝐴𝑅 − 𝐷𝑗 ) × 𝜃 (𝐷𝑂𝐴𝑅 − 𝐷𝑗 )
𝑖 ∈𝑃𝑇𝑉
𝑗 ∈𝑂𝐴𝑅
This function has to be minimized. This means doses that go over or under the prescribed one in
the PTV area will be penalized with a higher fitness score. The voxels in the OAR areas will
̂
show the same behaviour except when the dose received by the voxel is bigger than 𝐷
𝑂𝐴𝑅 . In this
case, the Heaviside function will prevent the voxel to contribute towards the fitness score.
11 | P a g e
2.3 The Intel MIC architecture
As mentioned in section 2.2, each addition of the beams into the three dimensional matrix can be
done in parallel. Since we can have thousands of beams, this software can benefit from massive
parallelization. In this project, this computing power will be provided by the Intel Xeon Phi. The
lower layer described in section 2.2 will be run completely on it.
The Intel Many Integrated Core (MIC) architecture combines many Intel CPU cores running at
1.1 GHz onto a single chip. Each core is able to handle up to 4 threads using hyperthreading,
achieving a theoretical peak of more of 1 teraFLOPS in double precision.
A key attribute of the microarchitecture is that it is built to provide a general-purpose
programming environment similar to the one offered by the Intel Xeon processor. The Intel Xeon
Phi coprocessors based on the Intel MIC architecture run a full service Linux operating system,
support x86 memory order model and are capable of running applications written in Fortran, C,
and C++. Thus, a wide array of multiprocessing libraries is supported such as MPI, OpenMP and
OpenCL.
A program written in OpenMP3 can be compiled to run both on a regular multicore Intel Xeon
processor and on the Intel Xeon Phi. Unlike regular GPUs, there is no need to use external or
multiprocessing APIs unique for the platform being developed to.
The Intel Xeon Phi is connected to an Intel Xeon processor, also known as the “host”, through a
PCI Express (PCIe) bus. Any user can connect to the coprocessor through a secure shell and
directly run individual jobs or submit batchjobs to it. The coprocessor also supports
heterogeneous applications wherein a part of the application executes on the host while a part
executes on the coprocessor, as shown in Figure 6.
Figure 6.
3
Programming Models of the Intel Xeon Phi. Author: Tim Cramer, RWTH Aachen
University.
Refer to section 2.3.1
12 | P a g e
In this project, we needed to offload the Usrbin classes containing the two dimensional projection
of the beams, so the offload model could not be used because of the inability of OpenMP to
offload custom C++ classes.
The solution was to separate the two layers of the Pre-Optimizer: the lower layer will be running
as a TCP/IP server on the Intel Xeon Phi and the higher layer, running on the host, would query
the server using a TCP/IP socket: each time the server is queried with a series of weights it would
return the fitness score of that deposition.
Two extra layers had to be written in order to interface both layers with this model: one for the
server and another one for the client, written in C++ and Python respectively.
2.3.1 OpenMP
OpenMP is a directive-based multiprocessing API for C, C++ and Fortran.
In order to parallelize a sequential program with OpenMP adding a few pragmas is enough. In our
case, some of the added pragmas were the following.
In order to fully parallelize the main loop which merges each beam into the three dimensional
matrix in parallel the following code was introduced:
#pragma omp parallel for default(none) shared(weight) schedule(static)
for (int i = 0; i < num_beams; i++) {
mergeUsrbin(*usrbinArray[i],weight[i]);
}
In order to protect from data races the three dimensional matrix when the program writes into it,
an atomic directive must be set in the line where the operation occurs:
#pragma omp atomic
_voxel[pos] += weight*gy;
The directive will cause the instruction to be executed atomically, that is, only one thread at a
time can execute it and ensure that the result being accumulated in the variable is correct.
3 Results
3.1 Optimization Algorithms
In order to prove the correct reduction on the fitness score provided by the optimization
algorithms, a plot of the fitness function was developed.
The Genetic Algorithm yields the plot shown in Figure 7. The fitness score is decreased and, after
2100 evaluations of the fitness function, it converges into the found minimum. In the graph, the
yellow area shows the difference between the minimum and maximum scores found in each
generation, with the average plotted in blue. As the generations are created, the difference is
decreased as expected.
13 | P a g e
Figure 7.
Decrease of the fitness score after each generation on the Genetic Algorithm.
The Steepest Descent algorithm generates the plot found in Figure 8. We can observe that the
convergence is much slower than in the genetic algorithm. While the Genetic algorithm reduced
the fitness score by 3x106, the Steepest Descent algorithm only reduced about a 100 fitness
points.
While the Genetic algorithm performed about 2100 evaluations before finding a minimum, the
Steepest Descent algorithm performed about 170000 evaluations before getting stuck in a local
minimum, with a disappointing reduction in the fitness score. This means the time it took to
reduce the score on the steepest descent algorithm is eight times more than the time taken by the
genetic algorithm to find a minimum. Clearly the h parameter used to approximate the derivate
and the threshold set for the Golden Section Search need to be tuned in order to deliver better
performance.
14 | P a g e
Figure 8.
Decrement of the fitness score produced by the the Steepest Descent Algorithm.
3.2 Intel Xeon Phi Performance
The performance of the Intel Xeon Phi was measured against two Intel Xeon E5-2690 v3 running
in the cluster provided by the department.
In the plot shown in Figure 9, we can observe the execution time of a test case on the Intel Xeon
Phi compared to the execution time yielded by the two Intel Xeon E5-2690 v3 as the number of
threads used increased. The Intel Xeon Phi has the ability to run up to 240 threads since it has 60
cores and each can run up to 4 threads. Each Intel Xeon E5-2690 v3 has 12 cores able to run 2
threads each, totalling in 48 threads for the cluster.
The program run in order to perform the test is the lower layer of the pre-optimizer –described in
section 2.2– which is fed a series of pre-established weights and returns the same fitness score.
The sequential code present in the lower layer of the program will only be executed once ideally,
versus the thousands of times the parallel code will need to be executed. Therefore, only the
execution times of the parallel code are measured in the following plot and no sequential code is
present in the measurements.
In the plot it can be observed that the Intel Xeon Phi, even running with the maximum number of
threads, cannot achieve the performance yielded by the two Intel Xeon E5-2690 v3. The
15 | P a g e
maximum speed on which the Intel Xeon Phi converges is three times bigger than the one given
by the Intel Xeon E5-2690 v3.
Figure 9.
Speedup plots on the Intel Xeon Phi and two Intel Xeon E5-2690 v3
The executions of the program with the Intel Xeon Phi with 1 to 10 cores where excluded because
of the extremely high execution times yielded. In an environment where an Intel Xeon Phi is
available, it makes no sense to run the program with only a few cores.
4 Conclusions
We can conclude then that a randomized algorithm like the Genetic algorithm used in this project
would be a better fit for this type of Monte Carlo simulations. The Genetic algorithm converged
faster than the Steep Descent algorithm and gave a bigger decrease in fitness score, although
some performance tuning needs to be done on the latter mentioned algorithm.
The Intel Xeon Phi performance, was considerably inferior than then one yielded by high-end
processors like the Intel Xeon E5-2690 v3 for this type of applications. Software written for the
Intel Xeon Phi needs to be extremely tuned for the platform. It is very hard to get good results
with the Intel Xeon Phi on applications written to be run in both multicore processors and in the
Intel MIC architecture, even on simple programs with few calculations like the one tested.
16 | P a g e
5 Future work
The optimization algorithms still need to be tested with different initialization parameters in order
to tune them for this particular case: for example, in the genetic algorithm, a thorough test needs
to be done on the best crossover operator to use as well as testing other new crossover operators.
Also other mutation/crossover ratios or elitism values can be tested.
In the Steep Descent algorithm, other values for the parameter h used in the derivate
approximation can be tested in order to improve the results gathered.
The code developed also is centred on the patient case provided to conduct this experiment.
Therefore, improvements could be done to the source code in order to generalize the software for
any kind of treatment. To do that, extra information for each beam must be appended to the
Usrbin data in order to know the point on which the beam is centred and the vector which the
beam is directed towards. Also the Voxel file needs to contain information about the necessary
rotations and translations needed to be applied to transform the coordinate system used by the
Usrbin files to the coordinate system used by the Voxel file.
6 Bibliography
1. Crossover (genetic algorithm). [Online] 6 June 2014 . [Cited: 2 September 2015.]
https://en.wikipedia.org/wiki/Crossover_%28genetic_algorithm%29.
2. Selective crossover in genetic algorithms: An empirical study. Vekaria, Kant and Clack,
Chris. s.l. : Springer, 1998, Vols. Parallel Problem Solving from Nature - PPSN V.
3. Tournament selection. [Online] Wikipedia, The Free Encyclopedia. , 7 August 2015. [Cited: 2
September 2015.] https://en.wikipedia.org/w/index.php?title=Tournament_selection.
4. Golden section search. [Online] Wikipedia, The Free Encyclopedia., 13 May 2015. [Cited: 2
September 2015.] https://en.wikipedia.org/w/index.php?title=Golden_section_search.
17 | P a g e
5. A Monte Carlo-based treatment planning tool for proton therapy. Mairani, A, et al., et al. 8,
s.l. : IOP Publishing, 2013, Physics in medicine and biology, Vol. 58.
18 | P a g e
Download