Parallel Monte Carlo for Proton Therapy: Using GEANT4 for Plan Valida=on Hywel Owen and Andrew Green School of Physics and Astronomy, University of Manchester & CockcroH Ins=tute for Accelerator Science and Technology Proton Therapy Treatment Planning • Arrange your proton beams such that the target is irradiated. • Either spot scanning (point, shoot, step) or passive scaOered (compensators, bolus etc.). • Avoid cri=cal structures. • In the UK, we will have spot scanning, so focus on that. • Typically require thousands of beam spots to ensure good coverage and conformality. What does a spot look like? • Beam from accelerator (usually a cyclotron) ~ 250 MeV • Degraded to the correct energy. (70-­‐250 MeV) • Usually fairly Gaussian (shape and energy). • Width usually a few mm (sigma). • Usually a ~MeV sigma (energy dependent). • Each spot has a given weight needed to produce the correct dose distribu=on. • Spots are collected into beams – varying energy varies depth. • Easily modeled directly in a Monte Carlo. The Role of Monte Carlo in Proton Therapy • • • • • • Algorithms used in planning tools can have drawbacks. Monte Carlo is too slow to be used directly for planning (maybe not, see later…) Use Monte Carlo to validate plans before delivery Must fit planning workflow Workflow: Import pa=ent CT file as a geometry of voxels. – • Pencil beam algorithm (left) vs Monte Carlo (right). Arrow indicates a range error due to MCS in heterogenous boundary. (Adapted from Paganetti (2008)) Usually a few million, more in head and neck Track protons through the voxels recording where the energy goes. Must simulate enough protons to get good uncertainty (< 1%). – Must be confident in the physics implementa=on. – Needs to fit within a clinical workflow – minutes not hours. – Development elsewhere • Lots of work on GPU-­‐based codes etc. – gPMC (Giantsoudi et. al 2015) – Simplified Monte Carlo (Kohno et. al. 2011) – Track repea=ng (Yepes et. al. 2010) • In all codes simplified subset physics is used, e.g. removing certain processes, such as: – inelas=c nuclear interac=ons – secondary par=cle transport, e.g. neutrons • General feeling: More work needed in valida=ng the physics, par=cularly nuclear interac=ons gPMC – an important compe?tor code! • • • Image grid 1 x 1 x 1.5 mm Dose grid 2 x 2 x 2.5 mm Simula=on rate – gPMC: ~2.6s/MP – TOPAS: 4h/MP • Simula=on =me – 50-­‐500 MP (histories) – gPMC: ~130s to 1300s • Random uncertain=es (depends on site): – gPMC: 0.5-­‐2.4% – TOPAS: 1-­‐2% • Dose difference gPMC/TOPAS: – Around 1% on D98,D50,D02 • This problem is in the simula=on of nuclear interac=ons, exacerbated by the higher energies required in prostate treatment • gPMC can already calculate good accuracy in a few minutes! Giantsoudi et. al. PMB60, 2257 (2015) Simplified Monte Carlo • Energy loss at a given depth from experimental data. • Automa=cally includes all physics except scaOering. • ScaOering calculated from Highland’s formula. • Good agreement with measurement, though not in clinical cases (Kohno et. al. 2011, Kohno et. al. 2002). Dose distribution measured with a silicon strip detector (left) compared with the prediction using a simplified Monte Carlo algorithm (right). Adapted from Kohno, R. 2002. Track Repea?ng Dose distribution projected into each plane. Differences between Geant4 and the GPU track repeating implementation are around 2 % • Create a database with 1 million proton histories. • Stores step length, angle rela=ve to previous step, energy lost and energy deposited for a 251 MeV beam in water and 41 other biological materials. • This data is then scaled for different energies. • Validated against Geant4 – accurate to ~2% • 5.4s per million par=cles (on older hardware). What about Intel? • XeonPhi MIC – Many Integrated Core • 61 cores/244 threads (today) • Only 16GB on-­‐card RAM (today) • Challenging target of <67MB/ thread. • BUT: x86 architecture – – – – – Same source code Same compilers Same development cycle Less debugging! Keep all physics! Model 3120A 7120P # Cores (threads) 57 (228) 61 (224) RAM 6 GB 16 GB Clock Speed 1.100 GHz 1.238 GHz Cost $1695-­‐$1960 $4129 Intel Xeon Phi -­‐ Peculiari?es Xeon Phi ‘Standard’ Xeon • In-­‐order execu=on • Out-­‐of-­‐order execu=on – One instruc=on stalls (e.g. cache miss), the whole program stalls. • Vector Processing – 512-­‐bit SIMD unit – need to make effec=ve use of it! • Low clock speed – 1.100 – 1.238 GHz • Huge thread count – 244 threads – code needs to be lock free. – If an instruc=on stalls, and we can s=ll do something, then do it. • Vector Processing – 128-­‐bit SIMD unit – vectorisa=on will improve performance, but not essen=al. • Higher clock speed – 1.8 – 3.6 GHz • Moderate thread count – Code s=ll needs to be lock free, but less important. GEANT4MT TOPAS – TOol for PArticle Simulation, a proton therapy specific Monte Carlo based on Geant4 (Perl et. al., 2012) GATE – Geant4 Application for Tomographic Emission. Originally written for PET, repurposed to do radiotherapy. (Jan et. al., 2004) • Well known in HEP, well validated (see e.g. (Yarba, 2012)) • Lots of front ends for Medical Physics (TOPAS, GATE, GAMOS etc) • Valida=on in medical applica=ons is good (e.g. (Testa et. al. 2013)) • Recent addi=on of mul=threading opens the possibility of running on Xeon Phi – Per process memory required is ~100s MB – Per thread can be ~10s MB • Useful to have a common code base on all systems • We used stock Geant4MTv10 Sample Calcula?on • • • Used a radiotherapy phantom (Aitkenhead et. al. 2013) to avoid data protec=on issues CT scan image converted to materials 16M voxels Image grid 1 x 1 x 2 mm Dose grid 2 x 2 x 2 mm • Simula=on =me • • • – 50-­‐500 MP (histories) – gPMC: ~130s to 1300s • Two beam angles, each with c.1500 spots; ~3000 GPS sources • Simula=on of 10MP: The Geant4 General Par=cle Source is the obvious choice for simula=ng spots: – Any energy distribu=on. – Any source shape/size. – Any par=cle. – Weighted sampling. However, not previously op=mized for mul=threading – lots of memory required. – ~1% uncertainty in high dose region The dose distribution: 50 cm3 sphere roughly 7 cm below surface of head Memory Problems • Version 10.0 of Geant4 exceeds the 16 GB memory limit with just 46 threads. • Problem is in GPS code Per thread memory required by the code. Extrapolating to 244 threads with version 10.0 requires 86 GB of RAM. – Large C-­‐style arrays used for calcula=ng blackbody spectrum – Copying of en=re distribu=on of sources into each thread • Neutron cross sec=ons also eat memory Memory Solu?ons • Reduce the size of the class: – Switch C-­‐style arrays for STL containers where possible – Use dynamic alloca=on otherwise • Share data between threads: – Things that don’t change during a run – Source posi=ons, energies, intensi=es etc. • Use an approxima=on to neutron cross sec=ons – negligible impact on accuracy (see Asai et. al. 2014) Calcula?on Speed • Now the simula=on can run, how fast does it go? – One 7120P card – 1h 54m for 10MP – Two 7120P cards – 1h 1m – Best performance – 366 s/MP – (cf. 2.6s/MP for gMPC) – 2 x XeonPhi is $13,500 Speedup as a function of number of threads. Ratio shows the speedup over the number of physical cores – it should be 1 until hyper-threading starts. • Speed Comparisons: – AMD Opteron machine (48 threads, 2.6 GHz) gives similar speed at $6,000 – Single card gPMC about 100x faster Bottom Line: Lose about 100x speed cf. GPU, but keeps all physics. Other Op?misa?ons • Vectorised PRNG from the Intel MKL – Small difference: ~3% faster. • Compiler op=ons – Kind of a dark art – Poten=ally large improvements in performance for very liOle effort. – Can cause increase in compila=on =me Future Direc?ons – Lazy Approach • New Xeon Phi model due out in 2015: – Improved out-­‐of-­‐order execu=on – Higher core count, faster clock speed – More cache • Compiler improvements – Auto-­‐vectorisa=on is con=nuously improving – New op=miza=on op=ons may have considerable impact Model Name Knights Corner Knights Landing Max # cores (threads) 61 (244) >60 (?) Interface PCI-­‐e Socket, PCI-­‐e, InfiniBand RAM 16 GB 16 GB ‘near’, up to 384GB ‘far’ Peak TFLOPs (doubles) ~1 ~3 (?) If we do nothing, we can expect a 2-3x improvement in performance by the end of the year. Bottom Line: Wait 6 months and XeonPhi will be as good as a cluster, $ for $ Future Direc?ons – Ac?ve Approach • Improvements to boundary crossing algorithm, about 3x • Variance reduc=on techniques • Further memory op=miza=on: – Minimize cache misses – Minimize footprint • Vectorisa=on: – Vectorised geometry naviga=on (eg USolids) – Vectorised PRNG (either MKL or home-­‐brewed). • Should be beneficial for all plaxorms/applica=ons Apostolakis et al. (CHEP 2013), arxiv:1312.0816 Summary • Have op=mised GEANT4MT (10.1) for XeonPhi • Performance per card today slower than gPMC – but keeps all physics! – Speed/cost will be faster than cluster by end of 2015 • Expect following improvements: – – – – Hardware improvements: x3 (end 2015) Boundary crossing algorithm: x3 (work in progress) Variance reduc=on: x? BeOer vectorisa=on: x1.1? • Overall will be only 10x slower than GPU in about 6 months’ =me – Worth paying for to get full accuracy – Iden=cal calcula=on on any machine install References Aitkenhead et. al. 2013, “Marvin: an anatomical phantom for dosimetric evalua=on of complex radiotherapy of the head and neck.” Physics in medicine and biology 58(19), 6915. Asai et. al. 2014, “Recent developments in Geant4”, Annals of Nuclear Energy. Giantsoudi, D et. al. 2015, “Valida=on of a GPU-­‐based Monte Carlo code (gPMC) for proton radia=on therapy: clinical cases study.” Physics in medicine and biology, 60(6), 2257. Jan, S. et. al. 2004, “GATE: a simula=on toolkit for PET and SPECT.” Physics in medicine and biology, 49(19) 4543. Kohno, R. et. al. 2011, “Clinical implementa=on of a GPU-­‐based simplified Monte Carlo method for a treatment planning system of proton beam therapy.” Physics in medicine and biology 56(22), N287. Kohno, R. et. al. 2002, “Simplified Monte Carlo dose calcula=on for therapeu=c proton beams.” Japanese journal of applied physics, 41(3A), L294 Pagane|, H 2008, “Clinical implementa=on of full Monte Carlo dose calcula=on in proton beam therapy.” Physics in medicine and biology, 53(17), 4825. Perl, J et. al. 2012, “TOPAS: an innova=ve Monte Carlo plaxorm for research and clinical applica=ons.” Medical Physics, 39(11) 6818-­‐6837. Testa, M. et. al. 2013 “Experimental valida=on of the TOPAS Monte Carlo system for passive scaOering proton therapy.” Medical physics, 40(12) 121719. Yarba, J. 2012, “Recent developments and valida=on of Geant4 hadronic physics.” In Journal of Physics: Conference series (Vol. 396, No. 2, p 022060). Yepes, P et. al. 2010, “A GPU implementa=on of a track-­‐repea=ng algorithm for proton radiotherapy dose calcula=ons.” Physics in medicine and biology, 55(23), 7107. Usolids library -­‐ hOp://aidasoH.web.cern.ch/USolids