Kalman Filter improves using GPGPU and autovectorization for online LHCb triggers August 2015 Author: Jimmy Aguilar Mena* *Master Student in High Performance Computing ICTP-SISSA Supervisor(s): Daniel Hugo Campora Perez Manuel Tobias Schiller Niko Neufeld CERN openlab Summer Student Report 2015 CERN openlab Summer Student Report 2015 Project Specification This project concerns the field of autovectorization and GPGPU programming for the Gaudi framework of LHCb experiment at CERN. This paper summarises the results and progress of some autovectorized, OpenCL or CUDA® implementations for a typical Kalman Filter function that could improve the current or future versions of Gaudi. CERN openlab Summer Student Report 2015 Abstract LHCb is a single arm forward spectrometer at the LHC collider, designed to do precision studies of beauty and charm decays, among others. The first step is the reconstruction of tracks in the vertex detectors with a Kalman Filter. Reducing this means freeing up resources to do a more sophisticated reconstruction in events with a displaced vertex, lending to a more efficient triggers. These challenges will become even more important for the LHCb upgrade. Gaudi is an architecture and framework for event processing applications in the LHCb experiment at CERN. The Kalman filter routine is an important section in the code that can use around the 10% of the calculation time in some cases. It is extensively used in many other applications. This project is an initial study for some proposed optimizations and modification to improve the performance behaviour of the Kalman filter routine in a Gaudi function using autovectorization and general purpose GPU programming using OpenCL and CUDA® CERN openlab Summer Student Report 2015 Table of Contents 1 Introduction .............................................................................................................. 5 1.1 Gaudi Framework......................................................................................................... 5 1.2 GPGPU ........................................................................................................................ 5 1.3 Kalman Filter ................................................................................................................ 5 1.4 Objectives .................................................................................................................... 6 2 Porting the Code ...................................................................................................... 6 3 Contiguous memory implementation (SOA) ............................................................. 7 4 3.1 Implementation details ................................................................................................. 7 3.2 Benchmarking .............................................................................................................. 7 GPGPU implementations ......................................................................................... 9 4.1 Benchmarking .............................................................................................................. 9 5 OpenCL vs serial code on CPU.............................................................................. 10 6 Improving the code................................................................................................. 11 7 Conclusions ........................................................................................................... 12 8 Bibliography ........................................................................................................... 13 CERN openlab Summer Student Report 2015 1 Introduction LHCb is a single arm forward spectrometer at the LHC collider, designed to do precision studies of beauty and charm decays, among others. These decays feature a displaced vertex as signature. In order to isolate these interesting decay signatures among the huge number of collisions, sophisticated software triggers are required. The first step towards identifying these secondary vertices is the reconstruction of tracks in the vertex detector with a Kalman Filter. This takes around 10% of the CPU available in HLT. Reducing this means freeing up resources to do a more sophisticated reconstruction in events with a displaced vertex, leading to a more efficient trigger. These challenges will become even more important for the LHCb upgrade. 1.1 Gaudi Framework Gaudi is an architecture and framework for event processing applications (simulation, reconstruction, etc.) Initially developed for LHCb, it has been adopted and extended by ATLAS and adopted by several other experiments including GLAST and HARP. (Clemencic, 2015) Currently Gaudi is being updated to use modern hardware more efficiently, while developing performance and compatibility tests. The upgrades that are being considered include not only the usage of features like SSE4.2 and AVX instruction sets on the CPU, but also running on different architectures like general-purpose computing on graphical processing units (GPGPU) 1.2 GPGPU GPGPU is the use of a Graphic Processor Units (GPU), to perform computation in applications traditionally handled by the Central Processor Units (CPU). (Fung, Tang, & Mann, 2001). The use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. In addition, even a single GPUCPU framework provides advantages that multiple CPUs on their own do not offer due to specialization in each chip. Some test versions of Gaudi are in development using CUDA® and OpenCL. Fundamental to the performance of Gaudi's code are the different implementations of the Kalman Filter, since currently takes between 10% and 25% of the total computation time. 1.3 Kalman Filter The Kalman Filter (Kalman, 1960) addresses the general problem of trying to estimate the state 𝑥 ∈ ℜ𝑛 of a discrete-time controlled process that is governed by the linear stochastic difference equation: 𝑥𝑘 = 𝐴𝑥𝑘−1 + 𝐵𝑢𝑘−1 + 𝑤𝑘−1 ( 1.1) Where: A is the state transition matrix which applies the effect of each system state parameter at position xk-1 on the system state at position xk. 5|P a g e CERN openlab Summer Student Report 2015 B: is the control input matrix which applies the effect of each control input parameter in the vector u. uk: is the vector containing any control inputs. wk: is the vector containing the process noise terms for each parameter in the state vector. With a Measurement that is 𝑥 ∈ ℜ𝑚 that is: 𝑧𝑘 = 𝐻𝑥𝑘 + 𝑣𝑘 ( 1.2) Where: H: is the transformation matrix that maps the state vector parameters into the measurement domain The random variables 𝑤𝑘 and 𝑣𝑘 represent the process and measurement noise (respectively). They are assumed to be independent (of each other), with normal probability distributions: (Welch & Bishop, 2006) 𝑝(𝑤) = 𝑁(0, 𝑄) 𝑝(𝑣) = 𝑁(0, 𝑅) ( 1.3) ( 1.4) 1.4 Objectives The main objective of this work is to implement, optimize and benchmark some Kalman Filter codes using different optimization techniques and tools for CPU and GPU. In order to accomplish this objective the following tasks have been assigned. Port the code of interest out of the Gaudi framework guaranteeing the compatibility with the results and consistency. Implement and benchmark an improved version for CPU that can be interfaced easily with a GPU implementation. Implement and benchmark a version of the Kalman Filter using OpenCL and CUDA®, compare performance and accuracy. Improve or correct if possible the GPU implementations of the code and compare with CPU versions. 2 Porting the Code The current version of Gaudi implements a PrPixel class that reconstructs tracks using a Kalman Filter. This was the filter elected for the study case and the original code was modified to create an output file with the input data of the filter and the output data to test the implementations after porting. The Gaudi framework relies heavily on Object Oriented Programming (OOP), and the implementation of the PrTrack class consists basically on a std::vector of objects that contains the data of the position (x,y,z) and the data of error (tx, ty). (See the Gaudi doxygen documentation for more details.) 6|P a g e CERN openlab Summer Student Report 2015 The serialization requires different detail levels because the default serialization does not save the information contained inside arrays or vectors. Normal text files were used for both outputs to facilities the accuracy tests. The original class based implementations and all the dependencies were ported outside Gaudi (copied) or re-implemented (cloned) and the useless features deleted. Only an extra function was inserted to import the data from the output file generated in Gaudi and the same serialization function for the output was reused here. (See bad.h and bad.cpp files in the final project repository on bitbucket.org). As both codes were run the same hardware the accuracy should be the same on and the consistence test was made easily using diff tool. 3 Contiguous memory implementation (SOA) 3.1 Implementation details The main performance hit in the original code was due to the storage of data in memory --in an array of structures-- and excessive use of getter/setter methods that can have a significant impact on the performance. These functions not always can be inlined by the compiler, mainly when they handle pointers or references. The generated code also depends of the optimization level and encapsulation. Irrespective of the compiler used, autovectorization cannot give good results, since the storage layout does not match the hardware capabilities. As the arrays of structs (AOS) are used on different levels, there are many internal pointer potentially aliasing the same memory location, this prevents autovectorization or reordering strategies by the compiler. The first version of the program was improved for CPU using structs of arrays (SOA), and declaring friendship relations between the most related objects. All the hits were rearranged in memory to be contiguous. This way the autovectorization and cache usage are improved simultaneously. This is also better for GPU, because the number of copy operations for the device is reduced; which is very important taking into account that latency for copy operation hostdevice is huge. This kind of storage enables the code to process many events at the same time in the CPU or in the GPU. This makes sense from the GPGPU point of view if the data fit in the memory, because that will reduce the number of times the kernels have to be loaded and initialized, copy operations host-device and the associated latencies.. Since the data comes from outside in the same way it is stored in the original code (hit by hit), a benchmark of the time needed to rearrange the data to be contiguous plus the filter algorithm time was also made. 3.2 Benchmarking The time needed to apply the Kalman Filter to a known number of tracks is a good measurement of the efficiency of the code. The number or tracks every event is not very big, and for processing in the GPU it is better to use as many tracks as possible. Hence processing many events at the time makes much more sense. 7|P a g e CERN openlab Summer Student Report 2015 It is shown (Figure 3.1 and Figure 3.2) the time and speedup dependency with the number of events for the original code and the contiguous memory implemented one. The pattern used to calculate the speedup is the original implementation of the code cloned outside Gaudi. As can be shown the best solution is to change the default storage layout from the beginning, so the conversions step will be not longer needed only for the filter. The performance improvement will benefit all the code and the copy operation can be negligible comparing with the other operations. Figure 3.1: Time vs number of tracks in serial code using different Implementations on Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Figure 3.2: Speedup with respect to the original code using serial implementations on Intel(R) Xeon(R) CPU E52670 v2 @ 2.50GHz. 8|P a g e CERN openlab Summer Student Report 2015 4 GPGPU implementations GPGPU codes were implemented using CUDA® and OpenCL and the implemented code using aligned memory was reused. At this stage two possible versions both languages were considered taking into account that in the original code the Kalman filter can be applied in x and y projections independently. The first version processes one track per thread and applies both filters in a serial way, while in the second version every axis is in a different thread, so 2 threads are needed to process a full track. The expected behaviour is that the second version will give better results for performance, because the access to global memory and the number of calculations per thread are smaller than in the first implementation, every thread copies in the private memory the values needed before starting the calculations. This is not a problem because the number of hits is always lower than 24 and the data fit in the private memory. 4.1 Benchmarking In this measurement the copy time to and from the GPU was not included, because in the real implementation when the filter is applied the data is supposed to be already in the device memory and some other calculations needs to be made still. The time was measured using CPU and GPU methods, but kernel execution measurement methods are different between CUDA® and OpenCL being also dependent of the implementation or the hardware mainly for OpenCL. For this report only the standard CPU measurement is used, but all the data collected for this benchmark is available in the project repository. Figure 4.1 Execution time for Kalman Filter implementation on a GPU (CPU time) on a nVidia® GeForce GTX 690 graphics card In Figure 4.1 there are 3 interesting details to take into account. 9|P a g e CERN openlab Summer Student Report 2015 1. The first implementation using OpenCL is more efficient than the second in contraposition than expected. 2. The behaviour using CUDA® is exactly the expected where the second implementation is notably better than the first. 3. When the number of threads is not too big (~5000 threads) the OpenCL performance is always better. 5 OpenCL vs serial code on CPU As many devices can have OpenCL support, it is possible to run the implemented OpenCL code in the same CPU than the serial code and compare both performances (Figure 5.1). The comparison between the GPU and CPU executions makes no sense from the code evaluation point of view, because those are quite different hardware. The big variations in the speedup (Figure 5.2) values are related with statistical fluctuation and with the fact that in the current system there is no way to guarantee the exclusive use of server’s resources. All this has a bad impact in the runtime measurements precision. But In general it is possible to conclude that for a very big number of tracks the speedup is around 20x (±10x) with respect to the initial algorithm in a node with 40 cores. Figure 5.1 Execution time vs number of tracks in CPU using OpencCl and all the serial code versions on Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 10 | P a g e CERN openlab Summer Student Report 2015 Figure 5.2 Kalman Filter Speedup using OpenCL in CPU respect to the original code on Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz 6 Improving the code Some improves were proposed after the previous results in order to get some better times for the calculations. The unexpected behaviour using OpenCL with more parallelization can be associated to the number of registers in the device. Many other tests were made with the code but no important differences in performance were appreciated. The only modification that made a real difference was the change of the copy loop of hits information in every thread from global to private memory inside the kernel. It was substituted with a pointer to the global memory because every value is accessed only ones. This improvement can be associated with the fact that some calculations can be overlapped with global memory access during the kernel execution. It is important to remark that the memory access model used in the implementation is not the most efficient one, but the most practical in the possible real application. From the graph (Figure 6.2) can be seen that using pointers the performance for double parallel CUDA® code is very close to both OpenCL. The difference between codes performance is reduced using OpenCL, but the inverted behaviour with respect to CUDA® one still persists. 11 | P a g e CERN openlab Summer Student Report 2015 Figure 6.1 Kernel execution time comparison using copy loop vs pointer to global memory in Cuda® Figure 6.2 Kernel execution time comparison using copy loop vs pointer to global memory in OpenCL 7 Conclusions The needed code from Gaudi was ported using very simple techniques. This made it easier to develop and benchmark the new implemented code and made more precise the benchmark results. The SOA implementation for the code can be easily interfaced with other implementations, but also gives a good performance improvement using a full serial code. 12 | P a g e CERN openlab Summer Student Report 2015 For this routine, the GPU implementation using CUDA® and OpenCL can provide very good speedup comparing with the initial serial code, not taking into account the time to move the data host-device. o The OpenCL code gives also good performance comparing with the serial code running on the same hardware. o OpenCL and CUDA® code shows very different behaviours when the parallelization levels for the code are changed. The different memory access optimizations impact the performance much more than calculation optimizations. 8 Bibliography Clemencic, M. (2015, January 14). LHCb Software Tutorials. Retrieved July 2015, from https://twiki.cern.ch/twiki/bin/view/LHCb/LHCbSoftwareTutorials Fung, J., Tang, F., & Mann, S. (2001). Mediated Reality Using Computer Graphics Hardware for Computer Vision. Proceedings of the International Symposium on Wearable Computing 2002, (pp. 83-89). Seattle, Washington, USA. Kalman, R. E. (1960). A New Approach to Linear Filtering andPrediction Problems. Transaction of the ASME—Journal of BasicEngineering, 35-45. Welch, G., & Bishop, G. (2006). An Introduction to the Kalman Filter. Retrieved 2015, from Department of Computer Science North Carolina: http://www.cs.unc.edu/~tracker/ref/s2001/kalman/ 13 | P a g e