Jimmy Aguilar Mena

advertisement
Kalman Filter improves using
GPGPU and autovectorization for
online LHCb triggers
August 2015
Author:
Jimmy Aguilar Mena*
*Master Student in High Performance Computing ICTP-SISSA
Supervisor(s):
Daniel Hugo Campora Perez
Manuel Tobias Schiller
Niko Neufeld
CERN openlab Summer Student Report 2015
CERN openlab Summer Student Report
2015
Project Specification
This project concerns the field of autovectorization and GPGPU programming for the Gaudi
framework of LHCb experiment at CERN. This paper summarises the results and progress of
some autovectorized, OpenCL or CUDA® implementations for a typical Kalman Filter function
that could improve the current or future versions of Gaudi.
CERN openlab Summer Student Report
2015
Abstract
LHCb is a single arm forward spectrometer at the LHC collider, designed to do precision studies
of beauty and charm decays, among others. The first step is the reconstruction of tracks in the
vertex detectors with a Kalman Filter. Reducing this means freeing up resources to do a more
sophisticated reconstruction in events with a displaced vertex, lending to a more efficient triggers.
These challenges will become even more important for the LHCb upgrade.
Gaudi is an architecture and framework for event processing applications in the LHCb
experiment at CERN. The Kalman filter routine is an important section in the code that can use
around the 10% of the calculation time in some cases. It is extensively used in many other
applications. This project is an initial study for some proposed optimizations and modification to
improve the performance behaviour of the Kalman filter routine in a Gaudi function using
autovectorization and general purpose GPU programming using OpenCL and CUDA®
CERN openlab Summer Student Report
2015
Table of Contents
1
Introduction .............................................................................................................. 5
1.1
Gaudi Framework......................................................................................................... 5
1.2
GPGPU ........................................................................................................................ 5
1.3
Kalman Filter ................................................................................................................ 5
1.4
Objectives .................................................................................................................... 6
2
Porting the Code ...................................................................................................... 6
3
Contiguous memory implementation (SOA) ............................................................. 7
4
3.1
Implementation details ................................................................................................. 7
3.2
Benchmarking .............................................................................................................. 7
GPGPU implementations ......................................................................................... 9
4.1
Benchmarking .............................................................................................................. 9
5
OpenCL vs serial code on CPU.............................................................................. 10
6
Improving the code................................................................................................. 11
7
Conclusions ........................................................................................................... 12
8
Bibliography ........................................................................................................... 13
CERN openlab Summer Student Report
2015
1 Introduction
LHCb is a single arm forward spectrometer at the LHC collider, designed to do precision studies
of beauty and charm decays, among others. These decays feature a displaced vertex as signature.
In order to isolate these interesting decay signatures among the huge number of collisions,
sophisticated software triggers are required.
The first step towards identifying these secondary vertices is the reconstruction of tracks in the
vertex detector with a Kalman Filter. This takes around 10% of the CPU available in HLT.
Reducing this means freeing up resources to do a more sophisticated reconstruction in events with
a displaced vertex, leading to a more efficient trigger. These challenges will become even more
important for the LHCb upgrade.
1.1 Gaudi Framework
Gaudi is an architecture and framework for event processing applications (simulation,
reconstruction, etc.) Initially developed for LHCb, it has been adopted and extended by ATLAS
and adopted by several other experiments including GLAST and HARP. (Clemencic, 2015)
Currently Gaudi is being updated to use modern hardware more efficiently, while developing
performance and compatibility tests. The upgrades that are being considered include not only the
usage of features like SSE4.2 and AVX instruction sets on the CPU, but also running on different
architectures like general-purpose computing on graphical processing units (GPGPU)
1.2 GPGPU
GPGPU is the use of a Graphic Processor Units (GPU), to perform computation in applications
traditionally handled by the Central Processor Units (CPU). (Fung, Tang, & Mann, 2001). The
use of multiple graphics cards in one computer, or large numbers of graphics chips, further
parallelizes the already parallel nature of graphics processing. In addition, even a single GPUCPU framework provides advantages that multiple CPUs on their own do not offer due to
specialization in each chip.
Some test versions of Gaudi are in development using CUDA® and OpenCL. Fundamental to the
performance of Gaudi's code are the different implementations of the Kalman Filter, since
currently takes between 10% and 25% of the total computation time.
1.3 Kalman Filter
The Kalman Filter (Kalman, 1960) addresses the general problem of trying to estimate the state
𝑥 ∈ ℜ𝑛 of a discrete-time controlled process that is governed by the linear stochastic difference
equation:
𝑥𝑘 = 𝐴𝑥𝑘−1 + 𝐵𝑢𝑘−1 + 𝑤𝑘−1
( 1.1)
Where:
A is the state transition matrix which applies the effect of each system state parameter at
position xk-1 on the system state at position xk.
5|P a g e
CERN openlab Summer Student Report
2015
B: is the control input matrix which applies the effect of each control input parameter in
the vector u.
uk: is the vector containing any control inputs.
wk: is the vector containing the process noise terms for each parameter in the state vector.
With a Measurement that is 𝑥 ∈ ℜ𝑚 that is:
𝑧𝑘 = 𝐻𝑥𝑘 + 𝑣𝑘
( 1.2)
Where:
H: is the transformation matrix that maps the state vector parameters into the
measurement domain
The random variables 𝑤𝑘 and 𝑣𝑘 represent the process and measurement noise (respectively).
They are assumed to be independent (of each other), with normal probability distributions:
(Welch & Bishop, 2006)
𝑝(𝑤) = 𝑁(0, 𝑄)
𝑝(𝑣) = 𝑁(0, 𝑅)
( 1.3)
( 1.4)
1.4 Objectives
The main objective of this work is to implement, optimize and benchmark some Kalman
Filter codes using different optimization techniques and tools for CPU and GPU.
In order to accomplish this objective the following tasks have been assigned.




Port the code of interest out of the Gaudi framework guaranteeing the compatibility with
the results and consistency.
Implement and benchmark an improved version for CPU that can be interfaced easily
with a GPU implementation.
Implement and benchmark a version of the Kalman Filter using OpenCL and CUDA®,
compare performance and accuracy.
Improve or correct if possible the GPU implementations of the code and compare with
CPU versions.
2 Porting the Code
The current version of Gaudi implements a PrPixel class that reconstructs tracks using a Kalman
Filter. This was the filter elected for the study case and the original code was modified to create
an output file with the input data of the filter and the output data to test the implementations after
porting.
The Gaudi framework relies heavily on Object Oriented Programming (OOP), and the
implementation of the PrTrack class consists basically on a std::vector of objects that contains the
data of the position (x,y,z) and the data of error (tx, ty). (See the Gaudi doxygen documentation
for more details.)
6|P a g e
CERN openlab Summer Student Report
2015
The serialization requires different detail levels because the default serialization does not save the
information contained inside arrays or vectors. Normal text files were used for both outputs to
facilities the accuracy tests.
The original class based implementations and all the dependencies were ported outside Gaudi
(copied) or re-implemented (cloned) and the useless features deleted. Only an extra function was
inserted to import the data from the output file generated in Gaudi and the same serialization
function for the output was reused here. (See bad.h and bad.cpp files in the final project
repository on bitbucket.org). As both codes were run the same hardware the accuracy should be
the same on and the consistence test was made easily using diff tool.
3 Contiguous memory implementation (SOA)
3.1 Implementation details
The main performance hit in the original code was due to the storage of data in memory --in an
array of structures-- and excessive use of getter/setter methods that can have a significant impact
on the performance. These functions not always can be inlined by the compiler, mainly when they
handle pointers or references.
The generated code also depends of the optimization level and encapsulation. Irrespective of the
compiler used, autovectorization cannot give good results, since the storage layout does not
match the hardware capabilities. As the arrays of structs (AOS) are used on different levels, there
are many internal pointer potentially aliasing the same memory location, this prevents
autovectorization or reordering strategies by the compiler.
The first version of the program was improved for CPU using structs of arrays (SOA), and
declaring friendship relations between the most related objects. All the hits were rearranged in
memory to be contiguous. This way the autovectorization and cache usage are improved
simultaneously. This is also better for GPU, because the number of copy operations for the device
is reduced; which is very important taking into account that latency for copy operation hostdevice is huge.
This kind of storage enables the code to process many events at the same time in the CPU or in
the GPU. This makes sense from the GPGPU point of view if the data fit in the memory, because
that will reduce the number of times the kernels have to be loaded and initialized, copy operations
host-device and the associated latencies..
Since the data comes from outside in the same way it is stored in the original code (hit by hit), a
benchmark of the time needed to rearrange the data to be contiguous plus the filter algorithm time
was also made.
3.2 Benchmarking
The time needed to apply the Kalman Filter to a known number of tracks is a good measurement
of the efficiency of the code. The number or tracks every event is not very big, and for processing
in the GPU it is better to use as many tracks as possible. Hence processing many events at the
time makes much more sense.
7|P a g e
CERN openlab Summer Student Report
2015
It is shown (Figure 3.1 and Figure 3.2) the time and speedup dependency with the number of
events for the original code and the contiguous memory implemented one. The pattern used to
calculate the speedup is the original implementation of the code cloned outside Gaudi. As can be
shown the best solution is to change the default storage layout from the beginning, so the
conversions step will be not longer needed only for the filter. The performance improvement will
benefit all the code and the copy operation can be negligible comparing with the other operations.
Figure 3.1: Time vs number of tracks in serial code using different Implementations on Intel(R) Xeon(R) CPU
E5-2670 v2 @ 2.50GHz
Figure 3.2: Speedup with respect to the original code using serial implementations on Intel(R) Xeon(R) CPU E52670 v2 @ 2.50GHz.
8|P a g e
CERN openlab Summer Student Report
2015
4 GPGPU implementations
GPGPU codes were implemented using CUDA® and OpenCL and the implemented code using
aligned memory was reused. At this stage two possible versions both languages were considered
taking into account that in the original code the Kalman filter can be applied in x and y
projections independently.
The first version processes one track per thread and applies both filters in a serial way, while in
the second version every axis is in a different thread, so 2 threads are needed to process a full
track.
The expected behaviour is that the second version will give better results for performance,
because the access to global memory and the number of calculations per thread are smaller than
in the first implementation, every thread copies in the private memory the values needed before
starting the calculations. This is not a problem because the number of hits is always lower than 24
and the data fit in the private memory.
4.1 Benchmarking
In this measurement the copy time to and from the GPU was not included, because in the real
implementation when the filter is applied the data is supposed to be already in the device memory
and some other calculations needs to be made still. The time was measured using CPU and GPU
methods, but kernel execution measurement methods are different between CUDA® and OpenCL
being also dependent of the implementation or the hardware mainly for OpenCL. For this report
only the standard CPU measurement is used, but all the data collected for this benchmark is
available in the project repository.
Figure 4.1 Execution time for Kalman Filter implementation on a GPU (CPU time) on a nVidia® GeForce GTX
690 graphics card
In Figure 4.1 there are 3 interesting details to take into account.
9|P a g e
CERN openlab Summer Student Report
2015
1. The first implementation using OpenCL is more efficient than the second in
contraposition than expected.
2. The behaviour using CUDA® is exactly the expected where the second implementation
is notably better than the first.
3. When the number of threads is not too big (~5000 threads) the OpenCL performance is
always better.
5 OpenCL vs serial code on CPU
As many devices can have OpenCL support, it is possible to run the implemented OpenCL code
in the same CPU than the serial code and compare both performances (Figure 5.1). The
comparison between the GPU and CPU executions makes no sense from the code evaluation
point of view, because those are quite different hardware.
The big variations in the speedup (Figure 5.2) values are related with statistical fluctuation and
with the fact that in the current system there is no way to guarantee the exclusive use of server’s
resources. All this has a bad impact in the runtime measurements precision. But In general it is
possible to conclude that for a very big number of tracks the speedup is around 20x (±10x) with
respect to the initial algorithm in a node with 40 cores.
Figure 5.1 Execution time vs number of tracks in CPU using OpencCl and all the serial code versions on Intel(R)
Xeon(R) CPU E5-2670 v2 @ 2.50GHz
10 | P a g e
CERN openlab Summer Student Report
2015
Figure 5.2 Kalman Filter Speedup using OpenCL in CPU respect to the original code on Intel(R) Xeon(R) CPU
E5-2670 v2 @ 2.50GHz
6 Improving the code
Some improves were proposed after the previous results in order to get some better times for the
calculations. The unexpected behaviour using OpenCL with more parallelization can be
associated to the number of registers in the device. Many other tests were made with the code but
no important differences in performance were appreciated.
The only modification that made a real difference was the change of the copy loop of hits
information in every thread from global to private memory inside the kernel. It was substituted
with a pointer to the global memory because every value is accessed only ones. This
improvement can be associated with the fact that some calculations can be overlapped with global
memory access during the kernel execution. It is important to remark that the memory access
model used in the implementation is not the most efficient one, but the most practical in the
possible real application.
From the graph (Figure 6.2) can be seen that using pointers the performance for double parallel
CUDA® code is very close to both OpenCL. The difference between codes performance is
reduced using OpenCL, but the inverted behaviour with respect to CUDA® one still persists.
11 | P a g e
CERN openlab Summer Student Report
2015
Figure 6.1 Kernel execution time comparison using copy loop vs pointer to global memory in Cuda®
Figure 6.2 Kernel execution time comparison using copy loop vs pointer to global memory in OpenCL
7 Conclusions


The needed code from Gaudi was ported using very simple techniques. This made it
easier to develop and benchmark the new implemented code and made more precise the
benchmark results.
The SOA implementation for the code can be easily interfaced with other
implementations, but also gives a good performance improvement using a full serial
code.
12 | P a g e
CERN openlab Summer Student Report


2015
For this routine, the GPU implementation using CUDA® and OpenCL can provide very
good speedup comparing with the initial serial code, not taking into account the time to
move the data host-device.
o The OpenCL code gives also good performance comparing with the serial code
running on the same hardware.
o OpenCL and CUDA® code shows very different behaviours when the
parallelization levels for the code are changed.
The different memory access optimizations impact the performance much more than
calculation optimizations.
8 Bibliography
Clemencic, M. (2015, January 14). LHCb Software Tutorials. Retrieved July 2015, from
https://twiki.cern.ch/twiki/bin/view/LHCb/LHCbSoftwareTutorials
Fung, J., Tang, F., & Mann, S. (2001). Mediated Reality Using Computer Graphics Hardware for
Computer Vision. Proceedings of the International Symposium on Wearable Computing
2002, (pp. 83-89). Seattle, Washington, USA.
Kalman, R. E. (1960). A New Approach to Linear Filtering andPrediction Problems. Transaction
of the ASME—Journal of BasicEngineering, 35-45.
Welch, G., & Bishop, G. (2006). An Introduction to the Kalman Filter. Retrieved 2015, from
Department
of
Computer
Science
North
Carolina:
http://www.cs.unc.edu/~tracker/ref/s2001/kalman/
13 | P a g e
Download