See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/279445873 Implementing the Unscented Kalman Filter on an embedded system: A lesson learnt Conference Paper · March 2015 DOI: 10.1109/ICIT.2015.7125391 CITATIONS READS 0 413 4 authors, including: M.M. Prats Universidad de Sevilla 49 PUBLICATIONS 4,347 CITATIONS SEE PROFILE All content following this page was uploaded by M.M. Prats on 30 June 2015. The user has requested enhancement of the downloaded file. Implementing the Unscented Kalman Filter on an Embedded System: a Lesson Learnt Vito Mario Fico∗ , César Pecharromán Arribas∗ , Álvaro Ricca Soaje∗ , Marı́a Ángeles Martin Prats∗ , senior IEEE, Sebastián Ramiro Utrera† , Antonio Leopoldo Rodrı́guez Vázquez† Luis Miguel Parrilla Casquet† ∗ Departamento de Ingenieria Electrónica Escuela Técnica Superiór de Ingenierı́a Universidad de Sevilla, † Skylife Engineering, Seville, Spain Email: vfico@zipi.us.es Abstract—The development process of an Unscented Kalman Filter (UKF) in an embedded platform for navigation purposes is presented in this work.This type of filter is usually executed in real-time, therefore high processing speed is required. The authors outline the steps followed to allow the use of a cheap microcontroller unit to implement a full state UKF at a maximum frequency of about 200 Hz. Some important parts of the developed algorithm are explained in order to make clear where the computational bottlenecks are. This work can be used as an inspiring experience for other implementations, providing a guide to develop and deploy a complex algorithm into a microprocessor. Two different microprocessing architectures are proposed to explain some platform-dependent optimisations. These platforms are a Digital Signal Processor (DSP) and an ARM microcontroller. General advices and hardware-oriented optimisations are presented in order to reach the execution time reduction achieved in this paper. In this paper are shown the results of the improvements done in both platforms and the minimum requirements are listed as a conclusion. I. I NTRODUCTION Over the last years, embedded systems were characterised by a search of extreme miniaturisation. This trend has been followed also by the integrated navigation systems, switching to the low cost, low size MEMS sensors. At the same time, this drastic reduction in size gave the possibility to have multiple sensors in the same size and also new entries to the attitude and navigation systems as cameras and odometers. This translates into a major computational load on the microprocessors and, even if they are even more powerful, into the need for an intense optimisation procedure on the data fusion software. The main disadvantage of using MEMS sensors is their high noise/signal ratio. Here come the necessity to use them in conjunction with a data fusion algorithm which improves the overall navigation solution. In order to have a high cost-effectiveness ratio it is proposed to use a low cost embedded DSP and an ARM microcontroller to perform the data fusion algorithm represented by a non-linear Kalman Filter, specifically the Unscented Kalman Filter. The present work introduces the optimisations the authors did on the code in order to get the maximum execution 978-1-4799-7800-7/15/$31.00 ©2015 IEEE 2010 speed, making a comparison between two microprocessors and defining the minimum requirements in order to have a data fusion algorithm running in real-time. II. A LGORITHM D ESCRIPTION The EKF and the UKF are the most used non-linear data fusion algorithms. They are widely used in many engineering fields, above all in navigation. The EKF is a generalisation of the linear Kalman Filter for nonlinear systems. Specifically it approximates the propagation and/or the measurement equations using their first term of their Taylor expansion, hence it needs the computation of a Jacobian matrix. It works well if the integration time is much shorter than the aircraft dynamics, but if the initial estimate of the state is wrong, or if the process is modelled incorrectly, the filter may quickly diverge, owing to its linearization. The UKF origins from a different idea, i.e. that: It is easier to approximate a Gaussian distribution than it is to approximate an arbitrary nonlinear function or transformation. [1] This filter propagates the distribution trough the exact nonlinear model using a set of special points named sigma-points, which compose the minimum set of points which represent the distribution properties to at least the second order [1]. The EKF only preserves the first order statistical properties of the points distribution (mean). Furthermore the UKF does not need the computation of a Jacobian matrix, resulting in an easier implementation and an improved accuracy. On the other hand it has a higher computational load. To justify this last sentence, it is necessary to have a little insight into the UKF algorithm. Being a recursive algorithm it needs at least the previous state estimate x̂k−1 and the initial state covariance Sk−1 . Using these data it creates the (2L + 1) sigma points being L the number of states as: χ0 = x̂ p (L + λ)S p (:,i) χi = x̂ + (L + λ)S χi = x̂ + (:,i−L) i = 1, . . . , L i = L + 1, . . . , 2L (1) p with ( (L + λ)S)(:,i) the column of the square root matrix of Sk−1 . The algorithm generally used to compute the square root form of the matrix S is the Cholesky decomposition. Thus, the sigma points are propagated through the nonlinear kinematic model and the estimated mean and covariance are computed using eq. (2) and (3). This implies to have a matrix composed of 2L + 1 sigma points and to execute the propagation (non-linear equations) 2L + 1 times, when the EKF only performs it one time. The same thing repeats when computing the estimated observables using the non-linear measurement model. Indeed the UKF needs generally O(L3 ) operations while the EKF only O(L2 ) operations [4]. It appears clear that for small state spaces the difference in computation time is negligible, but it is not the case of navigation state estimation. In order to represent properly the system is needed a state composed of about 20 variables, generally composed of: • 3 or 4 variables to represent the aircraft attitude • 3 variables to represent the aircraft speed • 3 variables to represent the aircraft position • 9 variables to represent sensors biases x̂− k = Pk− = This last method uses a combination of the QR factorisation and the Cholesky Update instead of the Cholesky Factorisation to update the state covariance matrix. This, in conjunction with the use of a greatly optimised QR factorisation performed with Householder reflections, helped to make the algorithm quicker and more stable. O PTIMISATIONS The UKF implementation needs a high programming effort, as it needs to be run in real time. Hence, the data flow has to be controlled and minimised to fulfil with time requirements and memory restrictions. To optimise the execution performance, some changes were made to a standard Unscented Kalman Filter code. Above all, the main reduction in execution time was obtained using float type variables in order to take advantage of the floating point hardware unit (FPU) available in some microcontrollers. As it is described in section IV-A, the FPU can be used only with float variables. Because of that, using double variables raise execution time because this operations has to be done in the microcontroller instead of the hardware unit. The second step was obviously the selection of a microcontroller implementing a FPU. This improvement can be seen in picture 2 compared with other optimisations. In addition, using float variables need less accesses to the memory in every operation. Hence, changing all variables to float makes every single operation run faster on the microprocessor. As said, this change had a great impact on the execution 2011 2L X i=0 2L X Wim χi,k|k−i (2) − T Wic [χi,k|k−i − x̂− k ][χi,k|k−i − x̂k ] (3) i=0 Where χ are the sigma-points and Wim and Wic the weights for the mean and for the covariance. The weights are computed in this way: Wim The theoretical model of the UKF also suffers of numerical instabilities due to the fact that the state covariance matrix P could become negative definite if finite algebra errors accumulate. In order to avoid this issue, the algorithm implemented in the present paper makes use of the Scaled Unscented Transform (Julier [2]) and of the Square Root UKF (Merwe [4]). III. time, but, as the code uses less bits in every operation, the computational precision suffered a great degradation during the mean and covariance computation. In order to achieve the same precision as before, some key parameters of the Unscented Kalman Filter had to be changed. To make this changes a little clearer, the equations used to compute respectively the propagated mean and covariance of the distribution are written below. W0m = λ/(L + λ) W0c = (λ/(L + λ)) + (1 − α2 + β) = Wic = 1/(2(L + λ)) (4) (5) (6) with λ = α2 (L + κ) − L. Using the standard values for the UKF parameters α = 10−3 , κ = 0, β = 2 and L = 21 it gives the following values for Wi using double. Wim W0m = −9.9999900 · 105 W0c = −9.9999600 · 105 = Wic = 2.3809524 · 104 and the following using float Wim W0m = −1.0009124 · 106 W0c = −1.0009094 · 106 = Wic = 2.3831273 · 104 The difference seems not noticeable, but when multiplied and sequentially summed it gives a substantial difference in the final result (order of meters in position in the executed navigation algorithm). The formulae used to compute the weights is susceptible of numerical errors for low values of α. To improve this behaviour and allow the use of float variables, the UKF parameters values were changed to α = 3 · 10−1 , κ = 10, β = 2, reducing the order of the Wi to 101 and the difference between the weights computed using float and double to 0. On the other hand, this values changes did not introduced any noticeable degradation or change on the state estimation results. IV. T EST A. Platforms Description The present UKF was developed to run in an embedded system that could be load in a transport like a plane, ship or car. In order to test the algorithm speed, it was decided to use two different platforms from two important electronic brands: Texas Instruments and STMicroelectronics. Both platforms are single core and implement specific hardware used to increase processing performances. A resume of both embedded system is reported in table I [8], [10]. The TMS320F28335 microcontroller is a Digital Signal Processor (DSP) with C2000 architecture. The floating point CPU on TMS320F28335 architecture implements an Arithmetic logic unit (ALU) to perform 2s-complement arithmetic and Boolean logic operations, Floating point unit (FPU) to perform IEEE single-precision floating-point operations and Address register arithmetic unit (ARAU) that generates data memory addresses and increments or decrements pointers in parallel with ALU operations. The C28x+FPU memory map is accessible outside the CPU by the memory interface, which connects the CPU logic to memories, peripherals, or other interfaces. The memory interface includes separate buses for program space and data space. This means that an instruction can be fetched from program memory while data memory is being accessed. [10] Manufactured eZdsp F28335 [10] Texas Instruments Processor (Family) TMS320F28335 (C2000) Operating Speed 150 MHz 34Kb RAM 512Kb Flash UART SPI bus I2C bus 12-bit ADC Inertial Measurement Unit (IMU) ≈ 430 e ≈ 10 e On-Chip Memory Peripheral Price at november 2014 STM32F4Discovery [8] STMicroelectronics STM32F407VG (ARM 32-bit Cortex M4) 168MHz 192Kb SRAM 1Mb Flash TABLE I: Hardware comparison The STM32F407VG microprocessor has a ARM Cortex M4 architecture with a single-cycle DSP MAC. This is, the STM32F407VG is a ARM microcontroller with DSP abilities. The microcontroller integrates a single-cycle multiplyaccumulate (MAC) unit which is capable of accomplishing an operation 16-bit and 32-bit multiplies with 32- and 64-bit accumulations in a single cycle. Furthermore Cortex-M4 is integrated with a set of single-cycle SIMD instructions such as add, subtract, multiply, multiply and accumulate which is used to perform matrix addition, matrix subtraction, and matrix multiplication commonly used on UKF implementations. Those high-performance unit, which are included in the FPU, makes digital signal processing more efficient and reduces the consumption of CPU resources. [9] Both microcontrollers have to be programmed with a compiler provided by the manufacturer. It is important to point out that the C2000 compiler is more strict than the ARM compiler, making the implementation more arduous. Besides, microcontrollers needs some auxiliary hardware if the microcontroller is meant to be in other system different to the developer board. This is because the microcontroller has to be programmed in some way when it is integrated in another circuit. It is necessary to note that the TMS320F28335 needs a JTAG programmer. This device has to be bought in addition of the microcontroller. The STM32F407VG could be programmed with the same developer board used in the development of the code, so this option is cheaper. B. Microcontroller Optimisations An UKF implementation needs some hardware-oriented optimisations in order to achieve real-time execution require2012 ments. One of the main problem to implement an UKF on a microcontroller is that it needs a lot of memory. As it is said in section II, it is necessary to use many matrices and variables to make the algorithm work. It is important to control where the variables are saved. Declared variables are in a fast RAM memory while defined variables are in Flash memory. So all defined variables were changed to an equivalent variable in RAM, improving the data accesses speed. Some defined variables cannot be changed to a declared variable because they are used in several code files. Even though, float type was forced in to them because the compiler interpret the defined variables as double, not allowing the microcontroller to use the FPU. In the TSM320F28335 there were this memory issue that not allows to perform the RAM memory optimisations. The main problem was the memory allocation of the variables. The developed UKF has around 16KB of execution code and 40KB of data. The TSM320F28335 has 34KB of RAM memory [10], so it was not possible to allocate the entire UKF inside the RAM memory. It was necessary to use the Flash memory in order to allocate the matrix that could not be saved in the faster RAM memory. It was decided to allocate the most used matrices in RAM in order to minimise the use of the Flash memory, which represent the speed execution bottleneck. The flash memory is further slower than the RAM: every operation in Flash memory needs at least six clock cycles, which is much more than the direct access used for RAM. Here it is clear the importance of the memory optimisation in the development of the code, so it could be implemented in different embedded systems. Furthermore, despite the fact that the data bus is 32 bit width, all memories in the TSM320F28335 have a width of 16 bits. Because of this, every float type (32 bits) operation in any memory needed two accesses. Hence, mainly due to these two drawbacks, the complete execution time on the TSM320F28335 was much slower. The Texas Instruments compiler does not allow the programmer to use inline functions [10]. Indeed, on this microcontroller it is an automatic optimisation, so it was impossible for the authors to verify where the compiler used it. On the other hand, Texas Instrument offered the fastRTS library [11] that optimises CPU time in around 700 cycles for each float operation. First of all, the preliminary code was executed in the STM32F407VG and the measured time was 44, 8ms, while, when running the same code on the TSM320F28335, the measured time was 128, 8ms. The execution was three times slower, as expected. To improve that result, some changes were made to diminish the execution time, taking care of the memory occupation. The main optimisations were: • ARM optimized matrix libraries. • Inline functions. • Change divisions to multiplications. First, some function declarations were changed to inline functions to allow the compiler to insert them in the code. That is, using inline functions context switching is reduced in critical points of the code. The context switches are expensive in execution time because of the matrices sizes. This improvement was implemented in some critical functions which were chosen because they are used in loops where are called sequentially. It is important to point out that the election of the functions is critical. Memory is a limited resource, therefore time optimisation and memory occupation had to be balanced here. Neither a very long function, nor a high number of functions can be used as inline element. Furthermore, some minor optimisations could be done in order to improve a bit more the execution time. In the STM32F407VG a multiplication spend seven clock cycles less than a division so it was beneficial to change all divisions to a multiplication with the inverse number. Besides, all numbers and defines variables were changed to a declared variable in memory. This allocation of all data in the same place improves execution performances. Fig. 1: Execution time in STM32F407VG Float Variables 68 % In the algorithm, there are some trigonometric functions. There is a library in C language that optimises this kind of functions for type float. This possibility was made calling all the trigonometric functions with an f. That is, calling sinf, cosf, tanf instead of the normal way. In STM32F407VG this is an important change because it allows the microcontroller to use the FPU. All this changes traduced into a reduction of about 32% in the execution time, so the final measured time was 27, 9ms with the STM32F407VG as it is showed in Figure 2. V. 32 % Other optimisations Fig. 2: Importance of each optimisation R ESULTS Being implemented the developed UKF code in the platforms described in IV-A, execution time was measured in order to demonstrate the advantages of the authors’ optimisations. When the UKF was implemented in TSM320F28335 microcontroller, the measured execution time was 169, 01ms. This result was due mainly to memory restrictions. The preliminary test in the STM32F407VG microcontroller was done with double type variables, resulting an execution time of 220, 4ms. When variables were changed to float type and without any of the microcontroller optimisations explained in section III, the measured execution time was 44, 8ms. When the UKF was implemented with all the optimisations, the measured execution time was 27, 9ms thanks to the variety of optimisations implemented on it. In Fig. 1 it is showed a resume of the execution time measured during the code optimisation. Looking to this results, is immediate to see that some changes optimise more the execution time. Specifically, changing type variables is the major optimisations that can be done as it is shown in Fig. 2. Furthermore, the optimisations named as Other Optimisations are not equal for both micros. These percentages could change due to the dependence in the developed code, as well as where the inline functions can be used. 2013 The percentage due to Matrix libraries in Fig. 3 strongly depends on the type of microcontroller used. Each manufacturer have his own mathematical libraries and they also varies for the micro architecture. The percentage shown here is based on the improvement obtained on the STM32F407VG. Matrix Libraries 59 % 6% 35 % Minor optimisations Inline Functions Fig. 3: Other optimisations in detail VI. R EFERENCES C ONCLUSIONS In order to achieve this execution performance, it is possible to list the minimum requirements to implement an optimised UKF in a embedded system. This recommended requirements are: [2] • 57KB RAM memory. [3] • Floating-point unit. [4] • Math optimized libraries. [5] • C language optimization tools. [6] Furthermore, to reach this execution time was needed an important effort in programming the code in order to maintain the number of operations and the memory occupation at the minimum. VII. F UTURE W ORK As a future line, the authors would like to implement the UKF on a FPGA. A low level programming is the best way to control and improve the execution time in an algorithm, but it is more arduous to develop. Hardware-based computations will speed up very much the computations, allowing the use of other inputs to the filter, like data from video cameras. Real-time video features extraction requires a great amount of computational power, unobtainable even using actual high-end microcontrollers. ACKNOWLEDGMENT The authors would like to thank the Group TIC-109 of the Department of Electronic Engineering of the University of Seville and Skylife Engineering for making the test platforms available and their help during the implementation phase. 2014 View publication stats Powered by TCPDF (www.tcpdf.org) [1] [7] [8] [9] [10] [11] Julier, SJ and Uhlmann, JK, A general method for approximating nonlinear transformations of probability distributions Journal of Oxford, Oxford, OC1 3PJ United 1996 S. Julier, The scaled unscented transformation, Control Conf. 2002. Proc., 2002 J. Crassidis, F. Markley, and Y. Cheng, Survey of nonlinear attitude estimation methods, J. Guid. Control., 2007. R. Van Der Merwe and E. Wan, The square-root unscented Kalman filter for state and parameter-estimation,Acoust. Speech, Signal,2001. M. Grewal, L. Weill, and A. Andrews, Global positioning systems, inertial navigation, and integration. 2007. A. Sloss, D.Symes and C. Wright ARM system developer’s guide: designing and optimizing system software,Elsevier/ Morgan Kaufman, 2004. R. Oshana Digital signal processing for embedded and real-time systems,Oxford ; Boston, 2012. STMicroelectronics, RM0090 Reference manual Version 7 2014 STMicroelectronics, UM1472 User manual Version 4 2014 Texas Instruments, SPRS439M Data Manual Revised 2012 2012 Texas Instruments, SPRCA75: FastRTS Module User Guide Version 1.00 2010