UKF Implementation on Embedded Systems: A Lesson Learnt

Implementing the Unscented Kalman Filter on
an embedded system: A lesson learnt
Conference Paper · March 2015
DOI: 10.1109/ICIT.2015.7125391
Implementing the Unscented Kalman Filter on an
Embedded System: a Lesson Learnt
Vito Mario Fico∗ , César Pecharromán Arribas∗ , Álvaro Ricca Soaje∗ , Marı́a Ángeles Martin Prats∗ , senior IEEE,
Sebastián Ramiro Utrera† , Antonio Leopoldo Rodrı́guez Vázquez† Luis Miguel Parrilla Casquet†
∗ Departamento
de Ingenieria Electrónica
Escuela Técnica Superiór de Ingenierı́a
Universidad de Sevilla,
† Skylife Engineering,
Seville, Spain
Email: vfico@zipi.us.es
Abstract—The development process of an Unscented Kalman
Filter (UKF) in an embedded platform for navigation purposes
is presented in this work.This type of filter is usually executed
in real-time, therefore high processing speed is required. The
authors outline the steps followed to allow the use of a cheap
microcontroller unit to implement a full state UKF at a maximum
frequency of about 200 Hz. Some important parts of the developed
algorithm are explained in order to make clear where the
computational bottlenecks are. This work can be used as an
inspiring experience for other implementations, providing a guide
to develop and deploy a complex algorithm into a microprocessor.
Two different microprocessing architectures are proposed to
explain some platform-dependent optimisations. These platforms
are a Digital Signal Processor (DSP) and an ARM microcontroller. General advices and hardware-oriented optimisations are
presented in order to reach the execution time reduction achieved
in this paper.
In this paper are shown the results of the improvements done
in both platforms and the minimum requirements are listed as a
Over the last years, embedded systems were characterised
by a search of extreme miniaturisation. This trend has been
followed also by the integrated navigation systems, switching
to the low cost, low size MEMS sensors. At the same time, this
drastic reduction in size gave the possibility to have multiple
sensors in the same size and also new entries to the attitude and
navigation systems as cameras and odometers. This translates
into a major computational load on the microprocessors and,
even if they are even more powerful, into the need for an
intense optimisation procedure on the data fusion software.
The main disadvantage of using MEMS sensors is their
high noise/signal ratio. Here come the necessity to use them
in conjunction with a data fusion algorithm which improves
the overall navigation solution.
In order to have a high cost-effectiveness ratio it is proposed to use a low cost embedded DSP and an ARM microcontroller to perform the data fusion algorithm represented by
a non-linear Kalman Filter, specifically the Unscented Kalman
The present work introduces the optimisations the authors
did on the code in order to get the maximum execution
speed, making a comparison between two microprocessors and
defining the minimum requirements in order to have a data
fusion algorithm running in real-time.
The EKF and the UKF are the most used non-linear data
fusion algorithms. They are widely used in many engineering
fields, above all in navigation.
The EKF is a generalisation of the linear Kalman Filter for
nonlinear systems. Specifically it approximates the propagation
and/or the measurement equations using their first term of their
Taylor expansion, hence it needs the computation of a Jacobian
matrix. It works well if the integration time is much shorter
than the aircraft dynamics, but if the initial estimate of the
state is wrong, or if the process is modelled incorrectly, the
filter may quickly diverge, owing to its linearization.
The UKF origins from a different idea, i.e. that:
It is easier to approximate a Gaussian distribution
than it is to approximate an arbitrary nonlinear
function or transformation. [1]
This filter propagates the distribution trough the exact nonlinear model using a set of special points named sigma-points,
which compose the minimum set of points which represent
the distribution properties to at least the second order [1]. The
EKF only preserves the first order statistical properties of the
points distribution (mean). Furthermore the UKF does not need
the computation of a Jacobian matrix, resulting in an easier
implementation and an improved accuracy. On the other hand
it has a higher computational load.
To justify this last sentence, it is necessary to have a little
insight into the UKF algorithm. Being a recursive algorithm it
needs at least the previous state estimate x̂k−1 and the initial
state covariance Sk−1 . Using these data it creates the (2L + 1)
sigma points being L the number of states as:
χ0 = x̂
(L + λ)S
χi = x̂ +
(L + λ)S
χi = x̂ +
i = 1, . . . , L
i = L + 1, . . . , 2L
with ( (L + λ)S)(:,i) the column of the square root matrix
of Sk−1 . The algorithm generally used to compute the square
root form of the matrix S is the Cholesky decomposition.
Thus, the sigma points are propagated through the nonlinear kinematic model and the estimated mean and covariance
are computed using eq. (2) and (3). This implies to have
a matrix composed of 2L + 1 sigma points and to execute
the propagation (non-linear equations) 2L + 1 times, when
the EKF only performs it one time. The same thing repeats
when computing the estimated observables using the non-linear
measurement model.
Indeed the UKF needs generally O(L3 ) operations while
the EKF only O(L2 ) operations [4]. It appears clear that
for small state spaces the difference in computation time is
negligible, but it is not the case of navigation state estimation.
In order to represent properly the system is needed a state
composed of about 20 variables, generally composed of:
3 or 4 variables to represent the aircraft attitude
3 variables to represent the aircraft speed
3 variables to represent the aircraft position
9 variables to represent sensors biases
k =
Pk− =
This last method uses a combination of the QR factorisation
and the Cholesky Update instead of the Cholesky Factorisation
to update the state covariance matrix. This, in conjunction with
the use of a greatly optimised QR factorisation performed with
Householder reflections, helped to make the algorithm quicker
and more stable.
The UKF implementation needs a high programming effort,
as it needs to be run in real time. Hence, the data flow has to
be controlled and minimised to fulfil with time requirements
and memory restrictions.
To optimise the execution performance, some changes were
made to a standard Unscented Kalman Filter code. Above all,
the main reduction in execution time was obtained using float
type variables in order to take advantage of the floating point
hardware unit (FPU) available in some microcontrollers. As it
is described in section IV-A, the FPU can be used only with
float variables. Because of that, using double variables raise
execution time because this operations has to be done in the
microcontroller instead of the hardware unit. The second step
was obviously the selection of a microcontroller implementing
a FPU. This improvement can be seen in picture 2 compared
with other optimisations.
In addition, using float variables need less accesses to the
memory in every operation. Hence, changing all variables to
float makes every single operation run faster on the microprocessor. As said, this change had a great impact on the execution
Wim χi,k|k−i
− T
Wic [χi,k|k−i − x̂−
k ][χi,k|k−i − x̂k ]
Where χ are the sigma-points and Wim and Wic the weights
for the mean and for the covariance. The weights are computed
in this way:
The theoretical model of the UKF also suffers of numerical
instabilities due to the fact that the state covariance matrix P
could become negative definite if finite algebra errors accumulate. In order to avoid this issue, the algorithm implemented
in the present paper makes use of the Scaled Unscented
Transform (Julier [2]) and of the Square Root UKF (Merwe
time, but, as the code uses less bits in every operation, the
computational precision suffered a great degradation during the
mean and covariance computation. In order to achieve the same
precision as before, some key parameters of the Unscented
Kalman Filter had to be changed. To make this changes a
little clearer, the equations used to compute respectively the
propagated mean and covariance of the distribution are written
W0m = λ/(L + λ)
W0c = (λ/(L + λ)) + (1 − α2 + β)
= Wic = 1/(2(L + λ))
with λ = α2 (L + κ) − L. Using the standard values for the
UKF parameters α = 10−3 , κ = 0, β = 2 and L = 21 it gives
the following values for Wi using double.
W0m = −9.9999900 · 105
W0c = −9.9999600 · 105
= Wic = 2.3809524 · 104
and the following using float
W0m = −1.0009124 · 106
W0c = −1.0009094 · 106
= Wic = 2.3831273 · 104
The difference seems not noticeable, but when multiplied
and sequentially summed it gives a substantial difference in
the final result (order of meters in position in the executed
navigation algorithm).
The formulae used to compute the weights is susceptible of
numerical errors for low values of α. To improve this behaviour
and allow the use of float variables, the UKF parameters values
were changed to α = 3 · 10−1 , κ = 10, β = 2, reducing the
order of the Wi to 101 and the difference between the weights
computed using float and double to 0.
On the other hand, this values changes did not introduced
any noticeable degradation or change on the state estimation
A. Platforms Description
The present UKF was developed to run in an embedded
system that could be load in a transport like a plane, ship or
car. In order to test the algorithm speed, it was decided to use
two different platforms from two important electronic brands:
Texas Instruments and STMicroelectronics. Both platforms are
single core and implement specific hardware used to increase
processing performances. A resume of both embedded system
is reported in table I [8], [10].
The TMS320F28335 microcontroller is a Digital Signal
Processor (DSP) with C2000 architecture. The floating point
CPU on TMS320F28335 architecture implements an Arithmetic logic unit (ALU) to perform 2s-complement arithmetic
and Boolean logic operations, Floating point unit (FPU) to
perform IEEE single-precision floating-point operations and
Address register arithmetic unit (ARAU) that generates data
memory addresses and increments or decrements pointers in
parallel with ALU operations. The C28x+FPU memory map
is accessible outside the CPU by the memory interface, which
connects the CPU logic to memories, peripherals, or other
interfaces. The memory interface includes separate buses for
program space and data space. This means that an instruction
can be fetched from program memory while data memory is
being accessed. [10]
eZdsp F28335 [10]
Texas Instruments
Processor (Family)
TMS320F28335 (C2000)
Operating Speed
150 MHz
34Kb RAM
512Kb Flash
SPI bus
I2C bus
12-bit ADC
Inertial Measurement Unit (IMU)
≈ 430 e
≈ 10 e
On-Chip Memory
Price at november 2014
STM32F4Discovery [8]
(ARM 32-bit Cortex M4)
192Kb SRAM
1Mb Flash
TABLE I: Hardware comparison
The STM32F407VG microprocessor has a ARM Cortex
M4 architecture with a single-cycle DSP MAC. This is, the
STM32F407VG is a ARM microcontroller with DSP abilities. The microcontroller integrates a single-cycle multiplyaccumulate (MAC) unit which is capable of accomplishing
an operation 16-bit and 32-bit multiplies with 32- and 64-bit
accumulations in a single cycle. Furthermore Cortex-M4 is
integrated with a set of single-cycle SIMD instructions such
as add, subtract, multiply, multiply and accumulate which is
used to perform matrix addition, matrix subtraction, and matrix multiplication commonly used on UKF implementations.
Those high-performance unit, which are included in the FPU,
makes digital signal processing more efficient and reduces the
consumption of CPU resources. [9]
Both microcontrollers have to be programmed with a
compiler provided by the manufacturer. It is important to
point out that the C2000 compiler is more strict than the
ARM compiler, making the implementation more arduous.
Besides, microcontrollers needs some auxiliary hardware if
the microcontroller is meant to be in other system different
to the developer board. This is because the microcontroller
has to be programmed in some way when it is integrated in
another circuit. It is necessary to note that the TMS320F28335
needs a JTAG programmer. This device has to be bought in
addition of the microcontroller. The STM32F407VG could
be programmed with the same developer board used in the
development of the code, so this option is cheaper.
B. Microcontroller Optimisations
An UKF implementation needs some hardware-oriented
optimisations in order to achieve real-time execution require2012
ments. One of the main problem to implement an UKF on a
microcontroller is that it needs a lot of memory. As it is said in
section II, it is necessary to use many matrices and variables
to make the algorithm work.
It is important to control where the variables are saved.
Declared variables are in a fast RAM memory while defined
variables are in Flash memory. So all defined variables were
changed to an equivalent variable in RAM, improving the data
accesses speed.
Some defined variables cannot be changed to a declared
variable because they are used in several code files. Even
though, float type was forced in to them because the compiler
interpret the defined variables as double, not allowing the
microcontroller to use the FPU.
In the TSM320F28335 there were this memory issue that
not allows to perform the RAM memory optimisations. The
main problem was the memory allocation of the variables.
The developed UKF has around 16KB of execution code
and 40KB of data. The TSM320F28335 has 34KB of RAM
memory [10], so it was not possible to allocate the entire UKF
inside the RAM memory. It was necessary to use the Flash
memory in order to allocate the matrix that could not be saved
in the faster RAM memory. It was decided to allocate the most
used matrices in RAM in order to minimise the use of the Flash
memory, which represent the speed execution bottleneck.
The flash memory is further slower than the RAM: every
operation in Flash memory needs at least six clock cycles,
which is much more than the direct access used for RAM.
Here it is clear the importance of the memory optimisation in
the development of the code, so it could be implemented in
different embedded systems.
Furthermore, despite the fact that the data bus is 32 bit
width, all memories in the TSM320F28335 have a width of
16 bits. Because of this, every float type (32 bits) operation
in any memory needed two accesses. Hence, mainly due to
these two drawbacks, the complete execution time on the
TSM320F28335 was much slower.
The Texas Instruments compiler does not allow the programmer to use inline functions [10]. Indeed, on this microcontroller it is an automatic optimisation, so it was impossible
for the authors to verify where the compiler used it. On the
other hand, Texas Instrument offered the fastRTS library [11]
that optimises CPU time in around 700 cycles for each float
First of all, the preliminary code was executed in the
STM32F407VG and the measured time was 44, 8ms, while,
when running the same code on the TSM320F28335, the
measured time was 128, 8ms. The execution was three times
slower, as expected.
To improve that result, some changes were made to diminish the execution time, taking care of the memory occupation.
The main optimisations were:
ARM optimized matrix libraries.
Inline functions.
Change divisions to multiplications.
First, some function declarations were changed to inline
functions to allow the compiler to insert them in the code.
That is, using inline functions context switching is reduced in
critical points of the code. The context switches are expensive
in execution time because of the matrices sizes.
This improvement was implemented in some critical functions which were chosen because they are used in loops where
are called sequentially. It is important to point out that the
election of the functions is critical. Memory is a limited
resource, therefore time optimisation and memory occupation
had to be balanced here. Neither a very long function, nor a
high number of functions can be used as inline element.
Furthermore, some minor optimisations could be done
in order to improve a bit more the execution time. In the
STM32F407VG a multiplication spend seven clock cycles less
than a division so it was beneficial to change all divisions to a
multiplication with the inverse number. Besides, all numbers
and defines variables were changed to a declared variable in
memory. This allocation of all data in the same place improves
execution performances.
Fig. 1: Execution time in STM32F407VG
Float Variables
68 %
In the algorithm, there are some trigonometric functions.
There is a library in C language that optimises this kind of
functions for type float. This possibility was made calling all
the trigonometric functions with an f. That is, calling sinf, cosf,
tanf instead of the normal way. In STM32F407VG this is an
important change because it allows the microcontroller to use
the FPU.
All this changes traduced into a reduction of about 32%
in the execution time, so the final measured time was 27, 9ms
with the STM32F407VG as it is showed in Figure 2.
32 %
Other optimisations
Fig. 2: Importance of each optimisation
Being implemented the developed UKF code in the platforms described in IV-A, execution time was measured in order
to demonstrate the advantages of the authors’ optimisations.
When the UKF was implemented in TSM320F28335 microcontroller, the measured execution time was 169, 01ms. This
result was due mainly to memory restrictions.
The preliminary test in the STM32F407VG microcontroller was done with double type variables, resulting an
execution time of 220, 4ms. When variables were changed to
float type and without any of the microcontroller optimisations
explained in section III, the measured execution time was
44, 8ms. When the UKF was implemented with all the optimisations, the measured execution time was 27, 9ms thanks
to the variety of optimisations implemented on it. In Fig. 1 it
is showed a resume of the execution time measured during the
code optimisation.
Looking to this results, is immediate to see that some
changes optimise more the execution time. Specifically, changing type variables is the major optimisations that can be
done as it is shown in Fig. 2. Furthermore, the optimisations
named as Other Optimisations are not equal for both micros.
These percentages could change due to the dependence in the
developed code, as well as where the inline functions can be
The percentage due to Matrix libraries in Fig. 3 strongly
depends on the type of microcontroller used. Each manufacturer have his own mathematical libraries and they also varies
for the micro architecture. The percentage shown here is based
on the improvement obtained on the STM32F407VG.
Matrix Libraries
59 %
35 %
Minor optimisations
Inline Functions
Fig. 3: Other optimisations in detail
In order to achieve this execution performance, it is
possible to list the minimum requirements to implement an
optimised UKF in a embedded system. This recommended
requirements are:
57KB RAM memory.
Floating-point unit.
Math optimized libraries.
C language optimization tools.
Furthermore, to reach this execution time was needed an
important effort in programming the code in order to maintain
the number of operations and the memory occupation at the
As a future line, the authors would like to implement the
UKF on a FPGA. A low level programming is the best way
to control and improve the execution time in an algorithm, but
it is more arduous to develop. Hardware-based computations
will speed up very much the computations, allowing the use
of other inputs to the filter, like data from video cameras.
Real-time video features extraction requires a great amount of
computational power, unobtainable even using actual high-end
The authors would like to thank the Group TIC-109 of
the Department of Electronic Engineering of the University of
Seville and Skylife Engineering for making the test platforms
available and their help during the implementation phase.
