Computationally Efficient Pixel

Computationally Efficient Pixel-level Image Fusion
V Petrović, Prof. C Xydeas
Manchester Avionics Research Center (MARC)
Department of Electrical Engineering
University of Manchester
Oxford Road, Manchester, M13 9PL, UK
With the recent rapid developments in
the field of sensing technologies multisensor
systems have become a reality in a growing
number of applications. The resulting increase
in the amount of data available is increasingly
being treated by image fusion. Image fusion
algorithms provide an effective way of reducing
the total amount of information presented
without perceptual loss of image quality or
content information. In this paper we present a
novel approach for fusion of image information
using a multiscale image processing technique
with a reduced number of levels. Emphasis is
placed on two points: i) design of a
computationally efficient fusion process that
operates on pre-registered input images to
provide a monochrome fused image and ii) the
minimisation/elimination of “aliasing” effects,
found in conventional multi-resolution fusion
algorithms. Fusion is achieved by combining
image information at two different ranges of
scales with spectral decomposition being
performed using adaptive size averaging
templates. Larger size features are fused using
simple arithmetic fusion methods while
efficient feature selection is applied to fuse the
finer details. The quality of fusion achieved by
this efficient scheme is equivalent or better than
that obtained from more complex, conventional
image fusion techniques. To support this,
subjective performance results are provided as
well as complexity performance evaluations.
1. Introduction
With the recent rapid developments in
the field of sensing technologies, such as the
emergence of the third generation imaging
sensors, leading to enhanced performance at a
cheaper price, multisensor systems have
become a reality in a growing number of
applications. Earth imaging, civilian avionics
and medical imaging are just some of the areas
benefiting from such systems in addition to the
battlefield applications for which they were first
developed. Larger and spectrally more
independent sensor arrays provide for increased
spatial resolution and better spectral
discrimination of the image data available for
these applications. However, implementation of
such sensor arrays has resulted in a significant
increase in the raw amount of image data which
needs to be processed. Most of the conventional
image processing software has been designed
for optimal operation on single images and their
application in multisensor arrays must be
backed up by a large increase in computational
power. An alternative to this costly solution is
in the form of image fusion algorithms which
provide an effective way of reducing the total
amount of information presented without
perceptual loss of image quality or content
information. Other advantages of image fusion
such as improving situational awareness [10] and
night pilotage assistance [7] have also been
documented in literature.
Although they achieve data amount
reduction, image fusion algorithms still operate
on very large input information sets and as a
result their computational complexity can be
prohibitively high for fast, real or near-real time
vision system operation. It is, therefore,
imperative to develop simple and efficient
fusion techniques if the implementation of
image fusion is to become a reality. Between
them, image fusion systems can be
differentiated according to the processing level
at which information fusion takes place.
Generally, fusion can be at symbol, object and
pixel level. In this report we will restrict
ourselves to the basic signal, or pixel, level
image fusion where information fusion is
performed using raw signal data as input.
During the past decade, a number of
fusion algorithms has been developed and
presented in literature. The majority of them are
based on multiresolution techniques of image
processing. Toet at. al.[8,9] proposed image
fusion based on the contrast or Ratio of LowPass (RoLP) pyramids derived as the ratios
between corresponding samples at subsequent
levels of the Gaussian multiresolution pyramid.
Burt and Kolczynski[1] presented a system
based on the gradient pyramid using a saliency
and similarity measurement to determine how
pyramid coefficients are to be fused.
Development of the Discrete Wavelet
Transform presented another useful framework
for multiresolution image fusion. Chipman et.
al.[2] introduced a common platform for wavelet
based image fusion and Li et. al.[4] and Petrović
and Xydeas[6] published systems using QMF
wavelet decomposition and area or crosssubband based feature selection algorithms.
However, although some of these systems
exhibit high level of fused image quality and a
good degree of robustness, their conventional
computationally expensive solutions. Ulug and
McCullough[11], in contrast, presented a realtime fusion system based on linewise fusion
using disjunctive functions, which traded some
robustness for efficiency.
In this paper we present a novel
approach for fusion of image information using
an adaptive, reduced number of levels,
multiscale image processing technique, that
achieves fusion quality better or equivalent to
conventional multiresolution schemes at a
fraction of their complexity. The system can be
successfully implemented on both image
sequences and single still input image sets. It
operates on pre-registered monochrome images
and gives a monochrome fused result. In this
report the general theory of the fusion system is
presented in Section 2 of this report with
internal structures of the elements being
described in more detail in Section 3. Fusion
results and performance evaluation is dealt with
in Section 4 and we conclude in Section 5.
2. Image Fusion System Theory
Fusion system proposed in this work is
based on an adaptive, multiresolution approach
with a reduced number of levels. The aim is to
preserve, at a reduced computational
complexity, the robustness and high image
quality of multiresolution fusion, and eliminate
reconstruction errors generated by such
approaches, like the Gaussian pyramid and the
DWT, decompose the original signal into subbands of logarithmically decreasing size. In
DWT for example, at every level the upper half
of the image spectrum is decomposed into four
sub-band signals a quarter size of the input
signal. However, this rigid structure is not
always optimal and for particular spectral
power distributions results in sub-band signals
information). Reducing the number of
decomposition levels directly eliminates this
processing redundancy resulting in decreased
complexity. Fewer levels also means fewer subband signals, which reduces the possibility of
reconstruction errors as fewer discontinuities
are introduced during their fusion.
The general structure of the proposed
fusion system is given in Figure 1. Multiscale
structure is simplified into two levels of scale
only. They are the background and the
foreground levels. Background signal contains
the DC component and the surrounding
baseband and represents large scale features,
such as position of the horizon and clouds, form
of terrain ahead, large obstacles and any other
information necessary for general description of
the surrounding environment. It also carries
information responsible for natural appearance
of the fused image [3]. Foreground signal, on the
other hand, contains the upper parts of the
original spectrum which means small scale
features, like marcations, patterns and small
objects, vital for tasks such as object
recognition and classification. This form of
spectral division however, does not, necessarily
support the division into background and
foreground information that would be perceived
by a human operator, e.g. objects that appear
large because they are close being classified as
background, however we will keep this
notification for simplicity reasons.
Signal fusion is performed at both
levels independently. Background signals,
obtained as the direct product of the average
filtering, are combined using an arithmetic
fusion approach. Foreground signals, in
contrast, produced as the difference between the
original and background signals and exhibiting
higher degree of feature localisation are fused
using a simple pixel-level feature selection
technique. Finally, the resulting, fused,
foreground and background signals are summed
to produce the fused image. In order to achieve
optimal fusion performance an adaptive feedback loop is implemented to optimise system
parameters for the current image set. Statistical
analysis box determines the template size, the
coefficients of the background arithmetic fusion
and possible input image inversion and
distributes this information to relevant parts of
the system. It performs this on the basis of the
spectral power distributions, standard deviation
values and signal correlations evaluated from
input and background signals.
conventional Gaussian-Laplacian pyramid
approach. The original signal is decomposed
into two subband signals using two dimensional
average filtering with templates of adaptively
varying size. Although they do not possess
enviable spectral characteristics, averaging
templates are used since they require only a
fraction of the computational effort needed for
templates with better spectral characteristics
such as the gaussian window. Spectral
characteristic of an averaging template can be
seen in Figure 2. Low-pass nature of their
response is evident in this plot, with the passband being concentrated closely around the
zero frequency. Stop band characteristics,
however, are not perfect. This is especially
noticeable in purely horizontal and vertical
frequencies where the stop band response still
remains high, around 0.1, close to maximum
frequency. For more diagonal frequencies, the
stop band response is considerably better.
imperfections indicate that some of the high
frequency information will be present in the
background signal, they do not make a
significant impact on the performance of the
fusion process and we can take this
approximation as a compromise to the efficient
implementation. The pass-band
frequency is determined, for a fixed image size,
by the size of the template used.
Figure 1: General structure of the proposed
fusion system
3. Fusion System Description
In this section we will examine in more
detail the separate parts of the proposed fusion
system such as the spectral decomposition,
background and foreground signal fusion and
the statistical analysis box.
3.1. Spectral Decomposition
Spectral decomposition employed in
our system, introduced in the previous section,
represents a simplified version of the
Figure 2: Spectral amplitude response of an
averaging template
During fusion, the input image
signals are filtered using the averaging
templates to produce the low-pass,
background, signals. These background
signals are then subtracted form the original
image signals to obtain foreground signals.
No sub-sampling is performed meaning the
foreground and background signals remain
of the same resolution and size as the
original input signals.
3.2. Background Fusion
Background signals are fused using
simple arithmetic fusion methods. They contain
large scale information which is vital for spatial
awareness and the natural appearance of the
images [3]. However, precisely what amount of
information will be within the baseband of our
input image signals also depends on the nature
of the sensor used and the prevailing
conditions. In daylight and good visibility
conditions, for example, visible spectrum
sensors, will usually have much more
background information than infrared. At night
and in foggy conditions, the situation may be
reversed. This kind of behaviour restricts us in
our design of background fusion as we can not
rely on sensor behaviour generalisations to
restrict our options to a single simple arithmetic
fusion method. Instead, signal statistics are
monitored in order to choose the best possible
fusion method. In principle, the ideal solution
would be to keep all of the information from
both images, however, this is almost impossible
when using arithmetic fusion, because of the
information loss introduced by effects such as
destructive superposition. Avoiding these
effects completely is impossible but a simple
solution that can improve performance in such
cases is to use the photographic negative of an
input image instead of the original image itself.
This does not completely remove the problem
of destructive superposition but offers a
significant reduction in its effect. Note,
however, that care has to be taken as to which
image can be inverted, as any significant
changes in the appearance of visible spectrum
images produces a fused image of unnatural
appearance to a human observer.
In our system we use two different
arithmetic fusion approaches for background
signals. They both give their optimal results for
complimentary sets of statistical conditions. In
cases when, like we mentioned earlier, one
background image dominates, one of the inputs
contains significantly more background
information, we employ the direct elimination
approach. The dominant background image
becomes the fused background image while the
other background image is ignored. Otherwise,
in cases where the energies of the input
background signals have similar values, the
fused signal is constructed as the sum of the
non-DC, or zeroed mean, input background
signals and the average of the input image mean
values. This relationship is given in Equation 1,
where Ab, Bb, and Fb represent the two input
and the fused background signals respectively
and A and B signify input signal means.
Fb ( x, y )  Ab ( x, y )   A  Bb ( x, y )   B 
 A  B
This fusion mechanism ensures that all
the background information present in the input
signals gets transferred into the fused image. In
addition to that, if destructive superposition
becomes a problem, it is reduced by fusing an
inverted input. In the former case, however,
when the secondary, less active, background
image is sacrificed for the sake of
computational efficiency, we can ensure that it
does not result in any significant information
loss by optimising the criteria which decides
which of these two mechanisms is employed.
3.3. Foreground Fusion
implemented using a simple feature selection
fusion mechanism. The signals contain small
scale, and usually high contrast, information
from the input images. Important information is
relatively easier to localise than in the case of
the background signals and a feature selection
mechanism can be implemented on pixel level.
This form of selection increases the robustness
of the fusion system in comparison to simple
arithmetic fusion methods used for background
signals. For each pixel in the fused foreground
image we choose the corresponding pixel with
the highest absolute value from the input
foreground images. This is also shown in
Equation 2, where Af, Bf, and Ff are the input
and fused foreground images respectively.
F f ( x, y )  
A f ( x, y )  B f ( x , y )
Af ( x, y ),
B f ( x, y ),
By definition, background signal
contains the local mean of the input image at
every location. This means the difference
between the input image and the local mean,
contained in foreground signal, can be taken to
represent the local contrast of the input image.
Accordingly, the fusion mechanism described
above can also be considered as a form of
maximisation of the local contrast.
3.4. Statistical Analysis
Statistical analysis of the input signals
is necessary to determine optimal parameters
for other parts of the fusion system. These
include the average filter template size, whether
to use input signal inversion and which of the
background fusion algorithms to apply. Input
image inversion decision is made on the basis
of correlation measurements, but due to
temporally constant nature of signal
characteristics for the majority of sensors,
inversion decisions are made seldom for a
particular pair of input sensors. The rest of the
system parameters in the current system, are
determined using standard deviation, 02,
influenced by the size of the averaging
template. For a given image size it determines
the relative boundary between the pass and stop
bands. Oversize filtering separates too much
low frequency information into the foreground
signal compromising the important, zero local
mean property. Selective fusion of such
foreground signals can produce undesirable
effects in the form of false edges in image areas
where no significant information resides but
selection decision goes from one image to the
other. Similarly, undersize filtering means that
real foreground information is fused using suboptimal arithmetic fusion. Optimally, we would
like the standard deviation of the background
signal to fall to 80% of the standard deviation
of the original input signal. In case of sequence
fusion we can use subsequent frames to
increase or decrease the template size according
to the distance from the desired ratio.
Finally, the decision on which of the
background fusion approaches to use, is made
on the basis of the relative sizes of the standard
deviations of the two input background signals.
If one of the background images has a standard
deviation twice that of the other, than it is taken
as the fused background image. Otherwise, if
the standard deviations remain within 50% of
each other, both of the background signals are
inputs to the arithmetic fusion method given in
Equation 1.
4. Fusion Results
Although there has been as many
attempts as there have been fusion algorithms,
as yet, no universally accepted standard has
emerged for evaluating image fusion
performance. In our work we restricted
ourselves to determining performance levels in
two main aspects of image fusion, fused image
quality and computational complexity. In
literature, subjective image viewing tests have
been a relatively standard way of determining
the relative quality of fused imagery [7,10]. We
used them in our work to compare the
performance of our system with an established
fusion algorithm. As for computational
complexity evaluation, we can be considerably
more exact. McDaniel et. al. [5] compared
complexity of a small number of fusion systems
on the basis of the number of operations per
pixel needed to fuse a pair of input images.
They assumed all the arithmetic operations to
be equivalent to one and a memory call to three
operations. Complexity results calculated using
this method and some from McDaniel et. al. are
given in Table 1 [5]. The value given for our
system, we named CEMIF here, is for image
sequence fusion and exhibits a great reduction
in computational effort compared to other, more
conventional fusion methods.
Fusion System
Ops / pixel
Image Averaging
Laplacian Pyramid Fusion
Video Boost and Add
QMF Multiresolution Fusion
LME&M Cross
LME&M Morphological
Table 1: Computational complexity values
4.1. Subjective Image Viewing
Human visual system, is still the most
complex and able vision system known to man.
On the basis of this, relative subjective quality
tests use human subjects to determine the
relative perceived image quality of a number of
fusion systems. They involve presenting
subjects with a set of the input images and a
fused image of each fusion system to be
evaluated. In subjective tests performed as part
of this project in late August of 1999, we
compared our proposed system with an
established DWT multiresolution fusion system
using area based feature selection [4] (referred to
here as WMRF). A wide range of input scenes
was selected to encompass the largest number
of possible fusion scenarios. Figure 3 shows an
example of an image set used for this test. Input
images which are channel images of an AMS
Daedalus hyperspectral scanner, are on the top
row, a) and b), and represent an aerial view of
an industrial facility. Image on the bottom left
of the set, c), represents the fusion result of the
WMRF system. Finally, the fused image
produced by our own CEMIF algorithm is in
the bottom right, d). Relative positions of the
fused images in other test sets were randomised
to avoid any bias. Considering the images in
Figure 3 it is relatively easy to spot the
advantages of our multiscale approach, d).
Image features are all clear and there are no
perceivable reconstruction errors that plague the
image produced by the conventional WMRF
fusion, c).
Figure 3: Fusion quality subjective testing, input images a) and b) and fused images, WMRF c) and
our CEMIF d)
In total nine subjects took part in the
subjective tests. For each of the twelve
presented sets of images, subjects were asked to
express their preference for one or none of the
presented fused images based on perceived
image quality. The results of this test are
summarised in the form of total number of
preferences expressed by subjects for each
system and are shown on the bar chart in Figure
4. This chart, again, clearly indicates the
advantage of our system (CEMIF) over the
conventional multiresolution method (WMRF).
Overall, out of 9*12=108 preference votes, 57
or 52.7% were for our fusion system with 43 or
39.9% for WMRF and 8 (7.4%) undecided
votes. Pairwise, the situation is similar, with
subjects showing preference of our fused image
in 7 out of 12 (58.3%) image sets with 4 sets
(33.3%) against and one (8.3%) remaining
undecided. The main reason for these results
were the ‘ringing’ artifacts present in
conventional method and visible in Figure 3 c),
which are not present in images produced by
our system.
fusion applications is required before any
serious implementation can be considered.
The authors gratefully acknowledge all
the members of the Manchester Avionics
Research Center at MU and British Aerospace
Military Aircraft and Airstructures, Warton,
Lancashire for their support during this work.
American Government AMPS programme for
providing the input imagery.
P Burt, R Kolczynski, “Enhanced
Image Capture Through Fusion”,
Proceedings of the Fourth Int.
Conference on Computer Vision,
Berlin, May 1993, pp 173-182
L Chipman, T Orr, L Graham,
“Wavelets and image fusion”, Proc.
SPIE, Vol. 2569, 1995, pp 208-219
W Handee, P Wells, “The Perception of
Visual Information”, Springer, New
York 1997
H Li, S Munjanath, S Mitra,
“Multisensor Image Fusion Using the
Models and Image Proc., Vol. 57, No.
3, 1995, pp 235-245
R McDaniel, D Scribner, W Krebs, P
Warren, N Ockman, J McCarley,
Applications”, Proc. SPIE, Vol. 3436,
1998, pp 685-695
V Petrović, C Xydeas, “Multiresolution
image fusion using cross band feature
selection”, Proc. SPIE, Vol. 3719,
1999, pp 319-326
D Ryan, R Tinkler, “Night Pilotage
Assessment of Image Fusion”, Proc.
SPIE, Vol. 2465, 1995, pp 50-67
A Toet, L van Ruyven, J Velaton,
“Merging thermal and visual images by
a contrast pyramid”, Opt. Engineering,
1989, Vol. 28, No. 7, pp 789-792
% 30.0
Figure 4: Subjective preferences for particular
fusion system
5. Conclusion
In this report we presented a novel
efficient, multiscale approach to image fusion.
conventional multiresolution fusion approaches
at less than 10 % of the computational
complexity warrants further investigation into
such simplistic and efficient approaches to what
is essentially a robust and effective image
processing framework. The use of averaging
templates also indicates a relatively low
sensitivity of multiscale fusion approaches to
imperfect filter spectral responses. In the case
of this actual system, further research to
determine its robustness for a large number of
A Toet, “ Hierarchical Image Fusion”,
Machine Vision & Apps., 1990, Vol. 3,
pp 1-11
A Toet, J Ijspeert, A Waxman, M
Aguilar, “Fusion of Visible and
Thermal Imagery Improves Situational
Awareness”, Proc. SPIE, Vol. 3088,
1997, pp 177-188
M Ulug, L McCullough, “Feature and
data level fusion of infrared and visual
images”, Proc. SPIE, Vol. 3719, pp