Computationally Efficient Pixel-level Image Fusion V Petrović, Prof. C Xydeas Manchester Avionics Research Center (MARC) Department of Electrical Engineering University of Manchester Oxford Road, Manchester, M13 9PL, UK Abstract With the recent rapid developments in the field of sensing technologies multisensor systems have become a reality in a growing number of applications. The resulting increase in the amount of data available is increasingly being treated by image fusion. Image fusion algorithms provide an effective way of reducing the total amount of information presented without perceptual loss of image quality or content information. In this paper we present a novel approach for fusion of image information using a multiscale image processing technique with a reduced number of levels. Emphasis is placed on two points: i) design of a computationally efficient fusion process that operates on pre-registered input images to provide a monochrome fused image and ii) the minimisation/elimination of “aliasing” effects, found in conventional multi-resolution fusion algorithms. Fusion is achieved by combining image information at two different ranges of scales with spectral decomposition being performed using adaptive size averaging templates. Larger size features are fused using simple arithmetic fusion methods while efficient feature selection is applied to fuse the finer details. The quality of fusion achieved by this efficient scheme is equivalent or better than that obtained from more complex, conventional image fusion techniques. To support this, subjective performance results are provided as well as complexity performance evaluations. 1. Introduction With the recent rapid developments in the field of sensing technologies, such as the emergence of the third generation imaging sensors, leading to enhanced performance at a cheaper price, multisensor systems have become a reality in a growing number of applications. Earth imaging, civilian avionics and medical imaging are just some of the areas benefiting from such systems in addition to the battlefield applications for which they were first developed. Larger and spectrally more independent sensor arrays provide for increased spatial resolution and better spectral discrimination of the image data available for these applications. However, implementation of such sensor arrays has resulted in a significant increase in the raw amount of image data which needs to be processed. Most of the conventional image processing software has been designed for optimal operation on single images and their application in multisensor arrays must be backed up by a large increase in computational power. An alternative to this costly solution is in the form of image fusion algorithms which provide an effective way of reducing the total amount of information presented without perceptual loss of image quality or content information. Other advantages of image fusion such as improving situational awareness [10] and night pilotage assistance [7] have also been documented in literature. Although they achieve data amount reduction, image fusion algorithms still operate on very large input information sets and as a result their computational complexity can be prohibitively high for fast, real or near-real time vision system operation. It is, therefore, imperative to develop simple and efficient fusion techniques if the implementation of image fusion is to become a reality. Between them, image fusion systems can be differentiated according to the processing level at which information fusion takes place. Generally, fusion can be at symbol, object and pixel level. In this report we will restrict ourselves to the basic signal, or pixel, level image fusion where information fusion is performed using raw signal data as input. During the past decade, a number of fusion algorithms has been developed and presented in literature. The majority of them are based on multiresolution techniques of image processing. Toet at. al.[8,9] proposed image fusion based on the contrast or Ratio of LowPass (RoLP) pyramids derived as the ratios between corresponding samples at subsequent levels of the Gaussian multiresolution pyramid. Burt and Kolczynski[1] presented a system based on the gradient pyramid using a saliency and similarity measurement to determine how pyramid coefficients are to be fused. Development of the Discrete Wavelet Transform presented another useful framework for multiresolution image fusion. Chipman et. al.[2] introduced a common platform for wavelet based image fusion and Li et. al.[4] and Petrović and Xydeas[6] published systems using QMF wavelet decomposition and area or crosssubband based feature selection algorithms. However, although some of these systems exhibit high level of fused image quality and a good degree of robustness, their conventional multiresolution approaches represent computationally expensive solutions. Ulug and McCullough[11], in contrast, presented a realtime fusion system based on linewise fusion using disjunctive functions, which traded some robustness for efficiency. In this paper we present a novel approach for fusion of image information using an adaptive, reduced number of levels, multiscale image processing technique, that achieves fusion quality better or equivalent to conventional multiresolution schemes at a fraction of their complexity. The system can be successfully implemented on both image sequences and single still input image sets. It operates on pre-registered monochrome images and gives a monochrome fused result. In this report the general theory of the fusion system is presented in Section 2 of this report with internal structures of the elements being described in more detail in Section 3. Fusion results and performance evaluation is dealt with in Section 4 and we conclude in Section 5. 2. Image Fusion System Theory Fusion system proposed in this work is based on an adaptive, multiresolution approach with a reduced number of levels. The aim is to preserve, at a reduced computational complexity, the robustness and high image quality of multiresolution fusion, and eliminate reconstruction errors generated by such systems. Conventional multiresolution approaches, like the Gaussian pyramid and the DWT, decompose the original signal into subbands of logarithmically decreasing size. In DWT for example, at every level the upper half of the image spectrum is decomposed into four sub-band signals a quarter size of the input signal. However, this rigid structure is not always optimal and for particular spectral power distributions results in sub-band signals whose energy approaches zero (no information). Reducing the number of decomposition levels directly eliminates this processing redundancy resulting in decreased complexity. Fewer levels also means fewer subband signals, which reduces the possibility of reconstruction errors as fewer discontinuities are introduced during their fusion. The general structure of the proposed fusion system is given in Figure 1. Multiscale structure is simplified into two levels of scale only. They are the background and the foreground levels. Background signal contains the DC component and the surrounding baseband and represents large scale features, such as position of the horizon and clouds, form of terrain ahead, large obstacles and any other information necessary for general description of the surrounding environment. It also carries information responsible for natural appearance of the fused image [3]. Foreground signal, on the other hand, contains the upper parts of the original spectrum which means small scale features, like marcations, patterns and small objects, vital for tasks such as object recognition and classification. This form of spectral division however, does not, necessarily support the division into background and foreground information that would be perceived by a human operator, e.g. objects that appear large because they are close being classified as background, however we will keep this notification for simplicity reasons. Signal fusion is performed at both levels independently. Background signals, obtained as the direct product of the average filtering, are combined using an arithmetic fusion approach. Foreground signals, in contrast, produced as the difference between the original and background signals and exhibiting higher degree of feature localisation are fused using a simple pixel-level feature selection technique. Finally, the resulting, fused, foreground and background signals are summed to produce the fused image. In order to achieve optimal fusion performance an adaptive feedback loop is implemented to optimise system parameters for the current image set. Statistical analysis box determines the template size, the coefficients of the background arithmetic fusion and possible input image inversion and distributes this information to relevant parts of the system. It performs this on the basis of the spectral power distributions, standard deviation values and signal correlations evaluated from input and background signals. conventional Gaussian-Laplacian pyramid approach. The original signal is decomposed into two subband signals using two dimensional average filtering with templates of adaptively varying size. Although they do not possess enviable spectral characteristics, averaging templates are used since they require only a fraction of the computational effort needed for templates with better spectral characteristics such as the gaussian window. Spectral characteristic of an averaging template can be seen in Figure 2. Low-pass nature of their response is evident in this plot, with the passband being concentrated closely around the zero frequency. Stop band characteristics, however, are not perfect. This is especially noticeable in purely horizontal and vertical frequencies where the stop band response still remains high, around 0.1, close to maximum frequency. For more diagonal frequencies, the stop band response is considerably better. However, although these response imperfections indicate that some of the high frequency information will be present in the background signal, they do not make a significant impact on the performance of the fusion process and we can take this approximation as a compromise to the efficient implementation. The pass-band cut-off frequency is determined, for a fixed image size, by the size of the template used. Figure 1: General structure of the proposed fusion system 3. Fusion System Description In this section we will examine in more detail the separate parts of the proposed fusion system such as the spectral decomposition, background and foreground signal fusion and the statistical analysis box. 3.1. Spectral Decomposition Spectral decomposition employed in our system, introduced in the previous section, represents a simplified version of the Figure 2: Spectral amplitude response of an averaging template During fusion, the input image signals are filtered using the averaging templates to produce the low-pass, background, signals. These background signals are then subtracted form the original image signals to obtain foreground signals. No sub-sampling is performed meaning the foreground and background signals remain of the same resolution and size as the original input signals. 3.2. Background Fusion Background signals are fused using simple arithmetic fusion methods. They contain large scale information which is vital for spatial awareness and the natural appearance of the images [3]. However, precisely what amount of information will be within the baseband of our input image signals also depends on the nature of the sensor used and the prevailing conditions. In daylight and good visibility conditions, for example, visible spectrum sensors, will usually have much more background information than infrared. At night and in foggy conditions, the situation may be reversed. This kind of behaviour restricts us in our design of background fusion as we can not rely on sensor behaviour generalisations to restrict our options to a single simple arithmetic fusion method. Instead, signal statistics are monitored in order to choose the best possible fusion method. In principle, the ideal solution would be to keep all of the information from both images, however, this is almost impossible when using arithmetic fusion, because of the information loss introduced by effects such as destructive superposition. Avoiding these effects completely is impossible but a simple solution that can improve performance in such cases is to use the photographic negative of an input image instead of the original image itself. This does not completely remove the problem of destructive superposition but offers a significant reduction in its effect. Note, however, that care has to be taken as to which image can be inverted, as any significant changes in the appearance of visible spectrum images produces a fused image of unnatural appearance to a human observer. In our system we use two different arithmetic fusion approaches for background signals. They both give their optimal results for complimentary sets of statistical conditions. In cases when, like we mentioned earlier, one background image dominates, one of the inputs contains significantly more background information, we employ the direct elimination approach. The dominant background image becomes the fused background image while the other background image is ignored. Otherwise, in cases where the energies of the input background signals have similar values, the fused signal is constructed as the sum of the non-DC, or zeroed mean, input background signals and the average of the input image mean values. This relationship is given in Equation 1, where Ab, Bb, and Fb represent the two input and the fused background signals respectively and A and B signify input signal means. Fb ( x, y ) Ab ( x, y ) A Bb ( x, y ) B A B 2 …(Eq.1) This fusion mechanism ensures that all the background information present in the input signals gets transferred into the fused image. In addition to that, if destructive superposition becomes a problem, it is reduced by fusing an inverted input. In the former case, however, when the secondary, less active, background image is sacrificed for the sake of computational efficiency, we can ensure that it does not result in any significant information loss by optimising the criteria which decides which of these two mechanisms is employed. 3.3. Foreground Fusion Foreground signal fusion is implemented using a simple feature selection fusion mechanism. The signals contain small scale, and usually high contrast, information from the input images. Important information is relatively easier to localise than in the case of the background signals and a feature selection mechanism can be implemented on pixel level. This form of selection increases the robustness of the fusion system in comparison to simple arithmetic fusion methods used for background signals. For each pixel in the fused foreground image we choose the corresponding pixel with the highest absolute value from the input foreground images. This is also shown in Equation 2, where Af, Bf, and Ff are the input and fused foreground images respectively. F f ( x, y ) A f ( x, y ) B f ( x , y ) Af ( x, y ), if B f ( x, y ), otherwise …(Eq.2) By definition, background signal contains the local mean of the input image at every location. This means the difference between the input image and the local mean, contained in foreground signal, can be taken to represent the local contrast of the input image. Accordingly, the fusion mechanism described above can also be considered as a form of maximisation of the local contrast. 3.4. Statistical Analysis Statistical analysis of the input signals is necessary to determine optimal parameters for other parts of the fusion system. These include the average filter template size, whether to use input signal inversion and which of the background fusion algorithms to apply. Input image inversion decision is made on the basis of correlation measurements, but due to temporally constant nature of signal characteristics for the majority of sensors, inversion decisions are made seldom for a particular pair of input sensors. The rest of the system parameters in the current system, are determined using standard deviation, 02, measurements. Fusion performance is heavily influenced by the size of the averaging template. For a given image size it determines the relative boundary between the pass and stop bands. Oversize filtering separates too much low frequency information into the foreground signal compromising the important, zero local mean property. Selective fusion of such foreground signals can produce undesirable effects in the form of false edges in image areas where no significant information resides but selection decision goes from one image to the other. Similarly, undersize filtering means that real foreground information is fused using suboptimal arithmetic fusion. Optimally, we would like the standard deviation of the background signal to fall to 80% of the standard deviation of the original input signal. In case of sequence fusion we can use subsequent frames to increase or decrease the template size according to the distance from the desired ratio. Finally, the decision on which of the background fusion approaches to use, is made on the basis of the relative sizes of the standard deviations of the two input background signals. If one of the background images has a standard deviation twice that of the other, than it is taken as the fused background image. Otherwise, if the standard deviations remain within 50% of each other, both of the background signals are inputs to the arithmetic fusion method given in Equation 1. 4. Fusion Results Although there has been as many attempts as there have been fusion algorithms, as yet, no universally accepted standard has emerged for evaluating image fusion performance. In our work we restricted ourselves to determining performance levels in two main aspects of image fusion, fused image quality and computational complexity. In literature, subjective image viewing tests have been a relatively standard way of determining the relative quality of fused imagery [7,10]. We used them in our work to compare the performance of our system with an established fusion algorithm. As for computational complexity evaluation, we can be considerably more exact. McDaniel et. al. [5] compared complexity of a small number of fusion systems on the basis of the number of operations per pixel needed to fuse a pair of input images. They assumed all the arithmetic operations to be equivalent to one and a memory call to three operations. Complexity results calculated using this method and some from McDaniel et. al. are given in Table 1 [5]. The value given for our system, we named CEMIF here, is for image sequence fusion and exhibits a great reduction in computational effort compared to other, more conventional fusion methods. Fusion System Ops / pixel Image Averaging 10 Laplacian Pyramid Fusion 520 Video Boost and Add 945 QMF Multiresolution Fusion 1200 LME&M Cross 2250 LME&M Morphological 4200 CEMIF 90 Table 1: Computational complexity values 4.1. Subjective Image Viewing Human visual system, is still the most complex and able vision system known to man. On the basis of this, relative subjective quality tests use human subjects to determine the relative perceived image quality of a number of fusion systems. They involve presenting subjects with a set of the input images and a fused image of each fusion system to be evaluated. In subjective tests performed as part of this project in late August of 1999, we compared our proposed system with an established DWT multiresolution fusion system using area based feature selection [4] (referred to here as WMRF). A wide range of input scenes was selected to encompass the largest number of possible fusion scenarios. Figure 3 shows an example of an image set used for this test. Input images which are channel images of an AMS Daedalus hyperspectral scanner, are on the top row, a) and b), and represent an aerial view of an industrial facility. Image on the bottom left of the set, c), represents the fusion result of the WMRF system. Finally, the fused image produced by our own CEMIF algorithm is in the bottom right, d). Relative positions of the fused images in other test sets were randomised to avoid any bias. Considering the images in Figure 3 it is relatively easy to spot the advantages of our multiscale approach, d). Image features are all clear and there are no perceivable reconstruction errors that plague the image produced by the conventional WMRF fusion, c). Figure 3: Fusion quality subjective testing, input images a) and b) and fused images, WMRF c) and our CEMIF d) In total nine subjects took part in the subjective tests. For each of the twelve presented sets of images, subjects were asked to express their preference for one or none of the presented fused images based on perceived image quality. The results of this test are summarised in the form of total number of preferences expressed by subjects for each system and are shown on the bar chart in Figure 4. This chart, again, clearly indicates the advantage of our system (CEMIF) over the conventional multiresolution method (WMRF). Overall, out of 9*12=108 preference votes, 57 or 52.7% were for our fusion system with 43 or 39.9% for WMRF and 8 (7.4%) undecided votes. Pairwise, the situation is similar, with subjects showing preference of our fused image in 7 out of 12 (58.3%) image sets with 4 sets (33.3%) against and one (8.3%) remaining undecided. The main reason for these results were the ‘ringing’ artifacts present in conventional method and visible in Figure 3 c), which are not present in images produced by our system. 60.0 fusion applications is required before any serious implementation can be considered. Acknowledgements The authors gratefully acknowledge all the members of the Manchester Avionics Research Center at MU and British Aerospace Military Aircraft and Airstructures, Warton, Lancashire for their support during this work. American Government AMPS programme for providing the input imagery. References [1] P Burt, R Kolczynski, “Enhanced Image Capture Through Fusion”, Proceedings of the Fourth Int. Conference on Computer Vision, Berlin, May 1993, pp 173-182 [2] L Chipman, T Orr, L Graham, “Wavelets and image fusion”, Proc. SPIE, Vol. 2569, 1995, pp 208-219 [3] W Handee, P Wells, “The Perception of Visual Information”, Springer, New York 1997 [4] H Li, S Munjanath, S Mitra, “Multisensor Image Fusion Using the Wavelet Transform”, Graphical Models and Image Proc., Vol. 57, No. 3, 1995, pp 235-245 [5] R McDaniel, D Scribner, W Krebs, P Warren, N Ockman, J McCarley, “Image Fusion for Tactical Applications”, Proc. SPIE, Vol. 3436, 1998, pp 685-695 [6] V Petrović, C Xydeas, “Multiresolution image fusion using cross band feature selection”, Proc. SPIE, Vol. 3719, 1999, pp 319-326 [7] D Ryan, R Tinkler, “Night Pilotage Assessment of Image Fusion”, Proc. SPIE, Vol. 2465, 1995, pp 50-67 [8] A Toet, L van Ruyven, J Velaton, “Merging thermal and visual images by a contrast pyramid”, Opt. Engineering, 1989, Vol. 28, No. 7, pp 789-792 50.0 40.0 % 30.0 20.0 10.0 0.0 CEMIF WMRF None Figure 4: Subjective preferences for particular fusion system 5. Conclusion In this report we presented a novel efficient, multiscale approach to image fusion. Achieved image quality better than conventional multiresolution fusion approaches at less than 10 % of the computational complexity warrants further investigation into such simplistic and efficient approaches to what is essentially a robust and effective image processing framework. The use of averaging templates also indicates a relatively low sensitivity of multiscale fusion approaches to imperfect filter spectral responses. In the case of this actual system, further research to determine its robustness for a large number of [9] A Toet, “ Hierarchical Image Fusion”, Machine Vision & Apps., 1990, Vol. 3, pp 1-11 [10] A Toet, J Ijspeert, A Waxman, M Aguilar, “Fusion of Visible and Thermal Imagery Improves Situational Awareness”, Proc. SPIE, Vol. 3088, 1997, pp 177-188 [11] M Ulug, L McCullough, “Feature and data level fusion of infrared and visual images”, Proc. SPIE, Vol. 3719, pp 312-318