Multi-Scale CNN for Multi-Focus Image Fusion

Multi-scale convolutional neural network for multi-focus image fusion☆
Hafiz Tayyab Mustafa ⁎, Jie Yang, Masoumeh Zareapoor
Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China
Article history:
Received 23 February 2019
Accepted 5 March 2019
Available online 21 March 2019
Multi-focus image fusion
Convolutional neural network
Structure similarity
a b s t r a c t
In this study, we present new deep learning (DL) method for fusing multi-focus images. Current multi-focus
image fusion (MFIF) approaches based on DL methods mainly treat MFIF as a classification task. These methods
use a convolutional neural network (CNN) as a classifier to identify pixels as focused or defocused pixels. However, due to unavailability of labeled data to train networks, existing DL-based supervised models for MFIF add
Gaussian blur in focused images to produce training data. DL-based unsupervised models are also too simple
and only applicable to perform fusion tasks other than MFIF. To address the above issues, we proposed a new
MFIF method, which aims to learn feature extraction, fusion and reconstruction components together to produce
a complete unsupervised end-to-end trainable deep CNN. To enhance the feature extraction capability of CNN,
we introduce a Siamese multi-scale feature extraction module to achieve a promising performance. In our proposed network we applied multiscale convolutions along with skip connections to extract more useful common
features from a multi-focus image pair. Instead of using basic loss functions to train the CNN, our model utilizes
structure similarity (SSIM) measure as a training loss function. Moreover, the fused images are reconstructed in a
multiscale manner to guarantee more accurate restoration of images. Our proposed model can process images
with variable size during testing and validation. Experimental results on various test images validate that our proposed method yields better quality fused images that are superior to the fused images generated by compared
state-of-the-art image fusion methods.
1. Introduction
Pixel-level image fusion aims to combine two or more input images
to produce a more informative fused image for human or visual perception as compared to source images. Pixel-level image fusion plays a significant role in image fusion and being implemented in different
applications which makes it an active topic of Image processing field.
There exists difficulty for imaging devices to capture an image with all
objects clearly focused. Normally with general setting of imaging devices, due to the limited depth of focus in the optical lenses of cameras,
it is very difficult to capture an appropriate image which contains all
portion of the scene with different depth-of-field. Hence some areas of
the image become blurred and these images are not suitable for most
image processing tasks. Fusion of different depth-of-field images with
different focus levels of the same scene into one all-in-focus image is
called Multi-Focus Image Fusion (MFIF) [1]. From decades image fusion
has been studied and many algorithms developed to merge images for
fusion tasks. Based on the fusion strategy image fusion methods can
be roughly categorized into two groups [2]; transform domain methods
and spatial domain methods.
E-mail address: mustafa.tayyab@hotmail.com (H.T. Mustafa).
In previous image fusion research transform domain methods
mostly based on multi-scale transform (MST). These methods include
Laplacian pyramid [3], discrete wavelet decomposition [4], stationary
wavelet decomposition [5], curvelet [6], nonsubsampled contourlet
transform (NSCT) [7] and dual tree complex wavelet [8] and others.
MST-based MFIF methods work as decomposition of input images into
multi-scale representations (MSRs) and then a fused MSR is achieved
by fusion of MSRs of different images based on specific fusion rule [9].
Apart from MST-based methods feature domain-based MFIF methods
also has been introduced in recent years. For instance, sparse representation [10], independent component analysis [11], robust principal component analysis [12] and so on, have been proposed for MFIF.
Spatial domain methods can be classified into three groups: blockbased methods, pixel-based and region-based methods. These methods
[13,14] first divide source images into various blocks or regions by using
some strategies and then blend the images. Region-based image fusion
methods [15,16] achieve fusion on the irregular shaped regions with the
help of segmentation. Limitation of these methods is that they depend
on precise segmentation of source images. Recently, pixel-based MFIF
methods got much popularity in image fusion research. These methods
extract features from input images by keeping the spatial consistency of
the final fused images [17] and apply complex fusion strategies as well.
These methods include image matting [18], guided filtering [19], dense
scale-invariant feature transform [17], and so on, also achieved
H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35
promising results in MFIF. Most of the conventional image fusion
methods manually design their different components such as image
transform, activity level measurement, fusion rule to achieve state-ofthe-art performances. Moreover, it is not easy to build an ideal design
manually for such methods due to the limitation of implementation
difficulty and high computational cost.
Due to the strong capability in feature extraction as well as in data
representation, deep learning (DL) has got much attention and gained
many advancements in several image processing and computer vision
tasks. To overcome issues in handcraft methods recently DL-based
image fusion approaches have been introduced for different applications of image fusion. For instance, in digital photography [20–22]
multi-modality imaging [23–25] and remote sensing imaging applications [26–28] led state-of-the-art results and gaining more popularity
in image fusion research. Significant motivations that outcomes the
popularity of DL-based methods includes that strong ability of DL
model to extract image features automatically to overcome the limitation of the manual design of models [29]. Additionally, availability of
friendly DL libraries such as Theano, Keras, TensorFlow, and others,
open source large-scale image datasets such as CIFAR-10 [30], PASCAL
VOC [31] and ImageNet which ensures appropriate research on image
fusion topic.
DL-based Multi-focus fusion methods use a convolutional neural
network (CNN) as a part of the algorithm to fuse images, so far CNNbased models are considered as classification problem in MFIF research.
Liu et al. [20] proposed a CNN-based method for multi-focus image
fusion. They used popular image classification dataset as training data.
Multi-scale Gaussian filtering with different standard deviation to
obtain blur on random patches on gray scale images to simulate
multi-focus images. They used their model to classify images as focus
and unfocused pixels to get focus map same as input images. Then
binary segmentation on the focus map yields decision map. Finally
using pixel-wise weighted-average strategy is used to obtain a fused
image. Following Liu et al. [20], Tang et al. [21] introduced pixel CNN
(p-CNN) for multi-focus image fusion. They used Cifar-10 [30] dataset
as training images, by automatic addition of Gaussian blur to original
images defocused images were obtained. The model is trained to learn
three probabilities: focused, defocused and unknown pixels by their
neighborhood pixels information. Source Images to form the score
matrix, and by comparing the value of the score matrix decision map
is obtained, finally by the fused image is obtained by final decision
map. Du and Gao [32] introduced segmentation based multi-focus
image fusion network which uses multi-scale input and their network
architecture follows the same model as used in [19]. However, a
common limitation of supervised learning methods is the unavailability
of labeled data for image fusion as these methods simulate blur on
image patches by using popular image classification datasets to create
training samples which makes the training data unrealistic. Furthermore, these methods just utilize the result calculated by the last layers
which effect to lose useful information obtained by the middle layers.
Prabhakar et al. [22] introduced CNN based unsupervised approach for
multi-exposure fusion (MEF). They presented a simple CNN with two
convolutional layers for encoding source images and encoding network
generates feature map sequences and they are added to get fused maps.
Finally, the image is reconstructed by three layers decoding network.
Although their method achieves better performance for MEF however,
their feature extraction architecture is very simple which may lose
some useful features which are not suitable for multi-focus images.
To overcome these issues, we proposed a multi-scale convolutional
neural network for fusing multi-focus images. We introduce a Siamese
multiscale feature extraction to learn more efficient convolutional filters
in feature extraction and reconstruction and results in more accurate
fusion of images. Multiple convolutions in a single convolution layer
allow CNN to learn more efficient and precise image details. Additionally, we use skip connection between layers to ensure smooth training
process. Our proposed network learns feature extraction, fusion and
reconstruction process altogether, resulting in an end-to-end network.
To minimize network training loss, we use no-reference image quality
metric structure similarity as training loss function. To the best of our
knowledge, this is a first end-to-end multi-scale CNN for fusing multifocus image pairs.
In brief, our contributions are as follows:
• A new multi-scale CNN based unsupervised DL model to achieve
multi-focus image fusion.
• We introduced a siamese multi-scale feature extraction module
which is able to extract more accurate features of input images, skip
connections are also added to ensure smooth training of the network.
• No-reference image quality metric structural similarity is used as a
network training loss function.
• The network is flexible in such a way that can process variable size images during testing.
2. The proposed method
As mentioned earlier, Convolutional neural network (CNN) is successfully applied in various image fusion task and gained state-of-theart performances. In image fusion activity level measurement is of
great significance which can be recognized as feature extraction. It is
Fig. 1. The detailed architecture of the proposed method.
H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35
Fig. 2. Difference between the basic convolutional layer of CNN and convolutional layer with multi-scale convolutions.
essential to extract image features in a precise way. CNN provides us a
better solution for image fusion as feature extraction, fusion and reconstruction can be learned by a single CNN model. Our proposed network
consists of 11 trainable layers and further divided into three main components: Siamese multi-scale feature extraction, feature fusion, and
multiscale reconstruction. Fig. 1 overviews our proposed network in detail. In order to improve more detailed feature extraction and more precise reconstruction of fused images we introduce two multi-scale
convolutional blocks in feature extraction and one in reconstruction
modules with a little modification which was used in image superresolution [33]. Fig. 2 illustrates the main difference between a basic
convolutional layer and multi-scale convolutional layer. The following
section describes Network architecture in detail.
2.1. Multi-scale feature extraction
Convolutional layers present in CNN automatically learns filters
from training images to extract local features of source images. It is
very crucial to choose the size of convolutional kernels for the extraction
of features. Conventionally small size kernels extract short edges or lowfrequency content but high-frequency details or the other suitable details of the images cannot be extracted at the same time. Similarly, the
kernels with large scale extract more big features without having lowfrequency contents of images. For every convolutional layer using the
same filter size or even alternative size, makes CNN deeper and complex
computations make training slow. This inspired us to apply multi-scale
convolutions within the same layers of CNN to secure low frequency
and high-frequency details of the source images. Our proposed multiscale feature extraction module is a Siamese CNN consisting of two
identical subnetworks sharing the same weights. Pair of source images
I1 and I2 are the input to independent feature extraction modules. All
convolutional layers followed by a bias and Rectification Liner Unit
(ReLU) layers for nonlinearity. The first layer of the network is a
single-scale convolutional layer to remap image features. Each singlescale convolutional layer along with ReLU layer can be expressed as
F i ¼ maxð0; W i F i−1 þ bi Þ;
same layers of CNN help the model to secure low frequency and highfrequency details of the source images. Each MS-conv layer in the network can be expressed as
½l ½l ½l
F ½ml ¼ max 0; Concat f 1 ; f 2 ; f 3
m is output feature map of l-th MS-conv layer with depth m = 3, f1 ,
and f3 are the feature maps obtained after multi-scale convolutions
and can be express as
f 1 ¼ w1 F ½l−1 þ b1 ; >
f 2 ¼ w2 F ½l−1 þ b2 ;
½l ;
f 3 ¼ w3 F ½l−1 þ b3 ;
Three sets of convolutional filters w[l]
m of size 3 × 3, 5 × 5, and 7 × 7
are utilized to convolve with F[l−1] feature map. blm is the bias added
to each feature map of l-th MS-conv layer. Each of the convolution
operation output in feature maps and then they are merged by
concatenation operation along the spectral dimension. Next layer of
MSCB is 3 × 3 convolutional layer to reduce the spectral dimension of
feature maps. We did not use pooling layers in our proposed network
because pooling shrinks the size of the image while eliminating some
essential image details which are helpful for image reconstruction
later. Siamese nature of networks allows the network to learn same feature maps for each input image and makes the output feature maps
identical. To facilitate gradient flow in the training process of the network inspired from [34] a skip connection is used after every two layers
in the network. Complete architecture of the proposed multi-scale
convolutional block is illustrated in Fig. 3.
where Wi and bi represents convolutional kernel and the bias of i-th
layer respectively. Fi is output feature map of the i-th convolutional
layer with F0 indicates original multi-focus input images. For convenience in this paper, ‘*’ is used for convolution operation.
2.1.1. Multi-scale convolutional block
In our proposed CNN model multi-scale feature extraction consists
of two multi-scale convolutional blocks (MSCB). Each MSCB consists
of multi-scale convolutional (MS-conv) layer and a flat convolutional
layer. MS-conv layer comprises of three simultaneous convolution operations with different kernel sizes. Multi-scale convolutions within the
Fig. 3. Proposed multi-scale convolutional block.
H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35
pixel in an image from top-left to bottom-right. SSIM loss function
[36] on an image patch Ip can be defined as
2.2. Feature fusion
As mentioned earlier, the multi-scale feature extraction module allows the network to result in a similar type of feature maps of each
input image. In order to fuse the corresponding level of features from
each source image, the extracted features are merged together by
feature fusion concatenation operation as FM = Concat(fm
1 , f2 ), where
f1 and f2 represents the feature maps obtained by feature extraction
from source images I1 and I2 respectively, FM denotes the fused feature
map or representation. Later this fused image representation is utilized
in the reconstruction module as an input for the restoration of fused
2.3. Reconstruction
Our proposed multi-scale reconstruction consists of 6 trainable
convolutional layers interleaved by ReLU layers. In order to resolve
the gradient problems in the training process, the first four convolution layers are connected via skip connections. First 4 convolutional
layers utilize 3 × 3 kernel sizes and the subsequent multiscale
convolutional layer consist of three different convolution operations
with the kernel size of 3 × 3, 5 × 5, and 7 × 7 respectively. All convolution operations are conducted simultaneously and all feature
maps are merged together. The last layer is the final convolutional
layer with a kernel size of 3 × 3 reconstructs the all-in-focus fused
2.4. Implementation details
For precise reconstruction of fused images, it is essential to
choose proper loss function that can minimize loss. There are different image quality metrics used widely in many applications of image
processing such as the mean square error (MSE), Round mean square
error (RMSE), peak signal-to-noise ratio (PSNR) and others. These
are generally used to compute loss between input and reference images. However, they do not match with the human visual perception
because the signal error is not the same as the degradation of visual
quality in the human visual system (HVS). Our proposed network is
unsupervised CNN, due to lack of reference (ground truth) images
and it is not easy to acquire reference images practically. To calculate
network training loss with commonly used loss functions without
reference images becomes challenging. To overcome abovementioned difficulties, we use structure similarity [35] as a loss function in our network. The image structure similarity (SSIM) aims to
extract structural information of images with the help of different
sliding windows to the corresponding position of the image being
compared. The window moves across the image pixel by pixel and
SSIM is calculated within the local window. SSIM separates highly
structured independent parameters such as luminance, contrast,
and structure. We aim to learn a mapping function to generate a
fused image which is the same as the desired all-in-focus image.
Let suppose we have a reference image (r) and a test image (t),
then SSIM [35] can be defined as,
ð2xr xt þ C 1 Þð2σ xr xt þ C 2 Þ
SSIMðr; t j xÞ ¼ ;
x2r þ x2t þ C 1 ðσ 2 xr þ σ 2 xt C 2 Þ
where C1 and C2 are small constants, xr is a sliding window in reference image r, xr is the mean of x r , σ 2 xr and σ x r x t are the variance
and covariance of xr and x t respectively. To compute SSIM in local
windows, from Eq. (4) we calculate first SSIMðr 1 ; ^tjxÞ and SSIMðr 2 ; ^tjx
Þ. In our method the constants C1 and C2 are set to 0.0001 and 0.0009
respectively. Sliding window size is set as 11 × 11, it moves pixel by
LSSIM Ip ¼ ∑Ip^ ϵIp 1−SSIM Ip^ ;
where n represents the total number of sliding windows, and the
computed loss is back propagated to train the network. The above
equation can be simplified as
LSSIM Ip ¼ 1−SSIM Ip ;
where Ip is the center pixel of patch Ip, pixel loss is calculated by the
following equation as
Lp ¼ jj Y−I jj2 ;
where Y is desired generated all-in-focus image and I is input image.
The final Loss is calculated by the combination of structural similarity
loss (Eq. (6)) and pixel loss (Eq. (7)) given by
L ¼ LSSIM ðpÞ þ Lp ;
When images are passed through deeper networks there is a high possibility of getting more accurate and abstract features as compare to shallow networks [37,38]. However, it is unbelievably challenging to train a
deep CNN along with maintaining the gradient flow through deep layers
to make it converge in affordable time. By increasing the number of layers
in a network, it is quite normal to happen that the accuracy of the network
will start to decrease and degradation will occur eventually. Meanwhile
back-propagating the deep network from output to input gradient of
loss tends to get smaller and smaller causes several problems in a network
such as a gradient vanishing, network convergence time and others. To
deal with these issues, residual learning [34] provides one of the most efficient solutions to train deep CNNs with improved accuracy and fast convergence time. However, the authors skip a few training layers of the
network by using skip connection or shortcut, so that network can learn
directly an identity function which depends on skip connections only.
Mathematically, in residual learning normal convolutional filtering x[n]
= y[n] (x[n−1]) is replaced with x[n] = y[n] (x[n−1]) + x[n−1], finally the residual expression x[n] − x[n−1] turn into the goal of prediction. For the network it is a lot easier, to learn residual image rather than the original input
image. In our proposed network we used a skip connection through identity mapping to add up Fi and Fi+2 feature maps. Before applying convolution operation, the parameters of convolutional layers are initialized
randomly, padding of zero by 1 around boundaries is applied as well.
This preserves the size of feature maps identical to the source images.
We use open source image dataset MS COCO [39] as a training dataset,
which contains more than 80,000 RGB images. All images are resized to
200 × 200 for training, the learning rate is set to 0.0001. For the network
training, we use the TensorFlow framework (1.5.0) and Nvidia workstation with a double GPU, Intel Core i7 with 256 GB Ram.
3. Experimental results
To validate the effectiveness of our proposed multi-scale
convolutional neural network (MSCNN) model, in this section we will
demonstrate a detailed comparison of our proposed method with
some state-of-art multi-focus image fusion (MFIF) methods. We select
five recently proposed state-of-the-art MFIF methods to compare with
our proposed method. These methods include Image Matting [18],
Guided Filtering (GF) [19], Multi-scale weighted gradient-based
method [40], boundary finding (BF) [41] and convolutional neural network (CNN) [20]. We perform experiments on 30 pairs of multi-focus
images, among them 20 pairs from open source available dataset
“Lytro” [42] and the rest of the 10 pairs have been extensively used for
MFIF research purposes.
H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35
(a) Source Image 1
(b) Source Image 2
(c) IM
(d) GF
(e) MSWG
(f) BF
(g) CNN
(h) Proposed
Fig. 4. Fused results of different image fusion methods and proposed methods on “baby” image set.
3.1. Comparison with other methods
Based on visual perception we compare the performances of various
MFIF methods to validate our proposed MSCNN-based method. For this
purpose, we provide some examples here to exhibit the difference
among selected MFIF methods. Fig. 4 illustrates the “Baby” source
image pair and their fused results achieved by different methods and
the proposed method. Fig. 4(a) and (b) are the pair of source images,
whereas Fig. 4(c)–(h) are the output fused images from methods IM,
GF, MSWG, BF, CNN and our proposed method respectively. It is clear
that our proposed method fused image is free from obvious artifacts,
however, fused images from other methods contain some sort of artifacts around edges, and the background details are not much clear as
compared to the proposed method. Fig. 5 compares the result of our
proposed method on “Clock” source images set with other methods
mentioned above. From the figure, it can be seen that fused images
(a) Source Image 1
(b) Source Image 2
(c) IM
(d) GF
(e) MSWG
(f) BF
(g) CNN
(h) Proposed
Fig. 5. Detailed fused results of “clock” image set.
H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35
(a) Source Image 1
(b) Source Image 2
(c) IM
(d) GF
(e) MSWG
(f) BF
(g) CNN
(h) Proposed
Fig. 6. The “Horse” source images and fused results using different MFIF methods and our proposed method.
generated by our proposed model provide the best fusion result. In another example, detailed results of “Horse” fused images are illustrated in
Fig. 6, which clearly demonstrate that the results obtained from
methods IM, GF, MSWG, BF and CNN, contain some blur artifacts in
the background region and the results achieved by our method are
free from such artifacts. Some set of multi-focus images used in our experiments are displayed in Fig. 7. For more detailed comparison fused
results of 10 multi-focus image pair are illustrated in Fig. 8. From the figure, it can be seen that our proposed model achieves better fusion results as compare to other methods.
3.2. Quantitative evaluation
Objective evaluation in image fusion plays a significant role and it is
not an easy task since the ideal fused all-in-focus image is not always
available. Many quantitative evaluation metrics have been proposed
for evaluating image fusion performance. However, there is no ideal
benchmark that can completely summarize the best one. Therefore, it
is essential to utilize several metrics to verify the performance of
the proposed method. Liu et al. [43] presented a survey on the image
fusion evaluation metrics and classified them into four categories:
Fig. 7. Some multi-focus image pairs used in our experiments.
H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35
Fig. 8. Fusion results of different methods source images in Fig. 7. Left to right fused results of IM, GF, MSWG, BF, CNN and ours.
H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35
1) information theory-based metrics, 2) Image feature-based metrics,
3) Image structural similarity-based metrics and 4) human
perception-based metrics. In this study, we evaluate our results using
four metrics one from each category to validate the effectiveness of
our proposed method. These metrics include normalized mutual information QMI [44], phase congruency QPC [45], image structural similarity
QIS [46] and human perception-based metric QHP [47]. For all metrics the
higher the values the better fusion result. Brief introduction of these
metrics is discussed below.
3.2.1. Information theory-based metric (QMI)
Normalized Mutual information QMI [44] is an information theorybased image fusion metric which measures the amount of mutual information between the source and fused images. QMI can be computed as
Q MI ¼ 2
MI ð F; I 1 Þ
MIð F; I2 Þ
Hð F Þ þ H ðI1 Þ Hð F Þ þ H ðI2 Þ
where I1 and I2 represents source images and F denotes fused image of I1
and I2, H(F), H(I1) and H(I2) are the entropy of fused image F, I1 and I2
3.2.2. Image feature-based metric (QPC)
Phase congruency metric QPC [45] is image feature-based metric
which compare the phase congruency features of source images with
the fused image. QPC can be defined as
Q PC ¼ P p ðP M Þβ ðP m Þγ ;
where p, M and m are phase congruency, maximum and minimum moments between source images and fused image. Exponential parameters in the above equation are set as α = β = γ = 1.
3.2.3. Image structural similarity-based metric (QIS)
Image structure similarity (SSIM) [35] can imitate the similarity of
different images and is used to compare images. Yang et al. [46] introduced modified SSIM to evaluate the performance of image fusion. QIS
is a structural similarity-based metric which evaluate the structural information of source images preservation level. SSIM to evaluate the performance of image fusion, which can be computed as
Q IS ¼
λðwÞSSIMðI1 ; FjwÞ þ ð1−λðwÞÞSSIM ðI2 ; FjwÞ; SSIM ðI 1 ; I 2 jwÞ≥0:75
maxfSSIMðI 1 ; FjwÞ; SSIMðI 2 ; FjwÞg; SSIM ðI 1 ; I 2 jwÞ b 0:75
where SSIM is the structural similarity [35] and λ(w) is local weight
computed as,
λðwÞ ¼
sðI 1 jwÞ
sðI 1 jwÞ þ sðI2 jwÞ
where w is 7 × 7 local window, s(I1 | w) and s(I2 | w) are the variances of
source images I1 and I2 within the local window w respectively.
Table 1
The objective assessment of various fusion methods for the fusion of five pairs of validation
multi-focus source images.
Fig. 9. Graphical comparison of image structural similarity-based metric values of 10 fused
images from other methods and our proposed method.
3.2.4. Human perception-based metric (QHP)
Human perception-based metric QHP is contrast based image fusion
metric proposed by Chen and Blum [47]. This metric makes use of
major features in Human visual system model and compare contrast
features of input images with the fused image. QHP can be computed as
Q GQM ðx; yÞ ¼ λI1 ðx; yÞQ I1 F ðx; yÞ þ λI2 ðx; yÞQ I2 F ðx; yÞ
where I1 and I2 represent source images and F denotes fused image of I1
and I2. QI 1F and QI 2F represents contrast information preserved from the
source images I1 and I2 into fused image F respectively. λI 1 and λI 2 are
the saliency maps of QI 1F and QI 2F respectively. Finally, QHP is computed
as Q HP ¼ Q GQM from the average of global quality map. The average
scores of fused images obtained by our method compared with other
different fusion methods for the five multi-focus image pairs are listed
in Table 1. Highest values are shown in bold to validate the efficiency
of our proposed model. Results shows that our method achieves better
performance in most cases. Fig. 9 illustrates the graphical comparisons
of image structural similarity-based metric QIS values on 10 fused images obtained from IM, GF, MSWG, BF, CNN and our proposed method
3.3. Comparison of computational efficiency
To evaluate the computational efficiency of our proposed method
with other MFIF methods Table 2 lists the comparison of the average
running time consumed by each method. Source codes of all methods
(IM, GF, MSWG and CNN) are available online, can be downloaded
from the website [48]. All methods for the comparison are implemented
in MATLAB integrated development environment and our method is
implemented by using TensorFlow. For this experiment, we use a computer with specifications Intel Core i7-5960X CPU, Nvidia Titan X GPU
and 32-GB RAM. Ten pairs of multi-focus source images with variable
size are used for testing in this experiment. It is clear from the table
handcraft methods IM, GF, MSWG and BF are computationally more efficient while in comparison with CNN-based method our proposed
method has high computational efficiency. Our proposed CNN-based
method requires complex computations while test time but yields in
end-to-end effective solution for image fusion.
Table 2
The average running time of different methods and the proposed method.
Time (second)
H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35
(a) Underexposed Image
(b) Overexposed Image
(c) Fused Image
(d) Underexposed Image
(e) Overexposed Image
(f) Fused Image
Fig. 10. Multi-exposure fusion result using the proposed method.
3.4. Application to multi-exposure fusion
Conflict of interest
Here we will discuss the possibility of applying our proposed
method to other applications of image fusion. When a photograph is
taken by a camera which contains shadows or highlighted regions, photographers come up with the challenge of adjusting the suitable exposure. In different lighting conditions sometimes, the image becomes
too bright or too dark. In such case, Multi-Exposure Fusion (MEF) techniques are applied to fuse images with different exposure. MEF problem
is similar to MFIF, excluding that source images have variable exposure
than variable focus. To utilize the generalizability of CNN without fine
tuning already trained network we use it to fuse multi-exposure images.
Fig. 10 shows that our proposed model successfully fuses images
with variable exposure. This demonstrates that the CNN model is generic and could be used in digital photography applications of image
We hereby confirm that there is no conflict of interest between authors to declare.
4. Conclusion
In this paper, we introduced a new method for multi-focus image fusion (MFIF) based on multi-scale convolutional neural network
(MSCNN). Our proposed network aims to learn all modules together
to produce a complete unsupervised end-to-end trainable deep
MSCNN. To the best of our knowledge, it is the first time that MSCNN
is applied to multi-focus image fusion. In our feature extraction module,
we applied multi-scale convolutional filters to extract more accurate
features of source images. For the precise reconstruction of fused images, our model utilizes structure similarity (SSIM) no reference image
quality metric to calculate loss. Finally, the fused images are reconstructed in a multi-scale manner to guarantee more accurate restoration
of images. We train our model on an open source dataset image and perform extensive experiments as well as quantitative and qualitative
evaluations to validate the efficiency of our proposed method. Our
proposed CNN model could be used in other image fusion applications
such as multi-exposure fusion and infrared and visible image fusion.
In the future, we aim to make our model more robust and generic
which be applied to fuse more than two images and with movable
objects as well.
This research is partly supported by NSFC, China (No: 61876107,
61572315, U1803261) and 973 Plan, China (No. 2015CB856004).
