Image and Vision Computing 85 (2019) 26–35 Contents lists available at ScienceDirect Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis Multi-scale convolutional neural network for multi-focus image fusion☆ Hafiz Tayyab Mustafa ⁎, Jie Yang, Masoumeh Zareapoor Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China a r t i c l e i n f o Article history: Received 23 February 2019 Accepted 5 March 2019 Available online 21 March 2019 Keywords: Multi-focus image fusion Convolutional neural network Unsupervised Structure similarity a b s t r a c t In this study, we present new deep learning (DL) method for fusing multi-focus images. Current multi-focus image fusion (MFIF) approaches based on DL methods mainly treat MFIF as a classification task. These methods use a convolutional neural network (CNN) as a classifier to identify pixels as focused or defocused pixels. However, due to unavailability of labeled data to train networks, existing DL-based supervised models for MFIF add Gaussian blur in focused images to produce training data. DL-based unsupervised models are also too simple and only applicable to perform fusion tasks other than MFIF. To address the above issues, we proposed a new MFIF method, which aims to learn feature extraction, fusion and reconstruction components together to produce a complete unsupervised end-to-end trainable deep CNN. To enhance the feature extraction capability of CNN, we introduce a Siamese multi-scale feature extraction module to achieve a promising performance. In our proposed network we applied multiscale convolutions along with skip connections to extract more useful common features from a multi-focus image pair. Instead of using basic loss functions to train the CNN, our model utilizes structure similarity (SSIM) measure as a training loss function. Moreover, the fused images are reconstructed in a multiscale manner to guarantee more accurate restoration of images. Our proposed model can process images with variable size during testing and validation. Experimental results on various test images validate that our proposed method yields better quality fused images that are superior to the fused images generated by compared state-of-the-art image fusion methods. © 2019 Elsevier B.V. All rights reserved. 1. Introduction Pixel-level image fusion aims to combine two or more input images to produce a more informative fused image for human or visual perception as compared to source images. Pixel-level image fusion plays a significant role in image fusion and being implemented in different applications which makes it an active topic of Image processing field. There exists difficulty for imaging devices to capture an image with all objects clearly focused. Normally with general setting of imaging devices, due to the limited depth of focus in the optical lenses of cameras, it is very difficult to capture an appropriate image which contains all portion of the scene with different depth-of-field. Hence some areas of the image become blurred and these images are not suitable for most image processing tasks. Fusion of different depth-of-field images with different focus levels of the same scene into one all-in-focus image is called Multi-Focus Image Fusion (MFIF) [1]. From decades image fusion has been studied and many algorithms developed to merge images for fusion tasks. Based on the fusion strategy image fusion methods can be roughly categorized into two groups [2]; transform domain methods and spatial domain methods. ☆ This paper has been recommended for acceptance by Sinisa Todorovic. ⁎ Corresponding author. E-mail address: mustafa.tayyab@hotmail.com (H.T. Mustafa). https://doi.org/10.1016/j.imavis.2019.03.001 0262-8856/© 2019 Elsevier B.V. All rights reserved. In previous image fusion research transform domain methods mostly based on multi-scale transform (MST). These methods include Laplacian pyramid [3], discrete wavelet decomposition [4], stationary wavelet decomposition [5], curvelet [6], nonsubsampled contourlet transform (NSCT) [7] and dual tree complex wavelet [8] and others. MST-based MFIF methods work as decomposition of input images into multi-scale representations (MSRs) and then a fused MSR is achieved by fusion of MSRs of different images based on specific fusion rule [9]. Apart from MST-based methods feature domain-based MFIF methods also has been introduced in recent years. For instance, sparse representation [10], independent component analysis [11], robust principal component analysis [12] and so on, have been proposed for MFIF. Spatial domain methods can be classified into three groups: blockbased methods, pixel-based and region-based methods. These methods [13,14] first divide source images into various blocks or regions by using some strategies and then blend the images. Region-based image fusion methods [15,16] achieve fusion on the irregular shaped regions with the help of segmentation. Limitation of these methods is that they depend on precise segmentation of source images. Recently, pixel-based MFIF methods got much popularity in image fusion research. These methods extract features from input images by keeping the spatial consistency of the final fused images [17] and apply complex fusion strategies as well. These methods include image matting [18], guided filtering [19], dense scale-invariant feature transform [17], and so on, also achieved H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35 promising results in MFIF. Most of the conventional image fusion methods manually design their different components such as image transform, activity level measurement, fusion rule to achieve state-ofthe-art performances. Moreover, it is not easy to build an ideal design manually for such methods due to the limitation of implementation difficulty and high computational cost. Due to the strong capability in feature extraction as well as in data representation, deep learning (DL) has got much attention and gained many advancements in several image processing and computer vision tasks. To overcome issues in handcraft methods recently DL-based image fusion approaches have been introduced for different applications of image fusion. For instance, in digital photography [20–22] multi-modality imaging [23–25] and remote sensing imaging applications [26–28] led state-of-the-art results and gaining more popularity in image fusion research. Significant motivations that outcomes the popularity of DL-based methods includes that strong ability of DL model to extract image features automatically to overcome the limitation of the manual design of models [29]. Additionally, availability of friendly DL libraries such as Theano, Keras, TensorFlow, and others, open source large-scale image datasets such as CIFAR-10 [30], PASCAL VOC [31] and ImageNet which ensures appropriate research on image fusion topic. DL-based Multi-focus fusion methods use a convolutional neural network (CNN) as a part of the algorithm to fuse images, so far CNNbased models are considered as classification problem in MFIF research. Liu et al. [20] proposed a CNN-based method for multi-focus image fusion. They used popular image classification dataset as training data. Multi-scale Gaussian filtering with different standard deviation to obtain blur on random patches on gray scale images to simulate multi-focus images. They used their model to classify images as focus and unfocused pixels to get focus map same as input images. Then binary segmentation on the focus map yields decision map. Finally using pixel-wise weighted-average strategy is used to obtain a fused image. Following Liu et al. [20], Tang et al. [21] introduced pixel CNN (p-CNN) for multi-focus image fusion. They used Cifar-10 [30] dataset as training images, by automatic addition of Gaussian blur to original images defocused images were obtained. The model is trained to learn three probabilities: focused, defocused and unknown pixels by their neighborhood pixels information. Source Images to form the score matrix, and by comparing the value of the score matrix decision map is obtained, finally by the fused image is obtained by final decision map. Du and Gao [32] introduced segmentation based multi-focus image fusion network which uses multi-scale input and their network architecture follows the same model as used in [19]. However, a common limitation of supervised learning methods is the unavailability 27 of labeled data for image fusion as these methods simulate blur on image patches by using popular image classification datasets to create training samples which makes the training data unrealistic. Furthermore, these methods just utilize the result calculated by the last layers which effect to lose useful information obtained by the middle layers. Prabhakar et al. [22] introduced CNN based unsupervised approach for multi-exposure fusion (MEF). They presented a simple CNN with two convolutional layers for encoding source images and encoding network generates feature map sequences and they are added to get fused maps. Finally, the image is reconstructed by three layers decoding network. Although their method achieves better performance for MEF however, their feature extraction architecture is very simple which may lose some useful features which are not suitable for multi-focus images. To overcome these issues, we proposed a multi-scale convolutional neural network for fusing multi-focus images. We introduce a Siamese multiscale feature extraction to learn more efficient convolutional filters in feature extraction and reconstruction and results in more accurate fusion of images. Multiple convolutions in a single convolution layer allow CNN to learn more efficient and precise image details. Additionally, we use skip connection between layers to ensure smooth training process. Our proposed network learns feature extraction, fusion and reconstruction process altogether, resulting in an end-to-end network. To minimize network training loss, we use no-reference image quality metric structure similarity as training loss function. To the best of our knowledge, this is a first end-to-end multi-scale CNN for fusing multifocus image pairs. In brief, our contributions are as follows: • A new multi-scale CNN based unsupervised DL model to achieve multi-focus image fusion. • We introduced a siamese multi-scale feature extraction module which is able to extract more accurate features of input images, skip connections are also added to ensure smooth training of the network. • No-reference image quality metric structural similarity is used as a network training loss function. • The network is flexible in such a way that can process variable size images during testing. 2. The proposed method As mentioned earlier, Convolutional neural network (CNN) is successfully applied in various image fusion task and gained state-of-theart performances. In image fusion activity level measurement is of great significance which can be recognized as feature extraction. It is Fig. 1. The detailed architecture of the proposed method. 28 H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35 Fig. 2. Difference between the basic convolutional layer of CNN and convolutional layer with multi-scale convolutions. essential to extract image features in a precise way. CNN provides us a better solution for image fusion as feature extraction, fusion and reconstruction can be learned by a single CNN model. Our proposed network consists of 11 trainable layers and further divided into three main components: Siamese multi-scale feature extraction, feature fusion, and multiscale reconstruction. Fig. 1 overviews our proposed network in detail. In order to improve more detailed feature extraction and more precise reconstruction of fused images we introduce two multi-scale convolutional blocks in feature extraction and one in reconstruction modules with a little modification which was used in image superresolution [33]. Fig. 2 illustrates the main difference between a basic convolutional layer and multi-scale convolutional layer. The following section describes Network architecture in detail. 2.1. Multi-scale feature extraction Convolutional layers present in CNN automatically learns filters from training images to extract local features of source images. It is very crucial to choose the size of convolutional kernels for the extraction of features. Conventionally small size kernels extract short edges or lowfrequency content but high-frequency details or the other suitable details of the images cannot be extracted at the same time. Similarly, the kernels with large scale extract more big features without having lowfrequency contents of images. For every convolutional layer using the same filter size or even alternative size, makes CNN deeper and complex computations make training slow. This inspired us to apply multi-scale convolutions within the same layers of CNN to secure low frequency and high-frequency details of the source images. Our proposed multiscale feature extraction module is a Siamese CNN consisting of two identical subnetworks sharing the same weights. Pair of source images I1 and I2 are the input to independent feature extraction modules. All convolutional layers followed by a bias and Rectification Liner Unit (ReLU) layers for nonlinearity. The first layer of the network is a single-scale convolutional layer to remap image features. Each singlescale convolutional layer along with ReLU layer can be expressed as F i ¼ maxð0; W i F i−1 þ bi Þ; same layers of CNN help the model to secure low frequency and highfrequency details of the source images. Each MS-conv layer in the network can be expressed as ½l ½l ½l F ½ml ¼ max 0; Concat f 1 ; f 2 ; f 3 [l] F[l] m is output feature map of l-th MS-conv layer with depth m = 3, f1 , [l] and f3 are the feature maps obtained after multi-scale convolutions and can be express as f[l] 2 9 ½l ½l ½l f 1 ¼ w1 F ½l−1 þ b1 ; > > = ½l ½l ½l f 2 ¼ w2 F ½l−1 þ b2 ; > > ½l ½l ½l ; f 3 ¼ w3 F ½l−1 þ b3 ; ð3Þ Three sets of convolutional filters w[l] m of size 3 × 3, 5 × 5, and 7 × 7 are utilized to convolve with F[l−1] feature map. blm is the bias added to each feature map of l-th MS-conv layer. Each of the convolution operation output in feature maps and then they are merged by concatenation operation along the spectral dimension. Next layer of MSCB is 3 × 3 convolutional layer to reduce the spectral dimension of feature maps. We did not use pooling layers in our proposed network because pooling shrinks the size of the image while eliminating some essential image details which are helpful for image reconstruction later. Siamese nature of networks allows the network to learn same feature maps for each input image and makes the output feature maps identical. To facilitate gradient flow in the training process of the network inspired from [34] a skip connection is used after every two layers in the network. Complete architecture of the proposed multi-scale convolutional block is illustrated in Fig. 3. ð1Þ where Wi and bi represents convolutional kernel and the bias of i-th layer respectively. Fi is output feature map of the i-th convolutional layer with F0 indicates original multi-focus input images. For convenience in this paper, ‘*’ is used for convolution operation. 2.1.1. Multi-scale convolutional block In our proposed CNN model multi-scale feature extraction consists of two multi-scale convolutional blocks (MSCB). Each MSCB consists of multi-scale convolutional (MS-conv) layer and a flat convolutional layer. MS-conv layer comprises of three simultaneous convolution operations with different kernel sizes. Multi-scale convolutions within the ð2Þ Fig. 3. Proposed multi-scale convolutional block. H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35 pixel in an image from top-left to bottom-right. SSIM loss function [36] on an image patch Ip can be defined as 2.2. Feature fusion As mentioned earlier, the multi-scale feature extraction module allows the network to result in a similar type of feature maps of each input image. In order to fuse the corresponding level of features from each source image, the extracted features are merged together by m feature fusion concatenation operation as FM = Concat(fm 1 , f2 ), where m m f1 and f2 represents the feature maps obtained by feature extraction from source images I1 and I2 respectively, FM denotes the fused feature map or representation. Later this fused image representation is utilized in the reconstruction module as an input for the restoration of fused image. 2.3. Reconstruction Our proposed multi-scale reconstruction consists of 6 trainable convolutional layers interleaved by ReLU layers. In order to resolve the gradient problems in the training process, the first four convolution layers are connected via skip connections. First 4 convolutional layers utilize 3 × 3 kernel sizes and the subsequent multiscale convolutional layer consist of three different convolution operations with the kernel size of 3 × 3, 5 × 5, and 7 × 7 respectively. All convolution operations are conducted simultaneously and all feature maps are merged together. The last layer is the final convolutional layer with a kernel size of 3 × 3 reconstructs the all-in-focus fused image. 2.4. Implementation details For precise reconstruction of fused images, it is essential to choose proper loss function that can minimize loss. There are different image quality metrics used widely in many applications of image processing such as the mean square error (MSE), Round mean square error (RMSE), peak signal-to-noise ratio (PSNR) and others. These are generally used to compute loss between input and reference images. However, they do not match with the human visual perception because the signal error is not the same as the degradation of visual quality in the human visual system (HVS). Our proposed network is unsupervised CNN, due to lack of reference (ground truth) images and it is not easy to acquire reference images practically. To calculate network training loss with commonly used loss functions without reference images becomes challenging. To overcome abovementioned difficulties, we use structure similarity [35] as a loss function in our network. The image structure similarity (SSIM) aims to extract structural information of images with the help of different sliding windows to the corresponding position of the image being compared. The window moves across the image pixel by pixel and SSIM is calculated within the local window. SSIM separates highly structured independent parameters such as luminance, contrast, and structure. We aim to learn a mapping function to generate a fused image which is the same as the desired all-in-focus image. Let suppose we have a reference image (r) and a test image (t), then SSIM [35] can be defined as, ð2xr xt þ C 1 Þð2σ xr xt þ C 2 Þ SSIMðr; t j xÞ ¼ ; x2r þ x2t þ C 1 ðσ 2 xr þ σ 2 xt C 2 Þ 29 ð4Þ where C1 and C2 are small constants, xr is a sliding window in reference image r, xr is the mean of x r , σ 2 xr and σ x r x t are the variance and covariance of xr and x t respectively. To compute SSIM in local windows, from Eq. (4) we calculate first SSIMðr 1 ; ^tjxÞ and SSIMðr 2 ; ^tjx Þ. In our method the constants C1 and C2 are set to 0.0001 and 0.0009 respectively. Sliding window size is set as 11 × 11, it moves pixel by 1 LSSIM Ip ¼ ∑Ip^ ϵIp 1−SSIM Ip^ ; n ð5Þ where n represents the total number of sliding windows, and the computed loss is back propagated to train the network. The above equation can be simplified as LSSIM Ip ¼ 1−SSIM Ip ; ð6Þ where Ip is the center pixel of patch Ip, pixel loss is calculated by the following equation as Lp ¼ jj Y−I jj2 ; ð7Þ where Y is desired generated all-in-focus image and I is input image. The final Loss is calculated by the combination of structural similarity loss (Eq. (6)) and pixel loss (Eq. (7)) given by L ¼ LSSIM ðpÞ þ Lp ; ð8Þ When images are passed through deeper networks there is a high possibility of getting more accurate and abstract features as compare to shallow networks [37,38]. However, it is unbelievably challenging to train a deep CNN along with maintaining the gradient flow through deep layers to make it converge in affordable time. By increasing the number of layers in a network, it is quite normal to happen that the accuracy of the network will start to decrease and degradation will occur eventually. Meanwhile back-propagating the deep network from output to input gradient of loss tends to get smaller and smaller causes several problems in a network such as a gradient vanishing, network convergence time and others. To deal with these issues, residual learning [34] provides one of the most efficient solutions to train deep CNNs with improved accuracy and fast convergence time. However, the authors skip a few training layers of the network by using skip connection or shortcut, so that network can learn directly an identity function which depends on skip connections only. Mathematically, in residual learning normal convolutional filtering x[n] = y[n] (x[n−1]) is replaced with x[n] = y[n] (x[n−1]) + x[n−1], finally the residual expression x[n] − x[n−1] turn into the goal of prediction. For the network it is a lot easier, to learn residual image rather than the original input image. In our proposed network we used a skip connection through identity mapping to add up Fi and Fi+2 feature maps. Before applying convolution operation, the parameters of convolutional layers are initialized randomly, padding of zero by 1 around boundaries is applied as well. This preserves the size of feature maps identical to the source images. We use open source image dataset MS COCO [39] as a training dataset, which contains more than 80,000 RGB images. All images are resized to 200 × 200 for training, the learning rate is set to 0.0001. For the network training, we use the TensorFlow framework (1.5.0) and Nvidia workstation with a double GPU, Intel Core i7 with 256 GB Ram. 3. Experimental results To validate the effectiveness of our proposed multi-scale convolutional neural network (MSCNN) model, in this section we will demonstrate a detailed comparison of our proposed method with some state-of-art multi-focus image fusion (MFIF) methods. We select five recently proposed state-of-the-art MFIF methods to compare with our proposed method. These methods include Image Matting [18], Guided Filtering (GF) [19], Multi-scale weighted gradient-based method [40], boundary finding (BF) [41] and convolutional neural network (CNN) [20]. We perform experiments on 30 pairs of multi-focus images, among them 20 pairs from open source available dataset “Lytro” [42] and the rest of the 10 pairs have been extensively used for MFIF research purposes. 30 H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35 (a) Source Image 1 (b) Source Image 2 (c) IM (d) GF (e) MSWG (f) BF (g) CNN (h) Proposed Fig. 4. Fused results of different image fusion methods and proposed methods on “baby” image set. 3.1. Comparison with other methods Based on visual perception we compare the performances of various MFIF methods to validate our proposed MSCNN-based method. For this purpose, we provide some examples here to exhibit the difference among selected MFIF methods. Fig. 4 illustrates the “Baby” source image pair and their fused results achieved by different methods and the proposed method. Fig. 4(a) and (b) are the pair of source images, whereas Fig. 4(c)–(h) are the output fused images from methods IM, GF, MSWG, BF, CNN and our proposed method respectively. It is clear that our proposed method fused image is free from obvious artifacts, however, fused images from other methods contain some sort of artifacts around edges, and the background details are not much clear as compared to the proposed method. Fig. 5 compares the result of our proposed method on “Clock” source images set with other methods mentioned above. From the figure, it can be seen that fused images (a) Source Image 1 (b) Source Image 2 (c) IM (d) GF (e) MSWG (f) BF (g) CNN (h) Proposed Fig. 5. Detailed fused results of “clock” image set. H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35 31 (a) Source Image 1 (b) Source Image 2 (c) IM (d) GF (e) MSWG (f) BF (g) CNN (h) Proposed Fig. 6. The “Horse” source images and fused results using different MFIF methods and our proposed method. generated by our proposed model provide the best fusion result. In another example, detailed results of “Horse” fused images are illustrated in Fig. 6, which clearly demonstrate that the results obtained from methods IM, GF, MSWG, BF and CNN, contain some blur artifacts in the background region and the results achieved by our method are free from such artifacts. Some set of multi-focus images used in our experiments are displayed in Fig. 7. For more detailed comparison fused results of 10 multi-focus image pair are illustrated in Fig. 8. From the figure, it can be seen that our proposed model achieves better fusion results as compare to other methods. 3.2. Quantitative evaluation Objective evaluation in image fusion plays a significant role and it is not an easy task since the ideal fused all-in-focus image is not always available. Many quantitative evaluation metrics have been proposed for evaluating image fusion performance. However, there is no ideal benchmark that can completely summarize the best one. Therefore, it is essential to utilize several metrics to verify the performance of the proposed method. Liu et al. [43] presented a survey on the image fusion evaluation metrics and classified them into four categories: Fig. 7. Some multi-focus image pairs used in our experiments. 32 H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35 Fig. 8. Fusion results of different methods source images in Fig. 7. Left to right fused results of IM, GF, MSWG, BF, CNN and ours. H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35 33 1) information theory-based metrics, 2) Image feature-based metrics, 3) Image structural similarity-based metrics and 4) human perception-based metrics. In this study, we evaluate our results using four metrics one from each category to validate the effectiveness of our proposed method. These metrics include normalized mutual information QMI [44], phase congruency QPC [45], image structural similarity QIS [46] and human perception-based metric QHP [47]. For all metrics the higher the values the better fusion result. Brief introduction of these metrics is discussed below. 3.2.1. Information theory-based metric (QMI) Normalized Mutual information QMI [44] is an information theorybased image fusion metric which measures the amount of mutual information between the source and fused images. QMI can be computed as Q MI ¼ 2 MI ð F; I 1 Þ MIð F; I2 Þ þ ; Hð F Þ þ H ðI1 Þ Hð F Þ þ H ðI2 Þ ð9Þ where I1 and I2 represents source images and F denotes fused image of I1 and I2, H(F), H(I1) and H(I2) are the entropy of fused image F, I1 and I2 respectively. 3.2.2. Image feature-based metric (QPC) Phase congruency metric QPC [45] is image feature-based metric which compare the phase congruency features of source images with the fused image. QPC can be defined as α Q PC ¼ P p ðP M Þβ ðP m Þγ ; ð10Þ where p, M and m are phase congruency, maximum and minimum moments between source images and fused image. Exponential parameters in the above equation are set as α = β = γ = 1. 3.2.3. Image structural similarity-based metric (QIS) Image structure similarity (SSIM) [35] can imitate the similarity of different images and is used to compare images. Yang et al. [46] introduced modified SSIM to evaluate the performance of image fusion. QIS is a structural similarity-based metric which evaluate the structural information of source images preservation level. SSIM to evaluate the performance of image fusion, which can be computed as Q IS ¼ λðwÞSSIMðI1 ; FjwÞ þ ð1−λðwÞÞSSIM ðI2 ; FjwÞ; SSIM ðI 1 ; I 2 jwÞ≥0:75 ; maxfSSIMðI 1 ; FjwÞ; SSIMðI 2 ; FjwÞg; SSIM ðI 1 ; I 2 jwÞ b 0:75 ð11Þ where SSIM is the structural similarity [35] and λ(w) is local weight computed as, λðwÞ ¼ sðI 1 jwÞ ; sðI 1 jwÞ þ sðI2 jwÞ ð12Þ where w is 7 × 7 local window, s(I1 | w) and s(I2 | w) are the variances of source images I1 and I2 within the local window w respectively. Table 1 The objective assessment of various fusion methods for the fusion of five pairs of validation multi-focus source images. Methods QMI QPC QIS QHP IM GF MSWG BF CNN Proposed 1.1436 1.1187 1.1523 1.1728 1.1714 1.1715 0.7931 0.8057 0.8136 0.8144 0.8153 0.8182 0.9211 0.9427 0.9681 0.9814 0.9836 0.9883 0.7591 0.7618 0.7753 0.7862 0.7925 0.7994 Fig. 9. Graphical comparison of image structural similarity-based metric values of 10 fused images from other methods and our proposed method. 3.2.4. Human perception-based metric (QHP) Human perception-based metric QHP is contrast based image fusion metric proposed by Chen and Blum [47]. This metric makes use of major features in Human visual system model and compare contrast features of input images with the fused image. QHP can be computed as Q GQM ðx; yÞ ¼ λI1 ðx; yÞQ I1 F ðx; yÞ þ λI2 ðx; yÞQ I2 F ðx; yÞ ð13Þ where I1 and I2 represent source images and F denotes fused image of I1 and I2. QI 1F and QI 2F represents contrast information preserved from the source images I1 and I2 into fused image F respectively. λI 1 and λI 2 are the saliency maps of QI 1F and QI 2F respectively. Finally, QHP is computed as Q HP ¼ Q GQM from the average of global quality map. The average scores of fused images obtained by our method compared with other different fusion methods for the five multi-focus image pairs are listed in Table 1. Highest values are shown in bold to validate the efficiency of our proposed model. Results shows that our method achieves better performance in most cases. Fig. 9 illustrates the graphical comparisons of image structural similarity-based metric QIS values on 10 fused images obtained from IM, GF, MSWG, BF, CNN and our proposed method respectively. 3.3. Comparison of computational efficiency To evaluate the computational efficiency of our proposed method with other MFIF methods Table 2 lists the comparison of the average running time consumed by each method. Source codes of all methods (IM, GF, MSWG and CNN) are available online, can be downloaded from the website [48]. All methods for the comparison are implemented in MATLAB integrated development environment and our method is implemented by using TensorFlow. For this experiment, we use a computer with specifications Intel Core i7-5960X CPU, Nvidia Titan X GPU and 32-GB RAM. Ten pairs of multi-focus source images with variable size are used for testing in this experiment. It is clear from the table handcraft methods IM, GF, MSWG and BF are computationally more efficient while in comparison with CNN-based method our proposed method has high computational efficiency. Our proposed CNN-based method requires complex computations while test time but yields in end-to-end effective solution for image fusion. Table 2 The average running time of different methods and the proposed method. Methods IM GF MSWG BF CNN Proposed Time (second) 0.83 0.51 1.54 1.98 3.78 3.56 34 H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35 (a) Underexposed Image (b) Overexposed Image (c) Fused Image (d) Underexposed Image (e) Overexposed Image (f) Fused Image Fig. 10. Multi-exposure fusion result using the proposed method. 3.4. Application to multi-exposure fusion Conflict of interest Here we will discuss the possibility of applying our proposed method to other applications of image fusion. When a photograph is taken by a camera which contains shadows or highlighted regions, photographers come up with the challenge of adjusting the suitable exposure. In different lighting conditions sometimes, the image becomes too bright or too dark. In such case, Multi-Exposure Fusion (MEF) techniques are applied to fuse images with different exposure. MEF problem is similar to MFIF, excluding that source images have variable exposure than variable focus. To utilize the generalizability of CNN without fine tuning already trained network we use it to fuse multi-exposure images. Fig. 10 shows that our proposed model successfully fuses images with variable exposure. This demonstrates that the CNN model is generic and could be used in digital photography applications of image fusion. We hereby confirm that there is no conflict of interest between authors to declare. 4. Conclusion In this paper, we introduced a new method for multi-focus image fusion (MFIF) based on multi-scale convolutional neural network (MSCNN). Our proposed network aims to learn all modules together to produce a complete unsupervised end-to-end trainable deep MSCNN. To the best of our knowledge, it is the first time that MSCNN is applied to multi-focus image fusion. In our feature extraction module, we applied multi-scale convolutional filters to extract more accurate features of source images. For the precise reconstruction of fused images, our model utilizes structure similarity (SSIM) no reference image quality metric to calculate loss. Finally, the fused images are reconstructed in a multi-scale manner to guarantee more accurate restoration of images. We train our model on an open source dataset image and perform extensive experiments as well as quantitative and qualitative evaluations to validate the efficiency of our proposed method. Our proposed CNN model could be used in other image fusion applications such as multi-exposure fusion and infrared and visible image fusion. In the future, we aim to make our model more robust and generic which be applied to fuse more than two images and with movable objects as well. Acknowledgment This research is partly supported by NSFC, China (No: 61876107, 61572315, U1803261) and 973 Plan, China (No. 2015CB856004). References [1] M. Nejati, S. Samavi, S. Shirani, Multi-focus image fusion using dictionary-based sparse representation, Information Fusion 25 (2015) 72–84. [2] A.A. Goshtasby, S. Nikolov, Image fusion: advances in the state of the art, Information Fusion 8 (2) (2007) 114–118. [3] P. Burt, E. Adelson, The Laplacian pyramid as a compact image code, IEEE Trans. Commun. 31 (4) (1983) 532–540. [4] H. Li, B. Manjunath, S. Mitra, Multisensor image fusion using the wavelet transform, Graph. Models Image Process. 57 (3) (1995 May 1) 235–245. [5] S. Li, J.T. Kwok, Y. Wang, Using the discrete wavelet frame transform to merge Landsat TM and SPOT panchromatic images, Information Fusion 3 (1) (2002) 17–23. [6] F. Nencini, A. Garzelli, S. Baronti, L. Alparone, Remote sensing image fusion using the curvelet transform, Information Fusion 8 (2) (2007) 143–156. [7] Q. Zhang, B. Guo, Multifocus image fusion using the nonsubsampled contourlet transform, Signal Process. 89 (7) (2009) 1334–1346. [8] J.J. Lewis, R.J.O.Õ. Callaghan, S.G. Nikolov, D.R. Bull, Pixel- and region-based image fusion with complex wavelets, Information Fusion 8 (2) (2007) 119–130. [9] S. Li, X. Kang, L. Fang, J. Hu, H. Yin, Pixel-level image fusion: a survey of the state of the art, Information Fusion 1 (33) (2017) 100–112. [10] B. Yang, S. Li, Multifocus image fusion and restoration with sparse representation, IEEE Trans. Instrum. Meas. 59 (4) (2010) 884–892. [11] N. Mitianoudis, T. Stathaki, “Image Fusion Schemes Using ICA Bases,” Image Fusion: Algorithms and Applications, 2005 85–118. [12] T. Wan, C. Zhu, Z. Qin, Multifocus image fusion based on robust principal component analysis, Pattern Recogn. Lett. 34 (9) (2013) 1001–1008. [13] V. Aslantas, R. Kurban, Fusion of multi-focus images using differential evolution algorithm, Expert Syst. Appl. 37 (12) (2010) 8861–8870. [14] I. De, B. Chanda, Multi-focus image fusion using a morphology-based focus measure in a quad-tree structure, Information Fusion 14 (2) (2013) 136–146. [15] M. Li, W. Cai, Z. Tan, A region-based multi-sensor image fusion scheme using pulsecoupled neural network, Pattern Recogn. Lett. 27 (16) (2006) 1948–1956. [16] S. Li, B. Yang, Multifocus image fusion using region segmentation and spatial frequency, Image Vis. Comput. 26 (2008) 971–979. [17] Y. Liu, S. Liu, Z. Wang, Multi-focus image fusion with dense SIFT, Information Fusion 23 (2015) 139–155. H.T. Mustafa et al. / Image and Vision Computing 85 (2019) 26–35 [18] S. Li, X. Kang, J. Hu, B. Yang, Image matting for fusion of multi-focus images in dynamic scenes, Information Fusion 14 (2) (2013) 147–162. [19] S. Li, X. Kang, J. Hu, Image fusion with guided filtering, IEEE Trans. Image Process. 22 (7) (2013) 2864–2875. [20] Y. Liu, X. Chen, H. Peng, Z. Wang, Multi-focus image fusion with a deep convolutional neural network, Information Fusion 36 (2017) 191–207. [21] H. Tang, B. Xiao, W. Li, G. Wang, Pixel convolutional neural network for multi-focus image fusion, Inf. Sci. 433 (2018) 125–141. [22] K.R. Prabhakar, V.S. Srikar, R.V. Babu, DeepFuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs, Proceedings of the IEEE International Conference on Computer Vision 2017, pp. 4714–4722. [23] B. Yang, J. Zhong, Y. Li, Z. Chen, Multi-focus image fusion and super-resolution with convolutional neural network, Int. J. Wavelets Multiresolution Inf. Process. 15 (04) (2017) 1750037. [24] Y. Liu, X. Chen, J. Cheng, H. Peng, A medical image fusion method based on convolutional neural networks, 20th International Conference on Information Fusion (Fusion) IEEE 2017, pp. 1–7. [25] Y. Liu, X. Chen, R.K. Ward, Z.J. Wang, Image fusion with convolutional sparse representation, IEEE Signal Process. Lett. 23 (12) (2016) 1882–1886. [26] G. Masi, D. Cozzolino, L. Verdoliva, G. Scarpa, Pansharpening by convolutional neural networks, Remote Sens. 8 (7) (2016) 594. [27] A. Azarang, H. Ghassemian, A new pansharpening method using multi resolution analysis framework and deep neural networks, 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA) IEEE 2017, pp. 1–6. [28] F. Palsson, J.R. Sveinsson, M.O. Ulfarsson, Multispectral and hyperspectral image fusion using a 3-D-convolutional neural network, IEEE Geosci. Remote S. 14 (5) (2017) 639–643. [29] Y. Liu, X. Chen, Z. Wang, Z.J. Wang, R.K. Ward, X. Wang, Deep learning for pixel-level image fusion: recent advances and future prospects, Information Fusion 42 (2018) 158–173. [30] A. Krizhevsky, V. Nair, G. Hinton, The CIFAR-10 dataset, online http//www.cs.toronto.edu/kriz/cifar.html 2014. [31] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal visual object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338. [32] C. Du, S. Gao, Image segmentation-based multi-focus image fusion through multiscale convolutional neural network, IEEE Access 5 (99) (2017) 15750–15761. 35 [33] Y. Wang, L. Wang, H. Wang, P. Li, “End-to-end Image Super-resolution via Deep and Shallow Convolutional Networks,” arXiv Preprint, 2016 (arXiv1607.07680). [34] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, pp. 770–778. [35] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612. [36] H. Zhao, O. Gallo, I. Frosio, J. Kautz, Loss functions for image restoration with neural networks, IEEE Trans. Comput. Imaging 3 (1) (2017) 47–57. [37] D. Erhan, Y. Bengio, A. Courville, P. Vincent, Visualizing higher-layer features of a deep network, University of Montreal 1341 (3) (2009) 1. [38] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, H. Lipson, “Understanding Neural Networks Through Deep Visualization,” arXiv Preprint, 2015 (arXiv1506.06579). [39] T.-Y. Lin, et al., Microsoft coco: common objects in context, European Conference on Computer Vision, Springer, Cham 2014, pp. 740–755. [40] Z. Zhou, S. Li, B. Wang, Multi-scale weighted gradient-based fusion for multi-focus images, Information Fusion 20 (2014) 60–72. [41] Y. Zhang, X. Bai, T. Wang, Boundary finding based multi-focus image fusion through multi-scale morphological focus-measure, Information Fusion 35 (2017) 81–101. [42] http://mansournejati.ece.iut.ac.ir/content/lytro-multi-focus-dataset. [43] Z. Liu, E. Blasch, Z. Xue, J. Zhao, R. Laganiere, W. Wu, Objective assessment of multiresolution image fusion algorithms for context enhancement in night vision: a comparative study, IEEE Trans. Pattern Anal. Mach. Intell. 34 (1) (2012) 94–109. [44] M. Hossny, S. Nahavandi, D. Creighton, A. Bhatti, Image fusion performance metric based on mutual information and entropy driven quadtree decomposition, Electron. Lett. 46 (18) (2010) 1266–1268. [45] J. Zhao, R. Laganiere, Z. Liu, Performance assessment of combinative pixel-level image fusion based on an absolute feature measurement, Int. J. Innov. Comput. I 3 (6) (2007) 1433–1447. [46] C. Yang, J.-Q. Zhang, X.-R. Wang, X. Liu, A novel similarity-based quality metric for image fusion, Inf. Fusion 9 (2) (2008) 156–160. [47] Y. Chen, R.S. Blum, A new automated quality assessment algorithm for image fusion, Image Vis. Comput. 27 (10) (2009) 1421–1432. [48] https://github.com/budaoxiaowanzi/image-fusion.