Explainable Computer Vision: A Critical Literature Review of Methods and Applications Pham Thi Minh Anh Student Number: 220880356 ec22313@qmul.ac.uk I. INTRODUCTION Recent research in deep learning methodology has led to a variety of complex modeling techniques in computer vision (CV) that reach or even outperform human performance. Although these black-box deep learning models have obtained astounding results, they are limited in their interpretability and transparency which are critical to take learning machines to the next step to include them in sensitive decision-support systems involving human supervision. Hence, the development of explainable techniques for computer vision (XCV) has recently attracted increasing attention. This work aims to 1) provide a thorough overview of the current state of this emerging field and explain its theoretical methods; 2) compare approaches using different extensive evaluation methods, and 3) problematize challenges and avenues for future research. II. PROBLEM DEFINITION As deep learning-based methods are increasingly used in visual perception tasks, high prediction accuracy alone may not be sufficient in practice. For instance, in healthcare, models which can be interpreted and verified by medical experts are an absolute necessity. Also in self-driving cars, where a single incorrect prediction can be very costly. The use of explainable techniques for these nested structure black-box models is a prerequisite for providing such a guarantee with the critical needs 1) verification of the system [3]; 2) improvement of the system [6]; 3) the ability to transfer a system’s learned knowledge to humans; 4) legislation compliance of the systems. These factors demonstrate that knowledge of the present XCV methods, understanding the difference between their approaches, and applying the appropriate explanatory method to different types of models are crucial and will play a pivotal role in developing future CV systems. III. KEY WORKS Computer vision (CV), black-box models; deep learning; explainable artificial intelligence (XAI); explainable computer vision (XCV); Interpretability; transparency; neural networks; IV. PRACTICAL XCV METHODS This review focuses on three major explanation techniques: occlusion analysis, gradient-based techniques, and layer-wise relevance propagation (LRP). A. Occlusion Analysis The occlusion analysis [7] is a particular type of perturbation analysis where we repeatedly test the effect on the neural network output of occluding patches or individual features in the input image as in: Ri = f (x) – f (x ⊙ (1 – mi)) (1) where mi is an indicator vector for the removed patch or feature. A heatmap (Ri)i can be built from these scores highlighting locations where the occlusion has caused the strongest decrease of the function. This post-doc method is suitable for non-differentiable black-box and even functions that are locally flat, with no or only very small gradient B. Gradient-based techniques Integrated gradients (IG) [5] explain by integrating the gradient ∇f(x) along some trajectory in input space connecting some root point 𝑥$ to the data point x. Integrated gradients quantify the importance of each input variable i (e.g., image pixel) as: ! Ri (x) = (xi - 𝑥$i) ∙ ∫" [∇f (𝑥$ + t ∙ (x - 𝑥$))]i dt (2) This measure assumes that the most relevant input features are those to which the output is most sensitive. Since they’re so fundamental to deep learning, those implementations are very efficient. C. Layer-wise relevance propagation The LRP method [1] redistributes the prediction f (x) backward using local redistribution rules until it assigns a relevance score Ri to each input variable (e.g., image pixel). The relevance scores Ri of each input variable determines how much this variable has contributed to the prediction. In the LRP redistribution process for feed-forward neural networks, let xj be the neuron activations at layer l, Rk be the relevance scores associated with the neurons at layer l + 1 and wjk be the weight connecting neuron j to neuron k. The simple LRP rule redistributes relevance from layer l + 1 to layer l as: Rj = ∑) ∑ #! %!" ! #! %!" ' ( Rk (3) where a small stabilization term 𝜖 is added to prevent division by zero. For different layers of networks, various suitable propagation rules have been proposed. V. EVALUATION CRITERIA In experiments, we use the VGG-16 neural network and ResNet-50 deep neural network to classify images from ImageNet. Different explanation methods lead to different qualities of explanation which can be evaluated by 1) faithfulness (pixel-flipping method - the decrease of output score) 2) human- interpretability (explanation file size in bytes) and 3) the possibility of practically applying it to a computer vision model (explanations per second) VI. DISCUSSION A. COMPARING EXPLANATION METHODS I. Faithfulness A practical technique to assess the reliability of the model's decision structure is “pixel-flipping” [4] which runs from the most to the least relevant input features, iteratively removing them and monitoring the evolution of the neural network output. Applying pixel-flipping to the three XCV methods in Fig. 1, we observe that removing relevant features quickly destroys class evidence for all explanation approaches, as compared to a random explanation baseline. LRP performs better on VGG-16 than on ResNet-50. This can be explained by VGG-16 having a more explicit structure (pooling operations for VGG-16 versus stride convolution for ResNet-50). The second observation in Fig. 1 is that integrated gradients have by far the highest decay TABLE II. RUNTIME COMPARISON (IN EXPLANATIONS PER SECOND) Ingrated Gradients LRP VGG-16 2.4 Occlusion 5.8 204.1 ResNet-50 4.0 8.7 188.7 gradient computation, which is O(1), but requires multiple iterations for the integration. LRP is the fastest approach because it can take advantage of network structure, using the network weights and the neural activations created by the forward pass to propagate the output back through the network up until the input layer. Due to this, LRP is especially practical for large-scale analysis. J. CHALLENGES AND OUTLOOKS Fig. 1. Pixel-flipping experiment for testing faithfulness of the explanation. rate initially but stagnate in the later phase because IG focuses on the most sensitive pixels to the network, without, being able to identify fully comprehensively the relevant pattern in the image. We note that occlusion and LRP do not run into such adversarial problems. For these methods, pixelflipping steadily and comprehensively removes features until class evidence has totally disappeared. II. Human-interpretability For image classification, interpretability can be quantified in terms of the amount of information contained in the heatmap. Table 1 shows average file sizes associated with three explanation techniques on two models. We observe that occlusion produces the lowest file size and is, hence, the most “interpretable” because it only presents to the user rough localization information without delving into specifics about which specific feature has supported the decision as done, e.g., by LRP. In the integrated gradients, every single pixel contains information, and this makes it clearly overwhelming to the human. III. Applicability and Runtime Occlusion-based explanations are the easiest to implement. These explanations can be obtained for any neural network, even those that are not differentiable or networks for which we do not have the source code and where we can only access their prediction through some online server. Occlusion can, therefore, be used to understand the predictions of third-party models. Integrated gradients require access to the neural network gradient. Given that most ML models are differentiable, this method is highly applicable also to complex neural networks, such as ResNets or SqueezeNets. With stronger requirements to redistribute the prediction, LRP requires explicitly accessing the different layers of a neural network with a canonical sequence of layers, for instance, an alternation of kernel methods, convolution, ReLU layers, and pooling layers. A runtime comparison of the three explanation methods is given in Table 2. Since each occluded patch requires a new evaluation of the function, occlusion is the slowest technique. The runtime of occlusion for image data grows quadratically with step size, making it computationally impractical to generate high-resolution explanations. Integrated gradients inherit pixelwise resolution from the TABLE I. AVERAGE EXPLANATION FILE SIZES (IN BYTES) Occlusion Ingrated Gradients LRP VGG-16 698.4 5795.0 1828.3 ResNet-50 693.6 5978.0 2928.2 While XCV has made astounding conceptual and technical advancements recently, foundational theoretical work in XCV has so far been limited [2]. It remains unclear how to weigh the model and the data distribution into the explanation, in particular, whether an explanation should be based on all attributes the model locally reacts to or simply those that are expressed locally. These significant problems will be clarified with the help of a deeper formalization and theoretical understanding of XCV. Further challenges might arise when XAI software must keep up with the complexity of highly predictive models, which are becoming more complex in terms of the number of parameters and the model's structure. I. CONCLUSION Rapid advancements in XCV have made almost any of these complex computer vision models interpretable to the user. Consequently, predictability should not be sacrificed in favor of interpretability, and strong nonlinear ML can be fully utilized in practical applications. In this review, we have made the attempt to critically evaluate XCV methods such as occlusion, integrated gradients, and LRP. Occlusion analysis is easy to implement and suitable for small data, nondifferentiable models. Integrated Gradients are widely applicable to most complex neural networks, however, it does not explain feature interactions and combinations. LRP is especially practical for large-scale analysis of neural networks with a canonical sequence of layers. While XCV has seen exponential growth, there are many open problems and challenges with ample opportunities to contribute. In conclusion, we strongly believe that XCV will ultimately become a crucial practical component for obtaining transparent and impartial learning models. REFERENCES [1] [2] [3] [4] [5] [6] [7] S. Bach, A. Binder, G. Montavon, F. Klauschen, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PLoS ONE, vol. 10, Jul. 2015, Art. e0130140. S. M. Lundberg and S. Lee, “A unified approach to interpreting model predictions,” in Proc. Adv. Neural Inf. Process. Syst. 30, 2017, pp. 4765–4774 A. M. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence for unrecognizable images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 427–436. W. Samek, A. Binder, G. Montavon, and K.-R. Müller, “Evaluating the visualization of what a deep neural network has learned,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 11, pp. 2660–2673, Nov. 2017. M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in Proc. 34th Int. Conf. Mach. Learn., 2017, pp. 3319–3328. B. Ustun, A. Spangher, and Y. Liu, “Actionable recourse in linear classification,” in Proc. Conf. Fairness, Accountability, Transparency, 2019, pp. 10–19. M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proc. Eur. Conf. Comput. Vis.-ECCV, 2014, pp. 818–8