Uploaded by lam lambao

ICV Pham Thi Minh Anh-220880356-Submission 2 Choice1-ECS709

advertisement
Explainable Computer Vision: A Critical Literature
Review of Methods and Applications
Pham Thi Minh Anh
Student Number: 220880356
ec22313@qmul.ac.uk
I. INTRODUCTION
Recent research in deep learning methodology has led to a
variety of complex modeling techniques in computer vision
(CV) that reach or even outperform human performance.
Although these black-box deep learning models have obtained
astounding results, they are limited in their interpretability and
transparency which are critical to take learning machines to the
next step to include them in sensitive decision-support systems
involving human supervision. Hence, the development of
explainable techniques for computer vision (XCV) has
recently attracted increasing attention. This work aims to 1)
provide a thorough overview of the current state of this
emerging field and explain its theoretical methods; 2) compare
approaches using different extensive evaluation methods, and
3) problematize challenges and avenues for future research.
II. PROBLEM DEFINITION
As deep learning-based methods are increasingly used in
visual perception tasks, high prediction accuracy alone may not
be sufficient in practice. For instance, in healthcare, models
which can be interpreted and verified by medical experts are
an absolute necessity. Also in self-driving cars, where a single
incorrect prediction can be very costly. The use of explainable
techniques for these nested structure black-box models is a
prerequisite for providing such a guarantee with the critical
needs 1) verification of the system [3]; 2) improvement of the
system [6]; 3) the ability to transfer a system’s learned
knowledge to humans; 4) legislation compliance of the
systems. These factors demonstrate that knowledge of the
present XCV methods, understanding the difference between
their approaches, and applying the appropriate explanatory
method to different types of models are crucial and will play a
pivotal role in developing future CV systems.
III. KEY WORKS
Computer vision (CV), black-box models; deep learning;
explainable artificial intelligence (XAI); explainable computer
vision (XCV); Interpretability; transparency; neural networks;
IV. PRACTICAL XCV METHODS
This review focuses on three major explanation techniques:
occlusion analysis, gradient-based techniques, and layer-wise
relevance propagation (LRP).
A. Occlusion Analysis
The occlusion analysis [7] is a particular type of
perturbation analysis where we repeatedly test the effect on the
neural network output of occluding patches or individual
features in the input image as in:
Ri = f (x) – f (x ⊙ (1 – mi))
(1)
where mi is an indicator vector for the removed patch or
feature. A heatmap (Ri)i can be built from these scores
highlighting locations where the occlusion has caused the
strongest decrease of the function. This post-doc method is
suitable for non-differentiable black-box and even functions
that are locally flat, with no or only very small gradient
B. Gradient-based techniques
Integrated gradients (IG) [5] explain by integrating the
gradient ∇f(x) along some trajectory in input space connecting
some root point 𝑥$ to the data point x. Integrated gradients
quantify the importance of each input variable i (e.g., image
pixel) as:
!
Ri (x) = (xi - 𝑥$i) ∙ ∫" [∇f (𝑥$ + t ∙ (x - 𝑥$))]i dt
(2)
This measure assumes that the most relevant input features
are those to which the output is most sensitive. Since they’re
so fundamental to deep learning, those implementations are
very efficient.
C. Layer-wise relevance propagation
The LRP method [1] redistributes the prediction f (x)
backward using local redistribution rules until it assigns a
relevance score Ri to each input variable (e.g., image pixel).
The relevance scores Ri of each input variable determines how
much this variable has contributed to the prediction. In the LRP
redistribution process for feed-forward neural networks, let xj
be the neuron activations at layer l, Rk be the relevance scores
associated with the neurons at layer l + 1 and wjk be the weight
connecting neuron j to neuron k. The simple LRP rule
redistributes relevance from layer l + 1 to layer l as:
Rj = ∑) ∑
#! %!"
! #! %!" ' (
Rk
(3)
where a small stabilization term 𝜖 is added to prevent division
by zero. For different layers of networks, various suitable
propagation rules have been proposed.
V. EVALUATION CRITERIA
In experiments, we use the VGG-16 neural network and
ResNet-50 deep neural network to classify images from
ImageNet. Different explanation methods lead to different
qualities of explanation which can be evaluated by 1)
faithfulness (pixel-flipping method - the decrease of output
score) 2) human- interpretability (explanation file size in bytes)
and 3) the possibility of practically applying it to a computer
vision model (explanations per second)
VI. DISCUSSION
A. COMPARING EXPLANATION METHODS
I.
Faithfulness
A practical technique to assess the reliability of the
model's decision structure is “pixel-flipping” [4] which runs
from the most to the least relevant input features, iteratively
removing them and monitoring the evolution of the neural
network output. Applying pixel-flipping to the three XCV
methods in Fig. 1, we observe that removing relevant
features quickly destroys class evidence for all explanation
approaches, as compared to a random explanation baseline.
LRP performs better on VGG-16 than on ResNet-50. This
can be explained by VGG-16 having a more explicit
structure (pooling operations for VGG-16 versus stride
convolution for ResNet-50). The second observation in Fig.
1 is that integrated gradients have by far the highest decay
TABLE II.
RUNTIME COMPARISON (IN EXPLANATIONS PER SECOND)
Ingrated Gradients
LRP
VGG-16
2.4
Occlusion
5.8
204.1
ResNet-50
4.0
8.7
188.7
gradient computation, which is O(1), but requires multiple
iterations for the integration. LRP is the fastest approach
because it can take advantage of network structure, using the
network weights and the neural activations created by the
forward pass to propagate the output back through the
network up until the input layer. Due to this, LRP is
especially practical for large-scale analysis.
J. CHALLENGES AND OUTLOOKS
Fig. 1. Pixel-flipping experiment for testing faithfulness of the explanation.
rate initially but stagnate in the later phase because IG
focuses on the most sensitive pixels to the network, without,
being able to identify fully comprehensively the relevant
pattern in the image. We note that occlusion and LRP do not
run into such adversarial problems. For these methods, pixelflipping steadily and comprehensively removes features until
class evidence has totally disappeared.
II. Human-interpretability
For image classification, interpretability can be
quantified in terms of the amount of information contained in
the heatmap. Table 1 shows average file sizes associated
with three explanation techniques on two models.
We observe that occlusion produces the lowest file size
and is, hence, the most “interpretable” because it only
presents to the user rough localization information without
delving into specifics about which specific feature has
supported the decision as done, e.g., by LRP. In the
integrated gradients, every single pixel contains information,
and this makes it clearly overwhelming to the human.
III. Applicability and Runtime
Occlusion-based explanations are the easiest to
implement. These explanations can be obtained for any
neural network, even those that are not differentiable or
networks for which we do not have the source code and
where we can only access their prediction through some
online server. Occlusion can, therefore, be used to
understand the predictions of third-party models. Integrated
gradients require access to the neural network gradient.
Given that most ML models are differentiable, this method is
highly applicable also to complex neural networks, such as
ResNets or SqueezeNets. With stronger requirements to
redistribute the prediction, LRP requires explicitly accessing
the different layers of a neural network with a canonical
sequence of layers, for instance, an alternation of kernel
methods, convolution, ReLU layers, and pooling layers.
A runtime comparison of the three explanation methods
is given in Table 2. Since each occluded patch requires a new
evaluation of the function, occlusion is the slowest
technique. The runtime of occlusion for image data grows
quadratically with step size, making it computationally
impractical to generate high-resolution explanations.
Integrated gradients inherit pixelwise resolution from the
TABLE I.
AVERAGE EXPLANATION FILE SIZES (IN BYTES)
Occlusion
Ingrated Gradients
LRP
VGG-16
698.4
5795.0
1828.3
ResNet-50
693.6
5978.0
2928.2
While XCV has made astounding conceptual and
technical advancements recently, foundational theoretical
work in XCV has so far been limited [2]. It remains unclear
how to weigh the model and the data distribution into the
explanation, in particular, whether an explanation should be
based on all attributes the model locally reacts to or simply
those that are expressed locally. These significant problems
will be clarified with the help of a deeper formalization and
theoretical understanding of XCV.
Further challenges might arise when XAI software must
keep up with the complexity of highly predictive models,
which are becoming more complex in terms of the number of
parameters and the model's structure.
I. CONCLUSION
Rapid advancements in XCV have made almost any of
these complex computer vision models interpretable to the
user. Consequently, predictability should not be sacrificed in
favor of interpretability, and strong nonlinear ML can be fully
utilized in practical applications. In this review, we have made
the attempt to critically evaluate XCV methods such as
occlusion, integrated gradients, and LRP. Occlusion analysis is
easy to implement and suitable for small data, nondifferentiable models. Integrated Gradients are widely
applicable to most complex neural networks, however, it does
not explain feature interactions and combinations. LRP is
especially practical for large-scale analysis of neural networks
with a canonical sequence of layers. While XCV has seen
exponential growth, there are many open problems and
challenges with ample opportunities to contribute. In
conclusion, we strongly believe that XCV will ultimately
become a crucial practical component for obtaining transparent
and impartial learning models.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
S. Bach, A. Binder, G. Montavon, F. Klauschen, and W. Samek, “On
pixel-wise explanations for non-linear classifier decisions by layer-wise
relevance propagation,” PLoS ONE, vol. 10, Jul. 2015, Art. e0130140.
S. M. Lundberg and S. Lee, “A unified approach to interpreting model
predictions,” in Proc. Adv. Neural Inf. Process. Syst. 30, 2017, pp.
4765–4774
A. M. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are
easily fooled: High confidence for unrecognizable images,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 427–436.
W. Samek, A. Binder, G. Montavon, and K.-R. Müller, “Evaluating the
visualization of what a deep neural network has learned,” IEEE Trans.
Neural Netw. Learn. Syst., vol. 28, no. 11, pp. 2660–2673, Nov. 2017.
M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep
networks,” in Proc. 34th Int. Conf. Mach. Learn., 2017, pp. 3319–3328.
B. Ustun, A. Spangher, and Y. Liu, “Actionable recourse in linear
classification,” in Proc. Conf. Fairness, Accountability, Transparency,
2019, pp. 10–19.
M. D. Zeiler and R. Fergus, “Visualizing and understanding
convolutional networks,” in Proc. Eur. Conf. Comput. Vis.-ECCV,
2014, pp. 818–8
Download