Uploaded by asirsaadat

ICECE 2022 Draft

Visual Robustness Analysis in Visual Question
Ishmam Tashdeed
Md Farhan Ishmam
Asir Sadaat Nipun
Computer Science and Engineering
Islamic University of Technology
Gazipur, Bangladesh
Computer Science and Engineering
Islamic University of Technology
Gazipur, Bangladesh
Computer Science and Engineering
Islamic University of Technology
Gazipur, Bangladesh
Abstract—The domain of Visual Question Answering (VQA)
focuses on conveying an output with the combination of both
visual and textual realms. As VQA combines modalities of both
vision and language, it is susceptible to adversarial disturbances
of both modalities. Robustness is the ability of a model to resist
adversarial attacks and has been a research interest in VQA.
Although the linguistic robustness of VQA methods has been a
common field of interest, there has yet been any significant work
on visual robustness. In this work, we will present a series of
experiments that focus on challenging the visual robustness of
multiple VQA models through performing image processing and
noisy transformation on standard VQA datasets. By comparing
the accuracy of these transformations, we hypothesize the role of
different image features in generating accurate predictions by the
model and test the model’s robustness for varying levels of noise.
We observed that [result of analysis]. We intend our method to
be used for evaluating the stability of various VQA methods.
Visual Question Answering has always been a unique problem domain as it combines the domains of Computer Vision
and Natural Language Processing. The general VQA problem
consists of answering any question with an image given as the
context. The problem domain of VQA is similar to contextbased textual question answering as seen in [1] and can be
thought of as an extension of contextual QA. By the end of the
last decade, the field of visual question answering experienced
rapid growth— primarily due to the advent of revolutionary
architectures like transformers [2] in processing sequential
data and outperforming Recurrent Neural Networks (RNNs)
and their variants. Recently, we have seen Vision Transformers
[3] outperforming Convolutional Neural Networks (CNNs)
resulting in the creation of completely transformer-based architectures like ViLT [4] in VQA.
As VQA is a multimodal task, the model needs to perform
inferences from images and textual data. [5] explored the
robustness of a model in dealing with adversarial attacks which
can target the question and/or the image. While the robustness
of the textual sub-model has been thoroughly explored by
experimenting with the questions fed to the model in [5], there
has been no analysis of the robustness of the visual sub-model
by experimenting with the images given as context. In this
paper, we will delve into the model’s robustness in dealing
with visual data by performing standard image processing experiments on common datasets and evaluating the performance
of the model on the transformed datasets. Since the image
serves as a context to the question, any substantial change
to the image will affect the model’s predicted outcome. We
performed a series of image processing operations on standard
VQA datasets and tested the accuracy of the transformer-based
model ViLT [4]. Our experiments confirmed that [explanation
of experimental results].
Fig. 1. Overview of our framework for evaluating visual robustness.
Adversarial attacks first encountered in [6] are considered
to be a shortcoming of modern deep learning methods. The
weakness was initially uncovered in the context of image
classification which led to the phenomena being observed in
various sub-tasks of both computer vision and natural language
processing. Adding adversarial perturbations to the inputs in
order to evaluate the performance of models is recognized
in [6]–[9] and many others. The effects of adversarial perturbations were soon noticed in language processing tasks
[10]. Thus, it is no surprise that multimodal tasks such as
Visual Question Answering (VQA) [11] are vulnerable to such
attacks. So, assessing the robustness of VQA methods to such
adversarial attacks are essential.
Visual Question Answering [11] is one of the most challenging tasks being researched in multimodal learning. Excelling
in this task requires a greater understanding of visual elements
in the given context i.e. image or video, as well as processing
the given question. Works such as [12]–[16] paved the way
for further research as they constantly improved upon the
accuracy score. But these methods contained inherent biases
either from the distribution of the datasets [17] or from the
modalities such as the question given [12], [15], [18], [19].
This frequently resulted in models answering correctly but for
the wrong reasons.
Transformers [2] and visual transformers [3] ushered a
new wave of transformer based methods for VQA such as
[4], [20]–[23]. These methods either used pre-training for
language processing [21] or patched image processing using
self-attention [4] or unified vision-language pre-training [23].
Our paper focuses on ViLT [4], which uses a combined visionlanguage transformer to comprehend multiple vision-language
tasks. This focuses on eliminating the usage of region or grid
features, which would require using a CNN (Convolutional
Neural Network) backbone resulting in compromising the
Approaches for measuring robustness of VQA methods can
be observed in [5], [24]–[26]. However, when it comes to
these assessments, different approaches to test the robustness for only the questions are emphasized. [5] focuses on
generating questions with certain levels of noise and creating an evaluation metric to compute similarity scores. [25]
uses counterfactual augmentations to questions, which convert
them to “yes/no” questions. [26] provides a novel evaluation
benchmark using an adversarial human-and-model-in-the-loop
procedure. [24] only provides one type of visual perturbation whereas introducing six types of textual perturbations.
This method obfuscates segments of the image not directly
relevant to the question. While we acknowledge that the
aforementioned methods provide a baseline for determining
the robustness of VQA methods, they ignore a key component
of all VQA methods. Our approach provides a baseline for
testing robustness of the visual component in VQA methods.
B. Noisy Transformation Methods
We experimented with multiple types of noise such as:
Gaussian, Poisson, Speckle, Salt & Pepper e.t.c. and finally
settled on using “Speckle noise” for adding noise to images.
We drew a random assortment of values from a normal
distribution matching the size and channels of the image and
multiplied it. This gave us a noise mask which we added to
the original image to get a noisy image. We can also control
the level of noise by controlling the noise mask.
C. Evaluation Metric
The evaluation metric we used in our method is accuracy.
It considers a prediction to be correct if it matches with the
most frequent answer annotated by humans. In that case, the
accuracy would be:
accuracy =
1 X
I[ai = mode(Ti )]
N i=1
Where, N is the total number of questions, ai is the prediction
for the ith answer, I[·] is the indicator function which matches
the answer with the prediction and Ti is the list of answers
for the ith question.
In case of VQAv2.0 [27], we matched the ith prediction with the multiple choice answer field provided. And for
DAQUAR [29], we matched the prediction with the singular
provided with the dataset.
For testing the accuracy on the transformed datasets, we
will be using ViLT [4] on: VQAv2.0 [27], VizWiz [28],
DAQUAR [29] and AdVersarial [26]. Each dataset will have
with four types of image augmentations and five levels of
noisy transformations. For each category, the accuracy based
on misclassification is measured.
We will be primarily focusing on the robustness of a
transformer-based model named ViLT [4] in dealing with
transformed or noisy images. We have used 4 datasets for
evaluation of visual robustness: VQAv2.0 [27], VizWiz [28],
DAQUAR [29] and AdVersarial [26]. For each dataset, we
calculate the baseline accuracy and accuracy for different
categories of image processing and noise.
A. Image Processing Methods
We performed three primary categories of image processing
operations on the dataset - greyscaling, color inversion, and
edge filtering. Greyscaling was performed by capturing the
luminance in a single channel rather than three separate RGB
channels. Inversion was done simply by subtracting the current
color value from the maximum color value. We also combined greyscaling and color inversion to perform a compound
transformation. For edge detection, Canny [30] was used. The
Canny edge detection algorithm has two threshold parameters
for hysteresis thresholding in order to pick strong edges and
these parameters are set accordingly based on experimental
Fig. 2. Different levels of noise
A significant drop in accuracy for greyscale images
compared to inverted images. Worse performance in edge
Greyscaled images lose the color component and hence,
invalidating the labels of the questions based on colors.
Ex - if a greyscaled image of a boy holding a red ball
is shown, a question is asked, “ What is the color of the
ball?”, then the model is expected to predict “Grey” since
the color red has been greyscaled to a variant of grey.
However, the labels were annotated for colored images
and it will be “Red” for the given question. Hence, the
predicted answer “Grey” will be incorrect and counted
as misclassification. A better methodology would be to
change the labels of color-based questions and then be
used to compute the accuracy.
Color inverted images face similar mislabeling problems
like greyscale images but the mode seems to perform
far better with color inverted images. A reasonable assumption would be that the model extracts meaningful
features from the color of the image and uses it to
answer questions unrelated to color. E.g. the model might
associate the color yellow with a shape like a banana
and when greyscaled the model won’t be able to predict
the name of the fruit as the color component has been
lost. However, it may still be able to predict correctly for
inverted images as the color component is not completely
Edge Filtering results in a higher loss of information
as information about the texture of the shapes are lost
along with the color of the image. Both the color and
texture are important features for predicting a particular
object and unsurprisingly removing both resulted in a
worse performance. However, the drop in performance
between greyscaling and edge filtering transformations is
not as high as the drop in performance between baseline
and greyscaled transformations, suggesting colors have a
higher contribution in accurate predictions than textures
of a shape. This can be properly verified if a separate
transformation which removes the textures and keeps the
basic colors only was performed and a comparison is
done between the texture-only and greyscaled datasets.
Fig. 3. Accuracy for different levels of noise
[Write conclusion]
[1] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, and
L. Zettlemoyer, “Quac: Question answering in context,” arXiv preprint
arXiv:1808.07036, 2018.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
neural information processing systems, vol. 30, 2017.
[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al.,
“An image is worth 16x16 words: Transformers for image recognition
at scale,” arXiv preprint arXiv:2010.11929, 2020.
[4] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer
without convolution or region supervision,” in International Conference
on Machine Learning, pp. 5583–5594, PMLR, 2021.
[5] J.-H. Huang, C. D. Dao, M. Alfadly, and B. Ghanem, “A novel
framework for robustness analysis of visual qa models,” in Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8449–
8456, 2019.
[6] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,
and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint
arXiv:1312.6199, 2013.
[7] J. Hendrik Metzen, M. Chaithanya Kumar, T. Brox, and V. Fischer,
“Universal adversarial perturbations against semantic image segmentation,” in Proceedings of the IEEE international conference on computer
vision, pp. 2755–2764, 2017.
[8] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 1765–1773, 2017.
[9] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple
and accurate method to fool deep neural networks,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 2574–
2582, 2016.
[10] R. Jia and P. Liang, “Adversarial examples for evaluating reading
comprehension systems,” arXiv preprint arXiv:1707.07328, 2017.
[11] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and
D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE
international conference on computer vision, pp. 2425–2433, 2015.
[12] R. Cadene, C. Dancette, M. Cord, D. Parikh, et al., “Rubi: Reducing
unimodal biases for visual question answering,” Advances in neural
information processing systems, vol. 32, 2019.
[13] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and
M. Rohrbach, “Multimodal compact bilinear pooling for visual question
answering and visual grounding,” arXiv preprint arXiv:1606.01847,
[14] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, “Mutan: Multimodal tucker fusion for visual question answering,” in Proceedings of
the IEEE international conference on computer vision, pp. 2612–2620,
[15] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Don’t just assume;
look and answer: Overcoming priors for visual question answering,” in
Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 4971–4980, 2018.
[16] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention
networks for image question answering,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 21–29, 2016.
[17] A. Agrawal, A. Kembhavi, D. Batra, and D. Parikh, “C-vqa: A compositional split of the visual question answering (vqa) v1. 0 dataset,” arXiv
preprint arXiv:1704.08243, 2017.
[18] S. Ramakrishnan, A. Agrawal, and S. Lee, “Overcoming language priors
in visual question answering with adversarial regularization,” Advances
in Neural Information Processing Systems, vol. 31, 2018.
[19] C. Clark, M. Yatskar, and L. Zettlemoyer, “Don’t take the easy way
out: Ensemble based methods for avoiding known dataset biases,” arXiv
preprint arXiv:1909.03683, 2019.
[20] H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder
representations from transformers,” arXiv preprint arXiv:1908.07490,
[21] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic
visiolinguistic representations for vision-and-language tasks,” Advances
in neural information processing systems, vol. 32, 2019.
[22] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert:
A simple and performant baseline for vision and language,” arXiv
preprint arXiv:1908.03557, 2019.
[23] W. Wang, H. Bao, L. Dong, and F. Wei, “Vlmo: Unified visionlanguage pre-training with mixture-of-modality-experts,” arXiv preprint
arXiv:2111.02358, 2021.
[24] C. E. Jimenez, O. Russakovsky, and K. Narasimhan, “Carets: A consistency and robustness evaluative test suite for vqa,” arXiv preprint
arXiv:2203.07613, 2022.
[25] D. Rosenberg, I. Gat, A. Feder, and R. Reichart, “Are vqa systems rad?
measuring robustness to augmented data with focused interventions,”
arXiv preprint arXiv:2106.04484, 2021.
[26] L. Li, J. Lei, Z. Gan, and J. Liu, “Adversarial vqa: A new benchmark
for evaluating the robustness of vqa models,” in Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp. 2042–
2051, 2021.
[27] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making
the v in vqa matter: Elevating the role of image understanding in
visual question answering,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 6904–6913, 2017.
[28] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo,
and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions
from blind people,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 3608–3617, 2018.
[29] M. Malinowski and M. Fritz, “A multi-world approach to question
answering about real-world scenes based on uncertain input,” Advances
in neural information processing systems, vol. 27, 2014.
[30] Z. Xu, X. Baojie, and W. Guoxin, “Canny edge detection based on
open cv,” in 2017 13th IEEE international conference on electronic
measurement & instruments (ICEMI), pp. 53–56, IEEE, 2017.