Visual Robustness Analysis in Visual Question Answering Ishmam Tashdeed Md Farhan Ishmam Asir Sadaat Nipun Computer Science and Engineering Islamic University of Technology Gazipur, Bangladesh ishmamtashdeed@iut-dhaka.edu Computer Science and Engineering Islamic University of Technology Gazipur, Bangladesh farhanishmam@iut-dhaka.edu Computer Science and Engineering Islamic University of Technology Gazipur, Bangladesh asirsaadat@iut-dhaka.edu Abstract—The domain of Visual Question Answering (VQA) focuses on conveying an output with the combination of both visual and textual realms. As VQA combines modalities of both vision and language, it is susceptible to adversarial disturbances of both modalities. Robustness is the ability of a model to resist adversarial attacks and has been a research interest in VQA. Although the linguistic robustness of VQA methods has been a common field of interest, there has yet been any significant work on visual robustness. In this work, we will present a series of experiments that focus on challenging the visual robustness of multiple VQA models through performing image processing and noisy transformation on standard VQA datasets. By comparing the accuracy of these transformations, we hypothesize the role of different image features in generating accurate predictions by the model and test the model’s robustness for varying levels of noise. We observed that [result of analysis]. We intend our method to be used for evaluating the stability of various VQA methods. I. I NTRODUCTION Visual Question Answering has always been a unique problem domain as it combines the domains of Computer Vision and Natural Language Processing. The general VQA problem consists of answering any question with an image given as the context. The problem domain of VQA is similar to contextbased textual question answering as seen in [1] and can be thought of as an extension of contextual QA. By the end of the last decade, the field of visual question answering experienced rapid growth— primarily due to the advent of revolutionary architectures like transformers [2] in processing sequential data and outperforming Recurrent Neural Networks (RNNs) and their variants. Recently, we have seen Vision Transformers [3] outperforming Convolutional Neural Networks (CNNs) resulting in the creation of completely transformer-based architectures like ViLT [4] in VQA. As VQA is a multimodal task, the model needs to perform inferences from images and textual data. [5] explored the robustness of a model in dealing with adversarial attacks which can target the question and/or the image. While the robustness of the textual sub-model has been thoroughly explored by experimenting with the questions fed to the model in [5], there has been no analysis of the robustness of the visual sub-model by experimenting with the images given as context. In this paper, we will delve into the model’s robustness in dealing with visual data by performing standard image processing experiments on common datasets and evaluating the performance of the model on the transformed datasets. Since the image serves as a context to the question, any substantial change to the image will affect the model’s predicted outcome. We performed a series of image processing operations on standard VQA datasets and tested the accuracy of the transformer-based model ViLT [4]. Our experiments confirmed that [explanation of experimental results]. Fig. 1. Overview of our framework for evaluating visual robustness. Adversarial attacks first encountered in [6] are considered to be a shortcoming of modern deep learning methods. The weakness was initially uncovered in the context of image classification which led to the phenomena being observed in various sub-tasks of both computer vision and natural language processing. Adding adversarial perturbations to the inputs in order to evaluate the performance of models is recognized in [6]–[9] and many others. The effects of adversarial perturbations were soon noticed in language processing tasks [10]. Thus, it is no surprise that multimodal tasks such as Visual Question Answering (VQA) [11] are vulnerable to such attacks. So, assessing the robustness of VQA methods to such adversarial attacks are essential. II. R ELATED WORK Visual Question Answering [11] is one of the most challenging tasks being researched in multimodal learning. Excelling in this task requires a greater understanding of visual elements in the given context i.e. image or video, as well as processing the given question. Works such as [12]–[16] paved the way for further research as they constantly improved upon the accuracy score. But these methods contained inherent biases either from the distribution of the datasets [17] or from the modalities such as the question given [12], [15], [18], [19]. This frequently resulted in models answering correctly but for the wrong reasons. Transformers [2] and visual transformers [3] ushered a new wave of transformer based methods for VQA such as [4], [20]–[23]. These methods either used pre-training for language processing [21] or patched image processing using self-attention [4] or unified vision-language pre-training [23]. Our paper focuses on ViLT [4], which uses a combined visionlanguage transformer to comprehend multiple vision-language tasks. This focuses on eliminating the usage of region or grid features, which would require using a CNN (Convolutional Neural Network) backbone resulting in compromising the speed. Approaches for measuring robustness of VQA methods can be observed in [5], [24]–[26]. However, when it comes to these assessments, different approaches to test the robustness for only the questions are emphasized. [5] focuses on generating questions with certain levels of noise and creating an evaluation metric to compute similarity scores. [25] uses counterfactual augmentations to questions, which convert them to “yes/no” questions. [26] provides a novel evaluation benchmark using an adversarial human-and-model-in-the-loop procedure. [24] only provides one type of visual perturbation whereas introducing six types of textual perturbations. This method obfuscates segments of the image not directly relevant to the question. While we acknowledge that the aforementioned methods provide a baseline for determining the robustness of VQA methods, they ignore a key component of all VQA methods. Our approach provides a baseline for testing robustness of the visual component in VQA methods. III. M ETHODOLOGY B. Noisy Transformation Methods We experimented with multiple types of noise such as: Gaussian, Poisson, Speckle, Salt & Pepper e.t.c. and finally settled on using “Speckle noise” for adding noise to images. We drew a random assortment of values from a normal distribution matching the size and channels of the image and multiplied it. This gave us a noise mask which we added to the original image to get a noisy image. We can also control the level of noise by controlling the noise mask. C. Evaluation Metric The evaluation metric we used in our method is accuracy. It considers a prediction to be correct if it matches with the most frequent answer annotated by humans. In that case, the accuracy would be: accuracy = N 1 X · I[ai = mode(Ti )] N i=1 Where, N is the total number of questions, ai is the prediction for the ith answer, I[·] is the indicator function which matches the answer with the prediction and Ti is the list of answers for the ith question. In case of VQAv2.0 [27], we matched the ith prediction with the multiple choice answer field provided. And for DAQUAR [29], we matched the prediction with the singular provided with the dataset. IV. E XPERIMENTS AND A NALYSIS For testing the accuracy on the transformed datasets, we will be using ViLT [4] on: VQAv2.0 [27], VizWiz [28], DAQUAR [29] and AdVersarial [26]. Each dataset will have with four types of image augmentations and five levels of noisy transformations. For each category, the accuracy based on misclassification is measured. We will be primarily focusing on the robustness of a transformer-based model named ViLT [4] in dealing with transformed or noisy images. We have used 4 datasets for evaluation of visual robustness: VQAv2.0 [27], VizWiz [28], DAQUAR [29] and AdVersarial [26]. For each dataset, we calculate the baseline accuracy and accuracy for different categories of image processing and noise. A. Image Processing Methods We performed three primary categories of image processing operations on the dataset - greyscaling, color inversion, and edge filtering. Greyscaling was performed by capturing the luminance in a single channel rather than three separate RGB channels. Inversion was done simply by subtracting the current color value from the maximum color value. We also combined greyscaling and color inversion to perform a compound transformation. For edge detection, Canny [30] was used. The Canny edge detection algorithm has two threshold parameters for hysteresis thresholding in order to pick strong edges and these parameters are set accordingly based on experimental results. Fig. 2. Different levels of noise • • A significant drop in accuracy for greyscale images compared to inverted images. Worse performance in edge filtered. Greyscaled images lose the color component and hence, invalidating the labels of the questions based on colors. Ex - if a greyscaled image of a boy holding a red ball • • is shown, a question is asked, “ What is the color of the ball?”, then the model is expected to predict “Grey” since the color red has been greyscaled to a variant of grey. However, the labels were annotated for colored images and it will be “Red” for the given question. Hence, the predicted answer “Grey” will be incorrect and counted as misclassification. A better methodology would be to change the labels of color-based questions and then be used to compute the accuracy. Color inverted images face similar mislabeling problems like greyscale images but the mode seems to perform far better with color inverted images. A reasonable assumption would be that the model extracts meaningful features from the color of the image and uses it to answer questions unrelated to color. E.g. the model might associate the color yellow with a shape like a banana and when greyscaled the model won’t be able to predict the name of the fruit as the color component has been lost. However, it may still be able to predict correctly for inverted images as the color component is not completely lost. Edge Filtering results in a higher loss of information as information about the texture of the shapes are lost along with the color of the image. Both the color and texture are important features for predicting a particular object and unsurprisingly removing both resulted in a worse performance. However, the drop in performance between greyscaling and edge filtering transformations is not as high as the drop in performance between baseline and greyscaled transformations, suggesting colors have a higher contribution in accurate predictions than textures of a shape. This can be properly verified if a separate transformation which removes the textures and keeps the basic colors only was performed and a comparison is done between the texture-only and greyscaled datasets. Fig. 3. Accuracy for different levels of noise V. C ONCLUSION [Write conclusion] R EFERENCES [1] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, and L. Zettlemoyer, “Quac: Question answering in context,” arXiv preprint arXiv:1808.07036, 2018. [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. [4] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning, pp. 5583–5594, PMLR, 2021. [5] J.-H. Huang, C. D. Dao, M. Alfadly, and B. Ghanem, “A novel framework for robustness analysis of visual qa models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8449– 8456, 2019. [6] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013. [7] J. Hendrik Metzen, M. Chaithanya Kumar, T. Brox, and V. Fischer, “Universal adversarial perturbations against semantic image segmentation,” in Proceedings of the IEEE international conference on computer vision, pp. 2755–2764, 2017. [8] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1765–1773, 2017. [9] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574– 2582, 2016. [10] R. Jia and P. Liang, “Adversarial examples for evaluating reading comprehension systems,” arXiv preprint arXiv:1707.07328, 2017. [11] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015. [12] R. Cadene, C. Dancette, M. Cord, D. Parikh, et al., “Rubi: Reducing unimodal biases for visual question answering,” Advances in neural information processing systems, vol. 32, 2019. [13] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” arXiv preprint arXiv:1606.01847, 2016. [14] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, “Mutan: Multimodal tucker fusion for visual question answering,” in Proceedings of the IEEE international conference on computer vision, pp. 2612–2620, 2017. [15] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Don’t just assume; look and answer: Overcoming priors for visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4971–4980, 2018. [16] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29, 2016. [17] A. Agrawal, A. Kembhavi, D. Batra, and D. Parikh, “C-vqa: A compositional split of the visual question answering (vqa) v1. 0 dataset,” arXiv preprint arXiv:1704.08243, 2017. [18] S. Ramakrishnan, A. Agrawal, and S. Lee, “Overcoming language priors in visual question answering with adversarial regularization,” Advances in Neural Information Processing Systems, vol. 31, 2018. [19] C. Clark, M. Yatskar, and L. Zettlemoyer, “Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases,” arXiv preprint arXiv:1909.03683, 2019. [20] H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, 2019. [21] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019. [22] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019. [23] W. Wang, H. Bao, L. Dong, and F. Wei, “Vlmo: Unified visionlanguage pre-training with mixture-of-modality-experts,” arXiv preprint arXiv:2111.02358, 2021. [24] C. E. Jimenez, O. Russakovsky, and K. Narasimhan, “Carets: A consistency and robustness evaluative test suite for vqa,” arXiv preprint arXiv:2203.07613, 2022. [25] D. Rosenberg, I. Gat, A. Feder, and R. Reichart, “Are vqa systems rad? measuring robustness to augmented data with focused interventions,” arXiv preprint arXiv:2106.04484, 2021. [26] L. Li, J. Lei, Z. Gan, and J. Liu, “Adversarial vqa: A new benchmark for evaluating the robustness of vqa models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2042– 2051, 2021. [27] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913, 2017. [28] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind people,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3608–3617, 2018. [29] M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-world scenes based on uncertain input,” Advances in neural information processing systems, vol. 27, 2014. [30] Z. Xu, X. Baojie, and W. Guoxin, “Canny edge detection based on open cv,” in 2017 13th IEEE international conference on electronic measurement & instruments (ICEMI), pp. 53–56, IEEE, 2017.