Automatic Question Generation from Video T. Janani Priya, K. P. Sabari Priya, L. Raxxelyn Jenneyl, and K. V. Uma(B) Department of Information Technology, Thiagarajar College of Engineering, Madurai, Tamil Nadu 625015, India kvuit@tce.edu Abstract. We live in a world of information. Lecture videos available online have become a major source of information. The evaluation of knowledge gained requires questions to be generated from the videos referred. This paper aims to generate questions from online educational videos to facilitate self-learning through a continuous evaluation process and to deliver a light-weight, faster Video Question generator model as the already existing methodologies are heavier, costlier and require high computational power. The paper integrates the state-of-the-art technology in NLP, the T5 transformers, to incorporate transfer learning and making the model more flexible. The system can develop Yes or no questions and who, what, where, when, why, which, and how (Wh) questions. Keywords: Question generation · T5 transformer · Wh questions · Yes or no questions 1 Introduction The World Economic Forum has stated in April 2020 that more than 1.2 billion students are not able to attend classes due to the outbreak of corona virus. As an upshot of this pandemic, the emergence of distance learning, procedures of the educational institutions have shifted rapidly. Moreover, several tele-learning platforms have offered free access to their educational resources. In the aftermath of the corona virus pandemic, online education has brought up a new edge in the education system, recently used as an alternative solution to cover losses in the education field. While online education was widely treated as a discretionary means of education before the pandemic, the pandemic period made it obligatory to secure the position of being the primary and predominant mode of teaching-learning process. Therefore, non-conventional online education platforms and resources are used by institutions to reinforce the learning process of students. This ultimate change in the education system has led to the normalization of self-directed remote learning and increase in the interest towards taking up courses in online learning platforms which requires a continuous evaluation process for self-evaluation. This new dimension has made most of the educational content or materials digital. Videos have emerged as a vital source for learning. Knowledge at every level is provided in the form of videos in various learning platforms and social media. Self-learning has taken a leap in this pandemic and online lecture videos have become the major source for gaining knowledge. Some learning platforms which offer various courses have their © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. K. Das et al. (Eds.): CIPR 2022, LNNS 480, pp. 366–372, 2022. https://doi.org/10.1007/978-981-19-3089-8_35 Automatic Question Generation from Video 367 own lecture videos and means of evaluation for the learners, but in other sources, for instance learning from YouTube videos requires an additional source for evaluation and it is not often possible to get related questions for evaluation. The traditional learning methods also demand a continuous supply of new questions. The question generation task is an intricate task for a computer system and the task is dependent on humans for generating reasonable questions [5]. 2 Literature Survey Yu et al. [3] have proposed Composition Attention Network (CAN) with two-stream features. The method of Uniform Sampling Stream (USS) and Action Pooling Stream (APS) are used for extracting visual features in the two-stream mechanism for a better representation. Guo et al. [2] proposed a new framework for QG which brings in attention mechanism to process dialog history and selection mechanism to select the most relevant questions from the candidate questions that are generated during each iteration of dialog history for the framework for Single turn Video Question Generation. Video question answering model is also incorporated to predict answer and answer quality and fine tune it using reinforced learning. Khurana et al. [8] have surveyed the captioning techniques for videos and reviewed them based on datasets and evaluation metrics and the dependency of Attention mechanism on video Question-Answering technique to get relevant results and generating visual questions. Patil et al. [7] have also proposed a similar paper on attention module and the comparison of the results obtained from simple encoder-decoder module and also the evaluation of attention for the task. Su et al. [6] have proposed a novel Generator-Pretester network for generating question answer pairs and validating the question that are generated by trying to answer them under End-to-End Video Question-Answer Generation. Dhawaleshwar Rao CH et al. [9] have formulated a workflow for automatic questions generation (MCQs). Comparison on the question generation techniques and the evaluation techniques for validating the quality and relevancy of the automatically generated questions (MCQs) have been studied. 3 Objectives The main objective of the proposed model is as follows: • To generate questions from online educational videos to facilitate self-learning through a continous evaluation process. • To deliver a light-weight and faster Video Question generator model as the existing methodologies are heavier, costlier and require high computational power. • To integrate the state-of-the-art technology in NLP, the T5 transformers, to incorporate transfer learning and to make the model more flexible. 368 T. J. Priya et al. 4 Proposed Methodology The proposed system is a light weight system compared to existing systems and it also ensures the questions generated to be relevant and natural. The system works as two modules. The first module for text extraction uses moviepy library for audio conversion and IBM Watson speech to text API for text extraction. The second module Question generations uses transformer library models’ fine tuned on T5 transformers to generate questions. The proposed system architecture is as shown in Fig. 1. Fig. 1. System design architecture 4.1 System Requirements and Tools The system requires minimum installed memory of 8 GB and processor Intel (R) Core(TM) 2 Duo CPU E7500 is used. It is run on a 64-bit Operating system. Tools used for implementation are Python3, and Jupyter notebook. 4.2 System Design The workflow of the system is as shown in Fig. 2. The proposed system generates questions as follow: • The user is presented with a web application. • The UI of the application allows users to upload video and choose the type of the questions to be generated (WH or yes/no). Automatic Question Generation from Video 369 • The user uploads a video from which the questions are to be generated. • The uploaded video is converted into an audio file using moviepy library. • The audio file is transcribed using IBM Watson speech to text API and the text is stored. • If the user has selected WH questions then the WH question generator function runs and generates questions using Huggingface library model fine-tuned on T5 transformer. • If the user has selected yes/no questions then the yes/no question generator function runs and generates questions using a question generated model fine-tuned on the T5 transformer. • The generated questions are displayed to the user. Fig. 2. System flow diagram 4.3 Text Extraction The video that is uploaded is handled by the moviepy library in python. The audio conversion is carried out using the write_audiofile function of the moviepy library. The audio is then converted to text passage using IBM Watson speech-to-text API. 4.4 Transfer Learning A model is first pre-trained to perform a particular task on a rich dataset which is then fine-tuned to perform another particular task. This approach has led to a new wave of state-of-the-art in Natural Language Processing (NLP). As we are transferring the knowledge of a pretrained model into a new model it helps to reduce the training time for the new model. 4.5 T5 Transformer Text To Text Transfer Transformer (T5) is a model pre-trained on a large pre-training unlabelled dataset called the Colossal Clean Crawled Corpus (C4). T5 is an encoderdecoder model that outperforms decoder-only models. It uses a shared Text-to-Text framework with a pretrain then fine tune approach. This makes the model flexible enough to be utilized for the task of question generation. The transformer has achieved state of the art performance in many NLP benchmarks. 370 T. J. Priya et al. 4.6 Hugging Face Transformer Library Hugging Face is a large open-source community that quickly became an enticing hub for pre-trained deep learning models, mainly aimed at NLP. Their core mode of operation for natural language processing revolves around the use of Transformers. This library provides various models for different NLP tasks fine tuned on the T5 transformer. 4.7 Datasets Datasets used are BoolQ dataset [4] and Stanford Question Answering Dataset (SQuAD) [1]. BoolQ contains 15942 examples of yes/no questions. Each of these examples consist of a question, passage and answer. The SQuAD dataset consists of question-answer pairs which are posed by crowdworkers on Wikipedia articles. The answers to these questions are a sequence of tokens from the corresponding article. As the questions are posted by humans and the answers are from the corresponding passage, this dataset highly contributes to the naturalness of the question generated by the proposed system. 4.8 Yes or No Question Generation The model is trained to perform a specific task of Yes or no question generation by fine tuning the T5 transformer using BoolQ dataset. T5-base model is used considering the memory requirements. The effective batch size for training is calculated based on the parameter’s gradient_accumulation_steps and train_batch_size. Hugging Face transformer-based beam search algorithm is used for decoding which uses n number of beams to trace the most likely hypothesis at each epoch and chooses the hypothesis with highest probability. 4.9 WH Question Generation Wh Question generation module is based on Hugging face transformer with pretrained models using simple and straight-forward pipelines to simplify the task of training scripts and data modeling. The model is fine tuned on a T5 transformer using SQuAD dataset. End-to-end qg is used to generate multiple questions upon giving the context. The questions that are generated are separated by <sep> tokens. The model temperature is responsible for the randomness in question generation rather than a greedy approach, and the top-p value for diversity in questions, are tuned accordingly. Each generation cycle is stopped when a delimiter is generated or maximum token length is reached. 5 Results and Discussions The extracted passage has been evaluated using automatic speech recognition evaluation metric. The metrics used for evaluation are Word Information Lost (WIL), Match Error Rate (MER) and Word Error Rate (WER). The Word Information Lost (WIL) is calculated using the Eq. (1). WIL = 1 − H2 (H + S + D)(H + S + I ) (1) Automatic Question Generation from Video 371 The Match Error Rate (MER) is calculated using the formula Eq. (2). MER = (S + D + N )/(N = H + S + D + I ) = 1 − H /N (2) The Word Error Rate (WER) is calculated using the formula Eq. (3). WER = (S + D + I )/N1 = (S + D + I )/(H + S + D) (3) where, S - number of substitutions D - number of deletions I - number of insertions C - number of correct words N - number of words in the reference H - total number of successes The extracted passage has been pre-processed by converting it into lowercase, removing white spaces, and reducing it into a list of ListOfWords. The minimum error rate of speech recognition is around 25%. An error rate of 0.31 is achieved for the extracted passage after pre-processing. The evaluation results were as shown in Table 1. Table 1. Evaluation using metrics MER, WIL and WER. Metrics MER WIL WER Sample 1 0.4825 0.5618 0.3112 Sample 2 0.5102 0.5923 0.3333 The system performs a subjective task which cannot be evaluated accurately using a particular evaluation measure. Model generations were also evaluated via human annotation. Volunteers were asked to rate questions based on three criteria: 1) naturalness and 2) difficulty as well as 3) relevance to the context. The system has undergone human evaluation. The questions generated by the system were presented to the volunteers and the rated the generated questions on relevancy, naturalness and the difficulty. A survey was conducted through circulating forms among students, faculties, novice users (graduates) and their responses were recorded and analyzed. Two YouTube videos were selected for survey, the context of the video and the questions generated by the system were displayed and the users were asked to rate them on the basis of naturalness, relevancy and difficulty. The difficulty of the questions varied from video to video. The survey results were as shown in Table 2. 372 T. J. Priya et al. Table 2. Human evaluation. Measures Naturalness Dificulty Relevance Score Sample 1 3.78 2.29 4.14 0.89 Sample 2 3.94 3.15 4.38 0.94 6 Conclusion The model has been tested with various videos from YouTube and lecture videos. The video was uploaded and the system generated questions as wh or yes/no questions as per user needs. The quality of the questions generated is found to be highly dependent on the quality of the video uploaded. Videos have information represented in visuals. The visuals presented in the video possess information aiding the audio. As the system only relies on the audio for context extraction, the information available in visuals is not utilized. These visuals could also aid in the speech to text conversion. In some cases, the speech to text converter fails to guess the correct terms, in such cases the visuals in the video could be useful. In future, the system will comprehend the visuals and incorporate the information with the audio to enrich the context thus will increase the quality of the questions. References 1. Rajpurkar, P., et al.: Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016). https://doi.org/10.48550/arXiv.1606.05250 2. Guo, Z., et al.: Multi-turn video question generation via reinforced multi-choice attention network. IEEE Trans. Circuits Syst. Video Technol. 31(5), 1697–1710 (2020) 3. Yu, T., et al.: Compositional attention networks with two-stream fusion for video question answering. IEEE Trans. Image Process. 29, 1204–1218 (2019) 4. Clark, C., et al.: BoolQ: exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044 (2019). https://doi.org/10.48550/arXiv.1905.10044 5. Nie, L., et al.: Beyond text QA: multimedia answer generation by harvesting web information. IEEE Trans. Multimed. 15(2), 426–441 (2012) 6. Su, H.-T., et al.: End-to-end video question-answer generation with generator-pretester network. IEEE Trans. Circuits Syst. Video Technol. 31(11), 4497–4507 (2021) 7. Patil, C., Kulkarni, A.: Attention-based visual question generation. In: 2021 International Conference on Emerging Smart Computing and Informatics (ESCI). IEEE (2021). https://doi.org/ 10.1109/ESCI50559.2021.9396956 8. Khurana, K., Deshpande, U.: Video question-answering techniques, benchmark datasets and evaluation metrics leveraging video captioning: a comprehensive survey. IEEE Access (2021). https://doi.org/10.1109/ACCESS.2021.3058248 9. Ch, D.R., Saha, S.K.: Automatic multiple choice question generation from text: a survey. IEEE Trans. Learn. Technol. 13(1), 14–25 (2018)