Uploaded by Sparsh Sethi

Automatic Question Generation from Video using T5 Transformers

advertisement
Automatic Question Generation from Video
T. Janani Priya, K. P. Sabari Priya, L. Raxxelyn Jenneyl, and K. V. Uma(B)
Department of Information Technology, Thiagarajar College of Engineering, Madurai,
Tamil Nadu 625015, India
kvuit@tce.edu
Abstract. We live in a world of information. Lecture videos available online
have become a major source of information. The evaluation of knowledge gained
requires questions to be generated from the videos referred. This paper aims to generate questions from online educational videos to facilitate self-learning through a
continuous evaluation process and to deliver a light-weight, faster Video Question
generator model as the already existing methodologies are heavier, costlier and
require high computational power. The paper integrates the state-of-the-art technology in NLP, the T5 transformers, to incorporate transfer learning and making
the model more flexible. The system can develop Yes or no questions and who,
what, where, when, why, which, and how (Wh) questions.
Keywords: Question generation · T5 transformer · Wh questions · Yes or no
questions
1 Introduction
The World Economic Forum has stated in April 2020 that more than 1.2 billion students
are not able to attend classes due to the outbreak of corona virus. As an upshot of this
pandemic, the emergence of distance learning, procedures of the educational institutions
have shifted rapidly. Moreover, several tele-learning platforms have offered free access
to their educational resources. In the aftermath of the corona virus pandemic, online
education has brought up a new edge in the education system, recently used as an alternative solution to cover losses in the education field. While online education was widely
treated as a discretionary means of education before the pandemic, the pandemic period
made it obligatory to secure the position of being the primary and predominant mode
of teaching-learning process. Therefore, non-conventional online education platforms
and resources are used by institutions to reinforce the learning process of students. This
ultimate change in the education system has led to the normalization of self-directed
remote learning and increase in the interest towards taking up courses in online learning
platforms which requires a continuous evaluation process for self-evaluation.
This new dimension has made most of the educational content or materials digital.
Videos have emerged as a vital source for learning. Knowledge at every level is provided
in the form of videos in various learning platforms and social media. Self-learning has
taken a leap in this pandemic and online lecture videos have become the major source
for gaining knowledge. Some learning platforms which offer various courses have their
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022
A. K. Das et al. (Eds.): CIPR 2022, LNNS 480, pp. 366–372, 2022.
https://doi.org/10.1007/978-981-19-3089-8_35
Automatic Question Generation from Video
367
own lecture videos and means of evaluation for the learners, but in other sources, for
instance learning from YouTube videos requires an additional source for evaluation and
it is not often possible to get related questions for evaluation. The traditional learning
methods also demand a continuous supply of new questions. The question generation
task is an intricate task for a computer system and the task is dependent on humans for
generating reasonable questions [5].
2 Literature Survey
Yu et al. [3] have proposed Composition Attention Network (CAN) with two-stream
features. The method of Uniform Sampling Stream (USS) and Action Pooling Stream
(APS) are used for extracting visual features in the two-stream mechanism for a better
representation. Guo et al. [2] proposed a new framework for QG which brings in attention
mechanism to process dialog history and selection mechanism to select the most relevant
questions from the candidate questions that are generated during each iteration of dialog
history for the framework for Single turn Video Question Generation. Video question
answering model is also incorporated to predict answer and answer quality and fine tune
it using reinforced learning.
Khurana et al. [8] have surveyed the captioning techniques for videos and reviewed
them based on datasets and evaluation metrics and the dependency of Attention mechanism on video Question-Answering technique to get relevant results and generating
visual questions. Patil et al. [7] have also proposed a similar paper on attention module
and the comparison of the results obtained from simple encoder-decoder module and
also the evaluation of attention for the task.
Su et al. [6] have proposed a novel Generator-Pretester network for generating question answer pairs and validating the question that are generated by trying to answer them
under End-to-End Video Question-Answer Generation. Dhawaleshwar Rao CH et al.
[9] have formulated a workflow for automatic questions generation (MCQs). Comparison on the question generation techniques and the evaluation techniques for validating
the quality and relevancy of the automatically generated questions (MCQs) have been
studied.
3 Objectives
The main objective of the proposed model is as follows:
• To generate questions from online educational videos to facilitate self-learning through
a continous evaluation process.
• To deliver a light-weight and faster Video Question generator model as the existing
methodologies are heavier, costlier and require high computational power.
• To integrate the state-of-the-art technology in NLP, the T5 transformers, to incorporate
transfer learning and to make the model more flexible.
368
T. J. Priya et al.
4 Proposed Methodology
The proposed system is a light weight system compared to existing systems and it also
ensures the questions generated to be relevant and natural. The system works as two
modules. The first module for text extraction uses moviepy library for audio conversion
and IBM Watson speech to text API for text extraction. The second module Question
generations uses transformer library models’ fine tuned on T5 transformers to generate
questions. The proposed system architecture is as shown in Fig. 1.
Fig. 1. System design architecture
4.1 System Requirements and Tools
The system requires minimum installed memory of 8 GB and processor Intel (R)
Core(TM) 2 Duo CPU E7500 is used. It is run on a 64-bit Operating system. Tools
used for implementation are Python3, and Jupyter notebook.
4.2 System Design
The workflow of the system is as shown in Fig. 2. The proposed system generates
questions as follow:
• The user is presented with a web application.
• The UI of the application allows users to upload video and choose the type of the
questions to be generated (WH or yes/no).
Automatic Question Generation from Video
369
• The user uploads a video from which the questions are to be generated.
• The uploaded video is converted into an audio file using moviepy library.
• The audio file is transcribed using IBM Watson speech to text API and the text is
stored.
• If the user has selected WH questions then the WH question generator function
runs and generates questions using Huggingface library model fine-tuned on T5
transformer.
• If the user has selected yes/no questions then the yes/no question generator function
runs and generates questions using a question generated model fine-tuned on the T5
transformer.
• The generated questions are displayed to the user.
Fig. 2. System flow diagram
4.3 Text Extraction
The video that is uploaded is handled by the moviepy library in python. The audio
conversion is carried out using the write_audiofile function of the moviepy library. The
audio is then converted to text passage using IBM Watson speech-to-text API.
4.4 Transfer Learning
A model is first pre-trained to perform a particular task on a rich dataset which is then
fine-tuned to perform another particular task. This approach has led to a new wave
of state-of-the-art in Natural Language Processing (NLP). As we are transferring the
knowledge of a pretrained model into a new model it helps to reduce the training time
for the new model.
4.5 T5 Transformer
Text To Text Transfer Transformer (T5) is a model pre-trained on a large pre-training
unlabelled dataset called the Colossal Clean Crawled Corpus (C4). T5 is an encoderdecoder model that outperforms decoder-only models. It uses a shared Text-to-Text
framework with a pretrain then fine tune approach. This makes the model flexible enough
to be utilized for the task of question generation. The transformer has achieved state of
the art performance in many NLP benchmarks.
370
T. J. Priya et al.
4.6 Hugging Face Transformer Library
Hugging Face is a large open-source community that quickly became an enticing hub for
pre-trained deep learning models, mainly aimed at NLP. Their core mode of operation
for natural language processing revolves around the use of Transformers. This library
provides various models for different NLP tasks fine tuned on the T5 transformer.
4.7 Datasets
Datasets used are BoolQ dataset [4] and Stanford Question Answering Dataset (SQuAD)
[1]. BoolQ contains 15942 examples of yes/no questions. Each of these examples consist
of a question, passage and answer. The SQuAD dataset consists of question-answer pairs
which are posed by crowdworkers on Wikipedia articles. The answers to these questions
are a sequence of tokens from the corresponding article. As the questions are posted
by humans and the answers are from the corresponding passage, this dataset highly
contributes to the naturalness of the question generated by the proposed system.
4.8 Yes or No Question Generation
The model is trained to perform a specific task of Yes or no question generation by
fine tuning the T5 transformer using BoolQ dataset. T5-base model is used considering the memory requirements. The effective batch size for training is calculated based
on the parameter’s gradient_accumulation_steps and train_batch_size. Hugging Face
transformer-based beam search algorithm is used for decoding which uses n number of
beams to trace the most likely hypothesis at each epoch and chooses the hypothesis with
highest probability.
4.9 WH Question Generation
Wh Question generation module is based on Hugging face transformer with pretrained
models using simple and straight-forward pipelines to simplify the task of training scripts
and data modeling. The model is fine tuned on a T5 transformer using SQuAD dataset.
End-to-end qg is used to generate multiple questions upon giving the context. The questions that are generated are separated by <sep> tokens. The model temperature is responsible for the randomness in question generation rather than a greedy approach, and the
top-p value for diversity in questions, are tuned accordingly. Each generation cycle is
stopped when a delimiter is generated or maximum token length is reached.
5 Results and Discussions
The extracted passage has been evaluated using automatic speech recognition evaluation metric. The metrics used for evaluation are Word Information Lost (WIL), Match
Error Rate (MER) and Word Error Rate (WER). The Word Information Lost (WIL) is
calculated using the Eq. (1).
WIL = 1 −
H2
(H + S + D)(H + S + I )
(1)
Automatic Question Generation from Video
371
The Match Error Rate (MER) is calculated using the formula Eq. (2).
MER = (S + D + N )/(N = H + S + D + I ) = 1 − H /N
(2)
The Word Error Rate (WER) is calculated using the formula Eq. (3).
WER = (S + D + I )/N1 = (S + D + I )/(H + S + D)
(3)
where,
S - number of substitutions
D - number of deletions
I - number of insertions
C - number of correct words
N - number of words in the reference
H - total number of successes
The extracted passage has been pre-processed by converting it into lowercase, removing white spaces, and reducing it into a list of ListOfWords. The minimum error rate of
speech recognition is around 25%. An error rate of 0.31 is achieved for the extracted
passage after pre-processing. The evaluation results were as shown in Table 1.
Table 1. Evaluation using metrics MER, WIL and WER.
Metrics
MER
WIL
WER
Sample 1
0.4825
0.5618
0.3112
Sample 2
0.5102
0.5923
0.3333
The system performs a subjective task which cannot be evaluated accurately using a
particular evaluation measure. Model generations were also evaluated via human annotation. Volunteers were asked to rate questions based on three criteria: 1) naturalness
and 2) difficulty as well as 3) relevance to the context. The system has undergone human
evaluation. The questions generated by the system were presented to the volunteers and
the rated the generated questions on relevancy, naturalness and the difficulty.
A survey was conducted through circulating forms among students, faculties, novice
users (graduates) and their responses were recorded and analyzed. Two YouTube videos
were selected for survey, the context of the video and the questions generated by the
system were displayed and the users were asked to rate them on the basis of naturalness,
relevancy and difficulty. The difficulty of the questions varied from video to video. The
survey results were as shown in Table 2.
372
T. J. Priya et al.
Table 2. Human evaluation.
Measures
Naturalness
Dificulty
Relevance
Score
Sample 1
3.78
2.29
4.14
0.89
Sample 2
3.94
3.15
4.38
0.94
6 Conclusion
The model has been tested with various videos from YouTube and lecture videos. The
video was uploaded and the system generated questions as wh or yes/no questions as
per user needs. The quality of the questions generated is found to be highly dependent
on the quality of the video uploaded. Videos have information represented in visuals.
The visuals presented in the video possess information aiding the audio. As the system
only relies on the audio for context extraction, the information available in visuals is not
utilized. These visuals could also aid in the speech to text conversion. In some cases, the
speech to text converter fails to guess the correct terms, in such cases the visuals in the
video could be useful. In future, the system will comprehend the visuals and incorporate
the information with the audio to enrich the context thus will increase the quality of the
questions.
References
1. Rajpurkar, P., et al.: Squad: 100,000+ questions for machine comprehension of text. arXiv
preprint arXiv:1606.05250 (2016). https://doi.org/10.48550/arXiv.1606.05250
2. Guo, Z., et al.: Multi-turn video question generation via reinforced multi-choice attention
network. IEEE Trans. Circuits Syst. Video Technol. 31(5), 1697–1710 (2020)
3. Yu, T., et al.: Compositional attention networks with two-stream fusion for video question
answering. IEEE Trans. Image Process. 29, 1204–1218 (2019)
4. Clark, C., et al.: BoolQ: exploring the surprising difficulty of natural yes/no questions. arXiv
preprint arXiv:1905.10044 (2019). https://doi.org/10.48550/arXiv.1905.10044
5. Nie, L., et al.: Beyond text QA: multimedia answer generation by harvesting web information.
IEEE Trans. Multimed. 15(2), 426–441 (2012)
6. Su, H.-T., et al.: End-to-end video question-answer generation with generator-pretester
network. IEEE Trans. Circuits Syst. Video Technol. 31(11), 4497–4507 (2021)
7. Patil, C., Kulkarni, A.: Attention-based visual question generation. In: 2021 International Conference on Emerging Smart Computing and Informatics (ESCI). IEEE (2021). https://doi.org/
10.1109/ESCI50559.2021.9396956
8. Khurana, K., Deshpande, U.: Video question-answering techniques, benchmark datasets and
evaluation metrics leveraging video captioning: a comprehensive survey. IEEE Access (2021).
https://doi.org/10.1109/ACCESS.2021.3058248
9. Ch, D.R., Saha, S.K.: Automatic multiple choice question generation from text: a survey. IEEE
Trans. Learn. Technol. 13(1), 14–25 (2018)
Download