Uploaded by 曾纪勇

Semantic Attribute Enriched Storytelling from a Sequence of Images

advertisement
Semantic Attribute Enriched Storytelling from a
Sequence of Images
Zainy M. Malakana,b , Ghulam Mubashar Hassana , Mohammad A. A. K. Jalwanaa ,
Nayyer Aafaqa and Ajmal Miana
a
Department of Computer Science and Software Engineering, The University of Western Australia, Perth, Australia
b
Department of Information Science, College of Computer and Information Systems, Umm Al-Qura University,
Makkah, Kingdom of Saudi Arabia
2021 Digital Image Computing: Techniques and Applications (DICTA) | 978-1-6654-1709-9/21/$31.00 ©2021 IEEE | DOI: 10.1109/DICTA52665.2021.9647213
{zainy.aljawy, nayyer.aafaq}@research.uwa.edu.au,
{ghulam.hassan, mohammad.jalwana, ajmal.mian}@uwa.edu.au
Abstract—Visual storytelling (VST) pertains to the task of
generating story-based sentences from an ordered sequence of
images. Contemporary techniques suffer from several limitations
such as inadequate encapsulation of visual variance and context
capturing among the input sequence. Consequently, generated
story from such techniques often lacks coherence, context and
semantic information. In this research, we devise a ‘Semantic
Attribute Enriched Storytelling’ (SAES) framework to mitigate
these issues. To that end, we first extract the visual features
of input image sequence and the noun entities present in the
visual input by employing an off-the-shelf object detector. The
two features are concatenated to encapsulate the visual variance
of the input sequence. The features are then passed through
a Bidirectional-LSTM sequence encoder to capture the past
and future context of the input image sequence followed by
attention mechanism to enhance the discriminality of the input to
language model i.e., mogrifier-LSTM. Additionally, we incorporate semantic attributes e.g., nouns to complement the semantic
context in the generated story. Detailed experimental and human
evaluations are performed to establish competitive performance
of proposed technique. We achieve up 1.4% improvement on
BLEU metric over the recent state-of-art methods.
Index Terms—Storytelling, Computer Vision, Image and Video
Captioning, and Object Detection.
Fig. 1. Comparison between image captioning and visual story telling. The
black (middle) box shows the isolated individual description for each image,
while the green (bottom) box points to a single story having five sentences.
Additionally, the objects detected in each image are presented in the blue
(top) box.
I. I NTRODUCTION
Visual storytelling (VST) aims to create a meaningful story
from a set of images. Although a trivial task for humans, it is
a very challenging task for machines. To build a coherent and
meaningful story, an algorithm needs to understand both the
modalities which include, the visual input and associated semantic information [1]. For instance, it requires understanding
of various activities, objects, and places in the visual input to
create relevant story. Additionally, the generated story should
be semantically meaningful and syntactically correct at the
same time.
Research problems similar to visual storytelling include
image captioning [2]–[5] and video captioning [6]–[13]. In
conventional image captioning techniques, an isolated description is generated for every input image. Consequently,
these techniques lack to capture contextual information as
reflected by the non-coherent sentences. In contrast, storytelling (VST) extracts and leverages contextual relationship
among the stream of input images to produce a coherent story.
Further, these algorithms are significantly complex than video
description techniques, as they require narrative context than
literal captioning from the language side [14]. Coherency in
story description is a challenging problem as the images have
high perceptual differences. This is specifically highlighted
in Fig. 1. Existing VST techniques are mostly end-to-end
trainable deep pipelines [15]–[17] that are able to generate
grammatically correct story however they lack the long term
consistency among story sentences that may be aided by
additional information from input images.
In this study, we propose a novel visual storytelling framework that is based on Encoder-Decoder paradigm. Encoder
part captures the contextual relationship among images by
a Bidirectional-LSTM over the visual features followed by
a self attention mechanism. Bidirectional-LSTM pass the
images in both forward and backward sequences to capture
the context in both the directions. The visual features from the
978-1-6654-1709-9/21/$31.00 ©2021 IEEE
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply.
encoder includes concatenated features from a pretrained classifier i.e., 2D-CNN and off-the-shelf object detector [18]. The
concatenated features are then passed through BidirectionalLSTM to incorporate the contextual information. Before feeding the features into language model, self-attention mechanism
is applied to enhance the discriminality of the features. These
attentive features are then modulated with the input word
embeddings via Mogrifier-LSTM [19] before processed by
the LSTM. The Mogrifier leverages the rich features from
classifier and object detector to enable intra-sentence coherence and feeds the rich context-aware information to language
model to improve the coherency and relevance of the generated
sentences.
In summary, our contributions are as follows:
• We propose a novel ‘Semantic Attribute Enriched Storytelling (SAES)’ framework that leverages from visual
features and objects information to generate a relevant
story from an input image stream.
• Our framework captures both past and future context
and employs attention mechanism over the contextualised
features to enhance the discriminality to generate semantically rich dense captions from the language model.
• We incorporate Mogrifier-LSTM to capture the relationship between visual and semantic information for relevant
story generation from the enriched features.
• We demonstrate the efficacy of our model by performing
extensive evaluation using popular Visual Storytelling
Dataset (VIST). We show that our technique outperforms
state-of-the-art (SoTA) techniques by 1.4% improvement
over multiple automatic evaluation metrics (see §V for
details) and 4% when using human evaluations.
II. R ELATED WORK
This section provides an overview of recent trends in visual
storytelling. Due to objective similarity of image and video
captioning, we also discuss them before summarising the
recently proposed literature of VST.
A. Image and Video Captioning
Image captioning (IC) [20] is a limiting case of visual story
telling where the input sequence set is reduced to a single
image. Thus, the progress in IC can be utilized in the research
in storytelling. State-of-the-art methods in IC are based on
deep learning pipelines [21]. While promising, these pipelines
are known for their thirst of larger datasets. Therefore, the
research in deep learning in general and image captioning in
specific has gained the momentum after the introduction of
seminal dataset of ImageNet ILSVRC [22]. The prominent
algorithms in this direction include subjects and objects modeling [23], reinforcement learning [24], and attention techniques
[25]. The common evaluation metrics used for IC methods
include BLUE [26], CIDEr [27], METEOR [28], ROUGE
[29], and SPICE [30].
Video captioning may be considered as an extension of
the image captioning field that can describe multiple frames
(i.e., video) in a single sentence. Similar to visual storytelling
methods, most recent techniques in video captioning incorporate encoder-decoder framework. Encoder consists of a 2D/3DCNN that extracts visual features from the stream of input images. These features are then translated into natural language
sentences by a decoder i.e., a language model based on either
a recurrent neural network [31], [32] or a transformer [33].
Most recent video captioning methods incorporate various
techniques to enhance the performance of video captioning
framework that includes objects and actions modeling [2],
[34], [35], Fourier transform [7], attention mechanism [8],
[10], and semantic attributes learning [11], [12].
B. Visual Storytelling (VST)
In Visual storytelling (VST) task, the encompassed system
learns to produce grammatically correct sentences as a story by
relying on an ordered stream of images [36]. The pioneering
study of Park et al. [37] proposed a novel framework of
multi-frame to multi-sentence modeling in VST, in which the
coherence in the textual model was used to resolve entity
transition patterns commonly found between sentences. The
model was tested over the NYC and Disneyland datasets [38]
which are not currently available in public domain.
The recent research efforts pre-dominantly rely on encoderdecoder framework [39], [40]. This framework addresses deep
learning structural design that blends a Convolutional Neural
Network (CNN) as a feature extractor with recurrent network
as a sequence encoder and a language model as a decoder.
In between, they utilise an attention mechanism, which allow
them to have competitive results. The performances of the
proposed frameworks were evaluated over publicly open VIST
dataset. Similarly, Malakan et al. [41] proposed a novel
framework that utilizes Mogrifier-LSTM as a language model
to enhance the quality of generated story. Few other studies
improved other aspects of the framework involving stacked
recurrent neural networks (RRNs) [17], a hierarchical context
based network [42], concept detection [43], and imaginereason-write framework [44]. These enhancements significantly improved the VST models performance over automatic
evaluation metrics, especially METEOR.
In this article, we propose a novel framework that enriches
the semantic features by incorporating object detector. Extensive experimentation establishes the enhanced performance
of our model when evaluated over the automatic evaluation
metrics and human evaluation.
III. M ETHODOLOGY
Our proposed model is based on well known EncoderDecoder paradigm. The encoder consists of hierarchical layers
of feature extractors and Bidirectional-LSTM with attention.
Recent literature has highlighted that deep visual models learn
rich semantic features [45], [46] that can be extracted from its
last layers [47]. This motivated us to aggregate our feature set
from second last layer of a pretrained classifier and output of
an object detector. The concatenated features are subsequently
encoded by the Bidirectional-LSTM layers with attention. Our
decoder module consists of Mogrifier LSTM with five rounds
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. The proposed architecture which is based on an encoder-decoder algorithm. First, Resnet-152 extracts the image features, and the object labels are
detected using the state-of-the-art Yolo-v5. Then, the decoder part comprises Mogrifier with five rounds to a language model to generate a natural human-like
coherent story.
and a language model that modulates the encoded features
to generate a coherent story composed of a sequence of five
coherent sentences. Each of the sub-modules is discussed in
detail below.
A. Feature Extractor
The feature extraction module combines the semantic features from a visual classifier and object detector as shown
in Figure 3. Among several off-the-shelf options, we selected
pretrained ResNet-152 available in TorchVision package [48]
and public implementation of state-of-the-art object detector
called Yolov5 [18]. For visual classifier, activations from
the second last layer (ci ∈ R1024 ) were utilized as salient
features. For the object detector, a dictionary composed of top
two hundred common words was prepared offline to encode
detection labels as multi-hot vector (di ∈ R128 ). Therefore,
for a given set of images as input (I),
I = [I1 , I2 , ..., IN ] s.t. Ii ∈ R224×224×3 ,
(1)
we define features as combination of features from classifier
and object detector as,
φ = [(c1 d1 ), ..., (cN dN )] s.t. ci ∈ R1024 , di ∈ R128 ,
(2)
where (ci di ) ∈ R1152 indicates the concatenation of ci with
di . These features are then fed to Bidirectional-LSTM for
temporal encoding
B. Bidirectional Encoder with Attention
Given the extracted high level visual features (φ), we utilize
to encode a context of all the images. A trivial approach
can be the concatenation or some weighted combination of
all the features. However such approaches cannot adequately
preserve the temporal relationship amongst images. Therefore,
we model the sequential information with a recurrent neural
network (RNN). Specifically, a Bidirectional-LSTM network
is employed to summarize the sequential information of N
images in forward as well as backward directions. At each
time step ‘t’, our sequence encoder accepts image feature
vector φi where i ∈ {1, 2, . . . N }. At last time step ‘N’, the
sequence encoder has encoded the whole stream of images
and encapsulates the contextual information in the last hidden
−→ ←−
state denoted as hse = [hse ; hse ].
Description of images in a context free manner often yields
unrelated sentences that significantly degrade quality of a
story. We mitigate this issue by utilizing the context alongside
context-conditioned individual image features to define our
encoded set of features that are subsequently refined by self
attention mechanism to preserve most salient parts. Formally,
ω i = W Ta · tanh(W φ [φi , hsei ] + b),
(3.3)
where φi is the feature vector of ith image with hidden state
hsei of sequence encoder after ith image has been fed.
exp(ω i )
γ i = PN
,
k=1 exp(ω k )
(3.4)
where N is the length of the features sequence as visual
stream. Finally, the attentive representation features becomes;
ζi =
N
X
γ i · [φi , hsei ].
(3.5)
i=1
The final representations serve as the decoder inputs which
attends both image specific (low level) and stream specific
(high level) information.
C. Image and Objects modulation
Recurrent networks often struggle in modeling sequential
inputs where coherence and relevance are vital [41] such as in
visual storytelling. These problems depend on the interaction
of model inputs and the context in which they occur. We
remedify these issues by exploiting the modulating technique
of Mogrifier LSTM [19] and employ it as decoder in our
framework. Before describing the Mogrifier LSTM, we first
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. In the decoder part of our proposed model, image features and the
detected object labels are concatenated. First, we extract labels from Yolo
and then create the story map as multi-hot vectors with a size of 128. During
feature extracting using resnet-152 as CNN, we concatenate the feature map.
discuss the way a standard LSTM [32] generates current
hidden state , hhti , given the previous hidden state hprev , and
updates its memory state chti . It uses input gates Γi , forget
gates Γf , and output gates Γo that are computed as follows:
htrangle
Γf
= σ(W f [hprev , xt ] + bf ),
hti
Γi = σ(W i [hprev , xt ] + bi ),
hti
c̃
= tanh (W c [hprev , xt ] + bc ),
hti
chti = Γf
hti
cht−1i + Γi
c̃hti ,
Γhti
o = σ(W o [hprev , xt ] + bo ),
hhti = Γhti
o
tanh(chti )
(6)
(7)
(8)
(9)
(10)
(11)
where W ∗ indicates the learnable transformation matrix in
all cases, x is the input word embedding vector at time step
‘t’ (we omit t for readability), b∗ represent the biases, σ is
the logistic sigmoid function, and
represent the hadamard
product of the vectors. In our decoder network, the LSTM
hidden state h is initialized by attentive vector ζ i from the
encoder output.
A close inspection of the above equations reveals that input
gate Γi scales the rows of the weight matrices W c (here we
have deliberately ignored the non-linearity in c for clarity).
This operation is modified in Mogrifier LSTM such that the
columns of all its weight matrices W ∗ are scaled by gated
modulation. Before the input to the LSTM, two inputs x and
hprev modulate each other alternatively. Formally, x is gated
conditioned on the output of the previous step hprev . Likewise,
the gated input is utilised in a similar fashion to gate previous
time step output. After five gating rounds, the highest-indexed
updated x and hprev are then fed into LSTM unit as presented
in Fig. 4. As demonstrated, it can be expressed as: Mogrify
(x, cprev , hprev ) = LST M (x↑ , cprev , h↑prev ) where x↑ and
Fig. 4. High-level architecture illustrates how we feed the objects and noun
features to Mogrifier modulation to have the highly context-dependent input
to the LSTM unit.
h↑prev are the modulated inputs being the highest-indexed xi
and hiprev respectively. Formally,
i−2
,
xi = 2σ(W ixh hi−1
prev ) x
hiprev
2σ(W ihx xi−1 )
for odd i ∈ [1, 2, ..., r], (12)
hi−2
prev , for
even i ∈ [1, 2, ..., r],
(13)
where
is the Hadamard product, x−1 = x, h0prev =
hprev = ζ i and r denotes the number of modulation rounds
treated as a hyperparameter. Setting r = 0 represents the
standard LSTM without any gated modulation at the input.
Multiplication with a constant 2 ensures that the matrices W ixh
and W ihx result in transformations close to identity.
=
D. Proposed framework Variants
We have explored several design choices for the semantic
enrichment of our proposed framework. These variants are:
• W/Encoder OD: This particular setting provided the best
performance and is illustrated in the Fig. 2. The detailed
mathematical formulation is presented in the previous
subsections. All other variants will be explained relative
to it.
• W/Encoder Decoder OD: In this setting, we modified
the ‘W/Encoder OD’ architecture by feeding Mogrifier
with additional features from the object detector. These
additional features are concatenated alongside the attentive features.
• W/Encoder OD & Noun w/Decoder: This variant is
similar to ‘W/Encoder OD’, except that top objects
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply.
TABLE I
A N EXPERIMENT OVER FIVE SPLITS ON THE VISUAL STORYTELLING
DATASET (VIST). T HE PERPLEXITY OF THE TRAINING , VALIDATION , AND
TESTING ARE REPORTED TO SHOW THE STRENGTH OF EACH SPLIT.
Models
st
1
Epoch
Train Perplexity
Val Perplexity
Test Perplexity
Split
21
12.19
16.77
16.54
2nd Split
21
11.90
16.80
16.22
3rd Split
21
12.05
16.79
15.92
4th Split
21
12.12
16.72
17.03
5th Split
21
11.50
16.83
18.05
TABLE II
H UMAN EVALUATION SURVEY RESULTS FOR 15 GROUND TRUTH STORIES ,
15 GENERATED STORIES BY OUR PROPOSED MODEL , AND 15 GENERATED
STORIES BY AREL MODEL WITH A TOTAL OF 250 STREAMS OF IMAGES .
Story
Type
Ground Truth
AREL
Ours
Relevance
3.20
3.07
3.16
Rank 1-5 (worst-best)
Coherence
Informative Story
3.20
3.33
3.13
3.13
3.23
3.28
B. Cross Validation
Sorted List is replaced by a list from ground truth noun
attributes.
E. Sentence Generation
After five rounds of the attentive features modulation, the
Mogrifier LSTM generates the highest-indexed feature vectors
corresponding to each image. These vectors are sequentially
fed to LSTM unit tokenizer to generate a complete visual story.
In the < start > sign, the tokenizer will start to receive the
feature vectors from Mogrifier, and it tokenizes the sentence
word until the LSTM meets the < end > which is a complete
sentence of the first image and so on for the whole story of
five sentences as shown in Figure 2.
All of our variants uniquely generate five enriched sentences
as one story of five images. Figure 5 demonstrate these
generated stories alongside the quantitative metric scores. The
concatenated object labels with the image features method
in the encoder part obtains the highest score in automatic
evaluation metrics.
IV. M ODEL E VALUATION
We empirically and qualitatively evaluated the performance
of our scheme over publicly available visual storytelling
dataset (VIST). This section provides details about dataset,
preprocessing, evaluation metrics and a discussion over comparative results.
A. Dataset
To the best of our knowledge, VIST (Visual Storytelling
dataset) [49] is the only publicly available dataset that can
be utilized for supervised learning techniques in storytelling
problems. It consists of 210,819 unique photographs that
belong to 10,117 Flickr albums and is organized in group of
five images. Each group is accompanied with a two type of
stories. One is called ‘Description-in-Isolation’ and includes
individual descriptions of images, that can be useful for
research in image captioning. While the other is called ‘Storyin-Sequence’ which is relevant to our research problem and
consists of a coherent story of exactly five sentences. It
is important to mention that the names of the persons are
modified by [male and female]; locations by [location]; and
organizations by [organization] respectively.
For a fair evaluation of proposed framework, we followed
5-fold cross validation over VIST dataset. In this process, we
utilized all the data in VIST and created five splits. Our model
was iteratively trained over one split and tested over other
splits. Firstly, all the test and train samples were grouped
together, and then we resplit the group again. Hence, we
measure our model’s perplexity score during training over the
train, validation, and test samples in five different splits. It is
important to mention that the first split is as the original state
presented in [49]. The test score is reported in Table I, which
shows that the fifth spilt is the best performing split during
training achieving 11.50 perplexity score. In comparison, the
third split improves during the model testing and achieving
15.92 perplexity score. Overall, the dataset experiments show
that our proposed model is stable and performing well because
there are no significant differences.
C. Automatic Evaluation Metrics
Automatic evaluation metrics provide a convenient choice
to quantify and compare the performances of different algorithms. Each metric has its own strengths, therefore comparative analysis over several metrics is vital to ascertain the
strength of an algorithm. Following the popular choices from
relevant literature we have included several metrics to evaluate
our proposed model. It includes BLEU, ROUGE, METEOR
and CIDEr. BLEU stands for Bilingual Evaluation Understudy
and measures the generated text using n-grams which compares them with the Ground truth text (i.e., reference text). We
have included four varaints called BLEU-1, 2, 3, and 4 [26].
ROUGE stands for Recall-Oriented Understudy for Gisting
Evaluation [29] and focuses on analysis or comparison in
three different ways: n-grams, word sequences, and word pairs.
METEOR stands for Metric for Evaluation of Translation with
Explicit ORdering [28] and is better for sentence-level since
it looks for the words’ synonyms in its measurement with
the text reference. Similarly CIDEr standands for Consensusbased Image Description Evaluation [27] and is designed
to compare the machine-generated text with multiple human
captions (i.e., reference texts).
V. F INDINGS AND R ESULTS
In this section, we compare our proposed model performance with state-of-art models. In addition, human evaluation
measure is reported. Finally, an example of a generated story
is discussed as a case study.
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. An example of the steam of images from VIST dataset followed by ground truth story and stories generated by our proposed Semantic Attribute
Enriched Storytelling (SAES) model and its variants. In addition, four metrics are presented, which includes relevance, coherence and informativeness from
human evaluation, and BLEU score from automatic evaluation metric.
TABLE III
C OMPARISON OF OUR
PROPOSED MODEL WITH RECENT METHODOLOGIES ON THE VISUAL STORYTELLING DATASET (VIST). W E USED SEVEN
AUTOMATIC EVALUATION METRICS AS QUANTITATIVE RESULTS . “ - ” MEANS THE SCORE IS NOT RECORDED BY THE AUTHORS OF THE RESPECTIVE
METHOD . T HE “ BOLD ” SCORES REPRESENT THE HIGHEST RESULT.
Model
AREL 2018 (baseline) [16]
GLACNet 2018 (baseline) [39]
HCBNet 2019 [36]
HCBNet(without prev. sent. attention) [36]
HCBNet(without description attention) [36]
HCBNet(VGG19) [36]
VSCMR 2019 [50]
MLE 2020 [51]
BLEU-RL 2020 [51]
ReCo-RL 2020 [51]
VS with MPJA 2021 [52]
CAMT 2021 [41]
Rand+RNN 2021 [43]
Proposed model w/Encoder Decoder OD
Proposed model w/Encoder OD & Noun w/Decoder
Proposed model w/Encoder OD
BLEU-1
0.536
0.568
0.593
0.598
0.584
0.591
0.638
0.601
0.641
0.64
0.63
0.65
A. Proposed Model Comparison
In our comparison, we include all automatic evaluation
metrics discussed in Section IV-C which are: BLEU-1, 2,
4, and 4, CIDEr, ROUGE-L, and METEOR. Moreover, to
represent our proposed model performance over these metrics, Table III presents the comparison between our proposed
Semantic Attribute Enriched Storytelling (SAES) model and
recently published popular methods which include: AREL1 ,
a method for an implicit reward with imitation learning [16];
BLEU-2
0.315
0.321
0.348
0.338
0.345
0.34
0.325
0.361
0.363
0.357
0.372
BLEU-3
0.173
0.171
0.191
0.180
0.194
0.186
0.133
0.201
0.133
0.196
0.195
0.204
BLEU-4
0.099
0.091
0.105
0.097
0.108
0.104
0.143
0.143
0.144
0.124
0.082
0.184
0.061
0.106
0.109
0.12
CIDEr
0.038
0.041
0.051
0.057
0.043
0.051
0.090
0.072
0.067
0.086
0.042
0.042
0.022
0.051
0.048
0.054
ROUGE-L
0.286
0.264
0.274
0.271
0.271
0.269
0.302
0.300
0.301
0.299
0.303
0.303
0.272
0.294
0.299
0.303
METEOR
0.352
0.306
0.34
0.332
0.337
0.334
0.355
0.348
0.352
0.339
0.344
0.335
0.311
0.330
0.331
0.335
GLACNet2 , an approach to learn attention cascading network
[39]; HCBNet, an approach of using image description as a
hierarchy for the sequence of images [36]; VSCMR, a method
of cross-modal rule mining [50]; ReCo-RL3 , an approach
of designing composite rewards [51]; VS with MPJA, a
method of Steganographic visual story with mutual-perceived
joint attention [52]; CAMT, a technique of highest-indexed
gated during modulation [41]; and Rand+RNN, a method of
Concepts integration [43]. In summary, our proposed model
(SAES) mostly outperforms all recent visual storytelling mod2 https://github.com/tkim-snu/GLACNet
1 https://github.com/eric-xw/AREL.git
3 https://github.com/JunjieHu/ReCo-RL
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply.
els mentioned above except on the metrics of BLEU-4, CIDEr,
and METEOR.
B. Human Evaluation
Beside the importance of automatic evaluation metrics,
human judgement has always been considered a gold standard
to evaluate the performance of computer algorithms. In order
to establish the superior performance of our technique, we
asked human participants to evaluate our proposed model,
the ground truth story and AREL model. First, 250 streams
of images are randomly selected from (VIST) dataset. Then,
participants are asked to rank each story from 1 to 5 (worst
to best). The survey of our human evaluation focuses on three
aspects relevance, coherence, and informativeness. Relevance
suggests that the story’s sentences have represented the images, coherence suggests the connection between the sentences
themselves as a story, and informativeness suggests the ability
to predict more relevant vocabulary in the story. The results
presented in Table II shows that our generated stories are
the most coherent, and more relevant and informative to the
ground truth stories which were written by humans.
C. Case Study
Although automatic evaluation metrics obtain a high score
result on the ground truth story against its five references,
human evolution measures define a better understanding of
creating a story. Fig. 5 presents an example sequence of images
with ground truth story, stories generated by different variants
of our proposed SAES method, and four different metrics for
each story: relevance, coherence, informativeness, and BLEU
score.
It can be observed from Fig. 5 that the ground truth story
obtained the highest score in the BLEU metric. However, participants of human evaluation survey found that our generated
story is better than the ground truth story. They found it to be
more relevant, coherent and informative. Fig. 5 presents the
highlighted green sentences which depicts that our proposed
model is able to provide more information in sentences to build
a coherent and relative story based on the provided stream of
images. Whereas, the highlighted red sentence ”I had a great
time”, did not contribute to the generated story and resulting
negatively in the automatic evaluation metric of BLEU.
VI. C ONCLUSION AND F UTURE W ORK
This study aimed to introduce a novel end-to-end encoderdecoder framework enhanced by object detection and noun
attribute methods. The encoder part plays a role in concatenating the image features extracted by Resnet-152 with the
object labels detected by Yolo-v5 (multi-hot vectors). These
concatenated features are then fed to a Bidirectional-LSTM
of two layers. The decoder part comprises a Mogrifier with
five rounds for better modulation and a language model.
In between, an attention mechanism layer is added, which
enable our model to generate a more coherent, relevant and
informative story. Our proposed model outperforms existing
methods on popular automatic evaluation metrics and human
evaluation. Our future work aims to create a new dataset
for storytelling problems having diverse ground truth and
scenarios. In addition, an integrated concept model will be
investigated to improve the quality of generated stories.
VII. ACKNOWLEDGMENT
This research was funded fully by the Australian
Government through the Australian Research Council
(DP190102443).
R EFERENCES
[1] Y. Liu, J. Fu, T. Mei, and C. W. Chen, “Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention
recurrent neural networks,” in Thirty-First AAAI Conference on Artificial
Intelligence, 2017.
[2] B. Pan, H. Cai, D.-A. Huang, K.-H. Lee, A. Gaidon, E. Adeli, and J. C.
Niebles, “Spatio-temporal graph for video captioning with knowledge
distillation,” in IEEE CVPR, 2020.
[3] Y. Feng, L. Ma, W. Liu, and J. Luo, “Unsupervised image captioning,”
in Proceedings of the IEEE CVPR, June 2019.
[4] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs
for image captioning,” in Proceedings of IEEE CVPR, 2019.
[5] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
networks for visual recognition and description,” in Proceedings of the
IEEE CVPR, 2015, pp. 2625–2634.
[6] J. Zhang and Y. Peng, “Video captioning with object-aware spatiotemporal correlation and aggregation,” IEEE Trans. on Image Processing, vol. 29, pp. 6209–6222, 2020.
[7] N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian, “Spatio-temporal
dynamics and semantic attribute enriched visual encoding for video
captioning,” in Proceedings of IEEE CVPR, 2019, pp. 12 487–12 496.
[8] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai, “Stat:
spatial-temporal attention mechanism for video captioning,” IEEE Trans.
on Multimedia, vol. 22, pp. 229–241, 2019.
[9] S. Liu, Z. Ren, and J. Yuan, “Sibnet: Sibling convolutional encoder
for video captioning,” IEEE Trans. on Pattern Analysis and Machine
Intelligence, pp. 1–1, 2020.
[10] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and
A. Courville, “Describing videos by exploiting temporal structure,” in
Proceedings of the IEEE ICCV, 2015, pp. 4507–4515.
[11] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and
L. Deng, “Semantic compositional networks for visual captioning,” in
Proceedings of the IEEE CVPR, 2017, pp. 5630–5639.
[12] Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferred
semantic attributes,” in IEEE CVPR, 2017.
[13] A. Nayyer, N. Akhtar, W. Liu, and A. Mian, “Empirical autopsy of deep
video captioning frameworks,” preprint arXiv:1911.09345, 2019.
[14] J. Wang, J. Fu, J. Tang, Z. Li, and T. Mei, “Show, reward and tell:
Automatic generation of narrative paragraph from photo stream by
adversarial training,” in Thirty-Second AAAI Conference on Artificial
Intelligence, 2018.
[15] P. Yang, F. Luo, P. Chen, L. Li, Z. Yin, X. He, and X. Sun, “Knowledgeable storyteller: A commonsense-driven generative model for visual
storytelling.” in IJCAI, 2019, pp. 5356–5362.
[16] X. Wang, W. Chen, Y.-F. Wang, and W. Y. Wang, “No metrics are perfect: Adversarial reward learning for visual storytelling,” arXiv preprint
arXiv:1804.09160, 2018.
[17] Y. Jung, D. Kim, S. Woo, K. Kim, S. Kim, and I. S. Kweon, “Hideand-tell: learning to bridge photo streams for visual storytelling,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34,
no. 07, 2020, pp. 11 213–11 220.
[18] D. Thuan, “Evolution of yolo algorithm and yolov5: the state-of-the-art
object detection algorithm,” 2021.
[19] G. Melis, T. Kočiskỳ, and P. Blunsom, “Mogrifier lstm,” in Proceedings
of the ICLR, 2020.
[20] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Computing
Surveys (CsUR), vol. 51, no. 6, pp. 1–36, 2019.
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply.
[21] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention for
image captioning,” in Proceedings of the IEEE International Conference
on Computer Vision, 2019, pp. 4634–4643.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[23] C. Sur, “aitpr: attribute interaction-tensor product representation for
image caption,” Neural Processing Letters, vol. 53, no. 2, pp. 1229–
1251, 2021.
[24] X. Shen, B. Liu, Y. Zhou, J. Zhao, and M. Liu, “Remote sensing image
captioning via variational autoencoder and reinforcement learning,”
Knowledge-Based Systems, p. 105920, 2020.
[25] J. Yuan, L. Zhang, S. Guo, Y. Xiao, and Z. Li, “Image captioning with
a joint attention mechanism by visual concept samples,” ACM Transactions on Multimedia Computing, Communications, and Applications
(TOMM), vol. 16, no. 3, pp. 1–22, 2020.
[26] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for
automatic evaluation of machine translation,” in Proceedings of the 40th
annual meeting of the Association for Computational Linguistics, 2002,
pp. 311–318.
[27] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensusbased image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575.
[28] S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings
of the acl workshop on intrinsic and extrinsic evaluation measures for
machine translation and/or summarization, 2005, pp. 65–72.
[29] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”
in Text summarization branches out, 2004, pp. 74–81.
[30] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic
propositional image caption evaluation,” in European Conference on
Computer Vision. Springer, 2016, pp. 382–398.
[31] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the
properties of neural machine translation: Encoder-decoder approaches,”
arXiv preprint arXiv:1409.1259, 2014.
[32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in Neural IPS, 2017, pp. 5998–6008.
[34] Q. Zheng, C. Wang, and D. Tao, “Syntax-aware action targeting for
video captioning,” in Proceedings of the IEEE/CVF CVPR, June 2020.
[35] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z.-J. Zha,
“Object relational graph with teacher-recommended learning for video
captioning,” in Proceedings of the IEEE/CVF CVPR, June 2020.
[36] M. S. Al Nahian, T. Tasrin, S. Gandhi, R. Gaines, and B. Harrison, “A
hierarchical approach for visual storytelling using image description,” in
International Conference on Interactive Digital Storytelling. Springer,
2019, pp. 304–317.
[37] C. C. Park and G. Kim, “Expressing an image stream with a sequence
of natural sentences,” in Advances in neural information processing
systems, 2015, pp. 73–81.
[38] G. Kim, S. Moon, and L. Sigal, “Ranking and retrieval of image
sequences from multiple paragraph queries,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2015, pp.
1993–2001.
[39] T. Kim, M.-O. Heo, S. Son, K.-W. Park, and B.-T. Zhang, “Glac
net: Glocal attention cascading networks for multi-image cued story
generation,” arXiv preprint arXiv:1805.10973, 2018.
[40] Q. Huang, Z. Gan, A. Celikyilmaz, D. Wu, J. Wang, and X. He,
“Hierarchically structured reinforcement learning for topically coherent
visual story generation,” in Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 33, 2019, pp. 8465–8472.
[41] Z. M. Malakan, N. Aafaq, G. M. Hassan, and A. Mian, “Contextualise,
attend, modulate and tell: Visual storytelling,” 2021.
[42] B. Harrison, “A hierarchical approach for visual storytelling using image
description,” in Interactive Storytelling: 12th International Conference
on Interactive Digital Storytelling, ICIDS 2019, Little Cottonwood
Canyon, UT, USA, November 19–22, 2019, Proceedings, vol. 11869.
Springer Nature, 2019, p. 304.
[43] H. Chen, Y. Huang, H. Takamura, and H. Nakayama, “Commonsense
knowledge aware concept selection for diverse and informative visual
storytelling,” arXiv preprint arXiv:2102.02963, 2021.
[44] C. Xu, M. Yang, C. Li, Y. Shen, X. Ao, and R. Xu, “Imagine, reason and
write: Visual storytelling with graph knowledge and relational reasoning,” in Proceedings of the AAAI Conference on Artificial Intelligence,
vol. 35, no. 4, 2021, pp. 3022–3029.
[45] N. Akhtar, M. Jalwana, M. Bennamoun, and A. S. Mian, “Attack to fool
and explain deep networks,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2021.
[46] Z. M. Malakan and H. A. Albaqami, “Classify, detect and tell: Realtime american sign language,” in 2021 National Computing Colleges
Conference (NCCC). IEEE, 2021, pp. 1–6.
[47] M. A. Jalwana, N. Akhtar, M. Bennamoun, and A. Mian, “Cameras:
Enhanced resolution and sanity preserving class activation mapping
for image saliency,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2021, pp. 16 327–16 336.
[48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[49] T.-H. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal, J. Devlin, R. Girshick, X. He, P. Kohli, D. Batra et al., “Visual storytelling,”
in Proceedings of the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Technologies, 2016, pp. 1233–1239.
[50] J. Li, H. Shi, S. Tang, F. Wu, and Y. Zhuang, “Informative visual
storytelling with cross-modal rules,” in Proceedings of the 27th ACM
International Conference on Multimedia, 2019, pp. 2314–2322.
[51] J. Hu, Y. Cheng, Z. Gan, J. Liu, J. Gao, and G. Neubig, “What makes
a good story? designing composite rewards for visual storytelling.” in
AAAI, 2020, pp. 7969–7976.
[52] Y. Guo, H. Wu, and X. Zhang, “Steganographic visual story with
mutual-perceived joint attention,” EURASIP Journal on Image and Video
Processing, vol. 2021, no. 1, pp. 1–14, 2021.
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply.
Download