Semantic Attribute Enriched Storytelling from a Sequence of Images Zainy M. Malakana,b , Ghulam Mubashar Hassana , Mohammad A. A. K. Jalwanaa , Nayyer Aafaqa and Ajmal Miana a Department of Computer Science and Software Engineering, The University of Western Australia, Perth, Australia b Department of Information Science, College of Computer and Information Systems, Umm Al-Qura University, Makkah, Kingdom of Saudi Arabia 2021 Digital Image Computing: Techniques and Applications (DICTA) | 978-1-6654-1709-9/21/$31.00 ©2021 IEEE | DOI: 10.1109/DICTA52665.2021.9647213 {zainy.aljawy, nayyer.aafaq}, {ghulam.hassan, mohammad.jalwana, ajmal.mian} Abstract—Visual storytelling (VST) pertains to the task of generating story-based sentences from an ordered sequence of images. Contemporary techniques suffer from several limitations such as inadequate encapsulation of visual variance and context capturing among the input sequence. Consequently, generated story from such techniques often lacks coherence, context and semantic information. In this research, we devise a ‘Semantic Attribute Enriched Storytelling’ (SAES) framework to mitigate these issues. To that end, we first extract the visual features of input image sequence and the noun entities present in the visual input by employing an off-the-shelf object detector. The two features are concatenated to encapsulate the visual variance of the input sequence. The features are then passed through a Bidirectional-LSTM sequence encoder to capture the past and future context of the input image sequence followed by attention mechanism to enhance the discriminality of the input to language model i.e., mogrifier-LSTM. Additionally, we incorporate semantic attributes e.g., nouns to complement the semantic context in the generated story. Detailed experimental and human evaluations are performed to establish competitive performance of proposed technique. We achieve up 1.4% improvement on BLEU metric over the recent state-of-art methods. Index Terms—Storytelling, Computer Vision, Image and Video Captioning, and Object Detection. Fig. 1. Comparison between image captioning and visual story telling. The black (middle) box shows the isolated individual description for each image, while the green (bottom) box points to a single story having five sentences. Additionally, the objects detected in each image are presented in the blue (top) box. I. I NTRODUCTION Visual storytelling (VST) aims to create a meaningful story from a set of images. Although a trivial task for humans, it is a very challenging task for machines. To build a coherent and meaningful story, an algorithm needs to understand both the modalities which include, the visual input and associated semantic information [1]. For instance, it requires understanding of various activities, objects, and places in the visual input to create relevant story. Additionally, the generated story should be semantically meaningful and syntactically correct at the same time. Research problems similar to visual storytelling include image captioning [2]–[5] and video captioning [6]–[13]. In conventional image captioning techniques, an isolated description is generated for every input image. Consequently, these techniques lack to capture contextual information as reflected by the non-coherent sentences. In contrast, storytelling (VST) extracts and leverages contextual relationship among the stream of input images to produce a coherent story. Further, these algorithms are significantly complex than video description techniques, as they require narrative context than literal captioning from the language side [14]. Coherency in story description is a challenging problem as the images have high perceptual differences. This is specifically highlighted in Fig. 1. Existing VST techniques are mostly end-to-end trainable deep pipelines [15]–[17] that are able to generate grammatically correct story however they lack the long term consistency among story sentences that may be aided by additional information from input images. In this study, we propose a novel visual storytelling framework that is based on Encoder-Decoder paradigm. Encoder part captures the contextual relationship among images by a Bidirectional-LSTM over the visual features followed by a self attention mechanism. Bidirectional-LSTM pass the images in both forward and backward sequences to capture the context in both the directions. The visual features from the 978-1-6654-1709-9/21/$31.00 ©2021 IEEE Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply. encoder includes concatenated features from a pretrained classifier i.e., 2D-CNN and off-the-shelf object detector [18]. The concatenated features are then passed through BidirectionalLSTM to incorporate the contextual information. Before feeding the features into language model, self-attention mechanism is applied to enhance the discriminality of the features. These attentive features are then modulated with the input word embeddings via Mogrifier-LSTM [19] before processed by the LSTM. The Mogrifier leverages the rich features from classifier and object detector to enable intra-sentence coherence and feeds the rich context-aware information to language model to improve the coherency and relevance of the generated sentences. In summary, our contributions are as follows: • We propose a novel ‘Semantic Attribute Enriched Storytelling (SAES)’ framework that leverages from visual features and objects information to generate a relevant story from an input image stream. • Our framework captures both past and future context and employs attention mechanism over the contextualised features to enhance the discriminality to generate semantically rich dense captions from the language model. • We incorporate Mogrifier-LSTM to capture the relationship between visual and semantic information for relevant story generation from the enriched features. • We demonstrate the efficacy of our model by performing extensive evaluation using popular Visual Storytelling Dataset (VIST). We show that our technique outperforms state-of-the-art (SoTA) techniques by 1.4% improvement over multiple automatic evaluation metrics (see §V for details) and 4% when using human evaluations. II. R ELATED WORK This section provides an overview of recent trends in visual storytelling. Due to objective similarity of image and video captioning, we also discuss them before summarising the recently proposed literature of VST. A. Image and Video Captioning Image captioning (IC) [20] is a limiting case of visual story telling where the input sequence set is reduced to a single image. Thus, the progress in IC can be utilized in the research in storytelling. State-of-the-art methods in IC are based on deep learning pipelines [21]. While promising, these pipelines are known for their thirst of larger datasets. Therefore, the research in deep learning in general and image captioning in specific has gained the momentum after the introduction of seminal dataset of ImageNet ILSVRC [22]. The prominent algorithms in this direction include subjects and objects modeling [23], reinforcement learning [24], and attention techniques [25]. The common evaluation metrics used for IC methods include BLUE [26], CIDEr [27], METEOR [28], ROUGE [29], and SPICE [30]. Video captioning may be considered as an extension of the image captioning field that can describe multiple frames (i.e., video) in a single sentence. Similar to visual storytelling methods, most recent techniques in video captioning incorporate encoder-decoder framework. Encoder consists of a 2D/3DCNN that extracts visual features from the stream of input images. These features are then translated into natural language sentences by a decoder i.e., a language model based on either a recurrent neural network [31], [32] or a transformer [33]. Most recent video captioning methods incorporate various techniques to enhance the performance of video captioning framework that includes objects and actions modeling [2], [34], [35], Fourier transform [7], attention mechanism [8], [10], and semantic attributes learning [11], [12]. B. Visual Storytelling (VST) In Visual storytelling (VST) task, the encompassed system learns to produce grammatically correct sentences as a story by relying on an ordered stream of images [36]. The pioneering study of Park et al. [37] proposed a novel framework of multi-frame to multi-sentence modeling in VST, in which the coherence in the textual model was used to resolve entity transition patterns commonly found between sentences. The model was tested over the NYC and Disneyland datasets [38] which are not currently available in public domain. The recent research efforts pre-dominantly rely on encoderdecoder framework [39], [40]. This framework addresses deep learning structural design that blends a Convolutional Neural Network (CNN) as a feature extractor with recurrent network as a sequence encoder and a language model as a decoder. In between, they utilise an attention mechanism, which allow them to have competitive results. The performances of the proposed frameworks were evaluated over publicly open VIST dataset. Similarly, Malakan et al. [41] proposed a novel framework that utilizes Mogrifier-LSTM as a language model to enhance the quality of generated story. Few other studies improved other aspects of the framework involving stacked recurrent neural networks (RRNs) [17], a hierarchical context based network [42], concept detection [43], and imaginereason-write framework [44]. These enhancements significantly improved the VST models performance over automatic evaluation metrics, especially METEOR. In this article, we propose a novel framework that enriches the semantic features by incorporating object detector. Extensive experimentation establishes the enhanced performance of our model when evaluated over the automatic evaluation metrics and human evaluation. III. M ETHODOLOGY Our proposed model is based on well known EncoderDecoder paradigm. The encoder consists of hierarchical layers of feature extractors and Bidirectional-LSTM with attention. Recent literature has highlighted that deep visual models learn rich semantic features [45], [46] that can be extracted from its last layers [47]. This motivated us to aggregate our feature set from second last layer of a pretrained classifier and output of an object detector. The concatenated features are subsequently encoded by the Bidirectional-LSTM layers with attention. Our decoder module consists of Mogrifier LSTM with five rounds Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply. Fig. 2. The proposed architecture which is based on an encoder-decoder algorithm. First, Resnet-152 extracts the image features, and the object labels are detected using the state-of-the-art Yolo-v5. Then, the decoder part comprises Mogrifier with five rounds to a language model to generate a natural human-like coherent story. and a language model that modulates the encoded features to generate a coherent story composed of a sequence of five coherent sentences. Each of the sub-modules is discussed in detail below. A. Feature Extractor The feature extraction module combines the semantic features from a visual classifier and object detector as shown in Figure 3. Among several off-the-shelf options, we selected pretrained ResNet-152 available in TorchVision package [48] and public implementation of state-of-the-art object detector called Yolov5 [18]. For visual classifier, activations from the second last layer (ci ∈ R1024 ) were utilized as salient features. For the object detector, a dictionary composed of top two hundred common words was prepared offline to encode detection labels as multi-hot vector (di ∈ R128 ). Therefore, for a given set of images as input (I), I = [I1 , I2 , ..., IN ] s.t. Ii ∈ R224×224×3 , (1) we define features as combination of features from classifier and object detector as, φ = [(c1 d1 ), ..., (cN dN )] s.t. ci ∈ R1024 , di ∈ R128 , (2) where (ci di ) ∈ R1152 indicates the concatenation of ci with di . These features are then fed to Bidirectional-LSTM for temporal encoding B. Bidirectional Encoder with Attention Given the extracted high level visual features (φ), we utilize to encode a context of all the images. A trivial approach can be the concatenation or some weighted combination of all the features. However such approaches cannot adequately preserve the temporal relationship amongst images. Therefore, we model the sequential information with a recurrent neural network (RNN). Specifically, a Bidirectional-LSTM network is employed to summarize the sequential information of N images in forward as well as backward directions. At each time step ‘t’, our sequence encoder accepts image feature vector φi where i ∈ {1, 2, . . . N }. At last time step ‘N’, the sequence encoder has encoded the whole stream of images and encapsulates the contextual information in the last hidden −→ ←− state denoted as hse = [hse ; hse ]. Description of images in a context free manner often yields unrelated sentences that significantly degrade quality of a story. We mitigate this issue by utilizing the context alongside context-conditioned individual image features to define our encoded set of features that are subsequently refined by self attention mechanism to preserve most salient parts. Formally, ω i = W Ta · tanh(W φ [φi , hsei ] + b), (3.3) where φi is the feature vector of ith image with hidden state hsei of sequence encoder after ith image has been fed. exp(ω i ) γ i = PN , k=1 exp(ω k ) (3.4) where N is the length of the features sequence as visual stream. Finally, the attentive representation features becomes; ζi = N X γ i · [φi , hsei ]. (3.5) i=1 The final representations serve as the decoder inputs which attends both image specific (low level) and stream specific (high level) information. C. Image and Objects modulation Recurrent networks often struggle in modeling sequential inputs where coherence and relevance are vital [41] such as in visual storytelling. These problems depend on the interaction of model inputs and the context in which they occur. We remedify these issues by exploiting the modulating technique of Mogrifier LSTM [19] and employ it as decoder in our framework. Before describing the Mogrifier LSTM, we first Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply. Fig. 3. In the decoder part of our proposed model, image features and the detected object labels are concatenated. First, we extract labels from Yolo and then create the story map as multi-hot vectors with a size of 128. During feature extracting using resnet-152 as CNN, we concatenate the feature map. discuss the way a standard LSTM [32] generates current hidden state , hhti , given the previous hidden state hprev , and updates its memory state chti . It uses input gates Γi , forget gates Γf , and output gates Γo that are computed as follows: htrangle Γf = σ(W f [hprev , xt ] + bf ), hti Γi = σ(W i [hprev , xt ] + bi ), hti c̃ = tanh (W c [hprev , xt ] + bc ), hti chti = Γf hti cht−1i + Γi c̃hti , Γhti o = σ(W o [hprev , xt ] + bo ), hhti = Γhti o tanh(chti ) (6) (7) (8) (9) (10) (11) where W ∗ indicates the learnable transformation matrix in all cases, x is the input word embedding vector at time step ‘t’ (we omit t for readability), b∗ represent the biases, σ is the logistic sigmoid function, and represent the hadamard product of the vectors. In our decoder network, the LSTM hidden state h is initialized by attentive vector ζ i from the encoder output. A close inspection of the above equations reveals that input gate Γi scales the rows of the weight matrices W c (here we have deliberately ignored the non-linearity in c for clarity). This operation is modified in Mogrifier LSTM such that the columns of all its weight matrices W ∗ are scaled by gated modulation. Before the input to the LSTM, two inputs x and hprev modulate each other alternatively. Formally, x is gated conditioned on the output of the previous step hprev . Likewise, the gated input is utilised in a similar fashion to gate previous time step output. After five gating rounds, the highest-indexed updated x and hprev are then fed into LSTM unit as presented in Fig. 4. As demonstrated, it can be expressed as: Mogrify (x, cprev , hprev ) = LST M (x↑ , cprev , h↑prev ) where x↑ and Fig. 4. High-level architecture illustrates how we feed the objects and noun features to Mogrifier modulation to have the highly context-dependent input to the LSTM unit. h↑prev are the modulated inputs being the highest-indexed xi and hiprev respectively. Formally, i−2 , xi = 2σ(W ixh hi−1 prev ) x hiprev 2σ(W ihx xi−1 ) for odd i ∈ [1, 2, ..., r], (12) hi−2 prev , for even i ∈ [1, 2, ..., r], (13) where is the Hadamard product, x−1 = x, h0prev = hprev = ζ i and r denotes the number of modulation rounds treated as a hyperparameter. Setting r = 0 represents the standard LSTM without any gated modulation at the input. Multiplication with a constant 2 ensures that the matrices W ixh and W ihx result in transformations close to identity. = D. Proposed framework Variants We have explored several design choices for the semantic enrichment of our proposed framework. These variants are: • W/Encoder OD: This particular setting provided the best performance and is illustrated in the Fig. 2. The detailed mathematical formulation is presented in the previous subsections. All other variants will be explained relative to it. • W/Encoder Decoder OD: In this setting, we modified the ‘W/Encoder OD’ architecture by feeding Mogrifier with additional features from the object detector. These additional features are concatenated alongside the attentive features. • W/Encoder OD & Noun w/Decoder: This variant is similar to ‘W/Encoder OD’, except that top objects Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply. TABLE I A N EXPERIMENT OVER FIVE SPLITS ON THE VISUAL STORYTELLING DATASET (VIST). T HE PERPLEXITY OF THE TRAINING , VALIDATION , AND TESTING ARE REPORTED TO SHOW THE STRENGTH OF EACH SPLIT. Models st 1 Epoch Train Perplexity Val Perplexity Test Perplexity Split 21 12.19 16.77 16.54 2nd Split 21 11.90 16.80 16.22 3rd Split 21 12.05 16.79 15.92 4th Split 21 12.12 16.72 17.03 5th Split 21 11.50 16.83 18.05 TABLE II H UMAN EVALUATION SURVEY RESULTS FOR 15 GROUND TRUTH STORIES , 15 GENERATED STORIES BY OUR PROPOSED MODEL , AND 15 GENERATED STORIES BY AREL MODEL WITH A TOTAL OF 250 STREAMS OF IMAGES . Story Type Ground Truth AREL Ours Relevance 3.20 3.07 3.16 Rank 1-5 (worst-best) Coherence Informative Story 3.20 3.33 3.13 3.13 3.23 3.28 B. Cross Validation Sorted List is replaced by a list from ground truth noun attributes. E. Sentence Generation After five rounds of the attentive features modulation, the Mogrifier LSTM generates the highest-indexed feature vectors corresponding to each image. These vectors are sequentially fed to LSTM unit tokenizer to generate a complete visual story. In the < start > sign, the tokenizer will start to receive the feature vectors from Mogrifier, and it tokenizes the sentence word until the LSTM meets the < end > which is a complete sentence of the first image and so on for the whole story of five sentences as shown in Figure 2. All of our variants uniquely generate five enriched sentences as one story of five images. Figure 5 demonstrate these generated stories alongside the quantitative metric scores. The concatenated object labels with the image features method in the encoder part obtains the highest score in automatic evaluation metrics. IV. M ODEL E VALUATION We empirically and qualitatively evaluated the performance of our scheme over publicly available visual storytelling dataset (VIST). This section provides details about dataset, preprocessing, evaluation metrics and a discussion over comparative results. A. Dataset To the best of our knowledge, VIST (Visual Storytelling dataset) [49] is the only publicly available dataset that can be utilized for supervised learning techniques in storytelling problems. It consists of 210,819 unique photographs that belong to 10,117 Flickr albums and is organized in group of five images. Each group is accompanied with a two type of stories. One is called ‘Description-in-Isolation’ and includes individual descriptions of images, that can be useful for research in image captioning. While the other is called ‘Storyin-Sequence’ which is relevant to our research problem and consists of a coherent story of exactly five sentences. It is important to mention that the names of the persons are modified by [male and female]; locations by [location]; and organizations by [organization] respectively. For a fair evaluation of proposed framework, we followed 5-fold cross validation over VIST dataset. In this process, we utilized all the data in VIST and created five splits. Our model was iteratively trained over one split and tested over other splits. Firstly, all the test and train samples were grouped together, and then we resplit the group again. Hence, we measure our model’s perplexity score during training over the train, validation, and test samples in five different splits. It is important to mention that the first split is as the original state presented in [49]. The test score is reported in Table I, which shows that the fifth spilt is the best performing split during training achieving 11.50 perplexity score. In comparison, the third split improves during the model testing and achieving 15.92 perplexity score. Overall, the dataset experiments show that our proposed model is stable and performing well because there are no significant differences. C. Automatic Evaluation Metrics Automatic evaluation metrics provide a convenient choice to quantify and compare the performances of different algorithms. Each metric has its own strengths, therefore comparative analysis over several metrics is vital to ascertain the strength of an algorithm. Following the popular choices from relevant literature we have included several metrics to evaluate our proposed model. It includes BLEU, ROUGE, METEOR and CIDEr. BLEU stands for Bilingual Evaluation Understudy and measures the generated text using n-grams which compares them with the Ground truth text (i.e., reference text). We have included four varaints called BLEU-1, 2, 3, and 4 [26]. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation [29] and focuses on analysis or comparison in three different ways: n-grams, word sequences, and word pairs. METEOR stands for Metric for Evaluation of Translation with Explicit ORdering [28] and is better for sentence-level since it looks for the words’ synonyms in its measurement with the text reference. Similarly CIDEr standands for Consensusbased Image Description Evaluation [27] and is designed to compare the machine-generated text with multiple human captions (i.e., reference texts). V. F INDINGS AND R ESULTS In this section, we compare our proposed model performance with state-of-art models. In addition, human evaluation measure is reported. Finally, an example of a generated story is discussed as a case study. Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply. Fig. 5. An example of the steam of images from VIST dataset followed by ground truth story and stories generated by our proposed Semantic Attribute Enriched Storytelling (SAES) model and its variants. In addition, four metrics are presented, which includes relevance, coherence and informativeness from human evaluation, and BLEU score from automatic evaluation metric. TABLE III C OMPARISON OF OUR PROPOSED MODEL WITH RECENT METHODOLOGIES ON THE VISUAL STORYTELLING DATASET (VIST). W E USED SEVEN AUTOMATIC EVALUATION METRICS AS QUANTITATIVE RESULTS . “ - ” MEANS THE SCORE IS NOT RECORDED BY THE AUTHORS OF THE RESPECTIVE METHOD . T HE “ BOLD ” SCORES REPRESENT THE HIGHEST RESULT. Model AREL 2018 (baseline) [16] GLACNet 2018 (baseline) [39] HCBNet 2019 [36] HCBNet(without prev. sent. attention) [36] HCBNet(without description attention) [36] HCBNet(VGG19) [36] VSCMR 2019 [50] MLE 2020 [51] BLEU-RL 2020 [51] ReCo-RL 2020 [51] VS with MPJA 2021 [52] CAMT 2021 [41] Rand+RNN 2021 [43] Proposed model w/Encoder Decoder OD Proposed model w/Encoder OD & Noun w/Decoder Proposed model w/Encoder OD BLEU-1 0.536 0.568 0.593 0.598 0.584 0.591 0.638 0.601 0.641 0.64 0.63 0.65 A. Proposed Model Comparison In our comparison, we include all automatic evaluation metrics discussed in Section IV-C which are: BLEU-1, 2, 4, and 4, CIDEr, ROUGE-L, and METEOR. Moreover, to represent our proposed model performance over these metrics, Table III presents the comparison between our proposed Semantic Attribute Enriched Storytelling (SAES) model and recently published popular methods which include: AREL1 , a method for an implicit reward with imitation learning [16]; BLEU-2 0.315 0.321 0.348 0.338 0.345 0.34 0.325 0.361 0.363 0.357 0.372 BLEU-3 0.173 0.171 0.191 0.180 0.194 0.186 0.133 0.201 0.133 0.196 0.195 0.204 BLEU-4 0.099 0.091 0.105 0.097 0.108 0.104 0.143 0.143 0.144 0.124 0.082 0.184 0.061 0.106 0.109 0.12 CIDEr 0.038 0.041 0.051 0.057 0.043 0.051 0.090 0.072 0.067 0.086 0.042 0.042 0.022 0.051 0.048 0.054 ROUGE-L 0.286 0.264 0.274 0.271 0.271 0.269 0.302 0.300 0.301 0.299 0.303 0.303 0.272 0.294 0.299 0.303 METEOR 0.352 0.306 0.34 0.332 0.337 0.334 0.355 0.348 0.352 0.339 0.344 0.335 0.311 0.330 0.331 0.335 GLACNet2 , an approach to learn attention cascading network [39]; HCBNet, an approach of using image description as a hierarchy for the sequence of images [36]; VSCMR, a method of cross-modal rule mining [50]; ReCo-RL3 , an approach of designing composite rewards [51]; VS with MPJA, a method of Steganographic visual story with mutual-perceived joint attention [52]; CAMT, a technique of highest-indexed gated during modulation [41]; and Rand+RNN, a method of Concepts integration [43]. In summary, our proposed model (SAES) mostly outperforms all recent visual storytelling mod2 1 3 Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply. els mentioned above except on the metrics of BLEU-4, CIDEr, and METEOR. B. Human Evaluation Beside the importance of automatic evaluation metrics, human judgement has always been considered a gold standard to evaluate the performance of computer algorithms. In order to establish the superior performance of our technique, we asked human participants to evaluate our proposed model, the ground truth story and AREL model. First, 250 streams of images are randomly selected from (VIST) dataset. Then, participants are asked to rank each story from 1 to 5 (worst to best). The survey of our human evaluation focuses on three aspects relevance, coherence, and informativeness. Relevance suggests that the story’s sentences have represented the images, coherence suggests the connection between the sentences themselves as a story, and informativeness suggests the ability to predict more relevant vocabulary in the story. The results presented in Table II shows that our generated stories are the most coherent, and more relevant and informative to the ground truth stories which were written by humans. C. Case Study Although automatic evaluation metrics obtain a high score result on the ground truth story against its five references, human evolution measures define a better understanding of creating a story. Fig. 5 presents an example sequence of images with ground truth story, stories generated by different variants of our proposed SAES method, and four different metrics for each story: relevance, coherence, informativeness, and BLEU score. It can be observed from Fig. 5 that the ground truth story obtained the highest score in the BLEU metric. However, participants of human evaluation survey found that our generated story is better than the ground truth story. They found it to be more relevant, coherent and informative. Fig. 5 presents the highlighted green sentences which depicts that our proposed model is able to provide more information in sentences to build a coherent and relative story based on the provided stream of images. Whereas, the highlighted red sentence ”I had a great time”, did not contribute to the generated story and resulting negatively in the automatic evaluation metric of BLEU. VI. C ONCLUSION AND F UTURE W ORK This study aimed to introduce a novel end-to-end encoderdecoder framework enhanced by object detection and noun attribute methods. The encoder part plays a role in concatenating the image features extracted by Resnet-152 with the object labels detected by Yolo-v5 (multi-hot vectors). These concatenated features are then fed to a Bidirectional-LSTM of two layers. The decoder part comprises a Mogrifier with five rounds for better modulation and a language model. In between, an attention mechanism layer is added, which enable our model to generate a more coherent, relevant and informative story. Our proposed model outperforms existing methods on popular automatic evaluation metrics and human evaluation. Our future work aims to create a new dataset for storytelling problems having diverse ground truth and scenarios. In addition, an integrated concept model will be investigated to improve the quality of generated stories. VII. ACKNOWLEDGMENT This research was funded fully by the Australian Government through the Australian Research Council (DP190102443). R EFERENCES [1] Y. Liu, J. Fu, T. Mei, and C. W. Chen, “Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017. [2] B. Pan, H. Cai, D.-A. Huang, K.-H. Lee, A. Gaidon, E. Adeli, and J. C. Niebles, “Spatio-temporal graph for video captioning with knowledge distillation,” in IEEE CVPR, 2020. [3] Y. Feng, L. Ma, W. Liu, and J. Luo, “Unsupervised image captioning,” in Proceedings of the IEEE CVPR, June 2019. [4] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of IEEE CVPR, 2019. [5] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE CVPR, 2015, pp. 2625–2634. [6] J. Zhang and Y. Peng, “Video captioning with object-aware spatiotemporal correlation and aggregation,” IEEE Trans. on Image Processing, vol. 29, pp. 6209–6222, 2020. [7] N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian, “Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning,” in Proceedings of IEEE CVPR, 2019, pp. 12 487–12 496. [8] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai, “Stat: spatial-temporal attention mechanism for video captioning,” IEEE Trans. on Multimedia, vol. 22, pp. 229–241, 2019. [9] S. Liu, Z. Ren, and J. Yuan, “Sibnet: Sibling convolutional encoder for video captioning,” IEEE Trans. on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020. [10] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Describing videos by exploiting temporal structure,” in Proceedings of the IEEE ICCV, 2015, pp. 4507–4515. [11] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning,” in Proceedings of the IEEE CVPR, 2017, pp. 5630–5639. [12] Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferred semantic attributes,” in IEEE CVPR, 2017. [13] A. Nayyer, N. Akhtar, W. Liu, and A. Mian, “Empirical autopsy of deep video captioning frameworks,” preprint arXiv:1911.09345, 2019. [14] J. Wang, J. Fu, J. Tang, Z. Li, and T. Mei, “Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [15] P. Yang, F. Luo, P. Chen, L. Li, Z. Yin, X. He, and X. Sun, “Knowledgeable storyteller: A commonsense-driven generative model for visual storytelling.” in IJCAI, 2019, pp. 5356–5362. [16] X. Wang, W. Chen, Y.-F. Wang, and W. Y. Wang, “No metrics are perfect: Adversarial reward learning for visual storytelling,” arXiv preprint arXiv:1804.09160, 2018. [17] Y. Jung, D. Kim, S. Woo, K. Kim, S. Kim, and I. S. Kweon, “Hideand-tell: learning to bridge photo streams for visual storytelling,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 213–11 220. [18] D. Thuan, “Evolution of yolo algorithm and yolov5: the state-of-the-art object detection algorithm,” 2021. [19] G. Melis, T. Kočiskỳ, and P. Blunsom, “Mogrifier lstm,” in Proceedings of the ICLR, 2020. [20] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Computing Surveys (CsUR), vol. 51, no. 6, pp. 1–36, 2019. Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply. [21] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention for image captioning,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4634–4643. [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [23] C. Sur, “aitpr: attribute interaction-tensor product representation for image caption,” Neural Processing Letters, vol. 53, no. 2, pp. 1229– 1251, 2021. [24] X. Shen, B. Liu, Y. Zhou, J. Zhao, and M. Liu, “Remote sensing image captioning via variational autoencoder and reinforcement learning,” Knowledge-Based Systems, p. 105920, 2020. [25] J. Yuan, L. Zhang, S. Guo, Y. Xiao, and Z. Li, “Image captioning with a joint attention mechanism by visual concept samples,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 16, no. 3, pp. 1–22, 2020. [26] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. [27] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensusbased image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575. [28] S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72. [29] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81. [30] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in European Conference on Computer Vision. Springer, 2016, pp. 382–398. [31] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014. [32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural IPS, 2017, pp. 5998–6008. [34] Q. Zheng, C. Wang, and D. Tao, “Syntax-aware action targeting for video captioning,” in Proceedings of the IEEE/CVF CVPR, June 2020. [35] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z.-J. Zha, “Object relational graph with teacher-recommended learning for video captioning,” in Proceedings of the IEEE/CVF CVPR, June 2020. [36] M. S. Al Nahian, T. Tasrin, S. Gandhi, R. Gaines, and B. Harrison, “A hierarchical approach for visual storytelling using image description,” in International Conference on Interactive Digital Storytelling. Springer, 2019, pp. 304–317. [37] C. C. Park and G. Kim, “Expressing an image stream with a sequence of natural sentences,” in Advances in neural information processing systems, 2015, pp. 73–81. [38] G. Kim, S. Moon, and L. Sigal, “Ranking and retrieval of image sequences from multiple paragraph queries,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1993–2001. [39] T. Kim, M.-O. Heo, S. Son, K.-W. Park, and B.-T. Zhang, “Glac net: Glocal attention cascading networks for multi-image cued story generation,” arXiv preprint arXiv:1805.10973, 2018. [40] Q. Huang, Z. Gan, A. Celikyilmaz, D. Wu, J. Wang, and X. He, “Hierarchically structured reinforcement learning for topically coherent visual story generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8465–8472. [41] Z. M. Malakan, N. Aafaq, G. M. Hassan, and A. Mian, “Contextualise, attend, modulate and tell: Visual storytelling,” 2021. [42] B. Harrison, “A hierarchical approach for visual storytelling using image description,” in Interactive Storytelling: 12th International Conference on Interactive Digital Storytelling, ICIDS 2019, Little Cottonwood Canyon, UT, USA, November 19–22, 2019, Proceedings, vol. 11869. Springer Nature, 2019, p. 304. [43] H. Chen, Y. Huang, H. Takamura, and H. Nakayama, “Commonsense knowledge aware concept selection for diverse and informative visual storytelling,” arXiv preprint arXiv:2102.02963, 2021. [44] C. Xu, M. Yang, C. Li, Y. Shen, X. Ao, and R. Xu, “Imagine, reason and write: Visual storytelling with graph knowledge and relational reasoning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3022–3029. [45] N. Akhtar, M. Jalwana, M. Bennamoun, and A. S. Mian, “Attack to fool and explain deep networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [46] Z. M. Malakan and H. A. Albaqami, “Classify, detect and tell: Realtime american sign language,” in 2021 National Computing Colleges Conference (NCCC). IEEE, 2021, pp. 1–6. [47] M. A. Jalwana, N. Akhtar, M. Bennamoun, and A. Mian, “Cameras: Enhanced resolution and sanity preserving class activation mapping for image saliency,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 327–16 336. [48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [49] T.-H. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal, J. Devlin, R. Girshick, X. He, P. Kohli, D. Batra et al., “Visual storytelling,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1233–1239. [50] J. Li, H. Shi, S. Tang, F. Wu, and Y. Zhuang, “Informative visual storytelling with cross-modal rules,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 2314–2322. [51] J. Hu, Y. Cheng, Z. Gan, J. Liu, J. Gao, and G. Neubig, “What makes a good story? designing composite rewards for visual storytelling.” in AAAI, 2020, pp. 7969–7976. [52] Y. Guo, H. Wu, and X. Zhang, “Steganographic visual story with mutual-perceived joint attention,” EURASIP Journal on Image and Video Processing, vol. 2021, no. 1, pp. 1–14, 2021. Authorized licensed use limited to: University of Science & Technology of China. Downloaded on April 06,2023 at 03:24:45 UTC from IEEE Xplore. Restrictions apply.