See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/357955678 IMAGE CAPTION GENERATOR USING DEEP LEARNING-Convolutional Neural Network, Recurrent Neural Network, (Bilingual Evaluation Understudy)BLEU score,Long Short Time Memory Article · January 2022 CITATIONS READS 0 1,644 4 authors, including: Biradavolu Shanmukh Kavitha Chaduvula Coventry University Gudlavalleru Engineering College 3 PUBLICATIONS 3 CITATIONS 47 PUBLICATIONS 20 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: B.Tech View project All content following this page was uploaded by Kavitha Chaduvula on 20 January 2022. The user has requested enhancement of the downloaded file. SEE PROFILE The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 IMAGE CAPTION GENERATOR USING DEEP LEARNING Ch. Sneha1 , B. Premanvitha2 , B.Shanmukh3 , Kavitha Chaduvula4 UG Student Professor and Head 1,2,3,4 IT Department, Gudlavalleru Engineering College, Gudlavalleru-521356 1,2,3 4 Abstract- Computer vision has become I. INTRODUCTION ubiquitous in our society, with applications in Image caption Generator is a popular research several fields. In this paper, we focus on one area of Artificial Intelligence that deals with of the visual recognition facets of computer image vision, i.e image captioning. The problem of description for that image. Generating well- generating language descriptions for visual formed sentences requires both syntactic and data has been studied from a long time but in semantic understanding of the language. Being the field of videos. In the recent few years able to describe the content of an image using emphasis has been lead on still image accurately description with natural text. Due to the recent challenging task, but it could also have a great advancements in the field of object detection, impact, by helping visually impaired people the task of scene description in an image has better understand the content of images. become easier. understanding formed and sentences a language is a very Artificial Intelligence(AI) is now at the heart The aim of the project was to train of innovation economy and thus the base for convolutional neural networks with several this paper is also the same. In the recent past a hundreds of hyperparameters and apply it on field of AI namely Deep Learning has turned a a huge dataset of images (ResNet,Vgg), and lot of heads due to its impressive results in combine the results of this image classifier terms of accuracy when compared to the with a recurrent neural network to generate a already existing Machine learning algorithms. caption for the classified image. In this report The task of being able to generate a meaningful we present the detailed architecture of the sentence from an image is a difficult task but model used by us. can have great impact, for instance helping the Keywords- Convolutional Neural Network, visually Recurrent Neural Network, (Bilingual understanding of images. Evaluation Understudy)BLEU score,Long The task of image captioning is significantly Short Time Memory harder than that of image classification, which impaired to have a better has been the main focus in the computer vision community. A description for an image must Volume XIII, Issue VII, July/2021 Page No: 1015 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 capture the relationship between the objects in from left to right and top to bottom to pull out the important image. In addition to the visual understanding of the image, the above semantic knowledge has to be expressed in a features from the image and combines the natural language like English, which means feature to classify images. It can handle the that a language model is needed. The attempts images that have been translated, rotated, made in the past have all been to stitch the two scaled and changes in perspective. LSTM models together. stands for Long short term memory, they are a I. RELATED WORK type of RNN (recurrent neural network) which We introduce a synthesized output generator problems. Based on the previous text, we can which localize and describe objects, attributes, and relationship in an image, in a natural language form.The simple cnn architecture with four classes classification shown in fig1 is well suited for sequence prediction predict what the next word will be. It has proven itself effective from the traditional RNN by overcoming the limitations of RNN which had short term memory. LSTM can carry out relevant information throughout the processing of inputs and with a forget gate, it discards non-relevant information. II. METHODOLOGY To develop any system, certain methodologies Fig1:A simple cnn architecture and techniques are used. The methodology So, to make our image caption generator used in this paper is image processing using model, we will be merging these architectures. CNN for predicting the image and LSTM for It is also called a CNN-RNN model. CNN is generating the captions. used for extracting features from the image. 3.1 Convolutional Neural Network LSTM will use the information from CNN to help generate a description of the image. Convolutional Neural Networks (ConvNets or CNN- Convolutional Neural networks are CNNs) are a category of Artificial Neural specialized deep neural networks which can Networks which have proven to be very process the data that has input shape like a 2D effective in the field of image recognition and matrix. Images are easily represented as a 2D classification. matrix and CNN is very useful in working extensively for the task of object detection, with images. CNN is basically used for image self driving cars, image captioning etc. First classifications and identifying if an image is a convnet was discovered in the year 1990 by bird, a plane or Superman, etc. It scans images Yann Lecun and the architecture of the model Volume XIII, Issue VII, July/2021 They have been used Page No: 1016 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 was called as the LeNet architecture. A basic Consider a small 2-dimensional 5*5 image convnet is shown in the fig1. Below with binary pixel values. Consider another The entire architecture of a convnet can be 3*3 matrix shown in Fig3. explained using four main operations namely, 1. Convolution 2. Non- Linearity (ReLU) 3. Pooling or Sub Sampling 4. Classification (Fully Connected Layer) These operations are the basic building blocks of every Convolutional Neural Network, so understanding how these work is an important step to developing a sound understanding of ConvNets. We will discuss each of these operations in detail below. Essentially, every image can be represented as a matrix of pixel values. An image from a standard digital camera will have three channels – red, green and blue – you can imagine those as three Fig3: Image (in green) and Filter (in orange) We slide this orange 3*3 matrix over the original image by 1 pixel and calculate element- wise multiplication of the orange matrix with the sub-matrix of the original image and add the final multiplication outputs to get the final integer which forms a single element of the output matrix which is shown in the Fig4 by the pink matrix. 2d-matrices stacked over each other (one for each color), each having pixel values in the range 0 to 255 shown in fig2. Fig4 : Convolution operation The 3*3 matrix is called a filter or kernel Fig2 :A grayscale image as matrix of numbers or feature detector and the matrix Convolution Operator formed by sliding the filter over the The purpose of convolution operation is to extract features from an image. We consider filters of size smaller than the dimensions of image. The entire operation of convolution can be understood with the example below. Volume XIII, Issue VII, July/2021 image and computing the dot product is called the Convolved Feature or Activation Map or the Feature Map. The number of pixels by which we slide the filter over the original image is known Page No: 1017 The International journal of analytical and experimental modal analysis as stride. ISSN NO:0886-9367 length. The complete process is specified by the Introducing Non-Linearity An additional operation is applied after everyconvolution operation.The most commonly used non-linear function for images is the ReLU which stands for Rectified Linear Unit. The ReLU operation is an element-wise operation which replaces the negative pixels in the image with a zero. Since most of the operations in real-life relate to non-linear data but the output of convolution operation is linear because the operation applied is elementwise multiplication and addition. The output of the ReLU operation is shown in the figure5 below. Fig. 6: Max pooling operation Fully-Connected layer The fully connected layer is the multi-layer perceptron that uses the SoftMax activation function in the output layer. The term “fullyconnected” refers to the fact that all the neuronsin the previous layer are connected to all the neurons of the next layer. The convolution and pooling operation generate features of an image. The task of the fully connected layer is to map these feature vectors to the classes in the training data. Fully connected layer of cnn with 4 classes shown in fig7. Fig. 5: Output after a ReLU operation Some other commonly used non-linearity functions are sigmoid and tanh. Spatial Pooling The pooling operation reduces the dimensionality of the image but preserves the important features in the image. The most common type of pooling technique used is max pooling. In max pooling you slide a window of n*n where n is less than the side of the image and determine the maximum in that window and then shift the window with the given stride Volume XIII, Issue VII, July/2021 Fig 7: An example of fully connected layer of data with 4 classes Pretrained CNN model used in this model is VGG16 and ResNet50 VGG16 architecture: VGG stands for Visual Geometry Group (a group of researchers at Oxford who developed this architecture). The VGG Page No: 1018 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 architecture consists of blocks, where each transforming a static input into a sequence (e.g. block is composed of 2D Convolution and image captioning); processing sequences into a Max Pooling layers. VGGNet comes in two static flavors, VGG16 and VGG19, where 16 and transforming sequences into sequences (e.g. 19 are the number of layers in each of them automatic translation). respectively. A simple recurrent network is typically we only used VGG16 shown in fig8, but designed by taking the layer's output from the not exactly the same dimensions. We use previous step and concatenating it with the the layer with dimension 7x7x512 to current step input: yt = f(xt; yt 1) extract features from images. Instead of The function f is a standard fully-connected saving all the features in the same file, we layer that processes both inputs indis-tinctly as dump them to a numpy file with the same one vector. Due to its simplicity, this approach name of the images so that we can use them is rather not su cient and does not yield later promising results. Thus, in past years, a great output (e.g. video labelling); or number of meaningful designs have been tested. The notion was advanced and designs have become more complex. For example, an inner state vector was introduced to convey information between times steps: ht; yt = f(xt; ht 1) The most popular architecture nowadays is a Long Short-Term Memory (LSTM) a rather complex design, yet outperforming others. Fig 8: VGG16 model 3.2 Recurrent Neural Networks Long Short-Term Memory The control flow of LSTM shown in fig9 Convolutional and fully connected layers are designed to process input in one time step without temporal context. Nonetheless, some tasks require concerning sequences where data are temporally interdependent. For that, a Recurrent Neural Network (RNN) an extension of fully connected layers has been introduced. RNNs are neural networks concerning information from previous time steps . Fig 9: Long Short Term Memory control Flow RNNs are used in a variety of tasks: Volume XIII, Issue VII, July/2021 Page No: 1019 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 LSTM’s have internal mechanisms called gates chain. You can think of it as the “memory” of that can regulate the flow of information. These the network. The cell state, in theory, can carry gates can learn which data in a sequence is relevant information throughout the processing important to keep or throw away. By doing that, of the sequence. So even information from the it can pass relevant information down the long earlier time steps can make it’s way to later chain of sequences to make predictions. Almost time steps, reducing the effects of short-term all state of the art results based on recurrent memory. As the cell state goes on its journey, neural networks are achieved with these two information get’s added or removed to the cell networks. LSTM’s and GRU’s can be found in state via gates. The gates are different neural speech recognition, speech synthesis, and text networks that decide which information is generation. You can even use them to generate allowed on the cell state. The gates can learn captions for videos.The gates and symbols in the what information is relevant to keep or forget LSTM control flow shown in fig10 during training. Sigmoid Gates contains sigmoid activations. A sigmoid activation is similar to the tanh activation. Instead of squishing values between -1 and 1, it squishes values between 0 and 1. That is helpful to update or forget data because any number getting multiplied by 0 is 0, causing values to disappears or be “forgotten.” Any number multiplied by 1 is the same value therefore that value stay’s the same or is “kept.” The network Fig 10: LSTM Gates An LSTM has a similar control flow as a can learn which data is not important therefore recurrent neural network. It processes data can be forgotten or which data is important to passing on information as it propagates forward. keep. Sigmoid squishes values to be between 0 The differences are the operations within the and 1 LSTM’s It’s Let’s dig a little deeper into what the various Operations These operations are used to allow gates are doing, shall we? So we have three the LSTM to keep or forget information. Now different gates that regulate information flow in looking at these operations can get a little an LSTM cell. A forget gate, input gate, and overwhelmingso we’ll go over this step by step. output gate. cells. LSTM Cell and The core concept of LSTM’s are the cell state, Forget gate and it’s various gates. The cell state act as a First, we have the forget gate. This gate decides transport relative what information should be thrown away or kept. information all the way down the sequence Information from the previous hidden state and highway that transfers Volume XIII, Issue VII, July/2021 Page No: 1020 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 information from the current input is passed through the sigmoid function. Values come out between 0 and 1. The closer to 0 means to forget, and the closer to 1 means to keep. Forget gate operations. Forget gate of LSTM shown in fig11 Fig 12: LSTM Input Gate Cell State Now we should have enough information to calculate the cell state. First, the cell state gets pointwise multiplied by the forget vector. This Fig 11: LSTM Forget gate has a possibility of dropping values in the cell Input Gate To update the cell state, we have the input gate. First, we pass the previous hidden state and current input into a sigmoid function. That decides which values will be updated by transforming the values to be between 0 and 1. 0 means not important, and 1 means important. state if it gets multiplied by values near 0. Then we take the output from the input gate and do a pointwise addition which updates the cell state to new values that the neural network finds relevant. That gives us our new cell state. Calculating cell state.The LSTM cell state shown in fig13 You also pass the hidden state and current input into the tanh function to squish values between 1 and 1 to help regulate the network. Then you multiply the tanh output with the sigmoid output. The sigmoid output will decide which information is important to keep from the tanh output. The input gate of the LSTM shown in fig12 Fig 13: LSTM Cell State Output Gate Last we have the output gate. The output gate decides what the next hidden state should be. Remember that the hidden state contains information on previous inputs. The hidden state Volume XIII, Issue VII, July/2021 Page No: 1021 The International journal of analytical and experimental modal analysis is also used for predictions. First, we pass the ISSN NO:0886-9367 previous hidden state and the current input into a Get Dataset sigmoid function. Then we pass the newly We decide to use the Flickr8k dataset. It has modified cell state to the tanh function. We multiply the tanh output with the sigmoid output to decide what information the hidden state should carry. The output is the hidden state. The new cell state and the new hidden is then carried over to the next time step. output gate operations. The output gate of LSTM shown in fig14 8092 images and 5 captions for each image. Each image has 5 captions because there are different ways to caption an image. This dataset has predefined training, testing and evaluation subsets of 6000, 1000 and 1000 images respectively. Prepare Photo Data We use two pretrained CNN models to extract features from images: VGG16 and ResNet50. We remove the last layers of these models because we are not interested in classifying images. We are interested in Fig 14: LSTM Output Gate the representation of the images. Instead of ex- tracting the features every time we need them, we compute all of them and save to a file. We saved the features extracted with 3.3 Implementation The steps involved in implementation are: i Get Dataset ii Prepare photo data each model in a different file: features vgg16.pkl and featuresRssnet.pkl These models require images of a concrete size, 224 pixels respectively. So, images had to be resized, then converted to array and iii Prepare text data reshaped. The extracted features are vectors iv Load Data of size 2048.Used cnn trained model vgg16 v Encode Text Data vi Define Model and Rsnet50 and LSTM are shown in fig15,16,17 vii Fit Model viii Evaluate Model ix Generate Captions Volume XIII, Issue VII, July/2021 Page No: 1022 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 punctuation, words one character long and words with numbers. Next, we create a vocabulary(VGG16) with the unique words of the descriptions. The size of the created vocabulary is 8,763. Finally, we save the descriptions to the file descriptions.txtso that we can use them later. Next, we create a vocabulary(ResNet50) Fig 15: LSTM+ResNet50 architecture with the unique words of the descriptions. The size of the created vocabulary is 1848. Finally, we save the descriptions to the file descriptions.txtso that we can use them later. Load Data We define functions to load data that correspond to each of the subsets: train, validaFig 16: Vgg16 architecture tion and test. These subsets are predefined in files: Flickr 8k.trainImages.txt, Flickr 8k.devImages.txt and Flickr 8k.testImages.txt. In the notebook with- out attention we load the image keys, descriptions and features. In the case of attention we don’t load features because we extract them later. The model will generate a caption word by word taking into account the previous words. So, we need initial and final word to start Fig 17: ResNet50 architecture Prepare Text Data Firstly, we load all the descriptions of the images. We create a dictionary that maps image names to descriptions. To prepare the and end the generation. That’s why we added <start> and <end> to the descriptions as initial and final word. Encode Text Data text data, we needed to clean the description Here we use a tokenizer to create a map of the images. For this, we converted all the from words of the vocabulary to integers. words to lowercase, we removed all the We Volume XIII, Issue VII, July/2021 save this tokenizer at the file Page No: 1023 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 tokenizer.pkl to use it later. We also all the possible input sequences and output calculate the size of the vocabulary and the words by adding one word until we reach maximum length of descriptions to use them <end>. later. We created a data generator for the training set and another one for the validation set. Define Model We decided to try different Recurrent Neural Network architectures and compare the results. As we can see in figures 18 the model is different, but both have a Long Short-Term Memory network as recursive network. See Figure 18. We defined the maximum number of epochs to 20 and the batch size to 32. We calculated the number of steps needed in each epoch for training and validation data. And we use a data generator to generate arrays of these sequence of the size of the batch progres- sively. We used this approach to avoid reaching RAM and GPU limits. Otherwise, we weren’t able to train the Fit Model model with the RAM available in Google LSTM+VGG16 Colaboratoty. The data generator receives In this part, we fit the model to the data we had. We monitor training and validation loss. We save the best the model with lowest the training data shuffled and works with image ids to take the images that correspond to the captions. validation loss in order to use it later, Finally, we visualize the training and because fitting can be very long. If the validation loss to know better how our validation loss increases in two consecutive model is learning. epochs we stop training to save time and avoid overfitting. Rnn Model1: It has two inputs. In the text submodel it has an Embedding, Dropout and We tried all the possible combinations LSTM layers. In the image submodel, it has between the different CNN’s with the a Dropout and a Dense layer. Then it adds RNN’s and we can see in figure 5 that the these two submodels and finally there are best result (regarding to the loss) have been two Dense layers, the last one having for the second RNN. vocabulary size with softmax to train the The model receives image features, a model with the RAM available in Google sequence of words and the output word. Instead of doing this with all the data we create these sequences for one image at a time. For each image and caption we create Volume XIII, Issue VII, July/2021 Colaboratoty. The data generator receives the training data shuffled and works with image ids to take the images that correspond to the captions. Page No: 1024 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 Added layers for the trained rnn model But, it also increases the time needed to shown in fig18 generate captions. We tried to evaluate with beam search the whole test set but took too long. So, we only used the Beam Search function to generate captions. The graph for the loss versus validation loss for the different epochs for the trained models(rnn1-vgg16,rnn2-vgg16) are shown in fig19 and fig20 Fig 18: Trained Model Architecture LSTM+ResNet50 Fig 19: loss and val loss values on different epochs for RNN1 and VGG16 The model architecture for this is with the relu and softmax activation function and epochs of 10 with the loss of 2.7 Evaluate Model We used two different ways to generate descriptions of images: Sampling and Beam search. Sam- pling consists of taking the best word at each time step until the end is reached. Beam Search considers the k best sentences at each time step. It predicts the Fig 20: loss and val loss values on different next words for each one of them. Generally, epochs for RNN2 and VGG16 making k bigger increases the chance of Taking into account that we have created getting the description with the highest multiple models, we needed to find an effective proba- bility. way to evaluate the models. So, we choose to Volume XIII, Issue VII, July/2021 Page No: 1025 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 calculate the BLEU-scores on the test data to some mistakes but they capture some evaluate the models. This requires comparing the aspects of the images correctly. On the other original captions with the generated ones. hand, the captions are very bad with the The BLEU scores we got from sampling, in the attention model. This correlates with the models with this ones low BLEU scores. Sampling Model 1 VGG16 •BLEU-1: 54.114390 •BLEU-2: 33.489586 •BLEU-3: 24.785442 These are some results we had with some images from the test dataset, using VGG16 and the second RNN model. We put one of the best captioning image for the trained model shown in fig21. •BLEU-4: 13.031321 Sampling Model 2 VGG16 •BLEU-1: 59.5796 •BLEU-2: 36.9997 •BLEU-3: 27.2431 •BLEU-4: 14.4684 Generate Captions We generate captions for some images of the test set to view the performance of our models with real examples. We used the same methods that were used for evaluation. But, instead of printing them as they are, we clean them to improve readability. This is done by removing the <start> and <end> words, and removing all the other tokens also. Fig 21 : One of the best captioning Original 1: brown and white dog stands outside while it snows Original 2: dog is looking at something near the water Original 3: furry dog attempts to dry itself by shaking the water off its coat Original 4: white and brown dog shaking its self dry Original 5: the large brown and white dog shakes off water Sampling (BLEU-1: 58.4101): dog is swimming in the water For each image, we show the 5 original Beam Search k=3 (BLEU-1: 72.7273): white captions, the result of the sampling, and the and white dog is swimming in the water result of beam search when k=3 and k=5. Beam Search k=5 (BLEU-1: 70.0000): brown We also show BLEU-1 scores as reference and white dog swims in shallow water for all the generated captions. While training, we have seen that we had the On the one hand, the captions generated best results with VGG16 and the 2nd RNN without attention are quite good. They have Volume XIII, Issue VII, July/2021 model and also with LSTM and ResNet50. Page No: 1026 The International journal of analytical and experimental modal analysis But there is a lot of improvement to do yet, as we can see in the Figure 8 captioning, for example. Also, we can see that there is still work to do, because, even if we have a quite correct description, it is not very close from any of the original ones. ISSN NO:0886-9367 Fig 22: The accuracies of the model IV RESULTS The Convolution Neural Network algorithm and Long Short term memory is used to build the model for Generating Captions for pictures. In this model a total of 8000 imagesand related Accuracies: Accuracy is the measure to evaluate the performance of the model. Here rnn1 and rnn2 and LSTM are the models for generating the captions for the images. Vgg16 and RsNet50 are the models for training the images and extracting the features.These two are the pretrained cnn models. The accuracies of the model are shown in the table1. MODEL NAME ACCURACY Rnn1-vgg16 70 Rnn2-vgg16 71 RsNet50-LSTM 72 captions are taken. We have trained and tested the model, it generates captions for the images. The model achieved an accuracy of 70% in scale. The user interface is built with Flask API toensure the interactive UI for the user. The output is seen through the user interface which consists of the button to upload the imageand dropdown button to choose the model it shows the output by considering the input as uploaded imageand choosen model that is if we give the image then it generates the caption by choosen trained model.The working of the image caption generator project is as shown in fig 23,24,25 Table1: Accuracies of the trained models The RsNet50-LSTM is better performing among three models. The camparisions between the accuracies of trained model are shown in figure22 Volume XIII, Issue VII, July/2021 Fig23: Application Page No: 1027 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 Fig 24: After selecting model and uploading image click on predict button. Fig 25: caption will be generated by selected model for the uploaded image. Dataset folders: In flickr-8k dataset have arround 8000 images and have two folder one for storing images i.e Images Folder and one for storing Text data for the related images in the images folder as shown in the shown in fig26 For each image there will five captions related to it because each one has a different kind of perespective to caption a image. Fig 26: Dataset Folders Volume XIII, Issue VII, July/2021 Page No: 1028 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 The flow of the image captioning is shown in fig 27 Fig 27: Flow of the project Volume XIII, Issue VII, July/2021 Page No: 1029 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 LIST OF FIGURES S. No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Fig. No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Figure A Simple Cnn Architecture A grayscale image as matrix of numbers Image(in green) and Filter(in orange) Convolution Operation Output after a ReLU operation Max pooling operation An example of Fully connected layer of data with four classess VGG16 Model Long Short Term Memory Control Flow LSTM Gates LSTM Forget gate LSTM Input gate LSTM Cell state LSTM Output gate LSTM+ResNet50 Architecture Vgg16 Architecture ResNet50 Architecture Trained Model Architecture Loss and val loss value on different epochs for rnn1-vgg16 Loss and val loss value on different epochs for rnn2-vgg16 One of the best Captioning The Accuracies of the trained model Application After selecting model and uploading image click on predict button Caption will be generated by selected model for the uploaded image Dataset Folders Flow of the Project List of Tables S. No 1 Tbl. no 1 Volume XIII, Issue VII, July/2021 Table Accuracies of the trained models Page No: 1030 The International journal of analytical and experimental modal analysis ISSN NO:0886-9367 Mao, Jonathan Huang, Alexander Toshev, Oana VII FUTURE ENHANCEMENT We are going to extend our work in the next Camburu, et al. 2016. Deep compositional higher level by enhancing our model to generate captioning: Describing novel object categories captions even for the live video frame. Our without paired training data. In Proceedings of present model generates captions only for the the IEEE Conference on Computer Vision and image. This is completely GPU based and Pattern Recognition. captioning live video frames cannot be possible [5] Dzmitry Bahdanau, Kyunghyun Cho, and with the general CPUs. Video captioning is a Yoshua popular research area in which it is going to translation by jointly learning to align and change the lifestyle of the people with the use translate. cases being widely usable in almost every Learning Representations (ICLR). domain. It automates the major tasks like video Shuang Bai and Shan An. 2018. A Survey on surveillance and other security tasks. Automatic The model’s accuracy can be boosted by Neurocomputing. deploying it on a larger dataset so that the words in the vocabulary of the model increase significantly. The use of relatively newer architecture, like and GoogleNet can also increase the accuracy in the classification task thus reducing the error rate in the language generation. Bengio. In 2015. Neural International Image machine Conference Caption on Generation. [6] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NIPS, 2014. [7] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying multimodal visual-semantic neural embeddings language models. with In arXiv:1411.2539, 2014. R. Kiros and R. Z. R. Salakhutdinov. Multimodal neural language VIII BIBILOGRAPHY models. In NIPS Deep Learning Workshop, [1] Peter Anderson, Xiaodong He, Chris Buehler, 2013. Damien Teney, Mark Johnson, Stephen Gould, [8] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. and Lei Zhang. 2017. Bottom-up and top-down Choi, A. C. Berg, and T. L. Berg. Baby talk: attention for image captioning and vqa. arXiv Understanding and generating simple image preprint arXiv:1707.07998 (2017). descriptions. In CVPR, 2011. [2] and [9] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Alexander G Schwing. 2018. Convolutional Berg, and Y. Choi. Collective generation of image captioning. In Proceedings of the IEEE natural image descriptions. In ACL, 2012. Conference on Computer Vision and Pattern [10] P. Kuznetsova, V. Ordonez, T. Berg, and Y. Recognition. 5561–5570. Choi. Treetalk: Composition and compression of [3] Jyoti Aneja, Lisa Venugopalan, Anne Aditya Deshpande, Hendricks, Marcus Rohrbach, Subhashini trees for image descriptions. ACL, 2(10), 2014. Raymond Mooney, Kate Saenko, Trevor Darrell, Junhua Volume XIII, Issue VII, July/2021 View publication stats Page No: 1031