VISVESVARAYA TECHNOLOGICAL UNIVERSITY JNANASANGAMA, BELAGAVI – 590018 Project Phase - 1 Report On “IMAGE TO CAPTION GENERATOR USING DEEP LEARNING” Submitted in partial fulfillment for the award of degree of Bachelor of Engineering In Computer Science and Engineering Submitted by PRIYANKA B G 1BG18CS134 ROOPASHREE M S 1BG18CS133 B.N.M. Institute of Technology An Autonomous Institution under VTU Approved by AICTE, Affiliated to VTU, Accredited as grade A Institution by NAAC. All UG branches – CSE, ECE, EEE, ISE & Mech.E accredited by NBA for academic years 2018-19 to 2020-21 & valid upto 30.06.2021 Post box no. 7087, 27th cross, 12th Main, Banashankari 2nd Stage, Bengaluru- 560070, INDIA Ph: 91-80- 26711780/81/82 Email: principal@bnmit.in, www. bnmit.org Department of Computer Science and Engineering 2021-22 B.N.M. Institute of Technology An Autonomous Institution under VTU Approved by AICTE, Affiliated to VTU, Accredited as grade A Institution by NAAC. All UG branches – CSE, ECE, EEE, ISE & Mech.E accredited by NBA for academic years 2018-19 to 2020-21 & valid upto 30.06.2021 Post box no. 7087, 27th cross, 12th Main, Banashankari 2nd Stage, Bengaluru- 560070, INDIA Ph: 91-80- 26711780/81/82 Email: principal@bnmit.in, www. bnmit.org Department of Computer Science and Engineering CERTIFICATE Certified that the report entitled IMAGE TO CAPTION GENERATOR USING DEEP LEARNING carried out by Ms PRIYANKA B G USN 1BG18CS134, Ms. ROOPASHREE M S USN 1BG18CS133, bonafide students of VII Semester B.E., B.N.M Institute of Technology in partial fulfillment for the Bachelor of Engineering in COMPUTER SCIENCE AND ENGINEERING of the Visvesvaraya Technological University, Belagavi during the year 2021-22. It is certified that all corrections / suggestions indicated for Internal Assessment have been incorporated in the report. The report has been approved as it satisfies the academic requirements in respect of Project Work Phase -1 prescribed for the said Degree. Prof. Vani M Assistant Professor, Dr. Sahana D. Gowda Professor and HOD Department of CSE Department of CSE BNMIT, Bengaluru BNMIT, Bengaluru ACKNOWLEDGEMENT The success and final outcome of this project required a lot of guidance and assistance from many people and I am extremely privileged to have got this all along the completion of my project. I would like to thank Shri. Narayana Rao R Maanay, Secretary, BNMEI, Bengaluru for providing the excellent environmental and infrastructure in the college. I would like to sincerely thank Prof. T J Rama Murthy, Director, BNMIT, Bengaluru for having extended his constant support and encouragement during the course of this project. I would like to sincerely thank Dr. S Y Kulkarni, Additional Director, BNMIT, Bengaluru for having extended his constant support and encouragement during the course of this project. I would like to express my gratitude to Prof. Eishwar N Mannay, Dean, BNMIT, Bengaluru for his relentless support and encouragement. I would like to thank Dr. Krishnamurthy G N, Principal, BNMIT, Bengaluru for his constant encouragement. I would like to thank Dr. Sahana D. Gowda, Professor & Head of the Department of Computer Science and Engineering for the encouragement and motivation she provides. I would also like to thank Prof. Vani M, Assistant Professor, Department of Computer Science and Engineering for providing me with her valuable insight and guidance wherever required throughout the course of the project and its successful completion. Priyanka B G 1BG18CS134 Roopashree M S 1BG18CS133 ABSTRACT In this project, we use CNN and LSTM to identify the caption of the image. As the deep learning techniques are growing, huge datasets and computer power are helpful to build models that can generate captions for an image. This is what we are going to implement in this Python based project where we will use deep learning techniques like CNN and RNN. Image caption generator is a process which involves natural language processing and computer vision concepts to recognize the context of an image and present it in English. In this survey paper, we carefully follow some of the core concepts of image captioning the model is trained in such a way that if input image is given to model it generates captions which nearly describes the image. The accuracy of model and smoothness or command of language model learns from image a description is tested on different datasets. These experiments show that model deep Learning is generally another field and it has caught a ton of eye since it gives more elevated level of precision in perceiving objects than at any other time prior. NLP is additionally one field that has made an immense effect in our life. NLP has made considerable progress from creating a lucid synopsis of the writings to investigation of psychological sickness; it shows the effect of NLP. Image captioning task consolidates both NLP and Deep Learning. Depicting pictures in an important manner should be possible utilizing Image subtitling. Depicting a picture don't simply mean perceiving objects, to portray a picture appropriately we first need to recognize objects present in the picture and afterward the connection between those articles. In this investigation we have utilized CNNLSTM based system. CNN will be utilized to remove highlights of the picture while with the assistance of LSTM we will attempt to produce caption Table of contents CONTENTS Page No. 1. CHAPTER 1 2. 1.1 Introduction 1 1.2 Motivation 2 1.3 Problem Statement 2 1.4 Objectives 2 1.5 Summary 3 CHAPTER 2 2.1 Introduction 4 2.2 Literature Survey 5 2.3 Proposed Methodology 8 2.4 Summary 3. REFERENCES 14 15 List of Figures Figure No. 2.3.1.1 Figure Name VGG 16 Model Page No. 8 2.3.1.2 VGG 16 Architecture 9 2.3.2.1 Attention Model 10 2.3.3.1 LSTM Model 11 2.3.3.2 Forget gate 12 2.3.3.3 Input gate 12 2.3.3.4 Output gate 13 Chapter 1 1.1 INTRODUCTION Every day, Making a computer system detect objects and describe them using natural language processing (NLP) in an age-old problem of Artificial Intelligence. This was considered an impossible task by computer vision researchers till now. With the growing advancements in Deep learning techniques, availability of vast datasets, and computational power, models are often built which will generate captions for an image. Image caption generation is a task that involves image processing and natural language processing concepts to recognize the context of an image and describe them in a natural language like English or any other language. While human beings are able to do it easily, it takes a strong algorithm and a lot of computational power for a computer system to do so. Many attempts have been made to simplify this problem and break it down into various simpler problems such as object detection, image classification, and text generation. A computer system takes input images as two-dimensional arrays and mapping is done from images to captions or descriptive sentences. In recent years a lot of attention has been drawn towards the task of automatically generating captions for images. However, while new datasets often spur considerable innovation, benchmark datasets also require fast, accurate, and competitive evaluation metrics to encourage rapid progress. Being able to automatically describe the content of a picture using properly formed English sentences may be a very challenging task, but it could have an excellent impact, as an example by helping visually impaired people better understand the content of images online. This task insignificantly harder, for instance than the well-studied image classification or visual perception tasks, which are a main focus within the computer vision community. Deep learning methods have demonstrated advanced result son caption generation problems. What is most impressive about these methods is that one end-to-end model is often defined to predict a caption, given a photograph, rather than requiring sophisticated data preparation or a pipeline of specifically designed models. BE/Dept. of CSE/BNMIT Page|1 2021-22 Image to caption generator using Deep learning 1.2 MOTIVATION Generating captions for images is a vital task relevant to the area of both Computer Vision and Natural Language Processing. Mimicking the human ability of providing descriptions for images by a machine is itself a remarkable step along the line of Artificial Intelligence. The main challenge of this task is to capture how objects relate to each other in the image and to express them in a natural language (like English).Traditionally, computer systems have been using predefined templates for generating text descriptions for images. However, 1 this approach does not provide sufficient variety required for generating lexically rich text descriptions. This shortcoming has been suppressed with the increased efficiency of neural networks. Many state of art models use neural networks for generating captions by taking image as input and predicting next lexical unit in the output sentence 1.3 PROBLEM STATEMENT To develop a system for users, which can automatically generate the description of an image with the use of CNN along with LSTM? 1.4 OBJECTIVES • Objective of our project is to develop a web based interface for users to get the description of the image. • To make a classification system in order to differentiate image as per their description. • It is a task involves computer vision and natural language processing concepts to recognize the context of an image and describe them in a natural language like English. • To build a working model of image caption generator by implementing CNN with LSTM. Image to caption generator using Deep learning 1.5 SUMMARY Although image caption can be applied to image retrieval so the variety of image caption systems are available today, experimental results show that this task still has better performance systems and improvement. It mainly faces the following three challenges: first, how to generate complete natural language sentences like a human being; second, how to make the generated sentence grammatically correct; and third, how to make the caption semantics as clear as possible and consistent with the given image content. For future work, we propose the following four possible improvements an image are often rich in content. The model should be able to generate description sentences corresponding to multiple main objects for images with multiple target objects, instead of just describing a singletarget object. For corpus description languages of different languages, a general image description system capable of handling multiple languages should be developed Evaluating the result of natural language generation systems is a difficult problem. Thebest way to evaluate the quality of automatically generated texts is subjective assessment by linguists, which is hard to achieve. In order to improve system performance, the evaluation indicators should be optimized to make them more in linewith human experts assessments. A very real problem is the speed of training, testing, and generating sentences for the model should be optimized to improve performance. BE/Dept. of CSE/BNMIT Page|3 2021-22 Chapter 2 2.1 INTRODUCTION A literature survey in a project report is that section which shows the various analyses and research made in the field of your interest and the results already published taking into account the various parameters of the project and the extent of the project. A Literature survey refers to getting the content from the books which are related to thetopic or a given project. It should be referred from some research paper that is related to the topic. Any materials which are related to the project from the internet which is valuable for the student and has helped the student to enhance the report status as well as the calculation, analysis and tabulation majorly reflect in the survey. So, in this way,one can select the literature survey. It is necessary to emphasize that it is the most important part in the project report. It is the most important part of the report as it gives the students a direction in the area of their research. It helps the students to set a goal for analysis - thus giving them their problem statement. When one writes a literature review in respect of their project, they have to write the researches made by various analysts their methodology (which is their abstract) and the conclusions they have arrived at. One should also give an account of how this research has influenced their thesis. Literature surveys are needed for: • To see what has and has not been investigated. • To identify data sources that other researchers have used. • To learn how others have defined and measured key concepts. • To demonstrate one's understanding, and ability to critically evaluate research in thefield. • To develop alternative research projects. • To contribute to the field by moving research forward. Reviewing the literature letsone see what came before, and what did and didn't work for other researchers. BE/Dept. of CSE/BNMIT Page|4 2021-22 Image to caption generator using Deep learning 2.2 LITERATURE SURVEY Image feature extraction: R.Subash [1] Proposed Deep Learning based Convolution Neural Network sand Natural Language Processing (NLP) Techniques reasonable sentences are framed and inscriptions are produced. Dataset used in this model is MS-COCO. Result shows that the Proposed model having convolution neural network whose output is paired to Long Short Term Memory network which helps us generate descriptive captions for the image. Also model doesn’t require huge dataset to produce caption of images. Simao Herdade, Armin Kappeler, Kofi Boakye,Joao Soares [2] Proposed Object Relation Transformer model, focuses on spatial relationship between objects of images through used of faster R-CNN with Res-Net Mainly focuses on Improve the relationship between objects. Dataset used in this model is MS-coco with pycharm IDE. Result shows that the proposed model encodes positions and size and relationship between detected objects in images and extracted features by building upon the bottom-up and top down image captioning approach and CNN. Goutam Dutta [3] in this the input dataset of images and 5 sets of sentence descriptions were collected. During the training stage, the images are fed into the model as input to RNN and RNN is used to predict the sequence of words in the form of sentence. During the prediction stage, a pre-held set of images is passed to RNN and RNN generates the sentence by extracting features by one by one word. Convolution Neural Networks is used to extract the feature from the image and ranking model is used to detect the class in the image. Based on the ranking and the extracted feature caption were generated for each image. N Komal Kumar [4] there are two main approaches to Image Captioning, one is bottom-up and another is top-down. Main challenges were in the field of Image Captioning is over fitting the training data. In CNN-LSTM architecture, modeled after the NIC architecture, top-down approaches were considered by using deep convolution neural network to generate a victories representation of an image and then using Long Short-Term Memory(LSTM) caption were generated. Implemented a generative CNN-LSTM model and over fitting is alleviated and using hyper parameter tuning using dropout and number of LSTM layers. Deep convolution neural network is used to generate a victories representation of an image and finally Long-Short-Term Memory (LSTM) network used to generates captions. BE/Dept. of CSE/BNMIT Page|5 2021-22 Image to caption generator using Deep learning Ali Ashraf Mohamed [5] CNN image classification takes an input image, process it and classify it under certain categories. It scans image from left to right and top to bottom to pull out important features from the image and combines the feature to classify image. Implemented a deep learning approach for captioning of images. the sequential API of keras was used with tensor flow as a backend to implement the deep learning architecture to achieve a effective BLEU score of 55.01% with xception model. Keshav, Muley, M. Kolhekar [6] to extract the image features of the input image and use them as the input of a Region Proposal Network (RPN) to generate the object region proposal set. For each object region, the model generates abounding box denoting the position of the object. On the other hand, we use another RPN trained with the ground truth region bounding boxes to generate the caption region proposals. We apply non-maximum suppression (NMS) to reduce the ROI (region of interest) number of object and caption proposals, and filter out the overlap objects that the IOU of their bounding boxes is greater than 0.8. Finally, the object proposals are combined to form the relationship region proposals, and the ROI pooling layer is used to extract the object, relationship, and caption features corresponding to each proposal region. Girshick [7] proposed an image detection algorithm with Semitic image segmentation by combining region proposals with CNN features. Known as R-CNN, their system extracted region proposals and employed the CNN for subsequent feature learning to generate a fixed dimensional feature vector for each proposal. Linear Support Vector Machines (SVMs) were used to classify each region. Their system was able to outperform the sliding window CNN for object detection. Philip Kinghorn , Li Zhang , Ling Shao subset[8] of ImageNet are also used for training of RNNbased human and object attribute prediction, respectively. A slightly modified version of Alex Net from Krizhevsky without classification layers is implemented to extract CNN features for training of RNNs. This network extracts 4096 image features from previously cropped images of the desired objects or people, which are then paired with the attribute labels from the respective attribute dataset for training. BE/Dept. of CSE/BNMIT Page|6 2021-22 Image to caption generator using Deep learning 2.2.1 Sequence processor: Dr S Ramacharan [9]This is a word embedding layer for handling the text input, followed by a Long Short- Term Memory (LSTM) recurrent neural network layer The Sequence Processor model expects input sequences with a pre-defined length which are fed into an Embedding layer that uses a mask to ignore padded values .Both the feature extractor and sequence processor output a fixedlength vector. These are merged together and processed by a Dense layer to make a final prediction. The Photo Feature Extractor model expects input photo features to be a vector of 4,096 elements. These are processed by a Dense layer to produce a 256 element representation of the photo. Goutam dutta [10] sequence generator is to generate the sequence of indices from the description text. Each index represents a unique word and it is passed by adding padding at the end of each sequence. The target word sequence is predicted based on the input sequence and it is returned to the data generator for predicting the next word. describes the various image captioning model used in this thesis. This model consists of two parts, one is image extractor in the form of vector and other is the language processor in the form of sequence. The representation of the image vector comes through convolution neural network (CNN). In this case the pre-processed models are used to extract the features from the images and the captions are predicted out by this model by feeding the extracted features into dense layer. The language processor with a recurrent neural network (RNN), takes summary of previous words, to generate next word. Venu gopalan [11] has described image captioning as a language translation problem. Previously language translation was complicated and included several different tasks but the recent work has shown that the task can be achieved in a much efficient way using Recurrent Neural Networks. But, regular RNNs suffer from the vanishing gradient problem which was vital in case of our application. The solution for the problem is to use LSTMs and GRUs which contain internal mechanisms and logic gates that retain information for a longer time and pass only useful information. BE/Dept. of CSE/BNMIT Page|7 2021-22 Image to caption generator using Deep learning 2.3 PROPOSED METHODOLOGY 2.3.1 Image Feature Extraction by CNN: A neural network which was designed to process multi-dimensional data like image and time series data is called a convolution neural network (CNN). It includes feature extraction and VGG16 is a convolution neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for LargeScale Image Recognition”. The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million imagesbelonging to 1000 classes. It was one of the famous models Fig. 2.3.1.1 VGG 16 Model The input to cov1 layer is of fixed size 224 x 224 RGB image. The image is passed through a stack of convolution (conv.) layers, where the filters were used with a very small receptive field: 3×3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations, it also utilizes 1×1 convolution filters, which can be seen as a lineartransformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1-pixel for 3×3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. Layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2. BE/Dept. of CSE/BNMIT Page|8 2021-22 Image to caption generator using Deep learning Fig. 2.3.1.2 VGG 16 Architecture Three Fully-Connected (FC) layers follow a stack of convolution layers (which has a different depth in different architectures): the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks. All hidden layers are equipped with the rectification (ReLU) non-linearity. It is also noted that none of the networks (except for one) contain Local Response Normalization (LRN), such normalization does not improve the performance on the ILSVRC dataset, but eads toincreased memory consumption and computation time. BE/Dept. of CSE/BNMIT Page|9 2021-22 Image to caption generator using Deep learning 2.3.2 ATTENTION MECHANISM: Fig. 2.3.2.1 Attention Model This attention model is a method that takes n arguments y1….yn (in the preceding examples, the yi would be the hi), and a context c. It returns a vector z which is supposed to be the “summary” of the yi, focusing on information linked to the context c. More formally, it returns weighted arithmetic mean of the yi, and the weights are chosen according to the relevance of each yi given the context c. In the example presented before, the context is the beginning of the generated sentence, the yi are the representations of the parts of the image (hi), and the output is arepresentation of the filtered image, with a filter putting the focus of the interesting part for the word currently generated. One interesting feature of the attention model is that the weight of the arithmetic means is accessible and can be plotted. A neural network is considered to be an effort to mimic human brain actions in a simplified manner. Attention Mechanism is also an attempt to implement the same action of selectively concentrating on a few relevant things, while ignoring others in deep neural networks. BE BE/Dept. of CSE/BNMIT P a g e | 10 2021-22 Image to caption generator using Deep learning 2.3.3 Long term short term memory (LSTM): Fig. 2.3.3.1 LSTM Model Recurrent Neural Networks suffer from short-term memory. If a sequence is long enough, they’ll have hard time carrying information from earlier time steps to later ones. So if you are trying to process a paragraph of text to do predictions, RNN’s may leave out important information from the beginning. During back propagation, recurrent neural networks suffer from the vanishing gradient problem. Gradients are values used to update neural networks weights. The vanishing gradient problem is when the gradient shrinks as it back propagates through time. If a gradient value becomes extremely small, it doesn’t contribute too much learning. An LSTM has a similar control flow as a recurrent neural network. It processes data passing on information as it propagates forward. The differences are the operations within the LSTM’s cells. The core concept of LSTM’s are the cell state, and its various gates. The cell state act as a transport highway that transfers relative information all the way down the sequence chain. You can think of it as the “memory” of the network. The cell state, in theory, can carry relevant information throughout the processing of the sequence. So even information from the earlier time steps can make it’s way to later time steps, reducing the effects of short-term memory. As the cell state goes on its journey, information get’s added or removed to the cell state via gates. The gates are different neural networks that decide which information is allowed on the cell state. The gates can learn what information is relevant to keep or forget during training BE/Dept. of CSE/BNMIT P a g e | 11 2021-22 Image to caption generator using Deep learning Fig. 2.3.3.2Forget gate First, we have the forget gate. This gate decides what information should be thrown away or kept. Information from the previous hidden state and information from the current input is passed through the sigmoid function. Values come out between 0 and 1. The closer to 0 means to forget, and the closer to 1 means to keep. Fig. 2.3.3.3 Input gate To update the cell state, we have the input gate. First, we pass the previous hidden state and current input into a sigmoid function. That decides which values will be updated by transforming the values to be between 0 and 1. 0 means not important and 1 means important. You also pass the hidden state and current input into the tanh function to squish values between -1 and 1 to help regulate the network. Then you multiply the tanh output with the sigmoid output. The sigmoid output will decide which information is important to keep from the tanh output. BE/Dept. of CSE/BNMIT P a g e | 12 2021-22 Image to caption generator using Deep learning Fig. 2.3.3.4 Output gate Cell State: Now we should have enough information to calculate the cell state. First, the cell state gets point wise multiplied by the forget vector. This has a possibility of dropping values in the cell state if it gets multiplied by values near 0. Then we take the output from the input gate and do a point wise addition which updates the cell state to new values that the neural network finds relevant. That gives us our new cell state. Last we have the output gate. The output gate decides what the next hidden state should be. Remember that the hidden state contains information on previous inputs. The hidden state is also used for predictions. First, we pass the previous hidden state and the current input into a sigmoid function. Then we pass the newly modified cell state to the tanh function. We multiply the tanh output with the sigmoid output to decide what information the hidden state should carry. The output is the hidden state. The new cell state and the new hidden is then carried over to the next time step. Flickr8k Dataset: Generating a caption for a given image is a challenging problem in the deep learning domain. In this article, we will use different techniques of computer vision and NLP to recognize the context of an image and describe them in a natural language like English. we will build a working model of the image caption generator by using CNN (Convolutional Neural Networks) and LSTM (Long short term memory) units For training our model I’m using Flickr8K dataset. It consists of 8000 unique images and each image will be mapped to five different sentences which will describe the image BE/Dept. of CSE/BNMIT P a g e | 13 2021-22 Image to caption generator using Deep learning 2.4 SUMMARY Image captioning has recently gathered a lot of attention specifically in the natural language domain. There is a pressing need for context based natural language description of images, however, this may seem a bit farfetched but recent developments in fields like neural networks, computer vision and natural language processing has paved a way for accurately describing images i.e. representing their visually grounded meaning. We are leveraging state-of- the-art techniques like Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and appropriate datasets of images and their human perceived description to achieve the same. We demonstrate that our alignment model produces results in retrieval experiments on datasets such as Flicker BE/Dept. of CSE/BNMIT P a g e | 14 2021-22 REFERENCES [1] Wu, Q., Sheen, C., Liu, L., Dick, A., & van den Hengelo, A. (2016). What value do explicit high level concepts have in vision to language problems? Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 203–212 [2] He, K., Zhang, X., Ran, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. [3] Cho, K., Courville, A., & Bengi, Y. (2015). Describing multimedia content using attention- based encoder-decoder networks. IEEE Transactions on Multimedia, Vol. 17,. [4] Saad Alawi, Tareq Abed Mohammed, and Saad Al-Zai, “Understanding of a convolution neural network”, IEEE – 2017 [5] Oriol Vinals, Alexander Torshavn, Samy Bengi, and Dumitru Erhan, “Show and Tell: A Neural Image Caption Generator”,(CVPR 1, 2- 2015) [6] Ma, Ningning, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ”Shuffle net v2: Practical guidelines for efficient CNN architecture design.” In Proceedings of the European conference on computer vision (ECCV), pp. 116-131. 2018. [7] Rehab Alahmadi, Chung Hyuk Park, and James Hahn, “Sequence-to sequence image caption generator”, (ICMV-2018) [8] MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shirat Uddin, and Hamid Laga, “A Comprehensive Survey of Deep Learning for Image Captioning” ,(ACM-2019) [9] Haoran Wang , Yue Zhang, and Xiaosheng Yu, “An Overview of Image Caption Generation Methods”, (CIN-2020) [10] Krishnakumar, Koushalya, Gokul, Karthikeyan, and Kaviarasi, “IMAGE CAPTION GENERATOR USING DEEP LEARNING”, (international Journal of Advanced Science and Technology- 2020 ) [11] MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shirat Uddin, and Hamid Laga, “A Comprehensive Survey of Deep Learning for Image Captioning” ,(ACM-2019) [12] MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Sirajuddin, and Hamid Laga, “A Comprehensive Survey of Deep Learning for Image Captioning” ,(ACM-2019) [13] Priyanka Kalena, Nishi Malde, Aromal Nair, Saurabh Parkar, and Grishma Sharma, “Visual Image Caption Generator Using Deep Learning”, (ICAST-2019) [14] Hossain, MD Zakir, Ferdous Sohel, Mohd Fairuz Sirajuddin, . ”A comprehensive survey of deep learning for image captioning.” ACM Computing Surveys (CSUR) 51, no. 6 (2019) BE/Dept. of CSE/BNMIT P a g e | 15 2021-22 Image to caption generator using Deep learning [14] Ruskowski, Olga, Jia Deng, Hao Sū, Jonathan Krause, Sanjeev Satheesh, Sean Ma,Zhiheng Huang et al. ”ImageNet large scale visual recognition challenge.” International journal of computer vision 115, no. 3 (2015) [15] Gehring, Jonas, Michael Auli, David Grungier, Denis Yara's, and Yann N. Dauphin. ”Convolutional sequence to sequence learning.” arXiv preprint arXiv:1705.03122 (2017). [17]Yu, Jun, Jing Li, Zhou Yu, and Qingming Huang. ”Multimodal transformer with multi-view visual representation for image captioning.” IEEE Transactions on Circuits and Systems for Video Technology (2019). [18] J. I. Tan, C. S .Chan and J. I. Chuah, “Image Captioning with Sparse Recurrent Neural Network,”arXiv preprint arXiv:1908.10797,2019 [19] Q.Wang,A.B.Chan, “CNN+ CNN: convolution decoders for image captioning,” arXiv preprint arXiv:1805.09019,2018v BE/Dept. of CSE/BNMIT P a g e | 16 2021-22