Uploaded by priyankagirigowda06

project report

advertisement
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
JNANASANGAMA, BELAGAVI – 590018
Project Phase - 1 Report
On
“IMAGE TO CAPTION GENERATOR USING
DEEP LEARNING”
Submitted in partial fulfillment for the award of degree of
Bachelor of Engineering
In
Computer Science and Engineering
Submitted by
PRIYANKA B G 1BG18CS134
ROOPASHREE M S 1BG18CS133
B.N.M. Institute of Technology
An Autonomous Institution under VTU
Approved by AICTE, Affiliated to VTU, Accredited as grade A Institution by NAAC.
All UG branches – CSE, ECE, EEE, ISE & Mech.E accredited by NBA for academic years 2018-19 to 2020-21 & valid
upto 30.06.2021
Post box no. 7087, 27th cross, 12th Main, Banashankari 2nd Stage, Bengaluru- 560070, INDIA
Ph: 91-80- 26711780/81/82 Email: principal@bnmit.in, www. bnmit.org
Department of Computer Science and Engineering
2021-22
B.N.M. Institute of Technology
An Autonomous Institution under VTU
Approved by AICTE, Affiliated to VTU, Accredited as grade A Institution by NAAC.
All UG branches – CSE, ECE, EEE, ISE & Mech.E accredited by NBA for academic years 2018-19 to 2020-21 & valid upto
30.06.2021
Post box no. 7087, 27th cross, 12th Main, Banashankari 2nd Stage, Bengaluru- 560070, INDIA
Ph: 91-80- 26711780/81/82 Email: principal@bnmit.in, www. bnmit.org
Department of Computer Science and Engineering
CERTIFICATE
Certified that the report entitled IMAGE TO CAPTION GENERATOR USING DEEP
LEARNING carried out by Ms PRIYANKA B G USN 1BG18CS134, Ms.
ROOPASHREE M S USN 1BG18CS133, bonafide students of VII Semester B.E.,
B.N.M Institute of Technology in partial fulfillment for the Bachelor of Engineering in
COMPUTER SCIENCE AND ENGINEERING of the Visvesvaraya Technological
University, Belagavi during the year 2021-22. It is certified that all corrections /
suggestions indicated for Internal Assessment have been incorporated in the report. The
report has been approved as it satisfies the academic requirements in respect of Project
Work Phase -1 prescribed for the said Degree.
Prof. Vani M
Assistant Professor,
Dr. Sahana D. Gowda
Professor and HOD
Department of CSE
Department of CSE
BNMIT, Bengaluru
BNMIT, Bengaluru
ACKNOWLEDGEMENT
The success and final outcome of this project required a lot of guidance and
assistance from many people and I am extremely privileged to have got this
all along the completion of my project.
I would like to thank Shri. Narayana Rao R Maanay, Secretary, BNMEI,
Bengaluru for providing the excellent environmental and infrastructure in
the college.
I would like to sincerely thank Prof. T J Rama Murthy, Director, BNMIT,
Bengaluru for having extended his constant support and encouragement
during the course of this project.
I would like to sincerely thank Dr. S Y Kulkarni, Additional Director,
BNMIT, Bengaluru for having extended his constant support and
encouragement during the course of this project.
I would like to express my gratitude to Prof. Eishwar N Mannay, Dean,
BNMIT, Bengaluru for his relentless support and encouragement.
I would like to thank Dr. Krishnamurthy G N, Principal, BNMIT,
Bengaluru for his constant encouragement.
I would like to thank Dr. Sahana D. Gowda, Professor & Head of the
Department of Computer Science and Engineering for the encouragement
and motivation she provides.
I would also like to thank Prof. Vani M, Assistant Professor, Department
of Computer Science and Engineering for providing me with her valuable
insight and guidance wherever required throughout the course of the project
and its successful completion.
Priyanka B G 1BG18CS134
Roopashree M S 1BG18CS133
ABSTRACT
In this project, we use CNN and LSTM to identify the caption of the image. As
the deep learning techniques are growing, huge datasets and computer power are
helpful to build models that can generate captions for an image. This is what we
are going to implement in this Python based project where we will use deep
learning techniques like CNN and RNN. Image caption generator is a process
which involves natural language processing and computer vision concepts to
recognize the context of an image and present it in English. In this survey paper,
we carefully follow some of the core concepts of image captioning the model is
trained in such a way that if input image is given to model it generates captions
which nearly describes the image. The accuracy of model and smoothness or
command of language model learns from image a description is tested on
different datasets. These experiments show that model deep Learning is
generally another field and it has caught a ton of eye since it gives more elevated
level of precision in perceiving objects than at any other time prior. NLP is
additionally one field that has made an immense effect in our life. NLP has made
considerable progress from creating a lucid synopsis of the writings to
investigation of psychological sickness; it shows the effect of NLP. Image
captioning task consolidates both NLP and Deep Learning. Depicting pictures in
an important manner should be possible utilizing Image subtitling. Depicting a
picture don't simply mean perceiving objects, to portray a picture appropriately
we first need to recognize objects present in the picture and afterward the
connection between those articles. In this investigation we have utilized CNNLSTM based system. CNN will be utilized to remove highlights of the picture
while with the assistance of LSTM we will attempt to produce caption
Table of contents
CONTENTS
Page No.
1. CHAPTER 1
2.
1.1 Introduction
1
1.2 Motivation
2
1.3 Problem Statement
2
1.4 Objectives
2
1.5 Summary
3
CHAPTER 2
2.1 Introduction
4
2.2 Literature Survey
5
2.3 Proposed Methodology
8
2.4 Summary
3.
REFERENCES
14
15
List of Figures
Figure No.
2.3.1.1
Figure Name
VGG 16 Model
Page No.
8
2.3.1.2
VGG 16 Architecture
9
2.3.2.1
Attention Model
10
2.3.3.1
LSTM Model
11
2.3.3.2
Forget gate
12
2.3.3.3
Input gate
12
2.3.3.4
Output gate
13
Chapter 1
1.1 INTRODUCTION
Every day, Making a computer system detect objects and describe them using natural language
processing (NLP) in an age-old problem of Artificial Intelligence. This was considered an
impossible task by computer vision researchers till now. With the growing advancements in Deep
learning techniques, availability of vast datasets, and computational power, models are often built
which will generate captions for an image. Image caption generation is a task that involves image
processing and natural language processing concepts to recognize the context of an image and
describe them in a natural language like English or any other language. While human beings are
able to do it easily, it takes a strong algorithm and a lot of computational power for a computer
system to do so. Many attempts have been made to simplify this problem and break it down into
various simpler problems such as object detection, image classification, and text generation. A
computer system takes input images as two-dimensional arrays and mapping is done from images
to captions or descriptive sentences.
In recent years a lot of attention has been drawn towards the task of automatically generating
captions for images. However, while new datasets often spur considerable innovation, benchmark
datasets also require fast, accurate, and competitive evaluation metrics to encourage rapid
progress. Being able to automatically describe the content of a picture using properly formed
English sentences may be a very challenging task, but it could have an excellent impact, as an
example by helping visually impaired people better understand the content of images online. This
task insignificantly harder, for instance than the well-studied image classification or visual
perception tasks, which are a main focus within the computer vision community.
Deep learning methods have demonstrated advanced result son caption generation problems. What
is most impressive about these methods is that one end-to-end model is often defined to predict a
caption, given a photograph, rather than requiring sophisticated data preparation or a pipeline of
specifically designed models.
BE/Dept. of CSE/BNMIT
Page|1
2021-22
Image to caption generator using Deep learning
1.2 MOTIVATION
Generating captions for images is a vital task relevant to the area of both Computer Vision and
Natural Language Processing. Mimicking the human ability of providing descriptions for images
by a machine is itself a remarkable step along the line of Artificial Intelligence. The main
challenge of this task is to capture how objects relate to each other in the image and to express
them in a natural language (like English).Traditionally, computer systems have been using
predefined templates for generating text descriptions for images. However, 1 this approach does
not provide sufficient variety required for generating lexically rich text descriptions. This
shortcoming has been suppressed with the increased efficiency of neural networks. Many state of
art models use neural networks for generating captions by taking image as input and predicting
next lexical unit in the output sentence
1.3 PROBLEM STATEMENT
To develop a system for users, which can automatically generate the description of an image with
the use of CNN along with LSTM?
1.4 OBJECTIVES
•
Objective of our project is to develop a web based interface for users to get the description of
the image.
•
To make a classification system in order to differentiate image as per their description.
•
It is a task involves computer vision and natural language processing concepts to recognize the
context of an image and describe them in a natural language like English.
•
To build a working model of image caption generator by implementing CNN with LSTM.
Image to caption generator using Deep learning
1.5 SUMMARY
Although image caption can be applied to image retrieval so the variety of image caption systems
are available today, experimental results show that this task still has better performance systems
and improvement. It mainly faces the following three challenges: first, how to generate complete
natural language sentences like a human being; second, how to make the generated sentence
grammatically correct; and third, how to make the caption semantics as clear as possible and
consistent with the given image content. For future work, we propose the following four possible
improvements an image are often rich in content.
The model should be able to generate description sentences corresponding to multiple main
objects for images with multiple target objects, instead of just describing a singletarget object.
For corpus description languages of different languages, a general image description system
capable of handling multiple languages should be developed
Evaluating the result of natural language generation systems is a difficult problem. Thebest way to
evaluate the quality of automatically generated texts is subjective assessment by linguists, which
is hard to achieve. In order to improve system performance, the evaluation indicators should be
optimized to make them more in linewith human experts assessments.
A very real problem is the speed of training, testing, and generating sentences for the model
should be optimized to improve performance.
BE/Dept. of CSE/BNMIT
Page|3
2021-22
Chapter 2
2.1 INTRODUCTION
A literature survey in a project report is that section which shows the various analyses and
research made in the field of your interest and the results already published taking into account
the various parameters of the project and the extent of the project. A Literature survey refers to
getting the content from the books which are related to thetopic or a given project. It should be
referred from some research paper that is related to the topic. Any materials which are related
to the project from the internet which is valuable for the student and has helped the student to
enhance the report status as well as the calculation, analysis and tabulation majorly reflect in
the survey. So, in this way,one can select the literature survey.
It is necessary to emphasize that it is the most important part in the project report. It is
the most important part of the report as it gives the students a direction in the area of their
research. It helps the students to set a goal for analysis - thus giving them their problem
statement.
When one writes a literature review in respect of their project, they have to write the
researches made by various analysts their methodology (which is their abstract) and the
conclusions they have arrived at. One should also give an account of how this research has
influenced their thesis.
Literature surveys are needed for:
•
To see what has and has not been investigated.
•
To identify data sources that other researchers have used.
•
To learn how others have defined and measured key concepts.
•
To demonstrate one's understanding, and ability to critically evaluate research in thefield.
•
To develop alternative research projects.
•
To contribute to the field by moving research forward. Reviewing the literature letsone see
what came before, and what did and didn't work for other researchers.
BE/Dept. of CSE/BNMIT
Page|4
2021-22
Image to caption generator using Deep learning
2.2 LITERATURE SURVEY
Image feature extraction:
R.Subash [1] Proposed Deep Learning based Convolution Neural Network sand Natural Language
Processing (NLP) Techniques reasonable sentences are framed and inscriptions are produced.
Dataset used in this model is MS-COCO. Result shows that the Proposed model having
convolution neural network whose output is paired to Long Short Term Memory network which
helps us generate descriptive captions for the image. Also model doesn’t require huge dataset to
produce caption of images.
Simao Herdade, Armin Kappeler, Kofi Boakye,Joao Soares [2] Proposed Object Relation
Transformer model, focuses on spatial relationship between objects of images through used of
faster R-CNN with Res-Net Mainly focuses on Improve the relationship between objects. Dataset
used in this model is MS-coco with pycharm IDE. Result shows that the proposed model encodes
positions and size and relationship between detected objects in images and extracted features by
building upon the bottom-up and top down image captioning approach and CNN.
Goutam Dutta [3] in this the input dataset of images and 5 sets of sentence descriptions were
collected. During the training stage, the images are fed into the model as input to RNN and RNN
is used to predict the sequence of words in the form of sentence. During the prediction stage, a
pre-held set of images is passed to RNN and RNN generates the sentence by extracting features
by one by one word. Convolution Neural Networks is used to extract the feature from the image
and ranking model is used to detect the class in the image. Based on the ranking and the extracted
feature caption were generated for each image.
N Komal Kumar [4] there are two main approaches to Image Captioning, one is bottom-up and
another is top-down. Main challenges were in the field of Image Captioning is over fitting the
training data. In CNN-LSTM architecture, modeled after the NIC architecture, top-down
approaches were considered by using deep convolution neural network to generate a victories
representation of an image and then using Long Short-Term Memory(LSTM) caption were
generated. Implemented a generative CNN-LSTM model and over fitting is alleviated and using
hyper parameter tuning using dropout and number of LSTM layers. Deep convolution neural
network is used to generate a victories representation of an image and finally Long-Short-Term
Memory (LSTM) network used to generates captions.
BE/Dept. of CSE/BNMIT
Page|5
2021-22
Image to caption generator using Deep learning
Ali Ashraf Mohamed [5] CNN image classification takes an input image, process it and classify it
under certain categories. It scans image from left to right and top to bottom to pull out important
features from the image and combines the feature to classify image. Implemented a deep learning
approach for captioning of images. the sequential API of keras was used with tensor flow as a
backend to implement the deep learning architecture to achieve a effective BLEU score of 55.01%
with xception model.
Keshav, Muley, M. Kolhekar [6] to extract the image features of the input image and use them as
the input of a Region Proposal Network (RPN) to generate the object region proposal set. For each
object region, the model generates abounding box denoting the position of the object. On the other
hand, we use another RPN trained with the ground truth region bounding boxes to generate the
caption region proposals. We apply non-maximum suppression (NMS) to reduce the ROI (region
of interest) number of object and caption proposals, and filter out the overlap objects that the IOU
of their bounding boxes is greater than 0.8. Finally, the object proposals are combined to form the
relationship region proposals, and the ROI pooling layer is used to extract the object, relationship,
and caption features corresponding to each proposal region.
Girshick [7] proposed an image detection algorithm with Semitic image segmentation by
combining region proposals with CNN features. Known as R-CNN, their system extracted region
proposals and employed the CNN for subsequent feature learning to generate a fixed dimensional
feature vector for each proposal. Linear Support Vector Machines (SVMs) were used to classify
each region. Their system was able to outperform the sliding window CNN for object detection.
Philip Kinghorn , Li Zhang , Ling Shao subset[8] of ImageNet are also used for training of RNNbased human and object attribute prediction, respectively. A slightly modified version of Alex Net
from Krizhevsky without classification layers is implemented to extract CNN features for training
of RNNs. This network extracts 4096 image features from previously cropped images of the
desired objects or people, which are then paired with the attribute labels from the respective
attribute dataset for training.
BE/Dept. of CSE/BNMIT
Page|6
2021-22
Image to caption generator using Deep learning
2.2.1 Sequence processor:
Dr S Ramacharan [9]This is a word embedding layer for handling the text input, followed by a
Long Short- Term Memory (LSTM) recurrent neural network layer The Sequence Processor model
expects input sequences with a pre-defined length which are fed into an Embedding layer that uses
a mask to ignore padded values .Both the feature extractor and sequence processor output a fixedlength vector. These are merged together and processed by a Dense layer to make a final
prediction. The Photo Feature Extractor model expects input photo features to be a vector of 4,096
elements. These are processed by a Dense layer to produce a 256 element representation of the
photo.
Goutam dutta [10] sequence generator is to generate the sequence of indices from the description
text. Each index represents a unique word and it is passed by adding padding at the end of each
sequence. The target word sequence is predicted based on the input sequence and it is returned to
the data generator for predicting the next word. describes the various image captioning model used
in this thesis. This model consists of two parts, one is image extractor in the form of vector and
other is the language processor in the form of sequence. The representation of the image vector
comes through convolution neural network (CNN). In this case the pre-processed models are used
to extract the features from the images and the captions are predicted out by this model by feeding
the extracted features into dense layer. The language processor with a recurrent neural network
(RNN), takes summary of previous words, to generate next word.
Venu gopalan [11] has described image captioning as a language translation problem. Previously
language translation was complicated and included several different tasks but the recent work has
shown that the task can be achieved in a much efficient way using Recurrent Neural Networks.
But, regular RNNs suffer from the vanishing gradient problem which was vital in case of our
application. The solution for the problem is to use LSTMs and GRUs which contain internal
mechanisms and logic gates that retain information for a longer time and pass only useful
information.
BE/Dept. of CSE/BNMIT
Page|7
2021-22
Image to caption generator using Deep learning
2.3 PROPOSED METHODOLOGY
2.3.1 Image Feature Extraction by CNN:
A neural network which was designed to process multi-dimensional data like image and time
series data is called a convolution neural network (CNN). It includes feature extraction and
VGG16 is a convolution neural network model proposed by K. Simonyan and A. Zisserman
from the University of Oxford in the paper “Very Deep Convolutional Networks for LargeScale Image Recognition”. The model achieves 92.7% top-5 test accuracy in ImageNet, which
is a dataset of over 14 million imagesbelonging to 1000 classes. It was one of the famous models
Fig. 2.3.1.1 VGG 16 Model
The input to cov1 layer is of fixed size 224 x 224 RGB image. The image is passed through a
stack of convolution (conv.) layers, where the filters were used with a very small receptive field:
3×3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the
configurations, it also utilizes 1×1 convolution filters, which can be seen as a lineartransformation
of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the
spatial padding of conv. layer input is such that the spatial resolution is preserved after
convolution, i.e. the padding is 1-pixel for 3×3 conv. layers. Spatial pooling is carried out by
five max-pooling layers, which follow some of the conv. Layers (not all the conv. layers are
followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2.
BE/Dept. of CSE/BNMIT
Page|8
2021-22
Image to caption generator using Deep learning
Fig. 2.3.1.2 VGG 16 Architecture
Three Fully-Connected (FC) layers follow a stack of convolution layers (which has a different
depth in different architectures): the first two have 4096 channels each, the third performs
1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The
final layer is the soft-max layer. The configuration of the fully connected layers is the same
in all networks. All hidden layers are equipped with the rectification (ReLU) non-linearity. It
is also noted that none of the networks (except for one) contain Local Response
Normalization (LRN), such normalization does not improve the performance on the ILSVRC
dataset, but eads toincreased memory consumption and computation time.
BE/Dept. of CSE/BNMIT
Page|9
2021-22
Image to caption generator using Deep learning
2.3.2 ATTENTION MECHANISM:
Fig. 2.3.2.1 Attention Model
This attention model is a method that takes n arguments y1….yn (in the preceding examples, the
yi would be the hi), and a context c. It returns a vector z which is supposed to be the
“summary” of the yi, focusing on information linked to the context c. More formally, it returns
weighted arithmetic mean of the yi, and the weights are chosen according to the relevance of
each yi given the context c. In the example presented before, the context is the beginning of the
generated sentence, the yi are the representations of the parts of the image (hi), and the output is
arepresentation of the filtered image, with a filter putting the focus of the interesting part for the
word currently generated. One interesting feature of the attention model is that the weight of the
arithmetic means is accessible and can be plotted. A neural network is considered to be an effort
to mimic human brain actions in a simplified manner. Attention Mechanism is also an attempt
to implement the same action of selectively concentrating on a few relevant things, while
ignoring others in deep neural networks.
BE BE/Dept. of CSE/BNMIT
P a g e | 10
2021-22
Image to caption generator using Deep learning
2.3.3 Long term short term memory (LSTM):
Fig. 2.3.3.1 LSTM Model
Recurrent Neural Networks suffer from short-term memory. If a sequence is long enough, they’ll
have hard time carrying information from earlier time steps to later ones. So if you are trying to
process a paragraph of text to do predictions, RNN’s may leave out important information from
the beginning. During back propagation, recurrent neural networks suffer from the vanishing
gradient problem. Gradients are values used to update neural networks weights. The vanishing
gradient problem is when the gradient shrinks as it back propagates through time. If a gradient
value becomes extremely small, it doesn’t contribute too much learning.
An LSTM has a similar control flow as a recurrent neural network. It processes data passing
on information as it propagates forward. The differences are the operations within the LSTM’s
cells. The core concept of LSTM’s are the cell state, and its various gates. The cell state act as a
transport highway that transfers relative information all the way down the sequence chain.
You can think of it as the “memory” of the network. The cell state, in theory, can carry relevant
information throughout the processing of the sequence. So even information from the earlier time
steps can make it’s way to later time steps, reducing the effects of short-term memory. As the cell
state goes on its journey, information get’s added or removed to the cell state via gates. The gates
are different neural networks that decide which information is allowed on the cell state. The gates
can learn what information is relevant to keep or forget during training
BE/Dept. of CSE/BNMIT
P a g e | 11
2021-22
Image to caption generator using Deep learning
Fig. 2.3.3.2Forget gate
First, we have the forget gate. This gate decides what information should be thrown away or kept.
Information from the previous hidden state and information from the current input is passed
through the sigmoid function. Values come out between 0 and 1. The closer to 0 means to forget,
and the closer to 1 means to keep.
Fig. 2.3.3.3 Input gate
To update the cell state, we have the input gate. First, we pass the previous hidden state and current
input into a sigmoid function. That decides which values will be updated by transforming the
values to be between 0 and 1. 0 means not important and 1 means important. You also pass the
hidden state and current input into the tanh function to squish values between -1 and 1 to help
regulate the network. Then you multiply the tanh output with the sigmoid output. The sigmoid
output will decide which information is important to keep from the tanh output.
BE/Dept. of CSE/BNMIT
P a g e | 12
2021-22
Image to caption generator using Deep learning
Fig. 2.3.3.4 Output gate
Cell State:
Now we should have enough information to calculate the cell state. First, the cell state gets point
wise multiplied by the forget vector. This has a possibility of dropping values in the cell state if it
gets multiplied by values near 0. Then we take the output from the input gate and do a point wise
addition which updates the cell state to new values that the neural network finds relevant. That
gives us our new cell state.
Last we have the output gate. The output gate decides what the next hidden state should be.
Remember that the hidden state contains information on previous inputs. The hidden state is also
used for predictions. First, we pass the previous hidden state and the current input into a sigmoid
function. Then we pass the newly modified cell state to the tanh function. We multiply the tanh
output with the sigmoid output to decide what information the hidden state should carry. The
output is the hidden state. The new cell state and the new hidden is then carried over to the next
time step.
Flickr8k Dataset:
Generating a caption for a given image is a challenging problem in the deep learning domain. In
this article, we will use different techniques of computer vision and NLP to recognize the context
of an image and describe them in a natural language like English. we will build a working model
of the image caption generator by using CNN (Convolutional Neural
Networks) and LSTM (Long short term memory) units
For training our model I’m using Flickr8K dataset. It consists of 8000 unique images and each
image will be mapped to five different sentences which will describe the image
BE/Dept. of CSE/BNMIT
P a g e | 13
2021-22
Image to caption generator using Deep learning
2.4 SUMMARY
Image captioning has recently gathered a lot of attention specifically in the natural language
domain. There is a pressing need for context based natural language description of images,
however, this may seem a bit farfetched but recent developments in fields like neural networks,
computer vision and natural language processing has paved a way for accurately describing images
i.e. representing their visually grounded meaning. We are leveraging state-of- the-art techniques
like Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and appropriate
datasets of images and their human perceived description to achieve the same. We demonstrate
that our alignment model produces results in retrieval experiments on datasets such as Flicker
BE/Dept. of CSE/BNMIT
P a g e | 14
2021-22
REFERENCES
[1]
Wu, Q., Sheen, C., Liu, L., Dick, A., & van den Hengelo, A. (2016). What value do explicit
high level concepts have in vision to language problems? Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 203–212
[2]
He, K., Zhang, X., Ran, S., & Sun, J. (2016). Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. [3]
Cho, K., Courville, A., & Bengi, Y. (2015). Describing multimedia content using attention- based
encoder-decoder networks. IEEE Transactions on Multimedia, Vol. 17,.
[4]
Saad Alawi, Tareq Abed Mohammed, and Saad Al-Zai, “Understanding of a convolution
neural network”, IEEE – 2017
[5]
Oriol Vinals, Alexander Torshavn, Samy Bengi, and Dumitru Erhan, “Show and Tell: A
Neural Image Caption Generator”,(CVPR 1, 2- 2015)
[6]
Ma, Ningning, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ”Shuffle net v2: Practical
guidelines for efficient CNN architecture design.” In Proceedings of the European conference
on computer vision (ECCV), pp. 116-131. 2018.
[7]
Rehab Alahmadi, Chung Hyuk Park, and James Hahn, “Sequence-to sequence image caption
generator”, (ICMV-2018)
[8]
MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shirat Uddin, and Hamid Laga, “A
Comprehensive Survey of Deep Learning for Image Captioning” ,(ACM-2019)
[9]
Haoran Wang , Yue Zhang, and Xiaosheng Yu, “An Overview of Image Caption Generation
Methods”, (CIN-2020)
[10] Krishnakumar, Koushalya, Gokul, Karthikeyan, and Kaviarasi, “IMAGE CAPTION
GENERATOR USING DEEP LEARNING”, (international Journal of Advanced Science and
Technology- 2020 )
[11] MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shirat Uddin, and Hamid Laga, “A
Comprehensive Survey of Deep Learning for Image Captioning” ,(ACM-2019)
[12] MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Sirajuddin, and Hamid Laga, “A
Comprehensive Survey of Deep Learning for Image Captioning” ,(ACM-2019)
[13] Priyanka Kalena, Nishi Malde, Aromal Nair, Saurabh Parkar, and Grishma Sharma, “Visual
Image Caption Generator Using Deep Learning”, (ICAST-2019)
[14] Hossain, MD Zakir, Ferdous Sohel, Mohd Fairuz Sirajuddin, . ”A comprehensive survey of
deep learning for image captioning.” ACM Computing Surveys (CSUR) 51, no. 6 (2019)
BE/Dept. of CSE/BNMIT
P a g e | 15
2021-22
Image to caption generator using Deep learning
[14] Ruskowski, Olga, Jia Deng, Hao Sū, Jonathan Krause, Sanjeev Satheesh, Sean Ma,Zhiheng
Huang et al. ”ImageNet large scale visual recognition challenge.” International journal of computer
vision 115, no. 3 (2015)
[15] Gehring, Jonas, Michael Auli, David Grungier, Denis Yara's, and Yann N. Dauphin.
”Convolutional sequence to sequence learning.” arXiv preprint arXiv:1705.03122 (2017).
[17]Yu, Jun, Jing Li, Zhou Yu, and Qingming Huang. ”Multimodal transformer with multi-view
visual representation for image captioning.” IEEE Transactions on Circuits and Systems for
Video Technology (2019).
[18] J. I. Tan, C. S .Chan and J. I. Chuah, “Image Captioning with Sparse Recurrent Neural
Network,”arXiv preprint arXiv:1908.10797,2019
[19] Q.Wang,A.B.Chan, “CNN+ CNN: convolution decoders for image captioning,” arXiv
preprint arXiv:1805.09019,2018v
BE/Dept. of CSE/BNMIT
P a g e | 16
2021-22
Download