Uploaded by Ashwini Mahendiran

2021JulyB.TechSnehapaperonImagecaptionGeneratorusingDeepLearning

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/357955678
IMAGE CAPTION GENERATOR USING DEEP LEARNING-Convolutional Neural
Network, Recurrent Neural Network, (Bilingual Evaluation Understudy)BLEU
score,Long Short Time Memory
Article · January 2022
CITATIONS
READS
0
1,644
4 authors, including:
Biradavolu Shanmukh
Kavitha Chaduvula
Coventry University
Gudlavalleru Engineering College
3 PUBLICATIONS 3 CITATIONS
47 PUBLICATIONS 20 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
B.Tech View project
All content following this page was uploaded by Kavitha Chaduvula on 20 January 2022.
The user has requested enhancement of the downloaded file.
SEE PROFILE
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
IMAGE CAPTION GENERATOR USING DEEP LEARNING
Ch. Sneha1 , B. Premanvitha2 , B.Shanmukh3 , Kavitha Chaduvula4
UG Student
Professor and Head
1,2,3,4
IT Department, Gudlavalleru Engineering College, Gudlavalleru-521356
1,2,3
4
Abstract- Computer vision has become
I. INTRODUCTION
ubiquitous in our society, with applications in
Image caption Generator is a popular research
several fields. In this paper, we focus on one
area of Artificial Intelligence that deals with
of the visual recognition facets of computer
image
vision, i.e image captioning. The problem of
description for that image. Generating well-
generating language descriptions for visual
formed sentences requires both syntactic and
data has been studied from a long time but in
semantic understanding of the language. Being
the field of videos. In the recent few years
able to describe the content of an image using
emphasis has been lead on still image
accurately
description with natural text. Due to the recent
challenging task, but it could also have a great
advancements in the field of object detection,
impact, by helping visually impaired people
the task of scene description in an image has
better understand the content of images.
become easier.
understanding
formed
and
sentences
a
language
is
a
very
Artificial Intelligence(AI) is now at the heart
The aim of the project was to train
of innovation economy and thus the base for
convolutional neural networks with several
this paper is also the same. In the recent past a
hundreds of hyperparameters and apply it on
field of AI namely Deep Learning has turned a
a huge dataset of images (ResNet,Vgg), and
lot of heads due to its impressive results in
combine the results of this image classifier
terms of accuracy when compared to the
with a recurrent neural network to generate a
already existing Machine learning algorithms.
caption for the classified image. In this report
The task of being able to generate a meaningful
we present the detailed architecture of the
sentence from an image is a difficult task but
model used by us.
can have great impact, for instance helping the
Keywords- Convolutional Neural Network,
visually
Recurrent Neural Network, (Bilingual
understanding of images.
Evaluation Understudy)BLEU score,Long
The task of image captioning is significantly
Short Time Memory
harder than that of image classification, which
impaired
to
have
a
better
has been the main focus in the computer vision
community. A description for an image must
Volume XIII, Issue VII, July/2021
Page No: 1015
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
capture the relationship between the objects in
from left to right and top to bottom to pull out
the
important
image.
In
addition
to
the
visual
understanding of the image, the above
semantic knowledge has to be expressed in a
features from the image and combines the
natural language like English, which means
feature to classify images. It can handle the
that a language model is needed. The attempts
images that have been translated, rotated,
made in the past have all been to stitch the two
scaled and changes in perspective. LSTM
models together.
stands for Long short term memory, they are a
I. RELATED WORK
type of RNN (recurrent neural network) which
We introduce a synthesized output generator
problems. Based on the previous text, we can
which localize and describe objects, attributes,
and relationship in an image, in a natural
language form.The simple cnn architecture
with four classes classification shown in fig1
is
well
suited
for
sequence
prediction
predict what the next word will be. It has
proven itself effective from the traditional
RNN by overcoming the limitations of RNN
which had short term memory. LSTM can
carry out relevant information throughout the
processing of inputs and with a forget gate, it
discards non-relevant information.
II. METHODOLOGY
To develop any system, certain methodologies
Fig1:A simple cnn architecture
and techniques are used. The methodology
So, to make our image caption generator
used in this paper is image processing using
model, we will be merging these architectures.
CNN for predicting the image and LSTM for
It is also called a CNN-RNN model. CNN is
generating the captions.
used for extracting features from the image.
3.1 Convolutional Neural Network
LSTM will use the information from CNN to
help generate a description of the image.
Convolutional Neural Networks (ConvNets or
CNN- Convolutional Neural networks are
CNNs) are a category of Artificial Neural
specialized deep neural networks which can
Networks which have proven to be very
process the data that has input shape like a 2D
effective in the field of image recognition and
matrix. Images are easily represented as a 2D
classification.
matrix and CNN is very useful in working
extensively for the task of object detection,
with images. CNN is basically used for image
self driving cars, image captioning etc. First
classifications and identifying if an image is a
convnet was discovered in the year 1990 by
bird, a plane or Superman, etc. It scans images
Yann Lecun and the architecture of the model
Volume XIII, Issue VII, July/2021
They
have
been
used
Page No: 1016
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
was called as the LeNet architecture. A basic
Consider a small 2-dimensional 5*5 image
convnet is shown in the fig1. Below
with binary pixel values. Consider another
The entire architecture of a convnet can be
3*3 matrix shown in Fig3.
explained using four main operations namely,
1. Convolution
2. Non- Linearity (ReLU)
3. Pooling or Sub Sampling
4. Classification (Fully Connected Layer)
These operations are the basic building blocks of
every
Convolutional
Neural
Network,
so
understanding how these work is an important
step to developing a sound understanding of
ConvNets. We will discuss each of these
operations in detail below.
Essentially, every image can be represented as a
matrix of pixel values. An image from a standard
digital camera will have three channels – red,
green and blue – you can imagine those as three
Fig3: Image (in green) and Filter (in orange)
We slide this orange 3*3 matrix over the
original image by 1 pixel and calculate
element- wise multiplication of the orange
matrix with the sub-matrix of the original
image and add the final multiplication
outputs to get the final integer which forms
a single element of the output matrix which
is shown in the Fig4 by the pink matrix.
2d-matrices stacked over each other (one for each
color), each having pixel values in the range 0 to
255 shown in fig2.
Fig4 : Convolution operation
The 3*3 matrix is called a filter or kernel
Fig2 :A grayscale image as matrix of numbers
or feature detector and the matrix
Convolution Operator
formed by sliding the filter over the
The purpose of convolution operation is to
extract features from an image. We consider
filters of size smaller than the dimensions of
image. The entire operation of convolution can
be understood with the example below.
Volume XIII, Issue VII, July/2021
image and computing the dot product is
called
the
Convolved
Feature
or
Activation Map or the Feature Map. The
number of pixels by which we slide the
filter over the original image is known
Page No: 1017
The International journal of analytical and experimental modal analysis
as stride.
ISSN NO:0886-9367
length. The complete process is specified by the
Introducing Non-Linearity
An
additional
operation
is
applied
after
everyconvolution operation.The most commonly
used non-linear function for images is the ReLU
which stands for Rectified Linear Unit. The
ReLU operation is an element-wise operation
which replaces the negative pixels in the image
with a zero.
Since most of the operations in real-life relate to
non-linear data but the output of convolution
operation is linear because the operation applied
is elementwise multiplication and addition. The
output of the ReLU operation is shown in the
figure5 below.
Fig. 6: Max pooling operation
Fully-Connected layer
The fully connected layer is the multi-layer
perceptron that uses the SoftMax activation
function in the output layer. The term “fullyconnected” refers to the fact that all the neuronsin
the previous layer are connected to all the neurons
of the next layer. The convolution and pooling
operation generate features of an image. The task
of the fully connected layer is to map these
feature vectors to the classes in the training data.
Fully connected layer of cnn with 4 classes shown
in fig7.
Fig. 5: Output after a ReLU operation
Some
other
commonly
used
non-linearity
functions are sigmoid and tanh.
Spatial Pooling
The
pooling
operation
reduces
the
dimensionality of the image but preserves the
important features in the image. The most
common type of pooling technique used is max
pooling. In max pooling you slide a window of
n*n where n is less than the side of the image
and determine the maximum in that window
and then shift the window with the given stride
Volume XIII, Issue VII, July/2021
Fig 7: An example of fully connected layer of
data with 4 classes
Pretrained CNN model used in this model
is VGG16 and ResNet50
VGG16 architecture:
VGG stands for Visual Geometry Group (a
group of researchers at Oxford who
developed this architecture). The VGG
Page No: 1018
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
architecture consists of blocks, where each
transforming a static input into a sequence (e.g.
block is composed of 2D Convolution and
image captioning); processing sequences into a
Max Pooling layers. VGGNet comes in two
static
flavors, VGG16 and VGG19, where 16 and
transforming sequences into sequences (e.g.
19 are the number of layers in each of them
automatic translation).
respectively.
A simple recurrent network is typically
we only used VGG16 shown in fig8, but
designed by taking the layer's output from the
not exactly the same dimensions. We use
previous step and concatenating it with the
the layer with dimension 7x7x512 to
current step input: yt = f(xt; yt 1)
extract features from images. Instead of
The function f is a standard fully-connected
saving all the features in the same file, we
layer that processes both inputs indis-tinctly as
dump them to a numpy file with the same
one vector. Due to its simplicity, this approach
name of the images so that we can use them
is rather not su cient and does not yield
later
promising results. Thus, in past years, a great
output
(e.g.
video
labelling);
or
number of meaningful designs have been tested.
The notion was advanced and designs have
become more complex. For example, an inner
state
vector
was
introduced
to
convey
information between times steps:
ht; yt = f(xt; ht 1)
The most popular architecture nowadays is a
Long Short-Term Memory (LSTM) a rather
complex design, yet outperforming others.
Fig 8: VGG16 model
3.2 Recurrent Neural Networks
Long Short-Term Memory
The control flow of LSTM shown in fig9
Convolutional and fully connected layers are
designed to process input in one time step
without temporal context. Nonetheless, some
tasks require concerning sequences where data
are temporally interdependent. For that, a
Recurrent Neural Network (RNN) an extension
of fully connected layers has been introduced.
RNNs
are
neural
networks
concerning
information from previous time steps .
Fig 9: Long Short Term Memory control Flow
RNNs are used in a variety of tasks:
Volume XIII, Issue VII, July/2021
Page No: 1019
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
LSTM’s have internal mechanisms called gates
chain. You can think of it as the “memory” of
that can regulate the flow of information. These
the network. The cell state, in theory, can carry
gates can learn which data in a sequence is
relevant information throughout the processing
important to keep or throw away. By doing that,
of the sequence. So even information from the
it can pass relevant information down the long
earlier time steps can make it’s way to later
chain of sequences to make predictions. Almost
time steps, reducing the effects of short-term
all state of the art results based on recurrent
memory. As the cell state goes on its journey,
neural networks are achieved with these two
information get’s added or removed to the cell
networks. LSTM’s and GRU’s can be found in
state via gates. The gates are different neural
speech recognition, speech synthesis, and text
networks that decide which information is
generation. You can even use them to generate
allowed on the cell state. The gates can learn
captions for videos.The gates and symbols in the
what information is relevant to keep or forget
LSTM control flow shown in fig10
during training. Sigmoid
Gates contains
sigmoid activations. A sigmoid activation is
similar to the tanh activation. Instead of
squishing values between -1 and 1, it squishes
values between 0 and 1. That is helpful to
update or forget data because any number
getting multiplied by 0 is 0, causing values to
disappears or be “forgotten.” Any number
multiplied by 1 is the same value therefore that
value stay’s the same or is “kept.” The network
Fig 10: LSTM Gates
An LSTM has a similar control flow as a
can learn which data is not important therefore
recurrent neural network. It processes data
can be forgotten or which data is important to
passing on information as it propagates forward.
keep. Sigmoid squishes values to be between 0
The differences are the operations within the
and 1
LSTM’s
It’s
Let’s dig a little deeper into what the various
Operations These operations are used to allow
gates are doing, shall we? So we have three
the LSTM to keep or forget information. Now
different gates that regulate information flow in
looking at these operations can get a little
an LSTM cell. A forget gate, input gate, and
overwhelmingso we’ll go over this step by step.
output gate.
cells.
LSTM
Cell
and
The core concept of LSTM’s are the cell state,
Forget gate
and it’s various gates. The cell state act as a
First, we have the forget gate. This gate decides
transport
relative
what information should be thrown away or kept.
information all the way down the sequence
Information from the previous hidden state and
highway
that
transfers
Volume XIII, Issue VII, July/2021
Page No: 1020
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
information from the current input is passed
through the sigmoid function. Values come out
between 0 and 1. The closer to 0 means to forget,
and the closer to 1 means to keep. Forget gate
operations.
Forget gate of LSTM shown in fig11
Fig 12: LSTM Input Gate
Cell State
Now we should have enough information to
calculate the cell state. First, the cell state gets
pointwise multiplied by the forget vector. This
Fig 11: LSTM Forget gate
has a possibility of dropping values in the cell
Input Gate
To update the cell state, we have the input gate.
First, we pass the previous hidden state and
current input into a sigmoid function. That
decides which values will be updated by
transforming the values to be between 0 and 1. 0
means not important, and 1 means important.
state if it gets multiplied by values near 0. Then
we take the output from the input gate and do a
pointwise addition which updates the cell state to
new values that the neural network finds relevant.
That gives us our new cell state. Calculating cell
state.The LSTM cell state shown in fig13
You also pass the hidden state and current input
into the tanh function to squish values between 1 and 1 to help regulate the network. Then you
multiply the tanh output with the sigmoid output.
The
sigmoid
output
will
decide
which
information is important to keep from the tanh
output. The input gate of the LSTM shown in
fig12
Fig 13: LSTM Cell State
Output Gate
Last we have the output gate. The output gate
decides what the next hidden state should be.
Remember that the hidden state contains
information on previous inputs. The hidden state
Volume XIII, Issue VII, July/2021
Page No: 1021
The International journal of analytical and experimental modal analysis
is also used for predictions. First, we pass the
ISSN NO:0886-9367
previous hidden state and the current input into a
Get Dataset
sigmoid function. Then we pass the newly
We decide to use the Flickr8k dataset. It has
modified cell state to the tanh function. We
multiply the tanh output with the sigmoid output
to decide what information the hidden state
should carry. The output is the hidden state. The
new cell state and the new hidden is then carried
over to the next time step. output gate operations.
The output gate of LSTM shown in fig14
8092 images and 5 captions for each image.
Each image has 5 captions because there are
different ways to caption an image. This
dataset has predefined training, testing and
evaluation subsets of 6000, 1000 and 1000
images respectively.
Prepare Photo Data
We use two pretrained CNN models to
extract features from images: VGG16 and
ResNet50. We remove the last layers of
these models because we are not interested
in classifying images. We are interested in
Fig 14: LSTM Output Gate
the representation of the images. Instead of
ex- tracting the features every time we need
them, we compute all of them and save to a
file. We saved the features extracted with
3.3 Implementation
The steps involved in implementation are:
i Get Dataset
ii Prepare photo data
each model in a different file: features
vgg16.pkl and featuresRssnet.pkl
These models require images of a concrete
size, 224 pixels respectively. So, images had
to be resized, then converted to array and
iii Prepare text data
reshaped. The extracted features are vectors
iv Load Data
of size 2048.Used cnn trained model vgg16
v Encode Text Data
vi Define Model
and Rsnet50 and LSTM are shown in
fig15,16,17
vii Fit Model
viii Evaluate Model
ix Generate Captions
Volume XIII, Issue VII, July/2021
Page No: 1022
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
punctuation, words one character long and
words with numbers.
Next, we create a vocabulary(VGG16) with
the unique words of the descriptions. The
size of the created vocabulary is 8,763.
Finally, we save the descriptions to the file
descriptions.txtso that we can use them later.
Next, we create a vocabulary(ResNet50)
Fig 15: LSTM+ResNet50 architecture
with the unique words of the descriptions.
The size of the created vocabulary is 1848.
Finally, we save the descriptions to the file
descriptions.txtso that we can use them later.
Load Data
We define
functions
to
load
data
that
correspond to each of the subsets: train,
validaFig 16: Vgg16 architecture
tion and test. These subsets are
predefined in files: Flickr 8k.trainImages.txt,
Flickr
8k.devImages.txt
and
Flickr
8k.testImages.txt. In the notebook with- out
attention we load the image keys, descriptions
and features. In the case of attention we don’t
load features because we extract them later.
The model will generate a caption word by
word taking into account the previous words.
So, we need initial and final word to start
Fig 17: ResNet50 architecture
Prepare Text Data
Firstly, we load all the descriptions of the
images. We create a dictionary that maps
image names to descriptions. To prepare the
and end the generation. That’s why we
added <start> and <end> to the descriptions
as initial and final word.
Encode Text Data
text data, we needed to clean the description
Here we use a tokenizer to create a map
of the images. For this, we converted all the
from words of the vocabulary to integers.
words to lowercase, we removed all the
We
Volume XIII, Issue VII, July/2021
save
this
tokenizer
at
the
file
Page No: 1023
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
tokenizer.pkl to use it later. We also
all the possible input sequences and output
calculate the size of the vocabulary and the
words by adding one word until we reach
maximum length of descriptions to use them
<end>.
later.
We created a data generator for the training set
and another one for the validation set.
Define Model
We decided to try different Recurrent Neural
Network architectures and compare the
results. As we can see in figures 18 the
model is different, but both have a Long
Short-Term Memory network as recursive
network. See Figure 18.
We defined the maximum number of epochs
to 20 and the batch size to 32. We calculated
the number of steps needed in each epoch
for training and validation data.
And we use a data generator to generate
arrays of these sequence of the size of the
batch progres- sively. We used this approach
to avoid reaching RAM and GPU limits.
Otherwise, we weren’t able to train the
Fit Model
model with the RAM available in Google
LSTM+VGG16
Colaboratoty. The data generator receives
In this part, we fit the model to the data we
had. We monitor training and validation loss.
We save the best the model with lowest
the training data shuffled and works with
image ids to take the images that correspond
to the captions.
validation loss in order to use it later,
Finally, we visualize the training and
because fitting can be very long. If the
validation loss to know better how our
validation loss increases in two consecutive
model is learning.
epochs we stop training to save time and
avoid overfitting.
Rnn Model1: It has two inputs. In the text
submodel it has an Embedding, Dropout and
We tried all the possible combinations
LSTM layers. In the image submodel, it has
between the different CNN’s with the
a Dropout and a Dense layer. Then it adds
RNN’s and we can see in figure 5 that the
these two submodels and finally there are
best result (regarding to the loss) have been
two Dense layers, the last one having
for the second RNN.
vocabulary size with softmax to train the
The model receives image features, a
model with the RAM available in Google
sequence of words and the output word.
Instead of doing this with all the data we
create these sequences for one image at a
time. For each image and caption we create
Volume XIII, Issue VII, July/2021
Colaboratoty. The data generator receives
the training data shuffled and works with
image ids to take the images that correspond
to the captions.
Page No: 1024
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
Added layers for the trained rnn model
But, it also increases the time needed to
shown in fig18
generate captions. We tried to evaluate with
beam search the whole test set but took too
long. So, we only used the Beam Search
function to generate captions.
The graph for the loss versus validation loss
for the different epochs for the trained
models(rnn1-vgg16,rnn2-vgg16) are shown
in fig19 and fig20
Fig 18: Trained Model Architecture
LSTM+ResNet50
Fig 19: loss and val loss values on different
epochs for RNN1 and VGG16
The model architecture for this is with the
relu and softmax activation function and
epochs of 10 with the loss of 2.7
Evaluate Model
We used two different ways to generate
descriptions of images: Sampling and Beam
search. Sam- pling consists of taking the
best word at each time step until the end is
reached. Beam Search considers the k best
sentences at each time step. It predicts the
Fig 20: loss and val loss values on different
next words for each one of them. Generally,
epochs for RNN2 and VGG16
making k bigger increases the chance of
Taking into account that we have created
getting the description with the highest
multiple models, we needed to find an effective
proba- bility.
way to evaluate the models. So, we choose to
Volume XIII, Issue VII, July/2021
Page No: 1025
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
calculate the BLEU-scores on the test data to
some mistakes but they capture some
evaluate the models. This requires comparing the
aspects of the images correctly. On the other
original captions with the generated ones.
hand, the captions are very bad with the
The BLEU scores we got from sampling, in the
attention model. This correlates with the
models with this ones
low BLEU scores.
Sampling Model 1 VGG16
•BLEU-1: 54.114390
•BLEU-2: 33.489586
•BLEU-3: 24.785442
These are some results we had with some
images from the test dataset, using VGG16
and the second RNN model. We put one of
the best captioning image for the trained
model shown in fig21.
•BLEU-4: 13.031321
Sampling Model 2 VGG16
•BLEU-1: 59.5796
•BLEU-2: 36.9997
•BLEU-3: 27.2431
•BLEU-4: 14.4684
Generate Captions
We generate captions for some images of
the test set to view the performance of our
models with real examples. We used the
same methods that were used for evaluation.
But, instead of printing them as they are, we
clean them to improve readability. This is
done by removing the <start> and <end>
words, and removing all the other tokens
also.
Fig 21 : One of the best captioning
Original 1: brown and white dog stands
outside while it snows Original 2: dog is
looking at something near the water
Original 3: furry dog attempts to dry itself by
shaking the water off its coat
Original 4: white and brown dog shaking its
self dry
Original 5: the large brown and white dog
shakes off water
Sampling (BLEU-1: 58.4101): dog is swimming
in the water
For each image, we show the 5 original
Beam Search k=3 (BLEU-1: 72.7273): white
captions, the result of the sampling, and the
and white dog is swimming in the water
result of beam search when k=3 and k=5.
Beam Search k=5 (BLEU-1: 70.0000): brown
We also show BLEU-1 scores as reference
and white dog swims in shallow water
for all the generated captions.
While training, we have seen that we had the
On the one hand, the captions generated
best results with VGG16 and the 2nd RNN
without attention are quite good. They have
Volume XIII, Issue VII, July/2021
model and also with LSTM and ResNet50.
Page No: 1026
The International journal of analytical and experimental modal analysis
But there is a lot of improvement to do yet,
as we can see in the Figure 8 captioning, for
example. Also, we can see that there is still
work to do, because, even if we have a quite
correct description, it is not very close from
any of the original ones.
ISSN NO:0886-9367
Fig 22: The accuracies of the model
IV RESULTS
The Convolution Neural Network algorithm and
Long Short term memory is used to build the
model for Generating Captions for pictures. In
this model a total of 8000 imagesand related
Accuracies:
Accuracy is the measure to evaluate the
performance of the model.
Here rnn1 and rnn2 and LSTM are the
models for generating the captions for the
images.
Vgg16 and RsNet50 are the models for
training the images and extracting the
features.These two are the pretrained cnn
models. The accuracies of the model are
shown in the table1.
MODEL NAME
ACCURACY
Rnn1-vgg16
70
Rnn2-vgg16
71
RsNet50-LSTM
72
captions are taken. We have trained and tested
the model, it generates captions for the images.
The model achieved an accuracy of 70% in scale.
The user interface is built with Flask API
toensure the interactive UI for the user. The
output is seen through the user interface which
consists of the button to upload the imageand
dropdown button to choose the model it shows
the output by considering the input as uploaded
imageand choosen model that is if we give the
image then it generates the caption by choosen
trained model.The working of the image caption
generator project is as shown in fig 23,24,25
Table1: Accuracies of the trained models
The RsNet50-LSTM is better performing
among three models.
The camparisions between the accuracies of
trained model are shown in figure22
Volume XIII, Issue VII, July/2021
Fig23: Application
Page No: 1027
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
Fig 24: After selecting model and uploading image
click on predict button.
Fig 25: caption will be generated by selected model
for the uploaded image.
Dataset folders:
In flickr-8k dataset have arround 8000 images
and have two folder one for storing images i.e
Images Folder and one for storing Text data for
the related images in the images folder as shown
in the shown in fig26 For each image there will
five captions related to it because each one has a
different kind of perespective to caption a image.
Fig 26: Dataset Folders
Volume XIII, Issue VII, July/2021
Page No: 1028
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
The flow of the image captioning is shown in fig 27
Fig 27: Flow of the project
Volume XIII, Issue VII, July/2021
Page No: 1029
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
LIST OF FIGURES
S. No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Fig. No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Figure
A Simple Cnn Architecture
A grayscale image as matrix of numbers
Image(in green) and Filter(in orange)
Convolution Operation
Output after a ReLU operation
Max pooling operation
An example of Fully connected layer of data with four classess
VGG16 Model
Long Short Term Memory Control Flow
LSTM Gates
LSTM Forget gate
LSTM Input gate
LSTM Cell state
LSTM Output gate
LSTM+ResNet50 Architecture
Vgg16 Architecture
ResNet50 Architecture
Trained Model Architecture
Loss and val loss value on different epochs for rnn1-vgg16
Loss and val loss value on different epochs for rnn2-vgg16
One of the best Captioning
The Accuracies of the trained model
Application
After selecting model and uploading image click on predict button
Caption will be generated by selected model for the uploaded image
Dataset Folders
Flow of the Project
List of Tables
S. No
1
Tbl. no
1
Volume XIII, Issue VII, July/2021
Table
Accuracies of the trained models
Page No: 1030
The International journal of analytical and experimental modal analysis
ISSN NO:0886-9367
Mao, Jonathan Huang, Alexander Toshev, Oana
VII FUTURE ENHANCEMENT
We are going to extend our work in the next
Camburu, et al. 2016. Deep compositional
higher level by enhancing our model to generate
captioning: Describing novel object categories
captions even for the live video frame. Our
without paired training data. In Proceedings of
present model generates captions only for the
the IEEE Conference on Computer Vision and
image. This is completely GPU based and
Pattern Recognition.
captioning live video frames cannot be possible
[5] Dzmitry Bahdanau, Kyunghyun Cho, and
with the general CPUs. Video captioning is a
Yoshua
popular research area in which it is going to
translation by jointly learning to align and
change the lifestyle of the people with the use
translate.
cases being widely usable in almost every
Learning Representations (ICLR).
domain. It automates the major tasks like video
Shuang Bai and Shan An. 2018. A Survey on
surveillance and other security tasks.
Automatic
The model’s accuracy can be boosted by
Neurocomputing.
deploying it on a larger dataset so that the
words in the vocabulary of the model increase
significantly. The use of relatively newer
architecture, like and GoogleNet can also
increase the accuracy in the classification task
thus reducing the error rate in the language
generation.
Bengio.
In
2015.
Neural
International
Image
machine
Conference
Caption
on
Generation.
[6] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep
fragment embeddings for bidirectional image
sentence mapping. NIPS, 2014.
[7] R. Kiros, R. Salakhutdinov, and R. S. Zemel.
Unifying
multimodal
visual-semantic
neural
embeddings
language
models.
with
In
arXiv:1411.2539, 2014. R. Kiros and R. Z. R.
Salakhutdinov. Multimodal neural language
VIII BIBILOGRAPHY
models. In NIPS Deep Learning Workshop,
[1] Peter Anderson, Xiaodong He, Chris Buehler,
2013.
Damien Teney, Mark Johnson, Stephen Gould,
[8] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y.
and Lei Zhang. 2017. Bottom-up and top-down
Choi, A. C. Berg, and T. L. Berg. Baby talk:
attention for image captioning and vqa. arXiv
Understanding and generating simple image
preprint arXiv:1707.07998 (2017).
descriptions. In CVPR, 2011.
[2]
and
[9] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L.
Alexander G Schwing. 2018. Convolutional
Berg, and Y. Choi. Collective generation of
image captioning. In Proceedings of the IEEE
natural image descriptions. In ACL, 2012.
Conference on Computer Vision and Pattern
[10] P. Kuznetsova, V. Ordonez, T. Berg, and Y.
Recognition. 5561–5570.
Choi. Treetalk: Composition and compression of
[3]
Jyoti
Aneja,
Lisa
Venugopalan,
Anne
Aditya
Deshpande,
Hendricks,
Marcus
Rohrbach,
Subhashini
trees for image descriptions. ACL, 2(10), 2014.
Raymond
Mooney, Kate Saenko, Trevor Darrell, Junhua
Volume XIII, Issue VII, July/2021
View publication stats
Page No: 1031
Download