Uploaded by numl-s20-26401

neewwwwwwwwwww

advertisement
Automatic Image Caption Generation Using Deep Learning
.
Automatic Image Caption Generation Using Deep
Learning
Project Proposal
By:
Muhammad Shahzad Nazir
Roll no #2348
Session 2020-2024
CGPA: 2.70
Muhammad Ahtasham Aslam
Roll no #2344
Session 2020-2024
CGPA: 2.27
Muhammad Jawad Ishaq
Roll no #2336
Session 2020-2024
CGPA: 2.83
Supervised by:
Dr. Hina Ashraf
DEPARTMENT OF COMPUTER SCIENCE
NATIONAL UNIVERSITY OF MODERN LANGUAGES ISLAMABAD
Automatic Image Caption Generation Using Deep Learning
.
Project Proposal
Project Overview:
Automatically describing the content of images using natural languages is a
fundamental and challenging task with applications in diverse domains such as image
retrieval, organizing and locating images of users' interest, etc. It has a great potential
impact. For example, it could help visually impaired people better understand the
content of images on the web. Also, it could provide more accurate and compact
information of images/videos in scenarios such as image sharing in social networks or
video surveillance systems.
We have adopted an encoder-decoder-based model that was proposed by (Verma,
Yadav, Kuma, & Yadav, 2022) that is capable of generating grammatically correct
captions for images. This model makes use of CNN (VGG16 Hybrid Places 1365) as
the encoder and RNN (LSTM) as the decoder. CNN is used as an encoder to convert
the input image into a 1-D array representation, and RNN as a decoder or language
model to generate the caption. To ensure complete ground truth accuracy, the model
will be trained on the labeled Flickr8k and MSCOCO Captions datasets. After training
the model it will be implemented on the website and android based mobile app.
Purpose:
Automatically describing the content of images using natural languages is a
fundamental and challenging task. It has a great potential impact. Our area of interest
lies in deep learning, so this project is the best fit for our field as it implements the
model based on deep learning neural networks. This technology can be used in many
new fields like helping the visually impaired, medical image analysis, geospatial image
analysis, etc. For example, it could help visually impaired people better understand the
content of images on the web. Also, it could provide more accurate and compact
information of images/videos in scenarios such as image sharing in social networks or
video surveillance systems.
NUML-S20-21950
NUML-S20-26401
NUML-S20-25478
Page 2
Automatic Image Caption Generation Using Deep Learning
.
Evaluation of Existing System:
(Amritkar & Jabade) Created an image caption generator using deep learning neural
networks that include CNN and LSTM. Using these neural networks, their accuracy
according to BLEU is 0.53356 on the Flickr8k dataset. In the paper (Kulkarni, et al.,
2013) the authors used a Conditional Random Field (CRF) based technique to derive
the objects (people, cars, etc., or things like trees, roads, etc.), attributes, and
prepositions. The model is evaluated on the PASCAL dataset using BLEU . (Jabade &
Amritkar, 2018)In this work, the best BLEU score obtained was 0.18. (Chu, Yue, Yu,
Sergei, & Wang, 2020) put forward a model using ResNet50 and LSTM with soft
attention that produced a BLEU-1 score of 0.619 on the Flickr8K dataset. In our model,
we will use a deep-learning neural network with CNN (VGG16 Hybrid Places 1365) as
the encoder and LSTM as the decoder. This model has been pre-trained on both, the
ImageNet and Places datasets. Using this model, we will generate the captions for the
images.
Proposed System:
In our project, a deep neural network-based method of image caption generation is
proposed. In this approach, an encoder-decoder-based deep neural network model will
be designed with the help of a convolutional neural network (VGG16 Hybrid Places
1365) and recurrent neural network (LSTM) as a decoder to generate captions. VGG16
Hybrid Places1365 model is trained on both ImageNet and Places datasets(containing
1000 and 365 classes respectively). This model will be trained on the labeled Flickr8k
and MSCOCO Captions datasets. Further, the model is evaluated using all standard
metrics such as BLEU, METEOR, GLEU, and ROUGE L.
In CNN, The input layer is used to provide input to our network, which should be a
picture. It might be colorful or black and white. After that, the picture is moved to the
convolutional layer. Filters are applied to the input picture in this layer. Convolution is
the term for this procedure. After that, the activation function is used. Here, the RELU
activation function is used. Then our output becomes an input to the pooling layer,
which helps to reduce the image's resolution as shown in fig () The convolutional layer,
on the other hand, goes through the same procedure. It's feasible to have more than a
single convolutional layer. As we progress through the convolutional layers, the
NUML-S20-21950
NUML-S20-26401
NUML-S20-25478
Page 3
Automatic Image Caption Generation Using Deep Learning
.
patterns get more complex, such as eyes, faces, and birds. The matrix is then flattened
before the completely linked layer is added. This layer aids in classifying the items in
the supplied picture. The probability of classes is calculated using the Softmax formula.
Finally, the output layer is created, which contains the items that were present in the
picture.
Workflow of CNN:
The key to LSTMs is the cell state, the horizontal line running through the top of the
diagram.The cell state is kind of like a conveyor belt. It runs straight down the entire
chain, with only some minor linear interactions. It’s very easy for information to just
flow along it unchanged.The LSTM does have the ability to remove or add information
to the cell state, carefully regulated by structures called gates.
Gates are a way to optionally let information through. They are composed out of a
sigmoid neural net layer and a pointwise multiplication operation.
The sigmoid layer outputs numbers between zero and one, describing how much of
each component should be let through. A value of zero means “let nothing through,”
while a value of one means “let everything through!”
An LSTM has three of these gates, to protect and control the cell state.
LSTM Working:
NUML-S20-21950
NUML-S20-26401
NUML-S20-25478
Page 4
Automatic Image Caption Generation Using Deep Learning
.
User Description:
Users will be able to sign up and log in on the website or android app also he/she will
be able to upload the image and will get the grammatically correct captions for the
uploaded image which will help him/her to better understand the context of the image.
Hardware Requirements:

Computer with Intel core i5 2.3 GHz. Or higher





8GB RAM
512 GB hard disk
Mouse
Keyboard
NVIDIA GeForce RTX 1080 Ti
Software Requirements:

Window 7 or above

VS Code 1.70

Python

Flutter

Jupyter Notebook

Google Collab
Project Deliverables:
We will provide web and android-based applications for people in which they will
upload the image and our model will generate grammatically correct captions for the
uploaded image. This project is based on the user input and our model's ability to
generate captions.
NUML-S20-21950
NUML-S20-26401
NUML-S20-25478
Page 5
Automatic Image Caption Generation Using Deep Learning
.
Model Diagram:
NUML-S20-21950
NUML-S20-26401
NUML-S20-25478
Page 6
Automatic Image Caption Generation Using Deep Learning
.
Flow Chart:
System Diagram:
NUML-S20-21950
NUML-S20-26401
NUML-S20-25478
Page 7
Automatic Image Caption Generation Using Deep Learning
.
Functional Requirements:
1. Login Management
In this module, the user will be able to log in and signup on the website, and
mobile app.
Sr. No.
Description
Type
1
The system should validate the logins and passwords.
Evident
2
The system should provide privileges according to log in type.
Evident
3
It should provide the ability to check the correct format of the
username and password. It should not include special characters Evident
except the _ and *.
2. Image Caption Generator:
In this user will upload the image on the web or mobile app and our
system will analyze the image by using our model it will generate grammatically
correct captions for the uploaded image which will be displayed on his/her screen.
Sr. No.
Description
Type
1
The user should be able to upload the image in the form.
Evident
2
The system should be able to generate captions for the uploaded Evident
image.
NUML-S20-21950
NUML-S20-26401
NUML-S20-25478
Page 8
Automatic Image Caption Generation Using Deep Learning
.
References
Amritkar, C., & Jabade, V. (2018). Image Caption Generation Using Deep Learning Technique.
Institute of Electrical and Electronics Engineers, 4.
doi:https://doi.org/10.1109/ICCUBEA.2018.8697360
Aote, 1. S. (2022). Image Caption Generation using Deep Learning Technique. JOURNAL OF
ALGEBRAIC STATISTICS, 2260-2267.
Chu, Y., Yue, X., Yu, L., Sergei, M., & Wang, Z. (2020). Automatic Image Captioning Based on
ResNet50 and LSTM with Soft Attention. Hindawi.
doi:https://doi.org/10.1155/2020/8909458
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., . . . Berg, T. (2013). BabyTalk:
Understanding and Generating Simple Image Descriptions. Institute of Electrical and
Electronics Engineers. doi:https://doi.org/10.1109/TPAMI.2012.162
Verma, V., Yadav, A. K., Kuma, M., & Yadav, D. (2022). Automatic Image Caption Generation
Using Deep Learning. Research Square. doi:https://doi.org/10.21203/rs.3.rs1282936/v1
NUML-S20-21950
NUML-S20-26401
NUML-S20-25478
Page 9
Automatic Image Caption Generation Using Deep Learning
.
Project Proposal Approval Certificate
Dated: ___________
Final Approval
It is certified that the project proposal submitted by Muhammad Shahzad Nazir ,Muhammad
Jawad Ishaq and Muhammad Ahtesham Aslam for the partial fulfillment of the requirement of
Masters in Computer Science degree is approved.
COMMITTEE
HoD CS:
Mr. Sajjad Haider
Signature: ____________________
Head Project Committee:
Ms. Mehwish Sabih
Signature: ____________________
Supervisor Name:
__________________
Signature: ____________________
Evaluator Name:
__________________
Signature: ____________________
Evaluator Name:
__________________
Signature: ____________________
Evaluator Name:
__________________
NUML-S20-21950
NUML-S20-26401
NUML-S20-25478
Signature: ____________________
Page 10
Download