Automatic Image Caption Generation Using Deep Learning . Automatic Image Caption Generation Using Deep Learning Project Proposal By: Muhammad Shahzad Nazir Roll no #2348 Session 2020-2024 CGPA: 2.70 Muhammad Ahtasham Aslam Roll no #2344 Session 2020-2024 CGPA: 2.27 Muhammad Jawad Ishaq Roll no #2336 Session 2020-2024 CGPA: 2.83 Supervised by: Dr. Hina Ashraf DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF MODERN LANGUAGES ISLAMABAD Automatic Image Caption Generation Using Deep Learning . Project Proposal Project Overview: Automatically describing the content of images using natural languages is a fundamental and challenging task with applications in diverse domains such as image retrieval, organizing and locating images of users' interest, etc. It has a great potential impact. For example, it could help visually impaired people better understand the content of images on the web. Also, it could provide more accurate and compact information of images/videos in scenarios such as image sharing in social networks or video surveillance systems. We have adopted an encoder-decoder-based model that was proposed by (Verma, Yadav, Kuma, & Yadav, 2022) that is capable of generating grammatically correct captions for images. This model makes use of CNN (VGG16 Hybrid Places 1365) as the encoder and RNN (LSTM) as the decoder. CNN is used as an encoder to convert the input image into a 1-D array representation, and RNN as a decoder or language model to generate the caption. To ensure complete ground truth accuracy, the model will be trained on the labeled Flickr8k and MSCOCO Captions datasets. After training the model it will be implemented on the website and android based mobile app. Purpose: Automatically describing the content of images using natural languages is a fundamental and challenging task. It has a great potential impact. Our area of interest lies in deep learning, so this project is the best fit for our field as it implements the model based on deep learning neural networks. This technology can be used in many new fields like helping the visually impaired, medical image analysis, geospatial image analysis, etc. For example, it could help visually impaired people better understand the content of images on the web. Also, it could provide more accurate and compact information of images/videos in scenarios such as image sharing in social networks or video surveillance systems. NUML-S20-21950 NUML-S20-26401 NUML-S20-25478 Page 2 Automatic Image Caption Generation Using Deep Learning . Evaluation of Existing System: (Amritkar & Jabade) Created an image caption generator using deep learning neural networks that include CNN and LSTM. Using these neural networks, their accuracy according to BLEU is 0.53356 on the Flickr8k dataset. In the paper (Kulkarni, et al., 2013) the authors used a Conditional Random Field (CRF) based technique to derive the objects (people, cars, etc., or things like trees, roads, etc.), attributes, and prepositions. The model is evaluated on the PASCAL dataset using BLEU . (Jabade & Amritkar, 2018)In this work, the best BLEU score obtained was 0.18. (Chu, Yue, Yu, Sergei, & Wang, 2020) put forward a model using ResNet50 and LSTM with soft attention that produced a BLEU-1 score of 0.619 on the Flickr8K dataset. In our model, we will use a deep-learning neural network with CNN (VGG16 Hybrid Places 1365) as the encoder and LSTM as the decoder. This model has been pre-trained on both, the ImageNet and Places datasets. Using this model, we will generate the captions for the images. Proposed System: In our project, a deep neural network-based method of image caption generation is proposed. In this approach, an encoder-decoder-based deep neural network model will be designed with the help of a convolutional neural network (VGG16 Hybrid Places 1365) and recurrent neural network (LSTM) as a decoder to generate captions. VGG16 Hybrid Places1365 model is trained on both ImageNet and Places datasets(containing 1000 and 365 classes respectively). This model will be trained on the labeled Flickr8k and MSCOCO Captions datasets. Further, the model is evaluated using all standard metrics such as BLEU, METEOR, GLEU, and ROUGE L. In CNN, The input layer is used to provide input to our network, which should be a picture. It might be colorful or black and white. After that, the picture is moved to the convolutional layer. Filters are applied to the input picture in this layer. Convolution is the term for this procedure. After that, the activation function is used. Here, the RELU activation function is used. Then our output becomes an input to the pooling layer, which helps to reduce the image's resolution as shown in fig () The convolutional layer, on the other hand, goes through the same procedure. It's feasible to have more than a single convolutional layer. As we progress through the convolutional layers, the NUML-S20-21950 NUML-S20-26401 NUML-S20-25478 Page 3 Automatic Image Caption Generation Using Deep Learning . patterns get more complex, such as eyes, faces, and birds. The matrix is then flattened before the completely linked layer is added. This layer aids in classifying the items in the supplied picture. The probability of classes is calculated using the Softmax formula. Finally, the output layer is created, which contains the items that were present in the picture. Workflow of CNN: The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state. LSTM Working: NUML-S20-21950 NUML-S20-26401 NUML-S20-25478 Page 4 Automatic Image Caption Generation Using Deep Learning . User Description: Users will be able to sign up and log in on the website or android app also he/she will be able to upload the image and will get the grammatically correct captions for the uploaded image which will help him/her to better understand the context of the image. Hardware Requirements: Computer with Intel core i5 2.3 GHz. Or higher 8GB RAM 512 GB hard disk Mouse Keyboard NVIDIA GeForce RTX 1080 Ti Software Requirements: Window 7 or above VS Code 1.70 Python Flutter Jupyter Notebook Google Collab Project Deliverables: We will provide web and android-based applications for people in which they will upload the image and our model will generate grammatically correct captions for the uploaded image. This project is based on the user input and our model's ability to generate captions. NUML-S20-21950 NUML-S20-26401 NUML-S20-25478 Page 5 Automatic Image Caption Generation Using Deep Learning . Model Diagram: NUML-S20-21950 NUML-S20-26401 NUML-S20-25478 Page 6 Automatic Image Caption Generation Using Deep Learning . Flow Chart: System Diagram: NUML-S20-21950 NUML-S20-26401 NUML-S20-25478 Page 7 Automatic Image Caption Generation Using Deep Learning . Functional Requirements: 1. Login Management In this module, the user will be able to log in and signup on the website, and mobile app. Sr. No. Description Type 1 The system should validate the logins and passwords. Evident 2 The system should provide privileges according to log in type. Evident 3 It should provide the ability to check the correct format of the username and password. It should not include special characters Evident except the _ and *. 2. Image Caption Generator: In this user will upload the image on the web or mobile app and our system will analyze the image by using our model it will generate grammatically correct captions for the uploaded image which will be displayed on his/her screen. Sr. No. Description Type 1 The user should be able to upload the image in the form. Evident 2 The system should be able to generate captions for the uploaded Evident image. NUML-S20-21950 NUML-S20-26401 NUML-S20-25478 Page 8 Automatic Image Caption Generation Using Deep Learning . References Amritkar, C., & Jabade, V. (2018). Image Caption Generation Using Deep Learning Technique. Institute of Electrical and Electronics Engineers, 4. doi:https://doi.org/10.1109/ICCUBEA.2018.8697360 Aote, 1. S. (2022). Image Caption Generation using Deep Learning Technique. JOURNAL OF ALGEBRAIC STATISTICS, 2260-2267. Chu, Y., Yue, X., Yu, L., Sergei, M., & Wang, Z. (2020). Automatic Image Captioning Based on ResNet50 and LSTM with Soft Attention. Hindawi. doi:https://doi.org/10.1155/2020/8909458 Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., . . . Berg, T. (2013). BabyTalk: Understanding and Generating Simple Image Descriptions. Institute of Electrical and Electronics Engineers. doi:https://doi.org/10.1109/TPAMI.2012.162 Verma, V., Yadav, A. K., Kuma, M., & Yadav, D. (2022). Automatic Image Caption Generation Using Deep Learning. Research Square. doi:https://doi.org/10.21203/rs.3.rs1282936/v1 NUML-S20-21950 NUML-S20-26401 NUML-S20-25478 Page 9 Automatic Image Caption Generation Using Deep Learning . Project Proposal Approval Certificate Dated: ___________ Final Approval It is certified that the project proposal submitted by Muhammad Shahzad Nazir ,Muhammad Jawad Ishaq and Muhammad Ahtesham Aslam for the partial fulfillment of the requirement of Masters in Computer Science degree is approved. COMMITTEE HoD CS: Mr. Sajjad Haider Signature: ____________________ Head Project Committee: Ms. Mehwish Sabih Signature: ____________________ Supervisor Name: __________________ Signature: ____________________ Evaluator Name: __________________ Signature: ____________________ Evaluator Name: __________________ Signature: ____________________ Evaluator Name: __________________ NUML-S20-21950 NUML-S20-26401 NUML-S20-25478 Signature: ____________________ Page 10