Uploaded by Sujan Thapa

sujan

advertisement
Image Captioning
Sujan Babu Shrestha 17031245
Contents
AI concept used
• CNN
• RNN(LSTM)
• VGG16
• Embeddings
Research Evidences
Solution
• How the program works
• Achieved results
• How it solves the problem in the real world
• Pseudo Code
• Flowchart
What is Artificial Intelligence
The term artificial intelligence was introduced by Jhon MacCarthy on the conference in 1956. The
concept of the machine. But the concept of a machine being able is think was much before that.
Artificial Intelligence is an approach to make a dumb machine i.e. a computer/robot to think and
respond like humans.
Some of the fields using AI are:
Gamming
Natural Language Processing
Expert Systems
Speech Recognition
Handwriting Recognition
What is Machine Learning ?
Machine learning is a subset of Artificial Intelligence that allows the system to learn and improve from
experience automatically without programming. The term Machine Learning was begat by Arthur Samuel in
1959, an American pioneer in the field of PC. It works on the set of algorithms and differs from traditional
computational methods. The algorithms used in machine learning allow computers to train on the data inputs and
use statistical analysis in order to output values that are within a specific range.
Types of machine Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
What is Natural Language Processing ?
NLP is a branch of artificial intelligence that deals with the interaction between computers and humans using the
natural language. For identifying and extracting natural language algorithms rules are applied such as converting
unstructured data into a form that computers can understand. Meaningful extraction of text is performed form
every sentence and collection of essential data from them. Techniques used in natural language processing:
Stages of Natural Language Processing:
Syntax: Lemmatization, Morphological segmentation, Word segmentation, Part-of-speech tagging, parsing,
sentence breaking, Stemming.
Semantics: Named entity recognition (NER), Word sense disambiguation, Natural language generation.
Deep Learning:
Deep learning is a subset of machine learning concerned with algorithms and consists of an artificial neural
network that is inspired by the human brain. It is capable to learn unsupervised from the data that is unstructured
or unlabeled. It uses an artificial neural network (ANNs) algorithms.
ANNs have nodes named neuron interconnected by links. Each
node receives data, performs an operation, and passes the new
data to another bode through the link. The link contains weights
or
biases
that
influence
the
next
node’s
operation.
Backpropagation is performed for minimization of the loss
function and in some cases like in RNNs backpropagation
Structure of ANN.
through time is performed
AI concept being used in the model.
CNN
• Convolutional Neural Network (CNN) is a Deep Learning algorithm that can take in an input image, get
all learnable weights and biases to different parts of the image and be able to differentiate one from
another.
• CNN is able to capture the Spatial and Temporal dependencies in an image using relevant filters through
the color channels. If an image is a grayscale image then there is only one channel.
• Due to the filters, fewer numbers of parameters are involved with the reusability of weights.
• The size of the convolutional features is decreased by extracting the dominant features and does not deal
with the color channels. It uses kernel size to extract the features. It is called pooling layer in CNN.
Filter/kernel reduction through channels.
Complete CNN architecture
Max pooling and average pooling.
RNN(LSTM)
• Recurrent neural network (RNN) process sequence of data one per time. The input and previous hidden
states are combined to form a vector. The vector consists of information of both current input and
previous input. The obtained vector gets tanh activation, and the output is a new hidden state and offers
termed as the memory of the network
• The hidden layer acts as memory and is often vanished duing optimization of long sequences.
• Long short-term memory (LSTM) is a solution to the loss of memory. It uses fates as internal
mechanisms for maintaining the flow of the information. The importance of data to keep or throw is
determined by the gates. All the important information is saved in the long chains of sequences for
making predictions.
• The information of the cell state is added or removed on the process through gates. The gates decide the
information that is to carried out on the cell state or removed.
Gates of LSTM:
Forget Gate
Input Gate
Output Gate
LSTM architecture.
Reccurent neural network.
VGG16:
• VGG16 is a convolutional neural system model proposed by K. Simonyan and A. Zisserman from the
University of Oxford in the paper "Very Deep Convolutional Networks for Large-Scale Image
Recognition".
• The model accomplishes 92.7% top-5 test exactness in ImageNet, which is a dataset of more than 14
million pictures having a place with 1000 classes. The input of the first layer is of a fixed size 224 x 224
RGB image. The image features are a 1-dimensional 4,096 element vector.
VGG16 weight model
Embeddings:
• Unlike a bag of words, neural network embeddings work by representing words into feature dimensional
space called dense vectors. The dense vectors are the continuous vector provided to each word.
• Random weights are initialized and new weights are learned for all the words in the training dataset.
Nearest neighbors can be classified in embedding space for recommendation and work well with highly
cardinality variable.
Vector reoresentation of each token in features dimensional
space.
Research Evidences:
• The model consists of a CNN encoder model. Various CNN were researched for the deployment of the
perfect encoder model. VGGNet is preferred for the simplicity of the model and for its power. But the
computational efficiency of the ResNet is most efficient than all the encoders.
• LSTM is developed from RNN for the intention to work with sequential data. Due to its efficiency in
memorizing the long term dependencies through the memory cell, it is the most popular used method for
image captioning. It works by generating caption by making one word at every time step adding on a
context vector, together with the previous hidden state and the earlier generated words.
Topic selection
reason
•
It is cleared that only detecting the object is not human-like behavior. There have been many variations
and combinations of different techniques since 2014—the very first application of neural networks in
image captioning and many developments are performed for more advanced human like behaviour.
• It is clear that human-like technologies are more appreciated by enabling computers to interact with
humans to specific applications for child education, health assistants for the elderly or visually disabled
people, and many more.
Solution to the engine
• Import the required packages and tools
• Extract features of the images.
• Initialize train text dataset.
• Load unique idenfier of each photo and save dictionary of photo identifiers to descriptions.
• Clean the created dataset. Remove single word letter, convert all to lower case.
• Get the vocabulary from the descriptions dictionaries and save it.
• Load clean create dictionary. Add starting and end word to a caption as it generated one per time.
• Load photo features that loads entire set of photo.
• Create consistent mapping form the words to unique integer values.
• Create consistent mapping form the words to unique integer values.
• Create Maiximum sequences, trasform the data into input photo and encoded text.
• Create a Sequence Processor model for handling text input
• Contenate both feature extractor and Sequence Processor.
•
i = 1, epochs= 20 and get length of description
• loaded textual descriptions, photo features, tokenizer and max length for 20 times
• get the descriptions of the and parse thorugh it.
• create a batch worth of a single photo.
• yield one photo’s worth of data per batch. which contains sequences generated for photo and set of
descriptions.
• fit the model.
• loaded textual descriptions, photo features, tokenizer and max length and ceate lists.
• Calculate BLEU and print it
• use the best model
Achieved result
Problems solved
•
•
•
•
Human like interaction can be generated.
prediction of a image with the description of the image.
Encourages people in the field of NLP which is underdeveloped till now.
Create more advanced field of AI.
Pseudo-code
• IMPORT pandas AS pd
• IMPORT numpy AS np
• READ the “2012-18_teamBoxScore_diff_columns.csv” CSV file
• Use the functions available in the Pandas package to sort and to validate the
rows and columns.
• DROP the columns with Nah values from the CSV file.
• DROP the columns having Nah as value better processing and validate the
CSV file.
• ASSIGN X (data) and y (target) to train and test the data.
• IMPORT train_test_split from sklearn.model._selection
• SPLIT the data into training and testing.
Flowchart
Thank you
Download