Image Captioning Sujan Babu Shrestha 17031245 Contents AI concept used • CNN • RNN(LSTM) • VGG16 • Embeddings Research Evidences Solution • How the program works • Achieved results • How it solves the problem in the real world • Pseudo Code • Flowchart What is Artificial Intelligence The term artificial intelligence was introduced by Jhon MacCarthy on the conference in 1956. The concept of the machine. But the concept of a machine being able is think was much before that. Artificial Intelligence is an approach to make a dumb machine i.e. a computer/robot to think and respond like humans. Some of the fields using AI are: Gamming Natural Language Processing Expert Systems Speech Recognition Handwriting Recognition What is Machine Learning ? Machine learning is a subset of Artificial Intelligence that allows the system to learn and improve from experience automatically without programming. The term Machine Learning was begat by Arthur Samuel in 1959, an American pioneer in the field of PC. It works on the set of algorithms and differs from traditional computational methods. The algorithms used in machine learning allow computers to train on the data inputs and use statistical analysis in order to output values that are within a specific range. Types of machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning What is Natural Language Processing ? NLP is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language. For identifying and extracting natural language algorithms rules are applied such as converting unstructured data into a form that computers can understand. Meaningful extraction of text is performed form every sentence and collection of essential data from them. Techniques used in natural language processing: Stages of Natural Language Processing: Syntax: Lemmatization, Morphological segmentation, Word segmentation, Part-of-speech tagging, parsing, sentence breaking, Stemming. Semantics: Named entity recognition (NER), Word sense disambiguation, Natural language generation. Deep Learning: Deep learning is a subset of machine learning concerned with algorithms and consists of an artificial neural network that is inspired by the human brain. It is capable to learn unsupervised from the data that is unstructured or unlabeled. It uses an artificial neural network (ANNs) algorithms. ANNs have nodes named neuron interconnected by links. Each node receives data, performs an operation, and passes the new data to another bode through the link. The link contains weights or biases that influence the next node’s operation. Backpropagation is performed for minimization of the loss function and in some cases like in RNNs backpropagation Structure of ANN. through time is performed AI concept being used in the model. CNN • Convolutional Neural Network (CNN) is a Deep Learning algorithm that can take in an input image, get all learnable weights and biases to different parts of the image and be able to differentiate one from another. • CNN is able to capture the Spatial and Temporal dependencies in an image using relevant filters through the color channels. If an image is a grayscale image then there is only one channel. • Due to the filters, fewer numbers of parameters are involved with the reusability of weights. • The size of the convolutional features is decreased by extracting the dominant features and does not deal with the color channels. It uses kernel size to extract the features. It is called pooling layer in CNN. Filter/kernel reduction through channels. Complete CNN architecture Max pooling and average pooling. RNN(LSTM) • Recurrent neural network (RNN) process sequence of data one per time. The input and previous hidden states are combined to form a vector. The vector consists of information of both current input and previous input. The obtained vector gets tanh activation, and the output is a new hidden state and offers termed as the memory of the network • The hidden layer acts as memory and is often vanished duing optimization of long sequences. • Long short-term memory (LSTM) is a solution to the loss of memory. It uses fates as internal mechanisms for maintaining the flow of the information. The importance of data to keep or throw is determined by the gates. All the important information is saved in the long chains of sequences for making predictions. • The information of the cell state is added or removed on the process through gates. The gates decide the information that is to carried out on the cell state or removed. Gates of LSTM: Forget Gate Input Gate Output Gate LSTM architecture. Reccurent neural network. VGG16: • VGG16 is a convolutional neural system model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper "Very Deep Convolutional Networks for Large-Scale Image Recognition". • The model accomplishes 92.7% top-5 test exactness in ImageNet, which is a dataset of more than 14 million pictures having a place with 1000 classes. The input of the first layer is of a fixed size 224 x 224 RGB image. The image features are a 1-dimensional 4,096 element vector. VGG16 weight model Embeddings: • Unlike a bag of words, neural network embeddings work by representing words into feature dimensional space called dense vectors. The dense vectors are the continuous vector provided to each word. • Random weights are initialized and new weights are learned for all the words in the training dataset. Nearest neighbors can be classified in embedding space for recommendation and work well with highly cardinality variable. Vector reoresentation of each token in features dimensional space. Research Evidences: • The model consists of a CNN encoder model. Various CNN were researched for the deployment of the perfect encoder model. VGGNet is preferred for the simplicity of the model and for its power. But the computational efficiency of the ResNet is most efficient than all the encoders. • LSTM is developed from RNN for the intention to work with sequential data. Due to its efficiency in memorizing the long term dependencies through the memory cell, it is the most popular used method for image captioning. It works by generating caption by making one word at every time step adding on a context vector, together with the previous hidden state and the earlier generated words. Topic selection reason • It is cleared that only detecting the object is not human-like behavior. There have been many variations and combinations of different techniques since 2014—the very first application of neural networks in image captioning and many developments are performed for more advanced human like behaviour. • It is clear that human-like technologies are more appreciated by enabling computers to interact with humans to specific applications for child education, health assistants for the elderly or visually disabled people, and many more. Solution to the engine • Import the required packages and tools • Extract features of the images. • Initialize train text dataset. • Load unique idenfier of each photo and save dictionary of photo identifiers to descriptions. • Clean the created dataset. Remove single word letter, convert all to lower case. • Get the vocabulary from the descriptions dictionaries and save it. • Load clean create dictionary. Add starting and end word to a caption as it generated one per time. • Load photo features that loads entire set of photo. • Create consistent mapping form the words to unique integer values. • Create consistent mapping form the words to unique integer values. • Create Maiximum sequences, trasform the data into input photo and encoded text. • Create a Sequence Processor model for handling text input • Contenate both feature extractor and Sequence Processor. • i = 1, epochs= 20 and get length of description • loaded textual descriptions, photo features, tokenizer and max length for 20 times • get the descriptions of the and parse thorugh it. • create a batch worth of a single photo. • yield one photo’s worth of data per batch. which contains sequences generated for photo and set of descriptions. • fit the model. • loaded textual descriptions, photo features, tokenizer and max length and ceate lists. • Calculate BLEU and print it • use the best model Achieved result Problems solved • • • • Human like interaction can be generated. prediction of a image with the description of the image. Encourages people in the field of NLP which is underdeveloped till now. Create more advanced field of AI. Pseudo-code • IMPORT pandas AS pd • IMPORT numpy AS np • READ the “2012-18_teamBoxScore_diff_columns.csv” CSV file • Use the functions available in the Pandas package to sort and to validate the rows and columns. • DROP the columns with Nah values from the CSV file. • DROP the columns having Nah as value better processing and validate the CSV file. • ASSIGN X (data) and y (target) to train and test the data. • IMPORT train_test_split from sklearn.model._selection • SPLIT the data into training and testing. Flowchart Thank you