Yansong Feng and Mirella Lapata Ashish Bagate What this paper is about Explore the feasibility of automatic caption generation for images in news domain Why particularly news domain – training data is available easily and abundantly Why Lots of digital images available on the Web Improved searching Analysis of the image Keywords only searches are ambiguous Targeted queries using longer search strings Web accessibility General Approach Two step process Analyze the image and build a representation for the same Run the text generation engine on the image representation, and come up with a natural language representation Related Work Hede et al. – not practical because of controlled data set and also manual database creation Yao et al. – based on just the image Elzer et al. – what the graphic depicts, little emphasis on graphics generation These methods use some background information /terminologies Problem Formulation For the given image I and the document D, generate a caption C Training data contains document – image – caption tuples Caption generation is a difficult task even for humans A good caption must be succinct, informative, clearly identify the subject of the picture, draw reader to the article Overview of the method Similar to Headline generation task Get the training data (it would be noisy) Follows two stage approach Get the keywords from the image (image annotation model) Generate the caption from the given image words Use of image features for faithful and meaningful description for the images Image Annotation Probabilistic model – well suited for noisy data Calculate SIFT descriptors of images Visual words by K means clustering Get the keywords by LDA dmix - bag of words representing image – document – caption Extractive Caption Generation Not much linguistic analysis is needed Caption would be a sentence from the document which is maximally similar to description keywords Types of Similarities Word Overlap Cosine Similarity Probabilistic Similarity KL divergence – similarity between an image and a sentence is measured by the extent to which they share the same topic distributions Issues with Extractive Caption Generation No single sentence can represent the image Selected caption sentences might be longer than the average length of the sentence May not be catchy Abstractive Caption Generation Word based model Adapted from headline generation Caption = the sequence of words that maximizes P Abstractive Caption Generation Phrase based model Caption = the sequence of words that maximizes P Evaluation… Evaluation… Evaluation Thanks!