Slides1 - Tamara L Berg

Yansong Feng and Mirella Lapata
Ashish Bagate
What this paper is about
 Explore the feasibility of automatic caption
generation for images in news domain
 Why particularly news domain – training data is
available easily and abundantly
 Lots of digital images available on the Web
 Improved searching
 Analysis of the image
 Keywords only searches are ambiguous
 Targeted queries using longer search strings
 Web accessibility
General Approach
 Two step process
 Analyze the image and build a representation for the
 Run the text generation engine on the image
representation, and come up with a natural language
Related Work
 Hede et al. – not practical because of controlled
data set and also manual database creation
 Yao et al. – based on just the image
 Elzer et al. – what the graphic depicts, little
emphasis on graphics generation
 These methods use some background information
Problem Formulation
 For the given image I and the document D,
generate a caption C
 Training data contains document – image –
caption tuples
 Caption generation is a difficult task even for
 A good caption must be succinct, informative,
clearly identify the subject of the picture, draw
reader to the article
Overview of the method
 Similar to Headline generation task
 Get the training data (it would be noisy)
 Follows two stage approach
 Get the keywords from the image (image annotation
 Generate the caption from the given image words
 Use of image features for faithful and meaningful
description for the images
Image Annotation
 Probabilistic model – well suited for noisy data
 Calculate SIFT descriptors of images
 Visual words by K means clustering
 Get the keywords by LDA
 dmix - bag of words representing image –
document – caption
Extractive Caption Generation
 Not much linguistic analysis is needed
 Caption would be a sentence from the document
which is maximally similar to description
Types of Similarities
 Word Overlap
 Cosine Similarity
 Probabilistic Similarity
 KL divergence – similarity between an image and a
sentence is measured by the extent to which they share
the same topic distributions
Issues with Extractive Caption
 No single sentence can represent the image
 Selected caption sentences might be longer than
the average length of the sentence
 May not be catchy
Abstractive Caption Generation
 Word based model
 Adapted from headline generation
 Caption = the sequence of words that maximizes P
Abstractive Caption Generation
 Phrase based model
 Caption = the sequence of words that maximizes P