visual words - Tamara L Berg

advertisement
Watch, Listen and Learn
Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney
-Pratiksha Shah
What’s the problem
• We want to automatically annotate and index
images (considering their ever growing
number)
• Visual cues alone can be ambiguous (depend
on lighting, and variety exhibited even by
objects of the same kind)
Previous methods
• Previous work was focused on learning the
association between visual and textual
information
• Many researchers have worked on activity
recognition in videos using only visual cues
• Some used co-training but in a different
flavour (co-SVM, 2 visual views, 2 text views)
How we propose to solve it
• The two major factors are the features and approach.
• We want to use : {Visual information + Linguistic
information + Unlabeled multi-modal data }
for the learning process using a co-training approach.
• Visual and linguistic information are used as separate
cues and we expect they will complement each other
during training
What is co-training
• Training using 2 different (conditionally independent and
sufficient) views
• First learn a separate classifier for each view
• Most confident predictions of these classifiers are then
used to iteratively construct labeled training data
• Change made: an unlabeled example is only labeled if a
pre-specified confidence threshold for that view is
exceeded
• Used for classifying :
Webpages (based on content and hyperlink views)
Email ( based on header and body)
object detection models
Co-training Algorithm:
Algorithm:
A classifier for each view using the labeled data with just the features for that
view.
– Loop following steps until there are no more unused unlabeled instances:
1. Compute predictions and confidences of both classifiers for all of the
unlabeled instances.
2. For each view, choose the m unlabeled instances for which its classifier has
the highest confidence. For each such instance, if the confidence value is less
than the threshold for this view, then ignore the instance and stop labeling
instances with this view, else label the instance and add it to the supervised
training set.
3. Retrain the classifiers for both views using the augmented labeled data.
Text Feature
• Pre process text by removing stop-words
• Stem the remaining words using Porter
stemming
• Frequency of the resulting word stems
comprises of the final textual features. (“bag
of words” representation)
Captioned images
Features used :
• We want to capture overall texture and color
distributions in local regions
• Texture – Gabor filter with 3 scales and 4
orientations
• Color – Mean, Standard deviation and
skewness of per-channel RGB and lab color
pixel values
Method (for captioned images)
• Divide each image into a 4 by 6 sized cells
• Compute texture feature using Gabor filter for each
• The resulting feature vector for each region is then
clustered using k-means
• Each region is then assigned to one of the k-clusters
based on its closeness to cluster centroids
• Final “bag of visual words” represents every image
with a vector of k values, each denoting number of
regions of the image close to that value.
Image Features
Divide images into 46 grid
…
Qui ckTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this pi cture.
Capture texture and color
distributions of each cell
into 30-dim vector
aTQrIFue iFcn ke( U
Te indmceeod™
mtop arsenesdesaethdi s) dp
N  30
Cluster the vectors using k-Means
to quantize the features into
a dictionary of visual words
Represent each image as histogram of visual
words
[Fei-Fei et al. ‘05, Bekkerman & Jeon ‘07]
The University of Texas at Austin
10
Example dataset
Results for captioned images
• Compare co-training to supervised SVM
• Compare co-training to Semi-supervised EM
• Compare co-training to Transductive SVM
Commented videos
• Features used:
• we use features that describe both salient
spatial changes and interesting movements.
• Maximize a normalized temporal laplacian
operation over spatial and temporal scale
• HOG – 3x3x2 spatial temporal blocks,4-bin
HOG descriptor for every block => 72 element
descriptor
Method (for commented videos)
• Use spatio-temporal descriptor motion descriptor (Laptev)
• To detect events, use significant local changes in image
values in both space and time.
• Estimate the spatio-temporal extent of the detected events
by maximizing a normalized spatiotemporal Laplacian
operator over both spatial and temporal scales
• A HOG(histogram of oriented gradients) is calculated at
each interest point. The patch is partitioned into a grid with
3x3x2 spatio-temporal blocks, and four-bin HOG descriptors
are then computed for all blocks and concatenated into a
72-element descriptor
• These descriptors are clustered to form a vocabulary
Video Features
Detect Interest Points
Harris-Forstener Corner Detector
for both spatial and temporal space
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see t his picture.
…
Describe Interest Points
Histogram of Oriented Gradients (HoG)
aTQrIFue iFcn ke( U
Te indmceeod™
mtop arsenesdesaethdi s) dp
Create Spatio-Temporal Vocabulary
Quantize interest points to create 200
visual words dictionary
N  72
Represent each video as
histogram of visual words
[Laptev, IJCV ‘05]
The University of Texas at Austin
15
Example dataset
Results for commented video
•
•
Compare co-training with supervised SVM for commented video dataset
Compare co-training with supervised SVM when commentary is not available
during testing
What does the future look like?
• Larger dataset + more categories for testing
• Labeled data versus associated text
• Already the results show that co-training gives
better results than existing semi-supervised
and supervised methods.
Questions
?
Download