Building Text features for object image classification

advertisement
BUILDING TEXT FEATURES FOR
OBJECT IMAGE CLASSIFICATION
Gang Wang
Derek Hoeim
David Forsyth
MAIN IDEA
Text based image features built using auxiliary
dataset of images(internet) annotated with tags.
 Visual classifier with an object viewed under
novel circumstances.

So, basically,
Text classifier
Image Classifier
Unified
WHAT ARE THEY TRYING TO DO?
CHALLENGES


Determine which objects are present in an image
based on the text that surrounds similar images
drawn from large collections.
Sounds easy but:



Object appearance
Pose
Illumination
LOW LEVEL FEATURES CAN RESCUE
BUT…..
Color
 Texture
 SIFT features
Can help if we had millions of training samples
but this is unrealistic.

So what can help?????
Millions of images on the internet, not tagged but
the text associated with them helps classification.
EUREKA!!!!!!


Easier to determine image content using
surrounding text than with currently available
image features.
Given a large enough dataset, we are bound to
find very similar images to an input image. So
they infer likely text for an input image based on
similar images
THE COMMON APPROACH
Approach

Improve annotation quality or filter spurious
search results that can be used for training.
The Problem

Noise or ambiguity in annotations can easily
nullify any benefit
Proposal

Learn a distance metric that causes images with
similar surrounding text to be similar in visual
feature space.
THEIR APPROACH

Build text features for object image classification
as they are expected to capture direct semantic
meaning of an image.
APPROACH EXPLAINED
Dataset = Training + Test images
 Auxiliary Dataset= Internet images(Flickr), have
associated text.


For each training image
Extract visual features.
 Find K nearest neighbor images from internet
dataset.
 Use text associated with these internet images to
build text feature.
 Train!!


Repeat for visual features and combine both.
VISUAL FEATURES

SIFT :





Used for image matching and object recognition.
They use to detect and describe local patches.
Extract 1000 local patches from each image.
Quantized to 1000 clusters and each patch denoted to
a cluster index.
Finally each image represented as a normalized
histogram of cluster indices.

GIST:
Powerful in scene categorization and retreiving.
 They represent each image as a 960 dimension GIST
descriptor.


Color:
Quantize each channel to 8 bins.
 Each pixel value is represented as integer between 1
to 512.
 512 dimensional histogram for each image.


Gradient
Can be considered as global and coarse SIFT feature.
 Divide image into 4*4 cells
 At each cell quantize the gradient into 16 bins.
 Whole image represented as 256 dimensional vector.


Unified
Concatenation of the 4 previously described features.
 Let the above features be f1, f2, f3, f4 .
 Resultant features [w1f1, w2f2 ,w3f3,w4f4]

HOW TO FIND WEIGHTS:
Learn weights from training images.
 Aim to force the images from the same category
to be close and vice versa.
 Randomly select N pairs of images from the
training set.
 For ith pair, Si=1 if two images share atleast one
same object class, otherwise Si=0.
 Calculate chi square distance fj for the ith pair as
 Learn weights:

Can solve directly using “fmincon” in Matlab.
CHI SQUARE???

Chi square distance(http://www.stat.lsu.edu/faculty/moser/exst7037/geometry.pdf):
Denominator is the normalization component for
each point in X.
 So for n dimensions:

FMINCON?????



Finds minimum of constrained nonlinear
multivariable function.
x = fmincon(fun,x0,A,b)
x = fmincon(fun,x0,A,b,Aeq,beq)
x = fmincon(fun,x0,A,b,Aeq,beq,lb,ub)…..
http://www.mathworks.com/help/toolbox/optim/ug/fmincon.html
AUXILIARY DATASET
Collected from Flickr.
 Total 1 million images
 Out of which 700,000 images collected for 58
object categories whose names come from
PASCAL and CALTECH 256 datasets.
 Rest collected from a group called “10 million
photos ”. Random images.

TEXT FEATURES

For each training/test image
Find K nearest neighbor images from the auxiliary
dataset.
 Extract text with these associated images
 Build text features.

“Dogs! Dogs! Dogs!” treated as a single item.
 Use only frequent tags and group names(6000) in
the auxiliary dataset.
 Text feature is a normalized histogram of tag and
group name counts.

CLASSIFIER


SVM classifier with a chi-squared kernel for text
features.
Same used for visual features as well.
FUSION
Build visual classifier
 Build text classifier
 Third classifier trained to combine the confidence
values of above two to give final prediction.
 Final classifier logistic regression and is trained
on a validation test.

RESULTS
PASCAL VOC 2006-10 object categories
 PASCAL VOC 2007-20 object categories

Performance quantitatively measured using
AUC(Area under the ROC curve) for 2006
dataset and by AP(Average Precision) for 2007
dataset.
 Use 150 nearest neighbor images in all
experiments.

PERFORMANCE METRICS
Performance of text features built with different
visual features.
 Effects of combining text and visual classifiers.
 Effects of varying number of training images
 Performance of the text features built with
varying number of internet images
 Effects of category names

For 2006 Dataset: Text classifier outperforms GIST KNN for each
feature. Unified is best amongst all. Combination(V) etc. are obtained
by training a logistic regression classifier on the validation dataset using
the confidence values returned by the individual classifiers.
VARYING NUMBER OF AUXILIARY IMAGES
EXCLUDING CATEGORY NAMES
QUESTIONS???
Download