Bayesian HC research talk

advertisement
Bayesian Content-Based
Image Retrieval
research with:
Katherine A. Heller
based on (Heller and Ghahramani, 2006)
part IB, paper 8, Lent
What is Information Retrieval?

finding material from within a large unstructured collection
(e.g. the internet) that satisfies the user’s information need
(e.g. expressed via a query).

well known examples…

…but there are many specialist search systems as well:
Universe of items being searched…
Imagine a universe of items:
….
The items could be:
images, music, documents, websites, publications, proteins,
news stories, customer profiles, products, medical records, …
or any other type of item one might want to query.
Illustrative example
Query set:
Result:
Query set:
Result:
Generalization from a small set…


Query is a set of items
Our information retrieval method should rank
items x by how well x fits with the query set
Bayesian Inference & Statistical Models




Statistical model
for data points
with model parameters
Prior on model parameters
Dataset
and model class
Marginal likelihood (evidence) for model
:
Illustrative example
Query set:
Result:
Query set:
Result:
Ranking items

Rank each item in the universe by how well it would “fit into”
a set which includes the query set
Query:
Ranking:
Best

Limit output to the top few items
Worst
A Criterion?

Having observed
, belonging to some concept, how
probable is it that an item
also belongs to that
concept ?

What we really want to know is
relative to
,
, the probability of the item before observing the
query…
Bayesian Sets Criterion
So we compute:

Assume a simple parameterized model,
and a prior on the parameters,
.

Since is unknown, to compute the score we
need to average over all values of :
,
Bayesian Sets Criterion
(A Different Perspective)
We can rewrite this score as:
Bayesian Sets Criterion
(A Different Perspective)
This has a nice intuitive interpretation:
Bayesian Sets Criterion
Bayesian Sets Algorithm



For simple models computing the score is tractable.
For sparse binary data computing all scores can be
reduced to a single sparse matrix multiplication.
Even with very simple models and almost no parameter
tuning one can get very competitive retrieval results.
Sparse Binary Data
E.g:
If we use a multivariate Bernoulli model:
With conjugate Beta prior:
We can compute:
This daunting expression can be dramatically simplified…
Sparse Binary Data
Reduces to:
The log of the score is linear in
where:
and
:
Priors

broad empirical priors from
entire data set chosen before
observing any queries

prior proportional to mean
feature frequency

robust to changes in
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Key Advantages of Our Approach

Novel search paradigm for retrieval


Based on:



queries are a small set of examples
principled statistical methods (Bayesian machine learning)
recent psychological research into models of human
categorization and generalization
Extremely fast



search >100,000 records per second on a laptop computer
uses sparse matrix methods
easy to parallelize and use inverted indices to search
billions of records/sec
Applications






Retrieving movies from a database of movie preferences
 EachMovie Dataset: (person,movie) entry is 1 if the person gave the movie a
rating above 3 stars out of a possible 0-5 stars
Finding sets of authors who work on similar topics
 NIPS authors dataset: (word,author) entry is 1 if the author uses that word more
frequently than twice the mean across all authors
Searching scientific literature
 NIPS dataset: (word, paper) entry is 1 if the paper uses that word more frequently
than twice the mean across all papers
Image retrieval based on color and texture features only
 Corel dataset: (image, feature) matrix contains 240 binary features per image:
Gabor and Tamura texture features and HSV color features
Searching a protein database
 UniProt database: the “world’s most comprehensive catalog of information on
proteins”. Binary features from GO annotations, PDB structural information,
keywords, and primary sequences.
Patent Search (Xyggy.com)
Retrieving Movies
EachMovie Data: 1813 people by 1532 movies
Retrieving Movies
comparison to Google Sets
Retrieving Movies
comparison to Google Sets
Query Times
Content-Based Image Retrieval
We can use the Bayesian Sets method as the basis
of a content-based image retrieval system…

The Image Retrieval Prototype System



A system for searching large collections of unlabelled images.
You enter a word, e.g. “penguins”, and it retrieves images that
match this label, using only color and texture features of the
images
A database of 32,000 images (from Corel)
Labelled Training Images: 10,000 images with about 3-10 text
labels per image
 Unlabelled Test Images: 22,000 images
 For each training and test image we can store a vector of 240 binary
color and texture features


A vocabulary of about 2000 keywords

For each keyword, we can compute a query vector q from the labelled
training images, as is specified by the Bayesian Sets algorithm.
Image features

Texture features (75)
48 Gabor features
 27 Tamura features


Color features (165)


HSV histogram (8x5x5)
Binarization
Compute skewness of each feature
 Assign value 1 to images in heavier tail

The Image Retrieval Prototype System
The Algorithm:
1.
Input query word: w=“penguins”
2.
Find all training images with label w
3.
Take the binary feature vectors for these training images as
query set and use Bayesian Sets algorithm
For each image, x, in the unlabelled test set, we compute
score(x) which measures the probability that x belongs in the
set of images with the label w.
4.
Return the images with the highest score
The algorithm is very fast:
about 0.2 sec on this laptop to query 22,000 test images
Results on all 50 queries…
Results for Image Retrieval
NNall - nearest neighbors to any member of the query set
Nnmean - nearest neighbors to the mean of the query set
BO - Behold Search online, www.beholdsearch.com
A Yavlinsky, E Schofield and S Rüger (CIVR, 2005)
http://www.inference.phy.cam.ac.uk/vr237/
Conclusions





Given a query of a small set of items, Bayesian Sets finds
additional items that belong in this set
The score used for ranking items is based on the marginal
likelihood of a probabilistic model
For binary data, the score can be computed exactly and
efficiently using sparse matrices (e.g. ~1 sec for over 2
million non-zero entries)
This approach can be extended to many probabilistic
models and other forms of data
Where applicable, results competitive with Google Sets



Google Sets works well for lists that appear explicitly on the web
Bayesian Sets works well for finding more abstract set completions
We have built prototype movie, author, paper, image and
protein search systems
Appendix
Image features
Texture features (75):
We represented images using two types of texture features, 48 Gabor texture
features and 27 Tamura texture features. We computed coarseness, contrast
and directionality Tamura features, for each of 9 (3x3) tiles. We applied 6
scale sensitive and 4 orientation sensitive Gabor filters to each image point
and compute the mean and standard deviation of the resulting distribution of
filter responses.
Color features (165):
Computed HSV 3D histogram with 8 bins for H and 5 each for value and
saturation. The lowest value bin was not partitioned into hues since these are
hard to distinguish.
Binarization:
Each feature was binarized by computing the skewness of the distribution of
that feature and giving a binary value of 1 to images falling in the 20 percentile
of the heavier tail of the feature distribution.
Download