Slide - Tamara L Berg

advertisement
790-133
Language and Vision
Tamara Berg
Retrieval
How big is the web?
• The first Google index in 1998 already had 26
million pages
• By 2000 the Google index reached the one
billion mark.
• July 25, 2008 – Google announced that search
had discovered one trillion unique URLs
Slide from Takis Metaxas
How hard is it to go from one page
to another?
• Over 75% of the time there is no directed path
from one random web page to another.
Kleinberg: The small-world phenomenon
How hard is it to go from one page
to another?
• Over 75% of the time there is no directed path
from one random web page to another.
• When a directed path exists its average length
is 16 clicks.
• When an undirected path exists its average
length is 7 clicks.
Kleinberg: The small-world phenomenon
How hard is it to go from one page
to another?
• Over 75% of the time there is no directed path
from one random web page to another.
• When a directed path exists its average length
is 16 clicks.
• When an undirected path exists its average
length is 7 clicks.
• Short average path between pairs of nodes is
characteristic of a small-world network (“six
degrees of separation” Stanley Milgram).
Kleiberg: The small-world phenomenon
Information Retrieval
• Information retrieval (IR) is the science of
searching for documents, for information
within documents, and for metadata about
documents, as well as that of searching
relational databases and the World Wide Web
Wikipedia
Slide from Takis Metaxas
The Anatomy of a Large-Scale
Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
“The ultimate search engine would understand
exactly what you mean and give back exactly
what you want.” - Larry Page
Google – misspelling of googol = 10100
The Google Search Engine
Founded 1998 (1996) by two Stanford students
Originally academic / research project that later became a
commercial tool
Distinguishing features (then!?):
- Special (and better) ranking
- Speed
- Size
Slide from Jeff Dean
The web in 1997
Internet was growing very quickly
• “Junk results often wash out any results that a
user is interested in. In fact, as of November
1997, only one of the top four commercial
search engines finds itself (returns its own
search page in response to its name in the top
ten results).”
The web in 1997
Internet was growing very quickly
• “Junk results often wash out any results that a
user is interested in. In fact, as of November
1997, only one of the top four commercial
search engines finds itself (returns its own
search page in response to its name in the top
ten results).”
Need high precision in
the top results
Google’s first search engine
Components of Web Search
Service
Components
• Web crawler
• Indexing system
• Search system
• Advertising system
Considerations
• Economics
• Scalability
• Legal issues
Slide from William Y. Arms
Web Searching: Architecture
•
Documents stored on many Web servers are indexed in a single central
index.
•
The central index is implemented as a single system on a very large
number of computers
Build index
Crawl
Search
Index to
all Web
pages
Web
pages
retrieved
by crawler
Web
servers
with Web
pages
Examples: Google, Yahoo!
Slide from William Y. Arms
What is a Web Crawler?
Web Crawler
•
A program for downloading web pages.
•
Given an initial set of seed URLs, it recursively downloads
every page that is linked from pages in the set.
•
A focused web crawler downloads only those pages whose
content satisfies some criterion.
Also known as a web spider
Slide from William Y. Arms
Simple Web Crawler Algorithm
Basic Algorithm
Let S be set of URLs to pages waiting to be indexed. Initially S is
is a set of known seeds.
Take an element u of S and retrieve the page, p, that it
references.
Parse the page p and extract the set of URLs L it has links to.
Update S = S + L - u
Repeat as many times as necessary.
[Large production crawlers may run continuously]
Slide from William Y. Arms
Indexing the Web Goals: Precision
Short queries applied to very large numbers of items
leads to large numbers of hits.
• Goal is that the first 10-100 hits presented should satisfy
the user's information need
-- requires ranking hits in order that fits user's
requirements
• Recall is not an important criterion
Completeness of index is not an important factor.
• Comprehensive crawling is unnecessary
Slide from William Y. Arms
Concept of Relevance and
Importance
Document measures
Relevance, as conventionally defined, is binary (relevant
or not relevant). It is usually estimated by the similarity
between the terms in the query and each document.
Importance measures documents by their likelihood of
being useful to a variety of users. It is usually estimated
by some measure of popularity.
Web search engines rank documents by a weighted
combination of estimates of relevance and importance.
Slide from William Y. Arms
Relevance
• Words in document (stored in inverted index)
• Location information – for use of proximity in
multi-word search.
• In page title, page url?
• Visual presentation details – font size of
words, words in bold.
Relevance
Anchor Text
The source of Document A contains the marked-up text:
<a href="http://www.cis.cornell.edu/">The Faculty of
Computing and Information Science</a>
The anchor text:
The Faculty of Computing and Information Science
can be considered descriptive metadata about the document:
http://www.cis.cornell.edu/
Slide from William Y. Arms
Importance - PageRank Algorithm
Used to estimate popularity of documents
Concept:
The rank of a web page is higher if many pages link to it.
Links from highly ranked pages are given greater weight than links from
less highly ranked pages.
Slide from William Y. Arms
Intuitive Model (Basic Concept)
Basic (no damping)
A user:
1.
Starts at a random page on the web
2.
Selects a random hyperlink from the current page and jumps to the
corresponding page
3.
Repeats Step 2 a very large number of times
Pages are ranked according to the relative frequency with
which they are visited.
Slide from William Y. Arms
PageRank
Example
1
2
6
3
5
4
Basic Algorithm:
Matrix Representation
Slide from William Y. Arms
Basic Algorithm: Normalize by
Number of Links from Page
Slide from William Y. Arms
Basic Algorithm: Normalize by
Number of Links from Page
Slide from William Y. Arms
Basic Algorithm: Weighting of
Pages
Initially all pages have
weight 1/n
w0 =
0.17
0.17
0.17
0.17
0.17
0.17
Recalculate
weights
0.06
w1 = Bw0 =
0.21
0.29
0.35
0.04
0.06
If the user starts at a random page, the jth element of w1 is the
probability of reaching page j after one step.
Slide from William Y. Arms
Basic Algorithm: Weighting of
Pages
Initially all pages have
weight 1/n
w0 =
0.17
0.17
0.17
0.17
0.17
0.17
Recalculate
weights
0.06
w1 = Bw0 =
0.21
0.29
0.35
0.04
0.06
If the user starts at a random page, the jth element of w1 is the
probability of reaching page j after one step.
Slide from William Y. Arms
Basic Algorithm: Weighting of
Pages
Initially all pages have
weight 1/n
w0 =
0.17
0.17
0.17
0.17
0.17
0.17
Recalculate
weights
0.06
w1 = Bw0 =
0.21
0.29
0.35
0.04
0.06
If the user starts at a random page, the jth element of w1 is the
probability of reaching page j after one step.
Slide from William Y. Arms
Basic Algorithm: Iterate
Iterate: wk = Bwk-1
w0
w1
w2
w3
... converges to ... w
0.17
0.06
0.01
0.01
->
0.00
0.17
0.21
0.32
0.47
->
0.40
0.17
0.29
0.46
0.34
->
0.40
0.17
0.35
0.19
0.17
->
0.20
0.17
0.04
0.01
0.00
->
0.00
0.17
0.06
0.01
0.00
->
0.00
At each iteration, the sum of the weights is 1.
Slide from William Y. Arms
Special Cases of Hyperlinks on
the Web
There is no link
out of {2, 3, 4}
2
1
4
3
5
6
Slide from William Y. Arms
Google PageRank with Damping
A user:
1.
Starts at a random page on the web
Teleport!
2a. With probability 1-d, selects any random page and jumps to it
2b. With probability d, selects a random hyperlink from the current page
and jumps to the corresponding page
3.
Repeats Step 2a and 2b a very large number of times
Pages are ranked according to the relative frequency with
which they are visited.
[For dangling nodes, always follow 2a.]
Slide from William Y. Arms
The PageRank Iteration
The basic method iterates using the normalized link matrix, B.
wk = Bwk-1
This w is an eigenvector of B
PageRank iterates using a damping factor. The method iterates:
wk = (1 - d)w0 + dBwk-1
w0 is a vector with every element equal to 1/n.
Slide from William Y. Arms
Choice of d
Conceptually, values of d that are close to 1 are desirable as
they emphasize the link structure of the Web graph, but...
•
The rate of convergence of the iteration decreases as d approaches 1.
•
The sensitivity of PageRank to small variations in data increases as d
approaches 1.
It is reported that Google uses a value of d = 0.85 and that the
computation converges in about 50 iterations
Slide from William Y. Arms
Image retrieval
Types of queries
1) Text query based retrieval
2) Image query based retrieval
1) Text query retrieval
2) Image query retrieval
Content based image retrieval:
• Analyze visual content of images
– Extract features
– Build visual descriptor of each image (query and database
images).
• For a query image, match descriptors between query
and database images.
• Return closest matches in ranked order by similarity.
Image query retrieval
Query Image
Reminder: Image Representation
Photo by: marielito
Represent the image as a spatial grid of average pixel colors
Convert data base of images to this representation
Represent query image in this representation.
Find images from data base that are similar to query.
Image query retrieval
Query Image
Database Images
Image query retrieval
Query Image
Ranked Results – database images
ranked by similarity to query
Image query retrieval
• What’s easy?
• What’s difficult?
Image Retrieval
• Image relevance
• Image importance
Image Retrieval
• Image relevance
• Image importance
Text info
• Idea – most images have associated text.
• Analyze words around picture & associated with picture (title,
words, links, etc).
• For a query word return pictures based on standard web
search on text associated with image.
Human info
ESP game
Luis von Ahn, Ruoran Liu and Manuel Blum
Just leave the content analysis/labeling to people.
User data
Watch
what
people
click on!
Text+Image info
Image Retrieval
• Image relevance
• Image importance
PageRank
• For web pages – use links between two pages
as a measure of their similarity.
• For images – use number of matching features
between two images as a measure of their
similarity.
– Features – SIFT features (based on histograms of
edges in different directions).
– Two features are considered matching if their SSD
distance is below a threshold.
Pros/Cons
• Where will it work well?
• Where will it fail?
• What happens to polysemous queries?
• What about logos?
Text + Image PageRank
How could we extend this algorithm to
incorporate image and text
information?
Animals on the Web
Tamara L. Berg & David Forsyth
I want to find lots of good
pictures of monkeys…
What can I do?
Google Image Search -- monkey
Circa 2006
Google Image Search -- monkey
Google Image Search -- monkey
Google Image Search -- monkey
Words alone won’t work
Flickr Search - monkey
• Even with humans doing the labeling, the data is
extremely noisy -- context, polysemy, photo sets
• Words alone still won’t work!
Our Results
General Approach
- Vision alone won’t solve the problem.
- Text alone won’t solve the problem.
-> Combine the two!
Animals on the Web
Extremely challenging visual
categories.
Free text on web pages.
Take advantage of language advances.
Combine multiple visual and textual
cues.
Goal:
Classify images depicting semantic categories of
animals in a wide range of aspects, configurations
and appearances. Images typically portray multiple
species that differ in appearance.
Animals on the Web Outline:
Harvest pictures of animals from the web using
Google Text Search.
Select visual exemplars using text based
information.
Use visual and textual cues to extend to similar
images.
Harvested Pictures
14,051 images for 10 animal categories.
12,886 additional images for monkey category using related
monkey queries (primate, species, old world, science…)
Text Model
Latent Dirichlet Allocation (LDA) on the words in collected web pages
to discover 10 latent topics for each category.
Each topic defines a distribution over words. Select the 50 most likely
words for each topic.
Example Frog Topics:
1.) frog frogs water tree toad leopard green southern music king irish eggs folk princess river ball
range eyes game species legs golden bullfrog session head spring book deep spotted de am free
mouse information round poison yellow upon collection nature paper pond re lived center talk buy
arrow common prince
2.) frog information january links common red transparent music king water hop tree pictures pond
green people available book call press toad funny pottery toads section eggs bullet photo nature
march movies commercial november re clear eyed survey link news boston list frogs bull sites
butterfly court legs type dot blue
Select Exemplars
Rank images according to whether they have these likely words near
the image in the associated page (word score)
Select up to 30 images per topic as exemplars.
1.) frog frogs water tree toad leopard green
southern music king irish eggs folk princess river
ball range eyes game species legs golden bullfrog
session head ...
2.) frog information january links common
red transparent music king water hop tree
pictures pond green people available book
call press ...
Senses
There are multiple senses of a category within the
Google search results.
Ask the user to identify which of the 10 topics are
relevant to their search. Merge.
Optional second step of supervision – ask user to
mark erroneously labeled exemplars.
Image Model
Match Pictures of a category
Geometric Blur Shape Feature
(A.) Berg & Malik ‘01
Sparse Signal
Geometric Blur
Captures local shape, but allows for some deformation.
Robust to differences in intra category object shape.
Used in current best object recognition systems
Zhang et al, CVPR 2006
Frome et al, NIPS 2006
Image Model (cont.)
Color Features: Histogram of what colors appear in the image
Texture Features: Histograms of 16 filters
*
=
Scoring Images
Irrelevant
Features
* * *** *
* *
* * **
* ** * ** *
Relevant
Features
*
Irrelevant
Exemplar
+ +++
++ + +
+ + +++ +
+
++
+ +
+
Relevant
Exemplar
Query
+
+ ?
+
+
*
+ ?
* + + + +
+ + +
*
* *
For each query feature apply a
1-nearest neighbor classifier. Sum
votes for relevant class.
Normalize.
Combine 4 cue scores (word,
shape, color, texture) using a
linear combination.
Words + Picture
Words
Classification Comparison
Cue Combination:
Monkey
Cue Combination:
Giraffe
Frog
Re-ranking Precision
Classification
Performance
Google
Re-ranking Precision
Monkey
Monkey Category
Classification
Performance
Google
Ranked Results:
http://tamaraberg.com/google/animals/index.html
Commercial systems
•
•
•
•
•
http://tineye.com/login
http://labs.systemone.at/retrievr/
http://www.polarrose.com/
http://images.google.com
http://www.picsearch.com/
Download