Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search

advertisement
Windows Live Image Search
Hugh Williams
Senior Software Design Engineer
Windows Live Search
Microsoft Corporation
Overview
Windows Live Image Search
Problem Definition and Background
User Interface
Architecture
Why is it a beta?
Questions?
Introduction
Windows Live Image Search is new:
Released in Beta form on March 8, 2006
Architected, designed, and engineered in Redmond
Close relative of MSN/Windows Live web search
Microsoft’s Image search is available only at
Windows Live
The MSN Image Search solution is provided by a third-party
Strong partnership between the Windows Live
Search product team and:
Microsoft Research, Cambridge UK
Microsoft Research, Asia (Beijing, China)
Microsoft Research, Redmond
Problem Definition
Find thumbnail images using a text query
There are no CBIR-based web-scale image
search engines
All modern image search engines share
fundamentals with AltaVista’s original
PhotoFinder (1998)
The thumbnail images represent web pages
“containing” the original image
We crawl web pages and images
More than a billion images
Pages and images regularly refreshed
Large numbers of images enter and leave the
collection daily
More later…
Queries
From an MSN Search sample drawn
from a month:
Most frequent: 65,000+ occurrences
Median: 2 occurrences
Most queries are 1 to 3 words in length
Most popular queries: lindsay lohan, scarlett johansson,
angelina jolie, sex, jessica simpson, kate beckinsale, paris
hilton, britney spears, shakira, sexy, jessica alba,
jennifer lopez
Random queries: bridge, rodolfo font, playboy, douwe
egberts, jesus, tanning, beauty, oakenfold, priyanka
chopra, actors
Around 60 of the top 100 queries are adult
or celebrity
Other popular scenarios are places, animals,
or objects
More On Queries…
In the US, around 10% are spelling errors
Less in some languages, more in others
Word forms are extremely common
Tom’s Diner, Toms Diner, Tom Diner
Lots of weirdness:
Math.abs
3/4” Ply
103,5 versus 103.5
www cnn.com
Every conceivable spelling of “Britney”
Navigational queries
Thumbnail Results
Thumbnail Clickthrough
How Users Click Through
MSN Result Visits for Web and Image Search
100.00%
90.00%
Cumulative percentage of sessions
80.00%
70.00%
60.00%
Web search
50.00%
Image search
40.00%
30.00%
20.00%
10.00%
0.00%
0
1000
2000
3000
4000
5000
Answer rank
Around 75% of Web search result page views are page
one. For image search it is 43%, and the 75% threshold
in image search is reached around page eight
6000
7000
8000
Searching And Ranking
Our ranking process matches queries
to documents
So, what is a document?
We refer to our documents as nodules
A nodule is created for each link between an HTML
document and an image (where we have
retrieved both)
The alternative is a nodule per image, or a nodule per page
A nodule typically contains:
The thumbnail of the image
Text and headers from the HTML page
Image metadata
Background: Ranking
So, how do we rank?
We rank using:
Static Rank: Query Independent value
Image and page properties, web link analysis, junk page
probability, and so on
Dynamic Rank: Query Dependent value
TF-IDF, BM25, and so on
The overall rank is a combination of Static and
Dynamic Rank
Broad answer: we compute the similarity
between selected nodules and a query, and
order the results by decreasing similarity
The selected nodules are those that contain all query terms
(Boolean AND to find a filter set, then similarity-based
ordering of the filter set)
Algorithmic Search
Traditional Information Retrieval focuses
on Intelligence
Recall
Long queries
Well-formed documents
Small (low millions) index
Image search focuses on
Precision
Short queries
Poor documents
Billions of nodules in the index
Nodule Text
Nodules represent the link between an
HTML page and an image
Nodule text includes elements such as:
The HTML page <title>
Text from the HTML page
Text from near the image is a good start…
ALT or anchor text from the image
Images can be embedded in a page using the
<img> tag or linked-to using the <a> tag
Table Parsing
Image Metadata
Ranking uses text and image properties (the
latter are exclusively for image search)
These include:
AspectRatio (the ratio of the X dimension to
the Y dimension)
Pixels (the product of X and Y dimensions)
PhotoGraphic (whether an image is a photograph
or a graphic)
…
Aspect Ratio Extremes
Throwing Out Junk
The Web is full of balls, lines, and Amazon
logos
Right now, we ignore very small images
Some we don’t fetch (HTML width and height
attributes help us), many we drop after fetching
Junk properties help us in ranking:
We lower the rank of images with extreme
aspect ratios
We lower the rank of images with few pixels
Duplicates And Near Duplicates
Duplication is problematic, particularly for
logos, products, and posters
We compute a hash of all images
All except the highest-ranked exact duplicate
is removed from the filter set at query time
We are working on techniques for
removing near duplicates
User Interface
The Windows Live image search user
interface has five new features:
1.
2.
3.
4.
5.
“Infinite scroll” or “smart scroll”
Thumbnail size slider
Film strip results view
Show full image
Metadata grow experience
Windows Live Image Search
Infinite Or Smart Scroll
Results are presented in a single page
Removes others’ paging model
Smooths the click curve
Improves browsability
Motivated by click data
As discussed previously, only 43% of users stay
on page one
Many sessions show very deep click behaviors
Same motivation for the thumbnail size slider
Other Features…
Motivated and reinforced by usability studies
Film Strip Results View:
Improve results navigation
Remove unnecessary click actions
Make it easy to find a page or image
Show full image feature:
Helps locate original image
Particularly useful for <a> links
Metadata grow
Most users don’t use metadata
Reduce clutter, improve browse experience
Architecture And Design
Crawl and index over a billion nodules
every two weeks
Crawl 750 nodules per second
Answer queries in less than 250ms, with
most answered in less than 50ms
Serve several million queries per day
Peak load of 150+ queries per second
Serve 10,000+ thumbnails per second
at peak
Manage several petabytes of raw storage
Architecture: Serving Queries
Spelling Correction
Mid Level
Aggregator
`
Customer
Query
Front End
Experience
(FEX)
Federator
Image Search
Mid Level
Aggregator
Index Serving Node
Architecture: Index Building
`
Index Builder
Crawler
Web
Servers
Static Ranker
Index Serving Node
Indexing: Selection And Crawl
Only way into Search is via our Crawler
We used to have “paid inclusion” but
abandoned it
Google doesn’t have it, Yahoo! does
Crawl is partly prioritized by Static Rank
We crawl the top few billion pages
Biggest issue with crawling: politeness
Distributed Searching I: Single Box
Web Server
Frontends
Big Iron
(DEC TurboLaser)
Monolithic Model (AltaVista, WebCrawler) – the index goes on a
single (big) box.
Advantages:
Easy to scale query volume: just buy more web server frontends and
Big Boxes
Full visibility on results while ranking
Disadvantages:
Hard to scale index size --- limited by CPU and Memory
Reliability
Distributed Searching II: Word-Striping
Quick
Quick brown fox
brown
Web Server
Frontends
fox
Stripe the index by term across index servers
Have a central box send the query terms to appropriate servers
Merge the results
Advantages:
Only boxes that have answers get used per query
Have full visibility of results while ranking
Disadvantages:
Some boxes are likely to be more loaded than others
It turns out this creates significant network traffic
Distributed Searching III: Document Striping
Quick brown fox
Quick brown fox
Quick brown fox
Web Server
Frontends
Quick brown fox
Stripe documents randomly across boxes
Send query to all boxes
Merge the results from all boxes
Advantages:
Scales with both index size and query traffic volume
Minimal network traffic, aggregation is easy
Disadvantage:
No visibility on all results while ranking
Why Is It A Beta?
We are working on multiple features
Continuous improvement of ranking
and relevance
Internationalization and accessibility
Scaling and reliability
Adult filtering
New, thought-leading features
Many of these involve colleagues in
Microsoft Research
© 2006 Microsoft Corporation. All rights reserved.
Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation.
Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft,
and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Download