Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Overview Windows Live Image Search Problem Definition and Background User Interface Architecture Why is it a beta? Questions? Introduction Windows Live Image Search is new: Released in Beta form on March 8, 2006 Architected, designed, and engineered in Redmond Close relative of MSN/Windows Live web search Microsoft’s Image search is available only at Windows Live The MSN Image Search solution is provided by a third-party Strong partnership between the Windows Live Search product team and: Microsoft Research, Cambridge UK Microsoft Research, Asia (Beijing, China) Microsoft Research, Redmond Problem Definition Find thumbnail images using a text query There are no CBIR-based web-scale image search engines All modern image search engines share fundamentals with AltaVista’s original PhotoFinder (1998) The thumbnail images represent web pages “containing” the original image We crawl web pages and images More than a billion images Pages and images regularly refreshed Large numbers of images enter and leave the collection daily More later… Queries From an MSN Search sample drawn from a month: Most frequent: 65,000+ occurrences Median: 2 occurrences Most queries are 1 to 3 words in length Most popular queries: lindsay lohan, scarlett johansson, angelina jolie, sex, jessica simpson, kate beckinsale, paris hilton, britney spears, shakira, sexy, jessica alba, jennifer lopez Random queries: bridge, rodolfo font, playboy, douwe egberts, jesus, tanning, beauty, oakenfold, priyanka chopra, actors Around 60 of the top 100 queries are adult or celebrity Other popular scenarios are places, animals, or objects More On Queries… In the US, around 10% are spelling errors Less in some languages, more in others Word forms are extremely common Tom’s Diner, Toms Diner, Tom Diner Lots of weirdness: Math.abs 3/4” Ply 103,5 versus 103.5 www cnn.com Every conceivable spelling of “Britney” Navigational queries Thumbnail Results Thumbnail Clickthrough How Users Click Through MSN Result Visits for Web and Image Search 100.00% 90.00% Cumulative percentage of sessions 80.00% 70.00% 60.00% Web search 50.00% Image search 40.00% 30.00% 20.00% 10.00% 0.00% 0 1000 2000 3000 4000 5000 Answer rank Around 75% of Web search result page views are page one. For image search it is 43%, and the 75% threshold in image search is reached around page eight 6000 7000 8000 Searching And Ranking Our ranking process matches queries to documents So, what is a document? We refer to our documents as nodules A nodule is created for each link between an HTML document and an image (where we have retrieved both) The alternative is a nodule per image, or a nodule per page A nodule typically contains: The thumbnail of the image Text and headers from the HTML page Image metadata Background: Ranking So, how do we rank? We rank using: Static Rank: Query Independent value Image and page properties, web link analysis, junk page probability, and so on Dynamic Rank: Query Dependent value TF-IDF, BM25, and so on The overall rank is a combination of Static and Dynamic Rank Broad answer: we compute the similarity between selected nodules and a query, and order the results by decreasing similarity The selected nodules are those that contain all query terms (Boolean AND to find a filter set, then similarity-based ordering of the filter set) Algorithmic Search Traditional Information Retrieval focuses on Intelligence Recall Long queries Well-formed documents Small (low millions) index Image search focuses on Precision Short queries Poor documents Billions of nodules in the index Nodule Text Nodules represent the link between an HTML page and an image Nodule text includes elements such as: The HTML page <title> Text from the HTML page Text from near the image is a good start… ALT or anchor text from the image Images can be embedded in a page using the <img> tag or linked-to using the <a> tag Table Parsing Image Metadata Ranking uses text and image properties (the latter are exclusively for image search) These include: AspectRatio (the ratio of the X dimension to the Y dimension) Pixels (the product of X and Y dimensions) PhotoGraphic (whether an image is a photograph or a graphic) … Aspect Ratio Extremes Throwing Out Junk The Web is full of balls, lines, and Amazon logos Right now, we ignore very small images Some we don’t fetch (HTML width and height attributes help us), many we drop after fetching Junk properties help us in ranking: We lower the rank of images with extreme aspect ratios We lower the rank of images with few pixels Duplicates And Near Duplicates Duplication is problematic, particularly for logos, products, and posters We compute a hash of all images All except the highest-ranked exact duplicate is removed from the filter set at query time We are working on techniques for removing near duplicates User Interface The Windows Live image search user interface has five new features: 1. 2. 3. 4. 5. “Infinite scroll” or “smart scroll” Thumbnail size slider Film strip results view Show full image Metadata grow experience Windows Live Image Search Infinite Or Smart Scroll Results are presented in a single page Removes others’ paging model Smooths the click curve Improves browsability Motivated by click data As discussed previously, only 43% of users stay on page one Many sessions show very deep click behaviors Same motivation for the thumbnail size slider Other Features… Motivated and reinforced by usability studies Film Strip Results View: Improve results navigation Remove unnecessary click actions Make it easy to find a page or image Show full image feature: Helps locate original image Particularly useful for <a> links Metadata grow Most users don’t use metadata Reduce clutter, improve browse experience Architecture And Design Crawl and index over a billion nodules every two weeks Crawl 750 nodules per second Answer queries in less than 250ms, with most answered in less than 50ms Serve several million queries per day Peak load of 150+ queries per second Serve 10,000+ thumbnails per second at peak Manage several petabytes of raw storage Architecture: Serving Queries Spelling Correction Mid Level Aggregator ` Customer Query Front End Experience (FEX) Federator Image Search Mid Level Aggregator Index Serving Node Architecture: Index Building ` Index Builder Crawler Web Servers Static Ranker Index Serving Node Indexing: Selection And Crawl Only way into Search is via our Crawler We used to have “paid inclusion” but abandoned it Google doesn’t have it, Yahoo! does Crawl is partly prioritized by Static Rank We crawl the top few billion pages Biggest issue with crawling: politeness Distributed Searching I: Single Box Web Server Frontends Big Iron (DEC TurboLaser) Monolithic Model (AltaVista, WebCrawler) – the index goes on a single (big) box. Advantages: Easy to scale query volume: just buy more web server frontends and Big Boxes Full visibility on results while ranking Disadvantages: Hard to scale index size --- limited by CPU and Memory Reliability Distributed Searching II: Word-Striping Quick Quick brown fox brown Web Server Frontends fox Stripe the index by term across index servers Have a central box send the query terms to appropriate servers Merge the results Advantages: Only boxes that have answers get used per query Have full visibility of results while ranking Disadvantages: Some boxes are likely to be more loaded than others It turns out this creates significant network traffic Distributed Searching III: Document Striping Quick brown fox Quick brown fox Quick brown fox Web Server Frontends Quick brown fox Stripe documents randomly across boxes Send query to all boxes Merge the results from all boxes Advantages: Scales with both index size and query traffic volume Minimal network traffic, aggregation is easy Disadvantage: No visibility on all results while ranking Why Is It A Beta? We are working on multiple features Continuous improvement of ranking and relevance Internationalization and accessibility Scaling and reliability Adult filtering New, thought-leading features Many of these involve colleagues in Microsoft Research © 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.