Challenges in Managing Distributed Information Distributed Information Retrieval (c) Maria Indrawan 2004 • • • • • No topology of the data organisation. Dynamic data. The size of the collection. No control over quality of the data. Multimedia data. 1 Challenges-Human Factor (c) Maria Indrawan 2004 2 Types of Distributed IR • Diversity of users • Directory – Expert to novice – Yahoo • Ill-formed queries. • Specific behaviour • Search Engine – Google, AskJeeves, Yahoo, Teoma – Favour precision over recall (85% users only look at the first screen – Lan Huang A survey on Web Information Technology) • Meta Search – Metacrawler, Dogpile • Distributed Broker – Harvest (c) Maria Indrawan 2004 3 (c) Maria Indrawan 2004 4 Directory Listing Directory Listing • Manually created • Automatic classification • TERENA. – Yahoo, Google, MSN – Open Directory Project • www.dmoz.org – http://www.terena.nl/tech/projects/portal/isir/reisnews9908 seac.html • Scorpion – http://orc.rsch.oclc.org:6109/ (c) Maria Indrawan 2004 5 Search Engine Architecture (c) Maria Indrawan 2004 6 Crawler – Design Considerations • Crawler (robots) • Crawling algorithm – Collecting the pages from the WEB. – Breadth-first vs Depth first • Indexer • How do we handle URL-aliases? • How do we reduce server load? • How do we detect a duplicate page or a mirrorsite? • How often we need to revisit a site? – Indexing pages collected by the crawler and represent them in an efficient data structure. • Query Server – Accepting, process and return the results of the query from the user. (c) Maria Indrawan 2004 7 (c) Maria Indrawan 2004 8 Update Rate Indexer - Design Considerations www.searchengineshowdown.com (May 2003) Search Engine Newest page Found Rough Average Oldest Page Found Google 2 days 1 month 165 days MSN (Ink) 1 day 4 weeks 51 days HotBot (Ink) 1 day 4 weeks 51 days AlltheWeb 1 day 1 month 599 days Gigablast 45 days 7 months 381 days Teoma 41 days 2.5 months 81 days WiseNut 133 days 6 months 183 days • How do we handle typing mistakes? • Do we use stop list and stemming algorithm? • How much do we want to index in a given web page? – Google index only the first 101 KB of a web page and 120 KB of PDF file. • How big do we want the database indexed to be? – response time vs coverage • Do we want to index PDF, PS files? (c) Maria Indrawan 2004 9 (c) Maria Indrawan 2004 10 Estimated Size Size Growth www.searchengineshowdown.com, Dec 31, 2002 Estimated Database Total Size 3500 3000 millions 2500 2000 Estimated Claim 1500 1000 500 0 Goggle (c) Maria Indrawan 2004 11 AlltheWeb AltaVista WiseNut Hotbot MSN (c) Maria Indrawan 2004 Teoma NLResearch Gigablast 12 Query Server- Design Considerations • • • • Retrieval Model Retrieval model. Complexity of the query syntax. HCI – human computer interface. Output display. (c) Maria Indrawan 2004 • Traditional approach: – Keywords matching returns to many low quality matches – low precision. • Search engines need a VERY high precision output – even on the expense of RECALL. • How can we achieve this? 13 (c) Maria Indrawan 2004 Google Retrieval Model PageRank Example • Utilise the popularity of a page 100 – If a page has many other pages pointed to this page, the page must be very important. We can assign a high weight to this page during search. – If a page is pointed by a popular page, this page can be considered as important because it is referred by a reputable source (a popular page). – PageRank Function. (c) Maria Indrawan 2004 14 15 50 53 50 3 9 50 3 3 (c) Maria Indrawan 2004 16 Google Retrieval Model Results Overlap • Utilise the anchor text. – Anchors often provide more accurate descriptions of web pages than the pages themselves. – Anchors may exist for documents which cannot be indexed by a text-based search engine. • Utilise the appearance of the text. – Larger and bolder font text are weighted higher than other words. (c) Maria Indrawan 2004 17 Metasearch (c) Maria Indrawan 2004 18 Meta Search • Meta searches do not build their own index. • They use the index of the existing search engines. • When user posted a query to a meta search, the meta search sends the query to a number of search engines and collates the results. • A list of metacrawler: • metacrawler, www.metacrawler.com – uses google, yahoo,askJeeves, About, Looksmart, Teoma, Overture, FindWhat. • dogpile, www.dogpile.com – uses google, yahoo,askJeeves, About, Looksmart, Teoma, Overture, FindWhat – http://www.searchenginewatch.com/links/article.php/21 56241 (c) Maria Indrawan 2004 19 (c) Maria Indrawan 2004 20 Distributed Broker Metasearch Design Issue • Information is indexed locally by geographical locations or institutional boundaries. • Potential problems: – Translating the user query into a different query in a different search engine. – Query time is bounded by the least powerful (slowest) underlying system. – Combining results into a single ranked list is difficult. Effectiveness depend on heuristics and information passed back from underlying search engines. • detecting overlap in the query results • different scoring schemes (some do not use) (c) Maria Indrawan 2004 – Suitable for supporting community that to have a common search database. • Local indexes are combined to provide wider coverage. • Document scoring is performed locally by each index server. 21 Distributed Broker (c) Maria Indrawan 2004 22 Distributed Broker broker Monash • Example: Harvest broker broker FIT F. Bussiness – http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searc hing/schwartz.harvest/schwartz.harvest.html broker broker broker broker CSSE SIMS ACC MGM (c) Maria Indrawan 2004 23 (c) Maria Indrawan 2004 24 Flat Graph Model General architecture • Hierarchical vs Flat • Hierarchical: underlying index servers are connected through a hierarchy of brokers. query – broker hierarchy provides efficient and global coverage. – brokers can be geographical, institutional or subject based. query broker index server broker index server ... query broker index server ... broker index server broker ... ... broker index server query broker ... (c) Maria Indrawan 2004 index server 25 Useful site 26 Summary • www.searchenginewatch.com • Type of Distributed Information Discovery – Provides links to most of the information discovery tools. (c) Maria Indrawan 2004 (c) Maria Indrawan 2004 – Directory Listing • yahoo – Search Engines. • Google, AskJeeves, Teoma – Metasearch • metacrawler, dogpile – Distributed Broker • Harvest 27 (c) Maria Indrawan 2004 28