Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada The Beginning • Since its inception the internet has grown at a staggering rate with an extremely large number of pages being added every day • Search engines such as Alta Vista, Excite, HotBot, Infoseek, Lycos, and Northern Light attempted to turn the internet into a “15-billion-word encyclopedia” Performance Measures For a Search Engine • • • • Coverage: also called “recall” in IR Relevance: also called “precision” in IR Freshness of pages in the index Speed What is the main problem with this? • Coverage – Despite their claims, no single search engine could index the entire web – Traditional IR systems were really designed for static collections • They could not keep up with the growth of the internet Experiment by Selberg and Etzioni • They did an experiment using the results from logs from their MetaCrawler web sites. – “unique documents” • What were the problems with this experiment? – They only took the first X pages returned from each engine – Ranking system of each search engine was different Experiment by Lawrence and Giles • Produced statistics on the coverage of the major web search engines and the estimated size of the web • Compared the number of documents returned by each and analyzed the results • Problems – They did not know if they were indexing unique URLs or subsets of the same URLs – Returned first X amount of documents They found… • Using the estimate that the web contains 320 million pages they calculated the following: – HotBot: 34% coverage – Alta Vista: 28% coverage – Northern Light: 20% coverage – Excite: 14% coverage – Infoseek: 10% coverage – Lycos: 3% coverage *Note: both experiments were concerned with coverage What method did they find that could increase coverage? • Combining results from multiple engines – By combining all six search engines they were able to yield 3.5X the amount of results – Selberg and Etzioni had created a MetaCrawler which gathered a “market share” of the results of each engine *A solution better than MetaCrawler? The First MetaCrawler • Softbot – Invented by Selberg and Etzioni at the University of Washington – What important qualities did it provide? • A single interface to query through multiple search engines such as Lycos and Alta Vista • Obtained higher quality results as opposed to just combining results Modular Design • User Interface – Translates user queries and options into appropriate parameters • Aggregation Engine – Obtains references, eliminates duplicates, collates & outputs results • Parallel Web Interface – Downloads HTML pages from the Web, sends queries and obtains results • Harness – Where service specific information is kept Motivation • Growth of the Web • Difficulty in finding information • Search engines index different documents and use different ranking algorithms – By using a single search engine you could miss over 77% of the most relevant references • Interfaces of many search engines were difficult to use Softbot Addresses These Problems • Aggregates web search services under a unified interface – Interface was much easier to use – Forwards queries to single search engines and ranks results into one composite list • Obtains higher quality results – Allows users to be more specific – Eliminates duplicates using comparison algorithm – Adapts to a rapidly changing environment Formatting and Ranking • MetaCrawler translates each query into the appropriate format for use in each search engine • Uses a “confidence score” to rank – Allows each service to vote on relevancy for a particular document – Higher total score = higher ranking on final list Speed • Has user modifiable timeouts • References only downloaded when needed or only when user chooses to • Shows partial results – Doesn’t wait for full results list to be generated before showing you something Adaptability, Portability, Scalability • Modular design allows for services to be added, modified, and removed quickly • Does not require large databases/large amounts of memory, can run on most machines • Has ability to scale without adding more machines MetaCrawlers Today • “The Big Four”” – Dogpile – Metacrawler – Excite – Webcrawler