Searching the World Wide Web: Meta Crawlers vs. Single Search Engines

advertisement
Searching the World Wide Web:
Meta Crawlers vs. Single Search
Engines
By: Voris Tejada
The Beginning
• Since its inception the internet has grown
at a staggering rate with an extremely
large number of pages being added every
day
• Search engines such as Alta Vista, Excite,
HotBot, Infoseek, Lycos, and Northern
Light attempted to turn the internet into a
“15-billion-word encyclopedia”
Performance Measures For a Search
Engine
•
•
•
•
Coverage: also called “recall” in IR
Relevance: also called “precision” in IR
Freshness of pages in the index
Speed
What is the main problem with this?
• Coverage
– Despite their claims, no single search engine
could index the entire web
– Traditional IR systems were really designed
for static collections
• They could not keep up with the growth of the
internet
Experiment by Selberg and Etzioni
• They did an experiment using the results from
logs from their MetaCrawler web sites.
– “unique documents”
• What were the problems with this experiment?
– They only took the first X pages returned from
each engine
– Ranking system of each search engine was
different
Experiment by Lawrence and Giles
• Produced statistics on the coverage of the
major web search engines and the
estimated size of the web
• Compared the number of documents
returned by each and analyzed the results
• Problems
– They did not know if they were indexing
unique URLs or subsets of the same URLs
– Returned first X amount of documents
They found…
• Using the estimate that the web contains 320 million
pages they calculated the following:
– HotBot: 34% coverage
– Alta Vista: 28% coverage
– Northern Light: 20% coverage
– Excite: 14% coverage
– Infoseek: 10% coverage
– Lycos: 3% coverage
*Note: both experiments were concerned with coverage
What method did they find that could
increase coverage?
• Combining results from multiple engines
– By combining all six search engines they were
able to yield 3.5X the amount of results
– Selberg and Etzioni had created a
MetaCrawler which gathered a “market share”
of the results of each engine
*A solution better than MetaCrawler?
The First MetaCrawler
• Softbot
– Invented by Selberg and Etzioni at the
University of Washington
– What important qualities did it provide?
• A single interface to query through multiple search
engines such as Lycos and Alta Vista
• Obtained higher quality results as opposed to just
combining results
Modular Design
• User Interface
– Translates user queries and options into appropriate
parameters
• Aggregation Engine
– Obtains references, eliminates duplicates, collates &
outputs results
• Parallel Web Interface
– Downloads HTML pages from the Web, sends
queries and obtains results
• Harness
– Where service specific information is kept
Motivation
• Growth of the Web
• Difficulty in finding information
• Search engines index different documents and
use different ranking algorithms
– By using a single search engine you could miss over
77% of the most relevant references
• Interfaces of many search engines were difficult
to use
Softbot Addresses These Problems
• Aggregates web search services under a unified
interface
– Interface was much easier to use
– Forwards queries to single search engines and ranks
results into one composite list
• Obtains higher quality results
– Allows users to be more specific
– Eliminates duplicates using comparison algorithm
– Adapts to a rapidly changing environment
Formatting and Ranking
• MetaCrawler translates each query into the
appropriate format for use in each search engine
• Uses a “confidence score” to rank
– Allows each service to vote on relevancy for a
particular document
– Higher total score = higher ranking on final list
Speed
• Has user modifiable timeouts
• References only downloaded when needed or
only when user chooses to
• Shows partial results
– Doesn’t wait for full results list to be generated before
showing you something
Adaptability, Portability, Scalability
• Modular design allows for services to be added,
modified, and removed quickly
• Does not require large databases/large amounts
of memory, can run on most machines
• Has ability to scale without adding more
machines
MetaCrawlers Today
• “The Big Four””
– Dogpile
– Metacrawler
– Excite
– Webcrawler
Download