Metas03 - Hobart and William Smith Colleges

advertisement
Choosing and Using the Best Metas
Michael Hunter
Reference Librarian
Hobart and William Smith Colleges
for Rochester Regional Library Council
Member Libraries’ Staff
Sponsored by the
Rochester Regional Library Council
Supported by Library Services and Technology Act (LSTA) and/or
Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the
New York State Library 2002
For Today …







Metas: History and Functions
Search and Retrieval Issues
Major Players in 2003
Clustering Technology
More Good Metas
Web Search Agents
Evaluating Metasearch Services
Metasearch defined . . .
 Group of search engines, subject directories
and/or databases made searchable through a
common interface.
 Results may or may not follow the original
source’s rankings
 Today our focus is free metaengines using
subject directories (Yahoo, LII, OD) and
crawler-based engines as sources (Google,
FAST, Teoma)
 We will NOT examine specialized or Deep
Web metas
A GOOD Meta will . . .
 Re-format queries to be compatible with
search syntax of each source
 Enable searchers to use advanced
features (when the sources support them)
 Indicate overlapping results without
repeating them
 Perform additional processing of results,
eg. ranking for appropriateness,
catagorization, etc.
 Use only sources with unique databases
The beginnings of metasearch
 A conceptual descendant of Veronica
 March 1995 – Harvest (later Savvysearch,
now Search.com) developed at Colorado
State by Daniel Dreilinger
 July 1995 – Metacrawler developed at U. of
Washington by Selberg and Etzioni
 “Metacrawler Architecture for Resource
Aggregation on the Web” 1996
The beginnings of metasearch





1996 - Dogpile
1998 - Ixquick
1999 - Kartoo
2000 - Ithaki
2001 - Vivisimo
More facts about metas
 “Flavor” determined by choice of sources
 Comprehensive
 Vivisimo, Ixquick, Metacrawler
 General Lifestyle, popular culture
 Dogpile, Profusion
 Commercial
 Search.com, Excite@home
Metas and retrieval
 Metas search quickly but not deeply
 Search time or a quantity of searches are
purchased from sources (typically top 1050 hits from each)
 Metas are subject to time-out limits from
their sources
 Each source is usually NOT searched for
each query
Metas and retrieval
 “Dumbing Down the Query”
 Advanced features are often not available, and
then only those that are shared among sources
 Default setting for time-out is the shortest; set to
maximum for more comprehensive searches
(when available)
 For most metas, advertising is the only source
of revenue; software sales are rare
Metas and retrieval
 What is their place in my search strategy?
 Metas best used for simple searches, with little (or
no) syntactic complexity
 Use them to find the top few sites on a topic
 For a quick overview of a topic’s coverage on the
Web in general
 Use them “as a last resort” for highly focused
topics that elude your usual search tools
 As a possible indication of coverage of a topic
among several engines (NOTE: problematic)
Searching the metas
 Results depend on
 Choice of sources
 Query processing speed OF THE
SOURCE
 Length of time spent at each source
A search comparison . . .
 Searched heterotropia (abnormal binocular
vision) on 4/21/03
 Vivisimo
77 Shortest 126 Longest
 Ixquick
37 “from at least 450 results”
 Profusion
30 Shortest 39 Longest
 Metacrawler 42 Shortest 61 Longest
 Webcrawler 31 Shortest 80 Longest
 Dogpile
29 (no time-out option)
 Excite
41 Shortest 31 Longest
Stability of Results
Searched “kids of survival” (modern art group)
as a phrase at 3-minute intervals (time-outs at
default setting) 4/21/03
Source
Search #1 Search #2 Search #3
Vivisimo
128
137
132
Ixquick
59
61
61
Profusion
27
27
27
iBoogie
128
185
171
Metacrw.** 138
133
137
Webcrw.
45
55
49
Metas and ranking options
 Listing by SOURCE
 Usually retains ranking of source
 COMBINED Listing options





Indicate source of each result
Indicate duplicates without repeating them
Indicate position in original source’s ranking
“Most duplicated hits” listed first
Disclose paid listings (if disclosed by source)
Vivisimo
 http://vivisimo.com
 Sources: Altavista, Yahoo, MSN, Netscape,
Lycos, LookSmart, Gigablast, Vizzavi, BBC,
Librarian’s Index to the Internet plus 11
specialized news sources and 7
specialized business, medical and
governmental sources
 Offers full Boolean and phrase search (if
supported by the source)
Vivisimo
 Offers the following customizations:
 Selection of sources searched
 Total number of results retrieved
 Length of search (“time-out period”)




Results combined
Source for each result given
Ranking data from that source given
Duplicates noted, but not repeated
Vivisimo
 Other features:
 Results are clustered by keyword
prevalence or website of origin
 Offers a preview of each result in a
separate window
 Offers vertical searches: Top News,
Business News, Tech News, Sports
News
Clustering results (“folders”)
Automated “subject analysis”
Facilitates navigation and query refinement
Can be hierarchical (folders within folders)
One document may appear in several
folders
 Northern Light first public search engine to
make use of folders




Clustering technology in a
metasearch environment
 Real-time processing of results retrieved
from sources
 Variety of data can be returned from each
source




Url
Title
First few sentences
Human-created summary
 Folder creation varies according to data
from sources and processing time
available at the moment of the query
Clustering -- Step 1
 Significant terms are identified from all
results based on
 Frequency of term(s)
 Position of term(s)
 Normalization algorithms applied
 Documents analyzed for word variants
(stemming)
 Norms set (“authority control”)
“game downloads” “download games”
“downloading games”
 Folder “labels” created
Clustering – Step 2
 Each result from the sources is matched
against the set of folder labels and
assigned to one or more folders
 By linguistic analysis (term position,
predictive descriptive importance)
 By statistical analysis (term frequency)
 Final, proprietary analysis combines
these (and more)
 Remember: The full documents are not
available to a meta for this type of
processing
Profusion
 http://profusion.com
 Sources: Altavista, Yahoo, MSN,
About.com, Adobe PDF, AOL, LookSmart,
Lycos, Netscape, Raging Search, Teoma,
WiseNut
 Offers full Boolean and phrase search (if
supported by the source)
Profusion
 Offers the following customizations:
 Selection of sources searched
 Total number of results retrieved
 Length of search (“time-out period”)
 Offers option of results listed by source or
combined listing
 Source for each result given
 Ranking data from that source given
 Duplicates noted, but not repeated
Profusion
 Other features:
 Results can be sorted by relevance score, title or
URL
 “Similar Result” enhancement
 Profusion Relevance Score shown
 Search terms highlighted in results listing
 “Set Search Alert” feature stores searches and
alerts user to page changes; requires setting up
a (free) account
 Search Analysis available
 Offer vertical searches: Deep Web content in 21
broad categories; News
Ixquick
 http://ixquick.com
 Sources: Altavista, Netscape, Gigablast, Adobe
PDF, Avaya PDF, AskJeeves, Teoma, Go,
Open Directory, Overture, Kanoodle,
LookSmart, WiseNut, FindWhat, Yahoo, MSN
 Offers full Boolean and phrase search (if
supported by the source)
 Offers the following customizations:
 Selection of sources searched
 Length of search (“time-out period”)
Ixquick




Results combined
Source for each result given
Ranking data from that source given
Duplicates noted, but not repeated
Ixquick
 Other features:
 Offers 7 field searches (when supported
by sources)
 Clusters hits from same site
 Highlights search terms in each hit
 Offers “Related Searches”
 Offers vertical searches: MP3, News,
Pictures
iBoogie
 http://iboogie.com
 Sources: Altavista, Yahoo, MSN, FAST,
FindWhat, Teoma, WiseNut, OpenFind
 Boolean and phrase search somewhat
unreliable
 Offers the following customizations:
 Selection of sources searched
 Total number of results retrieved
 Length of search (“time-out period”)
iBoogie









Results combined
Source for each result given
Duplicates noted, but not repeated
Other features:
Adult content filter (when supported by source)
Language limit (when supported by source)
Clusters results by keyword and/or website
Offers “Similar Pages” enhancement
Offers vertical searches: Newspapers,
Bookstores, Reference, Shopping
Metacrawler
 http://metacrawler.com
 Sources: FAST, Google, About.com, AskJeeves,
FindWhat, LookSmart, Inktomi (?), Open
Directory, Overture, Search Hippo, Sprinks,
Teoma
 Offers Boolean “and”, “or” (no “not”) and
phrase search (if supported by the source)
 Offers the following customizations:

Selection of sources searched

Total number of results retrieved

Length of search (“time-out period”)
Metacrawler
 Offers option of results listed by source or
combined listing
 Source for each result given
 Duplicates noted, but not repeated
 Other features:
 Offers Related Searches
 “More like this” results enhancement
 Offers a wide range of vertical searches: Images,
MP3, Shopping, Subject Directory, Multimedia,
News, Message Boards
Dogpile
 http://dogpile.com
 Sources: Google, Fast, About.com, Ah-ha,
AskJeeves, FindWhat, LookSmart, Open
Directory, Search Hippo, Sprinks, Overture,
Inktomi (?)
 Offers Boolean “and”, “or” (no “not”) and
phrase search (if supported by the source)
 Offers the following customization:

Selection of sources searched
Dogpile





Results listed ONLY by source
Source for each result given
Other features:
Offers Related Searches
Offers a wide range of vertical searches,
similar to Metacrawler: Images, MP3,
Shopping, Subject Directory, Multimedia,
News, Message Boards
Web Search Agents
aka desktop client search programs
 Software must be purchased
 Queries a fixed set of engines, directories,
news and other databases
 Sites that review and feature search agents




Searchenginewatch.com
Searchengineshowdown.com
www.botspot.com
www.agentland.com
Web Search Agents
typical features
 Queries are re-formulated to follow
syntax of source databases
 Duplicates removed
 Additional ranking performed
 Source given
 Optional sort orders
 Optional grouping of results into “folders”
 Many output options (html, word
processor, xml, e-mail and more)
Web Search Agents
different from other metas?
 Differences from the (good) free metas
 Many more sources queried
 Several output options
 Update option (re-running the search at
specified intervals)
 Customizable search parameters
Web Search Agents
 BullsEye Pro 3.0 $199
 BullsEye Plus
$49.99





Covers 1000+ sources
Removes dead links
Multiple language capability
Government and News search groups
Customization of sources available for an
additional fee
 All other “typical features”
 Available at intelliseek.com
Web Search Agents
 Copernic Pro 5.02
 Copernic 2001 Plus
 Copernic Plus Basic
$79.95
$39.95
Free
 Pro version covers 1000+ sources
 Removes dead links
 Post-search refinement and processing of
retrieved results
 Automatic document summarizations (requires
more software)
 All other “typical features”
 Available at www.copernic.com
Ultrabar:
choosing your own sources
 Free download
 Searches a small set of pre-selected
engines and allows more to be added,
including Deep Web resources
 Offers search term highlighting
 Does not re-formulate queries for each
source
 No output options
 Available at ultrabar.com
Evaluating metasearch services
 What are the sources for the results?
 Good general search engines and high-quality
directories? Shopping engines? Do any sources
share the same database?
 What search features are offered?
 Remember, these are only in effect for the sources
that support them.
 What results-based enhancements are
offered?
 Clustering? “More like this”? Highlighting of search
terms? “Related Searches”?
Evaluating metasearch services
 What factors determine the ranking of
results?
 Is there any processing of results after
retrieval from the sources?
 Is the source and/or ranking in that source
given for each hit?
 Can the user expand the number of
sources searched and/or the search
time?
Evaluating metasearch services
 Use your own test-drive questions and
compare with results from other metaengines and good single engines and
directories.
 Search for questions in specialized subject
areas you are familiar with (tests database
depth).
 Search for very recent topics (tests database
freshness)
Evaluating metasearch services
 Check its popularity through an
independent rating or popularity
monitoring service
 Media Metrix http://www.mediametrix.com/
 The oldest user-based rating service on the Web:
lists top 50 most visited sites.
 PC Data Online
http://www.pcdataonline.com/reports/
 Check for information at the site
 About, FAQ, Contact Us
A GOOD meta will . . .
 Re-format queries to be compatible with
search syntax of each source
 Enable searchers to use advanced
features (when the sources support them)
 Indicate overlapping results without
repeating them
 Perform additional processing of results,
eg. ranking for appropriateness,
catagorization, etc.
 Use only sources with unique databases
In conclusion . . .
 How do metas fit into my search strategy?
 Metas best used for simple searches, with little (or
no) syntactic complexity
 Use them to find the top few sites on a topic
 For a quick overview of a topic’s coverage on the
Web in general
 Use them “as a last resort” for highly focused
topics that elude your usual search tools
 As a possible indication of coverage of a topic
among several engines (NOTE: problematic)
 Other uses??
Thank you and
Best of Luck with
Metaengines!
Michael Hunter
Warren Hunting Smith Library
Hobart and William Smith Colleges
Geneva, NY 14507
(315) 781-3552
hunter@hws.edu
Download