Choosing and Using the Best Metas Michael Hunter Reference Librarian Hobart and William Smith Colleges for Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council Supported by Library Services and Technology Act (LSTA) and/or Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2002 For Today … Metas: History and Functions Search and Retrieval Issues Major Players in 2003 Clustering Technology More Good Metas Web Search Agents Evaluating Metasearch Services Metasearch defined . . . Group of search engines, subject directories and/or databases made searchable through a common interface. Results may or may not follow the original source’s rankings Today our focus is free metaengines using subject directories (Yahoo, LII, OD) and crawler-based engines as sources (Google, FAST, Teoma) We will NOT examine specialized or Deep Web metas A GOOD Meta will . . . Re-format queries to be compatible with search syntax of each source Enable searchers to use advanced features (when the sources support them) Indicate overlapping results without repeating them Perform additional processing of results, eg. ranking for appropriateness, catagorization, etc. Use only sources with unique databases The beginnings of metasearch A conceptual descendant of Veronica March 1995 – Harvest (later Savvysearch, now Search.com) developed at Colorado State by Daniel Dreilinger July 1995 – Metacrawler developed at U. of Washington by Selberg and Etzioni “Metacrawler Architecture for Resource Aggregation on the Web” 1996 The beginnings of metasearch 1996 - Dogpile 1998 - Ixquick 1999 - Kartoo 2000 - Ithaki 2001 - Vivisimo More facts about metas “Flavor” determined by choice of sources Comprehensive Vivisimo, Ixquick, Metacrawler General Lifestyle, popular culture Dogpile, Profusion Commercial Search.com, Excite@home Metas and retrieval Metas search quickly but not deeply Search time or a quantity of searches are purchased from sources (typically top 1050 hits from each) Metas are subject to time-out limits from their sources Each source is usually NOT searched for each query Metas and retrieval “Dumbing Down the Query” Advanced features are often not available, and then only those that are shared among sources Default setting for time-out is the shortest; set to maximum for more comprehensive searches (when available) For most metas, advertising is the only source of revenue; software sales are rare Metas and retrieval What is their place in my search strategy? Metas best used for simple searches, with little (or no) syntactic complexity Use them to find the top few sites on a topic For a quick overview of a topic’s coverage on the Web in general Use them “as a last resort” for highly focused topics that elude your usual search tools As a possible indication of coverage of a topic among several engines (NOTE: problematic) Searching the metas Results depend on Choice of sources Query processing speed OF THE SOURCE Length of time spent at each source A search comparison . . . Searched heterotropia (abnormal binocular vision) on 4/21/03 Vivisimo 77 Shortest 126 Longest Ixquick 37 “from at least 450 results” Profusion 30 Shortest 39 Longest Metacrawler 42 Shortest 61 Longest Webcrawler 31 Shortest 80 Longest Dogpile 29 (no time-out option) Excite 41 Shortest 31 Longest Stability of Results Searched “kids of survival” (modern art group) as a phrase at 3-minute intervals (time-outs at default setting) 4/21/03 Source Search #1 Search #2 Search #3 Vivisimo 128 137 132 Ixquick 59 61 61 Profusion 27 27 27 iBoogie 128 185 171 Metacrw.** 138 133 137 Webcrw. 45 55 49 Metas and ranking options Listing by SOURCE Usually retains ranking of source COMBINED Listing options Indicate source of each result Indicate duplicates without repeating them Indicate position in original source’s ranking “Most duplicated hits” listed first Disclose paid listings (if disclosed by source) Vivisimo http://vivisimo.com Sources: Altavista, Yahoo, MSN, Netscape, Lycos, LookSmart, Gigablast, Vizzavi, BBC, Librarian’s Index to the Internet plus 11 specialized news sources and 7 specialized business, medical and governmental sources Offers full Boolean and phrase search (if supported by the source) Vivisimo Offers the following customizations: Selection of sources searched Total number of results retrieved Length of search (“time-out period”) Results combined Source for each result given Ranking data from that source given Duplicates noted, but not repeated Vivisimo Other features: Results are clustered by keyword prevalence or website of origin Offers a preview of each result in a separate window Offers vertical searches: Top News, Business News, Tech News, Sports News Clustering results (“folders”) Automated “subject analysis” Facilitates navigation and query refinement Can be hierarchical (folders within folders) One document may appear in several folders Northern Light first public search engine to make use of folders Clustering technology in a metasearch environment Real-time processing of results retrieved from sources Variety of data can be returned from each source Url Title First few sentences Human-created summary Folder creation varies according to data from sources and processing time available at the moment of the query Clustering -- Step 1 Significant terms are identified from all results based on Frequency of term(s) Position of term(s) Normalization algorithms applied Documents analyzed for word variants (stemming) Norms set (“authority control”) “game downloads” “download games” “downloading games” Folder “labels” created Clustering – Step 2 Each result from the sources is matched against the set of folder labels and assigned to one or more folders By linguistic analysis (term position, predictive descriptive importance) By statistical analysis (term frequency) Final, proprietary analysis combines these (and more) Remember: The full documents are not available to a meta for this type of processing Profusion http://profusion.com Sources: Altavista, Yahoo, MSN, About.com, Adobe PDF, AOL, LookSmart, Lycos, Netscape, Raging Search, Teoma, WiseNut Offers full Boolean and phrase search (if supported by the source) Profusion Offers the following customizations: Selection of sources searched Total number of results retrieved Length of search (“time-out period”) Offers option of results listed by source or combined listing Source for each result given Ranking data from that source given Duplicates noted, but not repeated Profusion Other features: Results can be sorted by relevance score, title or URL “Similar Result” enhancement Profusion Relevance Score shown Search terms highlighted in results listing “Set Search Alert” feature stores searches and alerts user to page changes; requires setting up a (free) account Search Analysis available Offer vertical searches: Deep Web content in 21 broad categories; News Ixquick http://ixquick.com Sources: Altavista, Netscape, Gigablast, Adobe PDF, Avaya PDF, AskJeeves, Teoma, Go, Open Directory, Overture, Kanoodle, LookSmart, WiseNut, FindWhat, Yahoo, MSN Offers full Boolean and phrase search (if supported by the source) Offers the following customizations: Selection of sources searched Length of search (“time-out period”) Ixquick Results combined Source for each result given Ranking data from that source given Duplicates noted, but not repeated Ixquick Other features: Offers 7 field searches (when supported by sources) Clusters hits from same site Highlights search terms in each hit Offers “Related Searches” Offers vertical searches: MP3, News, Pictures iBoogie http://iboogie.com Sources: Altavista, Yahoo, MSN, FAST, FindWhat, Teoma, WiseNut, OpenFind Boolean and phrase search somewhat unreliable Offers the following customizations: Selection of sources searched Total number of results retrieved Length of search (“time-out period”) iBoogie Results combined Source for each result given Duplicates noted, but not repeated Other features: Adult content filter (when supported by source) Language limit (when supported by source) Clusters results by keyword and/or website Offers “Similar Pages” enhancement Offers vertical searches: Newspapers, Bookstores, Reference, Shopping Metacrawler http://metacrawler.com Sources: FAST, Google, About.com, AskJeeves, FindWhat, LookSmart, Inktomi (?), Open Directory, Overture, Search Hippo, Sprinks, Teoma Offers Boolean “and”, “or” (no “not”) and phrase search (if supported by the source) Offers the following customizations: Selection of sources searched Total number of results retrieved Length of search (“time-out period”) Metacrawler Offers option of results listed by source or combined listing Source for each result given Duplicates noted, but not repeated Other features: Offers Related Searches “More like this” results enhancement Offers a wide range of vertical searches: Images, MP3, Shopping, Subject Directory, Multimedia, News, Message Boards Dogpile http://dogpile.com Sources: Google, Fast, About.com, Ah-ha, AskJeeves, FindWhat, LookSmart, Open Directory, Search Hippo, Sprinks, Overture, Inktomi (?) Offers Boolean “and”, “or” (no “not”) and phrase search (if supported by the source) Offers the following customization: Selection of sources searched Dogpile Results listed ONLY by source Source for each result given Other features: Offers Related Searches Offers a wide range of vertical searches, similar to Metacrawler: Images, MP3, Shopping, Subject Directory, Multimedia, News, Message Boards Web Search Agents aka desktop client search programs Software must be purchased Queries a fixed set of engines, directories, news and other databases Sites that review and feature search agents Searchenginewatch.com Searchengineshowdown.com www.botspot.com www.agentland.com Web Search Agents typical features Queries are re-formulated to follow syntax of source databases Duplicates removed Additional ranking performed Source given Optional sort orders Optional grouping of results into “folders” Many output options (html, word processor, xml, e-mail and more) Web Search Agents different from other metas? Differences from the (good) free metas Many more sources queried Several output options Update option (re-running the search at specified intervals) Customizable search parameters Web Search Agents BullsEye Pro 3.0 $199 BullsEye Plus $49.99 Covers 1000+ sources Removes dead links Multiple language capability Government and News search groups Customization of sources available for an additional fee All other “typical features” Available at intelliseek.com Web Search Agents Copernic Pro 5.02 Copernic 2001 Plus Copernic Plus Basic $79.95 $39.95 Free Pro version covers 1000+ sources Removes dead links Post-search refinement and processing of retrieved results Automatic document summarizations (requires more software) All other “typical features” Available at www.copernic.com Ultrabar: choosing your own sources Free download Searches a small set of pre-selected engines and allows more to be added, including Deep Web resources Offers search term highlighting Does not re-formulate queries for each source No output options Available at ultrabar.com Evaluating metasearch services What are the sources for the results? Good general search engines and high-quality directories? Shopping engines? Do any sources share the same database? What search features are offered? Remember, these are only in effect for the sources that support them. What results-based enhancements are offered? Clustering? “More like this”? Highlighting of search terms? “Related Searches”? Evaluating metasearch services What factors determine the ranking of results? Is there any processing of results after retrieval from the sources? Is the source and/or ranking in that source given for each hit? Can the user expand the number of sources searched and/or the search time? Evaluating metasearch services Use your own test-drive questions and compare with results from other metaengines and good single engines and directories. Search for questions in specialized subject areas you are familiar with (tests database depth). Search for very recent topics (tests database freshness) Evaluating metasearch services Check its popularity through an independent rating or popularity monitoring service Media Metrix http://www.mediametrix.com/ The oldest user-based rating service on the Web: lists top 50 most visited sites. PC Data Online http://www.pcdataonline.com/reports/ Check for information at the site About, FAQ, Contact Us A GOOD meta will . . . Re-format queries to be compatible with search syntax of each source Enable searchers to use advanced features (when the sources support them) Indicate overlapping results without repeating them Perform additional processing of results, eg. ranking for appropriateness, catagorization, etc. Use only sources with unique databases In conclusion . . . How do metas fit into my search strategy? Metas best used for simple searches, with little (or no) syntactic complexity Use them to find the top few sites on a topic For a quick overview of a topic’s coverage on the Web in general Use them “as a last resort” for highly focused topics that elude your usual search tools As a possible indication of coverage of a topic among several engines (NOTE: problematic) Other uses?? Thank you and Best of Luck with Metaengines! Michael Hunter Warren Hunting Smith Library Hobart and William Smith Colleges Geneva, NY 14507 (315) 781-3552 hunter@hws.edu