Information jungle on the Web: finding and evaluating information sources Tefko Saracevic, PhD Rutgers University tefko@scils.rutgers.edu http://www.scils.rutgers.edu/people/faculty/tefko.html Web & information: key problems SEARCHING the Web for information Retrieving a MANAGEABLE AMOUNT Selecting the most RELEVANT sources EVALUATING sources & information Three laws for information on the Web: 1. EVALUATE 2. EVALUATE 3. EVALUATE Tefko Saracevic, Rutgers University 2 Characteristics of information on the Web VARIETY - amazing rich source on myriad topics & subjects DISTRIBUTION - all over, global information scattered across great many sites LINKAGE - many hyperlinks, hypertexts elaborate web of connections, paths, and mazes AMOUNT - huge, growing exponentially millions of sites, billions of pages Tefko Saracevic, Rutgers University 3 Characteristics … (cont.) CONTENT VALUE NEUTRAL - anything goes no control of content some accurate, trustworthy, verifiable some biased, self-serving, propaganda, promotional some false accidentally some false deliberately, some even with evil intent Thus, the three Web laws Tefko Saracevic, Rutgers University 4 Size of the Web Over 16 million web servers; 800 million pages 83% commercial, 6% scientific or educational; 3% health 2.5% personal; 2% societies; 1.5% government, about 1% each community, religion; 1.5% pornographic Growth 97-99 public sites +179% Countries of origin: U.S. 55% (59% in 1997), Germany 6%, Canada 5%, UK 5%, Japan 3%, Australia, Brazil, France, Italy 2% each, all others 18% Languages: 80% English (84% in 1997) US sites & English language predominate, but % falling steadily Sources: Lawrence & Giles, Nature (1999): http://www.wwwmetrics.com/ OCLC Web Characterization Project http://oclc.org/oclc/research/projects/webstats/index.htm Tefko Saracevic, Rutgers University 5 Organization of Web sites Metatags - to enable retrieval by fields- low use HTML “keywords”, “description” 34% of sites use them Dublin core - .3% sites use No standardization across sources Classification a predominant approach many types used Lack of organization major hindrance to retrieval also faked contents to force retrieval Tefko Saracevic, Rutgers University 6 Comparison: Web & library or inf. retrieval searching SIMILARITIES in searching Basic principles to approach the same human-human interaction - mediated or introspection to determine content, explore information need for a task preparation of search concepts, terms, logic determination of range, restrictions estimation of relevance Tefko Saracevic, Rutgers University 7 Differences Vastly different sources as to contents, authority, reliability, persistence variation in amounts, depth, breadth Very different organization little standardization, few if any fields Quite different search engines Differing search strategies needed Presence of many links; complex connections Evaluation more complex Tefko Saracevic, Rutgers University 8 Needed for Web searching Knowledge & competencies about great variety of sources great variety in their organization search engines search strategies; search dynamics exploring & exploiting links & networks keeping up: constant changes, innovations Web economics - no such thing as free lunch Effectiveness proportional to that knowledge Tefko Saracevic, Rutgers University 9 Criteria for evaluation http://www.otterbein.edu/learning/libpages/subeval.htm Authority Author - possible bias? Publisher - reputation? Professional society? Academic source? Reason on the Web? Vanity pages? Sponsor? Advocacy association? Domain name -who put up the site? Accuracy - possible independent verification? Sources? Currency - verification Prior review, experiences - checking review sources Critical thinking & constant verification Tefko Saracevic, Rutgers University 10 Ways to search & retrieve Most popular: search engines global, regional, country, specialized engines Following links from major sites & portals e.g. from Library of Congress to many libraries from newspapers to archives Reference sites - growing numbers Library sites - becoming ever richer sources Web addresses in print sources, newspapers Referrals, emails, bookmarks Tefko Saracevic, Rutgers University 11 Web sites & search engines Indexed by search engines (public sites) by keywords, classification, links, registration Hard to find most domain sources will not be found e.g digital libraries, online journals, reference sources many commercial sites Differing approaches to inclusion/selection mostly automatic; also generic source providers increasingly added human evaluation Tefko Saracevic, Rutgers University 12 Search engine coverage No US engine covers more than 16% of the Web Very hard to discern coverage In respect to combined coverage of 11 top engines: Northern Light 38.3% ; Snap 37.1; AltaVista 37.1 HotBot 27.1 MS 20.3 Infoseek 19.2, Google 18.6, Yahoo 17.6 Excite 13.5, Lycos 5.9, EuroSeek 5.2 HotBot, MS, Snap & Yahoo use Inktomi as search provider, but have different filtering & Inktomi databases Large European engines geared to country coverage E.g. Wanadoo (France), T-online (Germany) highest use among engines in their countries Tefko Saracevic, Rutgers University 13 Unique search engines Number of specialized engines - looking for niche good for scientific, technical, professional searches include manual evaluation & selection of sources Northern Light has ‘special collections’ not found on publicly indexable Web http://www.northernlight.com Oingo has word associations, evaluations includes elaborate classification Tefko Saracevic, Rutgers University http://www.oingo.com/ 14 Search features among engines Some search features the same across all but details differ - particularly in advanced Boolean available but sometimes AND sometimes OR default Differences may be found in: phrases, proximity, truncation, case sensitivity, relevance feedback, field searching, special features some have term expansion to concepts & lists of associated terms ( e.g. latent semantic indexing) Tefko Saracevic, Rutgers University 15 Search strategies & outputs Geared toward very short searches big majority of searches 2-3 terms (av. 2.5) big majority of users view one page only Geared toward limited top outputs Ranking output by relevance predominates relevance calculation differ & secret Also heavy & increasing use of classification Browsing a big component Tefko Saracevic, Rutgers University 16 Meta search engines Search engines that cover search engines e.g. All4one http://all4one.com/ four windows - good for comparison Savvy Search http://www.savvysearch.com/ indicates search engine source More on the horizon & differing Search Engine Watch http://www.searchenginewatch.com/ listing, reviews, ratings, tests, resources, tutorials Tefko Saracevic, Rutgers University 17 Reference sites - facts Reference services & access changing drastically Several models in reference services: Martindale’s Reference Desk - comprehensive http//www-sci.lib.uci.edu/~martindale/Ref.html Ask Jeeves! - natural language http://www.ask.com/ over 2 million queries per day; growing 46% per quarter Electric Library - membership http://www.elibrary.com/ Review of several reference sites http//www.libraryjournal.com/articles/multimedia/webwatch/1999110 1_12593.asp Tefko Saracevic, Rutgers University 18 Reference ... Sources … continued Information Please - almanacs http://www.infoplease.com/ Reference Desk - rich http://www.refdesk.com/ Encyclopedia Britannica http://www.britannica.com/ great many cross-references & other sources Webhelp - “real people, real answers, real time” live conversation with one of the 1000+ “Web wizards” www.webhelp.com Tefko Saracevic, Rutgers University 19 Libraries as Web sources Libraries providing open collections & services growth of digital libraries & Web access models vary; parts open to all, parts only to own users One example, among great many: Rutgers libraries - large & long term effort http://www.libraries.rutgers.edu/ various sources & links involved e.g for domain information& sources go to: Electronic Ready Reference Shelf; Research Guides; Social Sciences & Law; Library & Information Science Tefko Saracevic, Rutgers University 20 Virtual libraries on the Web Libraries emerging only on the Web More & more libraries & organizations involved Examples of libraries rich in sources & links Virtual Library - Switzerland, US, UK & other countries, started by Tim Berners-Lee the creator the Web http://vlib.org. Toronto Public Library http://vrl.tpl.toronto.on.ca/ Internet Public Library, Michigan http://www.ipl.org/ Academic Info - “Gateway to Quality Educational Resources.” International http://academicinfo.net/ Tefko Saracevic, Rutgers University 21 New modes of access Libraries, agencies, companies, developing reference & service models - new, rich, innovative e.g. For & about children Los Angeles Public Library - great fun! http://www.lapl.org/kidsweb/ Parenting: Parenttime http://www.parenttime.com/home/homepage.cgi Fathom - consortium of six leading institutions in US & UK beta testing - top quality research coverage http://www.fathom.com/ Course on Internet use with links http://www.newbie.org/ Tefko Saracevic, Rutgers University 22 Domain sites Many domain/issue specific sites rich & often unique coverage & services different approaches & requirements Examples in health related domains: Medscape - registration required http://www.medscape.com/ Rxlist - The Internet Drug Index http://www.rxlist.com/ Mayo Clinic HealthOasis http://www.mayohealth.org/ Tefko Saracevic, Rutgers University 23 Societies, organizations , publishers Great many rich sources for searching differences in requirements, depth, richness Examples from variety of organizations: Assoc. for Computing Machinery http://www.acm.org/ Digital Library; subscription or registration, searchable State department http://www.state.gov/ about the U.S & other countries R.R. Bowker http://www.bowker.com/ free sections - Yours for the Asking; Library Resource Guide Genealogy: http://www.familysearch.org/ Tefko Saracevic, Rutgers University 24 Newspapers Various online newspapers models are explored beyond having just a print copy on the Web subscription; links; archives; more elaborate stories … e.g. San Francisco Examiner - http://examiner.com/ articles, in depth projects, area guide (SF Gate), archive ... Finding stories & papers: Excite News Tracker http://nt.excite.com/ Includes: World Newspapers Resources Index of some major world news papers (from New Zealand) http://www.ccc.govt.nz/Library/Resources/Newspapers/index.asp Tefko Saracevic, Rutgers University 25 Summary Web is: rapidly evolving, changing, expanding unpredictable, rich, and valuable source Knowledge & competencies needed to use it effectively, also common sense & flexibility Three Web laws always in effect! Web economics rewards big, but costs significant Tefko Saracevic, Rutgers University 26 But … limitations The public Web does not have it all Many rich resources not accessible without paying DIALOG covers many fields & is larger than the Web similarly Lexis - Nexis, Data Star etc. Majority of content in libraries is NOT on Web Majority of archives, old newspapers NOT on Web WEB IS RICH, BUT NOT A BEGINNING & END OF INFORMATION SOURCES Tefko Saracevic, Rutgers University 27 Tefko Saracevic, Rutgers University 28