Web sources and library & information services Finding, evaluating and using a variety of Web sources for searching and reference © Tefko Saracevic, Rutgers University 1 Similarities between Web searching & IR & reference • Basic principles to approach the same – human-human interaction - interview • social, organizational, cognitive, affective aspects to explore including task, need … – preparation of search concepts, terms, logic – determination of range, restrictions – estimation of relevance © Tefko Saracevic, Rutgers University 2 Differences • Vastly different sources – as to contents, authority, reliability persistence – variation in amounts, depth, breadth • Very different organization – little standardization, few if any fields • Quite different search engines & capabilities -basic & advanced – also different from engine to engine • Differing search strategies needed © Tefko Saracevic, Rutgers University 3 Also: invisible Web • Materials that general search engines cannot or WILL not include in their collection of Web pages (indexes) • You cannot find through general search engines • Contains a vast amount of information – much of it authoritative, qualitative © Tefko Saracevic, Rutgers University 4 Why search engines miss? • Size: Web is huge, cannot cover all • Economics: associated costs are high – also pay per crawl & rank • • • • Technical: still limited capabilities Spam: eliminating bad also looses good Restrictions: some site do not let in Deep structure: some sites complex © Tefko Saracevic, Rutgers University 5 Needed for Web searching • Knowledge & competencies – variety of Web sources – their organization – search engines – Web search strategies – search dynamics, feedback • Keeping up & up & up – constant updates, changes, innovations – many domain/subject specific © Tefko Saracevic, Rutgers University 6 Web size - who knows? • Estimated over 16 million web servers Lawrence & Giles, 1999 – But only a fraction of direct search relevance • Domains of sites • 83% commercial, 6% scientific or educational; 3% health • 2.5% personal; 2% societies; 1.5% government, • about 1% each community, religion • 1.5% pornographic • Web Characterization Project - OCLC – statistics, trends, report, links … for 2001 reports 8.5 mill web sites – http://wcp.oclc.org/ © Tefko Saracevic, Rutgers University 7 Organization of sources • No standardization across sources • Major approaches in search engines – classification: many directory types used – statistical analyses of terms, links • Metatags in sources – to enable retrieval by fields – HTML “keywords”, “description” • 34% of sites use them – Dublin core - .3% sites use • Organization: hindrance to retrieval – also faked contents to force retrieval © Tefko Saracevic, Rutgers University 8 Sources & search engines • Indexed by search engines (publicly indexed) – by terms, selection, links, registration • Not publicly indexed – many domain sources will not be found e.g digital libraries, online journals, reference – many commercial sites will hardly be found • Differing approaches to inclusion/selection – mostly automatic; also generic source providers – increasingly added human evaluation & selection © Tefko Saracevic, Rutgers University 9 Search engine coverage • No engine covers more than 16% of WWW • In respect to combined coverage of 11 top: – Northern Light 38.3% ; Snap 37.1; AltaVista 37.1 HotBot 27.1 MS 20.3 Infoseek 19.2, Google 18.6, Yahoo 17.6 Excite 13.5, Lycos 5.9, EuroSeek 5.2 – HotBot MS, Snap & Yahoo use Inktomi as search provider, but have different filtering & Inktomi databases • Northern Light has ‘special collection’ - documents not part of publicly indexabable web • Hard to discern & compare coverage • Many national search engines - own coverage © Tefko Saracevic, Rutgers University 10 Search features among engines • Some search features the same across all but details differ - particularly in advanced – Boolean available • but sometimes AND sometimes OR default – Differences may be found in: • phrases, proximity, truncation, case sensitivity, relevance feedback, field searching, special features • term expansion to concepts (latent semantic indexing) © Tefko Saracevic, Rutgers University 11 Search strategies & outputs • Geared toward very short searches – big majority of searches 2-3 terms (av. 2.5) • in IR av. 7-14 - making a big difference • Directory browsing a big component - not in IR • Geared toward limited top outputs • Ranking output by relevance predominates – relevance calculation differ & proprietary (secret) – except Google - they published their method – affects search strategy - you guess how is done © Tefko Saracevic, Rutgers University 12 Meta search engines • Search engines that cover search engines – many around e.g. – All4one http://all4one.com/ • four windows - good for comparison – CDNET Search.com http://www.search.com/ • meta engine of meta engines - customization • Search Engines Worldwide • 174 countries, over 1300 engines http://www.twics.com/~takakuwa/search/search.html • More on the horizon & differing © Tefko Saracevic, Rutgers University 13 Specialized meta engines • Selective with directories & large number of databases & search engines – Complete Planet http://completeplanet.com – Invisible Web http://invisibleweb.com • U.S. federal information via Government Printing Office Access http://www.gpo.gov/gpoaccess – Federal Bulletin Board (file libraries for download from many agencies): http://fedbbs.access.gpo.gov © Tefko Saracevic, Rutgers University 14 Reference (expert) services • Reference services - several models – Q&A, directories, email answers etc. – e.g. – Martindale’s Reference Desk - comprehensive http://www-sci.lib.uci.edu/~martindale/Ref.html – Ask Jeeves! – most popular http://www.ask.com/ – Ask ERIC – education questions- email answers http://www.askeric.org/Qa/ – Information Please - almanac type questions http://www.infoplease.com/ • Academic libraries developing reference models - new service area © Tefko Saracevic, Rutgers University 15 Libraries as Web sources • Academic libraries providing open collections & services; models vary – Rutgers libraries - big long term effort http://www.libraries.rutgers.edu/ – various sources & links involved • for domain information& sources go to: – Electronic Reference Sources; Subject Research Guides: Social Sciences & Law; Library & Information Science – University of California, Berkeley - a most elaborate effort together with Sun Corporation http://sunsite.berkeley.edu/ © Tefko Saracevic, Rutgers University 16 Virtual libraries on the Web • Libraries emerging only on the Web – More & more libraries & organizations involved Examples of academic & public libraries – Virtual Library - Switzerland, US, UK & other countries – ‘oldest virtual library on the Web’ • http://vlib.org – Toronto Public Library • http://vrl.tpl.toronto.on.ca/ – Internet Public Library, Michigan • http://www.ipl.org/ © Tefko Saracevic, Rutgers University 17 Domain sites • Many domain/issue specific sites – rich & often unique coverage & services – different approaches & requirements • Examples in health related domains: – Medscape - registration required http://www.medscape.com/ – Rxlist - The Internet Drug Index http://www.rxlist.com/ – Mayo Clinic HealthOasis http://www.mayohealth.org/ © Tefko Saracevic, Rutgers University 18 • Societies, organizations , publishers Great many rich sources for searching – differences in requirements, depth, richness Examples from variety of organizations: – Assoc. for Computing Machinery http://www.acm.org/ • Digital Library; subscription or registration – State department http://www.state.gov/ • about the U.S & other countries – R.R. Bowker http://www.bowker.com/ • Free Resources from Bowker; Library Resource Guide – Genealogy: http://www.familysearch.org/ © Tefko Saracevic, Rutgers University 19 Language barriers on the Web • English still the major language – but declining, now slightly over 50% • Multilingual retrieval search engines – Euroseek – searches 40 languages http://www.euroseek.com/ – All the Web – 45 languages http://www.alltheweb.com/ – in both, search in different languages covers primarily their language sources © Tefko Saracevic, Rutgers University 20 Language barriers: translations • A number of translation sites – machine aided – i.e. plug in terms, phrases, sentences in one & review in the other language , but effectiveness??? – Free Translations http://www.freetranslations.com – Babel Fish http://babelfish.altavista.com/tr – Travlang – great for travelers – phrases http://www.travlang.com © Tefko Saracevic, Rutgers University 21 Key professional competencies • Knowledge of SOURCES in area of interest • search engines not enough • not too helpful in finding these other sources; structure hard to discern • Evaluation of sources – a key professional skill! • standard criteria: quality, veracity, coverage etc • plus Web criteria: authority; accuracy; currency (timeliness); objectivity; coverage, persistence, usability – http://www.otterbein.edu/learning/libpages/subeval.htm © Tefko Saracevic, Rutgers University 22 competencies … • • • • • • • Knowledge of users & use Knowledge of searching Use of technology Adaptability, flexibility Integration with other resources Teaching others Constant learning & update © Tefko Saracevic, Rutgers University 23 © Tefko Saracevic, Rutgers University 24