The Invisible Web Tefko Saracevic, PhD Rutgers University http://www.scils.rutgers.edu/~tefko (contains also a list of sites relevant to the topic and this presentation) © Tefko Saracevic, Rutgers University 1 What is invisible Web? • Materials that general search engines cannot or WILL not include in their collection of Web pages (indexes) • You cannot find through general search engines • Contains a vast amount of information – much of it authoritative, qualitative © Tefko Saracevic, Rutgers University 2 Why search engines miss? • Size: Web is huge, cannot cover all • Economics: associated costs are high – also pay per crawl & rank • • • • Technical: still limited capabilities Spam: eliminating bad also looses good Restrictions: some site do not let in Deep structure: some sites complex © Tefko Saracevic, Rutgers University 3 Web size - who knows? • Estimated over 16 million web servers Lawrence & Giles, 1999 – But only a fraction of direct search relevance • Domains of sites • 83% commercial, 6% scientific or educational; 3% health • 2.5% personal; 2% societies; 1.5% government, • about 1% each community, religion • 1.5% pornographic • Web Characterization Project - OCLC – statistics, trends, report, links … for 2001 reports 8.5 mill web sites – http://wcp.oclc.org/ © Tefko Saracevic, Rutgers University 4 Organization of sources • No standardization across sources • Major approaches in search engines – classification: many directory types used – statistical analyses of terms, links • Metatags in sources – to enable retrieval by fields – HTML “keywords”, “description” • 34% of sites use them – Dublin core - .3% sites use • Organization: hindrance to retrieval – also faked contents to force retrieval © Tefko Saracevic, Rutgers University 5 Sources & search engines • Indexed by search engines (publicly indexed) – by terms, selection, links, registration • Not publicly indexed – many domain sources will not be found e.g digital libraries, online journals, reference – many commercial sites will hardly be found • Differing approaches to inclusion/selection – mostly automatic; also generic source providers – increasingly added human evaluation & selection © Tefko Saracevic, Rutgers University 6 Search engine coverage • No engine covers more than 16% of WWW • In respect to combined coverage of 11 top: – Northern Light 38.3% ; Snap 37.1; AltaVista 37.1 HotBot 27.1 MS 20.3 Infoseek 19.2, Google 18.6, Yahoo 17.6 Excite 13.5, Lycos 5.9, EuroSeek 5.2 – HotBot MS, Snap & Yahoo use Inktomi as search provider, but have different filtering & Inktomi databases • Northern Light has ‘special collection’ - documents not part of publicly indexabable web • Hard to discern & compare coverage • Many national search engines - own coverage © Tefko Saracevic, Rutgers University 7 Meta search engines • Search engines that cover search engines – many around e.g. – All4one http://all4one.com/ • four windows - good for comparison – CDNET Search.com ttp://www.search.com/ • meta engine of meta engines - customization • Search Engines Worldwide http://www.twics.com/~takakuwa/search/search.html • 174 countries, over 1300 engines • More on the horizon & differing © Tefko Saracevic, Rutgers University 8 Major source for invisible Web • Book Chris Sherman & Gary Price (2001). Invisible Web: Uncovering information sources search engines can’t see. Information Today • Site www.invisible-web.net © Tefko Saracevic, Rutgers University 9 Specialized meta engines • Selective with directories & large number of databases & search engines – Complete Planet http://completeplanet.com – Invisible Web http://invisibleweb.com • In the U.S. federal information via Government Printing Office Access http://www.gpo.gov/gpoaccess • Federal Bulletin Board (file libraries for download from many agencies): http://fedbbs.access.gpo.gov © Tefko Saracevic, Rutgers University 10 Reference (expert) services • Reference services - several models – Q&A, directories, email answers etc. – e.g. – Martindale’s Reference Desk - comprehensive http://www-sci.lib.uci.edu/~martindale/Ref.html – Ask Jeeves! – most popular http://www.ask.com/ – Ask ERIC – education questions- email answers http://www.askeric.org/Qa/ – Information Please - almanac type questions http://www.infoplease.com/ • Academic libraries developing reference models - new service area © Tefko Saracevic, Rutgers University 11 Libraries as Web sources • Academic libraries providing open collections & services; models vary – Rutgers libraries - big long term effort http://www.libraries.rutgers.edu/ – various sources & links involved • for domain information& sources go to: – Electronic Reference Sources; Subject Research Guides: Social Sciences & Law; Library & Information Science – University of California, Berkeley - a most elaborate effort together with Sun Corporation http://sunsite.berkeley.edu/ © Tefko Saracevic, Rutgers University 12 Virtual libraries on the Web • Libraries emerging only on the Web – More & more libraries & organizations involved Examples of academic & public libraries – Virtual Library - Switzerland, US, UK & other countries – ‘oldest virtual library on the Web’ • http://vlib.org – Toronto Public Library – Internet Public Library, Michigan • http://www.ipl.org/ © Tefko Saracevic, Rutgers University 13 Domain sites • Many domain/issue specific sites – rich & often unique coverage & services – different approaches & requirements • Examples in health related domains: – Medscape - registration required http://www.medscape.com/ – Rxlist - The Internet Drug Index http://www.rxlist.com/ – Mayo Clinic HealthOasis http://www.mayohealth.org/ © Tefko Saracevic, Rutgers University 14 • Societies, organizations , publishers Great many rich sources for searching – differences in requirements, depth, richness Examples from variety of organizations: – Assoc. for Computing Machinery http://www.acm.org/ • Digital Library; subscription or registration – State department http://www.state.gov/ • about the U.S & other countries – R.R. Bowker http://www.bowker.com/ • Free Resources from Bowker; Library Resource Guide – Genealogy: http://www.familysearch.org/ © Tefko Saracevic, Rutgers University 15 Language barriers on the Web • English still the major language – but declining, now slightly over 50% • Multilingual retrieval search engines – Euroseek – searches 40 languages http://www.euroseek.com/ – All the Web – 45 languages http://www.alltheweb.com/ – in both, search in different languages covers primarily their language sources © Tefko Saracevic, Rutgers University 16 Language barriers: translations • A number of translation sites – machine aided – i.e. plug in terms, phrases, sentences in one & review in the other language , but effectiveness??? – Free Translations http://www.freetranslations.com – Babel Fish http://babelfish.altavista.com/tr – Travlang – great for travelers – phrases http://www.travlang.com © Tefko Saracevic, Rutgers University 17 News sources about the Web visible & invisible – The Virtual Acquisition Shelf & News Desk http://resourceshelf.blogspot.com/ – Free Pint http://www.freepint.com/ – ResearchBuzz. http://www.researchbuzz.com/index.shtml – Internet Resources Newsletter. http://www.hw.ac.uk/libwww/irn/ – Search Engine Watch. http://www.searchenginewatch.com/ © Tefko Saracevic, Rutgers University 18 Sample of great sources for invisible Web – Direct Search. http://gwis2.circ.gwu.edu/~gprice/direct.htm – eLibrary. http://ask.elibrary.com/ – The Scout Report. http://scout.cs.wisc.edu/ – Museum of online museums. http://www.coudal.com/archives/museum.html – Librarians index to the Internet. http://www.lii.org/ – Profusion. http://www.profusion.com/ – Research Index. http://www.researchindex.com/ – Cybercafe Search Engine. http://www.cybercaptive.com © Tefko Saracevic, Rutgers University 19 Needed for Web searching in general • Knowledge & competencies – variety of Web sources – their organization – search engines – Web search strategies – search dynamics, feedback • Keeping up & up & up – constant updates, changes, innovations – many domain/subject specific © Tefko Saracevic, Rutgers University 20 Needed for Web searching by professionals • Knowledge of SOURCES in area of interest • search engines not enough • not too helpful in finding these other sources; structure hard to discern • Evaluation of sources – a key professional skill! • standard criteria: quality, veracity, coverage etc • plus Web criteria: authority; accuracy; currency (timeliness); objectivity; coverage, persistence, usability © Tefko Saracevic, Rutgers University 21 competencies … • • • • • • • Knowledge of users & use Knowledge of searching Use of technology Adaptability, flexibility Integration with other resources Teaching others Constant learning & update © Tefko Saracevic, Rutgers University 22 © Tefko Saracevic, Rutgers University 23