The Invisible Web - finding things that are hard to find Tefko Saracevic, PhD Rutgers University http://www.scils.rutgers.edu/~tefko (contains also a list of sites relevant to the topic and this presentation) © Tefko Saracevic, Rutgers University 1 What is “Invisible Web?” • Materials that general search engines cannot or WILL not include in their collection of web pages (indexes) • You cannot find through general search engines • Contains a vast amount of information – much of it authoritative, qualitative – much of it specialized © Tefko Saracevic, Rutgers University 2 Why search engines miss? • Size: Web is huge, cannot cover all • Economics: associated costs are high – also pay per crawl & rank • • • • Technical: still limited capabilities Spam: eliminating bad also looses good Restrictions: some site do not let in Deep structure: some sites complex © Tefko Saracevic, Rutgers University 3 Web size - who knows? • Web Characterization Project - OCLC – provides statistics about the web – 1998: 2.8, 2002: 9.04 mill web sites (IP address) • In 2002: 35% public, 29% private, 36% provisional sites – Public sites (2002): • 55% US, 7% German, 6% Japanese, 3% each French, Spanish, 2% each Italian, Dutch, Chinese,1% each Korean, Russian, Polish, Portuguese – Adult sites (2002): 3.3% – IP address volatility - all sites (disappearance pattern): • 13% of sites in 2002 were also in 1998; 51% in 2001 © Tefko Saracevic, Rutgers University 4 How search engines work? • Crawlers, spiders: go out to find – new & changed sites; periodic, not for each query • Databases, caches: – gather content; could be submitted, bought • Indexing: creating appropriate entries – various, mostly proprietary algorithms • Retrieval engine: searching on basis of query • Interface: gathers query, displays results – could be ordered by pay © Tefko Saracevic, Rutgers University 5 Search engines differ • Substantial differences among search engines on each aspect • Information about search engines: Search Engine Watch ratings, news, statistics, charts Search Engine Showdown run by a librarian, news links, ratings Extreme Searcher update of a popular book © Tefko Saracevic, Rutgers University 6 Search engine coverage • No engine covers more than 16% of WWW • Hard to discern & compare coverage • Many national search engines - own coverage • Many topical search engines – own coverage • Many comprehensive sources independent of search engines © Tefko Saracevic, Rutgers University 7 Specialized sources • • • • • • • • Meta search engines Specialized engines & catalogs Domain (subject) engines & catalogs Reference sources Libraries as web sources Virtual libraries Subject databases Societies, organizations © Tefko Saracevic, Rutgers University 8 Meta search engines • Search engines that cover search engines Search Engine Colossus international meta engine Dogpile results from a number of search engines Surfwax -gives statistics and text sources Search Engine Guide categorized by topic; other engine information © Tefko Saracevic, Rutgers University 9 meta engines … (cont.) Vivisimo clusters results; innovative Complete over Planet 100,000 databases & s engines Webbrain results in tree structure – fun to use • © Tefko Saracevic, Rutgers University 10 Domain engines & catalogs • Cover general & specific areas Open Directory Project – large edited catalog of the web – global, run by volunteers BUBL LINK -selected Internet resources covering all academic subject areas – UK Profusion – search in categories © Tefko Saracevic, Rutgers University 11 domain engines … • Exist in many domains & subjects – rich! Psychcrawler Amer Psychological Association Entrez PubMed – Nat Library of Medicine CiteSeer - NEC Research Center web index for psychology scientific literature, citations index - free Think Quest – an international organization education resources, programs © Tefko Saracevic, Rutgers University 12 domain engines … KIRKE - Katalog der Internetressourcen für die Klassische Philologie aus Erlangen Perseus Digital Library Tufts University covers antiquity to renaissance Sch of Slavonic & East European Studies, University College London a variety of resources includes country resources, e.g. Croatia U Mich Document Center official documents from all over the world © Tefko Saracevic, Rutgers University 13 Reference services • Reference services - several models – Q&A, directories, email answers etc. Ask Jeeves! most popular, commercial Information Please almanac type questions © Tefko Saracevic, Rutgers University 14 reference … • Digital reference - new service area for libraries QuestionPoint L of Congress & OCLC Virtual Reference Desk – L of Congress project for a global reference network compilation of web reference sites LiveRef - maintained at Iowa State U a registry of real time digital reference services © Tefko Saracevic, Rutgers University 15 Libraries as web sources • Academic libraries providing open collections & services; models vary Rutgers libraries - big long term effort University of California, Berkeley a most elaborate effort together with Sun Corporation Bibliothèque includes Nationale de France virtual exhibitions, among others © Tefko Saracevic, Rutgers University 16 Virtual libraries on the Web • Libraries emerging only on the Web Virtual Library – US, UK & other countries – ‘oldest virtual library on the Web’ Switzerland, Internet also Public Library Michigan a long term effort Librarians very Index of the Internet popular and comprehensive © Tefko Saracevic, Rutgers University 17 virtual libraries … Academic Info Digital Library Gabriel many links to digital collections & resources in various subjects Gateway to European National Libraries Museum of online museums a delight © Tefko Saracevic, Rutgers University 18 Subjects databases • Many subject specific sites – rich & often unique coverage & services – different approaches & requirements • Examples in health related domains: WebMDHealth – news, medical information Rxlist - The Internet Drug Index Mayo Clinic HealthOasis – health advice © Tefko Saracevic, Rutgers University 19 Societies, organizations • Great many rich sources for searching – differences in requirements, depth, richness Examples from variety of organizations: Assoc. for Computing Machinery Digital US Library; subscription or registration State Department about the U.S & other countries Genealogy most – Church of Later Day Saints comprehensive historical list of records © Tefko Saracevic, Rutgers University 20 Language barriers on the Web • English still the major language – but declining, now slightly over 50% • Multilingual retrieval search engines Euroseek searches in a number of languages All the Web results in 45 languages © Tefko Saracevic, Rutgers University 21 Language barriers: translations • A number of translation sites – machine aided – i.e. plug in terms, phrases, sentences in one & review in the other language , but effectiveness??? Free Translations from to English, & 8 other languages Babel Fish from to English and 9 languages, translates URLs Travlang great for travelers, but annoying commercials © Tefko Saracevic, Rutgers University 22 Web news; keeping up • What is going on on the Web? Some major sources of news and evaluations: Free Pint – newsletter, articles, links Internet Resources Newsletter – UK based ResearchBuzz – daily updates; many aspects About.com Web Search – tools, Web Search Forum Resource Shelf – newsletter with archive © Tefko Saracevic, Rutgers University 23 keeping up … • Information Today – trade & professional monthly newspaper & web site – industry news – searcher columns – general analyses of trends © Tefko Saracevic, Rutgers University 24 Evaluations, ratings • Many sources evaluate web sites: The Scout Report – librarians’ BIBLE! Annotations. Comprehensive. Medical Library Assoc. – ten most useful sites; MLA user guide for health inf., recommendations Web 100 – commercial, user ratings, news Evaluating web pages UC Berkeley – tutorial and guide © Tefko Saracevic, Rutgers University 25 Archiving the web • Internet Archive – a large undertaking – includes web archive & lots more publicly available & free – 10 billion web pages archived from 1996 to a few months ago – Wayback Machine – search to look at old versions of web pages • But there is more. e.g.: – Million Book Project – International Children’s Digital Library © Tefko Saracevic, Rutgers University 26 Needed for Web searching • Knowledge & competencies on – variety of web sources & their organization – search engines – web search strategies – search dynamics, feedback • Keeping up & up & up – constant updates, changes, innovations – many domain/subject specific © Tefko Saracevic, Rutgers University 27 Needed for Web searching by professionals • Knowledge of SOURCES in area of interest • search engines not enough • not too helpful in finding these other sources; structure hard to discern • Evaluation of sources – a key professional skill! • standard criteria & Web criteria: authority; accuracy; currency (timeliness); objectivity; coverage, persistence, usability © Tefko Saracevic, Rutgers University 28 Needed competencies … • • • • • • • Knowledge of users & use Knowledge of searching Use of technology Adaptability, flexibility Integration with other resources Teaching others Constant learning & update – keeping up, keeping up, keeping up © Tefko Saracevic, Rutgers University 29 But now really: How to do it? information WWW © Tefko Saracevic, Rutgers University 30 © Tefko Saracevic, Rutgers University 31 © Tefko Saracevic, Rutgers University 32 P.S. a few weird sites… • SelectSmart.com – all kinds of quizzes for you • James Dean official web site • Deaducated – Dead Librarians’ Society • Livejournal – blogs & authoring tools © Tefko Saracevic, Rutgers University 33 Sources • • • • • • • • • • • • • • • • About.com Web Search http://websearch.about.com Academic Info Digital Library http://www.academicinfo.net/digital.html All the Web http://www.alltheweb.com/ Ask Jeeves! http://www.ask.com/ Assoc. for Computing Machinery http://www.acm.org/ Babelfish http://babelfish.altavista.com/tr Bibliothèque Nationale de France http://www.bnf.fr/ BUBL LINK http://bubl.ac.uk/link/ CDNET Search.com http://www.search.com/ CiteSeer http://citeseer.nj.nec.com/ CompletePlanet http://completeplanet.com Deaducated http://www.geocities.com/deadlibrarians/ Dogpile http://www.dogpile.com/ Entrez PubMed http://www.ncbi.nlm.nih.gov/PubMed/ Extreme Searcher http://www.extremesearcher.com/ Free Pint http://www.freepint.com/ © Tefko Saracevic, Rutgers University 34 sources … • • • • • • • • • • • • • • Free Translations http://www.freetranslations.com Gabriel http://www.kb.nl/gabriel/ Genealogy http://www.familysearch.org/ Information Please http://www.infoplease.com/ International Children’s Digital Library http://www.icdlbooks.org/ Internet Archive http://www.archive.org/ Internet Public Library, Michigan http://www.ipl.org/ Internet Resources Newsletter. http://www.hw.ac.uk/libwww/irn/ James Dean http://www.jamesdean.com/ KIRKE http://www.phil.uni-erlangen.de/~p2latein/ressourc/ressourc.html Librarians Index to the Internet http://lii.org/ Live Journal http://www.livejournal.com/ LiveRef http://www.public.iastate.edu/~CYBERSTACKS/LiveRef.htm Mayo Clinic http://www.mayohealth.org/ © Tefko Saracevic, Rutgers University 35 sources … • • • • • • • • • • • • • • • Medical Library Assoc. ten top sites http://www.mlanet.org/resources/medspeak/topten.html Medical Library Assoc. user guide for health inf. http://www.mlanet.org/resources/userguide.html Medscape http://www.medscape.com/ Million Book Project http://www.archive.org/texts/collection.php?collection=millionbooks Museum of online museums. http://www.coudal.com/moom.php OCLC Web Characterization Project http://wcp.oclc.org/ Open Directory Project http://dmoz.org Perseus Digital Library http://www.perseus.tufts.edu/ Profusion http://www.profusion.com/ Psychcrawler http://www.psychcrawler.com/ QuestionPoint http://www.questionpoint.org/ ResearchBuzz. http://www.researchbuzz.com/index.shtml Resource Shelf http://resourceshelf.blogspot.com/ Rutgers Libraries http://www.libraries.rutgers.edu/ RxList http://www.rxlist.com/ © Tefko Saracevic, Rutgers University 36 sources … • • • • • • • • • • • • • • • • • • • Sch of East Eur & Slavonic Studies http://www.ssees.ac.uk/dirctory.htm Search Engine Colossus http://www.searchenginecolossus.com/ Search Engine Guide http://www.searchengineguide.com/ Search Engine Showdown http://searchengineshowdown.com/ Search Engine Watch http://searchenginewatch.com/ Select Smart.com http://www.selectsmart.com/home.html Surfwax http://www.surfwax.com/ The Scout Report. http://scout.cs.wisc.edu/ Think Quest http://www.thinkquest.org/ Travlang http://www.travlang.com U California Berkeley http://sunsite.berkeley.edu/ U Mich Documents Center http://www.lib.umich.edu/govdocs/ US State department http://www.state.gov/ Virtual Library http://vlib.org Virtual Reference Desk http://www.loc.gov/rr/askalib/virtualref.html Vivisimo http://vivisimo.com Web 100 http://www.web100.com Webbrain http://www.webbrain.com/html/default_win.html WebMD http://my.webmd.com/webmd_today/home/default © Tefko Saracevic, Rutgers University 37