InvisibleWebALL.ppt

advertisement
The Invisible Web
- finding things that are hard to find Tefko Saracevic, PhD
Rutgers University
http://www.scils.rutgers.edu/~tefko
(contains also a list of sites relevant to the
topic and this presentation)
© Tefko Saracevic, Rutgers University
1
What is “Invisible Web?”
• Materials that general search engines
cannot or WILL not include in their
collection of web pages (indexes)
• You cannot find through general search
engines
• Contains a vast amount of information
– much of it authoritative, qualitative
– much of it specialized
© Tefko Saracevic, Rutgers University
2
Why search engines miss?
• Size: Web is huge, cannot cover all
• Economics: associated costs are high
– also pay per crawl & rank
•
•
•
•
Technical: still limited capabilities
Spam: eliminating bad also looses good
Restrictions: some site do not let in
Deep structure: some sites complex
© Tefko Saracevic, Rutgers University
3
Web size - who knows?
• Web Characterization Project - OCLC
– provides statistics about the web
– 1998: 2.8, 2002: 9.04 mill web sites (IP address)
• In 2002: 35% public, 29% private, 36% provisional
sites
– Public sites (2002):
• 55% US, 7% German, 6% Japanese, 3% each French,
Spanish, 2% each Italian, Dutch, Chinese,1% each
Korean, Russian, Polish, Portuguese
– Adult sites (2002): 3.3%
– IP address volatility - all sites (disappearance pattern):
• 13% of sites in 2002 were also in 1998; 51% in 2001
© Tefko Saracevic, Rutgers University
4
How search engines work?
• Crawlers, spiders: go out to find
– new & changed sites; periodic, not for each query
• Databases, caches:
– gather content; could be submitted, bought
• Indexing: creating appropriate entries
– various, mostly proprietary algorithms
• Retrieval engine: searching on basis of query
• Interface: gathers query, displays results
– could be ordered by pay
© Tefko Saracevic, Rutgers University
5
Search engines differ
• Substantial differences among search
engines on each aspect
• Information about search engines:
 Search Engine Watch
 ratings, news, statistics, charts
 Search Engine Showdown
 run by a librarian, news links, ratings
 Extreme Searcher
 update of a popular book
© Tefko Saracevic, Rutgers University
6
Search engine coverage
• No engine covers more than 16% of
WWW
• Hard to discern & compare coverage
• Many national search engines - own
coverage
• Many topical search engines – own
coverage
• Many comprehensive sources
independent of search engines
© Tefko Saracevic, Rutgers University
7
Specialized sources
•
•
•
•
•
•
•
•
Meta search engines
Specialized engines & catalogs
Domain (subject) engines & catalogs
Reference sources
Libraries as web sources
Virtual libraries
Subject databases
Societies, organizations
© Tefko Saracevic, Rutgers University
8
Meta search engines
• Search engines that cover search engines
Search Engine Colossus
international meta engine
Dogpile
results from a number of search engines
Surfwax -gives statistics and text sources
Search Engine Guide
categorized by topic; other engine information
© Tefko Saracevic, Rutgers University
9
meta engines … (cont.)
 Vivisimo
 clusters
results; innovative
 Complete
 over
Planet
100,000 databases & s engines
 Webbrain
 results
in tree structure – fun to use
•
© Tefko Saracevic, Rutgers University
10
Domain engines & catalogs
• Cover general & specific areas
Open Directory Project – large edited
catalog of the web – global, run by
volunteers
 BUBL LINK -selected Internet resources
covering all academic subject areas – UK
 Profusion – search in categories

© Tefko Saracevic, Rutgers University
11
domain engines …
• Exist in many domains & subjects – rich!

Psychcrawler Amer Psychological Association



Entrez PubMed – Nat Library of Medicine
CiteSeer - NEC Research Center


web index for psychology
scientific literature, citations index - free
Think Quest – an international organization

education resources, programs
© Tefko Saracevic, Rutgers University
12
domain engines …

KIRKE - Katalog der Internetressourcen für
die Klassische Philologie aus Erlangen


Perseus Digital Library Tufts University


covers antiquity to renaissance
Sch of Slavonic & East European Studies,
University College London


a variety of resources
includes country resources, e.g. Croatia
U Mich Document Center
 official
documents from all over the world
© Tefko Saracevic, Rutgers University
13
Reference services
• Reference services - several models
– Q&A, directories, email answers etc.
 Ask
Jeeves!
 most
popular, commercial
 Information

Please
almanac type questions
© Tefko Saracevic, Rutgers University
14
reference …
•
Digital reference - new service area
for libraries

QuestionPoint L of Congress & OCLC


Virtual Reference Desk – L of Congress


project for a global reference network
compilation of web reference sites
LiveRef - maintained at Iowa State U

a registry of real time digital reference
services
© Tefko Saracevic, Rutgers University
15
Libraries as web sources
• Academic libraries providing open
collections & services; models vary
 Rutgers
libraries - big long term effort
 University of California, Berkeley

a most elaborate effort together with Sun
Corporation
 Bibliothèque
 includes
Nationale de France
virtual exhibitions, among others
© Tefko Saracevic, Rutgers University
16
Virtual libraries on the Web
• Libraries emerging only on the Web

Virtual Library –
US, UK & other countries – ‘oldest
virtual library on the Web’
 Switzerland,
 Internet
 also
Public Library Michigan
a long term effort
 Librarians
 very
Index of the Internet
popular and comprehensive
© Tefko Saracevic, Rutgers University
17
virtual libraries …

Academic Info Digital Library


Gabriel


many links to digital collections & resources in
various subjects
Gateway to European National Libraries
Museum of online museums

a delight
© Tefko Saracevic, Rutgers University
18
Subjects databases
• Many subject specific sites
– rich & often unique coverage & services
– different approaches & requirements
• Examples in health related domains:
 WebMDHealth
– news, medical
information
 Rxlist - The Internet Drug Index
 Mayo Clinic HealthOasis – health advice
© Tefko Saracevic, Rutgers University
19
Societies, organizations
• Great many rich sources for searching
– differences in requirements, depth, richness
Examples from variety of organizations:
 Assoc.
for Computing Machinery
 Digital
 US
Library; subscription or registration
State Department
 about
the U.S & other countries
 Genealogy
 most
– Church of Later Day Saints
comprehensive historical list of records
© Tefko Saracevic, Rutgers University
20
Language barriers on the Web
• English still the major language
– but declining, now slightly over 50%
• Multilingual retrieval search engines
Euroseek
searches in a number of languages
All the Web
results in 45 languages
© Tefko Saracevic, Rutgers University
21
Language barriers: translations
• A number of translation sites
– machine aided – i.e. plug in terms,
phrases, sentences in one & review in the
other language , but effectiveness???
 Free Translations

from to English, & 8 other languages
 Babel Fish
 from to English and 9 languages, translates URLs
 Travlang

great for travelers, but annoying commercials
© Tefko Saracevic, Rutgers University
22
Web news; keeping up
•
What is going on on the Web? Some major
sources of news and evaluations:

Free Pint – newsletter, articles, links
Internet Resources Newsletter – UK based
ResearchBuzz – daily updates; many aspects
About.com Web Search – tools, Web Search
Forum
Resource Shelf – newsletter with archive




© Tefko Saracevic, Rutgers University
23
keeping up …
• Information Today
– trade & professional monthly
newspaper & web site
– industry news
– searcher columns
– general analyses of trends
© Tefko Saracevic, Rutgers University
24
Evaluations, ratings
• Many sources evaluate web sites:
 The Scout Report –
 librarians’ BIBLE! Annotations. Comprehensive.
 Medical Library Assoc. – ten most useful sites;
 MLA user guide for health inf., recommendations
 Web 100 – commercial, user ratings, news

Evaluating web pages UC Berkeley
– tutorial and guide
© Tefko Saracevic, Rutgers University
25
Archiving the web
• Internet Archive – a large undertaking
– includes web archive & lots more publicly
available & free
– 10 billion web pages archived from 1996 to
a few months ago
– Wayback Machine – search to look at old
versions of web pages
• But there is more. e.g.:
– Million Book Project
– International Children’s Digital Library
© Tefko Saracevic, Rutgers University
26
Needed for Web searching
• Knowledge & competencies on
– variety of web sources & their organization
– search engines
– web search strategies
– search dynamics, feedback
• Keeping up & up & up
– constant updates, changes, innovations
– many domain/subject specific
© Tefko Saracevic, Rutgers University
27
Needed for Web searching by
professionals
• Knowledge of SOURCES in area of interest
• search engines not enough
• not too helpful in finding these other sources;
structure hard to discern
• Evaluation of sources
– a key professional skill!
• standard criteria & Web criteria:
authority; accuracy; currency (timeliness);
objectivity; coverage, persistence, usability
© Tefko Saracevic, Rutgers University
28
Needed competencies …
•
•
•
•
•
•
•
Knowledge of users & use
Knowledge of searching
Use of technology
Adaptability, flexibility
Integration with other resources
Teaching others
Constant learning & update
– keeping up, keeping up, keeping up
© Tefko Saracevic, Rutgers University
29
But now really: How to do it?
information
WWW
© Tefko Saracevic, Rutgers University
30
© Tefko Saracevic, Rutgers University
31
© Tefko Saracevic, Rutgers University
32
P.S. a few weird sites…
• SelectSmart.com
– all kinds of quizzes for you
• James Dean official web site
• Deaducated
– Dead Librarians’ Society
• Livejournal
– blogs & authoring tools
© Tefko Saracevic, Rutgers University
33
Sources
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
About.com Web Search http://websearch.about.com
Academic Info Digital Library http://www.academicinfo.net/digital.html
All the Web http://www.alltheweb.com/
Ask Jeeves! http://www.ask.com/
Assoc. for Computing Machinery http://www.acm.org/
Babelfish http://babelfish.altavista.com/tr
Bibliothèque Nationale de France http://www.bnf.fr/
BUBL LINK http://bubl.ac.uk/link/
CDNET Search.com http://www.search.com/
CiteSeer http://citeseer.nj.nec.com/
CompletePlanet http://completeplanet.com
Deaducated http://www.geocities.com/deadlibrarians/
Dogpile http://www.dogpile.com/
Entrez PubMed http://www.ncbi.nlm.nih.gov/PubMed/
Extreme Searcher http://www.extremesearcher.com/
Free Pint http://www.freepint.com/
© Tefko Saracevic, Rutgers University
34
sources …
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Free Translations http://www.freetranslations.com
Gabriel http://www.kb.nl/gabriel/
Genealogy http://www.familysearch.org/
Information Please http://www.infoplease.com/
International Children’s Digital Library http://www.icdlbooks.org/
Internet Archive http://www.archive.org/
Internet Public Library, Michigan http://www.ipl.org/
Internet Resources Newsletter. http://www.hw.ac.uk/libwww/irn/
James Dean http://www.jamesdean.com/
KIRKE http://www.phil.uni-erlangen.de/~p2latein/ressourc/ressourc.html
Librarians Index to the Internet http://lii.org/
Live Journal http://www.livejournal.com/
LiveRef http://www.public.iastate.edu/~CYBERSTACKS/LiveRef.htm
Mayo Clinic http://www.mayohealth.org/
© Tefko Saracevic, Rutgers University
35
sources …
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Medical Library Assoc. ten top sites
http://www.mlanet.org/resources/medspeak/topten.html
Medical Library Assoc. user guide for health inf.
http://www.mlanet.org/resources/userguide.html
Medscape http://www.medscape.com/
Million Book Project
http://www.archive.org/texts/collection.php?collection=millionbooks
Museum of online museums. http://www.coudal.com/moom.php
OCLC Web Characterization Project http://wcp.oclc.org/
Open Directory Project http://dmoz.org
Perseus Digital Library http://www.perseus.tufts.edu/
Profusion http://www.profusion.com/
Psychcrawler http://www.psychcrawler.com/
QuestionPoint http://www.questionpoint.org/
ResearchBuzz. http://www.researchbuzz.com/index.shtml
Resource Shelf http://resourceshelf.blogspot.com/
Rutgers Libraries http://www.libraries.rutgers.edu/
RxList http://www.rxlist.com/
© Tefko Saracevic, Rutgers University
36
sources …
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Sch of East Eur & Slavonic Studies http://www.ssees.ac.uk/dirctory.htm
Search Engine Colossus http://www.searchenginecolossus.com/
Search Engine Guide http://www.searchengineguide.com/
Search Engine Showdown http://searchengineshowdown.com/
Search Engine Watch http://searchenginewatch.com/
Select Smart.com http://www.selectsmart.com/home.html
Surfwax http://www.surfwax.com/
The Scout Report. http://scout.cs.wisc.edu/
Think Quest http://www.thinkquest.org/
Travlang http://www.travlang.com
U California Berkeley http://sunsite.berkeley.edu/
U Mich Documents Center http://www.lib.umich.edu/govdocs/
US State department http://www.state.gov/
Virtual Library http://vlib.org
Virtual Reference Desk http://www.loc.gov/rr/askalib/virtualref.html
Vivisimo http://vivisimo.com
Web 100 http://www.web100.com
Webbrain http://www.webbrain.com/html/default_win.html
WebMD http://my.webmd.com/webmd_today/home/default
© Tefko Saracevic, Rutgers University
37
Download