Searching and the Web

advertisement
Information jungle on the
Web:
finding and evaluating
information sources
Tefko Saracevic, PhD
Rutgers University
tefko@scils.rutgers.edu
http://www.scils.rutgers.edu/people/faculty/tefko.html
Web & information:
key problems
SEARCHING the Web for information
Retrieving a MANAGEABLE AMOUNT
Selecting the most RELEVANT sources
EVALUATING sources & information
Three laws for information on the Web:
1. EVALUATE
2. EVALUATE
3. EVALUATE
Tefko Saracevic, Rutgers University
2
Characteristics of
information on the Web
VARIETY - amazing
rich source on myriad topics & subjects
DISTRIBUTION - all over, global
information scattered across great many sites
LINKAGE - many hyperlinks, hypertexts
elaborate web of connections, paths, and mazes
AMOUNT - huge, growing exponentially
millions of sites, billions of pages
Tefko Saracevic, Rutgers University
3
Characteristics … (cont.)
CONTENT VALUE NEUTRAL - anything goes
no control of content
some accurate, trustworthy, verifiable
some biased, self-serving, propaganda, promotional
some false accidentally
some false deliberately, some even with evil intent
Thus, the three Web laws
Tefko Saracevic, Rutgers University
4
Size of the Web
Over 16 million web servers; 800 million pages
83% commercial, 6% scientific or educational; 3% health
2.5% personal; 2% societies; 1.5% government,
about 1% each community, religion; 1.5% pornographic
Growth 97-99 public sites +179%
Countries of origin:
U.S. 55% (59% in 1997), Germany 6%, Canada 5%, UK 5%, Japan 3%,
Australia, Brazil, France, Italy 2% each, all others 18%
Languages: 80% English (84% in 1997)
US sites & English language predominate, but % falling steadily
Sources: Lawrence & Giles, Nature (1999): http://www.wwwmetrics.com/
OCLC Web Characterization Project
http://oclc.org/oclc/research/projects/webstats/index.htm
Tefko Saracevic, Rutgers University
5
Organization of Web sites
Metatags - to enable retrieval by fields- low use
HTML “keywords”, “description”
34% of sites use them
Dublin core - .3% sites use
No standardization across sources
Classification a predominant approach
many types used
Lack of organization major hindrance to retrieval
also faked contents to force retrieval
Tefko Saracevic, Rutgers University
6
Comparison: Web & library
or inf. retrieval searching
SIMILARITIES in searching
Basic principles to approach the same
human-human interaction - mediated or introspection
 to determine content, explore information need for a task
preparation of search concepts, terms, logic
determination of range, restrictions
estimation of relevance
Tefko Saracevic, Rutgers University
7
Differences
Vastly different sources
as to contents, authority, reliability, persistence
variation in amounts, depth, breadth
Very different organization
little standardization, few if any fields
Quite different search engines
Differing search strategies needed
Presence of many links; complex connections
Evaluation more complex
Tefko Saracevic, Rutgers University
8
Needed for Web searching
Knowledge & competencies
about great variety of sources
great variety in their organization
search engines
search strategies; search dynamics
exploring & exploiting links & networks
keeping up: constant changes, innovations
Web economics - no such thing as free lunch
Effectiveness proportional to that knowledge
Tefko Saracevic, Rutgers University
9
Criteria for evaluation
http://www.otterbein.edu/learning/libpages/subeval.htm
Authority
Author - possible bias? Publisher - reputation?
Professional society? Academic source?
Reason on the Web?
Vanity pages? Sponsor? Advocacy association?
Domain name -who put up the site?
Accuracy - possible independent verification? Sources?
Currency - verification
Prior review, experiences - checking review sources
Critical thinking & constant verification
Tefko Saracevic, Rutgers University
10
Ways to search & retrieve
Most popular: search engines
global, regional, country, specialized engines
Following links from major sites & portals
e.g. from Library of Congress to many libraries
from newspapers to archives
Reference sites - growing numbers
Library sites - becoming ever richer sources
Web addresses in print sources, newspapers
Referrals, emails, bookmarks
Tefko Saracevic, Rutgers University
11
Web sites & search engines
Indexed by search engines (public sites)
by keywords, classification, links, registration
 Hard to find
most domain sources will not be found e.g digital
libraries, online journals, reference sources
many commercial sites
Differing approaches to inclusion/selection
mostly automatic; also generic source providers
increasingly added human evaluation
Tefko Saracevic, Rutgers University
12
Search engine coverage
No US engine covers more than 16% of the Web
Very hard to discern coverage
In respect to combined coverage of 11 top engines:
Northern Light 38.3% ; Snap 37.1; AltaVista 37.1 HotBot 27.1 MS 20.3
Infoseek 19.2, Google 18.6, Yahoo 17.6 Excite 13.5, Lycos 5.9, EuroSeek
5.2
HotBot, MS, Snap & Yahoo use Inktomi as search provider, but have
different filtering & Inktomi databases
Large European engines geared to country coverage
 E.g. Wanadoo (France), T-online (Germany)
highest use among engines in their countries
Tefko Saracevic, Rutgers University
13
Unique search engines
Number of specialized engines - looking for niche
good for scientific, technical, professional searches
include manual evaluation & selection of sources
Northern Light has ‘special collections’
not found on publicly indexable Web
http://www.northernlight.com
Oingo has word associations, evaluations
includes elaborate classification
Tefko Saracevic, Rutgers University
http://www.oingo.com/
14
Search features among
engines
Some search features the same across all but
details differ - particularly in advanced
Boolean available
but sometimes AND sometimes OR default
Differences may be found in:
phrases, proximity, truncation, case sensitivity,
relevance feedback, field searching, special features
some have term expansion to concepts & lists of
associated terms ( e.g. latent semantic indexing)
Tefko Saracevic, Rutgers University
15
Search strategies &
outputs
Geared toward very short searches
big majority of searches 2-3 terms (av. 2.5)
big majority of users view one page only
Geared toward limited top outputs
Ranking output by relevance predominates
relevance calculation differ & secret
Also heavy & increasing use of classification
Browsing a big component
Tefko Saracevic, Rutgers University
16
Meta search engines
Search engines that cover search engines e.g.
All4one
http://all4one.com/
four windows - good for comparison
Savvy Search
http://www.savvysearch.com/
indicates search engine source
More on the horizon & differing
Search Engine Watch http://www.searchenginewatch.com/
listing, reviews, ratings, tests, resources, tutorials
Tefko Saracevic, Rutgers University
17
Reference sites - facts
Reference services & access changing drastically
Several models in reference services:
Martindale’s Reference Desk - comprehensive
http//www-sci.lib.uci.edu/~martindale/Ref.html
Ask Jeeves! - natural language http://www.ask.com/
over 2 million queries per day; growing 46% per quarter
Electric Library - membership http://www.elibrary.com/
Review of several reference sites http//www.libraryjournal.com/articles/multimedia/webwatch/1999110
1_12593.asp
Tefko Saracevic, Rutgers University
18
Reference ...
Sources … continued
Information Please - almanacs
http://www.infoplease.com/
Reference Desk - rich http://www.refdesk.com/
Encyclopedia Britannica
http://www.britannica.com/
great many cross-references & other sources
Webhelp - “real people, real answers, real time”
live conversation with one of the 1000+
“Web wizards” www.webhelp.com
Tefko Saracevic, Rutgers University
19
Libraries as Web sources
Libraries providing open collections & services
growth of digital libraries & Web access
models vary; parts open to all, parts only to own users
One example, among great many:
Rutgers libraries - large & long term effort
http://www.libraries.rutgers.edu/
various sources & links involved
e.g for domain information& sources go to:
Electronic Ready Reference Shelf; Research Guides; Social Sciences
& Law; Library & Information Science
Tefko Saracevic, Rutgers University
20
Virtual libraries on the
Web
Libraries emerging only on the Web
More & more libraries & organizations involved
Examples of libraries rich in sources & links
 Virtual Library - Switzerland, US, UK & other countries,
started by Tim Berners-Lee the creator the Web http://vlib.org.
Toronto Public Library http://vrl.tpl.toronto.on.ca/
Internet Public Library, Michigan http://www.ipl.org/
Academic Info - “Gateway to Quality Educational
Resources.” International http://academicinfo.net/
Tefko Saracevic, Rutgers University
21
New modes of access
Libraries, agencies, companies, developing reference
& service models - new, rich, innovative e.g.
For & about children Los Angeles Public Library - great
fun! http://www.lapl.org/kidsweb/
Parenting: Parenttime http://www.parenttime.com/home/homepage.cgi
Fathom - consortium of six leading institutions in US & UK
 beta testing - top quality research coverage http://www.fathom.com/
Course on Internet use with links http://www.newbie.org/
Tefko Saracevic, Rutgers University
22
Domain sites
Many domain/issue specific sites
rich & often unique coverage & services
 different approaches & requirements
Examples in health related domains:
Medscape - registration required
http://www.medscape.com/
Rxlist - The Internet Drug Index
http://www.rxlist.com/
Mayo Clinic HealthOasis http://www.mayohealth.org/
Tefko Saracevic, Rutgers University
23
Societies, organizations ,
publishers
Great many rich sources for searching
differences in requirements, depth, richness
Examples from variety of organizations:
Assoc. for Computing Machinery http://www.acm.org/
Digital Library; subscription or registration, searchable
State department http://www.state.gov/
about the U.S & other countries
R.R. Bowker http://www.bowker.com/
free sections - Yours for the Asking; Library Resource Guide
Genealogy: http://www.familysearch.org/
Tefko Saracevic, Rutgers University
24
Newspapers
Various online newspapers models are explored
beyond having just a print copy on the Web
subscription; links; archives; more elaborate stories …
e.g. San Francisco Examiner - http://examiner.com/
articles, in depth projects, area guide (SF Gate), archive ...
Finding stories & papers: Excite News Tracker
http://nt.excite.com/
Includes: World Newspapers Resources
Index of some major world news papers (from New
Zealand) http://www.ccc.govt.nz/Library/Resources/Newspapers/index.asp
Tefko Saracevic, Rutgers University
25
Summary
Web is:
 rapidly evolving, changing, expanding
unpredictable, rich, and valuable source
Knowledge & competencies needed to use it
effectively, also common sense & flexibility
 Three Web laws always in effect!
Web economics
rewards big, but costs significant
Tefko Saracevic, Rutgers University
26
But … limitations
The public Web does not have it all
 Many rich resources not accessible without paying
DIALOG covers many fields & is larger than the Web
similarly Lexis - Nexis, Data Star etc.
Majority of content in libraries is NOT on Web
Majority of archives, old newspapers NOT on Web
WEB IS RICH, BUT NOT A BEGINNING & END
OF INFORMATION SOURCES
Tefko Saracevic, Rutgers University
27
Tefko Saracevic, Rutgers University
28
Download