Web searching - School of Communication and Information

advertisement
Web searching &
the invisible web
Finding things that are hard
to find
© Tefko Saracevic
Principles of Searching
1
Dictionary definitions
World wide web :
Internet-connected files
the very large set of linked documents and other files
located on computers connected through the Internet
and used to access, manipulate, and download data
and programs
Invisible - dictionary definition:
not easily noticed; not noticed or detected readily
Invisible web – not yet in the dictionary
© Tefko Saracevic
Principles of Searching
2
What is “Invisible web?”

Materials that general search engines cannot or
WILL not include in their collection of web pages
(indexes)


You cannot find through general search engines
Contains a vast amount of information resources



much of it authoritative & higher quality than visible web
 quality becomes a main issue
much of it specialized
a lot of it also fluid or streaming or real time



“You can’t step in the same river twice”
much of it free
Many times larger than the visible Web
© Tefko Saracevic
Principles of Searching
3
in other words…
There is much more to the web than
or
Distribution of use:
© Tefko Saracevic
Principles of Searching
4
Why search engines do not
cover all?


Size: web is huge, cannot cover all
Economics: associated costs are high



Technical: still a challenge & limited capabilities




engines support themselves mostly by ads
also a number of engines have rank per pay & crawl
update per pay - providing paid listings first & mostly
also some file formats hard to cover
Spam: eliminating bad also looses good
Restrictions: some site do not let in
Deep structure: some sites complex
© Tefko Saracevic
Principles of Searching
5
How do search engines
work? Main parts

Crawlers, spiders: go out to find content


looking for new & changed sites
periodic, not for each query


no search engine works in real time
Organizing content: labeling, arranging

indexing for searching or classifying as directory
Databases, caches: storing content
 Retrieval engine: searching on basis of query
 Interface: handles query, displays results
All based on various, mostly proprietary algorithms

© Tefko Saracevic
Principles of Searching
6
Search engine coverage

No engine covers more than a fraction of WWW



Hard (impossible) to discern & compare coverage
Many national search engines


own coverage, orientation, governance
Many topical or domain search engines


estimates: none more than 16%
own coverage geared to subject of interest
Many comprehensive sources independent of
search engines

some compilations of evaluated web sources
© Tefko Saracevic
Principles of Searching
7
Search engines differ

Substantial differences among search
engines on each of these parts


Need to know how they work & differ
Information about search engines:

Search Engine Watch


ratings, news, statistics, charts, explanations, tutorials
Search Engine Showdown

© Tefko Saracevic
“The users’ guide to web searching” - run by a librarian,
news links, ratings
Principles of Searching
8
Invisible web searching:
Basic approach

The first step in determining the best approach
for searching the invisible web is to have a clear
idea of what you’re seeking


extensive user modeling
Limit your search to appropriate resources &
tools for the particular type of information you’re
looking for


know your sources
know how to find appropriate sources

shades of “Knowledge is of two kinds…”
© Tefko Saracevic
Principles of Searching
9
Specialized sources -
particularly for the invisible web
The rest of the lecture covers:
1.
2.
3.
4.
5.
6.
7.
8.
9.
© Tefko Saracevic
Meta search engines
Specialized engines & catalogs
Domain (subject) engines & catalogs
Reference sources
Libraries as web sources
Virtual libraries
Subject databases
Societies, organizations
Good old books
Principles of Searching
10
Meta search engines

Meta search engines search multiple engines


getting combined results from a variety of engines
Finding a search engine or meta engine:

SearchEngines.com
search for engines by topic, geography, reference

Search Engine Guide


engines categorized by topic; other engine information
Search Engine Colossus

© Tefko Saracevic
international directory of search engines by country, topic from
198 countries and 61 territories; engines in choice of languages
Principles of Searching
11
Sample of meta engines

Some meta engines provide organized results:
Dogpile
results from a number of leading search engines; gives
source, so overlap can be compared; (has also a (bad) joke
of the day)
Surfwax
gives statistics and text sources & linking to sources; for some
terms gives related terms to focus
Teoma
results with suggestions for narrowing; links resources
derived; originated at Rutgers
Turbo10
provides results in clusters; engines searched can be edited
© Tefko Saracevic
Principles of Searching
12
meta search engines (cont.)

Large directory

Complete Planet


directory of over 70,000 databases & specialty engines
Results with graphical displays

Vivisimo


clusters results; innovative
Webbrain

results in tree structure – fun to use
Kartoo
results in display by topics of query
© Tefko Saracevic
Principles of Searching
13
Domain engines &
catalogs

Cover general & specific subjects

Open Directory Project


BUBL LINK


large edited catalog of the web – global, run by volunteers
selected Internet resources covering all academic subject
areas; organized by Dewey Decimal System – from UK
Profusion

search in categories for resources & search engines
Resource Discovery Network – UK
“UK's free national gateway to Internet resources for the
learning, teaching and research community”
© Tefko Saracevic
Principles of Searching
14
domain engines …
Available in variety of domains & subjects – rich!


Think Quest – Oracle Education Foundation


All Music Guide


education resources, programs; web sites created by students
resource about musicians, albums, and songs
Internet Movie Database

treasure trove of American and British movies
Genealogy links and surname search engines
well.. that is getting really specialized (and popular)
© Tefko Saracevic
Principles of Searching
15
domain engines …

Scholarship, science

Psychcrawler - Amer Psychological Association


web index for psychology
Entrez PubMed – Nat Library of Medicine
biomedical literature from MEDLINE & health journals

CiteSeer - NEC Research Center

scientific literature, citations index; strong in computer science
Scholar Google
searches for scholarly articles & resources
Infomine
scholarly internet research collections
© Tefko Saracevic
Principles of Searching
16
Reference services

Reference services - several models

Ask Jeeves!


most popular, commercial
Information Please

almanac type questions
RefDesk
access to a number of reference tools
Wikipedia
web encyclopedia in many languages
Martindale’s The reference Desk
probably the most amazing & versatile reference collection on
the web – numerous sections, great to explore
© Tefko Saracevic
Principles of Searching
17
reference …
•
Digital reference - new service area for libraries

QuestionPoint L of Congress & OCLC


Virtual Reference Desk – L of Congress


project for a global reference network
large compilation of web reference sites
LiveRef - maintained at Iowa State U

a registry of real time digital reference services
Martindale’s The reference Desk
probably the most amazing & versatile reference
collection on the web – numerous sections,
great to explore
© Tefko Saracevic
Principles of Searching
18
Libraries as web sources

Academic, national libraries providing open
collections & services; models vary


Rutgers libraries - big long term effort
University of California, Berkeley

a most elaborate effort together with Sun Corporation
LibWeb U California, Berkeley
“lists currently over 7200pages from libraries in over 125
countries”

Bibliothèque Nationale de France

© Tefko Saracevic
includes virtual exhibitions, among others
Principles of Searching
19
Virtual libraries on the Web

Libraries emerging only on the Web

Virtual Library –


Internet Public Library U of Michigan


Switzerland, US, UK & other countries – ‘oldest virtual library on
the Web’
also a long term effort
Librarians Index of the Internet
very popular and comprehensive
Digital librarian
“a librarian's choice of the best of the Web “ – compiled and
annotated by a librarian

© Tefko Saracevic
Principles of Searching
20
virtual libraries …

Academic Info Digital Library


Gabriel


many links to digital collections & resources in various subjects
Gateway to European National Libraries
Museum of online museums

a delight
Stanford Encyclopedia of Philosophy
a comprehensive encyclopedia and library
The historical New York Times Project
universal library – ongoing digitization
© Tefko Saracevic
Principles of Searching
21
Subjects resources

Many subject specific sites



rich & often unique coverage & services
different approaches & requirements
Examples in health related domains:

WebMDHealth


Rxlist


news, medical information
The Internet Drug Index
Mayo Clinic HealthOasis

health advice
Kidshealth
sites for parents, kids, teens
© Tefko Saracevic
Principles of Searching
22
Subject resources …

Scholarship, humanities, government

KIRKE - Katalog der Internetressourcen für die
Klassische Philologie aus Erlangen


German; a variety of resources for classics
Perseus Digital Library Tufts University

covers antiquity to renaissance; one of the best subject
sites on the web; affected the whole field

Sch of Slavonic & East European Studies,

University College London
includes country resources, e.g. Croatia

U Mich Document Center

official documents from all over the world
© Tefko Saracevic
Principles of Searching
23

Subject resources …
Growing number of resources in arts, museums
MuseumStuff.com
“We have 1000's of museums, zoos, historical
societies and related organizations in our
database”
The State Hermitage Museum
One of the greatest museums in the world, and one
of the best museum site – developed with IBM
help
National Museum of Science and Technology
Leonardo da Vinci
Guess where those pictures came from. A delight!
© Tefko Saracevic
Principles of Searching
24
subject resources …
Diotima
Materials for study of women and gender in the Ancient World
Moving Images Collections
“MIC documents moving image collections around the world.”
Part particularly oriented toward science educators. Now at
Library of Congress, but developed at Rutgers.
And, of course …
Snoopy
The Official Peanuts Website
© Tefko Saracevic
Principles of Searching
25
Societies, organizations

Many societies, agencies developed their sites

great many rich sources for searching & resources
differences in requirements, depth, richness

Assoc. for Computing Machinery



Digital Library; subscription or registration or through RUL
US State Department

about the U.S & other countries
FirstGov
the US government official web portal
Ocean Planet NASA
presentation of earth & its vast oceans
ArXiv Cornell U, National Science Foundation
e-print service in the fields of physics, mathematics, nonlinear science, computer science, and quantitative biology
© Tefko Saracevic
Principles of Searching
26
Archiving, books on the web

Internet Archive – a large undertaking




includes web archive & lots more publicly available & free
10 billion web pages archived from 1996 to a few months ago
Wayback Machine – search to look at old versions of web
pages
Books on the web
Million Book Project
digitizing books and providing free access
International Children’s Digital Library
online children books
Digital books Index
“links to more than 105,000 title records from more than 1800
commercial and non-commercial publishers, universities, and
various private sites”
© Tefko Saracevic
Principles of Searching
27
Language barriers on
the Web

English still the major language


but declining, now slightly over 50%
Multilingual retrieval search engines

Euroseek


searches in a number of languages
All the Web

results in 45 languages
© Tefko Saracevic
Principles of Searching
28
Web news; keeping up

What is going on on the Web? Some major
sources of news and evaluations:

Free Pint


newsletter, articles, links; nice & sometimes quirky
Internet Resources Newsletter
UK based; monthly newsletter for “academics, students,
engineers, scientists and social scientists”

ResearchBuzz
daily updates; many aspects; “Collection of items on search
engines, online databases, and other information resources”

About.com Web Search

tools, Web Search Forum
© Tefko Saracevic
Principles of Searching
29
keeping up …
Information Today
trade & professional monthly newspaper & web site; industry
news; searcher columns; general analyses of trends

Keeping up through blogosphere:

Resource Shelf
bloger about internet (and some other stuff) with archive; it
has really good and really bad exchanges & threads
New York Times blogrunner - The annotated NYT
blog tracking of NYT articles, topics, authors; thread into
discussion of many other weblogs; includes net & web
topics
© Tefko Saracevic
Principles of Searching
30
Finding links & listings –
back to good old books with a new twist

Number of books on web searching have also
sites with links in the book, updates, news

Extreme Searcher Randolph Hock

update of a popular book; links by chapter topics
The web library Nicholas G. Tomaiuolo
spotlights free resources, links by chapter and new topics –
done by a librarian
The invisible web Chris Sherman & Gary Price
original book on the topic, links organized by subject
p.s. most, but not all, of the sites in this lecture can
be found on those sites – and much, much more
© Tefko Saracevic
Principles of Searching
31
Evaluations, ratings


Evaluating web sites: a prime responsibility of
searchers & all information professionals
Many sources evaluate web sites:

The Scout Report –


Medical Library Association


librarians’ BIBLE! Annotations. Comprehensive.
ten most useful sites for consumer health
MLA user guide

for finding & evaluating health information on the web

Web 100

commercial, user ranking & evaluation of web sites

Evaluating web pages UC Berkeley
tutorial and guide
© Tefko Saracevic
Principles of Searching
32
Needed for Web searching

Knowledge & competencies on





variety of web sources & their organization
search engines
web search strategies
search dynamics, feedback
Keeping up & up & up

Why? many reasons, such as:



constant updates, changes, innovations
many domain/subject specific
fluidity very high
© Tefko Saracevic
Principles of Searching
33
Needed for web searching
by professionals

Knowledge of SOURCES in area of interest

search engines not enough



not too helpful in finding these other sources; structure
hard to discern
find & use specialized sources
Evaluation of sources

a key professional skill!

application of standard criteria & web criteria:
authority; accuracy; currency (timeliness);
objectivity; coverage, persistence, usability
© Tefko Saracevic
Principles of Searching
34
Needed competencies …







Knowledge of users & use
Knowledge of searching
Use of technology
Adaptability, flexibility
Integration with other resources
Teaching others
Constant learning & update


again: keeping up, keeping up, keeping up
and again: keeping up, keeping up, keeping up
© Tefko Saracevic
Principles of Searching
35
But now really: How to do it?
information
WWW
© Tefko Saracevic
Principles of Searching
36
© Tefko Saracevic
Principles of Searching
37
© Tefko Saracevic
Principles of Searching
38
Images
from the invisible web
© Tefko Saracevic
Principles of Searching
39
images …
© Tefko Saracevic
Principles of Searching
40
images …
© Tefko Saracevic
Principles of Searching
41
and of course…
© Tefko Saracevic
Principles of Searching
42
P.S. a nice site
Poem by Emily Dickinson, 1830-1886
In a library
Who will write a poem:
In a digital library
??????
© Tefko Saracevic
Principles of Searching
43
P.S. a few weird or fun sites…

SelectSmart.com



James Dean official web site
Deaducated


Dead Librarians’ Society
Livejournal


all kinds of quizzes for you
blogs & authoring tools; and many pathetic entries
Airline meals


“the world’s first and leading site about nothing but
airline food” … some 12,000 pictures from 447 airlines
it is not weird, but for real and great fun
© Tefko Saracevic
Principles of Searching
44
Sources





















About.com Web Search http://websearch.about.com
Academic Info Digital Library http://www.academicinfo.net/digital.html
Airline meals http://www.airlinemeals.net/
All the Web http://www.alltheweb.com/
Ask Jeeves! http://www.ask.com/
Assoc. for Computing Machinery http://www.acm.org/
Bibliothèque Nationale de France http://www.bnf.fr/
BUBL LINK http://bubl.ac.uk/link/
CDNET Search.com http://www.search.com/
CiteSeer http://citeseer.nj.nec.com/
CompletePlanet http://completeplanet.com
Deaducated http://www.geocities.com/deadlibrarians/
Digital book index http://www.digitalbookindex.org/about.htm
Digital librarian http://www.digital-librarian.com/
Diotima http://turbo10.com/
Dogpile http://www.dogpile.com/
Entrez PubMed http://www.ncbi.nlm.nih.gov/PubMed/
Extreme Searcher http://www.extremesearcher.com/
Free Pint http://www.freepint.com/
Gabriel http://www.kb.nl/gabriel/
Genealogy http://darcisplace.com/darci/search.htm
© Tefko Saracevic
Principles of Searching
45
sources …


















Hermitage http://www.hermitagemuseum.org/html_En/index.html
Information Please http://www.infoplease.com/
International Children’s Digital Library http://www.icdlbooks.org/
Internet Archive http://www.archive.org/
Internet Public Library, Michigan http://www.ipl.org/
Internet Resources Newsletter. http://www.hw.ac.uk/libwww/irn/
James Dean http://www.jamesdean.com/
Kartoo http://www.kartoo.com/
KIRKE http://www.phil.uni-erlangen.de/~p2latein/ressourc/ressourc.html
Leonardo da Vinci Museum http://www.museoscienza.org/english/
Librarians Index to the Internet http://lii.org/
Live Journal http://www.livejournal.com/
LiveRef http://www.public.iastate.edu/~CYBERSTACKS/LiveRef.htm
Martindale’s The reference Desk http://www.martindalecenter.com/
Mayo Clinic http://www.mayohealth.org/
Medical Library Assoc. ten top sites http://www.mlanet.org/resources/medspeak/topten.html
Medical Library Assoc. user guide for health inf.
http://www.mlanet.org/resources/userguide.html
Medscape http://www.medscape.com/
© Tefko Saracevic
Principles of Searching
46
sources …



















Million Book Project http://www.archive.org/texts/collection.php?collection=millionbooks
Museum of online museums. http://www.coudal.com/moom.php
MuseumStuff http://www.museumstuff.com/
NYT blogrunner http://nytimes.blogrunner.com/
NYT historical project http://www.nyt.ulib.org/
OCLC Web Characterization Project http://wcp.oclc.org/
Open Directory Project http://dmoz.org
Perseus Digital Library http://www.perseus.tufts.edu/
Profusion http://www.profusion.com/
Psychcrawler http://www.psychcrawler.com/
QuestionPoint http://www.questionpoint.org/
ResearchBuzz. http://www.researchbuzz.com/index.shtml
Resource Shelf http://resourceshelf.blogspot.com/
Rutgers Libraries http://www.libraries.rutgers.edu/
RxList http://www.rxlist.com/
Sch of East Eur & Slavonic Studies http://www.ssees.ac.uk/dirctory.htm
Search Engine Colossus http://www.searchenginecolossus.com/
Search Engine Guide http://www.searchengineguide.com/
Search Engine Showdown http://searchengineshowdown.com/
© Tefko Saracevic
Principles of Searching
47
sources …





















Search Engine Watch http://searchenginewatch.com/
Select Smart.com http://www.selectsmart.com/home.html
Snoopy http://www.snoopy.com/
Stanford Encyclopedia of Philosophy http://www.wikipedia.org/
Surfwax http://www.surfwax.com/
Teoma http://teoma.com/
The invisible Web http://www.invisible-web.net/
The Scout Report. http://scout.cs.wisc.edu/
The Web Library http://www.ccsu.edu/library/tomaiuolon/theweblibrary.htm
Think Quest http://www.thinkquest.org/
Turbo10 http://turbo10.com/
U California Berkeley http://sunsite.berkeley.edu/
U Mich Documents Center http://www.lib.umich.edu/govdocs/
US State department http://www.state.gov/
Virtual Library http://vlib.org
Virtual Reference Desk http://www.loc.gov/rr/askalib/virtualref.html
Vivisimo http://vivisimo.com
Web 100 http://www.web100.com
Webbrain http://www.webbrain.com/html/default_win.html
WebMD http://my.webmd.com/webmd_today/home/default
Wikipedia http://www.wikipedia.org/
© Tefko Saracevic
Principles of Searching
48
Download