Lecture08 Web searching.ppt

advertisement
Web searching &
the invisible Web
Finding things that are hard
to find
tefkos@rutgers.edu; http://comminfo.rutgers.edu/~tefko/
Tefko Saracevic
Principles of Searching
1
Central ideas
•
Web has great many, even indispensable,
information resource for searching
– many of these are hard to find, forming the
“invisible Web”
•
A variety of Web resources have to be
accessed directly & not through common
search engines – as explored here
Tefko Saracevic
Principles of Searching
2
ToC
1.
2.
3.
4.
5.
6.
Invisible Web
Invisible Web searching
General sources
Domain sources
Reference sources and services
Digital libraries
Tefko Saracevic
Principles of Searching
3
1. Invisible Web
Definition, characteristics,
reasons
Tefko Saracevic
Principles of Searching
4
A few definitions
World wide web :
Internet-connected files
the very large set of
linked documents and
other files located on
computers connected
through the Internet
and used to access,
manipulate, and
download data and
programs
Tefko Saracevic
Invisible Encarta
hidden from view; not readily
noticed or detected
Invisible web
(also deep Web, hidden Web
as opposed to visible Web or
surface Web)
"Invisible Web" is the term
used to describe all the
information available on the
World Wide Web that is not
found by using generalpurpose search engines.
Principles of Searching
5
What is “Invisible web?”

Materials that general search engines cannot or
will not include in their collection of web pages


You cannot find through general search engines
Contains a vast amount of information resources



much of it authoritative & higher quality than visible web
 quality becomes a main issue
much of it specialized
a lot of it also fluid or streaming or real time



“You can’t step in the same river twice”
much of it free
Many times larger than the visible Web
Tefko Saracevic
Principles of Searching
6
Size & characteristic of
invisible web ( a lot from CUNY library)

Even the best search engines can access only
about 16% of the available information on WWW




therefore 84% of the information is excluded = Invisible Web
put another way, the size of the Invisible Web is 500 times larger
than the Surface Web
95% of the Invisible Web is publicly accessible
information
More than half of the Invisible Web resides in
topic specific databases
Thus, a lot of professional Web searching concentrates
on the invisible Web. So will we in this lecture
Tefko Saracevic
Principles of Searching
7
in other words…
There is much more to the Web than
or

“infobesity”


refers to the belief that searching Google for
information provides a junk information diet
not concerned about the quality

coined by James Morris, the Dean of the School of Computer Science at
Carnegie Mellon University
Brophy & Biden, (2005)
Tefko Saracevic
Principles of Searching
8
Why search engines do not
cover all?


Size: web is huge, cannot cover all
Economics: associated costs are high

engines support themselves mostly by ads


Technical: still a challenge & limited capabilities




some engines have rank per pay & crawl update per pay
- providing paid listings first & mostly
also some file formats hard to cover
Spam: eliminating bad also looses good
Restrictions: some site do not let in e.g. login required
Deep structure: some sites complex
Tefko Saracevic
Principles of Searching
9
Not found …
From: Characteristics of Invisible Web content
Tefko Saracevic
Principles of Searching
10
Search engine coverage


Hard (impossible) to discern & compare coverage
Many national search engines


Many topical or domain search engines


have own coverage, orientation, governance
have own coverage geared to subject of interest
But: there are many comprehensive sources
independent of search engines


some are search engines in specialized domains
others are compilations of evaluated web sources
Tefko Saracevic
Principles of Searching
11
Search engines differ

Substantial differences among search
engines on coverage


hard to impossible to discern
Substantial difference how they work


relatively easy to find out e.g. from
Search Engine Showdown

Tefko Saracevic
“The users’ guide to web searching” - run by a librarian,
news links, ratings
Principles of Searching
12
Be aware: search engines
are not only about search

Yes, search is (still) their core, but they are in
many other businesses built upon search & these
affect what & how of searching for us

they are corporations, commercial entities


but provide many other services




have to make money, mostly by ads & placements
selling, licensing software
email, messenger
add-on utilities – desktop search functions, toolbars …
most of the additional stuff is provided free

but there is no such thing as free lunch - it is about how search
engines can get us to continue to use their service
Tefko Saracevic
Principles of Searching
13
2. Invisible web searching
A few pointers
Tefko Saracevic
Principles of Searching
14
Invisible web searching:
Basic approach
User
 The first step in
determining the best
approach for searching
the invisible web is to
have a clear idea of what
you’re seeking

extensive user modeling
Resources

Limit your search to
appropriate resources &
tools for the particular
type of information you’re
looking for


know your sources
know how to find
appropriate sources

Tefko Saracevic
shades of “Knowledge is of
two kinds…”
Principles of Searching
15
Advanced searching on the Web –
applies to searching of the invisible Web as well

Needs to be adapted to differences






coverage not specified; vastly different from one
source, engine to another
no controlled vocabulary
output ranked by unknown methods & criteria for
“relevance”
building blocks may be indicated by “similar pages” or
“more from this site” or some such
some provide clusters to narrow searches
features, capabilities, specifics differ
Tefko Saracevic
Principles of Searching
16
Specialized sources -
particularly for the invisible web





Tefko Saracevic
Large scholarly search engines & directories
Domain sources, databases
Reference sources
Libraries as web sources
Virtual libraries
Principles of Searching
17
A few tips for Web searching,
including the invisible kind





Advanced web searc Univ. of California, Berkeley
Four NETS for better searching Bernie Dodge
Web search tutorial Searchenginez
Finding information: search engines Phil Bradley
Google Guide Nancy Blachman
Tefko Saracevic
Principles of Searching
18
3. General sources
A selection of a few (of great
many) sources for invisible web
Tefko Saracevic
Principles of Searching
19
Characteristics

Many oriented toward scholarly, research &
professional, technical & related information

include sources mostly not covered by general search
engines



origins vary widely


majority of these are trustworthy
quality much higher, some carefully selected, some
edited
from commercial to voluntary to government sponsored
Popular in many disciplines
Tefko Saracevic
Principles of Searching
20
Large scholarly search
engines & directories - sample

Infomine - a comprehensive virtual library and reference tool for
academic and scholarly Internet resources, including Web sites,
databases
 covers a wide range of scholarly resources by fields

Scirus – “it allows researchers to search for not only journal
content but also scientists' homepages, courseware, pre-print server
material, patents and institutional repository and website information. “


by Elsevier, run in conjunction with Scopus and Science Direct, but this one
free
Google Scholar “Stand on the shoulders of giants”
(but
Newton and John of Salisbury said it better)

searches for scholarly articles & resources, but sources not disclosed (no
idea on what it covers )
Tefko Saracevic
Principles of Searching
21
Large edited sites
Open Directory Project

large edited catalog of the web – global, run by
volunteers
BUBL LINK

Tefko Saracevic
selected Internet resources covering all academic
subject areas; organized by Dewey Decimal System –
from UK
Principles of Searching
22
Science, scholarship engines,
not free – a sample

In addition to freely accessible engines many
provide search free but access to full text paid



by subscription or per item
RUL provides access to these & many more
General
ScienceDirect
Elsevier: “world's largest electronic collection of science, technology and
medicine full text and bibliographic information” [available at RUL]

In a specific domain
ACM Portal
Asoc. for Computing Machinery: access to ACM Digital Library & Guide to
Computing [available at RUL]
Tefko Saracevic
Principles of Searching
23
4. Domain sources
Hardly a field without it
Tefko Saracevic
Principles of Searching
24
Domain engines

Cover specific subjects & topics


Important tool for subject searches



particularly for subject specialist
valued by professional searchers
Selection mostly hand-picked rather than by
crawlers, following inclusion criteria



from sciences, arts, humanities, to various media &
interests – you name it
often not readily discernable
but content more trustworthy
Usually well organized
Tefko Saracevic
Principles of Searching
25
in health & related fields …

PubMed – Nat Library of Medicine


Psychcrawler - Amer. Psychological Association


news, medical information
Rxlist


web index for psychology
WebMDHealth


biomedical literature from MEDLINE & health journals
The Internet Drug Index
Mayo Clinic HealthOasis

health advice
Kidshealth
sites for parents, kids, teen
Tefko Saracevic
Principles of Searching
26
in science …
Ocean Planet NASA
presentation of earth & its vast oceans
ArXiv Cornell U, National Science Foundation
e-print service in the fields of physics, mathematics, computer
science, and quantitative biology
large, non-reviewed contribution by authors, comments later
Athena Earth Sciences Resources
not a search engine but a large well organized directory
Tefko Saracevic
Principles of Searching
27
in education …
Intute
“Intute is a free online service providing you with a
database of hand selected Web resources for education
and research.”
Think Quest – Oracle Education Foundation
education resources, programs; web sites created by
students
Resource Discovery Network – UK
“UK's free national gateway to Internet resources for the
learning, teaching and research community”
Tefko Saracevic
Principles of Searching
28
in images, movies, video …
Internet Movie Database

treasure trove of movies
Picsearch
picture searching
Blinkx
claims to be word largest search engine for videos; it has indexed over
32 million hours worth of video footage, made searchable by
automatically transcribing the speech content.
Moving Images Collections
“MIC documents moving image collections around the world.” Part
particularly oriented toward science educators. Now at Library of
Congress, but developed at Rutgers.
Tefko Saracevic
Principles of Searching
29
in humanities …
Shakespeare & Internet Search Tools & Resources

great fun to navigate
 KIRKE - Katalog der Internetressourcen für die Klassische
Philologie aus Erlangen
 German; a variety of resources for classics
 Perseus Digital Library Tufts University
 covers antiquity to renaissance; one of the best subject sites
on the web; affected the whole field
 Sch of Slavonic & East European Studies, University
College London
 includes country resources, e.g. Croatia
Diotima
Materials for study of women and gender in the Ancient World
Tefko Saracevic
Principles of Searching
30
in music …
Musipedia
Not everything is text. This is “a searchable, editable, and expandable
collection of tunes, melodies, and musical themes.” Great fun!
All Music Guide

resource about musicians, albums, and songs
Tefko Saracevic
Principles of Searching
31
governments …

U Mich Document Center

official documents from all over the world

US government official web portal
“Whatever you want or need from the U.S. government”

US State Department

about the U.S & other countries
FirstGov
the US government official web portal

Tefko Saracevic
Principles of Searching
32
Evaluations, ratings


Evaluating web sites: a prime responsibility of
searchers & all information professionals
Many sources evaluate web sites:

The Scout Report –


Medical Library Association


librarians’ BIBLE! Annotations. Comprehensive.
ten most useful sites for consumer health
MLA user guide

for finding & evaluating health information on the web

Web 100

commercial, user ranking & evaluation of web sites

Evaluating web pages UC Berkeley
tutorial and guide
Tefko Saracevic
Principles of Searching
33
also a domain resource
And, of course …
Snoopy
The Official Peanuts Website
Tefko Saracevic
Principles of Searching
34
5. References
a few sources & services
Tefko Saracevic
Principles of Searching
35
Reference trends
Transactions


Live reference
transactions in libraries
falling off dramatically
But new reference modes
emerging

chat, ask a librarian
cooperative reference
among group of libraries




Tools
Commercial reference
growing strong
Tefko Saracevic
Most, if not even all
reference tools migrated
to digital



general & in many domains
Many free online access
Others licensed to
libraries
End users oriented

but still important source for
searchers
Principles of Searching
36
Reference tools –
open access
Wikipedia
web encyclopedia in many languages; user generated; very
popular; but uneven & entries at times manipulated
Stanford Encyclopedia of Philosophy
a comprehensive encyclopedia; authoritative - maintained and
kept up to date by experts in the field
Bartleby.com “Great books online”
dozen of reference books; Harvard classics; English usage;
and more. Amazing, invaluable collection
Tefko Saracevic
Principles of Searching
37
At RUL 100s of digital reference
sources in all domains
Tefko Saracevic
Principles of Searching
38
Reference services

Reference services - several models

Commercial – relatively new & successful

Ask (originally known as Ask Jeeves)


most popular, commercial
Information Please

almanac type questions
RefDesk
access to a number of reference tools

ChaCha - new service

direct answers to questions routed to any device. “ChaCha’s
advanced technology instantly routes it to the most knowledgeable person on
that topic in our Guide community” – to any device
Tefko Saracevic
Principles of Searching
39
Cooperative reference &
real time reference
•
Digital reference - new service area for
libraries

QuestionPoint L of Congress & OCLC


Virtual Reference Desk – L of Congress


project for a global 24/7 reference network
large compilation of web reference sites
LiveRef - maintained at Iowa State U

Tefko Saracevic
a registry of real time digital reference services
Principles of Searching
40
5. Libraries & other institutions as
web sources
Digital libraries, virtual
libraries, museums, good old
books
Tefko Saracevic
Principles of Searching
41
Libraries as web sources

Academic, national libraries providing open
collections & services; models vary


Rutgers libraries - big long term effort
University of California, Berkeley

a most elaborate effort together with Sun Corporation
LibWeb by Webjunction, formerly at U California, Berkeley
“lists currently over 7900 Web pages from libraries in 146
countries”

Bibliothèque Nationale de France

Tefko Saracevic
includes virtual exhibitions, among others
Principles of Searching
42
Virtual libraries

Libraries “living” only on the Web

Virtual Library –


Internet Public Library Drexel formerly at U of Michigan


Switzerland, US, UK & other countries – ‘oldest virtual library on
the Web’
“the first public library of and for the Internet community”
Librarians Index of the Internet Drexel

very popular and comprehensive - directory
Digital librarian maintained by Margaret Vail Anderson, a
librarian in Cortland, New York
“a librarian's choice of the best of the Web “ – large directory
Tefko Saracevic
Principles of Searching
43
Museums, societies…

Growing number of resources in museums &
variety of societies – rich resource for searching

Museum of online museums

a delight
MuseumStuff.com
“We have 1000's of museums, zoos, historical societies
and related organizations in our database”
The State Hermitage Museum
One of the greatest museums in the world, and one of the
best museum site – developed with IBM help
National Museum of Science and
Technology Leonardo da Vinci
Guess where those pictures came from? A delight!
Tefko Saracevic
Principles of Searching
44
Archiving, books on the web

Internet Archive – a large undertaking



includes web archive & lots more publicly available & free
10 billion web pages archived from 1996 to a few months ago
Wayback Machine – search to look at old versions of web
pages
Tefko Saracevic
Principles of Searching
45
Digital books on the Web
a sample of large projects

Books on the web – searchable
Million Book Project
digitizing books and providing free access
International Children’s Digital Library
online children books
Digital books Index
“"Meta-index" for most major eBook sites, along with thousands of
smaller specialized sites. ”
Google Book Search
large digitization effort; many large libraries cooperate;
agreement reached with publishers; connected with
Worldcat
Tefko Saracevic
Principles of Searching
46
Needed for Web searching

Knowledge & competencies on





variety of web sources & their organization
search engines
web search strategies
search dynamics, feedback
Keeping up & up & up

Why? many reasons, such as:



Tefko Saracevic
constant updates, changes, innovations
many domain/subject specific
fluidity very high
Principles of Searching
47
Needed for web searching
by professionals

Knowledge of SOURCES in area of interest

search engines not enough



not too helpful in finding these other sources; structure
hard to discern
find & use specialized sources
Evaluation of sources

a key professional skill!

application of standard criteria & web criteria:
authority; accuracy; currency (timeliness);
objectivity; coverage, persistence, usability
Tefko Saracevic
Principles of Searching
48
Needed competencies …







Knowledge of users & use
Knowledge of searching
Use of technology
Adaptability, flexibility
Integration with other resources
Teaching others
Constant learning & update


again: keeping up, keeping up, keeping up
and again: keeping up, keeping up, keeping up
Tefko Saracevic
Principles of Searching
49
P.S. a few weird sites…

SelectSmart.com



James Dean official web site
Deaducated


all kinds of quizzes for you
Dead Librarians’ Society
Livejournal

blogs & authoring tools; and many pathetic entries
Tefko Saracevic
Principles of Searching
50
But now really: How to do it?
information
WWW
Tefko Saracevic
Principles of Searching
51
Tefko Saracevic
Principles of Searching
52
Images
from the invisible web
Tefko Saracevic
Principles of Searching
53
and of course…
Tefko Saracevic
Principles of Searching
54
Download