Lecture05 Search engines.ppt

advertisement
search engines
digital libraries
tefkos@rutgers.edu; http://comminfo.rutgers.edu/~tefko/
Tefko Saracevic
1
Central ideas
As a searcher you are also using
Search engines
While the structure & basic
operation of search
engines is similar
• a great number & variety
exists beyond Google
 with their own features
 many of them in
specialized domains
Digital libraries
They have rich & varied
resources of use in
 accessing & searching of
variety of databases &
reference tools in many
domains
 accessing of journals for
delivery of full texts in all
fields
Knowing searching = also knowing these resources
Tefko Saracevic
2
ToC
1. Search engines
2. Digital libraries
Tefko Saracevic
3
1. Search engines
Definitions. How they work. Diversity
Tefko Saracevic
4
dictionary definitions
search
COMPUTING (transitive verb) to examine a computer file,
disk, database, or network for particular information
engine
something that supplies the driving force or energy to a
movement, system, or trend
search engine
a computer program that searches for particular keywords
and returns a list of documents in which they were found,
especially a commercial service that scans documents on
the Internet
Tefko Saracevic
5
about definition of search
engines
• oh well …
search engines do not search only for
keywords, some search for other stuff as
well
• and they are really not “engines” in the
classical sense
 but then mouse is not a “mouse”
Tefko Saracevic
6
use of search engines
… among others
Tefko Saracevic
7
How Search Engines Work
(Sherman 2003)
Crawler
URL1
URL2
Indexer
The Web
URL3
Search
Engine
Database
Tefko Saracevic
Eggs?
URL4
Eggs.
Eggs
- 90%
All About
Eggo
- 81%
Your
Eggs
Egoby40%
Browser
Huh?
S. I.-Am
10%
8
how do search engines
work? elaboration
• crawlers, spiders: go out to find content
 in various ways go through the web looking for
new & changed sites
 periodic, not for each query

no search engine works in real time
 some search engines do it for themselves,
others not

buy content from other companies
 for a number of reasons crawlers do not cover
all of the web – just a fraction
 what is not covered is “invisible web”
Tefko Saracevic
9
elaboration …
• organizing content: labeling, arranging
 indexing for searching – automatic
keywords and other fields
 arranging by URL popularity - PageRank as Google

 classifying as directory

mostly human handpicked & classified
• as a result of different organization we have
basically several kinds of search engines:
search – input is a query that is then searched & displayed
 directory – classified content – a class is displayed
 fused: directories have now also search capabilities & vice
versa

Tefko Saracevic
10
elaboration (cont.)
• databases, caches: storing content
 humongous files usually distributed over many computers
• query processor: searching, retrieval, display
 takes your query as input

engines have differing rules how handled
 displays ranked output

some engines also cluster output and provide visualization
• at the other end is your browser
 in addition to Explorer a number of the exists

Mozilla Firefox for instance – became quite popular
Tefko Saracevic
11
elaboration…
similarities, differences
• all search engines have these basic parts in
common
• BUT the actual processes – methods how
they do it – are based on various algorithms
& they differ
 most are proprietary with details kept secret but
based on well known principles from information
retrieval or classification
 to some extent Google is an exception – they
published their original method, but not further
Tefko Saracevic
12
case of
• developed by Sergey Brin and Lawrence
Page while students at Stanford
 in the beginning run on Stanford computers
• basic approach has been described in their
famous paper “The Anatomy of a Large-Scale
Hypertextual Web Search Engine”
 well written, simple language, has their pictures
 in acknowledgement they cite the support by NSF’s Digital
Library Initiative i.e. initially, Google came out of
government sponsored research
 describe their method PageRank - based on ranking
hyperlinks as in citation indexing
 “We chose our system name, Google, because it is a
common spelling of googol, or ten on hundredth power”
Tefko Saracevic
13
coverage differences
• no engine covers more than a fraction of WWW
 estimates: none more than 16%
 hard (even impossible) to discern & compare coverage, but they
differ substantially in what they cover
• in addition:
 many national search engines
 own coverage, orientation, governance
 many specialized or domain search engines
 own coverage geared to subject of interest
 many comprehensive sources independent of search engines
 some have compilations of evaluated web sources
Tefko Saracevic
14
searching differences
• substantial differences among search engines
on searching, retrieval display
 need to know how they work & differ in respect to
defaults in searching a query
 searching of phrases, case sensitivity, categories
 searching of different fields, formats, types of resources
 advance search capabilities and features
 possibilities for refinement, using relevance feedback
 display options
 personalization options

• Greg Notess’ chart & features describe
differences
Tefko Saracevic
15
business model differences
several business models
• public good - have independent budget

•
e.g. PubMed, Librarians’ Index to Internet
earn revenue from provision of information
 all commercial search engines
•
using search engines to promote their
other activities

e.g. telephone directories
Tefko Saracevic
16
sponsorship differences
• need to understand treatment of
sponsorship – they influence what they
search & how they display results
 some list separately results from sponsored sites
so you are reasonably clear what is there - what
is sponsored & not
 some have display-per-pay - showing first sites
that paid most & do not even tell you that
 some have pay per update of sites
• imperative to find sources that explain these
models for different engines to know what is
covered & what are you are getting
Tefko Saracevic
17
limitations
• every search engine has limitation as to
 coverage

meta engines just follow coverage limitations & have more
of their own – have to be careful in their use
 search capabilities
 finding quality information
• some have compromised search with economics
 becoming little more than advertisers
• but search engines are also many times victims
of spamindexing
 affecting what is included and how ranked
Tefko Saracevic
18
spamming a search engine
• use of techniques that push rankings higher
than they belong is also called spamdexing
 methods typically include textual as well as linkbased techniques
 like e-mail spam, search engine spam is a form
of adversarial information retrieval

the conflicting goals of accurate results of search
providers & high positioning by content page rank
• search engines are constantly battling this
with their own special (& secret) tools
Tefko Saracevic
19
search engine features,
reviews, tutorials • Search Engine Showdown
•
•
lists, reviews, follows search engines, blog – look at Chart
by Greg Notess (librarian) – book Teaching Web Search Skills has live links
• Recommended search engines by UC Berkeley
•
library workshop; lists features, evaluates
• Search Basics: Web Search Essentials
•
among others, has a large section on search engines
• Search features chart
•
with explanations
Tefko Saracevic
20
how to find a search engine?
• resources that list or categorize engines
Search Engine Guide
engines categorized by topic; other engine information
Search Engine Colossus
 international directory of search engines by country, topic from 351
countries and territories; engines in many languages
Phil Bradley’s country based search engines
“currently a total of 4,017 search engines and 222 countries,
territories, islands and regions”
Tefko Saracevic
21
all questions are not created equal
• what engine, what resource to use for what
kind of question or information need?
 An exhaustive classification in:
Finding information: search engines by Phil Bradley
 Sources for different topics:
Choose the Best Search for Your Information
Need by NoodleTools
 List of capabilities for major search engines:
Best Search Tools Chart by Infopeople
Tefko Saracevic
22
meta search engines
• meta engines search multiple engines
getting combined results from a variety of
engines
• do not have their own databases
but have their own business models
affecting results
• a number of techniques used
interesting ones: clustering, statistical
analyses
Tefko Saracevic
23
sample of meta engines
- with organized results
Dogpile
results from a number of leading search engines; gives
source, so overlap can be compared; has SearchSpy listing searches that were performed
Surfwax
gives text sources & linking to sources; for some terms
gives related terms to focus
Turbo10
provides results in clusters; engines searched can be
edited
Clusty
results grouped by topics or clusters for further sources
Tefko Saracevic
24
meta search engines (cont.)
• large directory

Complete Planet
directory of over 70,000 databases & specialty engines; classified
• results with graphical displays
Kartoo
results in display by topics of query
• new kid on the block
Cuil
(not a meta engine, but a search engine)
Claim: “Cuil searches more pages on the Web than anyone
else—three times as many as Google and ten times as many
as Microsoft”. Well … I do not know if it holds.
Tefko Saracevic
25
multilingual
• English still the major language
 but declining, now slightly over 50%
• multilingual retrieval search engines
 Euroseek

searches in a number of languages
 All the Web

results in 45 languages
Tefko Saracevic
26
where to find out?
• information about search engines in sources
that have updates, news, tips for searching
and more – a MUST for searchers :

Search Engine Watch


ratings, news, statistics, charts, explanations, tutorials
Search Engine Showdown

“The users’ guide to web searching” - run by a librarian, news
links, ratings
Virtual Chase
a site about “Teaching Legal Professionals How To Do
Research” - this section has very good tips and links for
consideration of quality on the web
Tefko Saracevic
27
where? ….
SiteLines
a blog, written by Rita Vine, a professional librarian, &
web search trainer; many evaluations in archive
ResourceShelf
“Resources and News for Information Professionals,”
edited by Gary Price, a librarian & author of Invisible
Web – has extensive archive
WebsearchAbout
not evaluative, but provides news, capabilities, sources,
articles about web searching
Tefko Saracevic
28
art of searching search engines
Tefko Saracevic
29
part 2: digital libraries
Tefko Saracevic
30
definition
• digital libraries are viewed from several perspectives
 technical: “Digital library is a managed collection of
information, with associated services, where
information is stored in digital format and accessible
over a network.” (Arms, 2000)
 institutional: “Digital libraries are organizations that
provide the resources, including the specialized staff,
to select, structure, offer intellectual access to,
interpret, distribute, preserve the integrity of, and
ensure the persistence over time of collections of
digital works so that they are readily and
economically available for use by a defined
community or set of communities.” (Waters, 1998)
Tefko Saracevic
31
a bit of context
• digital libraries have a short but volatile history
 research & development took of by start/mid 1990’s
 in the next decade phenomenal growth worldwide
 large investment in research, development, keeping up
• number of communities involved
 computer science, primarily in research
 library & information science: operations, studies of
users, use, usability
 many subjects: digital libraries in their domain
• diversity is large
 many institutions e..g. museums developed own
Tefko Saracevic
32
libraries & digital resources
• libraries (particularly research, academic & special)
invested massive & ongoing funding toward
 electronic journals
 databases
 reference sources
 digitization of parts of collection
RUL has
substantial
holdings &
expenditures
in all of these
• thus becoming in effect digital libraries – or more
accurately hybrid libraries
 with graphic and digital versions or types of resources
Tefko Saracevic
33
emphasis here
• on large academic or research digital
libraries that also are related to searching
including provision of
 search capabilities & access to databases
 electronic journals that provide full text of articles
after a search
 digital reference sources
• such libraries have become also search
portals of sort, essential for their users
 in education, research & related activities
Tefko Saracevic
34
sample
New York Public Library Digital Collections
A gateway to rare and unique collections in digitized form & to databases.
Access to most searchable databases requires library card number
U California Berkeley Digital Library SUNsite
digital collections and services
The British Library
“The world’s knowledge.” Includes “Services for library and information
Professionals.”
Los Angeles Public Library Kids’ Path
resources for children; search through directory
Tefko Saracevic
35
sample …
New Zealand Digital Library
searching of a number of digital collections, incl. humanitarian and UN
collections; provision of free software for digital libraries
Public Library of Science
“PLoS is a nonprofit organization of scientists and physicians committed
to making the world's scientific and medical literature a public
resource.” Publishes open access journals
Closer to home: New Brunswick Free Public Library
has online resources, databases (some require library PIN), historical
archives and more
example of great many public libraries that have databases for
searching
Tefko Saracevic
36
Rutgers libraries
– digital components
• strategic planning in developing digital access
• rich & complex content of digital resources
 several hundred indexes & databases for searching
 some 20,000 electronic journals
 thousand & more digital reference sources
 subject research guides
 Searchpath & other tutorials
 electronic reserve
• affected teaching, learning, research by the
whole community
Tefko Saracevic
37
some critical issues for
searching
• no way yet to do effective federated
searching in digital libraries (to search several
indexes at the same time)
 RUL has Searchlight – searches only 8 major databases
 each source has to be searched separately

most have very different search features, capabilities
• finding items in indexes does not mean that
always able to get full text
• thus, searching time-consuming, chaotic
Tefko Saracevic
38
where to find out?
• information about digital libraries for searching
LibWeb Webjunction formerly U California, Berkeley
“lists currently over 7900 pages from libraries in over 146 countries”
Digital Library Federation
“a consortium of libraries and related agencies that are pioneering the
use of electronic-information technologies to extend their collections
and services”
D-Lib Magazine
“a solely electronic publication with a primary focus on digital library
research and development, including but not limited to new
technologies, applications, and contextual social and economic issues”
Tefko Saracevic
39
where? …
Ariadne (UK)
“to report on information service developments and
information networking issues worldwide, keeping the busy
practitioner abreast of current digital library initiatives”
Journal of Digital Information
“Publishing papers on the management, presentation and uses
of information in digital environments”
Tool Kit for the Expert Web Searcher
one of the wikis by Library Information and Technology
Association, a division of the American Library
Association
Expert Web Search Tips
one of many informative articles from the Living Internet
Tefko Saracevic
40
in conclusion
• search engines are great but you have to
KNOW what is under the hood
 as to coverage, business model, search features,
outputs …
 they are NOT for every kind of information need
• digital libraries are great for searching but
you have to KNOW requirements for
searching different resources that are
included
 as yet federated searching is limited
Tefko Saracevic
41
art of searching digital libraries
more
Tefko Saracevic
42
and rewards …
Tefko Saracevic
43
Download