Lecture04 Information retrieval.ppt

advertisement
Information retrieval (IR)
Basics, models, interactions
tefkos@rutgers.edu; http://comminfo.rutgers.edu/~tefko/
Tefko Saracevic
1
Central ideas
• Information retrieval (IR) is at the heart of ALL
indexing & abstracting databases, information
resources, and search engines
– all work on basis of IR algorithms and procedures
• Contemporary IR is also interactive – to such a
degree that pragmatically IR can not be separated
from interaction
• As a searcher you will constantly use IR, thus you
have to be knowledgeable about it
Tefko Saracevic
2
ToC
1.
2.
3.
4.
Information retrieval (IR)
Matching algorithms: Exact match & best match
Strength & weaknesses
IR Interaction & interactive models
Tefko Saracevic
3
1. Information retrieval
Definitions. Traditional model
Tefko Saracevic
4
Information retrieval (IR)
- original definition
Calvin Mooers (1919-1994) coined the term
“Information retrieval embraces the intellectual aspects
of the description of information and its specification
for search, and also whatever
systems, techniques,
or machines are employed
to carry out the operation.”
Mooers, 1951
Tefko Saracevic
5
IR:
Objective & problems
Objectives:
Provide users with effective access to & interaction
with information resources.
Retrieve information or information objects that are relevant
Problems addressed:
1. How to organize information intellectually?
2. How to specify search & interaction intellectually?
3. What systems & techniques to use for those
processes?
Tefko Saracevic
6
IR models
• Model depicts, represents what is involved - a choice of
features, processes, things for consideration
• Several IR models used over time
– traditional: oldest, most used, shows basic elements involved
– interactive: more realistic, favored now, shows also
interactions involved; several models proposed
• Each has strengths, weaknesses
• We start with traditional model to illustrate many points
- from general to specific examples
Tefko Saracevic
7
Description of traditional IR model
• It has two streams of activities
– one is the systems side with processes performed by the system
– other is the user side with processes performed by users &
intermediaries (you)
– these two sides led to “system orientation” & “user orientation”
– in system side automatic processing is done; in user side human
processing is done
• They meet at the matching process
– where the query is fed into the system and system looks for
documents that match the query
• Also feedback is involved so that things change based on
results
– e.g. query is modified & new matching done
Tefko Saracevic
8
Traditional IR model
System
Acquisition
Problem
documents, objects
information need
Representation
Representation
indexing, ...
question
File organization
Query
indexed documents
search formulation
Matching
searching
feedback
User
Retrieved objects
Tefko Saracevic
9
Acquisition
system side
• Content: What is in databases
– In Dialog first part of blue sheets: File Description, Subject
Coverage; in Scopus Subject Areas
• Selection of documents & other objects from various
sources - journals, reports …
– In Blue Sheets: Sources; in Scopus Sources
• Mostly text based documents
– Full texts, titles, abstracts ...
– But also: data, statistics, images (e.g. maps, trade marks) ...
Importance: Determines contents of databases
Key to file selection in searching !!!
Tefko Saracevic
10
Representation
of documents, objects …
system side
• Indexing – many ways :
– free text terms (even in
full texts)
– controlled vocabulary thesaurus
– manual & automatic
techniques
• Abstracting; summarizing
• Bibliographic description:
– author, title, sources, date…
– metadata
• Classifying, clustering
• Organizing in fields & limits
– in Dialog: Basic Index, Additional
Index. Limits
– in Scopus pull down menus
Basic to what is available for searching & displaying
Tefko Saracevic
11
File organization
system side
As mentioned:
• Sequential
– record (document) by record
• Inverted
– term by term; list of records under each term
• Combination: indexes inverted, documents
sequential
• When citation retrieved only, need for document
files or document delivery
Enables searching & interplay between types of files
Tefko Saracevic
12
Problem
user side
• Related to user’s task, situation, problem at hand
– vary in specificity, clarity
• Produces information need
– ultimate criterion for effectiveness of retrieval
• how well was the need met?
• Inf. need for the same problem may change, evolve,
shift during the IR process - adjustment in searching
– often more than one search for same problem over time
• you will experience this in your term project
Critical for examination in interview
Tefko Saracevic
13
Representation
question – user side
Non-mediated: end user alone
Mediated: intermediary + user
– interviews; humanhuman interaction
• Question analysis
– selection, elaboration of
terms
– various tools may be
used
• thesaurus, classification
schemes, dictionaries,
textbooks, catalogs …
• Focus toward
– deriving search terms &
logic
– selection of files,
resources
• Subject to feedback
changes
• Critical roles of
intermediary - you
Determines search specification - a dynamic process
Tefko Saracevic
14
Query
search formulation – user side
• Translation into systems
requirements & limits
– start of human-computer
interaction
• Selection of files,
resources
• Search strategy - selection
of:
– search terms & logic
– possible fields, delimiters
– controlled & uncontrolled
vocabulary
– variations in tactics
• Reiterations from feedback
– several feedback types: relevance
feedback, magnitude feedback ...
– query expansion & modification
What & how of actual searching
Tefko Saracevic
15
Matching
searching – system side
• Process of comparing
– search: what documents in
the file match the query as
stated?
• Various search algorithms:
– exact match - Boolean
• Each has strengths,
weaknesses
– no ‘perfect’ method
exists
• and probably never will
• still available in most, if not all
systems
– best match - ranking by relevance
• increasingly used e.g. on the web
– hybrids incorporating both
• e.g. Target, Rank in Dialog
Involves many types of search interactions & formulations
Tefko Saracevic
16
Retrieved objects
from system to user
• Various order of output: • When citations only
– sorted by Last In First Out
available: possible links
(LIFO)
to document delivery
– ranked by relevance & then
LIFO
– ranked by other
characteristics
• Various forms of output
– In Dialog: Output options
– in Scopus title (default),
abstract + references, cited
by, plus more
– Scopus View at publisher
– accessing RUL for digital
journals
• Base for relevance,
utility evaluation by
users
What a user (or you) sees, gets, judges – can be specified
Tefko Saracevic
17
2. Matching algorithms
Exact match & best match searches
Tefko Saracevic
18
Exact match - Boolean search
• You retrieve exactly what you ask for in the query:
– all documents that have the term(s) with logical
connection(s), and possible other restrictions (e.g. to be
in titles) as stated in the query
– exactly: nothing less, nothing more
• Based on matching following rules of Boolean
algebra, or algebra of sets
– ‘new algebra’
– presented by circles in Venn diagrams
Tefko Saracevic
19
Boolean algebra:
operates on sets of documents
•
Has four operations
(like in algebra):
1. A: retrieves set that has
term A
•
•
Tefko Saracevic
•
•
I want documents that
have the term library
2. A AND B: retrieves set
that has terms A and B
•
3. A OR B: retrieves set that
has either term A or B
often called intersection
& labeled A  B
I want documents that
have both terms library
and digital someplace
within
often called union and
labeled A  B
I want documents that
have either term library or
term digital someplace
within
4. A NOT B: retrieves set that
has term A but not B
•
•
often called negation and
labeled A – B
I want documents that have
term library but if they also
have term digital I do not
want those
20
Potential problems
• But beware:
– digital AND library will retrieve documents that have digital
library (together as a phrase) but also documents that have
digital in the first paragraph and library in the third section, 5
pages later, and it does not deal with digital libraries at all
• thus in Scopus or Google you will ask for “digital library” and in Dialog
for digital(w)library to retrieve the exact phrase digital library
– digital NOT library will retrieve documents that have digital
and suppress those that along with digital also have library,
but sometimes those suppressed may very well be relevant.
Thus, NOT is also known as the “dangerous operator “
– also beware of order: venetian AND blind will retrieve
documents that have venetian blind and also that have blind
venetian (oldest joke in information retrieval)
Tefko Saracevic
21
Boolean algebra depicted in Venn
diagrams
Four basic operations:
e.g. A = digital B= libraries
A
B
1
2
3
A
1
B
2
A
1
Tefko Saracevic
3
A AND B. Shade 2
digital AND libraries
B
2
A
1
A alone. All documents that have A. Shade 1
& 2. digital
3
A OR B. Shade 1, 2, 3
digital OR libraries
B
2
3
A NOT B. Shade 1
digital NOT libraries
22
Venn diagrams … cont.
Complex statements allowed e.g
A
B
2
1
4
5
3
6
(A OR B) AND C
Shade 4,5,6
(digital OR libraries) AND Rutgers
7
C
Tefko Saracevic
(A OR B) NOT C
Shade what?
(digital OR libraries) NOT Rutgers
23
Venn diagrams cont.
• Complex statements can be made
– as in ordinary algebra e.g. (2+3)x4
• As in ordinary algebra: watch for parenthesis:
– 2+(3 x 4)
is not the same as
(2+3)x4
– (A AND B) OR C
is not the same as
A AND (B OR C)
Tefko Saracevic
24
Adding variations to Boolean searches
• digital AND libraries can be specified to appear in
given fields as present in the given system
– e.g. to appear in titles only
• in Dialog command is s digital AND libraries/TI
• in Scopus pull down menu allows for selection of given field, – so
for digital library specify Article Title in pull down menu
• in Google Advanced Search gets you to a pull down menu for
Where your keywords show up: & then go to in the title of the page
• Various systems have different ways to retrieve
singular and plurals for the same term
• in Scopus term library will retrieve also libraries & vice versa
• in Dialog you have to specify librar? to retrieve variants
• in Google library retrieves library but not libraries
Tefko Saracevic
25
Best match searching
• Output is ranked
– it is NOT presented as a Boolean set but in some rank order
• You retrieve documents ranked by how similar (close)
they are to a query (as calculated by the system)
– similarity assumed as relevance
– ranked from highest to lowest relevance to the query
• mind you, as considered by the system
• you change the query, system changes rank
– thus, documents as answers are presented from those that
are most likely relevant downwards to less & less likely
relevant as determined by a given algortihm
– remember: a system algorithm determines relevance ranking
Tefko Saracevic
26
Best match ... cont.
• Best match process deals with PROBABILITY:
–
–
–
–
–
• what is the probability that a document is relevant to a query?
compares the set of query terms with the sets of terms in
documents
calculates a similarity between query & each document based on
common terms &/or other aspects
sorts the documents in order of similarity
assumes that the higher ranked documents have a higher
probability of being relevant
allows for cut-off at a chosen number e.g. the first 20 documents
• BIG issue: What representation & similarity
measures are better? Subject of IR experiments
– “better” determined by a number of criteria, e.g. relevance, speed …
Tefko Saracevic
27
Best match (cont.)
• Variety of algorithms (formulas) used to determine
similarity
– using statistic &/or linguistic properties
• e.g. if digital appears a lot of times in a given document relative to its
size, that document will be ranked higher when the query is digital
– many proposed & tested in IR research
– many developed by commercial organizations
• Google also uses calculations as to number of links to/from a
document & other methods
• many algorithms are now proprietary & not disclosed
– the way a system ranks and you rank may not necessarily be in
agreement
• Web outputs are mostly ranked
– but Dialog allows ranking as well, with special commands
Tefko Saracevic
28
3. Strengths & weaknesses
Best vs. exact match
Traditional IR model
Tefko Saracevic
29
Boolean vs. best match
• Boolean
– allows for logic
– provides all that has been
matched
BUT
– has no particular order of
output – usually LIFO
– treats all retrievals equally from the most to least
relevant ones
– often requires examination
of large outputs
Tefko Saracevic
• Best match
– allows for free terminology
– provides for a ranked output
– provides for cut-off - any
size output
BUT
– does not include logic
– ranking method (algorithm)
not transparent
• whose relevance?
– where to cut off?
30
Strengths of traditional IR model
• Lists major components in both system & user
branches
• Suggests:
– What to explain to users about system, if needed
– What to ask of users for more effective searching (problem
...)
• Aids in selection of component(s) for concentration
– mostly ever better representation
• Provides a framework for evaluation of (static) aspects
Tefko Saracevic
31
Weaknesses
• Does not address nor account for interaction &
judgment of results by users
– identifies interaction with matching only
– interaction is a much richer process
• Many types of & variables in interaction not
reflected
• Feedback has many types & functions - also not
shown
• Evaluation thus one-sided
IR is a highly interactive process - thus additional model(s)
needed
Tefko Saracevic
32
4. IR interaction
Models. Implications: what
happens in searching?
Tefko Saracevic
33
Enters interaction
There is MUCH more to searching than knowing
computers, networks & commands, as there
is more to writing than knowing word
processing packages
Tefko Saracevic
34
IR as interaction
• If we consider USER & USE central, then:
Interaction is a dominant feature of contemporary IR
• Interaction has many facets:
– with systems, technology
– with documents, texts viewed/retrieved
– intermediaries with people
• Several interactive IR models
– none as widely accepted as traditional IR model
• Broader area: human-computer interaction (HCI)
studies
Tefko Saracevic
35
HCI: broader concepts
“Any interaction takes place through one or more
interfaces & involves two or more participants who
each have one or more purposes for the interaction”
Storrs, 1994
• Participants: people & ‘computer’ (everything in it –
software, hardware, resources …)
• Interface: a common boundary
• Purposes: people have purposes and ‘computer’ has
purposes built in
• At issue: identification of important aspects, roles of
each
Tefko Saracevic
36
HCI … definitions
“Interaction is the exchange of information between
participants where each has the purpose of using
the exchange to change the state of itself or of
one or more of others”
“An interaction is a dialogue for the purpose of
modifying the state of one or more participants”
• Key concepts: exchange, change
– for user: change the state of knowledge related to a given
problem, tasks, situation
Tefko Saracevic
37
IR interaction is ...
“... the interactive communication processes that occur
during the retrieval of information by involving all the
major participants in IR, i.e. the user, the intermediary,
and the IR system.”
Ingwersen, 1992
• Involved:
– users
– intermediaries (possibly)
– everything in IR system
– communication processes - exchange of
information
Tefko Saracevic
38
Questions
• What variables are involved in interaction?
– models give lists
• How do they affect the process? How to control?
– experiments, experience, observation give answers
• Do given interventions (actions) or communications
improve or degrade the process?
– e.g. searcher’s (intermediaries or end-users) actions
• Can systems be designed so that searcher’s
intervention improves performance?
Tefko Saracevic
39
Interactive IR models
• Several models proposed
– none as widely accepted as the traditional IR model
• They all try to incorporate
–
–
–
–
–
information objects (“texts”):
IR system & setting
interface
intermediary, if present
user’s characteristics
• cognitive aspects; task; problem; interest; goal; preferences ...
– social environment
– variety of processes between them all.
Tefko Saracevic
40
User modeling
(treated in unit 11, but introduced here to illustrate one of the
important aspect of human-human interaction)
• Identifying elements about a user that impact interaction,
searching, types of retrieval …:
–
–
–
–
–
–
–
who is the user (e.g. education)
what is the problem, task at hand
what is the need; question
how much s/he knows about it
what will be used for
how much wanted, how fast
what environment is involved
• Much more than just analyzing a question posed by user
– related to reference interview
• Used to select resources, specify search concepts and terms,
formulate query, select format and amount of results provided,
follow up with feedback and reiteration, change tactics …
Tefko Saracevic
41
Three interactive models
• Three differing models are presented here, each
concentrates on a different thing:
– Ingwersen concentrates on enumeration of general
elements that enter in interaction
– Belkin on different processes that are involved as
interaction progresses through time
– Saracevic on strata or levels of interaction elements on
computer and user side
• As mentioned, no one interaction model is widely
accepted as the traditional IR model
Tefko Saracevic
42
Ingwersen’s interactive cognitive model
• Among the first to view IR differently from traditional
model
• Included IR as a system but concentrates also on
elements outside system that interact
–
–
–
–
–
inf. objects – documents, images …
intermediary – you - & interface
user cognitive aspects
user & general environment
path of request (we call question)
• from environment (problem) to query
– path of cognitive changes
– path of communication
– various other paths of interactions
Tefko Saracevic
43
Ingwersen’s model graphically
Information
objects
Interface/
Intermediary
Query
IR system
setting
User’s
cognitive
space
Environ
ment
Request
Cognitive
transformations
Interactive
communication
Tefko Saracevic
44
Belkin’s episodes model
• Concentrates on what happen in interaction as process
– Ingwerson concentrated on elements
• Viewed interaction as a series of episodes where a
number of different things happen over time
– depending on user’s goals, tasks
• there is judgment, use, interpretation…
– processes of navigation, comparison, summarization …
– involving different aspects of information & inf. objects
• While interacting we do diverse things, perform various
tasks, & involve different objects
Think: what do you do while searching?
Tefko Saracevic
45
Belkin’s episodes model
USER
USER
USER
CO
CO
COMPARISON
Goals
tasks
REPRESENTATION
.....
INTERACTION
Judgment, use,
interpretation,
modification
SUMMARIZATION
NA
NA
NAVIGATION
INFORMATION
VISUALIZATION
Type,
medium
mode
level
Tefko Saracevic
46
Saracevic’ stratified model
• Interaction: considers it as a sequence of processes/episodes
occurring in several levels or strata*
Interaction = INTREPLAY between levels
• Structure:
–
–
–
–
–
Several User levels
Produce a Query – it has characteristics
Several Computer levels
They all meet on the Surface level
Dialogue enabled by Interface
• user utterances
• computer ‘utterances’
• Adaptation/changes in all
• Geared toward Information use
Tefko Saracevic
47
Saracevic’s stratification model
Context
social, cultural …
Situational
tasks; work context...
Affective
intent; motivation ...
Cognitive
knowledge; structure...
Query
characteristics …
Surface level
INTERFACE
Engineering
hardware; connections...
Processing
software; algorithms …
Content
inf. objects; representations...
Tefko Saracevic
48
Roles of levels or strata
• Defining of what’s involved
– whassup?
• Help in recognition/separation of differing variables
– each strata or level involves different elements, roles, &
processes
• Observation of interaction between strata complex dynamics
• On the user side suggests what affects factors query
and judgment of responses
– thus elements for user modeling
Tefko Saracevic
49
Interplay between levels
• Interplay on user side:
– Cognitive: between cognitive structures of texts & users
– Affective: between intentions & other
– Situational: between texts & tasks
• Similar interplay on computer side
• Surface level - interface:
– searching, navigation, browsing, display, visualization, query
characterization …
• Interplay judgments in searching:
–
–
–
–
evaluation of results - relevance
changing of models: situation, need ...
selection of search terms
resulting modifications - feedback
Tefko Saracevic
50
Intermediaries - YOU
• Intermediaries could participate as an additional
interface - many roles:
– diagnostic help in problem, query formulation
– system interface handling
– selection, interpretation & manipulation of inf.
resources
– interpretation of results
– education of users
– enablers of end-users
• Basic role: optimizing results
• Act in processes at different levels
Tefko Saracevic
51
Implications
• Interaction central to IR including in searching of the Web
• We see it on the surface level
– But result of MANY variables, levels & their interplay
• IR interaction requires knowledge of these levels &
interplays
– many users have difficulties
– so do many professionals
• Design of interfaces for interaction still lacking
• People compensate in many ways including trial & error,
failures
Tefko Saracevic
52
What happens in searching?
• Highly reiterative process
– back & forth between user modeling & (re)formulating
search strategy
– goes on & on in many feedback loops, twists & turns,
shifts
• Search strategy (the big picture)
– selection/reselection of sources
– stating a query (search statement) from a question
• terms, their expansions, logic, qualifications, limitations
Tefko Saracevic
Searching … (cont.)
• Search tactics (action steps)
– what to do first, next
– e.g. from broad to narrow searches
– format of results
• Evaluation of results
– as to magnitude - how much?
– as to relevance - how well?
– feedback to change after that
• user model - e.g. question
• strategy - e.g. files, query
• tactics - e.g. narrowing, broadening
Tefko Saracevic
54
Practical suggestions for searchers
(filched from a source I cannot find anymore)
• Prepare carefully
• Understand your opponent – e.g. Dialog, Scopus, LexisNexis
• Anticipate
– e.g. hidden meaning of terms
• Have a contingency plan
– assessing odds of success or points of diminishing returns
• Avoid ambiguity
– inherent in language
• Stay loose!
Tefko Saracevic
55
Stay loose?
• I copied that, but always wandered what
does it really mean?
• Dictionary says:
not firmly fastened or fixed in place
• ???? well, sounds OK!
• or
Tefko Saracevic
56
Tefko Saracevic
57
Download