IR traditional model 03.ppt

advertisement
Announcement
Feb. 3, 2003
1. Discussion
2. Information retrieval
(IR) model (the
traditional models).
3. The review of the
readings.
© Tefko Saracevic, Rutgers University
1
Information retrieval (IR):
traditional model
Definition of IR
System & user components
Exact match & best match
searches
Strengths & weaknesses of
the two match models
© Tefko Saracevic, Rutgers University
2
IR: problems addressed
- original definition
Calvin Mooers first introduced this
term, “information retrieval”, into the
literature of documentation in 1950.
(Swanson, 1988)
“Inf. retrieval embraces the intellectual
aspects of the description of
information and its specification for
search, and also whatever systems,
techniques, or machines are employed
to carry out the operation.”
Calvin Mooers, 1951
© Tefko Saracevic, Rutgers University
3
IR: another definition
• “Information retrieval is often
regarded as being synonymous
with document retrieval and
nowadays, with text retrieval,
implying that the task of an IR
system is to retrieve documents
or texts with information content
that is relevant to a user’s
information need” (Spark Jones
& Willett, 1997)
© Tefko Saracevic, Rutgers University
4
IR:
Objective & problems
Provide the users with effective
access to & interaction with
information resources.
Problems addressed:
1. How to organize information
intellectually?
2. How to specify search &
interaction intellectually?
3. What systems & techniques to
use for those processes?
© Tefko Saracevic, Rutgers University
5
IR models
• Model depicts, represents what is
involved - a choice of features,
processes, things for consideration
• Several IR models used over time
– traditional: oldest, most used, shows
basic elements involved
– interactive: more realistic, favored now,
shows also interactions involved;
several models proposed
• Each has strengths, weaknesses
• We start with traditional model to
illustrate many points - from general
to specific examples
© Tefko Saracevic, Rutgers University
6
Traditional IR model
• The classic information retrieval
model (Bates, 1989)
Document
Document
representation
Match
Query
Information
need
© Tefko Saracevic, Rutgers University
7
Traditional IR model
• The “standard” IR model (Belkin, 1993)
Information
need
Texts
Representation
Representation
Query
Surrogate
Comparison
Retrieval Texts
Judgment
Modification
© Tefko Saracevic, Rutgers University
8
Traditional IR model
User
Acquisition
Problem
documents, objects
information need
Representation
Representation
indexing, ...
question
File organization
Query
indexed documents
search formulation
Matching
searching
feedback
System
Retrieved objects
© Tefko Saracevic, Rutgers University
9
A few question about the
traditional models
• 1. What is the similarity and
difference between these three
models?
• 2. What do you learn about IR
from them?
• 3. What is the weaknesses and
strengths of traditional IR
model? If possible, critique these
models combining your own
experience.
© Tefko Saracevic, Rutgers University
10
Acquisition
(system)
• Content: What is in databases
– In DIALOG first part of blue sheets:
File Description, Subject Coverage
• Selection of documents & other
objects from various sources
– In blue sheets: Sources
• Mostly text based documents
– Full texts, titles, abstracts ...
– But also: data, statistics, images
(e.g. maps, trade marks) ...
Importance:
Determines contents of databases
Key to file selection !!!
© Tefko Saracevic, Rutgers University
11
Representation
of documents, objects
(system)
• Indexing :
– controlled vocabulary - thesaurus
– free text terms (even in full texts)
• Abstracting; annotating
• Bibliographic description:
– author, title, source, date…metadata
• Classifying, clustering, ranking
– Basic Index, Additional Index. Limits
• Organization in fields & limits
• Manual & automatic techniques
– advantages & disadvantages
Basic to what is available
for searching & displaying
© Tefko Saracevic, Rutgers University
12
File organization
(system)
• Sequential
– record (document) by record
• Inverted
– term by term; list of records under
each term
• Combination: indexes inverted,
documents sequential
• When citation retrieved only,
need for document files
• Large file approaches
– for efficient retrieval by computers
Enables searching & interplay
© Tefko Saracevic, Rutgers University
13
Problem
(user)
•
•
•
•
Related to task situation at hand
Vary in specificity, clarity
Produces information need
Ultimate criterion for effectiveness
of retrieval
• Inf. need for the same problem may
change, evolve, shift during the IR
process - adjustment in searching
• Often more than one search for
same problem over time
Critical for examination
in interview
© Tefko Saracevic, Rutgers University
14
Problem
(user)
• A question:
• Why information need for the
same problem may change? Do
you have this experience? Tell us
your story.
© Tefko Saracevic, Rutgers University
15
Representation - question
( user & possibly system)
• Non-mediated: end user alone
• Mediated: intermediary + user
– interviews; human-human interaction
• Question analysis: selection,
elaboration of terms
• Focus toward search terms &
logic; selection of databases
• Subject to feedback changes
• Various tools: thesaurus ...
• Roles of intermediary
Determines contents
of searching - dynamic
© Tefko Saracevic, Rutgers University
16
Query - search statement
(user & system)
• Translation into systems
requirements & limits
– start of human-computer interaction
• Selection of databases
• Search strategy - selection of:
–
–
–
–
search terms & logic
possible fields, delimiters
controlled & uncontrolled vocabulary
variations in effectiveness tactics
• Reiterations from feedback
– several feedback types: relevance
feedback, magnitude feedback ...
– query expansion & modification
What
&
how
of
actual
searching
© Tefko Saracevic, Rutgers University
17
Matching - searching
(user & system)
• Process of matching, comparing
– search: what documents in the file
match the query as stated?
• Various search algorithms:
– exact match - Boolean
• still most prevalent
– best match - ranking by relevance
• increasingly used e.g. on the web
– hybrids incorporating both
• e.g. Target, Rank in DIALOG
• Each has strengths, weaknesses
– no ‘perfect’ method exists
Search
interactions
© Tefko Saracevic, Rutgers University
18
Retrieved documents
(from system to user)
• Various order of output:
– Last In First Out (LIFO); sorted
– ranked by relevance
– ranked by other characteristics
• Various forms of output
– In DIALOG: Output options
• When citations only: linkage to
document delivery
• Base for relevance, utility
evaluation by users
• Relevance feedback
What a user sees, gets, judges
© Tefko Saracevic, Rutgers University
19
Exact match - Boolean
search
• You retrieve exactly what you
ask for in the query:
– all documents that have the term(s)
with logical connection(s), and
possible other restrictions (e.g. to be
in titles) as stated in the query
– exactly: nothing less, nothing more
• Based on matching following
rules of Boolean algebra, or
algebra of sets
– ‘new algebra’
– presented by circles in Venn
diagrams
© Tefko Saracevic, Rutgers University
20
Boolean algebra & Venn
diagrams
Four basic operations:
A
1
B
2
A
3
A alone. All documents that have A.
Shade 1 & 2. E.G. apples
B
1 2 3
A AND B. Shade 2
apples AND oranges
A
B
1 2 3
A OR B. Shade 1, 2, 3
apples OR oranges
A
B
1 2 3
A NOT B. Shade 1
apples NOT oranges
© Tefko Saracevic, Rutgers University
21
Venn diagrams … cont.
Complex statements allowed e.g
A
B
2
3
1
4
5
(A OR B) AND C
Shade 4,5,6
6
7
(apples or oranges) AND
Florida
C
(A OR B) NOT C
Shade what?
(apples or oranges NOT
Florida
© Tefko Saracevic, Rutgers University
22
Venn diagrams cont.
• Complex statements can be
made
– as in ordinary algebra e.g. (2+3)x4
• As in ordinary algebra: watch
for parenthesis:
– 2+(3 x 4) is not the same as
(2+3)x4
– (A AND B) OR C not the same as
A AND (B OR C)
© Tefko Saracevic, Rutgers University
23
Best match searching
• You retrieve documents ranked by
how similar (close) they are to a
query (as calculated by the system)
– similarity assumed as relevance
– thus, documents as answers are
presented from those that are most
likely relevant downwards to less & less
likely relevant - can be cut at any
desired number - e.g. first 10
• Algorithms (formulas) used to
determine similarity
– using statistic &/or linguistic properties
• Web outputs are mostly ranked
• But DIALOG allows ranking as
well, with special commands
© Tefko Saracevic, Rutgers University
24
Best match ...
cont.
• Best match process:
– compares a set of query terms with
the sets of terms in documents
– calculates a similarity between
query & each document based on
common terms
– sorts the documents in order of
similarity
– assumes that the higher ranked
documents have a higher
probability of being relevant
– allows for cut-off at a chosen number
• BIG issue: What representation
& similarity measures are best?
– considerable research & many tests
– many proprietary algorithms
© Tefko Saracevic, Rutgers University
25
Boolean vs. best match
• Boolean
– allows for logic
– provides all that
has been
matched
BUT
– has no particular
order of output
– treats all
retrievals equally
- from the most
to least relevant
ones
– often requires
examination of
large outputs
© Tefko Saracevic, Rutgers University
• Best match
– allows for free
terminology
– provides for a
ranked output
– provides for cutoff - any size
output
BUT
– does not include
logic
– ranking method
(algorithm) not
transparent
• whose
relevance?
– where to cut off?
26
Boolean vs. best match
• Questions about best match (just
thinking).
• 1. If you are a user, do you believe
the judgment of algorithm if you do
not read the hits?
• 2. Is it definitely that a document
which is judged only 10% relevant to
your query is less useful for resolving
your information problem than a 40%
relevant one?
© Tefko Saracevic, Rutgers University
27
Strengths of traditional
IR model
• Lists major components in both
system & user branches
• Suggests:
– What to explain to users about
system, if needed
– What to ask of users for more
effective searching (problem ...)
• Selection of component(s) for
concentration
– mostly ever better representation
• Provides a framework for
evaluation of (static) aspects
© Tefko Saracevic, Rutgers University
28
Weaknesses
• Does not address nor account for
interaction & judgment of
results by users
– identifies interaction with search only
– interaction is a much richer process
• Many types of & variables in
interaction not reflected
• Feedback has many types &
functions - also not shown
• Evaluation thus one-sided
IR is a highly interactive process
- thus additional model(s) needed
© Tefko Saracevic, Rutgers University
29
Download