IR traditional model

advertisement
Information retrieval
(IR):
traditional model
1. Why? Rationale for the
module. Definition of IR
2. System & user components
3. Exact match & best match
searches
4. Strengths & weaknesses
© Tefko Saracevic
1
1. Why? Rationale for
the module.
Definition of IR
includes problems
addressed in IR
© Tefko Saracevic
2
Why?
• Every online database, every
search engine, everything that is
searched online is based in some
way or another on principles
developed in IR
– IR is at the heart of searching used in
systems such as DIALOG, LexisNexis
& others
• Understanding the basics of IR is a
prerequisite for understanding how
searching of online systems works.
© Tefko Saracevic
3
You are asking:
• What basic elements and
processes are involved in IR?
• What are the conceptual bases
for searching?
• How are these applied in
practice?
© Tefko Saracevic
4
IR:
- original definition
“Information retrieval embraces the
intellectual aspects of the
description of information and its
specification for search, and also
whatever systems, techniques, or
machines are employed to carry
out the operation.”
Calvin Mooers, 1951
© Tefko Saracevic
5
IR:
Objective & problems
Provide the users with effective
access to & interaction with
information resources.
Problems addressed:
1. How to organize information
intellectually?
2. How to specify search &
interaction intellectually?
3. What systems & techniques to
use for those processes?
Where do you fit?
With what problems do you deal?
© Tefko Saracevic
6
2. System & user
components
Traditional IR model
presented
© Tefko Saracevic
7
IR models
• Model depicts, represents what is
involved
– a choice of features, processes, things
for consideration
• Several IR models used over time
– traditional: oldest, most used, shows
basic elements involved
 treated in this module
– interactive: more realistic, favored now,
shows also interactions involved
 treated in next module (module 5)
– Each has strengths, weaknesses
© Tefko Saracevic
8
Description of
traditional IR model
• It has two streams of activities
– one is the systems side with processes
performed by the system
– other is the user side with processes
performed by users & intermediaries (you)
– these two sides led to “system orientation” &
“user orientation”
– in system side automatic processing is done;
in user side human processing is done
• They meet at the matching process
– where the query is fed into the system and
system looks for documents that match the
query
• Also feedback is involved so that things
change based on results
– e.g. query is modified & new matching done
© Tefko Saracevic
9
Traditional IR model
User
Acquisition
Problem
documents, objects
information need
Representation
Representation
indexing, ...
question
File organization
Query
indexed documents
search formulation
Matching
searching
feedback
System
Retrieved objects
© Tefko Saracevic
10
Acquisition
(system)
• Content: What is in files, resources
– in DIALOG first part of blue sheets: File
Description, Subject Coverage
• Selection of documents & other objects
from various sources
– in blue sheets: Sources
• Mostly text based documents
– full texts, titles, abstracts ...
– but also other objects:
 data, statistics, images, maps, trade marks,
sounds ...
Importance:
Determines contents – what is in it
Key to file, resource selection !!!
© Tefko Saracevic
11
Representation
of documents, objects
(system)
• Indexing – many ways :
– free text terms (even in full texts)
– controlled vocabulary - thesaurus
– manual & automatic techniques
• Abstracting; summarizing
• Bibliographic description:
– author, title, sources, date…
– metadata
• Classifying, clustering
• Organizing in fields & limits
– in DIALOG: Basic Index, Additional
Index. Limits
Basic to what is available
for searching & displaying
© Tefko Saracevic
12
File organization
(system)
• Sequential
– record (document) by record
• Inverted
– term by term; list of records under
each term
• Combination: indexes inverted,
documents sequential
• When citation retrieved only,
need for document files
• Large file approaches
– for efficient retrieval by computers
Enables searching & interplay
between types of files
© Tefko Saracevic
13
Problem
(user)
• Related to user’s task, situation
– vary in specificity, clarity
• Produces information need
– ultimate criterion for effectiveness of
retrieval
 how well was the need met?
• Inf. need for the same problem may
change, evolve, shift during the IR
process - adjustment in searching
– often more than one search for same
problem over time
 you will experience this in your term project
Critical for examination
in interview
© Tefko Saracevic
14
Representation - question
( user & possibly system)
• Non-mediated: end user alone
• Mediated: intermediary + user
– interviews; human-human interaction
• Question analysis
– selection, elaboration of terms
– various tools may be used
 thesaurus, classification schemes,
dictionaries, textbooks, catalogs …
• Focus toward
– deriving search terms & logic
– selection of files, resources
• Subject to feedback changes
• Critical roles of intermediary - you
Determines search specification
- a dynamic process
© Tefko Saracevic
15
Query - search statement
(user & system)
• Translation into systems
requirements & limits
– start of human-computer interaction
 query is the thing that goes into the computer
• Selection of files, resources
• Search strategy - selection of:
–
–
–
–
search terms & logic
possible fields, delimiters
controlled & uncontrolled vocabulary
variations in effectiveness tactics
• Reiterations from feedback
– several feedback types: relevance
feedback, magnitude feedback *...
– query expansion & modification
What & how of actual searching
© Tefko Saracevic
16
Clarifying difference
• Question is what user asks and what
you may then have elaborated
• Query is what is asked of computer to
match – what is put in
• Question is transformed into query
• Question:
– I am interested in major historical
developments in the area of information
retrieval?
• Query
– history information retrieval (in Google)
– history AND information(w)retrieval (in
DIALOG) (plus you have to select which
file(s) to search)
© Tefko Saracevic
17
Matching - searching
(user & system)
• Process of matching, comparing
– search: what documents in the file
match the query as stated?
• Various search algorithms:
– exact match - Boolean
 still available in most, if not all systems
– best match - ranking by relevance
 increasingly used e.g. on the web
– hybrids incorporating both
 e.g. Target, Rank in DIALOG
• Each has strengths, weaknesses
– no ‘perfect’ method exists
 and probably never will
Involves many types of search
interactions & formulations
© Tefko Saracevic
18
Retrieved documents
(from system to user)
• Various order of output:
– Last In First Out (LIFO); sorted
– ranked by relevance
– ranked by other characteristics
• Various forms of output
– In DIALOG: Output options
• When citations only: possible
links to document delivery
• Base for relevance, utility
evaluation by users
• Relevance feedback
What a user (or you) sees, gets,
judges – can be specified
© Tefko Saracevic
19
3. Exact match & best
match searches
Getting to that Boolean and
similar stuff – the nitty-gritty
of matching
which actually affects how
you formulate the query
© Tefko Saracevic
20
Exact match Boolean search
• You retrieve exactly what you ask
for in the query:
– all documents that have the term(s)
with logical connection(s), and
possible other restrictions (e.g. to be
in titles) as stated in the query
– exactly: nothing less, nothing more
• Based on matching following rules
of Boolean algebra, or algebra of
sets
– ‘new algebra’
– presented by circles in Venn
diagrams
© Tefko Saracevic
21
Boolean algebra
•
Operates on sets
•
– e.g. set of documents
Has four operations (like in algebra):
1. A: retrieve set A
 I want documents that have the term library
2. A AND B: retrieve set that has A and B
 often called intersection & labeled A  B
 I want documents that have both terms library
and digital someplace within
3. A OR B: retrieve set that has either A or B
 often called union and labeled A  B
 I want documents that have either term library
or term digital someplace within
4. A NOT B: retrieve set A but not B
 often called negation and labeled A – B
 I want documents that have term library but if
they also have term digital I do not want those
© Tefko Saracevic
22
Potential problems
• But beware:
– digital AND library will retrieve documents
that have digital library (together as a
phrase) but also documents that have
digital in the first paragraph and library in
the third section, 5 pages later, and it
does not deal with digital libraries at all
– thus in Google you will ask for “digital
library” and in DIALOG for
digital(w)library to retrieve the exact
phrase digital library
– digital NOT library will retrieve documents
that have digital and suppress those that
along with digital also have library, but
sometimes those suppressed may very
well be relevant. Thus, NOT is also
known as the “dangerous operator “
© Tefko Saracevic
23
Boolean algebra depicted
in Venn diagrams
Four basic operations:
e.g. A = digital B= libraries
A
1
B
2
A
3
A alone. All documents that have A.
Shade 1 & 2. digital
B
1 2 3
A AND B. Shade 2
digital AND libraies
A
B
1 2 3
A OR B. Shade 1, 2, 3
digital OR libraries
A
B
1 2 3
A NOT B. Shade 1
digital NOT libraries
© Tefko Saracevic
24
Venn diagrams … cont.
Complex statements allowed e.g
A
B
2
3
1
4
5
7
(A OR B) AND C
Shade 4,5,6
6
(digital OR libraries) AND
Rutgers
C
(A OR B) NOT C
Shade what?
(digital OR libraries) NOT
Rutgers
© Tefko Saracevic
25
Venn diagrams cont.
• Complex statements can be
made
– as in ordinary algebra e.g. (2+3)x4
• As in ordinary algebra: watch for
parenthesis:
– 2+(3 x 4)
is not the same as
(2+3)x4
– (A AND B) OR C
is not the same as
A AND (B OR C)
© Tefko Saracevic
26
Best match searching
• Output is ranked
– it is NOT presented as a Boolean set
but in some rank order
• You retrieve documents ranked by
how similar (close) they are to a
query (as calculated by the system)
– similarity assumed as relevance
– ranked from highest to lowest relevance
to the query
 mind you, as considered by the system
 you change the query, system changes rank
– thus, documents as answers are
presented from those that are most
likely relevant downwards to less & less
likely relevant
– can be cut at any desired number - e.g.
first 10
© Tefko Saracevic
27
Best match ...
cont.
• Best match process deals with
PROBABILITY:
– compares the set of query terms
with the sets of terms in documents
– calculates a similarity between
query & each document based on
common terms &/or other aspects
– sorts the documents in order of
similarity
– assumes that the higher ranked
documents have a higher
probability of being relevant
– allows for cut-off at a chosen number
• BIG issue: What representation &
similarity measures are better?
– “better” determined by a number of
criteria, e.g. relevance, speed …
© Tefko Saracevic
28
Best match (cont.)
• Variety of algorithms (formulas) used
to determine similarity
– using statistic &/or linguistic properties
 e.g. if digital appears a lot in a given
document relative to its size, that document
will be ranked higher when the query is digital
– many proposed & tested in IR research
– many developed by commercial
organizations
 Google also uses calculations as to number
of links to/from a document
 many algorithms are now proprietary
– system ranking and your ranking may not
necessarily be in agreement
• Web outputs are mostly ranked
• But DIALOG allows ranking as well,
with special commands
© Tefko Saracevic
29
4. Strengths &
weaknesses
© Tefko Saracevic
30
Boolean vs. best
match
• Boolean
– allows for logic
– provides all that
has been
matched
BUT
– has no particular
order of output
– treats all
retrievals equally
- from the most
to least relevant
ones
– often requires
examination of
large outputs
• Best match
– allows for free
terminology
– provides for a
ranked output
– provides for cutoff - any size
output
BUT
– does not include
logic
– ranking method
(algorithm) not
transparent
 whose
relevance?
– where to cut off?
© Tefko Saracevic
31
Strengths of traditional
IR model
• Lists major components in both
system & user branches
• Suggests:
– What to explain to users about
system, if needed
– What to ask of users for more
effective searching (problem ...)
• Selection of component(s) for
concentration
– mostly ever better representation
• Provides a framework for
evaluation of (static) aspects
© Tefko Saracevic
32
Weaknesses
• Does not address nor account for
interaction & judgment of results
by users
– identifies interaction with search only
– interaction is a much richer process
• Many types of & variables in
interaction not reflected
• Feedback has many types &
functions - also not shown
• Evaluation thus one-sided
IR is a highly interactive process
- thus additional model(s) needed
© Tefko Saracevic
33
Interactive models
• Explored in next module
Module 5
© Tefko Saracevic
34
Download