Information Retrieval 3

advertisement
LIS618 lecture 3
Thomas Krichel
2003-02-13
Structure of talk
• Document Preprocessing
• Basic ingredients of query languages
• Retrieval performance evaluation
document preprocessing
• There are some operations that may be done to
the documents before indexing
–
–
–
–
–
lexical analysis
stemming of words
elimination of stop words
selection of index terms
construction of term categorization structures
we will look at those in turn
• in many cases, document preprocessing is not
well documented by the provider.
• but searchers need to be aware of them…
lexical analysis
• divides a stream of characters into a
stream of words
• seems easy enough but….
• should we keep numbers?
• hyphens. compare "state-of-the-art" with
"b-52"
• removal of punctuation, but "333B.C."
• casing. compare "bank" and "Bank"
stemming
• in general, users search for the
occurrence of a term irrespective of
grammar
• plural, gerund forms, past tense can be
subject to stemming
• important algorithm by Porter
• evidence about the effect of stemming on
information retrieval is mixed
• stemming is relatively rare these days.
elimination of stop words
• some words carry no meaning and should
be eliminated
• in fact any word that appears in 80% of all
documents is pretty much useless, but
• consider a searcher for "to be or not to
be".
• It is better to reduce the index weight of
terms that appear very frequently
index term selection
• some engines try to capture nouns only
• some nouns that appear heavily together
can be considered to be one index term,
such as "computer science"
• Dialog deals with this through phrase
indexing.
• Most web engines, however, index all
words, and all of the individually
thesauri
• a list of words and for each word, a list of related
words
– synonyms
– broader terms
– narrower terms
• used
– to provide a consistent vocabulary for indexing and
searching
– to assist users with locating terms for query
formulation
– allow users to broaden or narrow query
use of thesauri
• Thesauri are limited to experimental systems, or
some high-quality systems, see
http://www.sosig.ac.uk for an example.
• It can be confusing to users.
• Frequently the relationship between terms in the
query is badly served by the relationships in the
thesaurus. Thus thesaurus expansion of an
initial query (if performed automatically) can lead
to bad results.
simple queries
• single-word queries
– one word only
– Hopefully some word combinations are
understood as one word, e.g. on-line
• Context queries
– phrase queries (be aware of stop words)
– proximity queries, generalize phrase queries
• Boolean queries
simple pattern queries
•
•
•
•
•
prefix queries (e.g. "anal" for analogy)
suffix queries (e.g. "oral" for choral)
substring (e.g. "al" for talk)
ranges (e.g. form "held" to "hero")
within a distance, usually Levenshtein
distance (i.e. the minimum number of
insertions, deletions, and replacements) of
query term
regular expressions
• come from UNIX computing
• build form strings where certain characters are
metacharacters.
• example: "pro(blem)|(tein)s?" matches problem,
problem, protein and proteins.
• example: New .*y matches "New Jersey" and
"New York City", and "New Delhy".
• great variety of dialects, usually very powerful.
• Extremely important in digital libraries.
structured queries
• make use of document structures
• simplest example is when the documents
are database records, we can search for
terms is a certain field only.
• if there is sufficient structure to field
contents, the field can be interpreted as
meaning something different than the word
it contains. example: dates
query protocols
• There are some standard languages
– Z39.50 queries
– CCL, "common command language" is a
development of Z39.50
– CD-RDx "compact disk read only data
exchange" is supported by US government
agencies such as CIA and NASA
– SFQL "structure full text query language" built
on SQL
http://openlib.org/home/krichel
Thank you for your attention!
retrieval performance evaluation
• "Recall" and "Precision" are two classic measures
to measure the performance of information retrieval
in a single query.
• Both assume that there is an answer set of
documents that contain the answer to the query.
• Performance is optimal if
– the database returns all the documents in the answer set
– the database returns only documents in the answer set
• Recall is the fraction of the relevant documents that
the query result has captured.
• Precision is the fraction of the retrieved documents
that is relevant.
recall and precision curves
• Assume that all the retrieved documents arrive
at once and are being examined.
• During that process, the user discover more and
more relevant documents. Recall increases.
• During the same process, at least eventually,
there will be less and less useful document.
Precision declines (usually).
• This can be represented as a curve.
Example
• Let the answer set be {0,1,2,3,4,5,6,7,8,9}
and non-relevant documents represented
by letters.
• A query reveals the following result:
7,a,3,b,c,9,n,j,l,5,r,o,s,e,4.
• For the first document, (recall, precision) is
(10%,100%), for the third, (20%,66%), for
the sixth (30%,50%), for the tenth
(40%,40%) etc.
recall/precision curves
• Such curves can be formed for each
query.
• An average curve, for each recall level,
can be calculated for several queries.
• Recall and precision levels can also be
used to calculate two single-valued
summaries.
– average precision at seen document
– R-precision
average precision at seen
document
• To find it, sum all the precision level for
each new relevant document discovered
by the user and divide by the total number
of relevant documents for the query.
• In our example, it is 0.57
• This measure favors retrieval methods that
get the relevant documents to the top.
R-precision
•
•
•
•
•
•
•
a more ad-hoc measure.
Let R be the size of the answer set.
Take the first R results of the query.
Find the number of relevant documents
Divide by R.
In our example, the R-precision is .4.
An average can be calculated for a
number of queries.
critique of recall & precision
• Recall has to be estimated by an expert
• Recall is very difficult to estimate in a large
collection
• They focus on one query only. No serious
user works like this.
• There are some other measures, but that
is more for an advanced course in IR.
Download