Text_mining - Computer Science and Engineering

advertisement
Text Mining
Dr Eamonn Keogh
Computer Science & Engineering Department
University of California - Riverside
Riverside,CA 92521
eamonn@cs.ucr.edu
Text Mining/Information
Retrieval
• Task Statement:
Build a system that retrieves documents that users
are likely to find relevant to their queries.
• This assumption underlies the field of
Information Retrieval.
Information
need
text input
Parse
Collections
How is
the query
constructed?
Pre-process
Query
Index
How is
the text
processed?
Rank
Evaluate
Terminology
Token: A natural language word “Swim”, “Simpson”,
“92513” etc
Document: Usually a web page, but more generally any file.
Some IR History
– Roots in the scientific “Information Explosion” following
WWII
– Interest in computer-based IR from mid 1950’s
•
•
•
•
•
•
•
H.P. Luhn at IBM (1958)
Probabilistic models at Rand (Maron & Kuhns) (1960)
Boolean system development at Lockheed (‘60s)
Vector Space Model (Salton at Cornell 1965)
Statistical Weighting methods and theoretical advances (‘70s)
Refinements and Advances in application (‘80s)
User Interfaces, Large-scale testing and application (‘90s)
Relevance
• In what ways can a document be relevant
to a query?
– Answer precise question precisely.
– Who is Homer’s Boss? Montgomery Burns.
– Partially answer question.
– Where does Homer work? Power Plant.
– Suggest a source for more information.
– What is Bart’s middle name? Look in Issue 234 of
Fanzine
– Give background information.
– Remind the user of other knowledge.
– Others ...
Information
need
text input
Collections
How is
the query
constructed?
Parse
Pre-process
Query
Index
How is
the text
processed?
Rank
The section that follows is about
Content Analysis
(transforming raw text into a
computationally more manageable form)
Evaluate
Document Processing Steps
Figure from Baeza-Yates & RibeiroNeto
Stemming and Morphological Analysis
• Goal: “normalize” similar words
• Morphology (“form” of words)
– Inflectional Morphology
• E.g,. inflect verb endings and noun number
• Never change grammatical class
– dog, dogs
– Bike, Biking
– Swim, Swimmer, Swimming
What about… build, building;
Examples of Stemming (using Porters algorithm)
Porters algorithms is
available in Java, C,
Lisp, Perl, Python etc
from
http://www.tartarus.org/
~martin/PorterStemmer/
Original Words
…
consign
consigned
consigning
consignment
consist
consisted
consistency
consistent
consistently
consisting
consists
…
Stemmed Words
…
consign
consign
consign
consign
consist
consist
consist
consist
consist
consist
consist
Errors Generated by Porter
Stemmer (Krovetz 93)
Too Aggressive Too Timid
organization/organ
european/europe
policy/police
cylinder/cylindrical
execute/executive
create/creation
arm/army
search/searcher
Statistical Properties of Text
• Token occurrences in text are not
uniformly distributed
• They are also not normally
distributed
• They do exhibit a Zipf distribution
Government documents, 157734 tokens, 32259 unique
8164 the
4771 of
4005 to
2834 a
2827 and
2802 in
1592 The
1370 for
1326 is
1324 s
1194 that
973 by
969 on
915 FT
883 Mr
860 was
855 be
849 Pounds
798 TEXT
798 PUB
798 PROFILE
798 PAGE
798 HEADLINE
798 DOCNO
1 ABC
1 ABFT
1 ABOUT
1 ACFT
1 ACI
1 ACQUI
1 ACQUISITIONS
1 ACSIS
1 ADFT
1 ADVISERS
1 AE
Plotting Word Frequency by Rank
• Main idea: count
– How many times tokens occur in the text
• Over all texts in the collection
• Now rank these according to how often they
occur. This is called the rank.
The Corresponding Zipf Curve
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Freq
37
32
24
20
18
15
15
15
13
13
11
11
10
10
10
10
10
10
9
9
system
knowledg
base
problem
abstract
model
languag
implem
reason
inform
expert
analysi
rule
program
oper
evalu
comput
case
gener
form
Zipf Distribution
• The Important Points:
– a few elements occur very frequently
– a medium number of elements have medium
frequency
– many elements occur very infrequently
Zipf Distribution
• The product of the frequency of words (f) and
their rank (r) is approximately constant
– Rank = order of words’ frequency of occurrence
f  C 1 / r
C  N / 10
• Another way to state this is with an approximately correct
rule of thumb:
–
–
–
–
Say the most common term occurs C times
The second most common occurs C/2 times
The third most common occurs C/3 times
…
Zipf Distribution
(linear and log scale)
Illustration by Jacob Nielsen
What Kinds of Data Exhibit a
Zipf Distribution?
• Words in a text collection
– Virtually any language usage
•
•
•
•
•
•
Library book checkout patterns
Incoming Web Page Requests
Outgoing Web Page Requests
Document Size on Web
City Sizes
…
Consequences of Zipf
• There are always a few very frequent tokens
that are not good discriminators.
– Called “stop words” in IR
• English examples: to, from, on, and, the, ...
• There are always a large number of tokens
that occur once and can mess up algorithms.
• Medium frequency words most descriptive
Word Frequency vs. Resolving
Power (from van Rijsbergen 79)
The most frequent words are not the most descriptive.
Statistical Independence
Two events x and y are statistically
independent if the product of their
probability of their happening individually
equals their probability of happening
together.
P( x)P( y )  P( x, y )
Statistical Independence
and Dependence
• What are examples of things that are
statistically independent?
• What are examples of things that are
statistically dependent?
Lexical Associations
• Subjects write first word that comes to mind
– doctor/nurse; black/white (Palermo & Jenkins 64)
• Text Corpora yield similar associations
• One measure: Mutual Information (Church and Hanks 89)
P ( x, y )
I ( x, y )  log 2
P( x), P( y )
• If word occurrences were independent, the numerator and
denominator would be equal (if measured across a large
collection)
Statistical Independence
• Compute for a window of words
P ( x )  P ( y )  P ( x, y ) if independen t.
abcdefghij klmnop
P( x )  f ( x ) / N
We' ll approximat e P ( x, y ) as follows :
w1
w11
1 N |w|
P ( x, y ) 
wi ( x, y )

N i 1
| w | length of window w (say 5)
wi  words within window starting at position i
w( x, y )  number of times x and y co - occur in w
N  number of words in collection
w21
Interesting Associations with “Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)
I(x,y)
f(x,y)
f(x)
x
f(y)
y
11.3
12
111
Honorary
621
Doctor
11.3
8
1105
Doctors
44
Dentists
10.7
30
1105
Doctors
241
Nurses
9.4
8
1105
Doctors
154
Treating
9.0
6
275
Examined 621
Doctor
8.9
11
1105
Doctors
317
Treat
8.7
25
621
Doctor
1407
Bills
Un-Interesting Associations with
“Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)
I(x,y)
f(x,y)
f(x)
x
f(y)
y
0.96
6
621
doctor
73785
with
0.95
41
284690
a
1105
doctors
0.93
12
84716
is
1105
doctors
These associations were likely to happen because
the non-doctor words shown here are very common
and therefore likely to co-occur with any noun.
Associations Are Important Because…
• We may be able to discover that phrases that
should be treated as a word. I.e. “data mining”.
• We may be able to automatically discover
synonyms. I.e. “Bike” and “Bicycle”
Content Analysis Summary
• Content Analysis: transforming raw text into more
computationally useful forms
• Words in text collections exhibit interesting
statistical properties
– Word frequencies have a Zipf distribution
– Word co-occurrences exhibit dependencies
• Text documents are transformed to vectors
– Pre-processing includes tokenization, stemming,
collocations/phrases
Information
need
Collections
Pre-process
text input
Parse
Query
Index
How is
the index
constructed?
Rank
The section that follows is about
Index Construction
Evaluate
Inverted Index
• This is the primary data structure for text indexes
• Main Idea:
– Invert documents into a big index
• Basic steps:
– Make a “dictionary” of all the tokens in the collection
– For each token, list all the docs it occurs in.
– Do a few things to reduce redundancy in the data structure
How Are Inverted Files Created
• Documents are parsed to extract tokens.
These are saved with the Document ID.
Doc 1
Doc 2
Now is the time
for all good men
to come to the aid
of their country
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
How Inverted
Files are Created
• After all documents
have been parsed the
inverted file is sorted
alphabetically.
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
How Inverted
Files are Created
• Multiple term entries
for a single document
are merged.
• Within-document term
frequency information
is compiled.
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
How Inverted Files are Created
• Then the file can be split into
– A Dictionary file
and
– A Postings file
How Inverted Files are Created
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Dictionary
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
Postings
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Inverted Indexes
• Permit fast search for individual terms
• For each term, you get a list consisting of:
– document ID
– frequency of term in doc (optional)
– position of term in doc (optional)
• These lists can be used to solve Boolean queries:
• country -> d1, d2
• manor -> d2
• country AND manor -> d2
• Also used for statistical ranking algorithms
How Inverted Files are Used
Dictionary
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
Postings
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Query on
“time” AND “dark”
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
2 docs with “time” in
dictionary ->
IDs 1 and 2 from
posting file
1 doc with “dark” in
dictionary ->
ID 2 from posting
file
Therefore, only doc 2
satisfied the query.
Information
need
Collections
Pre-process
text input
Parse
Query
Index
How is
the index
constructed?
Rank
The section that follows is about
Querying (and
ranking)
Evaluate
Simple query language: Boolean
– Terms + Connectors (or operators)
– terms
• words
• normalized (stemmed) words
• phrases
– connectors
•
•
•
•
AND
OR
NOT
NEAR (Pseudo Boolean)
Word
Doc
• Cat
x
• Dog
• Collar x
• Leash
Boolean Queries
• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)
– Each of the following combinations works:
•
•
•
•
Cat
Dog
Collar
Leash
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)
– None of the following combinations work:
•
•
•
•
Cat
Dog
Collar
Leash
x
x
x
x
x
x
x
x
Boolean Searching
“Measurement of the
width of cracks in
prestressed
concrete beams”
Cracks
Formal Query:
cracks AND beams
AND Width_measurement
AND Prestressed_concrete
Width
measurement
Beams
Prestressed
concrete
Relaxed Query:
(C AND B AND P) OR
(C AND B AND W) OR
(C AND W AND P) OR
(B AND W AND P)
Ordering of Retrieved Documents
• Pure Boolean has no ordering
• In practice:
– order chronologically
– order by total number of “hits” on query terms
• What if one term has more hits than others?
• Is it better to one of each term or many of one term?
Boolean Model
• Advantages
– simple queries are easy to understand
– relatively easy to implement
• Disadvantages
– difficult to specify what is wanted
– too much returned, or too little
– ordering not well determined
• Dominant language in commercial Information
Retrieval systems until the WWW
Since the Boolean model is limited, lets consider a generalization…
Vector Model
• Documents are represented as “bags of words”
• Represented as vectors when used computationally
–
–
–
–
A vector is like an array of floating point
Has direction and magnitude
Each vector holds a place for every term in the collection
Therefore, most vectors are sparse
• Smithers secretly loves Monty Burns
• Monty Burns secretly loves Smithers
Both map to…
[ Burns, loves, Monty, secretly, Smithers]
Document Vectors
One location for each word
Document ids
nova
10
A
5
B
C
D
E
F
G 5
H
I
galaxy heat
5
3
10
7
6
10
h’wood
film
role
10
9
8
10
7
5
2
7
9
8
5
diet
fur
10
9
10
10
1
3
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
Documents in 3D Vector Space
t3
D1
D9
D11
D5
D3
D10
D4 D2
t1
t2
D7
D8
D6
Illustration from Jurafsky & Martin
Vector Space Model
docs Homer Marge Bart
D1
*
*
D2
*
D3
*
*
D4
*
D5
*
*
*
D6
*
*
D7
*
D8
*
D9
*
D10
*
*
D11
*
*
Q
*
Note that the query is projected
into the same vector space as the
documents.
The query here is for “Marge”.
We can use a vector similarity
model to determine the best match
to our query (details in a few slides).
But what weights should we use
for the terms?
Assigning Weights to Terms
• Binary Weights
• Raw term frequency
• tf x idf
– Recall the Zipf distribution
– Want to weight terms highly if they are
• frequent in relevant documents … BUT
• infrequent in the collection as a whole
Binary Weights
• Only the presence (1) or absence (0) of a
term is included in the vector
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t1
1
1
0
1
1
1
0
0
0
0
1
t2
0
0
1
0
1
1
1
1
0
1
0
t3
1
0
1
0
1
0
0
0
1
1
1
We have already
seen and discussed
this model.
Raw Term Weights
• The frequency of occurrence for the term in
each document is included in the vector
Counts can be
normalized by
document lengths.
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t1
2
1
0
3
1
3
0
0
0
0
4
t2
0
0
4
0
6
5
8
10
0
3
0
t3
3
0
7
0
3
0
0
0
1
5
1
This model is open
to exploitation by
websites…
sex sex sex sex sex
sex sex sex sex sex
sex sex sex sex sex
sex sex sex sex sex
sex sex sex sex sex
tf * idf Weights
• tf * idf measure:
– term frequency (tf)
– inverse document frequency (idf) -- a way to
deal with the problems of the Zipf distribution
• Goal: assign a tf * idf weight to each term
in each document
tf * idf
wik  tfik * log( N / nk )
Tk  term k in document Di
tfik  frequency of term Tk in document Di
idf k  inverse document frequency of term Tk in C
N  total number of documents in the collection C
nk  the number of documents in C that contain Tk
idf k  log  N 
 nk 
Inverse Document Frequency
• IDF provides high values for rare words and
low values for common words
idf k  log  N 
 nk 
For a
collection
of 10000
documents
 10000 
log 
0
 10000 
 10000 
log 
  0.301
 5000 
 10000 
log 
  2.698
 20 
 10000 
log 
4
 1 
Similarity Measures
Simple matching (coordination level match)
|QD|
2
|QD|
|Q|| D|
Dice’s Coefficient
|QD|
|QD|
Jaccard’s Coefficient
|QD|
1
|Q | | D |
2
1
2
Cosine Coefficient
|QD|
min(| Q |, | D |) Overlap Coefficient
Cosine
D1  (0.8, 0.3)
D2  (0.2, 0.7)
1.0
Q  (0.4, 0.8)
cos 1  0.74
Q
D2
0.8
0.6
0.4
0.2
cos  2  0.98
2
1
0.2
D1
0.4
0.6
0.8
1.0
Problems with Vector Space
• There is no real theoretical basis for the
assumption of a term space
– it is more for visualization that having any real
basis
– most similarity measures work about the same
regardless of model
• Terms are not really orthogonal dimensions
– Terms are not independent of all other terms
Probabilistic Models
• Rigorous formal model attempts to predict
the probability that a given document will
be relevant to a given query
• Ranks retrieved documents according to this
probability of relevance (Probability
Ranking Principle)
• Rely on accurate estimates of probabilities
Relevance Feedback
• Main Idea:
– Modify existing query based on relevance judgements
• Query Expansion: Extract terms from relevant documents and
add them to the query
• Term Re-weighing: and/or re-weight the terms already in the
query
– Two main approaches:
• Automatic (psuedo-relevance feedback)
• Users select relevant documents
– Users/system select terms from an automaticallygenerated list
Definition: Relevance Feedback is the reformulation of a search query in response
to feedback provided by the user for the results of previous versions of the query.
Suppose you are interested in bovine agriculture on
the banks of the river Jordan…
Term Vector
Term Weights
[Jordan , Bank, Bull, River]
[
1 , 1 , 1 , 1 ]
Search
Display Results
Gather Feedback
Update Weights
Term Vector
[Jordan , Bank, Bull, River]
Term Weights
[ 1.1 , 0.1 , 1.3 , 1.2 ]
Rocchio Method
n1
n2
Ri
Si
Q1  Q0      
i 1 n1
i 1 n2
where
Q0  the vector for the initial query
Ri  the vector for the relevant document i
S i  the vector for the non - relevant document i
n1  the number of relevant documents chosen
n2  the number of non - relevant documents chosen
 and  tune the importance of relevant and nonrelevan t terms
(in some studies best to set  to 0.75 and  to 0.25)
Rocchio Illustration
Although we usually work in vector space for text, it is
easier to visualize Euclidian space
Original Query
Term Re-weighting
Note that both the location of
the center, and the shape of
the query have changed
Query Expansion
Rocchio Method
• Rocchio automatically
– re-weights terms
– adds in new terms (from relevant docs)
• have to be careful when using negative terms
• Rocchio is not a machine learning algorithm
• Most methods perform similarly
– results heavily dependent on test collection
• Machine learning methods are proving to work
better than standard IR approaches like Rocchio
Using Relevance Feedback
• Known to improve results
• People don’t seem to like giving feedback!
Relevance Feedback for Time Series
The original query
The weigh vector.
Initially, all weighs
are the same.
Note: In this example we are using a piecewise linear
approximation of the data. We will learn more about this
representation later.
The initial query is
executed, and the five
best matches are
shown (in the
dendrogram)
One by one the 5 best
matching sequences
will appear, and the
user will rank them
from between very bad
(-3) to very good (+3)
Based on the user
feedback, both the
shape and the weigh
vector of the query are
changed.
The new query can be
executed.
The hope is that the
query shape and
weights will converge
to the optimal query.
Two papers consider relevance feedback for time series.
Query Expansion
L Wu, C Faloutsos, K Sycara, T. Payne: FALCON: Feedback Adaptive Loop for ContentBased Retrieval. VLDB 2000: 297-306
Term Re-weighting
Keogh, E. & Pazzani, M. Relevance feedback retrieval of time series data. In Proceedings
of SIGIR 99
Document Space has High
Dimensionality
• What happens beyond 2 or 3 dimensions?
• Similarity still has to do with how many
tokens are shared in common.
• More terms -> harder to understand which
subsets of words are shared among similar
documents.
• One approach to handling high
dimensionality:Clustering
Text Clustering
• Finds overall similarities among groups of
documents.
• Finds overall similarities among groups of
tokens.
• Picks out some themes, ignores others.
Scatter/Gather
Hearst & Pedersen 95
• Cluster sets of documents into general “themes”, like
a table of contents (using K-means)
• Display the contents of the clusters by showing topical
terms and typical titles
• User chooses subsets of the clusters and re-clusters the
documents within
• Resulting new groups have different “themes”
S/G Example: query on “star”
Encyclopedia text
8 symbols
68 film, tv (p)
97 astrophysics
67 astronomy(p)
10 flora/fauna
14 sports
47 film, tv
7 music
12 stellar phenomena
49 galaxies, stars
29 constellations
7 miscellaneous
Clustering and re-clustering is entirely automated
Ego Surfing!
http://vivisimo.com/
Information
need
Collections
Pre-process
text input
Parse
Query
Index
How is
the index
constructed?
Rank
The section that follows is about
Evaluation
Evaluate
Evaluation
• Why Evaluate?
• What to Evaluate?
• How to Evaluate?
Why Evaluate?
• Determine if the system is desirable
• Make comparative assessments
• Others?
What to Evaluate?
• How much of the information need is
satisfied.
• How much was learned about a topic.
• Incidental learning:
– How much was learned about the collection.
– How much was learned about other topics.
• How inviting the system is.
What to Evaluate?
effectiveness
What can be measured that reflects users’ ability
to use system? (Cleverdon 66)
–
–
–
–
–
Coverage of Information
Form of Presentation
Effort required/Ease of Use
Time and Space Efficiency
Recall
• proportion of relevant material actually retrieved
– Precision
• proportion of retrieved material actually relevant
Relevant vs. Retrieved
All docs
Retrieved
Relevant
Precision vs. Recall
| RelRetriev ed |
Precision 
| Retrieved |
| RelRetriev ed |
Recall 
| Rel in Collection |
All docs
Retrieved
Relevant
Why Precision and Recall?
Intuition:
Get as much good stuff while at the same time getting
as little junk as possible.
Retrieved vs. Relevant Documents
Very high precision, very low recall
Relevant
Retrieved vs. Relevant Documents
Very low precision, very low recall (0 in fact)
Relevant
Retrieved vs. Relevant Documents
High recall, but low precision
Relevant
Retrieved vs. Relevant Documents
High precision, high recall (at last!)
Relevant
Precision/Recall Curves
• There is a tradeoff between Precision and Recall
• So measure Precision at different levels of Recall
• Note: this is an AVERAGE over MANY queries
precision
x
x
x
recall
x
Precision/Recall Curves
• Difficult to determine which of these two hypothetical
results is better:
precision
x
x
x
recall
x
Precision/Recall Curves
Recall under various retrieval
assumptions
R
E
C
A
L
L
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Perfect
Tangent
Parabolic Parabolic
Recall Recall
random
Perverse
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Proportion of documents retrieved
1000 Documents
100 Relevant
Precision under various assumptions
P
R
E
C
I
S
I
O
N
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Perfect
Tangent
Parabolic
Recall
1000 Documents
100 Relevant
Parabolic
Recall
random
Perverse
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Proportion of documents retrieved
Document Cutoff Levels
• Another way to evaluate:
– Fix the number of documents retrieved at several levels:
•
•
•
•
•
•
top 5
top 10
top 20
top 50
top 100
top 500
– Measure precision at each of these levels
– Take (weighted) average over results
• This is a way to focus on how well the system ranks the
first k documents.
Problems with Precision/Recall
• Can’t know true recall value
– except in small collections
• Precision/Recall are related
– A combined measure sometimes more appropriate
• Assumes batch mode
– Interactive IR is important and has different criteria for
successful searches
– Assumes a strict rank ordering matters.
Relation to Contingency Table
Doc is
Relevant
Doc is
retrieved
Doc is
NOT
retrieved
•
•
•
•
a
c
Doc is
NOT
relevant
Doc is
Relevant
b
Doc is
retrieved
N retrel
N retrel
d
Doc is
NOT
retrieved
N retrel
N retrel
Accuracy: (a+d) / (a+b+c+d)
Precision: a/(a+b)
Recall:
a/(a+c)
Why don’t we use Accuracy for IR?
–
–
–
–
Doc is
NOT
relevant
(Assuming a large collection)
Most docs aren’t relevant
Most docs aren’t retrieved
Inflates the accuracy value
The E-Measure
Combine Precision and Recall into one number (van
Rijsbergen 79)
1  b2
E  1 2
b
1

R P
P = precision
R = recall
b = measure of relative importance of P or R
For example,
b = 0.5 means user is twice as interested in
precision as recall
How to Evaluate?
Test Collections
Test Collections
• Cranfield 2 –
– 1400 Documents, 221 Queries
– 200 Documents, 42 Queries
• INSPEC – 542 Documents, 97 Queries
• UKCIS -- > 10000 Documents, multiple sets, 193
Queries
• ADI – 82 Document, 35 Queries
• CACM – 3204 Documents, 50 Queries
• CISI – 1460 Documents, 35 Queries
• MEDLARS (Salton) 273 Documents, 18 Queries
TREC
• Text REtrieval Conference/Competition
– Run by NIST (National Institute of Standards & Technology)
– 2002 (November) will be 11th year
• Collection: >6 Gigabytes (5 CRDOMs), >1.5
Million Docs
– Newswire & full text news (AP, WSJ, Ziff, FT)
– Government documents (federal register, Congressional
Record)
– Radio Transcripts (FBIS)
– Web “subsets”
TREC (cont.)
• Queries + Relevance Judgments
– Queries devised and judged by “Information Specialists”
– Relevance judgments done only for those documents
retrieved -- not entire collection!
• Competition
– Various research and commercial groups compete (TREC
6 had 51, TREC 7 had 56, TREC 8 had 66)
– Results judged on precision and recall, going up to a
recall level of 1000 documents
TREC
• Benefits:
– made research systems scale to large collections (preWWW)
– allows for somewhat controlled comparisons
• Drawbacks:
– emphasis on high recall, which may be unrealistic for
what most users want
– very long queries, also unrealistic
– comparisons still difficult to make, because systems are
quite different on many dimensions
– focus on batch ranking rather than interaction
– no focus on the WWW
TREC is changing
• Emphasis on specialized “tracks”
– Interactive track
– Natural Language Processing (NLP) track
– Multilingual tracks (Chinese, Spanish)
– Filtering track
– High-Precision
– High-Performance
• http://trec.nist.gov/
What to Evaluate?
• Effectiveness
– Difficult to measure
– Recall and Precision are one way
– What might be others?
Download