http://net.pku.edu.cn/~wbia
黄连恩 hle@net.pku.edu.cn
北京大学信息工程学院
10/21/2014
IR Evaluation Methodology
Measures for a search engine
创建 index 的速度
Number of documents/hour
Documents size
These criteria measurable.
但更关键的 measure 是 user happiness
怎样量化地度量它 ?
搜索的速度
响应时间: Latency as a function of index size
吞吐率: Throughput as a function of index size
查询语言的表达能力
Ability to express complex information needs
Speed on complex queries
3
Measuring user happiness
Issue: 谁是 user?
Web engine : user finds what they want and return to the engine
Can measure rate of return users eCommerce site : user finds what they want and make a purchase
Is it the end-user, or the eCommerce site, whose happiness we measure?
Measure time to purchase, or fraction of searchers who become buyers?
Enterprise (company/govt/academic): Care about “ user productivity ”
How much time do my users save when looking for information?
Many other criteria having to do with breadth of access, secure access, etc.
4
Happiness: elusive to measure
Commonest proxy: relevance of search results
But how do you measure relevance?
Methodology: test collection(corpus)
1.
A benchmark document collection
2.
A benchmark suite of queries
3.
A binary assessment of either Relevant or Irrelevant for each query-doc pair
Some work on more-than-binary, but not the standard
5
Evaluating an IR system
Note: the information need is translated into a query, Relevance is assessed relative to the information need not the query
E.g.,
Information need : I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.
Query : wine red white heart attack effective
You evaluate whether the doc addresses the information need, not whether it has those words
6
• TREC - National Institute of Standards and Testing (NIST) has run a large IR test bed for many years
• Test collections consisting of documents , queries , and relevance judgments , e.g.,
– CACM : Titles and abstracts from the Communications of the
ACM from 1958-1979. Queries and relevance judgments generated by computer scientists.
– AP : Associated Press newswire documents from 1988-1990
(from TREC disks 1-3). Queries are the title fields from TREC topics 51-150. Topics and relevance judgments generated by government information analysts.
– GOV2 : Web pages crawled from websites in the .gov domain during early 2004. Queries are the title fields from TREC topics 701-850. Topics and relevance judgments generated by government analysts.
7/N
8/N
9/N
• Obtaining relevance judgments is an expensive, time-consuming process
– who does it?
– what are the instructions?
– what is the level of agreement?
• TREC judgments
– depend on task being evaluated
– generally binary
– agreement good because of “narrative”
10/N
• Exhaustive judgments for all documents in a collection is not practical
• Pooling technique is used in TREC
– top k results (for TREC, k varied between 50 and 200) from the rankings obtained by different search engines
(or retrieval algorithms) are merged into a pool
– duplicates are removed
– documents are presented in some random order to the relevance judges
• Produces a large number of relevance judgments for each query, although still incomplete
11/N
IR Evaluation Metrics
Unranked retrieval evaluation:
Precision and Recall
Precision : 检索得到的文档中相关的比率 =
P(relevant|retrieved)
Recall : 相关文档被检索出来的比率 =
P(retrieved|relevant)
Retrieved
Not Retrieved
Relevant Not Relevant tp fp fn tn
精度 Precision P = tp/(tp + fp)
召回率 Recall R = tp/(tp + fn)
13
Accuracy
给定一个 Query ,搜索引擎对每个文档分类 classifies as “ Relevant ” or “ Irrelevant ” .
Accuracy of an engine: 分类的正确比率 .
Accuracy = (tp + tn)/(tp + fp +tn + fn)
Is this a very useful evaluation measure in IR?
Retrieved
Not Retrieved
Relevant Not Relevant tp fp fn tn
14
Why not just use accuracy?
How to build a 99.9999% accurate search engine on a low budget … .
Search for:
0 matching results found.
People doing information retrieval want to find something and have a certain tolerance for junk.
15
Precision and recall when ranked
把集合上的定 义扩展到 ranked list
在 ranked list 中每个文档 处,计算 P/R point
这样计算出来的值,那些是有用的?
Consider a P/R point for each relevant document
Consider value only at fixed rank cutoffs
e.g., precision at rank 20
Consider value only at fixed recall points
e.g., precision at 20% recall
May be more than one precision value at a recall point
16
Precision and Recall example
17
precision of a query
Often want a single-number effectiveness measure
Average precision is widely used in IR
Calculate by averaging precision when recall increases
18
Recall/precision graphs
Average precision .vs.
P/R graph
AP hides information
Recall/precision graph has odd saw-shape if done directly
但是 P/R 图很难比较
19
Precision and Recall, toward averaging
20
Averaging graphs: a false start
How can graphs be averaged ?
不同的 queries 有不同的 recall values
What is precision at
25% recall?
插 值 interpolate
How?
21
Interpolation of graphs
可能的插 值方法
No interpolation
Not very useful
Connect the dots
Not a function
Connect max
Connect min
Connect average
…
0%recall 怎么 处理 ?
Assume 0?
Assume best?
Constant start?
22
How to choose?
一个好的 检索系统具有这样的特点:一般来说( On average ),随着 recall 增加 , 它的 precision 会降低
Verified time and time again
On average
插 值,使得 makes function monotonically decreasing
比如 : 从左往右,取右 边最大 precisions 值为插值
where S is the set of observed (R,P) points
结果是一个 step function
23
Our example, interpolated this way
monotonically decreasing
Handles 0% recall smoothly
24
Averaging graphs: using interpolation
Asked: what is precision at 25% recall?
Interpolate values
25
Averaging
queries
多个 queries 间的平均
微平均 Micro-average – 每个 relevant document 是一个
点,用来 计算平均
宏平均 Macro-average – 每个 query 是一个点,用来 计
算平均
Average of many queries’ average precision values
Called mean average precision ( MAP )
“Average average precision” sounds weird
Most common
26
average precision
Average precision at standard recall points
For a given query, compute P/R point for every relevant doc doc.
Interpolate precision at standard recall levels
11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall)
3-pt is usually 75%, 50%, 25%
Average over all queries to get average precision at each recall level
Average interpolated recall levels to get single result
Called “interpolated average precision”
Not used much anymore; MAP “mean average precision” more common
Values at specific interpolated points still commonly used
27
Interpolation and averaging
28
A combined measure: F
P/R 的综合指标 F measure ( weighted harmonic mean ):
F
1
P
1
( 1
)
1
R
(
2
2
P
1 )
PR
R
通常使用 balanced F
1 measure( = 1 or = ½ )
Harmonic mean is a conservative average , Heavily penalizes low values of P or R
29
Averaging F, example
Q-bad has 1 relevant document
Retrieved at rank 1000
(R P) = (1, 0.001)
F value of 0.2%, so AvgF = 0.2%
Q-perfect has 10 relevant documents
Retrieved at ranks 1-10
(R,P) = (.1,1), (.2,1), …, (1,1)
F values of 18%, 33%, …, 100%, so AvgF = 66.2%
Macro average
( 0.2% + 66.2%) / 2 = 33.2%
Micro average
( 0.2% + 18% + … 100%) / 11 = 60.2%
30
• Users tend to look at only the top part of the ranked result list to find relevant documents
• Some search tasks have only one relevant document
– e.g., navigational search , question answering
• Recall not appropriate
– instead need to measure how well the search engine does at retrieving relevant documents at very high ranks
31/N
• Precision at N
• Precision at Rank R
– R typically 5, 10, 20
– easy to compute, average, understand
– not sensitive to rank positions less than R
• Reciprocal Rank
– reciprocal of the rank at which the first relevant document is retrieved
– Mean Reciprocal Rank (MRR) is the average of the reciprocal ranks over a set of queries
– very sensitive to rank position
• Popular measure for evaluating web search and related tasks
• Two assumptions :
– Highly relevant documents are more useful than marginally relevant document
– the lower the ranked position of a relevant document, the less useful it is for the user,
• since it is less likely to be examined
33/N
• Uses graded relevance as a measure of the usefulness, or gain, from examining a document
• Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks
• Typical discount is 1/log (rank)
– With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
34/N
• DCG is the total gain accumulated at a particular rank p:
• Alternative formulation:
– used by some web search companies
– emphasis on retrieving highly relevant documents
35/N
• 10 ranked documents judged on 0-3 relevance scale:
3, 2, 3, 0, 0, 1, 2, 2, 3, 0
• discounted gain:
3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0
= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0
• DCG:
3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61
36/N
• DCG values are often normalized by comparing the DCG at each rank with the
DCG value for the perfect ranking
– makes averaging easier for queries with different numbers of relevant documents
37/N
• Perfect ranking:
3, 3, 3, 2, 2, 2, 1, 0, 0, 0
• ideal DCG values:
3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10
• NDCG values (divide actual by ideal):
1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88
– NDCG
1 at any rank position
38/N
• Two rankings described using preferences can be compared using the Kendall tau coefficient
(τ ) :
– P is the number of preferences that agree and Q is the number that disagree
• For preferences derived from binary relevance judgments, can use BPREF
39/N
Testing and Statistics
• Given the results from a number of queries, how can we conclude that ranking algorithm
A is better than algorithm B?
• A significance test enables us to reject the null hypothesis (no difference) in favor of the
alternative hypothesis (B is better than A)
– the power of a test is the probability that the test will reject the null hypothesis correctly
– increasing the number of queries in the experiment also increases power of test
41/N
• Distribution for the possible values of a test statistic assuming the null hypothesis
• shaded area is region of rejection
42/N
43/N
• Assumption is that the difference between the effectiveness values is a sample from a normal distribution
• Null hypothesis is that the mean of the distribution of differences is zero
• Test statistic
– for the example,
44/N
• Ignores magnitude of differences
• Null hypothesis for this test is that
– P(B > A) = P(A > B) = ½
– number of pairs where B is “better” than A would be the same as the number of pairs where
A is “better” than B
• Test statistic is number of pairs where B>A
• For example data,
– test statistic is 7, p-value = 0.17
– cannot reject null hypothesis
45/N
• Test (or even train) using live traffic on a search engine
• Benefits:
– real users, less biased, large amounts of test data
• Drawbacks:
– noisy data, can degrade user experience
• Often done on small proportion (1-5%) of live traffic
46/N
本次 课小结
IR evaluation
Precision , Recall , F
Interpolation
MAP , interpolated AP
P@N,P@R
DCG,NDCG
BPREF
47
Q&A
阅读材料
[1] IIR Ch1,Ch6.2,Ch6.3,Ch8.1,8.2,8.3,8.4
[2] M. P. Jay and W. B. Croft, "A language modeling approach to information retrieval," in
Proceedings of the 21st annual international ACM
SIGIR conference on Research and development in information retrieval . Melbourne, Australia:
ACM Press, 1998.
49
习题 8-9 [**] 在 10,000 篇文档构成的文档集中,
某个 查询的相关文档总数为 8 ,下面 给出了某系统
针对该查询的前 20 个有序 结果的相关 ( 用 R 表示 ) 和
不相关 ( 用 N 表示 ) 情况,其中有 6 篇相关文档:
RRNNN NNNRN RNNNR NNNNR a. 前 20 篇文档的正确率是多少? b. 前 20 篇文档的 F1 值是多少 ?
c. 在 25% 召回率水平上的插 值正确率是多少? d. 在 33% 召回率水平上的插 值正确率是多少? e. 计算其 MAP 值。
#2. Evaluation
定 义 precision-recall graph 如下: 对一个查询结果
列表,每一个返回 结果文档处计算 precision/recall
点,由 这些点构成的图 .
在 这个图上定义 breakeven point 为 precision 和 recall 值相等的点 .
问 :存在多于一个 breakeven point 的 图吗?如果
有, 给出例子;没有的话,请证明之。