Average - 北京大学网络与信息系统研究所

advertisement

Information Retrieval

Evaluation

http://net.pku.edu.cn/~wbia

黄连恩 hle@net.pku.edu.cn

北京大学信息工程学院

10/21/2014

IR Evaluation Methodology

Measures for a search engine

创建 index 的速度

Number of documents/hour

Documents size

These criteria measurable.

但更关键的 measure 是 user happiness

怎样量化地度量它 ?

搜索的速度

响应时间: Latency as a function of index size

吞吐率: Throughput as a function of index size

查询语言的表达能力

Ability to express complex information needs

Speed on complex queries

3

Measuring user happiness

Issue: 谁是 user?

Web engine : user finds what they want and return to the engine

Can measure rate of return users eCommerce site : user finds what they want and make a purchase

Is it the end-user, or the eCommerce site, whose happiness we measure?

Measure time to purchase, or fraction of searchers who become buyers?

Enterprise (company/govt/academic): Care about “ user productivity ”

How much time do my users save when looking for information?

Many other criteria having to do with breadth of access, secure access, etc.

4

Happiness: elusive to measure

Commonest proxy: relevance of search results

But how do you measure relevance?

Methodology: test collection(corpus)

1.

A benchmark document collection

2.

A benchmark suite of queries

3.

A binary assessment of either Relevant or Irrelevant for each query-doc pair

 Some work on more-than-binary, but not the standard

5

Evaluating an IR system

Note: the information need is translated into a query, Relevance is assessed relative to the information need not the query

E.g.,

Information need : I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

Query : wine red white heart attack effective

You evaluate whether the doc addresses the information need, not whether it has those words

6

Evaluation Corpus

• TREC - National Institute of Standards and Testing (NIST) has run a large IR test bed for many years

Test collections consisting of documents , queries , and relevance judgments , e.g.,

– CACM : Titles and abstracts from the Communications of the

ACM from 1958-1979. Queries and relevance judgments generated by computer scientists.

– AP : Associated Press newswire documents from 1988-1990

(from TREC disks 1-3). Queries are the title fields from TREC topics 51-150. Topics and relevance judgments generated by government information analysts.

– GOV2 : Web pages crawled from websites in the .gov domain during early 2004. Queries are the title fields from TREC topics 701-850. Topics and relevance judgments generated by government analysts.

7/N

Test Collections

8/N

TREC Topic Example

9/N

Relevance Judgments

• Obtaining relevance judgments is an expensive, time-consuming process

– who does it?

– what are the instructions?

– what is the level of agreement?

• TREC judgments

– depend on task being evaluated

– generally binary

– agreement good because of “narrative”

10/N

Pooling

• Exhaustive judgments for all documents in a collection is not practical

• Pooling technique is used in TREC

– top k results (for TREC, k varied between 50 and 200) from the rankings obtained by different search engines

(or retrieval algorithms) are merged into a pool

– duplicates are removed

– documents are presented in some random order to the relevance judges

• Produces a large number of relevance judgments for each query, although still incomplete

11/N

IR Evaluation Metrics

Unranked retrieval evaluation:

Precision and Recall

Precision : 检索得到的文档中相关的比率 =

P(relevant|retrieved)

Recall : 相关文档被检索出来的比率 =

P(retrieved|relevant)

Retrieved

Not Retrieved

Relevant Not Relevant tp fp fn tn

精度 Precision P = tp/(tp + fp)

召回率 Recall R = tp/(tp + fn)

13

Accuracy

给定一个 Query ,搜索引擎对每个文档分类 classifies as “ Relevant ” or “ Irrelevant ” .

Accuracy of an engine: 分类的正确比率 .

Accuracy = (tp + tn)/(tp + fp +tn + fn)

Is this a very useful evaluation measure in IR?

Retrieved

Not Retrieved

Relevant Not Relevant tp fp fn tn

14

Why not just use accuracy?

How to build a 99.9999% accurate search engine on a low budget … .

Search for:

0 matching results found.

People doing information retrieval want to find something and have a certain tolerance for junk.

15

Precision and recall when ranked

把集合上的定 义扩展到 ranked list

在 ranked list 中每个文档 处,计算 P/R point

这样计算出来的值,那些是有用的?

Consider a P/R point for each relevant document

Consider value only at fixed rank cutoffs

 e.g., precision at rank 20

Consider value only at fixed recall points

 e.g., precision at 20% recall

May be more than one precision value at a recall point

16

Precision and Recall example

17

Average

precision of a query

Often want a single-number effectiveness measure

Average precision is widely used in IR

Calculate by averaging precision when recall increases

18

Recall/precision graphs

Average precision .vs.

P/R graph

AP hides information

Recall/precision graph has odd saw-shape if done directly

但是 P/R 图很难比较

19

Precision and Recall, toward averaging

20

Averaging graphs: a false start

How can graphs be averaged ?

不同的 queries 有不同的 recall values

What is precision at

25% recall?

 插 值 interpolate

How?

21

Interpolation of graphs

可能的插 值方法

No interpolation

Not very useful

Connect the dots

Not a function

Connect max

Connect min

Connect average

0%recall 怎么 处理 ?

Assume 0?

Assume best?

Constant start?

22

How to choose?

一个好的 检索系统具有这样的特点:一般来说( On average ),随着 recall 增加 , 它的 precision 会降低

Verified time and time again

On average

 插 值,使得 makes function monotonically decreasing

比如 : 从左往右,取右 边最大 precisions 值为插值

 where S is the set of observed (R,P) points

结果是一个 step function

23

Our example, interpolated this way

 monotonically decreasing

Handles 0% recall smoothly

24

Averaging graphs: using interpolation

Asked: what is precision at 25% recall?

Interpolate values

25

Averaging

across

queries

多个 queries 间的平均

微平均 Micro-average – 每个 relevant document 是一个

点,用来 计算平均

宏平均 Macro-average – 每个 query 是一个点,用来 计

算平均

Average of many queries’ average precision values

Called mean average precision ( MAP )

“Average average precision” sounds weird

Most common

26

Interpolated

average precision

Average precision at standard recall points

For a given query, compute P/R point for every relevant doc doc.

Interpolate precision at standard recall levels

11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall)

3-pt is usually 75%, 50%, 25%

Average over all queries to get average precision at each recall level

Average interpolated recall levels to get single result

Called “interpolated average precision”

Not used much anymore; MAP “mean average precision” more common

Values at specific interpolated points still commonly used

27

Interpolation and averaging

28

A combined measure: F

P/R 的综合指标 F measure ( weighted harmonic mean ):

F

1

P

1

( 1

 

)

1

R

(

2

2

P

1 )

PR

R

通常使用 balanced F

1 measure(  = 1 or  = ½ )

Harmonic mean is a conservative average , Heavily penalizes low values of P or R

29

Averaging F, example

Q-bad has 1 relevant document

Retrieved at rank 1000

(R P) = (1, 0.001)

F value of 0.2%, so AvgF = 0.2%

Q-perfect has 10 relevant documents

Retrieved at ranks 1-10

(R,P) = (.1,1), (.2,1), …, (1,1)

F values of 18%, 33%, …, 100%, so AvgF = 66.2%

Macro average

( 0.2% + 66.2%) / 2 = 33.2%

Micro average

( 0.2% + 18% + … 100%) / 11 = 60.2%

30

Focusing on Top Documents

• Users tend to look at only the top part of the ranked result list to find relevant documents

• Some search tasks have only one relevant document

– e.g., navigational search , question answering

• Recall not appropriate

– instead need to measure how well the search engine does at retrieving relevant documents at very high ranks

31/N

Focusing on Top Documents

• Precision at N

• Precision at Rank R

– R typically 5, 10, 20

– easy to compute, average, understand

– not sensitive to rank positions less than R

• Reciprocal Rank

– reciprocal of the rank at which the first relevant document is retrieved

– Mean Reciprocal Rank (MRR) is the average of the reciprocal ranks over a set of queries

– very sensitive to rank position

Discounted Cumulative Gain

• Popular measure for evaluating web search and related tasks

• Two assumptions :

– Highly relevant documents are more useful than marginally relevant document

– the lower the ranked position of a relevant document, the less useful it is for the user,

• since it is less likely to be examined

33/N

Discounted Cumulative Gain

• Uses graded relevance as a measure of the usefulness, or gain, from examining a document

• Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks

• Typical discount is 1/log (rank)

– With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3

34/N

Discounted Cumulative Gain

DCG is the total gain accumulated at a particular rank p:

• Alternative formulation:

– used by some web search companies

– emphasis on retrieving highly relevant documents

35/N

DCG Example

• 10 ranked documents judged on 0-3 relevance scale:

3, 2, 3, 0, 0, 1, 2, 2, 3, 0

• discounted gain:

3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0

= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0

• DCG:

3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61

36/N

Normalized DCG

• DCG values are often normalized by comparing the DCG at each rank with the

DCG value for the perfect ranking

– makes averaging easier for queries with different numbers of relevant documents

37/N

NDCG Example

• Perfect ranking:

3, 3, 3, 2, 2, 2, 1, 0, 0, 0

• ideal DCG values:

3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10

• NDCG values (divide actual by ideal):

1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88

– NDCG

1 at any rank position

38/N

Using Preferences

• Two rankings described using preferences can be compared using the Kendall tau coefficient

(τ ) :

P is the number of preferences that agree and Q is the number that disagree

• For preferences derived from binary relevance judgments, can use BPREF

39/N

Testing and Statistics

Significance Tests

• Given the results from a number of queries, how can we conclude that ranking algorithm

A is better than algorithm B?

• A significance test enables us to reject the null hypothesis (no difference) in favor of the

alternative hypothesis (B is better than A)

– the power of a test is the probability that the test will reject the null hypothesis correctly

– increasing the number of queries in the experiment also increases power of test

41/N

One-Sided Test

• Distribution for the possible values of a test statistic assuming the null hypothesis

• shaded area is region of rejection

42/N

Example Experimental Results

43/N

t-Test

• Assumption is that the difference between the effectiveness values is a sample from a normal distribution

• Null hypothesis is that the mean of the distribution of differences is zero

• Test statistic

– for the example,

44/N

Sign Test

• Ignores magnitude of differences

• Null hypothesis for this test is that

– P(B > A) = P(A > B) = ½

– number of pairs where B is “better” than A would be the same as the number of pairs where

A is “better” than B

• Test statistic is number of pairs where B>A

• For example data,

– test statistic is 7, p-value = 0.17

– cannot reject null hypothesis

45/N

Online Testing

• Test (or even train) using live traffic on a search engine

• Benefits:

– real users, less biased, large amounts of test data

• Drawbacks:

– noisy data, can degrade user experience

• Often done on small proportion (1-5%) of live traffic

46/N

本次 课小结

IR evaluation

Precision , Recall , F

Interpolation

MAP , interpolated AP

P@N,P@R

DCG,NDCG

BPREF

47

Thank You!

Q&A

阅读材料

[1] IIR Ch1,Ch6.2,Ch6.3,Ch8.1,8.2,8.3,8.4

[2] M. P. Jay and W. B. Croft, "A language modeling approach to information retrieval," in

Proceedings of the 21st annual international ACM

SIGIR conference on Research and development in information retrieval . Melbourne, Australia:

ACM Press, 1998.

49

习题 8-9 [**] 在 10,000 篇文档构成的文档集中,

某个 查询的相关文档总数为 8 ,下面 给出了某系统

针对该查询的前 20 个有序 结果的相关 ( 用 R 表示 ) 和

不相关 ( 用 N 表示 ) 情况,其中有 6 篇相关文档:

RRNNN NNNRN RNNNR NNNNR a. 前 20 篇文档的正确率是多少? b. 前 20 篇文档的 F1 值是多少 ?

c. 在 25% 召回率水平上的插 值正确率是多少? d. 在 33% 召回率水平上的插 值正确率是多少? e. 计算其 MAP 值。

#2. Evaluation

定 义 precision-recall graph 如下: 对一个查询结果

列表,每一个返回 结果文档处计算 precision/recall

点,由 这些点构成的图 .

在 这个图上定义 breakeven point 为 precision 和 recall 值相等的点 .

问 :存在多于一个 breakeven point 的 图吗?如果

有, 给出例子;没有的话,请证明之。

Download