Document

advertisement
Information Retrieval
PengBo
Oct 28, 2010
2
3
本次课大纲

Introduction of Information Retrieval



索引技术:Index Techniques
排序:Scoring and Ranking
性能评测:Evaluation
4
Basic Index Techniques
Document Collection
site:pkunews.pku.edu.cn
baidu report 12,800 pages
Google report 6820 pages
6
User Information Need

在这个新闻网站内查找


articles talks about Culture of China and Japan, and
doesn’t talk about students abroad.
QUERY:

“中国 日本 文化 —留学生”
中国 日本 文化 -留学生 site:pkunews.pku.edu.cn
Baidu report 38 results
Google report 361 results
7
How to do it?

字符串匹配,如使用 grep 所有WebPages,找到
包含 “中国” ,“文化”and “日本” 的页面, 再去
除包含 “留学生”的页面?



Slow (for large corpora)
NOT “留学生” is non-trivial
Other operations (e.g., find “中国” NEAR “日本”)
not feasible
8
Document Representation

Bag of words model

Document-term incidence matrix(关联矩阵)
D1
D2
D3
D4
D5
D6
…
中国
文化
日本
1
0
1
1
1
0
1
1
0
0
1
0
0
1
1
0
1
1
留学
生
0
1
1
1
0
0
9
教育
北京 …
1
1
0
0
0
0
1
0
0
1
0
1
1 if page contains
word, 0 otherwise
Incidence Vector

Transpose :把Document-term矩阵转置
得到term-document 关联矩阵
 每个term对应一个0/1向量, incidence vector
D1
D2
D3
D4
D5
D6 …
1
0
1
1
1
0
中国
1
1
0
0
1
0
文化

日本
留学
生
0
0
1
1
1
1
0
1
1
0
1
0
教育
北京
…
1
1
0
0
0
0
1
0
0
1
0
1
10
Retrieval

Information Need:


在这个新闻网站内查找: articles talks about Culture of
China and Japan, and doesn’t talk about students
abroad.
To answer query:



读取term向量 “中国” ,“文化”,“日本” , “留学
生”(complemented)
bitwise AND
101110 AND 110010 AND 011011 AND 100011
= 000010
11
D5
12
Let’s build a search system!

考虑系统规模:




文档数:N = 1million documents, 每篇文档约有1K
terms.
平均6 bytes/term =>6GB of data in the documents.
不相同的term数:M = 500K distinct terms
这个Matrix规模是?



500K x 1M
十分稀疏:不超过one billion 1’s
What’s a better representation?
13
1875年,Mary Cowden Clarke为莎士比亚作品编纂了词
汇索引。在书的前言,她骄傲的写到她“纷献了一个通
向智慧宝藏的可靠指南…,希望这十六年来的辛勤劳动没
有辜负这个理想…”
1911,LaneCooper教授出版了一本William Wordsworth
诗集的词汇索引。耗时7个月,共有67人参与工作,使
用的工具八廓卡片、剪刀、胶水和邮票等。
1965,使用计算机整理这样的资料只需要几天时间,而
且会完成得更好……
14
Inverted index

对每个 term T: 保存包含T的文档(编号)列表
中国
2
4
8
16
文化
1
2
3
5
留学生
13
32
8
64
13
21
128
34
16
Postings
Dictionary
Sorted by docID (more later on why).
15
Inverted index construction
Documents to
be indexed.
Friends, Romans, countrymen.
Tokenizer
Token stream.
Friends Romans
Countrymen
Linguistic
modules
Modified tokens.
Inverted index.
friend
roman
countryman
Indexer friend
2
4
roman
1
2
countryman
13
16
16
Indexer steps

输出: <Modified token, Document ID> 元组序列.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
17
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2

Sort by terms.
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
Core indexing step
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
18
Term
Doc #
ambitious
be
brutus
brutus
capitol
caesar
caesar
caesar
did
enact
hath
I
I
i'
it
julius
killed
killed
let
me
noble
so
the
the
told
2
2
1
2
1
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
2
2
1
2
2

合并一个文档中的多次出现,添加term的
Frequency信息.
Term
Doc #
ambitious
be
brutus
brutus
capitol
caesar
caesar
caesar
did
enact
hath
I
I
i'
it
julius
killed
killed
let
me
noble
so
the
the
2
2
1
2
1
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
2
2
1
2
Term
ambitious
be
brutus
brutus
capitol
caesar
caesar
did
enact
hath
I
i'
it
julius
killed
let
me
noble
so
the
19
the
Doc #
Freq
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1

结果split 为一个Dictionary 文件和一个Postings 文
件.
Why split?
Term
Doc #
ambitious
be
brutus
brutus
capitol
caesar
caesar
did
enact
hath
I
i'
it
julius
killed
let
me
noble
so
Freq
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
Freq
Doc #
Term
N docs Tot Freq
ambitious
1
1
be
1
1
brutus
2
2
capitol
1
1
caesar
2
3
did
1
1
enact
1
1
hath
1
1
I
1
2
i'
1
1
it
1
1
julius
1
1
killed
1
2
let
1
1
me
1
1
noble
1
1
so
1
1
the
2
20 2
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
Boolean Query processing

查询:中国 AND 文化



查找Dictionary,定位中国;
 读取对应的postings.
查找Dictionary,定位文化;
 读取对应的postings.
“Merge” 合并(AND)两个postings:
2
4
8
16
1
2
3
5
32
8
21
64
13
21
128
中国
34 文化
The merge

2
Lists的合并算法
8
2
4
8
16
1
2
3
5
32
8
64
13
21
128
中国
34 文化
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID.
22
Boolean queries: Exact match

Queries using AND, OR
and NOT together with
query terms



23
Primary commercial
retrieval tool for 3 decades.
Professional searchers (e.g.,
Lawyers) still like Boolean
queries:
You know exactly what
you’re getting.
Example: WestLaw

Largest commercial (paying subscribers) legal search
service (started 1975; ranking added 1992)



Example query:



About 7 terabytes of data; 700,000 users
Majority of users still use boolean queries
What is the statute of limitations in cases involving the federal
tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
特点:Long, precise queries; proximity operators;
incrementally developed; not like web search
24
Beyond Boolean term search

短语phrase:


词的临近关系Proximity:


Find Gates NEAR Microsoft.
文档中的区域限定:


Find “Bill Gates” , not “Bill and Gates”
Find documents with (author = Ullman) AND (text
contains automata).
Solution:


记录term的field property
记录term在docs中的position information.
25
LAST COURSE REVIEW
26
Bag of words model






Vector representation doesn’t consider the
ordering of words in a document
John is quicker than Mary and Mary is quicker
than John have the same vectors
This is called the bag of words model.
In a sense, this is a step back: The positional
index was able to distinguish these two
documents.
We will look at “recovering” positional information
later in this course.
For now: bag of words model
Inverted index

对每个 term T: 保存包含T的文档(编号)列表
中国
2
4
8
16
文化
1
2
3
5
留学生
13
32
8
64
13
21
128
34
16
Postings
Dictionary
Sorted by docID (more later on why).
28
Simple
Inverted
Index
Inverted Index
with counts
• supports better
ranking algorithms
Inverted Index
with positions
• supports
proximity matches
Query Processing

Document-at-a-time


Term-at-a-time


Calculates complete scores for documents by
processing all term lists, one document at a time
Accumulates scores for documents by processing term
lists one at a time
Both approaches have optimization techniques
that significantly reduce time required to
generate scores
Document-At-A-Time
Term-At-A-Time
Scoring and Ranking
Beyond Boolean Search

对大多数用户来说….


大多数用户可能会输入 bill rights or bill of
rights作为Query




LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
怎样解释和处理这样 full text queries?
没有AND OR NOT等boolean连接符
某些query term不一定要出现在结果文档中
用户会期待结果按某种 order 返回,most likely to
be useful 的文档在结果的前面
36
Scoring: density-based


按query,给文档打分scoring,根据score排序
Idea



如果一个文档 talks about a topic more, then it is a
better match
if如果包含很多次query term的出现,文档是
relevant(相关的)
 term weighting.
37
Term frequency vectors

考察term t 在文档d, 中出现的次数number of
occurrences ,记作tft,d 对一个free-text Query q
Score(q,d) = tq tf…t,d
D1
D2
D3
D4
D5
D6
中国
11
0
7
13
4
0
文化
2
2
0
0
6
0
日本
0
5
2
0
1
9
留学
生
0
1
2
6
0
0
教育
3
0
0
2
0
0
北京
17
0
0
0
11
8
…
38
Problem of TF scoring

没有区分词序


长文档具有优势
 归一化:normalizing for document length


wft,d = tft,d / |d|
出现的重要程度其实与出现次数不成正比关系



Positional information index
从0次到1次的出现,和100次出现到101次出现,意
义大不相同
wf t ,d  0 if tft ,d  0, 1  logtft ,d otherwise
平滑
不同的词,其重要程度其实不一样

Consider query 日本 的 汉字 丼

区分Discrimination of terms
39
Discrimination of terms

怎样度量terms的common程度


collection frequency (cf ):文档集合里term出现的总次
数
document frequency (df ):文档集合里出现过term的文
档总数
Word
cf
df
try
insurance
10422
10440
8760
3997
40
tf x idf term weights

tf x idf 权值计算公式:


term frequency (tf )
 or wf, some measure of term density in a doc
inverse document frequency (idf )
 表达term的重要度(稀有度)
 原始值idft = 1/dft
N 
idft  log
 同样,通常会作平滑

 df t 

为文档中每个词计算其tf.idf权重:
wt ,d  tft ,d  log(N / dft )
41
Documents as vectors
D1
D2
D3
D4
D5
D6
中国
4.1
0.0
3.7
5.9
3.1
0.0
文化
4.5
4.5
0
0
11.6
0
日本
0
3.5
2.9
0
2.1
3.9
留学生
0
3.1
5.1
12.8
0
0
教育
2.9
0
0
2.2
0
0
北京
7.1
0
0
0
4.4
3.8
…
…


每一个文档 j 能够被看作一个向量,每个term 是一个
维度,取值为tf.idf
So we have a vector space



terms are axes
docs live in this space
高维空间:即使作stemming, may have 20,000+ dimensions
42
Intuition
t3
d2
d3
d1
θ
φ
t1
d5
t2
d4
Postulate: 在vector space中“close together” 的
文档会talk about the same things.
用例:Query-by-example,Free Text query as vector
43
Sec. 6.3
Formalizing vector space proximity

First cut: distance between two points




( = distance between the end points of the two vectors)
Euclidean distance?
Euclidean distance is a bad idea . . .
. . . because Euclidean distance is large for
vectors of different lengths.
Sec. 6.3
Why distance is a bad idea
The Euclidean
distance between q
and d2 is large even
though the
distribution of terms
in the query q and
the distribution of
terms in the
document d2 are
very similar.
Cosine similarity
t3

d2
d1

θ
t1

向量d1和d2的 “closeness”
可以用它们之间的夹角大
小来度量
具体的,可用cosine of the
angle x来计算向量相似度.
向量按长度归一化
Normalization

dj 
2
 
d j  dk
sim( d j , d k )    
d j dk

M
i 1
i1 w
M
46
2
w
i1 i, j
M
wi , j wi ,k
2
i, j
2
w
i1 i ,k
M
Example

Docs: Austen's Sense and Sensibility, Pride and
Prejudice; Bronte's Wuthering Heights
affection
jealous
gossip
SaS
115
10
2
SaS
affection 0.996
jealous 0.087
gossip 0.017


PaP
58
7
0
WH
20
11
6
PaP
0.993
0.120
0.000
WH
0.847
0.466
0.254
cos(SAS, PAP) = .996 x .993 + .087 x .120 + .017 x 0.0 = 0.999
cos(SAS, WH) = .996 x .847 + .087 x .466 + .017 x .254 = 0.929
47
Notes on Index Structure

怎样保存normalized tf-idf 值?




在每一个postings entry吗?
保存tf/normalization?
Space blowup because of floats
通常:


tf以整数值保存index compression
文档长度,idf每doc只保存一个
48
Sec. 6.4
tf-idf weighting has many variants
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
Sec. 6.4
Weighting may differ in queries vs
documents





Many search engines allow for different
weightings for queries vs. documents
SMART Notation: denotes the combination in use
in an engine, with the notation ddd.qqq, using
the acronyms from the previous table
A very standard weighting scheme is: lnc.ltc
Document: logarithmic tf (l as first character), no
idf and cosine normalization
A bad idea?
Query: logarithmic tf (l in leftmost column), idf (t
in second column), no normalization …
Sec. 6.4
tf-idf example: lnc.ltc
Document: car insurance auto insurance
Query: best car insurance
Term
Query
tf- tf-wt
raw
df
idf
Document
wt
n’liz
e
tf-raw
tf-wt
Pro
d
n’liz
e
wt
auto
0
0
5000
2.3
0
0
1
1
1
0.52
0
best
1
1 50000
1.3
1.3
0.34
0
0
0
0
0
car
1
1 10000
2.0
2.0
0.52
1
1
1
0.52
0.27
insurance
1
1
3.0
3.0
0.78
2
1.3
1.3
0.68
0.53
1000
Exercise: what is N, the number of docs?
Doc length = 12  02 12 1.32 1.92
Score = 0+0+0.27+0.53 = 0.8
Thus far

We can build a Information Retrieval System



Support Boolean query
Support Free-text query
Support ranking result
52
IR Evaluation
Measures for a search engine

创建index的速度



Number of documents/hour
Documents size
但更关键的measure是
user happiness
怎样量化的度量它?
搜索的速度



These criteria measurable.
响应时间:Latency as a function of index size
吞吐率:Throughput as a function of index size
查询语言的表达能力


Ability to express complex information needs
Speed on complex queries
54
Measuring user happiness


Issue: 谁是user?
Web engine: user finds what they want and return to the
engine


eCommerce site: user finds what they want and make a
purchase



Can measure rate of return users
Is it the end-user, or the eCommerce site, whose happiness we
measure?
Measure time to purchase, or fraction of searchers who become
buyers?
Enterprise (company/govt/academic): Care about “user
productivity”


How much time do my users save when looking for information?
Many other criteria having to do with breadth of access, secure
access, etc.
55
Happiness: elusive to measure



Commonest proxy: relevance of search results
But how do you measure relevance?
Methodology: test collection
1. A benchmark document collection
2. A benchmark suite of queries
3. A binary assessment of either Relevant or Irrelevant
for each query-doc pair
 Some work on more-than-binary, but not the
standard
56
Evaluating an IR system


Note: the information need is translated into a
query, Relevance is assessed relative to the
information need not the query
E.g.,



Information need: I'm looking for information on
whether drinking red wine is more effective at
reducing your risk of heart attacks than white wine.
Query: wine red white heart attack effective
You evaluate whether the doc addresses the
information need, not whether it has those words
57
Standard Test Collections



TREC - National Institute of Standards and
Testing (NIST) has run a large IR test bed for
many years
Reuters, CWT100G/CWT200G , etc…
Human experts mark, for each query and for
each doc, Relevant or Irrelevant

or at least for subset of docs that some system
returned for that query
58
Unranked retrieval evaluation:
Precision and Recall


Precision: 检索得到的文档中相关的比率 =
P(relevant|retrieved)
Recall: 相关文档被检索出来的比率 =
P(retrieved|relevant)
Relevant
Not Relevant
Retrieved
tp
fp
Not Retrieved
fn
tn


精度Precision P = tp/(tp + fp)
召回率Recall R = tp/(tp + fn)
59
Accuracy




给定一个Query,搜索引擎对每个文档分类
classifies as “Relevant” or “Irrelevant”.
Accuracy of an engine: 分类的正确比率.
Accuracy = (tp + tn)/(tp + fp +tn + fn)
Is this a very useful evaluation measure in IR?
Retrieved
Not Retrieved
Relevant
tp
fn
60
Not Relevant
fp
tn
Why not just use accuracy?

How to build a 99.9999% accurate search engine
on a low budget….
Search for:
0 matching results found.

People doing information retrieval want to find
something and have a certain tolerance for junk.
61
Precision and recall when ranked



把集合上的定义扩展到ranked list
在ranked list中每个文档处,计算P/R point
这样计算出来的值,那些是有用的?



Consider a P/R point for each relevant document
Consider value only at fixed rank cutoffs
 e.g., precision at rank 20
Consider value only at fixed recall points
 e.g., precision at 20% recall
 May be more than one precision value at a recall
point
62
Precision and Recall example
63
Average precision of a query



Often want a single-number effectiveness
measure
Average precision is widely used in IR
Calculate by averaging precision when recall
increases
64
Recall/precision graphs

Average precision .vs.
P/R graph




65
AP hides information
Recall/precision graph has
odd saw-shape if done
directly
但是P/R图很难比较
Precision and Recall, toward averaging
66
Averaging graphs: a false start

How can graphs be averaged?



不同的queries有不同的recall values
What is precision at
25% recall?
插值interpolate

How?
67
Interpolation of graphs

可能的插值方法







No interpolation
 Not very useful
Connect the dots
 Not a function
Connect max
Connect min
Connect average
…
0%recall怎么处理?



Assume 0?
Assume best?
Constant start?
68
How to choose?



一个好的检索系统具有这样的特点:一般来说(On
average),随着recall增加, 它的precision会降低

Verified time and time again

On average
插值,使得makes function monotonically decreasing
比如: 从左往右,取右边最大precisions值为插值


where S is the set of observed (R,P) points
结果是一个step function
69
Our example, interpolated this way


monotonically decreasing
Handles 0% recall smoothly
70
Averaging graphs: using interpolation


Asked: what is
precision at 25%
recall?
Interpolate
values
71
Averaging across queries

多个queries间的平均


微平均 Micro-average – 每个relevant document 是一个
点,用来计算平均
宏平均 Macro-average – 每个query是一个点,用来计
算平均
 Average of many queries’ average precision values
 Called mean average precision (MAP)
 “Average average precision” sounds weird
Most
common
72
Interpolated average precision

Average precision at standard recall points




For a given query, compute P/R point for every relevant doc doc.
Interpolate precision at standard recall levels
 11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall)
 3-pt is usually 75%, 50%, 25%
Average over all queries to get average precision at each recall
level
Average interpolated recall levels to get single result

Called “interpolated average precision”
 Not used much anymore;MAP “mean average precision” more
common
 Values at specific interpolated points still commonly used
73
Interpolation and averaging
74
A combined measure: F

P/R的综合指标F measure (weighted harmonic
mean):
2
(   1) PR
F

2
1
1

PR
  (1   )
P
R
通常使用balanced F1 measure( = 1 or  = ½)
1


Harmonic mean is a conservative average,Heavily
penalizes low values of P or R
75
Averaging F, example

Q-bad has 1 relevant document




Q-perfect has 10 relevant documents




Retrieved at ranks 1-10
(R,P) = (.1,1), (.2,1), …, (1,1)
F values of 18%, 33%, …, 100%, so AvgF = 66.2%
Macro average


Retrieved at rank 1000
(R P) = (1, 0.001)
F value of 0.2%, so AvgF = 0.2%
(0.2% + 66.2%) / 2 = 33.2%
Micro average

(0.2% + 18% + … 100%) / 11 = 60.2%
76
本次课小结

Basic Index Techniques



Scoring and Ranking





Term weighting
tf·idf
Vector Space Model
Cosine Similarity
IR evaluation



77
Inverted index
Dictionary & Postings
Precision, Recall, F
Interpolation
MAP, interpolated AP
Thank You!
Q&A
阅读材料


[1] IIR Ch1,Ch6.2,Ch6.3,Ch8.1,8.2,8.3,8.4
[2] M. P. Jay and W. B. Croft, "A language
modeling approach to information retrieval," in
Proceedings of the 21st annual international ACM
SIGIR conference on Research and development
in information retrieval. Melbourne, Australia:
ACM Press, 1998.
79
Download