Information Retrieval Information Science Patrick Lambrix

advertisement
Patrick Lambrix
Department of Computer and Information Science
Linköpings universitet
Information
Retrieval
1
2
Data source
system
Information
Data source
management
system
Data sources
Physical
Data source
Access to stored data
Processing of
queries/updates
Model
Queries
Answer
3
How is the information stored?
- high level
How is the information retrieved?
Storing and accessing textual
information
4
Text (IR)
Semi-structured data
Data models (DB)
Rules + Facts (KB)
structure
precision
Storing textual information
5
search using words
conceptual models:
boolean vector,
boolean,
vector probabilistic
probabilistic, …
file model:
flat file, inverted file, ...
Storing textual information Text - Information Retrieval
53
…
22
…
cloning
…
receptor
…
6
…
…
32
…
…
adrenergic
HITS
WORD
Inverted file
…
…
…
…
LINK
…
5
2
1
…
5
1
…
…
…
…
DOC# LINK
Postings file
Doc2
Doc1
…
DOCUMENTS
Document file
IR - File model: inverted files
7
Controlled vocabulary
Stop list
Stemming
IR – File model: inverted files
8
Information retrieval model: (D,Q,F,R)
D is a set of document representations
Q is a set of queries
F is a framework for modeling document
representations, queries and their
relationships
R associates a real number to documentquery-pairs (ranking)
IR - formal characterization
9
Boolean model
Vector model
Probabilistic model
Classic information retrieval
IR - conceptual models
yes
yes
yes
no
Doc1
Doc2
10
cloning
adrenergic
Document representation
Boolean model
no
no
receptor
(1
1 0)
--> (0 1 0)
-->
11
DNF: di
DNF
disjunction
j
i off conjunctions
j
i
off terms with
i h or without
ih
‘‘not’’
Rules: not not A --> A
not(A and B) --> not A or not B
not(A or B) --> not A and not B
(A or B) and C --> (A and C) or (B and C)
A and (B or C) --> (A and B) or (A and C)
(A and B) or C --> (A or C) and (B or C)
A or (B and C) --> (A or B) and (A or C)
Queries are translated to disjunctive normal form (DNF)
Q1: cloning and (adrenergic or receptor)
Queries : boolean (and, or, not)
Boolean model
12
(cloning and adrenergic) or (cloning and receptor)
--> (cloning and adrenergic and receptor)
or (cloning and adrenergic and not receptor)
or (cloning and receptor and adrenergic)
or (cloning and receptor and not adrenergic)
--> (1 1 1) or (1 1 0) or (1 1 1) or (0 1 1)
--> (1 1 1) or (1 1 0) or (0 1 1)
DNF is completed
+ translated
l d to same representation
i as d
documents
Q1: cloning and (adrenergic or receptor)
--> (cloning and adrenergic) or (cloning and receptor)
Boolean model
yes
yes
yes
no
Doc1
Doc2
13
--> (0 1 0) or (0 1 1)
--> (1 1 0) or (1 1 1) or (0 1 1)
Q2: cloning and not adrenergic
no
no
(1
1 0)
> (0 1 0)
-->
-->
Result: Doc2
Result: Doc1
receptor
Q1: cloning and (adrenergic or receptor)
cloning
adrenergic
Boolean model
14
Advantages
based on intuitive and simple formal
model (set theory and boolean algebra)
Disadvantages
binary decisions
- words are relevant or not
- document is relevant or not,
no notion of partial match
Boolean model
15
--> (1 0 1) or (1 1 1)
Result: empty
no
yyes
no
Doc2
Q3: adrenergic and receptor
no
yes
yes
Doc1
receptor
cloning
adrenergic
Boolean model
(1
1 0)
--> (0 1 0)
-->
16
receptor
cloning
Q (1,1,1)
Doc2 (0,1,0)
Doc1 (1,1,0)
sim(d,q) = d . q
|d| x |q|
adrenergic
Vector model (simplified)
17
Introduce weights in document vectors
(e.g. Doc3 (0, 0.5, 0))
Weights represent importance of the term
for describing the document contents
Weights are positive real numbers
Term does not occur -> weight = 0
Vector model
18
receptor
cloning
Vector model
sim(d,q) = d . q
|d| x |q|
adrenergic
Q4 (0.5,0.5,0.5)
Doc3 (0,0.5,0)
Doc1 (1,1,0)
19
dj (w1,j
1 j, …, wt,jj)
wi,j = weight for term ki in document dj
= fi,j x idfi
How to define weights? tf-idf
Vector model
How to define weights? tf-idf
20
term frequency freqi,j
i j: how often does term ki
occur in document dj?
normalized term frequency:
fi,j = freqi,j / maxl freql,j
Vector model
21
N = total number of documents
ni = number of documents in which ki occurs
inverse document frequency idfi: log (N / ni)
How to define weights? tf-idf
document frequency : in how many
documents does term ki occur?
Vector model
22
How to define weights for query?
recommendation:
q (w1,q
q=
1 q, …, wt,jj)
wi,q = weight for term ki in q
= (0.5 + 0.5 fi,q) x idfi
Vector model
23
Advantages
- term weighting improves retrieval
performance
- partial matching
- ranking according to similarity
Disadvantage
- assumption of mutually independent
terms?
Vector model
24
sim(dj,q) = P(R|dj) / P(Rc|dj)
weights are binary (wi,j = 0 or wi,j = 1)
R: the set of relevant documents for query q
Rc: the set of non-relevant documents for q
P(R|dj): probability that dj is relevant to q
P(Rc|dj): probability that dj is not relevant to q
Probabilistic model
25
sim(dj,q) = P(R|dj) / P(Rc|dj)
(Bayes’ rule, independence of index terms,
take logarithms
logarithms, P(ki|R) + P(not ki|R) = 1)
--> SIM(dj,q) ==
t
SUM i=1 wi,q x wi,j x
(log(P(ki|R) / (1- P(ki|R))) +
log((1- P(ki|Rc)) / P(ki|Rc)))
Probabilistic model
26
How to compute P(ki|R) and P(ki|Rc)?
- initially: P(ki|R) = 0.5 and P(ki|Rc) = ni/N
p
retrieve documents and rank them
- Repeat:
V: subset of documents (e.g. r best ranked)
Vi: subset of V, elements contain ki
P(ki|R) = |Vi| / |V|
and P(ki|Rc) = (ni-|Vi|) /(N-|V|)
Probabilistic model
27
Advantages:
- ranking of documents with respect to
probability of being relevant
Disadvantages:
- initial guess about relevance
- all weights are binary
- independence assumption?
Probabilistic model
28
Precision =
number of found relevant documents
total number of found documents
Recall =
number of found relevant documents
total number of relevant documents
IR - measures
29
Baeza-Yates, R., Ribeiro-Neto, B., Modern Information
Retrieval, Addison-Wesley, 1999.
Literature
Download