A Hidden Markov Model Information Retrieval System

advertisement
A Hidden Markov
Model Information
Retrieval System
Mahboob Alam Khalid
Language Model based Information Retrieval: University of Saarland
1
Overview






Motivation
Hidden Markov Model (Introduction)
HMM for Information Retrieval System
Probability Model
Baseline System Experiments
HMM Refinements
 Blind Feedback
 Bigrams
 Document

Priors
Conclusion
2
Language Model based Information Retrieval: University of Saarland
Motivation

Hidden Markov models have been applied
successfully
 Speech
Recognition
 Named Entity Finding
 Optical Character Recognition
 Topic Identification

Ad hoc Information Retrieval (now)
3
Language Model based Information Retrieval: University of Saarland
Hidden Markov Model
(Introduction)




You have seen sequence of observation (words)
You don’t know sequence of generator (states).
HMM is a solution for this problem
Two probabilities are involved in HMM
 Jump
from one state to others (Transition probability),
whose sum is 1.
 Probability of observations from one state, whose
sum is 1.
4
Language Model based Information Retrieval: University of Saarland
A discrete HMM





Set of output symbols
Set of states
Set of transitions between states
Probability distribution on output symbols for
each state
Observed sampling process
 Starting
from some initial state
 Transition from it to another state
 Sampling from the output distribution at that state
Repeat the steps
5
Language Model based Information Retrieval: University of Saarland
HMM for Information Retrieval
System
Observed data: query Q
 Unknown key: relevant document D
 Noisy channel: mind of user

 Transform

imagined notion into text of Q
P(D is R|Q) ?
D
is relevant in the user’s mind
 Given that Q was the query produced
6
Language Model based Information Retrieval: University of Saarland
Probability Model


P(D is R|Q) =
Output symbols
 Union

P(Q|D is R).P(D is R)
Prior
probability
P(Q)
Identical for
all documents
of all words in the corpus
States
 Mechanism
of query word generations
 Document
 General
English
7
Language Model based Information Retrieval: University of Saarland
A simple two-state HMM
a0
query
start
a1
P(q| GE)
General English
P(q| D)
query
end
Document

The choice of which kind of word to
generate next is independent of the
previous such choice.
8
Language Model based Information Retrieval: University of Saarland
Why simplification of params?


HMM for each document
EM for computing these parameters
 Need training samples
 Document with training queries (not available)

P(q|Dk) =

P(q|GE) =
# q appears in Dk
length of Dk
∑k # q appears in Dk
∑k length of Dk

P(Q|Dk is R) =
∏ (a0P(q|GE) + a1P(q|Dk))
q Q
9
Language Model based Information Retrieval: University of Saarland
Baseline System Performance


# of queries: 50
Inverted index is created







Replaced 397 stop words with special token *STOP*
Similarly 4-digit strings by *YEAR*, digit strings by *NUMBER*
TREC-6, TREC-7 test collections
TREC-6



Tf value (term frequency)
Ignoring case
Porter stemmer
556,077 documents : average of 26.5 unique terms
News and government agencies
TREC-7

528,155 documents: average of 17.6 unique terms
10
Language Model based Information Retrieval: University of Saarland
TF.IDF model
11
Language Model based Information Retrieval: University of Saarland
Non-interpolated average precision
12
Language Model based Information Retrieval: University of Saarland
HMM Refinements

Blind Feedback
 well-known

technique for enhancing performance
Bigrams
 distinctive
meaning when used in the context of other
word. e.g. white house, Pop John Paul II

Query Section Weighting
 Some

portion of query is more important than others.
Document Priors
 longer
documents are more informative than short
ones
13
Language Model based Information Retrieval: University of Saarland
Blind Feedback



Constructing a new query based on top-ranked
document
Rocchio algorithm
In 90% of top N retrieved document
 word
“very” is less informative
 word “Nixon” is highly informative

a0 and a1 can be estimated by EM algorithm by
training queries.
14
Language Model based Information Retrieval: University of Saarland
Estimate a1
In equation (5) of paper
 Q’ = general query
 q’ = general query word
???(am I right)
 Qi = one trained query
 Q = available training
queries
Im,Qi= top m documents for Qi
df(w) = document frequency of w
Qi = “Germany”
I0q
I1q
Negative values are avoided
by taking floor of estimate
….
Berlin
….
Imq
15
Language Model based Information Retrieval: University of Saarland
Performance gained
16
Language Model based Information Retrieval: University of Saarland
Bigrams
17
Language Model based Information Retrieval: University of Saarland
Query Section Weighting

TREC evaluation


Title section is more important than others.
vs(q)=weight for the section of the query

vdesc=1.2, vnarr=1.9, vtitle=5.7
18
Language Model based Information Retrieval: University of Saarland
Document Priors


refereed Journal may be more informative than a
supermarket tabloid.
Most predicative features



Source
Length
Average word length
19
Language Model based Information Retrieval: University of Saarland
Conclusion


Novel method in IR using HMMs
Offer rich setting


Incorporate new and familiar techniques
Experiments with a system that implements


Blind feedback
Bigram modeling
 Query Section weighting
 Document priors

Future work

HMM can be extended to accommodate



Passage retrieval
Explicit synonym modeling
Concept modeling
20
Language Model based Information Retrieval: University of Saarland
Resources

D. Miller, T. Leek, R. Schwartz
 A Hidden
Markov Information Retrieval
System
 SIGIR 99 Berkeley, CA USA

L. Rabiner
 A tutorial
on Hidden Markov Models and
selected applications in speech recognition
 Proc. IEEE 77, pp 130-137
21
Language Model based Information Retrieval: University of Saarland
Thankyou very much!

Questions?
22
Language Model based Information Retrieval: University of Saarland
Download