A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid Language Model based Information Retrieval: University of Saarland 1 Overview Motivation Hidden Markov Model (Introduction) HMM for Information Retrieval System Probability Model Baseline System Experiments HMM Refinements Blind Feedback Bigrams Document Priors Conclusion 2 Language Model based Information Retrieval: University of Saarland Motivation Hidden Markov models have been applied successfully Speech Recognition Named Entity Finding Optical Character Recognition Topic Identification Ad hoc Information Retrieval (now) 3 Language Model based Information Retrieval: University of Saarland Hidden Markov Model (Introduction) You have seen sequence of observation (words) You don’t know sequence of generator (states). HMM is a solution for this problem Two probabilities are involved in HMM Jump from one state to others (Transition probability), whose sum is 1. Probability of observations from one state, whose sum is 1. 4 Language Model based Information Retrieval: University of Saarland A discrete HMM Set of output symbols Set of states Set of transitions between states Probability distribution on output symbols for each state Observed sampling process Starting from some initial state Transition from it to another state Sampling from the output distribution at that state Repeat the steps 5 Language Model based Information Retrieval: University of Saarland HMM for Information Retrieval System Observed data: query Q Unknown key: relevant document D Noisy channel: mind of user Transform imagined notion into text of Q P(D is R|Q) ? D is relevant in the user’s mind Given that Q was the query produced 6 Language Model based Information Retrieval: University of Saarland Probability Model P(D is R|Q) = Output symbols Union P(Q|D is R).P(D is R) Prior probability P(Q) Identical for all documents of all words in the corpus States Mechanism of query word generations Document General English 7 Language Model based Information Retrieval: University of Saarland A simple two-state HMM a0 query start a1 P(q| GE) General English P(q| D) query end Document The choice of which kind of word to generate next is independent of the previous such choice. 8 Language Model based Information Retrieval: University of Saarland Why simplification of params? HMM for each document EM for computing these parameters Need training samples Document with training queries (not available) P(q|Dk) = P(q|GE) = # q appears in Dk length of Dk ∑k # q appears in Dk ∑k length of Dk P(Q|Dk is R) = ∏ (a0P(q|GE) + a1P(q|Dk)) q Q 9 Language Model based Information Retrieval: University of Saarland Baseline System Performance # of queries: 50 Inverted index is created Replaced 397 stop words with special token *STOP* Similarly 4-digit strings by *YEAR*, digit strings by *NUMBER* TREC-6, TREC-7 test collections TREC-6 Tf value (term frequency) Ignoring case Porter stemmer 556,077 documents : average of 26.5 unique terms News and government agencies TREC-7 528,155 documents: average of 17.6 unique terms 10 Language Model based Information Retrieval: University of Saarland TF.IDF model 11 Language Model based Information Retrieval: University of Saarland Non-interpolated average precision 12 Language Model based Information Retrieval: University of Saarland HMM Refinements Blind Feedback well-known technique for enhancing performance Bigrams distinctive meaning when used in the context of other word. e.g. white house, Pop John Paul II Query Section Weighting Some portion of query is more important than others. Document Priors longer documents are more informative than short ones 13 Language Model based Information Retrieval: University of Saarland Blind Feedback Constructing a new query based on top-ranked document Rocchio algorithm In 90% of top N retrieved document word “very” is less informative word “Nixon” is highly informative a0 and a1 can be estimated by EM algorithm by training queries. 14 Language Model based Information Retrieval: University of Saarland Estimate a1 In equation (5) of paper Q’ = general query q’ = general query word ???(am I right) Qi = one trained query Q = available training queries Im,Qi= top m documents for Qi df(w) = document frequency of w Qi = “Germany” I0q I1q Negative values are avoided by taking floor of estimate …. Berlin …. Imq 15 Language Model based Information Retrieval: University of Saarland Performance gained 16 Language Model based Information Retrieval: University of Saarland Bigrams 17 Language Model based Information Retrieval: University of Saarland Query Section Weighting TREC evaluation Title section is more important than others. vs(q)=weight for the section of the query vdesc=1.2, vnarr=1.9, vtitle=5.7 18 Language Model based Information Retrieval: University of Saarland Document Priors refereed Journal may be more informative than a supermarket tabloid. Most predicative features Source Length Average word length 19 Language Model based Information Retrieval: University of Saarland Conclusion Novel method in IR using HMMs Offer rich setting Incorporate new and familiar techniques Experiments with a system that implements Blind feedback Bigram modeling Query Section weighting Document priors Future work HMM can be extended to accommodate Passage retrieval Explicit synonym modeling Concept modeling 20 Language Model based Information Retrieval: University of Saarland Resources D. Miller, T. Leek, R. Schwartz A Hidden Markov Information Retrieval System SIGIR 99 Berkeley, CA USA L. Rabiner A tutorial on Hidden Markov Models and selected applications in speech recognition Proc. IEEE 77, pp 130-137 21 Language Model based Information Retrieval: University of Saarland Thankyou very much! Questions? 22 Language Model based Information Retrieval: University of Saarland