A Hidden Markov Model Information Retrieval System

A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid Language Model based Information Retrieval: University of Saarland 1 Overview       Motivation Hidden Markov Model (Introduction) HMM for Information Retrieval System Probability Model Baseline System Experiments HMM Refinements  Blind Feedback  Bigrams  Document  Priors Conclusion 2 Language Model based Information Retrieval: University of Saarland Motivation  Hidden Markov models have been applied successfully  Speech Recognition  Named Entity Finding  Optical Character Recognition  Topic Identification  Ad hoc Information Retrieval (now) 3 Language Model based Information Retrieval: University of Saarland Hidden Markov Model (Introduction)     You have seen sequence of observation (words) You don’t know sequence of generator (states). HMM is a solution for this problem Two probabilities are involved in HMM  Jump from one state to others (Transition probability), whose sum is 1.  Probability of observations from one state, whose sum is 1. 4 Language Model based Information Retrieval: University of Saarland A discrete HMM      Set of output symbols Set of states Set of transitions between states Probability distribution on output symbols for each state Observed sampling process  Starting from some initial state  Transition from it to another state  Sampling from the output distribution at that state Repeat the steps 5 Language Model based Information Retrieval: University of Saarland HMM for Information Retrieval System Observed data: query Q  Unknown key: relevant document D  Noisy channel: mind of user   Transform  imagined notion into text of Q P(D is R|Q) ? D is relevant in the user’s mind  Given that Q was the query produced 6 Language Model based Information Retrieval: University of Saarland Probability Model   P(D is R|Q) = Output symbols  Union  P(Q|D is R).P(D is R) Prior probability P(Q) Identical for all documents of all words in the corpus States  Mechanism of query word generations  Document  General English 7 Language Model based Information Retrieval: University of Saarland A simple two-state HMM a0 query start a1 P(q| GE) General English P(q| D) query end Document  The choice of which kind of word to generate next is independent of the previous such choice. 8 Language Model based Information Retrieval: University of Saarland Why simplification of params?   HMM for each document EM for computing these parameters  Need training samples  Document with training queries (not available)  P(q|Dk) =  P(q|GE) = # q appears in Dk length of Dk ∑k # q appears in Dk ∑k length of Dk  P(Q|Dk is R) = ∏ (a0P(q|GE) + a1P(q|Dk)) q Q 9 Language Model based Information Retrieval: University of Saarland Baseline System Performance   # of queries: 50 Inverted index is created        Replaced 397 stop words with special token *STOP* Similarly 4-digit strings by *YEAR*, digit strings by *NUMBER* TREC-6, TREC-7 test collections TREC-6    Tf value (term frequency) Ignoring case Porter stemmer 556,077 documents : average of 26.5 unique terms News and government agencies TREC-7  528,155 documents: average of 17.6 unique terms 10 Language Model based Information Retrieval: University of Saarland TF.IDF model 11 Language Model based Information Retrieval: University of Saarland Non-interpolated average precision 12 Language Model based Information Retrieval: University of Saarland HMM Refinements  Blind Feedback  well-known  technique for enhancing performance Bigrams  distinctive meaning when used in the context of other word. e.g. white house, Pop John Paul II  Query Section Weighting  Some  portion of query is more important than others. Document Priors  longer documents are more informative than short ones 13 Language Model based Information Retrieval: University of Saarland Blind Feedback    Constructing a new query based on top-ranked document Rocchio algorithm In 90% of top N retrieved document  word “very” is less informative  word “Nixon” is highly informative  a0 and a1 can be estimated by EM algorithm by training queries. 14 Language Model based Information Retrieval: University of Saarland Estimate a1 In equation (5) of paper  Q’ = general query  q’ = general query word ???(am I right)  Qi = one trained query  Q = available training queries Im,Qi= top m documents for Qi df(w) = document frequency of w Qi = “Germany” I0q I1q Negative values are avoided by taking floor of estimate …. Berlin …. Imq 15 Language Model based Information Retrieval: University of Saarland Performance gained 16 Language Model based Information Retrieval: University of Saarland Bigrams 17 Language Model based Information Retrieval: University of Saarland Query Section Weighting  TREC evaluation   Title section is more important than others. vs(q)=weight for the section of the query  vdesc=1.2, vnarr=1.9, vtitle=5.7 18 Language Model based Information Retrieval: University of Saarland Document Priors   refereed Journal may be more informative than a supermarket tabloid. Most predicative features    Source Length Average word length 19 Language Model based Information Retrieval: University of Saarland Conclusion   Novel method in IR using HMMs Offer rich setting   Incorporate new and familiar techniques Experiments with a system that implements   Blind feedback Bigram modeling  Query Section weighting  Document priors  Future work  HMM can be extended to accommodate    Passage retrieval Explicit synonym modeling Concept modeling 20 Language Model based Information Retrieval: University of Saarland Resources  D. Miller, T. Leek, R. Schwartz  A Hidden Markov Information Retrieval System  SIGIR 99 Berkeley, CA USA  L. Rabiner  A tutorial on Hidden Markov Models and selected applications in speech recognition  Proc. IEEE 77, pp 130-137 21 Language Model based Information Retrieval: University of Saarland Thankyou very much!  Questions? 22 Language Model based Information Retrieval: University of Saarland

A Hidden Markov Model Information Retrieval System

Related documents

Products

Support

A Hidden Markov Model Information Retrieval System

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib