Information Retrieval Modeling

Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling Alexander Gelbukh www.Gelbukh.com Previous chapter  User Information Need o Vague o Semantic, not formal  Document Relevance o Order, not retrieve  Huge amount of information o Efficiency concerns o Tradeoffs  Art more than science 2 Modeling     Still science: computation is formal No good methods to work with (vague) semantics Thus, simplify to get a (formal) model Develop (precise) math over this (simple) model Why math if the model is not precise (simplified)? phenomenon  model = step 1 = step 2 = ... = result math phenomenon  model  step 1  step 2  ...  ?! 3 Modeling  Substitute a complex real phenomenon with a simple model, which you can measure and manipulate formally  Keep only important properties (for this application)  Do this with text:  4 Modeling in IR: idea  Tag documents with fields o As in a (relational) DB: customer = {name, age, address} o Unlike DB, very many fields: individual words! o E.g., bag of words: {word1, word2, ...}: {3, 5, 0, 0, 2, ...}  Define a similarity measure between query and such a record o (Unlike DB) Rank (order), not retrieve (yes/no) o Justify your model (optional, but nice)  Develop math and algorithms for fast access o as relational algebra in DB 5 Taxonomy of IR systems Aspects of an IR system  IR model o Boolean, Vector, Probabilistic  Logical view of documents o Full text, bag of words, ...  User task o retrieval, browsing Independent, though some are more compatible 7 Appropriate models Characterization of an IR model  D = {dj}, collection of formal representations of docs o e.g., keyword vectors  Q = {qi}, possible formal representations of user information need (queries)  F, framework for modeling these two: reason for the next  R(qi,dj): Q  D  R, ranking function o defines ordering 9 Specific IR models IR models  Classical o Boolean o Vector o Probabilistic (clear ideas, but some disadvantages)  Refined o o o o Each one with refinements Solve many of the problems of the “basic” models Give good examples of possible developments in the area Not investigated well  We can work on this 11 Basic notions  Document: Set of index term o Mainly nouns o Maybe all, then full text logical view  Term weights o some terms are better than others o terms less frequent in this doc and more frequent in other docs are less useful  Documents  index term vector {w1j, w2j, ..., wtj} o weights of terms in the doc o t is the number of terms in all docs o weights of different terms are independent (simplification) 12 Boolean model  Weights  {0, 1} o Doc: set of words  Query: Boolean expression o R(qi,dj)  {0, 1}  Good: o clear semantics, neat formalism, simple  Bad: o no ranking ( data retrieval), retrieves too many or too few o difficult to translate User Information Need into query  No term weighting 13 Vector model  Weights (non-binary)  Ranking, much better results (for User Info Need)  R(qi,dj) = correlation between query vector and doc vector  E.g., cosine measure: (there is a typo in the book) 14 Projection Weights  How are the weights wij obtained? Many variants. One way: TF-IDF balance  TF: Term frequency o How well the term is related to the doc? o If appears many times, is important o Proportional to the number of times that appears  IDF: Inverse document frequency o How important is the term to distinguish documents? o If appears in many docs, is not important o Inversely proportional to number of docs where appears  Contradictory. How to balance? 16 TF-IDF ranking  TF: Term frequency  IDF: Inverse document frequency  Balance: TF  IDF o Other formulas exist. Art. 17 Advantages of vector model One of the best known strategies  Improves quality (term weighting)  Allows approximate matching (partial matching)  Gives ranking by similarity (cosine formula)  Simple, fast But:  Does not consider term dependencies o considering them in a bad way hurts quality o no known good way  No logical expressions (e.g., negation: “mouse & NOT cat”) 18 Probabilistic model  Assumptions: o set of “relevant” docs, o probabilities of docs to be relevant o After Bayes calculation: probabilities of terms to be important for defining relevant docs  Initial idea: interact with the user. o Generate an initial set o Ask the user to mark some of them as relevant or not o Estimate the probabilities of keywords. Repeat  Can be done without user o Just re-calculate the probabilities assuming the user’s acceptance is the same as predicted ranking 19 (Dis)advantages of Probabilistic model Advantage:  Theoretical adequacy: ranks by probabilities Disadvantages:  Need to guess the initial ranking  Binary weights, ignores frequencies  Independence assumption (not clear if bad) Does not perform well (?) 20 Alternative Set Theoretic models Fuzzy set model  Takes into account term relationships (thesaurus) o Bible is related to Church  Fuzzy belonging of a term to a document o Document containing Bible also contains “a little bit of” Church, but not entirely  Fuzzy set logic applied to such fuzzy belonging o logical expressions with AND, OR, and NOT  Provides ranking, not just yes/no  Not investigated well. o Why not investigate it? 21 Alternative Set Theoretic models Extended Boolean model  Combination of Boolean and Vector  In comparison with Boolean model, adds “distance from query” o some documents satisfy the query better than others  In comparison with Vector model, adds the distinction between AND and OR combinations  There is a parameter (degree of norm) allowing to adjust the behavior between Boolean-like and Vector-like  This can be even different within one query  Not investigated well. Why not investigate it? 22 Alternative Algebraic models Generalized Vector Space model  Classical independence assumptions: o All combinations of terms are possible, none are equivalent (= basis in the vector space) o Pair-wise orthogonal: cos ({ki}, {kj}) = 0  This model relaxes the pair-wise orthogonality: cos ({ki}, {kj})  0  Operates by combinations (co-occurrences) of index terms, not individual terms  More complex, more expensive, not clear if better  Not investigated well. Why not investigate it? 23 Alternative Algebraic models Latent Semantic Indexing model  Index by larger units, “concepts”  sets of terms used together  Retrieve a document that share concepts with a relevant one (even if it does not contain query terms)  Group index terms together (map into lower dimensional space). So some terms are equivalent. o Not exactly, but this is the idea o Eliminates unimportant details o Depends on a parameter (what details are unimportant?)  Not investigated well. Why not investigate it? 24 Alternative Algebraic models Neural Network model  NNs are good at matching  Iteratively uses the found documents as auxiliary queries o Spreading activation. o Terms  docs  terms  docs  terms  docs  ...     Like a built-in thesaurus First round gives same result as Vector model No evidence if it is good Not investigated well. Why not investigate it? 25 Models for browsing  Flat browsing: String o Just as a list of paper o No context cues provided  Structure guided: Tree o Hierarchy o Like directory tree in the computer  Hypertext (Internet!): Directed graph o No limitations of sequential writing o Modeled by a directed graph: links from unit A to unit B  units: docs, chapters, etc. o A map (with traversed path) can be helpful 26 Research issues  How people judge relevance? o ranking strategies  How to combine different sources of evidence?  What interfaces can help users to understand and formulate their Information Need? o user interfaces: an open issue  Meta-search engines: combine results from different Web search engines o They almost do not intersect o How to combine ranking? 27 Conclusions  Modeling is needed for formal operations  Boolean model is the simplest  Vector model is the best combination of quality and simplicity o TF-IDF term weighting o This (or similar) weighting is used in all further models  Many interesting and not well-investigated variations o possible future work 28 Thank you! Till March 22, 6 pm 29

Information Retrieval Modeling

Related documents

Products

Support

Information Retrieval Modeling

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib