Uploaded by bladeraj3

Advance topics in information retrieval

Special Topics in Computer Science
Advanced Topics in Information Retrieval
Chapter 2: Modeling
Alexander Gelbukh
Previous chapter
 User Information Need
o Vague
o Semantic, not formal
 Document Relevance
o Order, not retrieve
 Huge amount of information
o Efficiency concerns
o Tradeoffs
 Art more than science
Still science: computation is formal
No good methods to work with (vague) semantics
Thus, simplify to get a (formal) model
Develop (precise) math over this (simple) model
Why math if the model is not precise (simplified)?
phenomenon  model = step 1 = step 2 = ... = result
phenomenon  model  step 1  step 2  ...  ?!
 Substitute a complex real phenomenon with a simple
model, which you can measure and manipulate formally
 Keep only important properties (for this application)
 Do this with text:
Modeling in IR: idea
 Tag documents with fields
o As in a (relational) DB: customer = {name, age, address}
o Unlike DB, very many fields: individual words!
o E.g., bag of words: {word1, word2, ...}: {3, 5, 0, 0, 2, ...}
 Define a similarity measure between query and such
a record
o (Unlike DB) Rank (order), not retrieve (yes/no)
o Justify your model (optional, but nice)
 Develop math and algorithms for fast access
o as relational algebra in DB
Taxonomy of IR systems
Aspects of an IR system
 IR model
o Boolean, Vector, Probabilistic
 Logical view of documents
o Full text, bag of words, ...
 User task
o retrieval, browsing
Independent, though some are more compatible
Appropriate models
Characterization of an IR model
 D = {dj}, collection of formal representations of docs
o e.g., keyword vectors
 Q = {qi}, possible formal representations of user
information need (queries)
 F, framework for modeling these two: reason for the
 R(qi,dj): Q  D  R, ranking function
o defines ordering
Specific IR models
IR models
 Classical
o Boolean
o Vector
o Probabilistic
(clear ideas, but some disadvantages)
 Refined
Each one with refinements
Solve many of the problems of the “basic” models
Give good examples of possible developments in the area
Not investigated well
 We can work on this
Basic notions
 Document: Set of index term
o Mainly nouns
o Maybe all, then full text logical view
 Term weights
o some terms are better than others
o terms less frequent in this doc and more frequent in other
docs are less useful
 Documents  index term vector {w1j, w2j, ..., wtj}
o weights of terms in the doc
o t is the number of terms in all docs
o weights of different terms are independent (simplification)
Boolean model
 Weights  {0, 1}
o Doc: set of words
 Query: Boolean expression
o R(qi,dj)  {0, 1}
 Good:
o clear semantics, neat formalism, simple
 Bad:
o no ranking ( data retrieval), retrieves too many or too few
o difficult to translate User Information Need into query
 No term weighting
Vector model
 Weights (non-binary)
 Ranking, much better results (for User Info Need)
 R(qi,dj) = correlation between query vector and doc
 E.g., cosine measure:
(there is a typo in the book)
 How are the weights wij obtained? Many variants.
One way: TF-IDF balance
 TF: Term frequency
o How well the term is related to the doc?
o If appears many times, is important
o Proportional to the number of times that appears
 IDF: Inverse document frequency
o How important is the term to distinguish documents?
o If appears in many docs, is not important
o Inversely proportional to number of docs where appears
 Contradictory. How to balance?
TF-IDF ranking
 TF: Term frequency
 IDF: Inverse document frequency
 Balance: TF  IDF
o Other formulas exist. Art.
Advantages of vector model
One of the best known strategies
 Improves quality (term weighting)
 Allows approximate matching (partial matching)
 Gives ranking by similarity (cosine formula)
 Simple, fast
 Does not consider term dependencies
o considering them in a bad way hurts quality
o no known good way
 No logical expressions (e.g., negation: “mouse & NOT cat”)
Probabilistic model
 Assumptions:
o set of “relevant” docs,
o probabilities of docs to be relevant
o After Bayes calculation: probabilities of terms to be
important for defining relevant docs
 Initial idea: interact with the user.
o Generate an initial set
o Ask the user to mark some of them as relevant or not
o Estimate the probabilities of keywords. Repeat
 Can be done without user
o Just re-calculate the probabilities assuming the user’s
acceptance is the same as predicted ranking
(Dis)advantages of Probabilistic model
 Theoretical adequacy: ranks by probabilities
 Need to guess the initial ranking
 Binary weights, ignores frequencies
 Independence assumption (not clear if bad)
Does not perform well (?)
Alternative Set Theoretic models
Fuzzy set model
 Takes into account term relationships (thesaurus)
o Bible is related to Church
 Fuzzy belonging of a term to a document
o Document containing Bible also contains “a little bit of”
Church, but not entirely
 Fuzzy set logic applied to such fuzzy belonging
o logical expressions with AND, OR, and NOT
 Provides ranking, not just yes/no
 Not investigated well.
o Why not investigate it?
Alternative Set Theoretic models
Extended Boolean model
 Combination of Boolean and Vector
 In comparison with Boolean model, adds “distance
from query”
o some documents satisfy the query better than others
 In comparison with Vector model, adds the distinction
between AND and OR combinations
 There is a parameter (degree of norm) allowing to adjust
the behavior between Boolean-like and Vector-like
 This can be even different within one query
 Not investigated well. Why not investigate it?
Alternative Algebraic models
Generalized Vector Space model
 Classical independence assumptions:
o All combinations of terms are possible, none are
equivalent (= basis in the vector space)
o Pair-wise orthogonal: cos ({ki}, {kj}) = 0
 This model relaxes the pair-wise orthogonality:
cos ({ki}, {kj})  0
 Operates by combinations (co-occurrences) of index
terms, not individual terms
 More complex, more expensive, not clear if better
 Not investigated well. Why not investigate it?
Alternative Algebraic models
Latent Semantic Indexing model
 Index by larger units, “concepts”  sets of terms used
 Retrieve a document that share concepts with a
relevant one (even if it does not contain query terms)
 Group index terms together (map into lower
dimensional space). So some terms are equivalent.
o Not exactly, but this is the idea
o Eliminates unimportant details
o Depends on a parameter (what details are unimportant?)
 Not investigated well. Why not investigate it?
Alternative Algebraic models
Neural Network model
 NNs are good at matching
 Iteratively uses the found documents as auxiliary
o Spreading activation.
o Terms  docs  terms  docs  terms  docs  ...
Like a built-in thesaurus
First round gives same result as Vector model
No evidence if it is good
Not investigated well. Why not investigate it?
Models for browsing
 Flat browsing: String
o Just as a list of paper
o No context cues provided
 Structure guided: Tree
o Hierarchy
o Like directory tree in the computer
 Hypertext (Internet!): Directed graph
o No limitations of sequential writing
o Modeled by a directed graph: links from unit A to unit B
 units: docs, chapters, etc.
o A map (with traversed path) can be helpful
Research issues
 How people judge relevance?
o ranking strategies
 How to combine different sources of evidence?
 What interfaces can help users to understand and
formulate their Information Need?
o user interfaces: an open issue
 Meta-search engines: combine results from different
Web search engines
o They almost do not intersect
o How to combine ranking?
 Modeling is needed for formal operations
 Boolean model is the simplest
 Vector model is the best combination of quality and
o TF-IDF term weighting
o This (or similar) weighting is used in all further models
 Many interesting and not well-investigated variations
o possible future work
Thank you!
Till March 22, 6 pm