Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval 1 2 Data source system Information Data source management system Data sources Physical Data source Access to stored data Processing of queries/updates Model Queries Answer 3 How is the information stored? - high level How is the information retrieved? Storing and accessing textual information 4 Text (IR) Semi-structured data Data models (DB) Rules + Facts (KB) structure precision Storing textual information 5 search using words conceptual models: boolean vector, boolean, vector probabilistic probabilistic, … file model: flat file, inverted file, ... Storing textual information Text - Information Retrieval 53 … 22 … cloning … receptor … 6 … … 32 … … adrenergic HITS WORD Inverted file … … … … LINK … 5 2 1 … 5 1 … … … … DOC# LINK Postings file Doc2 Doc1 … DOCUMENTS Document file IR - File model: inverted files 7 Controlled vocabulary Stop list Stemming IR – File model: inverted files 8 Information retrieval model: (D,Q,F,R) D is a set of document representations Q is a set of queries F is a framework for modeling document representations, queries and their relationships R associates a real number to documentquery-pairs (ranking) IR - formal characterization 9 Boolean model Vector model Probabilistic model Classic information retrieval IR - conceptual models yes yes yes no Doc1 Doc2 10 cloning adrenergic Document representation Boolean model no no receptor (1 1 0) --> (0 1 0) --> 11 DNF: di DNF disjunction j i off conjunctions j i off terms with i h or without ih ‘‘not’’ Rules: not not A --> A not(A and B) --> not A or not B not(A or B) --> not A and not B (A or B) and C --> (A and C) or (B and C) A and (B or C) --> (A and B) or (A and C) (A and B) or C --> (A or C) and (B or C) A or (B and C) --> (A or B) and (A or C) Queries are translated to disjunctive normal form (DNF) Q1: cloning and (adrenergic or receptor) Queries : boolean (and, or, not) Boolean model 12 (cloning and adrenergic) or (cloning and receptor) --> (cloning and adrenergic and receptor) or (cloning and adrenergic and not receptor) or (cloning and receptor and adrenergic) or (cloning and receptor and not adrenergic) --> (1 1 1) or (1 1 0) or (1 1 1) or (0 1 1) --> (1 1 1) or (1 1 0) or (0 1 1) DNF is completed + translated l d to same representation i as d documents Q1: cloning and (adrenergic or receptor) --> (cloning and adrenergic) or (cloning and receptor) Boolean model yes yes yes no Doc1 Doc2 13 --> (0 1 0) or (0 1 1) --> (1 1 0) or (1 1 1) or (0 1 1) Q2: cloning and not adrenergic no no (1 1 0) > (0 1 0) --> --> Result: Doc2 Result: Doc1 receptor Q1: cloning and (adrenergic or receptor) cloning adrenergic Boolean model 14 Advantages based on intuitive and simple formal model (set theory and boolean algebra) Disadvantages binary decisions - words are relevant or not - document is relevant or not, no notion of partial match Boolean model 15 --> (1 0 1) or (1 1 1) Result: empty no yyes no Doc2 Q3: adrenergic and receptor no yes yes Doc1 receptor cloning adrenergic Boolean model (1 1 0) --> (0 1 0) --> 16 receptor cloning Q (1,1,1) Doc2 (0,1,0) Doc1 (1,1,0) sim(d,q) = d . q |d| x |q| adrenergic Vector model (simplified) 17 Introduce weights in document vectors (e.g. Doc3 (0, 0.5, 0)) Weights represent importance of the term for describing the document contents Weights are positive real numbers Term does not occur -> weight = 0 Vector model 18 receptor cloning Vector model sim(d,q) = d . q |d| x |q| adrenergic Q4 (0.5,0.5,0.5) Doc3 (0,0.5,0) Doc1 (1,1,0) 19 dj (w1,j 1 j, …, wt,jj) wi,j = weight for term ki in document dj = fi,j x idfi How to define weights? tf-idf Vector model How to define weights? tf-idf 20 term frequency freqi,j i j: how often does term ki occur in document dj? normalized term frequency: fi,j = freqi,j / maxl freql,j Vector model 21 N = total number of documents ni = number of documents in which ki occurs inverse document frequency idfi: log (N / ni) How to define weights? tf-idf document frequency : in how many documents does term ki occur? Vector model 22 How to define weights for query? recommendation: q (w1,q q= 1 q, …, wt,jj) wi,q = weight for term ki in q = (0.5 + 0.5 fi,q) x idfi Vector model 23 Advantages - term weighting improves retrieval performance - partial matching - ranking according to similarity Disadvantage - assumption of mutually independent terms? Vector model 24 sim(dj,q) = P(R|dj) / P(Rc|dj) weights are binary (wi,j = 0 or wi,j = 1) R: the set of relevant documents for query q Rc: the set of non-relevant documents for q P(R|dj): probability that dj is relevant to q P(Rc|dj): probability that dj is not relevant to q Probabilistic model 25 sim(dj,q) = P(R|dj) / P(Rc|dj) (Bayes’ rule, independence of index terms, take logarithms logarithms, P(ki|R) + P(not ki|R) = 1) --> SIM(dj,q) == t SUM i=1 wi,q x wi,j x (log(P(ki|R) / (1- P(ki|R))) + log((1- P(ki|Rc)) / P(ki|Rc))) Probabilistic model 26 How to compute P(ki|R) and P(ki|Rc)? - initially: P(ki|R) = 0.5 and P(ki|Rc) = ni/N p retrieve documents and rank them - Repeat: V: subset of documents (e.g. r best ranked) Vi: subset of V, elements contain ki P(ki|R) = |Vi| / |V| and P(ki|Rc) = (ni-|Vi|) /(N-|V|) Probabilistic model 27 Advantages: - ranking of documents with respect to probability of being relevant Disadvantages: - initial guess about relevance - all weights are binary - independence assumption? Probabilistic model 28 Precision = number of found relevant documents total number of found documents Recall = number of found relevant documents total number of relevant documents IR - measures 29 Baeza-Yates, R., Ribeiro-Neto, B., Modern Information Retrieval, Addison-Wesley, 1999. Literature