TEXT RETRIEVAL part I October Text Retrieval • Text retrieval is to return relevant textual documents from a given collection, according to users’ information needs as declared in a query • Main differences from database retrieval are concerned with: – Information • Unstructured text vs. structured data • Ambiguous vs. well-defined semantics – Query expression • Ambiguous vs. well-defined semantics • Incomplete vs. complete specification – Answers • Relevant documents vs. matched records • Formally the elements of the problem are as follows: – Vocabulary V={w1, w2, …, wN} where wi are the words of the language – Query q = q1,…,qm, where qi ∈ V – Document dk = { dk1,…,dkm } where dki ∈ V – – Collection C = {d1, …, dn} Set of relevant documents R(q) ⊆ C • • Generally unknown and user-dependent Query is a “hint” on which doc is in R(q) Based on this definition, the task is to compute R’(q), as an approximation of R(q). • This can be done according to two different strategies: – Document selection • R’(q)={d∈C|f(d,q)=1}, where f(d,q) ∈{0,1} is an indicator function such that it is possible to decide if a document is relevant or not (“absolute relevance”) – Document ranking • R’(q) = {d∈C|f(d,q)>θ}, where θ is a cutoff and f(d,q) ∈ℜ is a relevance measure function such that it is possible to decide if one document is more likely to be relevant than another (“relative relevance”) Document Selection vs. Ranking True R(q) + +--+ -+ + + --- --- 1 Doc S election f(d,q)= 0 Doc R anking f(d,q)= + +-+ ++ R’(q) - -- - - + -0.98 d 1 + 0.95 d 2 + 0.83 d3 0.80 d4 + 0.76 d5 0.56 d6 0.34 d7 0.21 d8 + 0.21 d9 - R’(q) • With Document Selection, the classifier is inaccurate: – – “Over-constrained” query (terms are too specific) ⇒ no relevant documents found “Under-constrained” query (terms are too general) ⇒ over delivery Even if the classifier is accurate, all relevant documents are not equally relevant. Indexing is easier instead to be implemented. • Document Ranking allows the used to control the boundary according to his/her preferences. • Measures for the evaluation of the retrieval sets are needed Evaluation of retrieval sets • Two most frequent and basic measures for information retrieval are precision and recall. These are first defined for the simple case where the information retrieval system returns a set of documents for a query All docs Retrieved Recall = | RelRetrieved | | Rel in Collection | Precision = Relevant • | RelRetrieved | | Retrieved | The advantage of having two numbers is that one is more important than the other in many circumstances. Web surfers would like every result in the first page to be relevant (i.e. high precision). In contrast professional searchers are very concerned with high recall and will tolerate low precision. Very high precision, very low recall Relevant High precision, high recall Relevant High recall, but low precision Relevant F-Measure • F-Measure is a single measure that trades off precision versus recall, which is the the weighted harmonic mean of precision and recall. α>1: precision is more important α<1: recall is more important F= 1 1 1 a + (1 - a ) P R (β2 + 1) PR / β2P + R β = (1 − α ) / α F β=1 = 2PR / (P + R) Evaluation of ranked retrieval results • Precision and recall figures are appropriate for unranked retrieval sets. In a ranking context, appropriate sets of retrieved documents are given by the top k retrieved documents. • For each such set, precision and recall values can be plotted to give a precision-recall curve. Precision-recall curves have a distinctive saw-tooth shape: – – if the (k + 1)th document retrieved is non-relevant then recall is the same as for the top k documents, but precision drops; if it is relevant, then both precision and recall increase, and the curve jags up and to the right. precision-recall curve • Interpolated Average Precision It is often useful to remove jiggles with an interpolated precision: the interpolated precision at a certain recall level r is defined as the highest precision found for any recall level q ≥ r : pint(r) = maxr’ ≥r p(r′ ). • Interpolated precision at recall level k precision-recall curve k Precision should be measured at different levels of Recall: this is an average measure over many queries. Different solutions (models) • Boolean model – Based on the notion of sets – Documents are retrieved only if they satisfy Boolean conditions specified in the query – Does not impose a ranking on retrieved documents – Exact match • Vector space model – Based on geometry, the notion of vectors in high dimensional space – Documents are ranked based on their similarity to the query (ranked retrieval) – Best/partial match • Language models – Based on the notion of probabilities and processes for generating text – Documents are ranked based on the probability that they generated the query – Best/partial match Relevance models Relevance ∆(Rep(q), Rep(d)) Similarity P(r=1|q,d) r ∈{0,1} Probability of Relevance Regression Different Model rep & similarity (Fox 83) … Generative Model Doc generation Query generation P(d →q) or P(q →d) Probabilistic inference Different inference system Prob. concept Inference network space model Vector space Prob. distr. model Classical LM (Wong & Yao, 95) model (Turtle & Croft, 91 model prob. Model approach (Salton et al., 75) (Wong & Yao, 89) (Robertson & (Ponte & Croft, 98) Sparck Jones, 76) (Lafferty & Zhai, 01a) Relevance ∆(Rep(q), Rep(d)) Similarity P(r=1|d, q) r ∈{0,1} Probability of Relevance Regression Different Model rep & similarity (Fox 83) … Generative Model Doc generation Query generation P(d →q) or P(q →d) Boolean Model Probabilistic inference Different inference system Prob. concept Inference network space model Vector space Prob. distr. model Classical LM (Wong & Yao, 95) model (Turtle & Croft, 91 model prob. Model approach (Salton et al., 75) (Wong & Yao, 89) (Robertson & (Ponte & Croft, 98) Sparck Jones, 76) (Lafferty & Zhai, 01a) Vector Space Model: Relevance (d,q) = Similarity (d,q) • Assumption: Query and document are represented in the same way • Retrieved terms are such that: R(q) = {d∈C|f(d,q)>θ} where f(q,d)=∆(Rep(q), Rep(d)) being ∆ a similarity measure and Rep a chosen representation for query and documents • Key issues are: – How to represent query and documents – How to define the similarity measure ∆ • In the Vector Space Model approach, a document/query is represented by a term vector: – A term is a basic concept, e.g., a word or phrase – Each term defines one dimension – Elements of the vector correspond to term weights – E.g., d=(x1,…,xN), xi is the “importance” of term i • A collection of n documents with t distinct terms is represented by a (sparse) matrix. D1 D2 : : Dn • T1 w11 w12 : : w1n T2 w21 w22 : : w2n …. … … … Tt wt1 wt2 : : wtn A query can also be represented as a vector like a document Dimensions in document space • Some terms must be assumed to be orthogonal to form linearly independent basis vectors. They must be non-overlapping in meaning • Which terms? – Remove stop words (common function words). – Stemming (standardize morphological variants; strip endings, etc: eat, eating…⇒ eat) Starbucks D2 D9 D11 D3 D10 D5 D4D6 D7 D8 Microsoft Query D1 Java Term weighting • • Binary Weighting: only the presence (1) or absence (0) of a term is included in the vector Binary weighting is not effective. docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 t1 1 1 0 1 1 1 0 0 0 0 1 t2 0 0 1 0 1 1 1 1 0 1 0 t3 1 0 1 0 1 0 0 0 1 1 1 Empirical distribution of words • There are stable language-independent patterns in how people use natural languages A few words occur very frequently; most occur rarely. E.g., in news articles, – – Top 4 words: 10 ~ 15% word occurrences Top 50 words: 35 ~ 40% word occurrences The most frequent word in one corpus may be rare in another • Zipf's law, publicized by Harvard linguist George Kingsley Zipf, stated that: in a corpus of natural language utterances, the frequency of any word is roughly inversely proportional to its rank in the frequency table. So, the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. Zipf’s Law: rank * frequency ≈ constant C F ( w) = r ( w)α α ≈ 1, C ≈ 0.1 Most useful words Word Freq. Most rare words Biggest data structure (stop words) Word Rank (by Freq) The long tail impies that almost all words are rare Generalized Zipf’s law: F ( w) = C [r ( w) + B]α applicable in many domains • TF (Term Frequency) Weighting: accounts of the number of occurrences of t in d (Salience of t in d). A term is more important if it occurs more frequently in a document • There exist different formulas for the computation of TF: Let f(t,d) be the frequency count of term t in doc d – Raw TF: TF(t,d) = f(t,d) – Log TF: TF(t,d) = 1+ln(1 + ln ( f(t,d))) docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 t1 2 1 0 3 1 3 0 0 0 0 4 t2 0 0 4 0 6 5 8 10 0 3 0 t3 3 0 7 0 3 0 0 0 1 5 1 • It is important that TF is normalized due to: – Document length variation – Repeated occurrences are less informative than the first occurrence Generally long docs should be penalized, but over-penalization should be avoided (pivoted normalization) • Two views of document length: – A doc is long because it uses more words – A doc is long because it has more contents • Pivoted normalization uses the average document length to regularize normalization: 1 - b + b(doclen / avgdoclen) where b varies from 0 to 1 Norm. TF Raw TF pivoted normalization “Okapi/BM25 TF”: TF(t,d) = k f(t,d) / [ f(t,d) + k (1 - b + b(doclen / avgdoclen)) ] where k and b are parameters doclen = avgdoclen =1 TF = TFref doclen > avgdoclen >1 TF < TFref doclen < avgdoclen <1 TF > TFref • IDF (Inverse Document Frequency) Weighting: accounts of the number of documents which t appears in (Informativeness of t). A term is more discriminative if it occurs only in fewer documents IDF(t) = 1+ log(n/k) Where: • n = total number of docs k = # docs with term t (doc freq) n=k =0 IDF = 1 n>k >0 IDF = >1 IDF provides high values for rare words and low values for common words For a collection of 10000 documents 10000 log =0 10000 10000 log = 0.301 5000 10000 log = 2.698 20 10000 log =4 1 • TF-IDF Weighting : a more effective weighting is obtained when weights are assigned considering the combination of the two basic heuristics TF and IDF • TF-IDF Weighting assigns document weight as weight(t,d) = TF(t,d) * IDF(t) This implies that: – Common in doc high tf highest TF-IDF weight – Rare in collection high idf Similarity measures in vector space With the vector space model similarity has a geometric interpretation. Assumption: documents that are “close together” in space are similar in meaning. Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 T3 5 D1 = 2T1+ 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 2 3 T1 D2 = 3T1 + 7T2 + T3 T2 7 • One measure of similarity between two vectors is the angle between the vectors: – 0o: overlapping vectors ⇒ identical – 90o: orthogonal vectors ⇒ totally dissimilar – The smallest the angle the most similar is Q to D • Cosine of angle varies monotonically from 0 to 1 as angle varies from 90o to 0o. – For unit-length vectors, cosine is dot product: r r r r n cos( x, y ) = x · y = å x i ×y i i=1 • For non-normalized vectors, the following expression can be used for similarity, also called normalized correlation coefficient : r r r r x· y cos( x, y ) = r r = x ×y • å å n i=1 n i=1 2 i x × Where – – x i ×y i Dot product measures vector correlation Denominator normalizes for length å n i=1 y i2 Example: TF-IDF & Dot Product doc1 information retrieval search engine information Sim(q,doc1)=4.8*2.4+4.5*4.5 query=“information retrieval” Sim(q,doc2)=2.4*2.4 travel information doc2 doc3 map travel government president congress …… Sim(q,doc3)=0 IDF info 2.4 TF-IDF doc1 doc2 doc3 4.8 2.4 query 2.4 retrieval travel map search engine govern president congress 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3 4.5 2.1 5.4 5.6 3.3 2.2 4.5 3.2 4.3 Most commonly used similarity measures: Simple matching (coordination level match) |Q ∩ D | |Q ∩ D | 2 |Q | + | D | Dice’s Coefficient |Q ∩ D | |Q ∪ D | Q•D 1 Jaccard’s Coefficient Cosine Coefficient 1 |Q | ×| D | 2 |Q ∩ D | min(| Q |, | D |) 2 Overlap Coefficient Criticisms • • Unwarranted orthogonality assumptions Reliance on terms: – Ambiguous: many terms have more than one meaning (affects precision) – Synonymy: many concepts can be expressed by more than one term (affects recall) • Nevertheless vector space model is highly effective Vector Space Model extensions: from terms to concepts • Latent semantic indexing – Dimensionality reduction (Singular Value Decomposition) – Project vectors in document-by-term space onto lower-dimensionality documentby-concept space – Leverages term co-occurrence in documents to approximate “latent concepts” • Blind relevance feedback – Add terms from top documents to new query (another way to leverage term cooccurrence)