Indexing Implementation and Indexing Models CSC 575 Intelligent Information Retrieval Information need Lexical analysis and stop words Collections Pre-process text input Parse Index Query How is the index constructed? Rank Result Sets Indexing Implementation Inverted files Primary data structure for text indexes Source file: collection, organized by document Inverted file: collection organized by term (one record per term, listing locations where term occurs) Query: traverse lists for each query term OR: the union of component lists AND: an intersection of component lists Based on the view of documents as vectors in n-dimensional space n = number of index terms used for indexing Each document is a bag of words (vector) with a direction and a magnitude The Vector-Space Model for IR Intelligent Information Retrieval 3 The Vector Space Model Vocabulary V = the set of terms left after pre-processing the text (tokenization, stop-word removal, stemming, ...). Each document or query is represented as a |V| = n dimensional vector: dj = [w1j, w2j, ..., wnj]. wij is the weight of term i in document j. the terms in V form the orthogonal dimensions of a vector space Document = Bag of words: Vector representation doesn’t consider the ordering of words: John is quicker than Mary vs. Mary is quicker than John. 4 Document Vectors and Indexes Conceptually, the index can be viewed as a documentterm matrix Each document is represented as an n-dimensional vector (n = no. of terms in the dictionary) Term weights represent the scalar value of each dimension in a document The inverted file structure is an “implementation model” used in practice to store the information captured in this conceptual representation The dictionary Document Ids Term Weights (in this case normalized) Intelligent Information Retrieval A B C D E F G H I nova 1.0 0.5 galaxy 0.5 1.0 1.0 0.9 0.5 heat 0.3 0.8 1.0 hollywood 0.7 0.5 1.0 film role 1.0 0.9 0.7 0.6 0.7 1.0 0.5 0.3 0.1 diet 1.0 0.9 0.2 fur a document vector 0.8 0.3 5 Example: Documents and Query in 3D Space Documents in term space Terms are usually stems Documents (and the query) are represented as vectors of terms Query and Document weights based on length and direction of their vector Why use this representation? A vector distance measure between the query and documents can be used to rank retrieved documents Intelligent Information Retrieval 6 Recall: Inverted Index Construction Invert documents into a big index vector file “inverted” so that rows become columns and columns become rows Basic idea: list all the tokens in the collection for each token, list all the docs it occurs in (together with frequency info.) docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 t1 1 1 0 1 1 1 0 0 0 0 t2 0 0 1 0 1 1 1 1 0 1 Intelligent Information Retrieval t3 1 0 1 0 1 0 0 0 1 1 Terms t1 t2 t3 D1 1 0 1 D2 1 0 0 D3 0 1 1 D4 1 0 0 D5 1 1 1 D6 1 1 0 D7 0 1 0 … Sparse Matrix Representation: In practice this data is very sparse; we do not need to store all the 0’s. Hence, the sorted array implementation … 7 How Are Inverted Files Created Sorted Array Implementation Documents are parsed to extract tokens. These are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight Intelligent Information Retrieval Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 8 How Inverted Files are Created After all documents have been parsed and the inverted file is sorted (with duplicates retained for within document frequency stats) If frequency information is not needed, then inverted file can be sorted with duplicates removed. Intelligent Information Retrieval Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 9 How Inverted Files are Created Multiple term entries for a single document are merged Within-document term frequency information is compiled If proximity operators are needed, then the location of each occurrence of the term must also be stored. Terms are usually represented by unique integers to fix and minimize storage space. Intelligent Information Retrieval Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 10 How Inverted Files are Created Then the file can be split into a Dictionary and a Postings file Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs (DF) Tot Freq 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 1 1 2 2 1 2 1 2 Dictionary Intelligent Information Retrieval Doc # Freq 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 1 1 2 Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 Freq 2 1 2 2 2 1 Postings 11 Inverted Indexes and Queries Permit fast search for individual terms For each term, you get a hit list consisting of: document ID frequency of term in doc position of term in doc (optional) These lists can be used to solve quickly Boolean queries: country ==> {d1, d2} manor ==> {d2} country AND manor ==> {d2} Full advantage of this structure can taken by statistical ranking algorithms such as the vector space model in case of Boolean queries, term or document frequency information is not used (just set operations performed on hit lists) We will look at the vector model later; for now let’s examine Boolean queries more closely Intelligent Information Retrieval 12 Scalability Issues: Number of Postings An Example: Reuters RCV1 Collection Number of docs = m = 800,000 Average token per doc: 200 Number of distinct terms = n = 400K 100 million (non-positional) postings in the inverted index Intelligent Information Retrieval 13 Bottleneck Parse and build postings entries one doc at a time Sort postings entries by term (then by doc within each term) Doing this with random disk seeks would be too slow – must sort N=100M records If every comparison took 2 disk seeks (10 milliseconds each), and N items could be sorted with N log2N comparisons, how long would this take? Intelligent Information Retrieval 14 Sorting with fewer disk seeks 12-byte (4+4+4) records (term, doc, freq) These are generated as we parse docs Must now sort 600M such 12-byte records by term Define a Block (e.g., ~ 10M) records Sort within blocks first and write to disk, then merge the blocks into one long sorted order. Blocked Sort-Based Indexing (BSBI) Intelligent Information Retrieval 15 Sec. 4.3 Problem with sort-based algorithm Assumption: we can keep the dictionary in memory. We need the dictionary (which grows dynamically) in order to implement a term to termID mapping. Actually, we could work with term, docID postings instead of termID, docID postings . . . . . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.) Sec. 4.3 SPIMI: Single-pass in-memory indexing Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks. Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur. With these two ideas we can generate a complete inverted index for each block. These separate indexes can then be merged into one big index. Distributed indexing For web-scale indexing must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines? Maintain a master machine directing the indexing job – considered “safe”. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle machine from a pool. Intelligent Information Retrieval 18 Parallel tasks Use two sets of parallel tasks Parsers Inverters Break the input document corpus into splits Each split is a subset of documents E.g., corresponding to blocks in BSBI Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs writes pairs into j partitions Each partition is for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here j = 3. Inverter collects all (term, doc) pairs for a partition; sorts and writes to postings list Intelligent Information Retrieval 19 Sec. 4.4 Data flow assign splits Intelligent Information Retrieval Master assign Parser a-f g-p q-z Parser a-f g-p q-z Parser a-f g-p q-z Map phase Segment files Postings Inverter a-f Inverter g-p Inverter q-z Reduce phase 20 Dynamic indexing Problem: Docs come in over time postings updates for terms already in dictionary new terms added to dictionary Docs get deleted Simplest Approach Maintain a “big” main index New docs go into a “small” auxiliary index Search across both, merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Periodically, re-index into one main index Intelligent Information Retrieval 21 Index on disk vs. memory Most retrieval systems keep the dictionary in memory and the postings on disk Web search engines frequently keep both in memory massive memory requirement feasible for large web service installations less so for commercial usage where query loads are lighter Intelligent Information Retrieval 22 Retrieval From Indexes Given the large indexes in IR applications, searching for keys in the dictionaries becomes a dominant cost Two main choices for dictionary data structures: Hashtables or Trees Using Hashing requires the derivation of a hash function mapping terms to locations may require collision detection and resolution for non-unique hash values Using Trees Binary search trees nice properties, easy to implement, and effective enhancements such as B+ trees can improve search effectiveness but, requires the storage of keys in each internal node Intelligent Information Retrieval 23 Sec. 3.1 Hashtables Each vocabulary term is hashed to an integer (We assume you’ve seen hashtables before) Pros: Lookup is faster than for a tree: O(1) Cons: No easy way to find minor variants: judgment/judgement No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything 24 Sec. 3.1 Trees Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings … but we typically have one Pros: Solves the prefix problem (e.g., terms starting with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive But B-trees mitigate the rebalancing problem 25 Sec. 3.1 Tree: binary tree a-m a-hu hy-m Root n-z n-sh si-z 26 Sec. 3.1 Tree: B-tree n-z a-hu hy-m Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. 27 Recall: Steps in Basic Automatic Indexing Parse documents to recognize structure Scan for word tokens Stopword removal Stem words Weight words Intelligent Information Retrieval 28 Indexing Models (aka “Term Weighting”) Basic issue: which terms should be used to index a document, and how much should it count? Some approaches binary weights Terms either appear or they don’t; no frequency information used. term frequency Either raw term counts or (more often) term counts divided by total frequency of the term across all documents TF.IDF (inverse document frequency model) Term discrimination model Signal-to-noise ratio (based on information theory) Probabilistic term weights Intelligent Information Retrieval 29 Binary Weights Only the presence (1) or absence (0) of a term is included in the vector This representation can be particularly useful, since the documents (and the query) can be viewed as simple bit strings. This allows for query operations be performed using logical bit operations. Intelligent Information Retrieval docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 t1 1 1 0 1 1 1 0 0 0 0 1 t2 0 0 1 0 1 1 1 1 0 1 0 t3 1 0 1 0 1 0 0 0 1 1 1 30 Binary Weights: Matching of Documents & Queries In the case of binary weights, matching between documents and queries can be seen as the size of the intersection of two sets (of terms): |Q D|. This in turn can be used to rank the relevance of documents to a query. docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 Q t1 1 1 0 1 1 1 0 0 0 0 1 1 q1 Intelligent Information Retrieval t2 0 0 1 0 1 1 1 1 0 1 0 1 q2 t3 1 0 1 0 1 0 0 0 1 1 1 1 q3 Rank=Q.Di 2 1 2 1 3 2 1 1 1 2 2 t1 t3 D9 D2 D1 D4 D11 D5 D3 D6 D10 D7 D8 t2 31 Beyond Binary Weight More generally, similarity between the query and the document can be seen as the dot product of two vectors: Q D (this is also called simple matching) Note that if both Q and D are binary this is the same as: |Q D| docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 Q t1 1 1 0 1 1 1 0 0 0 0 1 1 q1 Intelligent Information Retrieval t2 0 0 1 0 1 1 1 1 0 1 0 2 q2 t3 1 0 1 0 1 0 0 0 1 1 1 3 q3 Rank=Q.Di 4 1 5 1 6 3 2 2 3 5 3 Given two vectors X and Y: X x1 , x2 , , xn Y y1 , y2 , , yn Simple matching measures the similarity between X and Y as the dot product of X and Y: sim ( X , Y ) X Y xi yi i 32 Raw Term Weights The frequency of occurrence for the term in each document is included in the vector Now the notion of simple matching (dot product) incorporates the term weights from both the query and the documents. Using raw term weights provides the ability to better distinguish among retrieved documents Note: Although “term frequency” is commonly used to mean raw occurrence count, technically it implies that raw count is divided by the document length (total no. of term occurrences in the document). Intelligent Information Retrieval docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 Q t1 2 1 0 3 1 3 0 0 0 0 4 1 q1 t2 0 0 4 0 6 5 8 10 0 3 0 2 q2 t3 3 0 7 0 3 0 0 0 1 5 1 3 q3 RSV=Q.Di 11 1 29 3 22 13 16 20 3 21 7 33 Term Weights: TF More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j. May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij} Or sublinear tf scaling: tfij = 1 + log fij 34 Normalized Similarity Measures With or without normalized weights, it is possible to incorporate normalization into various similarity measures Example (Vector Space Model) in simple matching, the dot product of two vectors measures the similarity of these vectors the normalization can be achieved by dividing the dot product by the product of the norms of the two vectors given a vector the norm of X is: X x1 , x2 , X Note: this measures the cosine of the angle between two vectors; it is thus called the normalized cosine similarity measure. , xn 2 x i i the similarity of vectors X and Y is: X Y sim ( X , Y ) X y (x y ) i i x 2 i i Intelligent Information Retrieval i y 2 i i 35 Normalized Similarity Measures docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 Q t1 2 1 0 3 1 3 0 0 0 0 4 1 q1 t2 0 0 4 0 6 5 8 10 0 3 0 2 q2 t3 3 0 7 0 3 0 0 0 1 5 1 3 q3 RSV=Q.Di 11 1 29 3 22 13 16 20 3 21 7 Using normalized cosine similarity docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 Q t1 2 1 0 3 1 3 0 0 0 0 4 1 q1 t2 0 0 4 0 6 5 8 10 0 3 0 2 q2 t3 3 0 7 0 3 0 0 0 1 5 1 3 q3 SIM(Q,Di) 0.82 0.27 0.96 0.27 0.87 0.60 0.53 0.53 0.80 0.96 0.45 Note that the relative ranking among documents has changed! Intelligent Information Retrieval 36 tf x idf Weighting tf x idf measure: term frequency (tf) inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Recall the Zipf distribution Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole Goal: assign a tf x idf weight to each term in each document Intelligent Information Retrieval 37 tf x idf wik tfik * log( N / nk ) Tk term k in document Di tfik frequency of term Tk in document Di idf k inverse document frequency of term Tk in C N total number of documents in the collection C nk the number of documents in C that contain Tk idf k log N nk Intelligent Information Retrieval 38 Inverse Document Frequency IDF provides high values for rare words and low values for common words 10000 log 0 10000 10000 log 0.301 5000 10000 log 2.698 20 10000 log 4 1 Intelligent Information Retrieval 39 tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive this is more ad hoc than normalization based on vector norms, but the basic idea is the same: Intelligent Information Retrieval 40 tf x idf Example The initial Term x Doc matrix (Inverted Index) T1 T2 T3 T4 T5 T6 T7 T8 Doc 1 0 1 0 3 0 2 1 0 Doc 2 2 3 1 0 4 7 0 1 Doc 3 4 0 0 1 0 2 0 1 Doc 4 0 0 2 5 0 1 5 0 Doc 5 1 0 0 4 0 3 5 0 Doc 6 0 2 0 0 1 0 1 3 df 3 3 2 4 2 5 4 3 idf = log2(N/df) 1.00 1.00 1.58 0.58 1.58 0.26 0.58 1.00 Documents represented as vectors of words tf x idf Term x Doc matrix T1 T2 T3 T4 T5 T6 T7 T8 Doc 1 0.00 1.00 0.00 1.75 0.00 0.53 0.58 0.00 Doc 2 2.00 3.00 1.58 0.00 6.34 1.84 0.00 1.00 Doc 3 4.00 0.00 0.00 0.58 0.00 0.53 0.00 1.00 Doc 4 0.00 0.00 3.17 2.92 0.00 0.26 2.92 0.00 Doc 5 1.00 0.00 0.00 2.34 0.00 0.79 2.92 0.00 Doc 6 0.00 2.00 0.00 0.00 1.58 0.00 0.58 3.00 41 Alternative TF.IDF Weighting Schemes Many search engines allow for different weightings for queries vs. documents: A very standard weighting scheme is: Document: logarithmic tf, no idf, and cosine normalization Query: logarithmic tf, idf, no normalization 42 Keyword Discrimination Model The Vector representation of documents can be used as the source of another approach to term weighting Question: what happens if we removed one of the words used as dimensions in the vector space? If the average similarity among documents changes significantly, then the word was a good discriminator If there is little change, the word is not as helpful and should be weighted less Note that the goal is to have a representation that makes it easier for a queries to discriminate among documents Average similarity can be measured after removing each word from the matrix Any of the similarity measures can be used (we will look at a variety of other similarity measures later). Intelligent Information Retrieval 43 Keyword Discrimination Measuring average similarity (assume there are N documents) sim(D1,D2) = similarity score for pair of documents D1 and D2 sim 1 sim( Di , Dj ) 2 N i, j simk sim when termk removed Computationally Expensive Better way to calculate AVG-SIM Calculate centroid D* (avg. document vector = Sum vectors / N) Then: sim Intelligent Information Retrieval 1 N sim ( D , D ) * i i 44 Keyword Discrimination Discrimination value (discriminant) and term weights disck simk sim disck > 0 ==> termk is a good discriminant disck < 0 ==> termk is a poor discriminant disck = 0 ==> termk is indifferent Computing Term Weights New weight for a term k in a document i is the original term frequency of k in i time the discriminant value: wik tfik disc k Intelligent Information Retrieval 45 Keyword Discrimination - Example t1 docs 10 D1 9 D2 8 D3 8 D4 19 D5 9 D6 D* 10.50 t2 1 2 1 1 2 2 1.50 t3 0 10 1 50 15 0 12.67 Doc-Sim to Centroid sim(D1,D*) 0.641 sim(D2,D*) 0.998 sim(D3,D*) 0.731 sim(D4,D*) 0.859 sim(D5,D*) 0.978 sim(D6,D*) 0.640 0.808 AVG-SIM 1 sim N * sim ( D , D ) i i Using Normalized Cosine simk sim when termk removed sim1(D1,D*) sim1(D2,D*) sim1(D3,D*) sim1(D4,D*) sim1(D5,D*) sim1(D6,D*) 0.118 0.997 0.785 0.995 1.000 0.118 sim2(D1,D*) sim2(D2,D*) sim2(D3,D*) sim2(D4,D*) sim2(D5,D*) sim2(D6,D*) 0.638 0.999 0.729 0.861 0.978 0.638 sim3(D1,D*) sim3(D2,D*) sim3(D3,D*) sim3(D4,D*) sim3(D5,D*) sim3(D6,D*) 0.999 0.997 1.000 1.000 0.999 0.997 SIM1 0.669 SIM2 0.807 SIM3 0.999 Note: D* for each of the SIMk is now computed with only two terms Intelligent Information Retrieval 46 Keyword Discrimination - Example This shows that t1 tends to be a poor discriminator, while t3 is a good discriminator. The new term weight will now reflect the discrimination value for these terms. Note that further normalization can be done to make all term weights positive. disck simk sim Term t1 t2 t3 D1 D2 D3 D4 D5 D6 disc k -0.139 -0.001 0.191 t1 -1.392 -1.253 -1.114 -1.114 -2.645 -1.253 t2 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 t3 0.000 1.908 0.191 9.538 2.861 0.000 wik tfik disc k New Weights for Terms t1, t2, and t3 Intelligent Information Retrieval 47 Signal-To-Noise Ratio Based on work of Shannon in 1940’s on Information Theory Developed a model of communication of messages across a noisy channel Goal is to devise an encoding of messages that is most robust in the face of channel noise In IR, messages describe the content of documents Amount of information about the document from a word is inversely proportional to its probability of occurrence The least informative words are those that occur approximately uniformly across the corpus of documents a word that occurs with the similar frequency across many documents (e.g., “the”, “and”, etc.) is less informative than one that occurs with high frequency in one or two documents Shannon used entropy (a logarithmic measure) to measure average information gain with noise defined as its inverse Intelligent Information Retrieval 48 Signal-To-Noise Ratio pk = Prob(term k occurs in document i) = tfik / tfk Infok = - pk log pk Note: here we always take Noisek = - pk log (1/pk) logs to be base 2. Note: NOISE is the negation of AVG-INFO, so only one of these needs to be computed in practice. SIGNALk log(tf k ) NOISEk wik tf ik SIGNALk Intelligent Information Retrieval The weight of term k in document i 49 Signal-To-Noise Ratio - Example pk = tfik / tfk docs D1 D2 D3 D4 D5 D6 tf k t1 10 9 8 8 19 9 63 t2 1 2 1 1 2 2 9 t3 1 10 1 50 15 1 78 Prob t1 Prob t2 Prob t3 Info (t1 ) Info (t2 ) Info (t3 ) 0.159 0.111 0.013 0.421 0.352 0.081 0.143 0.222 0.128 0.401 0.482 0.380 0.127 0.111 0.013 0.378 0.352 0.081 0.127 0.111 0.641 0.378 0.352 0.411 0.302 0.222 0.192 0.522 0.482 0.457 0.143 0.222 0.013 0.401 0.482 0.081 AVG-INFO 2.501 2.503 1.490 Note: By definition, if the term k does not appear in the document, we assume Info(k) = 0 for that doc. This is the “entropy” of term k in the collection Intelligent Information Retrieval 50 Signal-To-Noise Ratio - Example Term t1 t2 t3 AVG-INFO 2.501 2.503 1.490 NOISE -2.501 -2.503 -1.490 NOISEk AVG-INFO docs D1 D2 D3 D4 D5 D6 Weight t1 34.760 31.284 27.808 27.808 66.044 31.284 Weight t2 0.667 1.333 0.667 0.667 1.333 1.333 Weight t3 4.795 47.951 4.795 239.753 71.926 4.795 SIGNAL 3.476 0.667 4.795 SIGNALk log(tf k ) NOISEk wik tf ik SIGNALk The weight of term k in document i Additional normalization can be performed to have values in the range [0,1] Intelligent Information Retrieval 51 Probabilistic Term Weights Probabilistic model makes explicit distinctions between occurrences of terms in relevant and non-relevant documents If we know pi: probability of term xi appears in relevant doc. qi: probability of term xi appears in non-relevant doc. with binary and independence assumption, the the weight of term xi in document Dk is: wt ik log pi(1qi ) qi(1 pi ) Estimates of pi and qi requires relevance information: using test queries and test collections to “train” the values of pi and qi other AI/learning technique? Intelligent Information Retrieval 52 Phrase Indexing and Phrase Queries Both statistical and syntactic methods have been used to identify “good” phrases Example: Mutual Expected Information to find “co-locations” Linguistic Approaches: using a part-of-speech tagger to identify simple noun phrases Phrases can have an impact on effectiveness and efficiency phrase indexing will speed up phrase queries improve precision by disambiguating the word senses: e.g, “grass field” v. “magnetic field” effectiveness not straightforward and depends on retrieval model e.g. “information retrieval”, how much do individual words count? • For phrase queries, it no longer suffices to store only <term : docs> entries Intelligent Information Retrieval 53 Phrases Detection and Weighting Typical Approach Compute pairwise co-occurrence for high-frequency words If co-occurrence value is less than some threshold a, do not consider the pair any further For qualifying pairs of terms (ti,tj) , compute the cohesion value freq (ti , t j ) cohesion(ti , t j ) s totfreq (ti ) totfreq (t j ) (Salton and McGill, 1983) where s is a size factor determined by the size of the vocabulary; OR cohesion(ti , t j ) freq (ti , t j ) freq (ti ) freq (t j ) (Rada, 1986) But, indexing all pairwise (or longer) frequent cooccurrences can be computational very expensive Intelligent Information Retrieval 54 Sec. 2.4.2 Better Solution: Positional indexes In the postings, store, for each term the position(s) in which tokens of it appear: <term, number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.> Sec. 2.4.2 Positional Index Example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”? For phrase queries, we can use a merge algorithm recursively at the document level Sec. 2.4.2 Processing a Phrase Query • Extract inverted index entries for each distinct term: to, be, or, not. • Merge their doc:position lists to enumerate all positions with “to be or not to be”. – to: • 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ... – be: • 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ... • Same general method for proximity searches • West Law Example: “LIMIT! /3 STATUTE /3 FEDERAL /2 TORT” – /k means “within k words of” – Positional indexes can be used for such queries; phrase indexes cannot. Sec. 2.4.2 Positional Index Size A positional index expands postings storage substantially Even though indices can be compressed Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries Need an entry for each occurrence, not just once per document Index size depends on average document size and average frequency of each term Average web page has <1000 terms SEC filings, books, even some epic poems … easily 100,000 terms Rule of Thumb A positional index is 2–4 as large as a non-positional index Positional index size 35–50% of volume of original text Concept Indexing More complex indexing could include concept or thesaurus classes One approach is to use a controlled vocabulary (or subject codes) and map specific terms to “concept classes” Automatic concept generation can use classification or clustering to determine concept classes Automatic Concept Indexing Words, phrases, synonyms, linguistic relations can all be evidence used to infer presence of the concept e.g. the concept “automobile” can be inferred based on the presence of the words “vehicle”, “transportation”, “driving”, etc. One approach is to represent each word as a “concept vector” each dimension represents a weight for a concept associated with the term phrases or index items can be represented as weighted averages of concept vectors for the terms in them Another approach: Latent Semantic Indexing (LSI) Intelligent Information Retrieval 59 Next Retrieval Models and Ranking Algorithms Boolean Matching and Boolean Queries Vector Space Model and Similarity Ranking Extended Boolean Models Basic Probabilistic Models Implementation Issues for Ranking Systems Intelligent Information Retrieval 60