Information Retrieval: Indexing Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow) Roadmap What is a document? Representing the content of documents – Luhn's analysis – Generation of document representatives – Weighting Inverted files Indexing Language Language used to describe documents and queries index terms – selected subset of words – derived from the text or arrived at independently Keyword searching – Statistical analysis of document based of word occurrence frequency – Automated, efficient and potentially inaccurate Searching using controlled vocabularies – More accurate results but time consuming if documents manually indexed Luhn's analysis Resolving power of significant words: – ability of words to discriminate document content – peak at rank order position half way between the two cut-offs Generating document representatives Generating document representatives Input text: full text, abstract, title Document representative: list of (weighted) class names, each name representing a class of concepts (words) occurring in input text Document indexed by a class name if one of its significant words occurs as a member of that class Phases: – – – – – – – identify words - Lexical Analysis (Tokenising) removal of high frequency words suffix stripping (stemming) detecting equivalent stems thesauri others (noun-phrase, noun group, logical formula, structure) Index structure creation Process View Document Lexical Analysis Stopwords removal stemming Indexing features Lexical Analysis The process of converting a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms) – treating digits, hyphens, punctuation marks, and the case of the letters. Stopword Removal Removal of high frequency words list of stop words (implement Luhn's upper cutoff) filtering out words with very low discrimination values for retrieval purposes example: “been", “a", “about", “otherwise" compare input text with stop list reduction: between 30 and 50 per cent Conflation Conflation reduces word variants into a single form – similar words generally have similar meaning – retrieval effectiveness increased if the query is expanded with those which are similar in meaning to those originally contained within it. Stemming algorithm is a conflation procedure – reduces all words with same root into a single root Different forms - stemming Stemming – Matching the query term “forests” to “forest” and “forested” – “choke", “choking", “choked" Suffix removal – – – – removal of suffixes - worker Porter algorithm: remove longest suffix error: “equal" -> “eq": heuristic rules more effective than ordinary word forms Detecting equivalent stems – example: ABSORB- and ABSORPT- Stemmers remove affixes – prefixes? - megavolt Plural stemmer Plurals in English – If word ends in “ies” but not “eies”, “aies” “ies” -> “y” – if word ends in “es” but not “aes, “ees”, “oes” “es” -> “e” – if word ends in “s” but not “us” or “ss” “s” -> “” – First applicable rule is the one used Processing “The destruction of the amazon rain forests” Case normalisation Stop word removal. – From fixed list – “destruction amazon rain forests” Suffix removal (stemming). – “destruct amazon rain forest” Thesauri A collection of terms along with some structure or relationships between them. Scope notes etc.. 1. provide standard vocabulary for indexing & searching 2. assist user locating terms for proper query formulation 3. provide classification hierarchy for broadening and narrowing current query according to user need – – – Equivalence: synonyms, preferred terms Hierarchical: broader/narrower terms (BT/NT) Association: related terms across the hierarchy (RT) Thesauri Examples: WordNet Faceted Classification Thesauri Examples: AAT Art and Architecture Thesaurus Hierarchical Classifications Alphanumeric coding schemes Subject classifications A taxonomy that represents a classification or kind-of hierarchy. Examples: Dewey Decimal, AAT, SHIC, ICONCLASS Kind of a door 41A32 Door Action associated 41A322 Closing the Door with a door 41A323 Monumental Door 41A324 Metalwork of a Door Something attached 41A3241 Door-Knocker to a door 41A325 Threshold 41A327 Door-keeper, houseguard Terminology/Controlled vocabulary The descriptors from a thesauri form a controlled vocabulary Normalise indexing concepts Identification of indexing concepts with clear semantics Retrieval based on concepts rather than terms Good for specific domains (e.g., medical) Problematic for general domains (large, new, dynamic) No One Classification No One Classification Generating document representatives - Outcome Class – words with the same stem Class name – stem Document representative: – list of class names (index terms or keywords) Same process applied to query Precision and Recall Precision – Ratio of the number of relevant documents retrieved to the total number of documents retrieved. – The number of hits that are relevant Recall – Ratio of number of relevant documents retrieved to the total number of relevant documents – The number of relevant documents that are hits Precision and Recall Relevant Documents Retrieved Documents Document Space Low Precision Low Recall High Precision Low Recall Low Precision High Recall High Precision High Recall Precision and Recall Relevant Retrieved Documents |RA| Documents |R| |A| Recall = |RA| |R| Information Space • The user isn’t usually given the answer set RA at once • The documents in A are sorted to a degree of relevance (ranking) which |RA| Precision = the user examines. Recall and precision |A| vary as the user proceeds with their examination of the answer set A Precision and Recall Trade Off Increase number of documents retrieved Likely to retrieve more of the relevant documents and thus increase the recall But typically retrieve more inappropriate documents and thus decrease precision 100% Recall Precision 100% Index term weighting Effectiveness of an indexing language: Exhaustivity – number of different topics indexed – high exhaustivity: high recall and low precision Specificity – ability of the indexing language to describe topics precisely – high specificity: high precision and low recall Index term weighting Exhaustivity – related to the number of index terms assigned to a given document Specificity – number of documents to which a term is assigned in a collection – related to the distribution of index terms in collection Index term weighting – index term frequency: occurrence frequency of a term in document – document frequency: number of documents in which a term occurs IR as Clustering A query is a vague spec of a set of objects, A IR is reduced to the problem of determining which documents are in set A and which ones are not Intra clustering similarity: xx A: x Retrieved Documents x x x C: Document Collection – What are the features that better describe the objects in A Inter clustering dissimilarity: – What are the features that better distinguish the objects A from the remaining objects in C Index term weighting N n(t) Weight(t,d) = tf(t,d) x idf(t) Number of documents in collection idf(t) Number of documents in which term t occurs Inverse document frequency occ(t,d) Occurrence of term t in document d tmax Term in document d with highest occurrence Term frequency of t in document d tf(t,d) Index term weighting Intra-clustering similarity – The raw frequency of a term t inside a Normalised frequency of term t in document d document d. – A measure of how well the document term describes the document contents tf(t,d) = Inter-cluster dissimilarity occ(t,d) occ(tmax, d) – Inverse document frequency – Inverse of the frequency of a term t among the documents in the collection. – Terms which appear in many documents are not useful for distinguishing a relevant document from a non-relevant one. Inverse document frequency N idf(t) = log n(t) Weight(t,d) = tf(t,d) x idf(t) Term weighting schemes Best known weight(t,d) = occ(t,d) N x log occ(tmax, d) n(t) Variation for query term weights 0.5occ(t,q) N x log weight(t,d) = 0.5 + occ(tmax, q) n(t) Term frequency Inverse document frequency Example Nuclear 7 Poverty 5 Luddites 3 People 25 Computer 9 Unemployment 1 Machines 19 And 49 Weight(machine) = 19/25 x log(100/50) = 0.76 x 0.3013 = 0.228988 Weight(luddite) = 3/25 x log(100/2) = 0.12 x 1.69897 = 0.2038764 Weight(poverty) = 5/25 x log(100/2) = 0.2 x 1.69897 = 0.339794 Inverted Files Word-oriented mechanism for indexing test collections to speed up searching Searching: – – – vocabulary search (query terms) retrieval of occurrence manipulation of occurrence Original Document view Cosmonaut astronaut moon car truck D1 1 0 1 1 1 D2 0 1 1 0 0 D3 0 0 0 1 1 Inverted view D1 D2 D3 Cosmonaut 1 0 0 astronaut 0 1 0 moon 1 1 0 Car 1 0 1 truck 1 0 1 Inverted index cosmonaut D1 astronaut D2 moon D1 D2 car D1 D3 truck D1 D3 Inverted File The speed of retrieval is maximised by considering only those terms that have been specified in the query This speed is achieved only at the cost of very substantial storage and processing overheads Components of an inverted file Header Information frequency term Field type pointer Document number frequency Postings file A B AI AL BA BR C D F G J L M N O P Q T TH TI Term Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Inverted File Doc 1 Producing an Inverted file Postings aid all back brown come dog fox good jump lazy men now over party quick their time 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 An Inverted file Inverted File A B AI AL BA BR C D F G J L M N O P Q T TH TI Term Postings aid all back brown come dog fox good jump lazy men now over party quick their time 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Searching Algorithm For each document D, Score(D) =0; For each query term – Search the vocabulary list – Pull out the postings list – for each document J in the list, Score(J) +=Score(J) +1 What Goes in a Postings File? Boolean retrieval – Just the document number Ranked Retrieval – Document number and term weight (TF*IDF, ...) Proximity operators – Word offsets for each occurrence of the term Example: Doc 3 (t17, t36), Doc 13 (t3, t45) How Big Is the Postings File? Very compact for Boolean retrieval – About 10% of the size of the documents If an aggressive stopword list is used Not much larger for ranked retrieval – Perhaps 20% Enormous for proximity operators – Sometimes larger than the documents But access is fast - you know where to look Query Documents Indexing features Stemming Matching Query features Storage: inverted index Term 1 di dj dk Term 2 Term 3 Doc dj di dk Score s1 s2 s3 s1>s2>s3> ... indexing Tokenize Stop word indexing Tokenize Stop word Stemming Similarity Matching The process in which we compute the relevance of a document for a query A similarity measure comprises – term weighting scheme which allocates numerical values to each of the index terms in a query or document reflecting their relative importance – similarity coefficient - uses the term weights to compute the overall degree of similarity between a query and a document