CS336 Lecture 4: Properties of Text Stop list • Typically most frequently occurring words – a, about, at, and, etc, it, is, the, or, … • Among the top 200 are words such as “time” “war” “home” etc. – May be collection specific • “computer, machine, program, source, language” in a computer science collection • Removal can be problematic (e.g. “Mr. The”, “and-or gates”) 2 Stop lists • Commercial systems use only few stop words • ORBIT uses only 8, “and, an, by, from, of , the, with” – patents,scientific and technical (sci-tech) information, trademarks and Internet domain names 3 Special Cases? • Name Recognition – People’s names - “Bill Clinton” – Company names - IBM & big blue – Places • New York City, NYC, the big apple 4 Text • Goal: – Identify what can be inferred about text based on • structural features • statistical features of language • Statistical Language Characteristics – convert text to form more easily manipulable via computer – reduce storage space and processing time – store and process in encrypted form • text compression 5 Zipf’s Law The probability of occurrence of words or other items starts high and tapers off. Thus, a few occur very often while many others occur rarely. • pr = (freq of word of rank r)/N – Probability that a word chosen randomly will be the word of rank r – N = total word occurrences – for D distinct words, S pr = 1 – r * pr = A • A ≈ 0.1 – e.g.) the rank of a word is inversely proportional to its frequency 6 Employing Zipf’s Law • Identify significant words and ineffectual words – A few words occur very often • 2 most frequent words can account for 10% of occurrences • top 6 words are 20% • top 50 words are 50% – Many words are infrequent 7 Most frequent words r Word f(r) 1 2 3 4 5 6 7 8 9 10 the of and to a in that is was he 69,971 36,411 28,852 26,149 23,237 21,341 10,595 10,049 9,816 9,543 N~1,000,000 r*f(r)/N 0.070 0.073 0.086 0.104 0.116 0.128 0.074 0.081 0.088 0.095 8 Employing Zipf’s Law • Estimate technical needs – Estimating storage space saved by excluding stop words from index • 10 most frequently occurring words in English make up about 25%-30% of text • Deleting very low frequency words from index yields a large saving • Estimate number of words n(1) that occur 1 times, n(2) that occur 2 times, etc – Words that occur at most twice comprise about 2/3 of a text • Estimating the size of a term’s inverted index list • Zipf is quite accurate except at very high and very low rank 9 Modeling Natural Language • Length of the words – defines total space needed for vocabulary • each character requires 1 byte • Heaps’ Law: length increases logarithmically with text size. 10 Vocabulary Growth • New words occur less frequently as collection grows • Empirically t = kNb , where – t is the number of unique words – k and b are constants • k 10-20 b 0.5-0.6 • As the total text size grows, the predictions of the Heaps’ Law become more accurate Sublinear growth rate 11 Information Theory • Shannon studied theoretical limits for data compression and transmission rate – “…problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point." • Compression limits given by Entropy (H) • Transmission limits given by Channel Capacity (C) • Many language tasks have been formulated as a “noisy channel” problem – determine most likely input given noisy output • • • • OCR Speech recognition Machine translation etc. 12 Shannon Game • How should we complete the following? – The president of the United States is George W. … – The winner of the $10K prize is … – Mary had a little … – The horse raced past the barn … • Period? • etc 13 Information Theory • Information content of a message is dependent on both – the receiver’s prior knowledge – the message itself • How much of the receiver’s uncertainty (entropy) is reduced • How predictable is the message 14 Information Theory • Think of information content, H, as a measurement of our ability to guess rest of message, given only a portion of the message – if predict with probability 1, information content is zero – if predict with probability 0, infinite information content – H(p) = -log p • Logs in base 2, unit of information content (entropy) is 1 bit • If message is a priori predictable with pr = 0.5 • Information content = -log2 (1/2) = 1 bit 15 Information Theory • Given n messages, the average or expected information content to be gained from receiving one of the messages is: H pi log 2 pi i 1 where : # of symbols in an alphabet, pi: probability of a symbol’s appearance (freqi/all occurrences) – Amount of information in a message is related to the distribution of symbols in the message. 16 Entropy • Average entropy is a maximum when messages are equally probable – e.g. average entropy associated with characters assuming equal probabilities • For alphabet, H = log 1/26 = 4.7 bits • With actual probabilities, H = 4.14 bits • With bigram probabilites, H reduces to 3.56 bits • People predict next letter with ~ 40% accuracy, H = 1.3 bits • Better models reduce the relative entropy • In text compression, entropy (H) specifies the limit on how much the text can be compressed – the more regularity (e.g. less uncertain) a data sequence, the more it can be compressed 17 Information Theory • Let t = number of unique words in a vocabulary – For t = 10,000 50,000 100,000 H = 9.5 10.9 11.4 bits • Information theory has been used for – Compression – Term weighting – Evaluation measures 18 Text • Modeling Natural Language – Length of the words • defines total space needed for vocabulary – each character requires 1 byte – Heaps’ Law: length increases logarithmically with text size. 19 Stemming • Commonly used to conflate morphological variants – combine non identical words referring to same concept • compute, computation, computer, … • Stemming is used to: – Enhance query formulation (and improve recall) by providing term variants – Reduce size of index files by combining term variants into single index term Stemmer correctness • Two ways to be incorrect – Under-stemming • Prevents related terms from being conflated • “consideration” to “considerat” prevents conflating it with “consider” • Under-stemming affects recall – Over-stemming • Terms with different meanings are conflated • “considerate”, “consider” and “consideration” should not be stemmed to “con”, with “contra”, “contact”, etc. • Over-stemming can reduce precision 21 The Concept of Relevance • Relevant => does the document fulfill the query? • Relevance of a document D to a query Q is subjective – Different users will have different judgments – Same users may judge differently at different times – Degree of relevance of different documents will vary • In IR system evaluation it is assumed: – A subset of database documents (DB) are relevant – A document is either relevant or not relevant 22 Recall and precision • Most common measures for evaluating IR systems • Recall: % of relevant documents retrieved. – Measures ability to get ALL of the good documents. • Precision: % of retrieved documents that are in fact relevant. – Measures amount of junk that is included in the results. • Ideal Retrieval Results – 100% recall (All good documents are retrieved ) – 100% precision (No bad document is retrieved) 23 Evaluating stemmers • In information retrieval stemmers are evaluated by their: – effect on retrieval • improvements in recall or precision – compression rate – Not linguistic correctness 24 Stemmers • 4 basic types – Affix removing stemmers – Dictionary lookup stemmers – n-gram stemmers – Corpus analysis • Studies have shown that stemming has a positive effect on retrieval. • Performance of different algorithms comparable • Results vary between test collections Affix removal stemmers • Remove suffixes and/or prefixes leaving a stem – In English remove suffixes • What might you remove if you were designing a stemmer? – In other languages, e.g. Hebrew, remove both prefix and suffix • Keshehalachnu --> halach • Nelechna --> halach – some languages are more difficult, e.g. Arabic – iterative: consideration => considerat => consider – longest match: use a set of stemming rules arranged on a ‘longest match’ principal (Lovins) 26 A simple stemmer (Harman) if word ends in “ies” but not “eies” or “aies” then “ies”->“y”; else in “es” but not “aes”, “ees” or “oes” then “es”->e; else in “s” but not “us” or “ss” then “s”->NULL endif • Algorithm changes: – “skies” to “sky”, – “retrieves” to “retrieve” – “doors” to “door” – but not “corpus” or “wellness” – “dies” to “dy”? 27 Stemming w/ Dictionaries • Avoid collapsing words with different meaning to same root • Word is looked up and replaced by the best stem • Typical stemmers consist of rules and/or dictionaries – simplest stemmer is “suffix s” – Porter stemmer is a collection of rules – KSTEM uses lists of words plus rules for inflectional and derivational morphology Stemming Examples • Original text: Document will describe marketing strategies carried out by U.S. companies for their agricultural chemicals, report predictions for market share of such chemicals, or report market statistics for agrochemicals, pesticide, herbicide, fungicide, insecticide, fertilizer, predicted sales, market share, stimulate demand, price cut, volume of sales • Porter Stemmer: market strateg carr compan agricultur chemic report predict market share chemic report market statist agrochem pesticid herbicid fungicid insecticid fertil predict sale stimul demand price cut volum sale • KSTEM: marketing strategy carry company agriculture chemical report prediction market share chemical report market statistic agrochemic pesticide herbicide fungicide insecticide fertilizer predict sale stimulate demand price cut volume sale 29 n-grams • Fixed length consecutive series of “n” characters – Bigrams: • Sea colony -> (se ea co ol lo on ny) – Trigrams • Sea colony -> (sea col olo lon ony), or -> (#se sea ea# #co col olo lon ony ny#) • Conflate words based on overlapping series of characters 30 Problems with Stemming • Lack of domain-specificity and context can lead to occasional serious retrieval failures • Stemmers are often difficult to understand and modify • Sometimes too aggressive in conflation (over-stem) – e.g. “execute”/“executive”, “university”/“universe”, “policy”/“police”, “organization”/“organ” conflated by Porter • Miss good conflations (under-stem) – e.g. “European”/“Europe”, “matrices”/“matrix”, “machine”/“machinery” are not conflated by Porter • Stems that are not words are often difficult to interpret – e.g. with Porter, “iteration” produces “iter” and “general” produces “gener” 31 Corpus-Based Stemming • Corpus analysis can improve/replace a stemmer • Hypothesis: Word variants that should be conflated will co-occur in context • Modify stem classes generated by a stemmer or other “aggressive” techniques such as initial ngrams – more aggressive classes mean less conflations missed • Prune class by removing words that don’t co-occur sufficiently often • Language independent Equivalence Class Examples abandon abandoned abandoning abandonment abandonments abandons abate abated abatement abatements abates abating abrasion abrasions abrasive abrasively abrasiveness abrasives absorb absorbable absorbables absorbed absorbencies absorbency absorbent -absorbents absorber absorbers absorbing absorbs abusable abuse abused abuser abusers abuses abusing abusive abusively access accessed accessibility accessible accessing accession Some Porter Classes for a WSJ Database abandonment abandonments abated abatements abatement abrasive abrasives absorbable absorbables absorbencies absorbency absorbent absorber absorbers abuse abusing abuses abusive abusers abuser abused accessibility accessible Classes refined through corpus analysis 33 Corpus-Based Stemming Results • Both Porter and KSTEM stemmers are improved slightly by this technique • Ngram stemmer gives same performance as “linguistic” stemmers for – English – Spanish – Not shown to be the case for Arabic Stemmer Summary • All automatic stemmers are sometimes incorrect – over-stemming – understemming • In general, improves effectiveness • May use varying levels of language specific information – morphological stemmers use dictionaries – affix removal stemmers use information about prefixes, suffixes, etc. • n-gram and corpus analysis methods can be used for different languages 35 Generating Document Representations • Use significant terms to build representations of documents – referred to as indexing • Manual indexing: professional indexers – Assign terms from a controlled vocabulary – Typically phrases • Automatic indexing: machine selects – Terms can be single words, phrases, or other features from the text of documents 36 Index Languages • Language used to describe docs and queries • Exhaustivity # of different topics indexed, completeness or breadth – increased exhaustivity => higher recall/ lower precision •retrieved output size increases because documents are indexed by any remotely connected content information • Specificity - accuracy of indexing, detail – increased specificity => higher precision/lower recall • When doc represented by fewer terms, content may be lost. A query that refers to the lost content,will fail to retrieve the document 37 Index Languages • Pre-coordinate indexing – combinations of terms (e.g. phrases) used as an indexing label • Post-coordinate indexing - combinations generated at search time • Faceted classification - group terms into facets that describe basic structure of a domain, less rigid than predefined hierarchy • Enumerative classification - an alphabetic listing, underlying order less clear – e.g. Library of Congress class for “socialism, communism and anarchism” at end of schedule for social sciences, after social pathology and criminology 38