CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Lecture 6: Linguistic Methods for Searching Stemming Thesaurus Online resources Automatic construction of thesaurus CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Outline of Stemming Methods Goal of Stemming Process Algorithm Affix Removal (Porter’s Algorithm) Dictionary Look-up Stemmers Successor Variety n-Gram Stemming Applications CS5286 Search Engine Technology and Algorithms/Xiaotie Deng The advantage Originally designed to improve performance by reducing the requirement on system resources. With the continued significant increase in storage and computing power, use of stemming for performance reason is no longer as important. CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Other Potentials It may make improvement in recall. There may be associated decline in precision. System designer make their own choice of including stemming or not. Google does not use the stemming Hotbot includes the word stemming for user choice CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Porter Stemming Algorithm The Porter algorithm is the most commonly accepted algorithm. Based upon a set of conditions of the stem, suffix and prefix and associated actions given the condition. See, e.g, http://www.tartarus.org/~martin/PorterStemmer/ CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Porter Stemming (Condition) m, the measure of a stem is a function of sequences of vowels (a,e,i,o,u,y) followed by a consonant. C(VC)mV where the initial C and final V are optional and m is the number VC repeats Measure Example m=0 free, why m=1 frees, whose m=2 prologue, compute CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Porter Stemming (Condition) *<X> -stem ends with letter X *v* -stem contains a vowel *d -stem ends in double consonant *o -stem ends with consonantvowel-consonant sequence where the final consonant is not w, x, or y CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Rules Step 1a Condition Suffix NULL sses 1b *v* ing 1b1 NULL at 1c *v* y Replacement Examples ss stresses ->stress NULL making -> mak ate inflat(ed)-> inflate i happy-> happi CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Rules (continued) 2 m>0 aliti al 3 m>0 icate ic 4 m>1 able NULL 5a m>1 e NULL 5b m>1 and *d and *<L> NULL single letter formaliti-> formal duplicated ->duplic adjustable ->adjust inflate-> inflat controll-> control CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Example duplicatable duplicat duplicate duplic rule 4 rule 1b1 rule 3 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Dictionary Look-Up Stemmer A dictionary contains the pairing of a word and its stem for all the words. The structure of the dictionary should be well designed for speeding up the search TERM computer compute computation STEM comput comput comput CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Successor Variety Stemming Hafer and Weiss (1974) “word segmentation by letter successor varieties”, Information Storage and Retrieval 10, 371-385. Main Idea: Determine word and morpheme boundaries based on the distribution of phonemes in a large body of utterances. CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Note Morpheme: smallest meaningful part into which a word can be divided Run-s contains two morphemes un-like-ly contains three morphemes Phoneme: unit of the system of sounds in a language English has 24 consonant phonemes CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Overall approach Hafer and Weiss use letters in place of phonemes texts in place of phonemically transcribed utterances CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Formal Definition Let w be a word of length n wi is a length I prefix of w Let D be a collection of words D(wi) is the subset of D containing terms whose first I letters match wi exactly S(wi) the successor variety of wi is the number of distinct letters that occupy the (i+1)st position of words in D(wi). A test word of length n has n successor varieties S(w1) S(w2) … S(wn). CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Informal Definition The successor variety of a string in a collection D of words is the number of different characters that follows it in D. That it, it depends on the string the collection D of words under consideration CS5286 Search Engine Technology and Algorithms/Xiaotie Deng An example D={able, axle, accident, ape, about, be} The successor variety for a: 4 (b,x,c,p) ap: 1 (e) app: 0 ab: 2 (l, o) b: 1 (e) Using Trie, successor variety of a string is the number of children for the node the string reaches in the trie (terminal node is treated as having one child CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Trie for the corpus of data D 1 b a b 3 l 2 x c be axle p o ape accident able about CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Segment in Words From a large body of text, usually the successor variety of a substring decreases as a character is added, until a segment boundary is reached Consider the following example D={able,ape,beatable, fixable, read, readable, reading, reads, red rope, ripe} r 3 (e,I,o) re 2 (a,d) rea 1 (d) read 3 (a,I,s) read is a segment (or stem) CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Selecting segments of words Cut off method: a boundary is identified if some cutoff value is reached. Peak and plateau method a segment break is made after a character whose successor variety is larger than that of both the character immediate before and the character immediately after it. Complete word method a break is made after a segment if the segment is a complete word in the corpus Entropy method cutoff method applied to entropy defined for words. CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Peak and Plateau Method D={able,ape,beatable, fixable, read, readable, reading, reads, red rope, ripe} r re rea read reada readab readabl readable 3 2 1 3 1 1 1 1 (e,I,o) (a,d) (d) (a,I,s) (b) (l) (e) (blank) the successor variety of {read} is 3 larger than that of both “rea” and “reada” CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Peak and Plateau Method Input: A document of many terms. Output: each term is segmented. E.G., the output of readable is read-able CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Stem method of Hafer and Weiss Determine successor variety of a word Use this information to segment the word using one of the previous methods (say peak&plateau) Choose one of the segment as stem if (first segment is in <=12 words in the corpus) //comment: maybe a prefix first segment is stem else second segment is stem CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Stem method of Hafer and Weiss Input: segmented word Output: the stem of the word For example: read-able is input read is the output //may be able is the output dependent on what happens in the algorithms CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Accessor Variety Method in Chinese The notation is introduced by Feng, Chen, Zheng, Deng for chinese word extraction. The idea is similar to successor variety It is use to determine chinese text segmentation since it is difficult to separate words in Chinese text. In comparison, English words are separated by a space symbol in text. CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Definition: Accessor Variety We treat each Chinese character as a letter For each string (a potential word) consisting of several characters, we define successor variety as in English Symmetrically, we also define a predecessor variety for each string. A word is considered a word if it has a large successor variety and a large predecessor variety. CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Testing Results The accessor variety method turns out a very simple yet efficient way to recognize Chinese words when combined with some simple grammar rules. For details, look at our paper: http://www.cs.cityu.edu.hk/~deng/5286/feng.pdf CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Word similarity N-gram method: break a word of length n into (n-1) digrams, consisting of substring of two characters of the word. Count the number of distinguished digrams Let A (B) be the number of distinguished digrams in word 1 (2). Let C be the number of distinguished digrams shared by word 1 and word 2. The similarity of the two words is S=2C/(A+B) CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Example of Word similarity Statistics: st, ta, at, ti, is, st, ti, ic, cs its distinguished digrams at, cs, ic, is, st, ta, ti statistical: st, ta, at, ti, is, st, ti, ic, ca, al its distinguished digrams: al, at, ca, ic, is, st, ta, ti A=7, B=8, C=6 Similarity =2x6/(7+8)=12/15=4/5=80% One may build a similarity matrix of all words in a corpus, calculated as above, and complemented by cutoff value method (set to zero if less than a certain value, and to 1 else) CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Thesaurus Vocabulary control in an information retrieval system Thesaurus construction Manual construction Automatic construction CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Vocabulary control Standard vocabulary for both indexing and searching (for the constructors of the system and the users of the system) CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Objectives of vocabulary control To promote the consistent representation of subject matter by indexers and searchers ,thereby avoiding the dispersion of related materials. To facilitate the conduct of a comprehensive search on some topic by linking together terms whose meanings are related paradigmatically. CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Thesaurus Not like common dictionary Words with their explanations May contain words in a language Or only contains words in a specific domain. With a lot of other information especially the relationship between words Classification of words in the language Words relationship like synonyms, antonyms CS5286 Search Engine Technology and Algorithms/Xiaotie Deng On-Line Thesaurus http://www.thesaurus.com http://www.dictionary.com/ http://www.cogsci.princeton.edu/~ wn/ CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Dictionary vs. Thesaurus Check Information use http://www.thesaurus.com Dictionary Thesaurus in·for·ma·tion ( n f r-m sh n) n. Knowledge derived from study, experience, or instruction. Knowledge of specific events or situations that has been gathered or received by communication; intelligence or news. See Synonyms at knowledge. ...... [Nouns] information, enlightenment, acquaintance …… [Verbs] tell; inform, inform of; acquaint, acquaint with; impart, …… [Adjectives] informed; communique; reported; published CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Use of Thesaurus To control the term used in indexing ,for a specific domain only use the terms in the thesaurus as indexing terms Assist the users to form proper queries by the help information contained in the thesaurus CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Construction of Thesaurus Stemming can be used for reduce the size of thesaurus Can be constructed either manually or automatically CS5286 Search Engine Technology and Algorithms/Xiaotie Deng WordNet: manually constructed WordNet® is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets. CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Relations in WordNet CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Automatic Thesaurus Construction A variety of methods can be used in construction the thesaurus Term similarity can be used for constructing the thesaurus CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Complete Term Relation Method Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8 Doc1 0 4 0 0 0 2 1 3 Doc2 3 1 4 3 1 2 0 1 Doc3 3 0 0 0 3 0 3 0 Doc4 0 1 0 3 0 0 2 0 Doc5 2 2 2 3 1 4 0 2 Term – Document Relationship can be calculated using a variety of methods Like tf-idf Term similarity can be calculated base on the term – document relationship for example: Sim(Termi , Term j ) ( DocTerm All Document K k ,i )( DocTermk , j ) CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Complete Term Relation Method Term1 Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8 7 16 15 14 14 9 7 8 12 3 18 6 17 18 6 16 0 8 6 18 6 9 6 9 3 2 16 Term2 7 Term3 16 8 Term4 15 12 18 Term5 14 3 6 6 Term6 14 18 16 18 6 Term7 9 6 0 6 9 2 Term8 7 17 8 9 3 16 Set threshold to 10 3 3 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Complete Term Relation Method T3 T1 Group T1,T3,T4,T6 T2 T1,T5 T4 T2,T4,T6 T5 T6 T2,T6,T8 T7 T8 T7