This Class How stemming is used in IR Stemming algorithms Frakes: Chapter 8 Kowalski: pages 67-76 Stemming algorithms Affix removing stemmers Dictionary lookup stemmers n-gram stemmers Successor variety stemmers Stemming Conflation - combining morphological term variants Done manually or automatically Automatic algorithms called stemmers Stemming algorithms Conflation methods Manual Automatic Affix Successor Dictionary n-grams Removal Variety Lookup Longest Match Simple Removal Stemming is used for: Enhance query formulation (and improve recall) by providing term variants Reduce size of index files by combining term variants into single index term Stemming during indexing Index terms are stemmed words Saves dictionary space One inverted index list for all variants Saves inverted index file space when position information in document not included Query terms are also stemmed Index is not stemmed In this case the index contains words No compression is achieved No information is lost Enables wild card searches Enables long phrase searches when position information included Providing term variants during search A stemming algorithm generate term variants Term variants added to query automatically (query expansion) or The user is provided with term variants and decides which ones to include Example A user searching for ystem users?is provided in the CATALOG system with term variants for sers?and ystem Example (cont.) Search term: users Term Occurrences 1. user 15 2. users 1 3. used 3 4. using 2 User selects variants to include in query Stemmer correctness A stemmer – – can be incorrect by either Under-stemming or by Over-stemming Over-stemming can reduce precision Under-stemming can affect recall Over-stemming Terms with different meanings are conflated onsiderate? and onsider?and onsideration should not be stemmed to on? with ontra? ontact? etc. Under-Stemming Prevents related terms from being conflated Under-stemming onsideration?to onsiderat? prevents conflating it with onsider Evaluating stemmers In information retrieval stemmers are evaluated by their: – – – effect on retrieval and compression rate, and not linguistic correctness Evaluating stemmers Studies have shown that stemming has a positive effect on retrieval. Performance of algorithms comparable Results vary between test collections Affix removal stemmers Remove suffixes and and/or – prefixes from terms – leaving a stem – Affix removal stemmers In English stemmers are suffix removers In other languages, for example Hebrew, both prefix and suffix are removed Affix removal stemmers Most affix removal stemmers in use are: – – iterative - for example, onsideration?stemmed first to onsiderat?then to onsider longest match stemmers using a set of stemming rules. A simple stemmer Harman – concluded minimal stemming helpful Her – – experimented simple stemmer changes: Plural to singular Third person to first person A simple stemmer Algorithm changes: kies?to ky? ies->y etrieves?to etrieve? es->s, and oors?to oor? s->NULL (leaves orpus?or ellness? ies?to y? A simple stemmer 1. word ends in es?but not ies?or ies?change end to ? 2. word ends in s? but not es? es?or es?change to ? 3. word ends in ?but not s?or s? remove s The Paice/Husk stemmer Uses a table of rules grouped into sections Section for each last letter of a suffix (rules for forms ending in a, then b, etc.) A form is any word or part of a word considered for stemming The Paice/Husk stemmer Each rule specifies a deletion or a replacement of an ending The order of the rules in each section is important. Rules tried until one can be applied, and the current form is updated Rule structure Each rule contains 5 parts (2 are optional): An ending (one or more characters in reverse order) An optional ntact?flag ??denoting form not yet stemmed Rule structure A digit (>=0) specifying no. characters to remove An optional string to append (after removal) A rule ending with ??denotes stemming should continue ?? terminating the stemming process Examples of rules ei3y>? if form ends in es?then replace the last 3 letters by ?and continue stemming ( ries?becomes ry? Examples of rules u*2.? if form ends with m?and word is intact remove 2 last letters and terminate stemming. aximum?is stemmed to axim? but resum?from resumably?remains unchanged Examples of rules lp0.?- if word terminates in ly?terminate. Next rule l2>?does not remove y?from ultiply ois4j>?causes ion?to be replaced by ? ?acts as dummy ending rovision?converted to rovij?and then to rovid Acceptability conditions Rule not applied unless conditions satisfied Attempt to prevent over-stemming Without them ent? ant? ice? ate? ation? iver?reduce to ? There are 2 rules: Acceptability conditions If form starts with a vowel then at least 2 letters must remain (owed/owing->ow but not ear->e) If a form starts with a consonant then at least 3 letters must remain, and at least one must be a vowel or (saying->say, crying->cry, but not string>str, meant->me, or cement->ce) Acceptability conditions These rules cause error in the stemming of some short-rooted words (doing, dying, being). These could be dealt with separately with a table lookup Example with Paice stemming eparately?- use ?section mismatch ylb1>, yli3y>, ylp0. match yl2>. Form becomes eparate? use rule 1>?in ?section form changes to eparat?- use t section mismatch with acilp4y.? match with a2>? change form to epar use r section, match with a2.? So ep Other examples p r e p a r a tio n prepare prepared r u l e n o i s 4 j> fa ils ru le n o ix 4 c t. fa ils ru le n o i2 > preparat ru le ta 2 > prepar ru le ra 2 . prep ru le e 1 > prepar ru le ra 2 . prep ru le d e 2 > prepar ru le ra 2 . prep n-grams Fixed length consecutive series of ?characters Bigrams: – Sea colony -> (se ea co ol lo on ny) Trigrams – Sea colony -> (sea col olo lon ony), or -> (#se sea ea# #co col olo lon ony ny#) Usage of n-grams Used in world war II by cryptographers Spell checking Text compression Signature files Stemming n-gram temmers Adamson and Borcham (1974) Method for grouping term variants Language independent n-gram Each temmers term transformed to n-gram A similarity value is generated between any pair of terms in database, resulting in a similarity matrix n-gram temmers A clustering method (single link) groups highly similar terms into clusters Most matrix elements had value 0. Used a cutoff value of 0.6 for their clustering algorithm Dice Coefficient Many formulas for computing set similarity Dice coefficient: S=2(|A B|)/(|A|+|B|) 0 S 1 S=1 if A=B, S=0 if A B= Sets of Unique Bigrams Let A and B denote the sets of unique bigrams associated with two terms, and let C=A B statistics -> (st ta at ti is st ti ic cs) Set of unique bigrams for statistics: A={at cs ic is st ta ti}, |A|=7 n-gram temmers statistical= (st ta at ti is st ti ic ca al) Set of unique bigrams for statistical B= {al at ca ic is st ta ti}, |B|=8 C={at ic is ta st ti}, |C|=6 S=2|C|/(|A|+|B|)=2x6/(7+8)=.8 Table lookup method Ideally, a table is constructed with stem for every word Stemming - look up word find stem There is no such data for English Systems use a combination of dictionary lookup and conflation rules Dictionary lookup method INQUERY uses Kstem Kstem is a morphological analyzer that conflates word variants to root form Dictionary lookup method Tries to avoid collapsing words with different meaning to same root The original word or a stemmed version is looked up in a dictionary and replaced by the best stem Successor variety stemmer Based on work in structural linguistic (Hafer and Weiss) Performed less well than affix removing stemmers Given a set of words, the successor variety (SV) of a string is the number of different characters that follow it in words in the set Successor variety stemmers Terms : {able, axle, accident, ape, about, apply, application, applies} The SV of p?is 2 p?is followed by ?in pe?and by ?in pply application and applies The SV of ?is 4 ?followed in set by ? ? ? and SVs for pply?and P r e fix a SV 4 ap app appl * a p p ly 2 1 2 1 L e tte r s b, x, c, p e, p l y, i b la n k pplies P r e fix a SV 4 ap app appl * a p p li a p p lie a p p lie s 2 1 2 2 1 1 * denotes a break point at peak L e tte r s b, x, c, p e, p l y, i e, c s b la n k SV for pplication Prefix a ap app appl appli * applic applica applicat applicati applicatio application SV 4 2 1 2 3 1 1 1 1 1 1 Letters b, x, c, p e, p l y, i c, y, e a t i o n blank Segmenting words 4 – – – – ways: Cut-off SV is reached SV eaks A substring of a word is equal to another word in the set eadable?breaks into ead?and Entropy based method ble Selecting a stem First segment is selected if it occurs in at most 12 words, Otherwise the second segment is selected (3 segments are unlikely) Summary All automatic stemmers - sometimes incorrect n-gram method can be used for different languages In general affix removing stemmers are more orrect Longest match stemming does not always generate satisfactory word stems