irf97c4

advertisement
This Class
How
stemming is used in IR
Stemming algorithms
Frakes: Chapter 8
Kowalski: pages 67-76
Stemming algorithms
Affix
removing stemmers
Dictionary lookup stemmers
n-gram stemmers
Successor variety stemmers
Stemming
Conflation
- combining
morphological term variants
Done manually or automatically
Automatic algorithms called
stemmers
Stemming algorithms
Conflation methods
Manual
Automatic
Affix
Successor Dictionary
n-grams
Removal Variety
Lookup
Longest
Match
Simple
Removal
Stemming is used for:
 Enhance
query formulation
(and improve recall)
by providing term variants
 Reduce size of index files
by combining term variants
into single index term
Stemming during indexing
 Index
terms are stemmed words
 Saves dictionary space
 One inverted index list for all variants
 Saves inverted index file space when
position information in document not
included
 Query terms are also stemmed
Index is not stemmed
In
this case the index contains words
No compression is achieved
No information is lost
Enables wild card searches
Enables long phrase searches
when position information included
Providing term variants
during search
 A stemming
algorithm generate term
variants
 Term variants added to query
automatically (query expansion)
or
 The user is provided
with term variants and
decides which ones to include
Example
A user
searching for
ystem users?is provided
in the CATALOG system with
term variants for sers?and
ystem
Example (cont.)
Search term: users
Term Occurrences
1. user
15
2. users
1
3. used
3
4. using
2
 User selects variants to include in query
Stemmer correctness
A stemmer
–
–
can be incorrect by either
Under-stemming or by
Over-stemming
Over-stemming
can reduce precision
Under-stemming can affect recall
Over-stemming
 Terms
with different meanings are
conflated
 onsiderate? and
onsider?and
onsideration
should not be stemmed to on? with
ontra?
ontact? etc.
Under-Stemming
 Prevents
related terms from being
conflated
 Under-stemming
onsideration?to
onsiderat?
prevents conflating it with
onsider
Evaluating stemmers
In
information retrieval stemmers
are evaluated by their:
–
–
–
effect on retrieval and
compression rate, and
not linguistic correctness
Evaluating stemmers
 Studies
have shown that stemming has a
positive effect on retrieval.
 Performance of algorithms comparable
 Results vary between test collections
Affix removal stemmers
Remove
suffixes and and/or
– prefixes from terms
– leaving a stem
–
Affix removal stemmers
In
English stemmers are suffix
removers
In other languages,
for example Hebrew,
both prefix and suffix are removed
Affix removal stemmers
Most
affix removal stemmers in use
are:
–
–
iterative - for example,
onsideration?stemmed first to
onsiderat?then to onsider
longest match stemmers using a set of
stemming rules.
A simple stemmer
Harman
–
concluded minimal stemming helpful
Her
–
–
experimented
simple stemmer changes:
Plural to singular
Third person to first person
A simple stemmer
Algorithm
changes:

kies?to ky? ies->y
 etrieves?to
etrieve? es->s, and
 oors?to
oor? s->NULL
(leaves
orpus?or ellness?
 ies?to
y?
A simple stemmer
1. word ends in es?but not
ies?or ies?change end to ?
2. word ends in s? but not es? es?or
es?change to ?
3. word ends in ?but not s?or s?
remove s
The Paice/Husk stemmer
 Uses
a table of rules grouped into sections
 Section for each last letter of a suffix
(rules for forms ending in a, then b, etc.)
 A form is any word or part of a word
considered for stemming
The Paice/Husk stemmer
 Each
rule specifies a deletion or a
replacement of an ending
 The order of the rules in each section is
important.
 Rules tried until one can be applied, and
the current form is updated
Rule structure
 Each
rule contains 5 parts (2 are
optional):
 An ending (one or more characters in
reverse order)
 An optional
ntact?flag ??denoting
form not yet stemmed
Rule structure
 A digit
(>=0) specifying no. characters to
remove
 An optional string to append (after
removal)
 A rule ending with
??denotes stemming should continue
?? terminating the stemming process
Examples of rules
ei3y>?
 if form ends in
es?then replace the last
3 letters by ?and continue stemming
( ries?becomes ry?

Examples of rules
u*2.?
 if form ends with
m?and word is intact
remove 2 last letters and terminate
stemming.
 aximum?is stemmed to
axim? but
resum?from resumably?remains
unchanged

Examples of rules
lp0.?- if word terminates in
ly?terminate. Next rule l2>?does not
remove y?from ultiply
 ois4j>?causes
ion?to be replaced by
?
 ?acts as dummy ending
 rovision?converted to
rovij?and then
to rovid

Acceptability conditions
 Rule
not applied unless conditions
satisfied
 Attempt to prevent over-stemming
 Without them
ent? ant? ice? ate?
ation? iver?reduce to ?
 There
are 2 rules:
Acceptability conditions
 If
form starts with a vowel then at least 2
letters must remain (owed/owing->ow but
not ear->e)
 If a form starts with a consonant then at
least 3 letters must remain, and
at least one must be a vowel or
(saying->say, crying->cry, but not string>str, meant->me, or cement->ce)
Acceptability conditions
 These
rules cause error in the stemming
of some short-rooted words
 (doing, dying, being).
 These could be dealt with separately with
a table lookup
Example with Paice stemming
eparately?- use ?section
 mismatch ylb1>, yli3y>, ylp0.
 match yl2>. Form becomes
eparate?
 use rule
1>?in ?section
 form changes to
eparat?- use t section
 mismatch with
acilp4y.? match with
a2>? change form to epar
 use r section, match with
a2.? So ep

Other examples
p r e p a r a tio n
prepare
prepared
r u l e n o i s 4 j>
fa ils
ru le n o ix 4 c t.
fa ils
ru le n o i2 >
preparat
ru le ta 2 >
prepar
ru le ra 2 .
prep
ru le e 1 >
prepar
ru le ra 2 .
prep
ru le d e 2 >
prepar
ru le ra 2 .
prep
n-grams
 Fixed
length consecutive series of
?characters
 Bigrams:
–
Sea colony -> (se ea co ol lo on ny)
 Trigrams
–
Sea colony -> (sea col olo lon ony), or
-> (#se sea ea# #co col olo lon ony ny#)
Usage of n-grams
 Used
in world war II by cryptographers
 Spell checking
 Text compression
 Signature files
 Stemming
n-gram
temmers
Adamson
and Borcham (1974)
Method for grouping term variants
Language independent
n-gram
Each
temmers
term transformed to n-gram
A similarity value
is generated between
any pair of terms in database,
resulting in a similarity matrix
n-gram
temmers
A clustering
method (single link)
groups highly similar terms into
clusters
Most matrix elements had value 0.
Used a cutoff value of 0.6 for their
clustering algorithm
Dice Coefficient
Many
formulas for computing set
similarity
Dice coefficient:
S=2(|A  B|)/(|A|+|B|)
0  S  1
S=1 if A=B, S=0 if A  B=
Sets of Unique Bigrams
Let A and
B denote the sets of
unique bigrams associated with two
terms, and let C=A B
statistics -> (st ta at ti is st ti ic cs)
Set of unique bigrams for statistics:
A={at cs ic is st ta ti},
|A|=7
n-gram
temmers
statistical=
(st ta at ti is st ti ic ca al)
Set of unique bigrams for statistical

B= {al at ca ic is st ta ti}, |B|=8
C={at ic is ta st ti}, |C|=6
S=2|C|/(|A|+|B|)=2x6/(7+8)=.8
Table lookup method
Ideally,
a table is constructed with
stem for every word
Stemming - look up word find stem
There is no such data for English
Systems use a combination of
dictionary lookup and conflation
rules
Dictionary lookup method
INQUERY uses
Kstem
Kstem is a morphological analyzer
that conflates word variants to root
form
Dictionary lookup method
Tries
to avoid collapsing words with
different meaning to same root
The original word or a stemmed
version is looked up in a dictionary
and replaced by the best stem
Successor variety stemmer
 Based
on work in structural linguistic
(Hafer and Weiss)
 Performed less well than affix removing
stemmers
 Given a set of words,
the successor variety (SV) of a string is
the number of different characters that
follow it in words in the set
Successor variety stemmers
 Terms
: {able, axle, accident, ape, about,
apply, application, applies}
 The SV of
p?is 2
p?is followed by ?in pe?and
by ?in pply application and applies
 The
SV of
?is 4
?followed in set by
?
?
? and
SVs for
pply?and
P r e fix
a
SV
4
ap
app
appl *
a p p ly
2
1
2
1
L e tte r s
b, x, c,
p
e, p
l
y, i
b la n k
pplies
P r e fix
a
SV
4
ap
app
appl *
a p p li
a p p lie
a p p lie
s
2
1
2
2
1
1
* denotes a break point at peak
L e tte r s
b, x, c,
p
e, p
l
y, i
e, c
s
b la n k
SV for
pplication
Prefix
a
ap
app
appl
appli *
applic
applica
applicat
applicati
applicatio
application
SV
4
2
1
2
3
1
1
1
1
1
1
Letters
b, x, c, p
e, p
l
y, i
c, y, e
a
t
i
o
n
blank
Segmenting words
4
–
–
–
–
ways:
Cut-off SV is reached
SV eaks
A substring of a word is equal to
another word in the set
eadable?breaks into ead?and
Entropy based method
ble
Selecting a stem
First
segment is selected if it occurs
in at most 12 words,
Otherwise the second segment is
selected (3 segments are unlikely)
Summary
 All
automatic stemmers - sometimes
incorrect
 n-gram method can be used for
different languages
 In general affix removing stemmers are
more orrect
 Longest match stemming does not always
generate satisfactory word stems
Download