ir4_txtStem

advertisement
CS336 Lecture 4:
Properties of Text
Stop list
• Typically most frequently occurring words
– a, about, at, and, etc, it, is, the, or, …
• Among the top 200 are words such as “time”
“war” “home” etc.
– May be collection specific
• “computer, machine, program, source, language” in a
computer science collection
• Removal can be problematic
(e.g. “Mr. The”, “and-or gates”)
2
Stop lists
• Commercial systems use only few stop words
• ORBIT uses only 8, “and, an, by, from, of ,
the, with”
– patents,scientific and technical (sci-tech)
information, trademarks and Internet domain
names
3
Special Cases?
• Name Recognition
– People’s names - “Bill Clinton”
– Company names - IBM & big blue
– Places
• New York City, NYC, the big apple
4
Text
• Goal:
– Identify what can be inferred about text based on
• structural features
• statistical features of language
• Statistical Language Characteristics
– convert text to form more easily manipulable via
computer
– reduce storage space and processing time
– store and process in encrypted form
• text compression
5
Zipf’s Law
The probability of occurrence of words or other items starts high
and tapers off. Thus, a few occur very often while many others
occur rarely.
• pr = (freq of word of rank r)/N
– Probability that a word chosen
randomly will be the word of rank r
– N = total word occurrences
– for D distinct words, S pr = 1
– r * pr = A
• A ≈ 0.1
– e.g.) the rank of a word is inversely
proportional to its frequency
6
Employing Zipf’s Law
• Identify significant words and ineffectual words
– A few words occur very often
• 2 most frequent words can account for 10% of occurrences
• top 6 words are 20%
• top 50 words are 50%
– Many words are infrequent
7
Most frequent words
r
Word
f(r)
1
2
3
4
5
6
7
8
9
10
the
of
and
to
a
in
that
is
was
he
69,971
36,411
28,852
26,149
23,237
21,341
10,595
10,049
9,816
9,543
N~1,000,000
r*f(r)/N
0.070
0.073
0.086
0.104
0.116
0.128
0.074
0.081
0.088
0.095
8
Employing Zipf’s Law
• Estimate technical needs
– Estimating storage space saved by excluding stop words
from index
• 10 most frequently occurring words in English make up about
25%-30% of text
• Deleting very low frequency words from index yields a large
saving
• Estimate number of words n(1) that occur 1
times, n(2) that occur 2 times, etc
– Words that occur at most twice comprise about 2/3 of a text
• Estimating the size of a term’s inverted index
list
• Zipf is quite accurate except at very high and
very low rank
9
Modeling Natural Language
• Length of the words
– defines total space needed for vocabulary
• each character requires 1 byte
• Heaps’ Law: length increases logarithmically
with text size.
10
Vocabulary Growth
• New words occur less frequently
as collection grows
• Empirically t = kNb , where
– t is the number of unique words
– k and b are constants
• k  10-20 b  0.5-0.6
• As the total text size grows, the
predictions of the Heaps’ Law
become more accurate
Sublinear growth rate
11
Information Theory
• Shannon studied theoretical limits for data compression and
transmission rate
– “…problem of communication is that of reproducing at one point
either exactly or approximately a message selected at another
point."
• Compression limits given by Entropy (H)
• Transmission limits given by Channel Capacity (C)
• Many language tasks have been formulated as a “noisy
channel” problem
– determine most likely input given noisy output
•
•
•
•
OCR
Speech recognition
Machine translation
etc.
12
Shannon Game
• How should we complete the following?
– The president of the United States is George W. …
– The winner of the $10K prize is …
– Mary had a little …
– The horse raced past the barn …
• Period?
• etc
13
Information Theory
• Information content of a message is dependent
on both
– the receiver’s prior knowledge
– the message itself
• How much of the receiver’s uncertainty (entropy)
is reduced
• How predictable is the message
14
Information Theory
• Think of information content, H, as a
measurement of our ability to guess rest of
message, given only a portion of the message
– if predict with probability 1, information content is zero
– if predict with probability 0, infinite information content
– H(p) = -log p
• Logs in base 2, unit of information content (entropy) is 1 bit
• If message is a priori predictable with pr = 0.5
• Information content = -log2 (1/2) = 1 bit
15
Information Theory
• Given n messages, the average or expected
information content to be gained from receiving one

of the messages is:
H   pi log 2 pi
i 1
where : # of symbols in an alphabet,
pi: probability of a symbol’s appearance
(freqi/all occurrences)
– Amount of information in a message is related to the
distribution of symbols in the message.
16
Entropy
• Average entropy is a maximum when messages are equally
probable
– e.g. average entropy associated with characters
assuming equal probabilities
• For alphabet, H = log 1/26 = 4.7 bits
• With actual probabilities, H = 4.14 bits
• With bigram probabilites, H reduces to 3.56 bits
• People predict next letter with ~ 40% accuracy, H = 1.3 bits
• Better models reduce the relative entropy
• In text compression, entropy (H) specifies the limit on
how much the text can be compressed
– the more regularity (e.g. less uncertain) a data sequence, the more it
can be compressed
17
Information Theory
• Let t = number of unique words in a vocabulary
– For t = 10,000
50,000
100,000
H = 9.5
10.9
11.4 bits
• Information theory has been used for
– Compression
– Term weighting
– Evaluation measures
18
Text
• Modeling Natural Language
– Length of the words
• defines total space needed for vocabulary
– each character requires 1 byte
– Heaps’ Law: length increases logarithmically
with text size.
19
Stemming
• Commonly used to conflate morphological
variants
– combine non identical words referring to same concept
• compute, computation, computer, …
• Stemming is used to:
– Enhance query formulation (and improve recall)
by providing term variants
– Reduce size of index files
by combining term variants into single index term
Stemmer correctness
• Two ways to be incorrect
– Under-stemming
• Prevents related terms from being conflated
• “consideration” to “considerat” prevents conflating it with
“consider”
• Under-stemming affects recall
– Over-stemming
• Terms with different meanings are conflated
• “considerate”, “consider” and “consideration”
should not be stemmed to “con”, with
“contra”, “contact”, etc.
• Over-stemming can reduce precision
21
The Concept of Relevance
• Relevant => does the document fulfill the query?
• Relevance of a document D to a query Q is subjective
– Different users will have different judgments
– Same users may judge differently at different times
– Degree of relevance of different documents will vary
• In IR system evaluation it is assumed:
– A subset of database documents (DB) are relevant
– A document is either relevant or not relevant
22
Recall and precision
• Most common measures for evaluating IR systems
• Recall: % of relevant documents retrieved.
– Measures ability to get ALL of the good documents.
• Precision: % of retrieved documents that are in fact
relevant.
– Measures amount of junk that is included in the results.
• Ideal Retrieval Results
– 100% recall (All good documents are retrieved )
– 100% precision (No bad document is retrieved)
23
Evaluating stemmers
• In information retrieval stemmers are
evaluated by their:
– effect on retrieval
• improvements in recall or precision
– compression rate
– Not linguistic correctness
24
Stemmers
• 4 basic types
– Affix removing stemmers
– Dictionary lookup stemmers
– n-gram stemmers
– Corpus analysis
• Studies have shown that stemming has a positive effect
on retrieval.
• Performance of different algorithms comparable
• Results vary between test collections
Affix removal stemmers
• Remove suffixes and/or prefixes leaving a stem
– In English remove suffixes
• What might you remove if you were designing a stemmer?
– In other languages, e.g. Hebrew, remove both prefix and suffix
• Keshehalachnu --> halach
• Nelechna
--> halach
– some languages are more difficult, e.g. Arabic
– iterative: consideration => considerat => consider
– longest match: use a set of stemming rules arranged on a ‘longest
match’ principal (Lovins)
26
A simple stemmer (Harman)
if word ends in “ies” but not
“eies” or “aies”
then “ies”->“y”;
else in “es” but not
“aes”, “ees” or “oes”
then “es”->e;
else in “s” but not
“us” or “ss”
then “s”->NULL endif
• Algorithm changes:
– “skies” to “sky”,
– “retrieves” to
“retrieve”
– “doors” to “door”
– but not “corpus” or
“wellness”
– “dies” to “dy”?
27
Stemming w/ Dictionaries
• Avoid collapsing words with different meaning to
same root
• Word is looked up and replaced by the best stem
• Typical stemmers consist of rules and/or
dictionaries
– simplest stemmer is “suffix s”
– Porter stemmer is a collection of rules
– KSTEM uses lists of words plus rules for inflectional and
derivational morphology
Stemming Examples
• Original text:
Document will describe marketing strategies carried out by U.S.
companies for their agricultural chemicals, report predictions for market
share of such chemicals, or report market statistics for agrochemicals,
pesticide, herbicide, fungicide, insecticide, fertilizer, predicted sales,
market share, stimulate demand, price cut, volume of sales
• Porter Stemmer:
market strateg carr compan agricultur chemic report predict
market share chemic report market statist agrochem pesticid herbicid
fungicid insecticid fertil predict sale stimul demand price cut volum sale
• KSTEM:
marketing strategy carry company agriculture chemical report
prediction market share chemical report market statistic agrochemic
pesticide herbicide fungicide insecticide fertilizer predict sale stimulate
demand price cut volume sale
29
n-grams
• Fixed length consecutive series of “n” characters
– Bigrams:
• Sea colony -> (se ea co ol lo on ny)
– Trigrams
• Sea colony -> (sea col olo lon ony), or
-> (#se sea ea# #co col olo lon ony ny#)
• Conflate words based on overlapping series of
characters
30
Problems with Stemming
• Lack of domain-specificity and context can lead to
occasional serious retrieval failures
• Stemmers are often difficult to understand and modify
• Sometimes too aggressive in conflation (over-stem)
– e.g. “execute”/“executive”, “university”/“universe”,
“policy”/“police”, “organization”/“organ” conflated by Porter
• Miss good conflations (under-stem)
– e.g. “European”/“Europe”, “matrices”/“matrix”,
“machine”/“machinery” are not conflated by Porter
•
Stems that are not words are often difficult to interpret
– e.g. with Porter, “iteration” produces “iter” and “general”
produces “gener”
31
Corpus-Based Stemming
• Corpus analysis can improve/replace a stemmer
• Hypothesis: Word variants that should be conflated
will co-occur in context
• Modify stem classes generated by a stemmer or
other “aggressive” techniques such as initial ngrams
– more aggressive classes mean less conflations missed
• Prune class by removing words that don’t co-occur
sufficiently often
• Language independent
Equivalence Class Examples
abandon abandoned abandoning abandonment abandonments abandons
abate abated abatement abatements abates abating
abrasion abrasions abrasive abrasively abrasiveness abrasives
absorb absorbable absorbables absorbed absorbencies absorbency absorbent
-absorbents absorber absorbers absorbing absorbs
abusable abuse abused abuser abusers abuses abusing abusive abusively
access accessed accessibility accessible accessing accession
Some Porter Classes for a WSJ Database
abandonment abandonments
abated abatements abatement
abrasive abrasives
absorbable absorbables
absorbencies absorbency absorbent
absorber absorbers
abuse abusing abuses abusive abusers abuser abused
accessibility accessible
Classes refined through corpus analysis
33
Corpus-Based Stemming Results
• Both Porter and KSTEM stemmers are
improved slightly by this technique
• Ngram stemmer gives same performance
as “linguistic” stemmers for
– English
– Spanish
– Not shown to be the case for Arabic
Stemmer Summary
• All automatic stemmers are sometimes incorrect
– over-stemming
– understemming
• In general, improves effectiveness
• May use varying levels of language specific
information
– morphological stemmers use dictionaries
– affix removal stemmers use information about
prefixes, suffixes, etc.
• n-gram and corpus analysis methods can be
used for different languages
35
Generating Document Representations
• Use significant terms to build representations of
documents
– referred to as indexing
• Manual indexing: professional indexers
– Assign terms from a controlled vocabulary
– Typically phrases
• Automatic indexing: machine selects
– Terms can be single words, phrases, or other features
from the text of documents
36
Index Languages
• Language used to describe docs and queries
• Exhaustivity # of different topics indexed,
completeness or breadth
– increased exhaustivity => higher recall/ lower precision
•retrieved output size increases because documents are
indexed by any remotely connected content information
• Specificity - accuracy of indexing, detail
– increased specificity => higher precision/lower recall
• When doc represented by fewer terms, content may be lost.
A query that refers to the lost content,will fail to retrieve
the document
37
Index Languages
• Pre-coordinate indexing – combinations of
terms (e.g. phrases) used as an indexing label
• Post-coordinate indexing - combinations
generated at search time
• Faceted classification - group terms into facets
that describe basic structure of a domain, less
rigid than predefined hierarchy
• Enumerative classification - an alphabetic
listing, underlying order less clear
– e.g. Library of Congress class for “socialism, communism and
anarchism” at end of schedule for social sciences, after social
pathology and criminology
38
Download