Supervised classification (Ido, )

advertisement
Supervised Classification of
Feature-based Instances
1
Simple Examples for Statistics-based
Classification
• Based on class-feature counts
• Contingency table:
C
~C
f
a
b
~f
c
d
• We will see several examples of simple
models based on these statistics
2
Prepositional-Phrase Attachment
• Simplified version of Hindle & Rooth
(1993)
[MS 8.3]
• Setting: V NP-chunk PP
– Moscow sent soldiers into Afghanistan
– ABC breached an agreement with XYZ
• Motivation for the classification task:
– Attachment is often a problem for (full) parsers
– Augment shallow/chunk parsers
3
Relevant Probabilities
• P(prep|n) vs. P(prep|v)
– The probability of having the preposition prep attached
to an occurrence of the noun n (the verb v).
– Notice: a single feature for each class
• Example: P(into|send) vs. P(into|soldier)
• Decision measured by the likelihood ratio:
P( prep | v)
(v, n, p)  log
P( prep | n)
• Positive/negative λ  verb/noun attachment
4
Estimating Probabilities
• Based on attachment counts from a training corpus
• Maximum likelihood estimates:
attach _ freq( prep, v)
freq(v)
attach _ freq( prep, n)
P( prep | n) 
freq(n)
P( prep | v) 
• How to count from an unlabeled ambiguous
corpus? (Circularity problem)
• Some cases are unambiguous:
– The road to London is long
– Moscow sent him to Afghanistan
5
Heuristic Bootstrapping and Ambiguous
Counting
1. Produce initial estimates (model) by counting all
unambiguous cases
2. Apply the initial model to all ambiguous cases;
count each case under the resulting attachment if
|λ| is greater than a threshold
•
E.g. |λ|>2, meaning one attachment is at least 4 times
more likely than the other
3. Consider each remaining ambiguous case as a
0.5 count for each attachment.
•
6
Likely n-p and v-p pairs would “pop up” in the
ambiguous counts, while incorrect attachments are
likely to accumulate low counts
Example Decision
• Moscow sent soldiers into Afghanistan
attach _ freq(into, send )
86

 0.049
freq(send )
1742.5
attach _ freq(into, soldier )
1
P(into | soldier ) 

 0.0007
freq(soldier )
1478
0.049
 (send, soldier, into)  log 2
 log 2 70
0.0007
P(into | send ) 
• Verb attachment is 70 times more likely
7
Hindle & Rooth Evaluation
• H&R results for a somewhat richer model:
– 80% correct if we always make a choice
– 91.7% precision for 55.2% recall, when
requiring |λ|>3 for classification.
• Notice that the probability ratio doesn’t
distinguish between decisions made based
on high vs. low frequencies.
8
Possible Extensions
• Consider a-priori structural preference for “low”
attachment (to noun)
• Consider lexical head of the PP:
– I saw the bird with the telescope
– I met the man with the telescope
• Such additional factors can be incorporated easily,
assuming their independence
• Addressing more complex types of attachments,
such as chains of several PP’s
• Similar attachment ambiguities within noun
compounds: [N [N N]] vs. [[N N] N]
9
Classify by Best Single Feature: Decision List
• Training: for each feature, measure its “entailment score ”
for each class, and register the class with the highest score
– Sort all features by decreasing score
• Classification: for a given example, identify the highest
entailment score among all “active” features, and select the
appropriate class
– Test all features for the class in decreasing score order, until first
success  output the relevant class
– Default decision: the majority class
• For multiple classes per example: may apply a threshold on
the feature-class entailment score
• Suitable when relatively few strong features indicate class
(compare to manually written rules)
10
Example: Accent Restoration
• (David Yarowsky, 1994): for French and Spanish
• Classes: alternative accent restorations for words in
text without accent marking
• Example: côte (coast) vs. côté (side)
• A variant of the general word sense disambiguation
problem - “one sense per collocation” motivates
using decision lists
• Similar tasks:
– Capitalization restoration in ALL-CAPS text
– Homograph disambiguation in speech synthesis (wind as
noun and verb)
11
Accent Restoration - Features
• Word form coloocation features:
– Single words in window: ±1, ±k (20-50)
– Word pairs at <-1,+1>, <-2,-1>, <+1,+2> (complex
features)
– Easy to implement
12
Accent Restoration - Features
• Local syntactic-based features (for Spanish)
–
–
–
–
13
Use a morphological analyzer
Lemmatized features - generalizing over inflections
POS of adjacent words as features
Some word classed (primarily time terms, to help with
tense ambiguity for unaccented words in Spanish)
Accent Restoration – Decision Score
P (c | f )
score ( f , c)  log
P(~ c | f )
c : class
f : feature
• Probabilities estimated from training statistics,
taken from a corpus with accents
• Smoothing - add small constant to all counts
• Pruning:
– Remove redundancies for efficiency: remove specific
features that score lower than their generalization
(domingo - WEEKDAY, w1w2 – w1)
– Cross validation: remove features that causes more
errors than correct classifications on held-out data
14
“Add-1/Add-Constant” Smoothing
c( x)
pMLE ( x) 
N
c( x) - the count for event x (e.g. word occurrence)
N - the total count for all x  X (e.g. corpus length)
 pMLE ( x)  0 for many low probability events (sparsenes s)
Smoothing - discountin g and redistribution :
c( x)  
pS ( x) 
N  | X |
  1: Laplace, assuming uniform prior.
In natural language events : usually   1
15
Accent Restoration – Results
• Agreement with accented test corpus for
ambiguous words: 98%
– Vs. 93% for baseline of most frequent form
– Accented test corpus also includes errors
• Worked well for most of the highly ambiguous
cases (see random sample in next slide)
• Results slightly better than Naive Bayes (weighing
multiple features)
– Consistent with related study on binary homograph
disambiguation, where combining multiple features
almost always agrees with using a single best feature
– Incorporating many low-confidence features may
introduce noise that would override the strong features
16
Accent Restoration – Tough Examples
17
Related Application: Anaphora Resolution
(Dagan, Justeson, Lappin, Lease, Ribak 1995)
The terrorist pulled the grenade from his pocket and
?
threw it at the policeman
Traditional AI-style approach
Manually encoded semantic preferences/constraints
Actions
Weapon
<object – verb>
Bombs
grenade
18
Cause_movement
throw
drop
Statistical Approach
“Semantic”
Judgment
Corpus
(text collection)
<verb–object: throw-grenade> 20 times
<verb–object: throw-pocket>
1 time
• Statistics can be acquired from unambiguous (non-anaphoric)
occurrences in raw (English) corpus (cf. PP attachment)
• Semantic confidence combined with syntactic preferences
 it  grenade
• “Language modeling” for disambiguation
19
Word Sense Disambiguation
for Machine Translation
I bought soap bars
I bought window bars
?
?
sense1 sense2
(‘chafisa’) (‘sorag’)
Corpus
(text collection)
sense1 sense2
(‘chafisa’) (‘sorag’)
Sense1:
<noun-noun: soap-bar>
20 times
<noun-noun: chocolate-bar> 15 times
Sense2:
<noun-noun: window-bar>
<noun-noun: iron-bar>
17 times
22 times
• Features: co-occurrence within distinguished syntactic relations
• “Hidden” senses – manual labeling required(?)
20
Solution: Mapping to Target Language
English(-English)-Hebrew Dictionary:
bar1 ‘chafisa’
bar2 ‘sorag’
soap  ‘sabon’
window  ‘chalon’
Map ambiguous “relations” to second language (all possibilities):
<noun-noun: soap-bar>
1 <noun-noun: ‘cahfisat-sabon’>
2 <noun-noun: ‘sorag-sabon’>
20 times
0 times
<noun-noun: window-bar>
1 <noun-noun: ‘cahfisat-chalon’>
2 <noun-noun: ‘sorag-chalon’>
0 times
15 times
• Exploiting ambiguities difference
• Principle – intersecting redundancies
(Dagan and Itai 1994)
21
Hebrew
Corpus
The Selection Model
• Constructed to choose (classify) the right translation
for a complete relation rather than for each individual
word at a time
– since both words in a relation might be ambiguous, having
their translations dependent upon each other
• Assuming a multinomial model, under certain
linguistic assumptions
– The multinomial variable: a source relation
– Each alternative translation of the relation is a possible
outcome of the variable
22
An Example Sentence
• A Hebrew sentence with 3 ambiguous words:
• The alternative translations to English:
23
Example - Relational Representation
24
Selection Model
• We would like to use as a classification score the log
of the odds ratio between the most probable relation i
and all other alternatives (in particular, the second
most probable one j):  p 
ln  i 
p 
 j
• Estimation is based on smoothed counts
• A potential problem: the odds ratio for probabilities
doesn’t reflect the absolute counts from which the
probabilities were estimated.
– E.g., a count of 3 vs. (smoothed) 0
• Solution: using a one sided confidence interval (lower
bound) for the odds ratio
25
Confidence Interval
(for a proportion)
• Given an estimate, what is the confidence
that the estimate is “correct”, or at least
close enough to the true value?
p : the true parameter value (proportion)
p : the sampled proportion (considere d as a variable)
n : the sample size
E( p )  p
26
p 
p (1  p )
n
Confidence Interval (cont.)
• Approximating by normal distribution: the
distribution of the sampled proportion (across
samples) approaches a normal distribution for
large n.
•
z : the number of standard deviations such that the
probability for obtaining p  p  z  p is  .
Popular values : z.05  1.645
27
z.025  1.96
Confidence Interval (cont.)
Estimation of two - sided confidence interval with
confidence 1   (using p for estimating  p ) :
p  p  z / 2
p (1  p )
n
Estimation of one - sided confidence interval with
confidence 1   (upper/lower bound) :
p  p  z
28
p (1  p )
n
Selection Model (cont.)
• The distribution of the log of the odds ratio (across
samples) converges to normal distribution
• Selection “confidence” score for a single relation the lower bound for the odds-ratio:
 pi 
 ni 
1 1


ln
 ln    Z1

 Conf (i )  
p 
n 
n
n
i
j
 j
 j
• The most probable translation i for the relation is
selected if Conf(i), the lower bound for the log odds
ratio, exceeds θ.
• Notice roles of θ vs. α, and impact of n1,n2
29
Handling Multiple Relations in a
Sentence: Constraint Propagation
1.
2.
4.
Compute Conf(i) for each ambiguous source relation.
Pick the source relation with highest Conf(i).
If Conf(i)< θ, or if no source relations left, then stop;
Otherwise, select word translations according to target
relation i and remove the source relation from the list.
Propagate the translation constraints: remove any target
relation that contradicts the selections made; remove
source relations that now become unambiguous.
Go to step 2.
•
Notice similarity to the decision list algorithm
3.
30
Selection Algorithm Example
31
Evaluation Results
• Results - HebrewEnglish translation:
Coverage: ~70%
Precision within coverage: ~90%
– ~20% improvement over choosing most
frequent translation (95% statistical confidence
for an improvement relative to this common
baseline)
32
Analysis
• Correct selections capture:
– Clear semantic preferences: sign/seal treaty
– Lexical collocation usage: peace treaty/contract
• No selection:
– Mostly: no statistics for any alternative (data
sparseness)
• investigator/researcher of corruption
– Also: similar statistics for several alternatives
– Solutions:
• Consult more features in remote (vs. syntactic) context
prime minister … take position/job
• Class/similarity-based generalizations (corruption-crime)
33
Analysis (cont.)
• Confusing multiple sources (senses) for the same
target relation:
– ‘sikkuy’ (chance/prospect) ‘kattan’ (small/young)
Valid (frequent) target relations:
• small chance - correct
• young prospect – incorrect, due to -
– “Young prospect” is the translation of another Hebrew
expression – ‘tikva’ (hope) ‘zeira’ (young)
• The “soundness” assumption of the multinomial
model is violated:
– Assume counting the generated target relations corresponds
to sampling the source relation, hence assuming a known
1:n mapping (also completeness – another source of errors)
– Potential solutions: bilingual corpus, “reverse” translation
34
Sense Translation Model: Summary
• Classification instance: a relation with multiple words,
rather than a single word at a time, to capture immediate
(“circular”) dependencies.
• Make local decisions, based on a single feature
• Taking into account statistical confidence of decisions
• Constraint propagation for multiple dependent
classifications (remote dependencies)
• Decision list style rational – classifying by a single high
confidence evidence is simpler, and may work better, than
considering all weaker evidence simultaneously
– Computing statistical confidence for a combination of multiple
events is difficult; easier to perform for each event at a time
• Statistical classification scenario (model) constructed for
the linguistic setting
– Important to identify explicitly the underlying model assumptions,
and to analyze the resulting errors
35
Word Sense Disambiguation
• Many words have multiple meanings
– E.g, river bank, financial bank
• Problem: Assign proper sense to each
ambiguous word in text
• Applications:
– Machine translation
– Information retrieval (mixed evidence)
– Semantic interpretation of text
36
Compare to POS Tagging?
• Idea: Treat sense disambiguation like POS
tagging, just with “semantic tags”
• The problems differ:
– POS tags depend on specific structural cues mostly neighboring, and thus dependent, tags
– Senses depend on semantic context – less
structured, longer distance dependency
 many relatively independent/unstructured
features
37
Approaches
•
Supervised learning:
Learn from a pre-tagged corpus
•
Dictionary-Based Learning
Learn to distinguish senses from dictionary
entries
•
Unsupervised Learning
Automatically cluster word occurrences into
different senses
38
Using an Aligned Bilingual Corpus
• Goal: get sense tagging cheaply
• Use correlations between phrases in two languages to
disambiguate
E.g, interest =
In German
‘legal share’ (acquire an interest)
‘attention’
(show interest)
Beteiligung erwerben
Interesse zeigen
• For each occurrence of an ambiguous word, determine which
sense applies according to the aligned translation
• Limited to senses that are discriminated by the other language;
suitable for disambiguation in translation
• Gale, Church and Yarowsky (1992)
39
Evaluation
• Train and test on pre-tagged (or bilingual) texts
– Difficult to come by
• Artificial data – cheap to train and test: ‘merge’
two words to form an ‘ambiguous’ word with two
‘senses’
– E.g, replace all occurrences of door and of window with
doorwindow and see if the system figures out which is
which
– Useful to develop sense disambiguation methods
40
Performance Bounds
• How good is (say) 83.2%??
• Evaluate performance relative to lower and
upper bounds:
– Baseline performance: how well does the
simplest “reasonable” algorithm do? E.g.,
compare to selecting the most frequent sense
– Human performance: what percentage of the
time do people agree on classification?
• Nature of the senses used impacts accuracy
levels
41
Download