Supervised Classification of Feature-based Instances 1 Simple Examples for Statistics-based Classification • Based on class-feature counts • Contingency table: C ~C f a b ~f c d • We will see several examples of simple models based on these statistics 2 Prepositional-Phrase Attachment • Simplified version of Hindle & Rooth (1993) [MS 8.3] • Setting: V NP-chunk PP – Moscow sent soldiers into Afghanistan – ABC breached an agreement with XYZ • Motivation for the classification task: – Attachment is often a problem for (full) parsers – Augment shallow/chunk parsers 3 Relevant Probabilities • P(prep|n) vs. P(prep|v) – The probability of having the preposition prep attached to an occurrence of the noun n (the verb v). – Notice: a single feature for each class • Example: P(into|send) vs. P(into|soldier) • Decision measured by the likelihood ratio: P( prep | v) (v, n, p) log P( prep | n) • Positive/negative λ verb/noun attachment 4 Estimating Probabilities • Based on attachment counts from a training corpus • Maximum likelihood estimates: attach _ freq( prep, v) freq(v) attach _ freq( prep, n) P( prep | n) freq(n) P( prep | v) • How to count from an unlabeled ambiguous corpus? (Circularity problem) • Some cases are unambiguous: – The road to London is long – Moscow sent him to Afghanistan 5 Heuristic Bootstrapping and Ambiguous Counting 1. Produce initial estimates (model) by counting all unambiguous cases 2. Apply the initial model to all ambiguous cases; count each case under the resulting attachment if |λ| is greater than a threshold • E.g. |λ|>2, meaning one attachment is at least 4 times more likely than the other 3. Consider each remaining ambiguous case as a 0.5 count for each attachment. • 6 Likely n-p and v-p pairs would “pop up” in the ambiguous counts, while incorrect attachments are likely to accumulate low counts Example Decision • Moscow sent soldiers into Afghanistan attach _ freq(into, send ) 86 0.049 freq(send ) 1742.5 attach _ freq(into, soldier ) 1 P(into | soldier ) 0.0007 freq(soldier ) 1478 0.049 (send, soldier, into) log 2 log 2 70 0.0007 P(into | send ) • Verb attachment is 70 times more likely 7 Hindle & Rooth Evaluation • H&R results for a somewhat richer model: – 80% correct if we always make a choice – 91.7% precision for 55.2% recall, when requiring |λ|>3 for classification. • Notice that the probability ratio doesn’t distinguish between decisions made based on high vs. low frequencies. 8 Possible Extensions • Consider a-priori structural preference for “low” attachment (to noun) • Consider lexical head of the PP: – I saw the bird with the telescope – I met the man with the telescope • Such additional factors can be incorporated easily, assuming their independence • Addressing more complex types of attachments, such as chains of several PP’s • Similar attachment ambiguities within noun compounds: [N [N N]] vs. [[N N] N] 9 Classify by Best Single Feature: Decision List • Training: for each feature, measure its “entailment score ” for each class, and register the class with the highest score – Sort all features by decreasing score • Classification: for a given example, identify the highest entailment score among all “active” features, and select the appropriate class – Test all features for the class in decreasing score order, until first success output the relevant class – Default decision: the majority class • For multiple classes per example: may apply a threshold on the feature-class entailment score • Suitable when relatively few strong features indicate class (compare to manually written rules) 10 Example: Accent Restoration • (David Yarowsky, 1994): for French and Spanish • Classes: alternative accent restorations for words in text without accent marking • Example: côte (coast) vs. côté (side) • A variant of the general word sense disambiguation problem - “one sense per collocation” motivates using decision lists • Similar tasks: – Capitalization restoration in ALL-CAPS text – Homograph disambiguation in speech synthesis (wind as noun and verb) 11 Accent Restoration - Features • Word form coloocation features: – Single words in window: ±1, ±k (20-50) – Word pairs at <-1,+1>, <-2,-1>, <+1,+2> (complex features) – Easy to implement 12 Accent Restoration - Features • Local syntactic-based features (for Spanish) – – – – 13 Use a morphological analyzer Lemmatized features - generalizing over inflections POS of adjacent words as features Some word classed (primarily time terms, to help with tense ambiguity for unaccented words in Spanish) Accent Restoration – Decision Score P (c | f ) score ( f , c) log P(~ c | f ) c : class f : feature • Probabilities estimated from training statistics, taken from a corpus with accents • Smoothing - add small constant to all counts • Pruning: – Remove redundancies for efficiency: remove specific features that score lower than their generalization (domingo - WEEKDAY, w1w2 – w1) – Cross validation: remove features that causes more errors than correct classifications on held-out data 14 “Add-1/Add-Constant” Smoothing c( x) pMLE ( x) N c( x) - the count for event x (e.g. word occurrence) N - the total count for all x X (e.g. corpus length) pMLE ( x) 0 for many low probability events (sparsenes s) Smoothing - discountin g and redistribution : c( x) pS ( x) N | X | 1: Laplace, assuming uniform prior. In natural language events : usually 1 15 Accent Restoration – Results • Agreement with accented test corpus for ambiguous words: 98% – Vs. 93% for baseline of most frequent form – Accented test corpus also includes errors • Worked well for most of the highly ambiguous cases (see random sample in next slide) • Results slightly better than Naive Bayes (weighing multiple features) – Consistent with related study on binary homograph disambiguation, where combining multiple features almost always agrees with using a single best feature – Incorporating many low-confidence features may introduce noise that would override the strong features 16 Accent Restoration – Tough Examples 17 Related Application: Anaphora Resolution (Dagan, Justeson, Lappin, Lease, Ribak 1995) The terrorist pulled the grenade from his pocket and ? threw it at the policeman Traditional AI-style approach Manually encoded semantic preferences/constraints Actions Weapon <object – verb> Bombs grenade 18 Cause_movement throw drop Statistical Approach “Semantic” Judgment Corpus (text collection) <verb–object: throw-grenade> 20 times <verb–object: throw-pocket> 1 time • Statistics can be acquired from unambiguous (non-anaphoric) occurrences in raw (English) corpus (cf. PP attachment) • Semantic confidence combined with syntactic preferences it grenade • “Language modeling” for disambiguation 19 Word Sense Disambiguation for Machine Translation I bought soap bars I bought window bars ? ? sense1 sense2 (‘chafisa’) (‘sorag’) Corpus (text collection) sense1 sense2 (‘chafisa’) (‘sorag’) Sense1: <noun-noun: soap-bar> 20 times <noun-noun: chocolate-bar> 15 times Sense2: <noun-noun: window-bar> <noun-noun: iron-bar> 17 times 22 times • Features: co-occurrence within distinguished syntactic relations • “Hidden” senses – manual labeling required(?) 20 Solution: Mapping to Target Language English(-English)-Hebrew Dictionary: bar1 ‘chafisa’ bar2 ‘sorag’ soap ‘sabon’ window ‘chalon’ Map ambiguous “relations” to second language (all possibilities): <noun-noun: soap-bar> 1 <noun-noun: ‘cahfisat-sabon’> 2 <noun-noun: ‘sorag-sabon’> 20 times 0 times <noun-noun: window-bar> 1 <noun-noun: ‘cahfisat-chalon’> 2 <noun-noun: ‘sorag-chalon’> 0 times 15 times • Exploiting ambiguities difference • Principle – intersecting redundancies (Dagan and Itai 1994) 21 Hebrew Corpus The Selection Model • Constructed to choose (classify) the right translation for a complete relation rather than for each individual word at a time – since both words in a relation might be ambiguous, having their translations dependent upon each other • Assuming a multinomial model, under certain linguistic assumptions – The multinomial variable: a source relation – Each alternative translation of the relation is a possible outcome of the variable 22 An Example Sentence • A Hebrew sentence with 3 ambiguous words: • The alternative translations to English: 23 Example - Relational Representation 24 Selection Model • We would like to use as a classification score the log of the odds ratio between the most probable relation i and all other alternatives (in particular, the second most probable one j): p ln i p j • Estimation is based on smoothed counts • A potential problem: the odds ratio for probabilities doesn’t reflect the absolute counts from which the probabilities were estimated. – E.g., a count of 3 vs. (smoothed) 0 • Solution: using a one sided confidence interval (lower bound) for the odds ratio 25 Confidence Interval (for a proportion) • Given an estimate, what is the confidence that the estimate is “correct”, or at least close enough to the true value? p : the true parameter value (proportion) p : the sampled proportion (considere d as a variable) n : the sample size E( p ) p 26 p p (1 p ) n Confidence Interval (cont.) • Approximating by normal distribution: the distribution of the sampled proportion (across samples) approaches a normal distribution for large n. • z : the number of standard deviations such that the probability for obtaining p p z p is . Popular values : z.05 1.645 27 z.025 1.96 Confidence Interval (cont.) Estimation of two - sided confidence interval with confidence 1 (using p for estimating p ) : p p z / 2 p (1 p ) n Estimation of one - sided confidence interval with confidence 1 (upper/lower bound) : p p z 28 p (1 p ) n Selection Model (cont.) • The distribution of the log of the odds ratio (across samples) converges to normal distribution • Selection “confidence” score for a single relation the lower bound for the odds-ratio: pi ni 1 1 ln ln Z1 Conf (i ) p n n n i j j j • The most probable translation i for the relation is selected if Conf(i), the lower bound for the log odds ratio, exceeds θ. • Notice roles of θ vs. α, and impact of n1,n2 29 Handling Multiple Relations in a Sentence: Constraint Propagation 1. 2. 4. Compute Conf(i) for each ambiguous source relation. Pick the source relation with highest Conf(i). If Conf(i)< θ, or if no source relations left, then stop; Otherwise, select word translations according to target relation i and remove the source relation from the list. Propagate the translation constraints: remove any target relation that contradicts the selections made; remove source relations that now become unambiguous. Go to step 2. • Notice similarity to the decision list algorithm 3. 30 Selection Algorithm Example 31 Evaluation Results • Results - HebrewEnglish translation: Coverage: ~70% Precision within coverage: ~90% – ~20% improvement over choosing most frequent translation (95% statistical confidence for an improvement relative to this common baseline) 32 Analysis • Correct selections capture: – Clear semantic preferences: sign/seal treaty – Lexical collocation usage: peace treaty/contract • No selection: – Mostly: no statistics for any alternative (data sparseness) • investigator/researcher of corruption – Also: similar statistics for several alternatives – Solutions: • Consult more features in remote (vs. syntactic) context prime minister … take position/job • Class/similarity-based generalizations (corruption-crime) 33 Analysis (cont.) • Confusing multiple sources (senses) for the same target relation: – ‘sikkuy’ (chance/prospect) ‘kattan’ (small/young) Valid (frequent) target relations: • small chance - correct • young prospect – incorrect, due to - – “Young prospect” is the translation of another Hebrew expression – ‘tikva’ (hope) ‘zeira’ (young) • The “soundness” assumption of the multinomial model is violated: – Assume counting the generated target relations corresponds to sampling the source relation, hence assuming a known 1:n mapping (also completeness – another source of errors) – Potential solutions: bilingual corpus, “reverse” translation 34 Sense Translation Model: Summary • Classification instance: a relation with multiple words, rather than a single word at a time, to capture immediate (“circular”) dependencies. • Make local decisions, based on a single feature • Taking into account statistical confidence of decisions • Constraint propagation for multiple dependent classifications (remote dependencies) • Decision list style rational – classifying by a single high confidence evidence is simpler, and may work better, than considering all weaker evidence simultaneously – Computing statistical confidence for a combination of multiple events is difficult; easier to perform for each event at a time • Statistical classification scenario (model) constructed for the linguistic setting – Important to identify explicitly the underlying model assumptions, and to analyze the resulting errors 35 Word Sense Disambiguation • Many words have multiple meanings – E.g, river bank, financial bank • Problem: Assign proper sense to each ambiguous word in text • Applications: – Machine translation – Information retrieval (mixed evidence) – Semantic interpretation of text 36 Compare to POS Tagging? • Idea: Treat sense disambiguation like POS tagging, just with “semantic tags” • The problems differ: – POS tags depend on specific structural cues mostly neighboring, and thus dependent, tags – Senses depend on semantic context – less structured, longer distance dependency many relatively independent/unstructured features 37 Approaches • Supervised learning: Learn from a pre-tagged corpus • Dictionary-Based Learning Learn to distinguish senses from dictionary entries • Unsupervised Learning Automatically cluster word occurrences into different senses 38 Using an Aligned Bilingual Corpus • Goal: get sense tagging cheaply • Use correlations between phrases in two languages to disambiguate E.g, interest = In German ‘legal share’ (acquire an interest) ‘attention’ (show interest) Beteiligung erwerben Interesse zeigen • For each occurrence of an ambiguous word, determine which sense applies according to the aligned translation • Limited to senses that are discriminated by the other language; suitable for disambiguation in translation • Gale, Church and Yarowsky (1992) 39 Evaluation • Train and test on pre-tagged (or bilingual) texts – Difficult to come by • Artificial data – cheap to train and test: ‘merge’ two words to form an ‘ambiguous’ word with two ‘senses’ – E.g, replace all occurrences of door and of window with doorwindow and see if the system figures out which is which – Useful to develop sense disambiguation methods 40 Performance Bounds • How good is (say) 83.2%?? • Evaluate performance relative to lower and upper bounds: – Baseline performance: how well does the simplest “reasonable” algorithm do? E.g., compare to selecting the most frequent sense – Human performance: what percentage of the time do people agree on classification? • Nature of the senses used impacts accuracy levels 41