acc

advertisement
Ambiguity Management in
Deep Grammar Engineering
Tracy Holloway King
Ambiguity: bug or feature?


Bug in computer programming languages
Feature in natural language
– People good at resolving ambiguity in context
– Ambiguity consequently often unperceived
“Readjust paper holding clip”
even though thousand-fold ambiguities are common
– Ambiguity promotes conciseness

Computers can’t resolve ambiguity like humans

If we are going to build large-scale, linguistically
sophisticated grammars, we need ways to
handle ambiguity
Talk Outline


Sources of ambiguity
Grammar engineering approaches
– Shallow markup
– (Dis)preference marks


Stochastic disambiguation
Efficiency in ambiguity management
Sources of Ambiguity

Phonetic:

Tokenization:
– “I scream” or “ice cream”
– “I like Jan.” --- |Jan|. Or |Jan.|.

Morphological:

Lexical:

Syntactic:

Semantic:

Pragmatic:
(abbrev January)
– “walks” --- plural noun or 3sg verb
– “untieable knot” --- un(tieable) or (untie)able
– “bank” --- river bank or financial institution
– “The turkeys are ready to eat.” --- fattened or hungry
– “Two boys ate fifteen pizzas.” --- 15 each or 15 total
– “Sue won. Ed gave her a good luck charm.” --- cause or result
PP Attachment
A classic example of syntactic ambiguity


PP adjuncts can attach to VPs and NPs
Strings of PPs in the VP are ambiguous
– I see the girl with the telescope.
I see [the girl with the telescope].
I see [the girl] [with the telescope].

Ambiguities proliferate exponentially
– I see the girl with the telescope in the park
I see [the girl with [the telescope in the park]]
I see [the [girl with the telescope] in the park]
I see the girl [with the [telescope in the park]]
I see the girl [with the telescope] [in the park]
I see [the girl with the telescope] [in the park]
– The syntax has no way to determine the attachment, even if
humans can.
Coverage entails ambiguity
I fell in the park.
+
I know the girl in the park.
I see the girl in the park.
Ambiguity can be explosive
If alternatives multiply within or across
components…
Discourse
Semantics
Syntax
Morphology
Tokenize
Ambiguity figures


Deep grammars are massively ambiguous
Example: 700 from section 23 of WSJ
– average # of words: 19.6
– average # of optimal parses: 684
» for 1-10 word sentences: 3.8
» for 11-20 word sentences: 25.2
» for 50-60 word sentences: 12,888
Managing Ambiguity

Grammar engineering approaches
– Trim early with shallow markup
– (Dis)preference marks on rules


Choose most probable parse for applications
that need a single input
Use packing to parse and manipulate the
ambiguities efficiently
Talk Outline


Sources of ambiguity
Grammar engineering approaches
– Shallow markup
– (Dis)preference marks


Stochastic disambiguation
Efficiency in ambiguity management
Shallow markup

Part of speech marking as filter
I saw her duck/VB.
– accuracy of tagger (v. good for English)
– can use partial tagging (verbs and nouns)

Named entities
– <company>Goldman, Sachs & Co.</company> bought IBM.
– good for proper names and times
– hard to parse internal structure

Fall back technique if fail
– slows parsing
– accuracy vs. speed
Example shallow markup: Named entities

Allow tokenizer to accept marked up input:
parse {<person>Mr. Thejskt Thejs</person> arrived.}
tokenized string:
Mr. Thejskt Thejs TB +NEperson
Mr(TB). TB Thejskt TB Thejs
 Add
TB arrived TB . TB
lexical entries and rules for NE tags
Resulting C-structure
Resulting F-structure
Results for shallow markup
Full/All
% Full
parses
Optimal
sol’ns
Best
F-sc
Time
%
Unmarked
76
482/1753
82/79
65/100
Named ent
78
263/1477
86/84
60/91
POS tag
62
248/1916
76/72
40/48
Kaplan and King 2003
(Dis)preference marks (OT marks)

Want to (dis)prefer certain constructions
– prefer: use when possible
– disprefer: do not use unless no other analysis

Implementation
– Put marks in rules and lexical entries
– Rank those marks
» ranking can be different for different grammars/corpora
– Use most prefered parse(s)
» can use as a two pass system for robust parsing
Ungrammatical input

Real world text contains ungrammatical input
– Deep grammars tend to only cover grammatical output

Common errors can be coded in the rules
– may want to know that error occurred
(e.g., provide feedback in CALL grammars)

Disprefer parses of ungrammatical structures
– tools for grammar writer to rank rules
– two+ pass system
1. standard rules
2. rules for known ungrammatical constructions
3. default fall back rules
Sample ungrammatical structures

Mismatched subject-verb agreement
Verb3Sg = { SUBJ PERS = 3
SUBJ NUM = sg
|BadVAgr }

Missing copula
VPcop ==> { Vcop: ^=!
|e: (^ PRED)='NullBe<(^ SUBJ)(^XCOMP)>'
MissingCopularVerb}
{ NP: (^ XCOMP)=!
|AP: (^ XCOMP)=!
| …}
Dispreferred grammatical structures

Prefer subcategorized infinitives to adverbials
– I want it.
I finished up (in order) to leave.
– I want it to leave.
VP --> V
(NP: (^ OBJ)=!)
(VPinf: { (^ XCOMP)=! +InfSubcat
|! $ (^ ADJUNCT) InfAdjunct } ).

Post-copular gerunds
– He is a boy. (His) going is difficult.
– He is going.
OT Mark summary



Use (dis)preference marks to (dis)prefer
constructions or words
Allows inclusion of marginal/ungrammatical
constructions
Issues:
– Only works with ambiguities with known
preferences (not PP attachment)
– Hard to determine ranking for many marks
– Two-pass parsing can be slow
Talk Outline


Sources of ambiguity
Grammar engineering approaches
– Shallow markup
– (Dis)preference marks


Stochastic disambiguation
Efficiency in ambiguity management
Packing & Pruning in XLE

XLE produces (too) many candidates
– All valid (with respect to grammar and OT marks)
– Not all equally likely
– Some applications require a single best parse
or at most just a handful (n best)

Grammar writer can’t specify correct choices
– Many implicit properties of words and structures with
unclear significance
Pruning in XLE



Appeal to probability model to choose best parse
Assume: previous experience is a good guide for
future decisions
Collect corpus of training sentences, build
probability model that optimizes for previous good
results
– partially labelled training data is ok
[NP-SBJ They] see [NP-OBJ the girl with the telescope]

Apply model to choose best analysis of new
sentences
– efficient (XLE English grammar: 5% of parse time)
Exponential models are appropriate
(aka Maximum Entropy or Log-linear models)



Assign probabilities to representations, not to
choices in a derivation
No independence assumption
Arithmetic combined with human insight
– Human:
» Define properties of representations that may be relevant
» Based on any computable configuration of features, trees
– Arithmetic:
» Train to figure out the weight of each property
Properties employed in WSJ Experiment

~800 property-functions:
–
–
–
–
–
–
–

c-structure nodes and subtrees
recursively embedded phrases
f-structure attributes (grammatical functions)
atomic attribute-value pairs
left/right branching
(non)parallelism in coordination
lexical elements (subcategorization frames)
Some end up with no discrimination power after
training
Stochastic Disambiguation Summary

Training:
– Define a set of features by hand
– Train on partially labelled data
– Can train on low-ambiguity data

Use:
– Choose just one structure for applications that
want just one
– XLE displays most probable first
– 5% of parse time to disambiguate
– 30% gain in F-score
Talk Outline


Sources of ambiguity
Grammar engineering approaches
– Shallow markup
– (Dis)preference marks


Stochastic disambiguation
Efficiency in ambiguity management
Computational consequences of ambiguity

Serious problem for computational systems
– Broad coverage, hand written grammars frequently produce
thousands of analyses, sometimes millions
– Machine learned grammars easily produce hundreds of
thousands of analyses if allowed to parse to completion

Three approaches to ambiguity management:
– Pruning: block unlikely analysis paths early
– Procrastination: do not expand analysis paths that will lead
to ambiguity explosion until something else requires them
» Also known as underspecification
– Packing: compact representation and computation of all
possible analyses
The Problem with Pruning:
premature disambiguation

The conventional approach: Use heuristics to prune as soon as possible

Strong constraints may reject the so-far-best (= only) option
Statistics
X
Discourse
Fast computation, wrong result
Semantics
X
Syntax
Morphology
Tokenize
X
X
X
The problem with procrastination:
passing the buck

Chunk parsing as an example:
– Collect noun groups, verb groups, PP groups
– Leave it to later processing to figure out the
correct way of putting these together
– Not all combinations are grammatically acceptable

Later processing must either
– Call parser to check grammatical constraints
– Have its own model of grammatical constraints
– In the best case, solve a set of constraints the
partial parser includes with its output
The Problem with Packing


There may be too many analyses to pack
efficiently
A major problem for relatively unconstrained
machine induced grammars
– Grammars overgenerate massively
– Statistics used to prune out unlikely sub-analyses

Less of a problem for carefully hand-coded
broad coverage grammars
Packing

Explosion of ambiguity results from a small
number of sub-analyses combining in
different ways to produce a large number of
total analyses (e.g. PP attachment)

Compute and represent each sub-analysis
just once
Compute a factored representation of how
these sub-analyses combine

Generalizing Free Choice Packing
The sheep saw the fish.
How many sheep?
How many fish?
Options multiplied out
The sheep-sg saw the fish-sg.
The sheep-pl saw the fish-sg.
The sheep-sg saw the fish-pl.
The sheep-pl saw the fish-pl.
In principle, a verb might require
agreement of Subject and Object:
Have to check it out.
Options packed
The sheep
sg
sg
saw the fish
pl
pl
But English doesn’t do that:
Any combination of choices is OK
Dependent choices
Das Mädchen
The girl
nom
acc
sah die Katze nom
acc
saw the cat
Again, packing avoids duplication … but it’s wrong
It doesn’t encode all dependencies, choices are not free.
Das Mädchen-nom sah die Katze-nom
Das Mädchen-nom sah die Katze-acc
Das Mädchen-acc sah die Katze-nom
Das Mädchen-acc sah die Katze-acc
bad
The girl saw the cat
The cat saw the girl
bad
Solution: Label dependent choices
Das Mädchen-nom sah die Katze-nom
Das Mädchen-nom sah die Katze-acc
Das Mädchen-acc sah die Katze-nom
Das Mädchen-acc sah die Katze-acc
Das Mädchen
p:nom
p:acc
sah die Katze
q:nom
q:acc
bad
The girl saw the cat
The cat saw the girl
bad
=
(pq)

(pq)
• Label each choice with distinct Boolean variables p, q, etc.
• Record acceptable combinations as a Boolean expression 
• Each analysis corresponds to a satisfying truth-value assignment
(a line from ’s truth table that assigns it “true”)
The Free Choice Gamble

Worst case, where everything interacts:
– As many choice variables as there are readings
– Packing blows up, and becomes exponential

Best case, no interactions
– N completely independent choices represent 2N
readings

Language interactions mostly limited & local
– Tends towards the best case
– Free choice packing pays off for linguistic analysis
Conclusions


Ambiguity has to be dealt with
Deep grammars use a variety of approaches
– preprocessing
– grammar engineering
– stochastic disambiguation

Why use deep grammars if they are so
ambiguous?
Deep analysis matters…
if you care about the answer
Example:
A delegation led by Vice President Philips, head of the chemical
division, flew to Chicago a week after the incident.
Question: Who flew to Chicago?
Candidate answers:
division
head
V.P. Philips
closest noun
next closest
next
delegation
furthest away but
Subject of flew
shallow but wrong
deep and right
Applications of Language Engineering
Post-Search
Sifting
Broad
Autonomous
Knowledge Filtering
Alta
Vista
AskJeeves
Document Base
Management
Narrow
Domain Coverage
Google
Restricted
Dialogue
Manually-tagged
Keyword Search
Knowledge
Fusion
Good
Translation
Useful
Summary
Natural
Dialogue
Microsoft
Paperclip
Low
Functionality
High
What to do with them?




Define yes-no / 1-0 features, f, that seem important
Training determines weights on these features, λ, to
reflect their actual importance
Select parse x: count occurrences of features (0,1)
and multiply by corresponding weights, λ.f(x)
Convert weighted feature counts to probabilities
eλ.f(x)
 eλ.f(X)
Un-normalized probability
Normalizing factor
Issues in Stochastic Disambiguation




What kind of probability model?
What kind of training data?
Efficiency of training, efficiency of
disambiguation?
Benefit vs. random choice of parse
Advantages of Free Choice Packing

Avoids procrastination
– Nogoods are constraints that parser sends to other component
– Eliminating nogoods: other components don’t do parser’s work

Independence between choices:
Allows processing relying on independence assumptions
– Counting number of readings
» Apparently trivial but of crucial importance, since statistical modelling
requires the ability to count
– Hence, statistical processing

A general mechanism extending beyond parsing
Simplifying Truth Tables
Das Mädchen
p:nom
p:acc
sah die Katze
p
1
1
0
0
Das Mädchen
p:nom
p:acc
q 
1 0
0 1
1 1
0 0
sah die Katze
p
1
0

1
1
q:nom
q:acc
=
(pq)

(pq)
(q = p)
p:nom
p:acc
=
(p 
p)
Freely choose any line
from the truth table
Download