Shuffling Non-Constituents Jason Eisner David A. Smith

advertisement
Shuffling Non-Constituents
Jason Eisner
with
David A. Smith
syntactically-flavored
reordering model
and
Roy Tromble
syntactically-flavored
reordering search methods
ACL SSST Workshop (invited talk), June 2008
1
Starting point: Synchronous alignment

Synchronous grammars are very pretty.

But does parallel text actually have
parallel structure?

Depends on what kind of parallel text



Free translations? Noisy translations?
Were the parsers trained on parallel annotation schemes?
Depends on what kind of parallel structure


What kinds of divergences can your synchronous grammar
formalism capture?
E.g., wh-movement versus wh in situ
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
2
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation from French to English.
donnent
(“give”)
kiss
à (“to”)
baiser
(“kiss”)
un
(“a”)
beaucoup
(“lots”)
Sam
Sam
kids
often
quite
d’
(“of”)
enfants
(“kids”)
“beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation from French to English.
A possible alignment is shown in orange.
donnent
(“give”)
Start
kiss
à (“to”)
baiser
(“kiss”)
un
(“a”)
beaucoup
(“lots”)
d’
(“of”)
Sam
Sam
NP
Adv
kids
null
null
NP
Adv
often
quite
NP
enfants
(“kids”)
“beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation from French to English.
A possible alignment is shown in orange.
A much worse alignment ...
donnent
(“give”)
Start
kiss
à (“to”)
baiser
(“kiss”)
un
(“a”)
beaucoup
(“lots”)
d’
(“of”)
enfants
(“kids”)
Sam
NP
Sam
kids
NP
often
quite
NP
Adv
“beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation from French to English.
A possible alignment is shown in orange.
donnent
(“give”)
Start
kiss
à (“to”)
baiser
(“kiss”)
un
(“a”)
beaucoup
(“lots”)
d’
(“of”)
Sam
Sam
NP
Adv
kids
null
null
NP
Adv
often
quite
NP
enfants
(“kids”)
“beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”
Synchronous Grammar = Set of Elementary Trees
donnent
(“give”)
Start
kiss
à (“to”)
baiser
(“kiss”)
un
(“a”)
NP
Adv
null
NP
Adv
beaucoup
(“lots”)
d’
(“of”)
null
NP
kids
NP
null
NP
enfants
(“kids”)
Sam
null
NP
often
Adv
Adv
Sam
quite
But many examples are harder
Auf
To
diese
this
Frage
question
habe
have
ich
I
leider
alas
keine
no
Antwort
answer
bekommen
received
NULL
I
did
not
unfortunately
receive
an
answer
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
to
this
question
8
But many examples are harder
Auf
To
diese
this
Frage
question
habe
have
ich
I
leider
alas
keine
no
Antwort
answer
bekommen
received
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced modifier (negation)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
9
But many examples are harder
Auf
To
diese
this
Frage
question
habe
have
ich
I
leider
alas
keine
no
Antwort
answer
bekommen
received
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced modifier (negation)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
10
But many examples are harder
Auf
To
diese
Frage
habe
this question have
ich
I
leider
alas
keine
no
Antwort
answer
bekommen
received
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced argument (here, because projective parser)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
11
But many examples are harder
Auf
To
diese
this
Frage
question
habe
have
ich
I
leider
alas
keine
no
Antwort
answer
bekommen
received
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Head-swapping (here, just different annotation conventions)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
12
Free Translation
Tschernobyl
Chernobyl
könnte
could
dann etwas später
then something later
an
on
die Reihe kommen
the queue
come
NULL
Then
we
could
deal
with
Chernobyl
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
some
time
later
13
Free Translation
Tschernobyl
Chernobyl
könnte
could
dann etwas später
then something later
an
on
die Reihe kommen
the queue
come
NULL
Then
we
could
deal
with
Chernobyl
some
time
later
Probably not systematic (but words are correctly aligned)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
14
Free Translation
Tschernobyl
Chernobyl
könnte
could
dann etwas später
then something later
an
on
die Reihe kommen
the queue
come
NULL
Then
we
could
deal
with
Chernobyl
some
time
later
Erroneous parse
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
15
What to do?

Current practice:



Don’t try to model all systematic phenomena!
Just use non-syntactic alignments (Giza++).
Only care about the fragments that recur often



Phrases or gappy phrases
Sometimes even syntactic constituents
(can favor these, e.g., Marton & Resnik 2008)
Use these (gappy) phrases in a decoder

Phrase based or hierarchical
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
16
What to do?

Current practice:



But could syntax give us better alignments?


Use non-syntactic alignments (Giza++)
Keep frequent phrases for a decoder
Would have to be “loose” syntax …
Why do we want better alignments?
1. Throw away less of the parallel training data
2. Help learn a smarter, syntactic, reordering model
 Could help decoding: less reliance on LM
3. Some applications care about full alignments
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
17
Quasi-synchronous grammar


How do we handle “loose” syntax?
Translation story:

Generate target English by a monolingual grammar


Any grammar formalism is okay
Pick a dependency grammar formalism for now
P(I | did, PRP)
I
did
not
unfortunately
receive
an
answer
P(PRP | no previous left
children of “did”)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
to
this
question
parsing: O(n3)
18
Quasi-synchronous grammar


How do we handle “loose” syntax?
Translation story:


Generate target English by a monolingual grammar
But probabilities are influenced by source sentence


I
did
Each English node is aligned to some source node
Prefers to generate children aligned to nearby source nodes
not
unfortunately
receive
an
answer
to
this
question
parsing: O(n3)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
19
QCFG Generative Story
observed
Auf
diese
Frage
habe
ich
leider
keine
Antwort
bekommen
NULL
P(parent-child)
P(I | did, PRP, ich)
I
did
not
P(breakage)
unfortunately
receive
an
answer
P(PRP | no previous left
children of “did”, habe)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
to
this
question
aligned parsing: O(m2n3)
20
What’s a “nearby node”?

Given parent’s alignment, where might child be aligned?
synchronous
grammar case
+ “none of the above”
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
21
Quasi-synchronous grammar


How do we handle “loose” syntax?
Translation story:



Generate target English by a monolingual grammar
But probabilities are influenced by source sentence
Useful analogies:
1.
2.
Generative grammar with latent word senses
Source
MEMM

Generate n-gram
tag sequence,
Target
but probabilities
are influenced by word sequence
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
22
Quasi-synchronous grammar


How do we handle “loose” syntax?
Translation story:



Generate target English by a monolingual grammar
But probabilities are influenced by source sentence
Useful analogies:
1.
2.
3.
Generative grammar with latent word senses
MEMM
IBM Model 1


Source nodes can be freely reused or unused 
Future work: Enforce 1-to-1 to allow good decoding
(NP-hard to do exactly)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
23
Some results: Quasi-synch. Dep. Grammar

Alignment (D. Smith & Eisner 2006)



Quasi-synchronous syntax much better than synchronous
Maybe also better than IBM Model 4
Question answering (Wang et al. 2007)


Align question w/ potential answer
Mean average precision 43%  48%  60%


previous state of the art  + QG  + lexical features
Bootstrapping a parser for a new language

(D. Smith & Eisner 2007 & ongoing)
Learn how parsed parallel text influences target dependencies


Along with many other features! (cf. co-training)
Unsupervised: German 30%  69%, Spanish 26%  65%
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
24
Summary of part I

Current practice:



Use non-syntactic alignments (Giza++)
Some bits align nicely
Use the frequent bits in a decoder

Suggestion: Let syntax influence alignments.

So far, loose syntax methods are like IBM Model I.


NP-hard to enforce 1-to-1 in any interesting model.
Rest of talk:


How to enforce 1-to-1 in interesting models?
Can we do something smarter than beam search?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
25
Shuffling Non-Constituents
Jason Eisner
with
David A. Smith
syntactically-flavored
reordering model
and
Roy Tromble
syntactically-flavored
reordering search methods
ACL SSST Workshop, June 2008
26
Motivation



MT is really easy!
Just use a finite-state transducer!
Phrases, morphology, the works!
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
27
Permutation search in MT
NNP
1
NEG
2
PRP
3
Marie
ne
m’
1
4
2
Mary
hasn’t
AUX
4
NEG
5
VBN
6
initial order
vu
(French)
a
pas
5
6
3
best order
(French’)
seen
me
easy
transduction
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
28
Motivation

MT is really easy!
Just use a finite-state transducer!
Phrases, morphology, the works!

Have just to fix that pesky word order.


Framing it this way lets us enforce 1-to-1 exactly at the permutation step.
Deletion and fertility > 1 are still allowed in the subsequent transduction.
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
29
Often want to find an optimal permutation …






Machine translation:
Reorder French to French-prime (Brown et al. 1992)
So it’s easier to align or translate
MT eval:
How much do you need to rearrange MT output so it
scores well under an LM derived from ref translations?
Discourse generation, e.g., multi-doc summarization:
Order the output sentences (Lapata 2003)
So they flow nicely
Reconstruct temporal order of events after info extraction
Learn rule ordering or constraint ranking for phonology?
Multi-word anagrams that score well under a LM
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
30
Permutation search: The problem
1
2
3
4
5
6
1
4
2
5
6
3
initial order
best order
according to
How can we find this needle some cost
in the haystack of N!
function
possible permutations?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
32
Traditional approach: Beam search
Approx. best path through a really big FSA
N! paths: one for each permutation
only 2N states
state remembers what
we’ve generated so far
(but not in what order)
arc weight =
cost of picking 5 next
if we’ve seen {1,2,4} so far
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
33
An alternative: Local search (“hill climbing”)
The SWAP neighborhood
132456
213456
cost=20
cost=26
123456
cost=22
123456
cost=22
124356
cost=19
123546
cost=25
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
34
An alternative: Local search (“hill-climbing”)
The SWAP neighborhood
123456
cost=22
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
124356
cost=19
35
An alternative: Local search (“hill-climbing”)
Like “greedy decoder” of Germann et al. 2001
The SWAP neighborhood
1
2
3
4
5
6
cost=22
cost=19
cost=17
cost=16
...
we pick best swap
Why are the costs always going down?
How long does it take to pick best swap? O(N) if you’re careful
O(N2)
How many swaps might you need to reach answer?
random restarts
What if you get stuck in a local min?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
36
Larger neighborhood
132456
213456
cost=20
cost=26
123456
cost=22
123456
cost=22
124356
cost=19
123546
cost=25
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
37
Larger neighborhood
(well-known
in the literature; works well)
INSERT neighborhood
1
2
3
4
5
6
cost=22
cost=17
Fewer local minima? yes – 3 can move past 4  to get past 5 
Graph diameter (max #moves needed)? O(N) rather than O(N2)
O(N2) rather than O(N)
How many neighbors?
O(N2) rather than O(N)
How long to find best neighbor?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
38
Even larger neighborhood
BLOCK neighborhood
1
2
3
4
5
6
cost=22
cost=14
yes – 2 can get past 45  without having
to cross 3  or move 3 first 
Fewer local minima?
still O(N)
Graph diameter (max #moves needed)?
O(N3) rather than O(N), O(N2)
How many neighbors?
How long to find best neighbor? O(N3) rather than O(N), O(N2)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
39
Larger yet: Via dynamic programming??
1
2
3
4
5
6
Fewer local minima?
Graph diameter (max #moves needed)?
How many neighbors?
How long to find best neighbor?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
cost=22
logarithmic
exponential
polynomial
40
Unifying/generalizing neighborhoods so far
1
i
2
3
j
4
5
6
7
k
8
Exchange two adjacent blocks, of max widths w ≤ w’
Move is defined by an (i,j,k) triple
SWAP: w=1, w’=1
INSERT: w=1, w’=N
BLOCK: w=N, w’=N
runtime = # neighbors = O(ww’N)
O(N)
everything in this talk
can be generalized to
O(N2)
other values of w,w’
O(N3)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
41
Very large-scale neighborhoods

What if we consider multiple simultaneous exchanges
that are “independent”?
1

3
2
5
4
6
The DYNASEARCH neighborhood
(Potts & van de Velde 1995; Congram 2000)
22
1
11 4
3
2
3
3
4
2
55
6
Lowest-cost neighbor
is lowest-cost path
5
6
5
44
Cost of this arc is Δcost
of swapping (4,5), here < 0
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
42
Very large-scale neighborhoods
2
1
1
4
2
3
6
4
2
3

3
Lowest-cost neighbor
is lowest-cost path
5
5
6
4
5
Why would this be a good idea?
Help get out of bad local minima? no; they’re still local minima
yes – less greedy
Help avoid getting into bad local minima?
0
B=
-20
0
80
0
0 -30
-0
0
0
0 -20
0
0
0
0
2
1
1
2
3
4
3
3
DYNASEARCH (-20+-20)
4
2
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
SWAP (-30)
43
Very large-scale neighborhoods
2
1
1
2
3

4
3
3
4
2
5
6
5
5
6
Lowest-cost neighbor
is lowest-cost path
4
Why would this be a good idea?
Help get out of bad local minima? no; they’re still local minima
yes – less greedy
Help avoid getting into bad local minima?
More efficient? yes! – shortest-path algorithm finds the best set
of swaps in O(N) time, as fast as best single swap.
Up to N moves as fast as 1 move:no penalty for “parallelism”!
Globally optimizes over exponentially many neighbors (paths).
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
44
Can we extend this idea – up to N moves in parallel by
dynamic programming – to neighborhoods beyond SWAP?
1
i
2
3
j
4
5
6
7
k
8
Exchange two adjacent blocks, of max widths w ≤ w’
Move is defined by an (i,j,k) triple
SWAP: w=1, w’=1
INSERT: w=1, w’=N
BLOCK: w=N, w’=N
runtime = # neighbors = O(ww’N)
O(N)
Yes.
2)
O(N
Asymptotic runtime is
always unchanged.
O(N3)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
45
Let’s define each neighbor by a “colored tree”
Just like ITG!
= swap children
1
2
3
4
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
5
6
46
Let’s define each neighbor by a “colored tree”
Just like ITG!
= swap children
1
2
3
4
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
5
6
47
Let’s define each neighbor by a “colored tree”
Just like ITG!
= swap children
5
6
1
2
3
4
This is like the BLOCK neighborhood, but with
multiple block exchanges, which may be nested.
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
48
If that was the optimal neighbor …
… now look for its optimal neighbor
new tree!
5
6
1
4
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
2
3
49
If that was the optimal neighbor …
… now look for its optimal neighbor
new tree!
5
6
1
4
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
2
3
50
If that was the optimal neighbor …
… now look for its optimal neighbor
… repeat till reach local optimum
Each tree defines a neighbor.
At each step, optimize over all possible trees
by dynamic programming (CKY parsing).
1
4
2
5
6
3
Use your favorite parsing speedups (pruning, best-first, …)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
51
Very-large-scale versions of SWAP, INSERT, and BLOCK
all by the algorithm we just saw …
1
i
2
3
j
4
5
6
7
k
8
Exchange two adjacent blocks, of max widths w ≤ w’
Move is defined by an (i,j,k) triple
Runtime of the algorithm we just saw was O(N3)
because we considered O(N3) distinct (i,j,k) triples
More generally, restrict to only the O(ww’N) triples of interest
to define a smaller neighborhood with runtime of O(ww’N).
(yes, the dynamic programming recurrences go through)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
52
How many steps to get from here to there?
6
2
5
8
4
3
7
1
initial order
One twisted-tree step?
No: As you probably know,
3 1 4 2  1 2 3 4 is impossible.
1
2
3
4
5
6
7
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
8
best order
53
Can you get to the answer in one step?
not always
(yay, local
search)
often
(yay, big
neighborhood)
for longer
sentences,
usually not
German-English,
Giza++ alignment
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
54
How many steps to the answer in the
worst case?
(what is diameter of the search space?)
6
2
5
8
4
3
7
1
claim: only log2N steps at worst
(if you know where to step)
Let’s sketch the proof!
1
2
3
4
5
6
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
7
8
55
Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
right-branching
tree
6
2
5
8
4
3
4
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
7
1
5
56
Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
Only log2 N steps to get to 1 2 3 4 5 6 7 8 …
… or to anywhere!
sequence of
right-branching
trees
2
4
2
4
3
3
1
7
8
6
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
5
6
5
7
57
Defining “best order”
What class of cost functions can we handle efficiently?
How fast can we compute a subtree’s cost from its child subtrees?
1
2
3
4
5
6
1
4
2
5
6
3
initial order
best order
according to
How can we find this needle some cost
in the haystack of N!
function
possible permutations?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
58
Defining “best order”
What class of cost functions?
0
-30
A=
15
12
7
6
15
22
80
0 -76
24
63 -44
0 -15
71 -99
28
8 -31
5
-7
0
54
-6
41
24
0
82
5 -22
8
93
0
-9
a14 + a42 + a25 + a56 + a63 + a31
1
4
2
5
6
3
best order
according to
How can we find this needle some cost
in the haystack of N!
function
possible permutations?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
59
Defining “best order”
What class of cost functions?
0
B=
5 -22
93
8
6
8 -31
-6
54
24
82
12
0
-7
41
0
-9
88
17
-6
0
11 -17
5
10 -59
4 -12
6
12 -60
0
23
55
0
b26 = cost of 2 preceding 6
(add up n(n-1)/2 such costs)
(any order will incur either b26 or b62)
1
4
2
5
6
3
best order
according to
How can we find this needle some cost
in the haystack of N!
function
possible permutations?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
60
Defining “best order”
What class of cost functions?


TSP and LOP are both NP-complete
In fact, believed to be inapproximable


hard even to achieve C * optimal cost (any C≥1)
Practical approaches:

correct answer, typically fast  branch-and-bound,

fast answer, typically close to correct  beam search,
ILP, …
this talk, …
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
61
Defining “best order”
What class of cost functions?
1
2
3
4
5
6
initial order
1
4
2
5
6
3
cost of this order:
4 1…2…3?
before 3 …?
Generalizes
TSP
1. Does my favorite WFSA
like this string of #s?
2. Non-local pair order ok?
3. Non-local triple order ok?
Can add these all up …
LOP
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
63
Costs are derived from source sentence features
NNP
1
NEG
PRP
2
Marie
AUX
4
3
m’
ne
a
VBN initial order
NEG
5
6
pas
(French)
vu
ne would like to be brought
adjacent to the next NEG word
0
-30
A=
15
12
7
6
15
-7
0
24
63 -44
12
0
0 -15
71 -99
-7
88
22
80
0 -76
28
8 -31
5
0
54
-6
41
24
0
82
5 -22
8
93
0
-9
B=
5 -22
93
8
6
8 -31
-6
54
41
0
-9
24
82
17
-6
0
11 -17
-75
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
10 -59
4 -12
6
12 -60
0
23
55
0
64
Costs are derived from source sentence features
NNP
1
NEG
2
Marie
-30
15
12
7
6
15
AUX
4
3
m’
ne
0
A=
PRP
a
22
80
0 -76
24
63 -44
0 -15
71 -99
28
8 -31
5
-7
0
54
-6
41
24
0
82
5 -22
8
93
0
-9
VBN initial order
NEG
5
6
pas
(French)
vu
50: a verb (e.g., vu) shouldn’t
precede its subject (e.g., Marie)
0
5 a-22
93
8 56
+27: words
at
distance
of
shouldn’t
12 swap
0 order
8 -31 -6 54
-2: words
-7 with
41 PRP
0 between
-9 24 82
them ought to swap
88 17 -6
0 12 -60
…
= 75
B=
11 -17
75
10 -59
4 -12
6
0
23
55
0
Can also include phrase boundary symbols in the input!
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
65
Costs are derived from source sentence features
NNP
1
NEG
PRP
2
Marie
AUX
4
3
m’
ne
a
VBN initial order
NEG
5
6
pas
(French)
vu
FSA costs: Distortion model
Language model – looks ahead to next step!
( good finite-state translation into good English?)
0
-30
A=
15
12
7
6
15
-7
0
24
63 -44
12
0
0 -15
71 -99
-7
88
22
80
0 -76
28
8 -31
5
0
54
-6
41
24
0
82
5 -22
8
93
0
-9
B=
5 -22
93
8
6
8 -31
-6
54
41
0
-9
24
82
17
-6
0
11 -17
75
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
10 -59
4 -12
6
12 -60
0
23
55
0
66
Dynamic program must pick the tree that
leads to the lowest-cost permutation
1
2
3
4
5
6
initial order
1
4
2
5
6
3
cost of this order:
1. Does my favorite WFSA
like it as a string?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
67
Scoring with a weighted FSA
This particular WFSA implements TSP scoring for N=3:
After you read 1, you’re in state 1
After you read 2, you’re in state 2
After you read 3, you’re in state 3 …
and this state determines the cost
of the next symbol you read
nitial
We’ll handle a WFSA with Q states by using a
fancier grammar, with nonterminals. (Now
runtime goes up to O(N3Q3) …)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
68
Including WFSA costs via nonterminals
A possible preterminal for word 2
is an arc in A that’s labeled with 2.
4
2
2
The preterminal 42 rewrites as word 2
with a cost equal to the arc’s cost.
61
1
42 23
2
3
14
4
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
I5
5
56
6
69
5
I
6
5
6
1
1.
4
4
2
2
3
Including WFSA costs via nonterminals
This constituent’s total cost is the
total cost of the best 63 path I3
6
1
1
4
4
.
2
2
3
3
cost of the new permutation
3
63
63
13
I6
61
1
43
I6
42 23
2
3
14
4
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
I5
5
56
6
70
Dynamic program must pick the tree that
leads to the lowest-cost permutation
1
2
3
4
5
6
initial order
1
4
2
5
6
3
cost of this order:
4 before 3 …?
1. Does my favorite WFSA
like it as a string?
2. Non-local pair order ok?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
71
Incorporating the pairwise ordering costs
This puts {5,6,7} before {1,2,3,4}.
1 2 3 4 5 6
So this hypothesis must add costs
5 < 1, 5 < 2, 5 < 3, 5 < 4,
6 < 1, 6 < 2, 6 < 3, 6 < 4,
7 < 1, 7 < 2, 7 < 3, 7 < 4
7
Uh-oh! So now it takes
O(N2) time to combine two
subtrees, instead of O(1) time?
Nope – dynamic programming
to the rescue again!
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
72
Computing LOP cost of a block move
1 2 3 4
This puts {5,6,7} before {1,2,3,4}.
revise
So we have to add O(N2) costs
just to consider this single neighbor!
1 2 3 4 5 6
=
5
6
7
+
5
6
7
7
Reuse work from other,
“narrower” block moves …
computed new cost in O(1)!
7
1 2 3 4
1 2 3 4
5
6
1 2 3 4
-
1 2 3 4
5
6
+6
7
7
already computed
at earlier
steps
Eisner, D.A.Smith, Tromble
- SSST Workshop
- June 2008 of parsing
5
73
Incorporating 3-way ordering costs

See the initial paper (Eisner & Tromble 2006)

A little tricky, but


comes “for free” if you’re willing to
accept a certain restriction on these costs
more expensive without that restriction, but possible
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
74
Another option: Markov chain Monte Carlo

Random walk in the space of permutations



interpret a permutation’s cost as a log-probability
p(π) = exp(–cost(π)) / Z
Sample a permutation from the neighborhood
instead of always picking the most probable
Why?


Simulated annealing might beat greedy-with-random-restarts
When learning the parameters of the distribution, can use
sampling to compute the feature expectations
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
75
Another option: Markov chain Monte Carlo

Random walk in the space of permutations



interpret a permutation’s cost as a log-probability
p(π) = exp(–cost(π)) / Z
Sample a permutation from the neighborhood
instead of always picking the most probable
How?


Pitfall: Sampling a permutation  sampling a tree
 Spurious ambiguity: some permutations have many trees
Solution: Exclude some trees, leaving 1 per permutation
 Normal form has long been known for colored trees
 For restricted colored trees (which limit the size of blocks
to swap), we’ve devised a more complicated normal form
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
76
Learning the costs


Where do these costs come from?
If we have some examples on which we know
the true permutation, could try to learn them
0
-30
A=
15
12
7
6
15
-7
0
24
63 -44
12
0
0 -15
71 -99
-7
88
22
80
0 -76
28
8 -31
5
0
54
-6
41
24
0
82
5 -22
8
93
0
-9
B=
5 -22
93
8
6
8 -31
-6
54
41
0
-9
24
82
17
-6
0
11 -17
75
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
10 -59
4 -12
6
12 -60
0
23
55
0
78
Learning the costs



Where do these costs come from?
If we have some examples on which we know
the true permutation, could try to learn them
More precisely, try to learn these weights θ
(the knowledge that’s
reused across examples)
0
-30
A=
15
12
7
6
15
22
80
0 -76
24
63 -44
0 -15
71 -99
28
8 -31
5
-7
0
54
-6
41
24
0
82
5 -22
8
93
0
-9
50: a verb (e.g., vu) shouldn’t
precede its subject (e.g., Marie)
0 at5 a -22
93 of
85 6
27: words
distance
shouldn’t
12 swap
0 order
8 -31 -6 54
-2: words
-7 with
41 PRP
0 between
-9 24 82
them ought to swap
88 17 -6
0 12 -60
…
B=
11 -17
75
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
10 -59
4 -12
6
0
23
55
0
79
Experimenting with training LOP params
(LOP is quite fast: O(n3) with no grammar constant)
PDS VMFIN PPER ADV
Das kann ich
so
APPR
ART
NN
PTKNEG VVINF
$.
aus dem Stand nicht sagen .
B[7,9]
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
83
Feature templates for cost of swapping i, j
22 features
plus versions
of all of these
conjoined with
the distance j - i
(binned)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
84
Feature templates for cost of swapping i, j



Only LOP features so far
And they’re unnecessarily simple 22 features
(don’t examine syntactic constituency)
And input sequence is only wordsplus versions
of all of these
(not interspersed with syntactic brackets)
conjoined with
the distance j-i
(binned)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
85
Learning LOP Costs for MT
(interesting, if odd, to try to reorder with only the LOP costs)
MOSES baseline
German
LOP

German’
English
MOSES
Define German’ to be German in English word order

To get German’ for training data, use Giza++ to align
all German positions to English positions (disallow NULL)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
86
Learning LOP Costs for MT
(interesting, if odd, to try to reorder with only the LOP costs)
MOSES baseline
German
LOP

German’
English
MOSES
Easy first try: Naïve Bayes



Treat each feature in θ as independent
Count and normalize over the training data
No real improvement over baseline
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
87
Learning LOP Costs for MT
(interesting, if odd, to try to reorder with only the LOP costs)
MOSES baseline
German
LOP

German’
English
MOSES
Easy second try: Perceptron
 0 
 1
search
. . . local 

n
optimum
error
update
gold standard
ˆ
global
optimum
*
Note: Search error can be beneficial, e.g., just take 1 step from identity permutation
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
88
Benefit from reordering
Learning method
BLEU vs. BLEU vs.
German′ English
No reordering
49.65
Naïve Bayes—POS
49.21
Naïve Bayes—POS+lexical
49.75
Perceptron—POS
50.05
25.92
Perceptron—POS+lexical
51.30
26.34
25.55
obviously, not
yet unscrambling
German: need
more features
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
90
Alternatively, work back from gold standard

Contrastive estimation (Smith & Eisner 2005)
1-step verylarge-scale
neighborhood


* gold
standard
Maximize the probability of the desired
permutation relative to its ITG neighborhood

Requires summing all permutations in a neighborhood


Must use normal-form trees here
Stochastic gradient descent
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
91
Alternatively, work back from gold standard

k-best MIRA in the neighborhood
1-step verylarge-scale
neighborhood



* gold
standard
current
winners
in the
neighborhood
Make gold standard beat its local competitors
Beat the bad ones by a bigger margin



Good = close to gold in swap distance?
Good = close to gold using BLEU?
Good = translates into English that’s close to reference?
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
92
Alternatively, train each iterate
model best in
neigh of (0)
 0 
 1
update
 
*
0

...
update

*
1
oracle in
neigh of (0)
 n 
update
 
*
n
Or could do a k-best MIRA version of this, too;
even use a loss measure based on lookahead to(n)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
93
Summary of part II

Local search is fun and easy



Probably useful for translation


Popular elsewhere in AI
Closely related to MCMC sampling
Maybe other NP-hard problems too
Can efficiently use huge local neighborhoods


Algorithms are closely related to parsing and FSMs
Our community knows that stuff better than anyone!
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008
95
Download