Automated Discovery of Helpful Structure in Large Text Collections

advertisement
Graphical Models Over
String-Valued Random Variables
Jason Eisner
with
ASRU, Dec. 2015
Nanyun
Nick
Ryan
Markus Michael
Cotterell (Violet) Andrews Dreyer
Paul
Peng
1
Pronunciation
Dictionaries
Probabilistic Inference of Strings
Jason Eisner
with
ASRU, Dec. 2015
Nanyun
Nick
Ryan
Markus Michael
Cotterell (Violet) Andrews Dreyer
Paul
Peng
2
3
semantics
lexicon (word types)
entailment
correlation
inflection
cognates
transliteration
abbreviation
neologism
language evolution
tokens
sentences
N
translation
alignment
editing
quotation
discourse context
resources
speech
misspellings,typos
formatting
entanglement
annotation
To recover variables,
model and exploit
their correlations
5
Bayesian View of the World
observed data
probability
distribution
hidden data
6
Bayesian NLP

Some good NLP questions:


Underlying facts or themes that help explain this document
collection?
An underlying parse tree that helps explain this sentence?



An underlying meaning that helps explain that parse tree?
An underlying grammar that helps explain why these
sentences are structured as they are?
An underlying grammar or evolutionary history that helps
explain why these words are spelled as they are?
7
Today’s Challenge
Too many words in a language!
Natural Language is Built from Words
9
Can store info about each word in a table
Index Spelling
Meaning
Pronunciation
Syntax
123
ca
[si.ei]
NNP (abbrev)
124
can
[kɛɪn]
NN
125
can
[kæn], [kɛn], … MD
126
cane
[keɪn]
NN (mass)
127
cane
[keɪn]
NN
128
canes
[keɪnz]
NNS
(other columns would include translations,
10
topics, counts, embeddings, …)
Problem: Too Many Words!



Google analyzed 1 trillion words of English text
Found > 13M distinct words with count ≥ 200
The problem isn’t storing such a big table …
it’s acquiring the info for each row separately

Need lots of evidence, or help from human speakers



Hard to get for every word of the language
Especially hard for complex or “low-resource” languages
Omit rare words?

Maybe, but many sentences contain them (Zipf’s Law)
11
Technically speaking, # words = 

Really the set of (possible) words is ∑*

Names
Neologisms
Typos
Productive processes:





friend  friendless  friendlessness 
friendlessnessless  …
hand+bag  handbag (sometimes can iterate)
12
Technically speaking, # words = 





Really the set of (possible) words is ∑*
Turkish word: uygarlaştiramadiklarimizdanmişsinizcasina
= uygar+laş+tir+ama+dik+lar+imiz+dan+miş+siniz+casina
(behaving) as if you are among those whom we could
not cause to become civilized
Names
Neologisms
Typos
Productive processes:


friend  friendless  friendlessness 
friendlessnessless  …
hand+bag  handbag (sometimes can iterate)
13
A typical Polish verb (“to contain”)
Imperfective
Perfective
infinitive
zawierać
zawrzeć
present
zawieram
zaweiramy
zawierasz
zawieracie
zawiera
zawierają
zawierałem/zawierałam
zawieraliśmy/zawierałyśmy
zawarłem/zawarłam
zawarliśmy/zawarłyśmy
zawierałeś/zawierałaś
zawieraliście/zawierałyście
zawarłeś/zawarłaś
zawarliście/zawarłyście
zawierał/zawierała/zawierało
zawierali/zawierały
zawarł/zawarła/zawarło
zawarli/zawarły
będę zawierał/zawierała
będziemy zawierali/zawierały
zawrę
zawrzemy
będziesz zawierał/zawierała
będziecie zawierali/zawierały
zawresz
zawrzecie
będzie zawierał/zawierała/zawierało
będą zawierali/zawierały
zawrze
zawrą
zawierałbym/zawierałabym
zawieralibyśmy/zawierałybyśmy
zawarłbym/zawarłabym
zawarlibyśmy/zawarłybyśmy
zawierałbyś/zawierałabyś
zawieralibyście/zawierałybyście
zawarłbyś/zawarłabyś
zawarlibyście/zawarłybyście
zawierałby/zawierałaby/zawierałoby
zawieraliby/zawierałyby
zawarłby/zawarłaby/zawarłoby
zawarliby/zawarłyby
past
future
conditional
imperative
zawierajmy
zawrzyjmy
zawieraj
zawierajcie
zawrzyj
zawrzjcie
niech zawiera
niech zawierają
niech zawrze
niech zawrą
present active participles
zawierający, -a, -e; -y, -e
present passive participles
zawierany, -a, -e; -, -e
past passive participles
zawarty, -a, -e; -, -te
adverbial participle
zawierając
100 inflected forms per verb
Sort of predictable from one another!
(verbs are more or less regular)
14
Solution: Don’t model every cell separately
Positive
ions
Noble
gases
16
Can store info about each word in a table
Index Spelling
Meaning
Pronunciation
Syntax
123
ca
[si.ei]
NNP (abbrev)
124
can
[kɛɪn]
NN
125
can
[kæn], [kɛn], … MD
126
cane
[keɪn]
NN (mass)
127
cane
[keɪn]
NN
128
canes
[keɪnz]
NNS
(other columns would include translations,
17
topics, counts, embeddings, …)
What’s in the table? NLP strings are diverse …

Use




Orthographic (spelling)
Phonological (pronunciation)
Latent (intermediate steps not observed directly)
Size




Morphemes (meaningful subword units)
Words
Multi-word phrases, including “named entities”
URLs
18
What’s in the table? NLP strings are diverse …

Language





English, French, Russian, Hebrew, Chinese, …
Related languages (Romance langs, Arabic dialects, …)
Dead languages (common ancestors) – unobserved?
Transliterations into different writing systems
Medium




Misspellings
Typos
Wordplay
Social media
19
Some relationships within the table






spelling  pronunciation
word  noisy word (e.g., with a typo)
word  related word in another language
(loanwords, language evolution, cognates)
singular  plural (for example)
(root, binyan)  word
underlying form  surface form
20
Reconstructing the (multilingual) lexicon
Index Spelling
Meaning
Pronunciation
Syntax
Ultimate goal: Probabilistically reconstruct all missing entries of
123this infinite
ca multilingual table, given
[si.ei]
NNPtext.
(abbrev)
some entries and some
124
can Needed: Exploit the relationships
[kɛɪn] (arrows). NN
May have to discover those relationships.
125
canLinguistics + generative modeling
[kæn], [kɛn],
… MD
Approach:
+ statistical
inference.
Modeling ingredients: Finite-state machines, graphical models, CRP.
126
cane
NN (mass)
[keɪn]
Inference ingredients: MCMC, BP/EP, DD.
127
cane
[keɪn]
NN
128
canes
[keɪnz]
NNS
(other columns would include translations,
22 …)
topics, distributional info such as counts,
Today’s Focus: Phonology
(but methods also apply to other
relationships among strings)
What is Phonology?
Orthography:
cat
Phonology:
[kæt]
• Phonology explains regular sound patterns
24
What is Phonology?
Orthography:
cat
Phonology:
[kæt]
Phonetics:
• Phonology explains regular sound patterns
• Not phonetics, which deals with acoustics
25
Q: What do phonologists do?
A: They find patterns among the
pronunciations of words.
26
A Phonological Exercise
Tenses
Verbs
1P Pres. Sg. 3P Pres. Sg.
TALK
THANK
HACK
CRACK
SLAP
[tɔk]
[θeɪŋk]
[hæk]
[slæp]
[tɔks]
[θeɪŋks]
[hæks]
[kɹæks]
Past Tense Past Part.
[tɔkt]
[θeɪŋkt]
[hækt]
[tɔkt]
[θeɪŋkt]
[hækt]
[kɹækt]
[slæpt]
27
Matrix Completion: Collaborative Filtering
Users
Movies
-37
-36
-24
-52
29
67
61
-79
19
77
74
29
22
12
-41
-39
28
Matrix Completion: Collaborative Filtering
Movies
[
9
-7
2
4
3
-2
-37
-36
-24
29
67
61
-79
19
77
74
29
22
12
-41
-52
[
9
-2
1
[
[
Users
[
[
[ 4 1 -5]
[ 7 -2 0]
[ 6 -2 3]
[-9 1 4]
[ 3 8 -5]
[
[
-6
-3
2
-39
29
Matrix Completion: Collaborative Filtering
Movies
9
-2
1
[
9
-7
2
[
4
3
-2
-37
-36
-24
59
-52
29
67
61
-79
6
19
77
74
-80
-39
29
22
12
-41
46
[
[
Users
[
[
[
[
[ 4 1 -5]
[ 7 -2 0]
[ 6 -2 3]
[-9 1 4]
[ 3 8 -5]
-6
-3
2
Prediction!
30
Matrix Completion: Collaborative Filtering
[1,-4,3]
[-5,2,1]
Dot Product
-10
Gaussian Noise
-11
31
Matrix Completion: Collaborative Filtering
Movies
9
-2
1
[
9
-7
2
[
4
3
-2
-37
-36
-24
59
-52
29
67
61
-79
6
19
77
74
-80
-39
29
22
12
-41
46
[
[
Users
[
[
[
[
[ 4 1 -5]
[ 7 -2 0]
[ 6 -2 3]
[-9 1 4]
[ 3 8 -5]
-6
-3
2
Prediction!
32
A Phonological Exercise
Tenses
Verbs
1P Pres. Sg. 3P Pres. Sg.
TALK
THANK
HACK
CRACK
SLAP
[tɔk]
[θeɪŋk]
[hæk]
[slæp]
[tɔks]
[θeɪŋks]
[hæks]
[kɹæks]
Past Tense Past Part.
[tɔkt]
[θeɪŋkt]
[hækt]
[tɔkt]
[θeɪŋkt]
[hækt]
[kɹækt]
[slæpt]
33
A Phonological Exercise
Suffixes
/Ø/
/s/
Stems
1P Pres. Sg. 3P Pres. Sg.
/tɔk/
/θeɪŋk/
/hæk/
/kɹæk/
/slæp/
TALK
THANK
HACK
CRACK
SLAP
[tɔk]
[θeɪŋk]
[hæk]
[slæp]
[tɔks]
[θeɪŋks]
[hæks]
[kɹæks]
/t/
/t/
Past Tense Past Part.
[tɔkt]
[θeɪŋkt]
[hækt]
[tɔkt]
[θeɪŋkt]
[hækt]
[kɹækt]
[slæpt]
34
A Phonological Exercise
Suffixes
/Ø/
/s/
Stems
1P Pres. Sg. 3P Pres. Sg.
/tɔk/
/θeɪŋk/
/hæk/
/kɹæk/
/slæp/
TALK
THANK
HACK
CRACK
SLAP
[tɔk]
[θeɪŋk]
[hæk]
[slæp]
[tɔks]
[θeɪŋks]
[hæks]
[kɹæks]
/t/
/t/
Past Tense Past Part.
[tɔkt]
[θeɪŋkt]
[hækt]
[tɔkt]
[θeɪŋkt]
[hækt]
[kɹækt]
[slæpt]
35
A Phonological Exercise
Suffixes
/Ø/
/s/
Stems
1P Pres. Sg. 3P Pres. Sg.
/tɔk/
/θeɪŋk/
/hæk/
/kɹæk/
/slæp/
TALK
THANK
HACK
CRACK
SLAP
[tɔk]
[θeɪŋk]
[hæk]
[kɹæk]
[slæp]
[tɔks]
[θeɪŋks]
[hæks]
[kɹæks]
[slæps]
/t/
/t/
Past Tense Past Part.
[tɔkt]
[θeɪŋkt]
[hækt]
[kɹækt]
[slæpt]
[tɔkt]
[θeɪŋkt]
[hækt]
[kɹækt]
[slæpt]
Prediction!
36
Why “talks” sounds like that
tɔk
s
Concatenate
tɔks
“talks”
37
A Phonological Exercise
Suffixes
/Ø/
/s/
Stems
1P Pres. Sg. 3P Pres. Sg.
/tɔk/
/θeɪŋk/
/hæk/
/kɹæk/
/slæp/
/koʊd/
/bæt/
TALK
THANK
HACK
CRACK
SLAP
CODE
BAT
[tɔk]
[θeɪŋk]
[hæk]
[tɔks]
[θeɪŋks]
[hæks]
[kɹæks]
[slæp]
[koʊdz]
[bæt]
/t/
/t/
Past Tense Past Part.
[tɔkt]
[θeɪŋkt]
[hækt]
[tɔkt]
[θeɪŋkt]
[hækt]
[kɹækt]
[slæpt]
[koʊdɪd]
[bætɪd]
38
A Phonological Exercise
Suffixes
/Ø/
/s/
Stems
1P Pres. Sg. 3P Pres. Sg.
/tɔk/
/θeɪŋk/
/hæk/
/kɹæk/
/slæp/
/koʊd/
/bæt/
TALK
THANK
HACK
CRACK
SLAP
CODE
BAT
[tɔk]
[θeɪŋk]
[hæk]
[tɔks]
[θeɪŋks]
[hæks]
[kɹæks]
[slæp]
[koʊdz]
[bæt]
z instead of s
/t/
/t/
Past Tense Past Part.
[tɔkt]
[θeɪŋkt]
[hækt]
[tɔkt]
[θeɪŋkt]
[hækt]
[kɹækt]
[slæpt]
[koʊdɪd]
[bætɪd]
ɪd instead of39t
A Phonological Exercise
Suffixes
/Ø/
/s/
Stems
1P Pres. Sg. 3P Pres. Sg.
/tɔk/
/θeɪŋk/
/hæk/
/kɹæk/
/slæp/
/koʊd/
/bæt/
/it/
TALK
THANK
HACK
CRACK
SLAP
CODE
BAT
EAT
[tɔk]
[θeɪŋk]
[hæk]
[tɔks]
[θeɪŋks]
[hæks]
[kɹæks]
[slæp]
[koʊdz]
[bæt]
[it]
eɪt instead of itɪd
/t/
/t/
Past Tense Past Part.
[tɔkt]
[θeɪŋkt]
[hækt]
[tɔkt]
[θeɪŋkt]
[hækt]
[kɹækt]
[slæpt]
[koʊdɪd]
[eɪt]
[bætɪd]
[itən]
40
Why “codes” sounds like that
koʊd
s
Concatenate
koʊd#s
Phonology (stochastic)
koʊdz
“codes”
Modeling word forms using latent underlying morphs and phonology.
Cotterell et. al. TACL 2015
41
Why “resignation” sounds like that
rizaign
ation
Concatenate
rizaign#ation
Phonology (stochastic)
rεzɪgneɪʃn
“resignation”
42
Fragment of Our Graph for English
3rd-person
singular suffix:
very common!
1) Morphemes
rizaign
s
eɪʃən
dæmn
rizaign#eɪʃən
rizaign#s
dæmn#eɪʃən
dæmn#s
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
“resignation”
“resigns”
Concatenation
2) Underlying words
Phonology
3) Surface words
“damnation”
“damns”
43
Handling Multimorphemic Words
• Matrix completion: each word built from one stem (row) + one suffix
(column). WRONG
• Graphical model: a word can be built from any # of morphemes
(parents). RIGHT
gə
liːb
t
gəliːbt
gəliːpt
44
“geliebt” (German: loved)
Limited to concatenation?
No, could extend to templatic morphology …
45
A (Simple) Model of Phonology
[1,-4,3]
[-5,2,1]
Dot Product
-10
Gaussian Noise
-11
47
rizaign
s
Concatenate
rizaign#s
Phonology (stochastic)
Sθ
rizainz
“resigns”
48
Phonology as an Edit Process
ft Context Upper Right Context
r
i
z
a
i
g
n
s
t Context
49
Phonology as an Edit Process
ft Context Upper Right Context
r
i
z
a
i
g
n
s
r
t Context
50
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
r
i
z
a
i
g
n
s
Lower Left Context
51
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
r
i
z
a
i
g
n
s
Lower Left Context
52
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
r
i
z
a
i
g
n
s
Lower Left Context
53
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
i
r
i
z
a
i
g
n
s
Lower Left Context
54
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
i
g
r
i
z
a
i
ɛ
n
s
Lower Left Context
55
Phonology as an Edit Process
Upper Left Context Upper Right Context
r
i
z
a
i
g
n
r
i
z
a
i
ɛ
n
s
Lower Left Context
56
Phonology as an Edit Process
Upper Left Context
Upper Right Conte
r
i
z
a
i
g
n
s
r
i
z
a
i
ɛ
n
z
Lower Left Context
57
Phonology as an Edit Process
Upper Left Context
Upper Right Conte
r
i
z
a
i
g
n
s
r
i
z
a
i
ɛ
n
z
Lower Left Context
58
Phonology as an Edit Process
Upper Left Context Upper Right Context
r
i
z
a
i
g
n
r
i
z
a
i
ɛ
n
s
Lower Left Context
59
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
i
g
r
i
z
a
i
ɛ
Lower Left Context
Feature Function
Weights
s
n
Action
Prob
DEL
COPY
SUB(A)
SUB(B)
...
INS(A)
INS(B)
...
.75
.01
.05
.03
...
.02
.01
...
60
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
i
g
r
i
z
a
i
ɛ
Lower Left Context
n
s
Features
Feature Function
Weights
61
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
i
g
r
i
z
a
i
ɛ
Lower Left Context
n
s
Features
Feature Function
Surface Form
Weights
62
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
i
g
r
i
z
a
i
ɛ
Lower Left Context
n
s
Features
Feature Function
Surface Form
Transduction
Weights
63
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
i
g
r
i
z
a
i
ɛ
Lower Left Context
n
s
Features
Feature Function
Surface Form
Transduction
Upper String
Weights
64
Phonological Attributes
Binary Attributes
(+ and -)
65
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
i
g
r
i
z
a
i
ɛ
n
s
Lower Left Context
66
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
i
g
r
i
z
a
i
ɛ
n
s
Lower Left Context
Faithfulness Features
EDIT(g, ɛ)
EDIT(+cons, ɛ)
EDIT(+voiced, ɛ)
67
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
i
r
i
z
a
i
g
n
s
Lower Left Context
Markedness Features
BIGRAM(a, i)
BIGRAM(-high, -low)
BIGRAM(+back, -back)
68
Phonology as an Edit Process
Upper Left Context
Upper Right Context
r
i
z
a
i
r
i
z
a
i
g
n
s
Lower Left Context
Markedness Features
BIGRAM(a, i)
BIGRAM(-high, -low)
BIGRAM(+back, -back)
69
Inference for Phonology
Bayesian View of the World
observed data
probability
distribution
hidden data
71
r,εzɪgn’eɪʃn
d,æmn’eɪʃn
d’æmz
72
rizaign
s
eɪʃən
dæmn
rizaign#eɪʃən
rizaign#s
dæmn#eɪʃən
dæmn#s
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
73
Bayesian View of the World
observed data
• some of the words
probability
distribution
hidden data
•
•
•
the rest of the words!
all of the morphs
the parameter vectors θ, φ
74
Why this matters
• Phonological grammars are usually handengineered by phonologists.
• Linguistics goal: Create an
automated phonologist?
• Cognitive science goal: Model
how babies learn phonology?
• Engineering goal: Analyze and generate
words we haven’t heard before?
76
The Generative Story
(defines which iceberg shapes are likely)
1. Sample the parameters φ and θ from priors.
These parameters describe the grammar of a new
language: what tends to happen in the language.
2. Now choose the lexicon of morphs and words:
– For each abstract morpheme aA,
sample the morph M(a) ~ Mφ
– For each abstract word a=a1,a2···, sample its surface
pronunciation S(a) from Sθ(· | u), where u=M(a1)#M(a2) ···
3. This lexicon can now be used to communicate.
A word’s pronunciation is now just looked up, not
sampled; so it is the same each time it is used.
77
Why Probability?
• A language’s morphology and phonology are
fixed, but probability models the learner’s
uncertainty about what they are.
• Advantages:
– Quantification of irregularity (“singed” vs. “sang”)
– Soft models admit efficient learning and inference
• Our use is orthogonal to the way phonologists
currently use probability to explain gradient
phenomena
78
Basic Methods for
Inference and Learning
Train the Parameters using EM
(Dempster et al. 1977)
• E-Step (“inference”):
– Infer the hidden strings (posterior distribution)
r,εzɪgn’eɪʃn
d,æmn’eɪʃn
d’æmz
Train the Parameters using EM
(Dempster et al. 1977)
• E-Step (“inference”):
– Infer the hidden strings (posterior distribution)
rizaign
s
eɪʃən
dæmn
rizaign#eɪʃən
rizaign#s
dæmn#eɪʃən
dæmn#s
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
Train the Parameters using EM
(Dempster et al. 1977)
• E-Step (“inference”):
– Infer the hidden strings (posterior distribution)
• M-Step (“learning”):
– Improve the continuous model parameters θ, φ
(gradient descent: the E-step provides supervision)
r
r
i
i
z
z
a
a
• Repeat till convergence
i
i
g
n
s
rizaign#s
riz’ajnz
82
Directed Graphical Model
(defines the probability of a candidate solution)
Inference step: Find high-probability reconstructions of
the hidden variables.
1) Morphemes
rizaign
s
eɪʃən
dæmn
rizaign#eɪʃən
rizaign#s
dæmn#eɪʃən
dæmn#s
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
Concatenation
2) Underlying words
Phonology
3) Surface words
High-probability if each string is likely given its parents.
83
Equivalent Factor Graph
(defines the probability of a candidate solution)
1) Morphemes
rizaign
s
eɪʃən
dæmn
rizaign#eɪʃən
rizaign#s
dæmn#eɪʃən
dæmn#s
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
Concatenation
2) Underlying words
Phonology
3) Surface words
84
85
86
87
88
89
90
91
92
93
Directed Graphical Model
1) Morphemes
rizaign
s
eɪʃən
dæmn
rizaign#eɪʃən
rizaign#s
dæmn#eɪʃən
dæmn#s
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
Concatenation
2) Underlying words
Phonology
3) Surface words
94
Equivalent Factor Graph
1) Morphemes
rizaign
s
eɪʃən
dæmn
rizaign#eɪʃən
rizaign#s
dæmn#eɪʃən
dæmn#s
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
Concatenation
2) Underlying words
Phonology
3) Surface words
Each ellipse is a random variable
• Each square is a “factor” – a function that jointly
scores the values of its few neighboring variables
•
95
Dumb Inference by Hill-Climbing
1) Morpheme URs
?
2) Word URs
?
?
?
3) Word SRs
r,εzɪgn’eɪʃn
riz’ajnz
riz’ajnd
?
?
?
96
Dumb Inference by Hill-Climbing
1) Morpheme URs
foo
bar
s
da
2) Word URs
?
?
?
3) Word SRs
r,εzɪgn’eɪʃn
riz’ajnz
riz’ajnd
97
Dumb Inference by Hill-Climbing
1) Morpheme URs
foo
bar
s
da
2) Word URs
bar#foo
bar#s
bar#da
3) Word SRs
r,εzɪgn’eɪʃn
riz’ajnz
riz’ajnd
98
Dumb Inference by Hill-Climbing
0.01
8e-3
1) Morpheme URs
foo
bar
0.05
s
0.02
da
2) Word URs
bar#foo
bar#s
bar#da
3) Word SRs
r,εzɪgn’eɪʃn
riz’ajnz
riz’ajnd
99
Dumb Inference by Hill-Climbing
0.01
8e-3
1) Morpheme URs
2) Word URs
bar
foo
bar#foo
s
bar#s
r,εzɪgn’eɪʃn
riz’ajnz
0.02
da
bar#da
6e-1200
2e-1300
3) Word SRs
0.05
7e-1100
riz’ajnd
100
Dumb Inference by Hill-Climbing
0.01
8e-3
1) Morpheme URs
2) Word URs

3) Word SRs
bar
foo
bar#foo
s
bar#s
riz’ajnz
0.02
da
bar#da
6e-1200
2e-1300
r,εzɪgn’eɪʃn
0.05
7e-1100
riz’ajnd
101
Dumb Inference by Hill-Climbing
?
1) Morpheme URs
foo
far
s
da
2) Word URs
far#foo
far#s
far#da
3) Word SRs
r,εzɪgn’eɪʃn
riz’ajnz
riz’ajnd
102
Dumb Inference by Hill-Climbing
?
1) Morpheme URs
foo
size
s
da
2) Word URs
size#foo
size#s
size#da
3) Word SRs
r,εzɪgn’eɪʃn
riz’ajnz
riz’ajnd
103
Dumb Inference by Hill-Climbing
?
1) Morpheme URs
foo
…
s
da
2) Word URs
…#foo
…#s
…#da
3) Word SRs
r,εzɪgn’eɪʃn
riz’ajnz
riz’ajnd
104
Dumb Inference by Hill-Climbing
1) Morpheme URs
foo
rizajn
s
da
2) Word URs
rizajn#foo
rizajn#s
rizajn#da
3) Word SRs
r,εzɪgn’eɪʃn
riz’ajnz
riz’ajnd
105
Dumb Inference by Hill-Climbing
1) Morpheme URs
2) Word URs
rizajn
foo
rizajn#foo
2e-5
3) Word SRs
r,εzɪgn’eɪʃn
s
rizajn#s
0.01
riz’ajnz
da
rizajn#da
0.008
riz’ajnd
106
Dumb Inference by Hill-Climbing
1) Morpheme URs
2) Word URs
eɪʃn
rizajn
rizajn#eɪʃn
0.001
3) Word SRs
r,εzɪgn’eɪʃn
s
rizajn#s
0.01
riz’ajnz
d
rizajn#d
0.015
riz’ajnd
107
Dumb Inference by Hill-Climbing
1) Morpheme URs
2) Word URs

3) Word SRs
eɪʃn
rizajgn
rizajgn#eɪʃn
0.008
r,εzɪgn’eɪʃn
s
rizajgn#s
0.008
riz’ajnz
d
rizajgn#d
0.013
riz’ajnd
108
Dumb Inference by Hill-Climbing



Can we make this any smarter?
This naïve method would be very slow. And it could
wander around forever, get stuck in local maxima, etc.
Alas, the problem of finding the best values in a factor
graph is undecidable! (Can’t even solve by brute force
because strings have unbounded length.)



Exact methods that might not terminate
(but do in practice)
Approximate methods – which try to recover not just
the best values, but the posterior distribution of values
All our methods are based on finite-state automata
109
A Generative Model of Phonology
• A Directed Graphical Model of the lexicon
dˈæmz
rizˈajnz
rˌɛzɪgnˈeɪʃə
n
110
About to sell our mathematical soul?
Insight
111
Give up lovely dynamic programming?
Big
Models
112
Give up lovely dynamic programming?
• Not quite!
Insight
– Yes, general algos … which call
specialized algos as subroutines
• Within a framework such as belief propagation, we may run
– parsers (Smith & Eisner 2008)
– finite-state machines (Dreyer & Eisner 2009)
• A step of belief propagation takes time O(kn) in general
– To update one message from a factor that coordinates n variables
that have k possible values each
– If that’s slow, we can sometimes exploit special structure in the factor!
• large n: parser uses dynamic programming to coordinate many vars
• infinite k: FSMs use dynamic programming to coordinate strings
113
Distribution Over
Surface Form:
a
s
r
dˈæmz
i
e
u
h
g
e
n
ε
UR
Prob
dæmeɪʃən
.80
dæmneɪʃən
.10
dæmineɪʃən. .001
dæmiineɪʃən .0001
…
… chomsky
.000001
…
…
rizˈajnz
rˌɛzɪgnˈeɪʃə
n
114
115
Experimental Design
# of observed
words per
experiment
Experimental Datasets
• 7 languages from different families
67
54
43
71
200 to 800
200 to 800
200 to 800
– Maori
– Tangale
– Indonesian
– Catalan
– English
– Dutch
– German
homework exercises:
can we generalize correctly
from small data?
CELEX
can we scale up to
larger datasets?
can we handle naturally
occurring datasets that
have more irregularity?
117
Evaluation Setup
r,εzɪgn’eɪʃn
riz’ajnz
d’æmz
118
Evaluation Setup
rizaign
s
eɪʃən
dæmn
rizaign#eɪʃən
rizaign#s
dæmn#eɪʃən
dæmn#s
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
did we guess
this pronunciation right?
119
Exploring the Evaluation Metrics
• 1-best error rate
– Is the 1-best correct?
Distribution Over
Surface Form:
UR
Prob
* dæmeɪʃən
.80
dæmneɪʃən
.10
dæmineɪʃən. .001
dæmiineɪʃən .0001
…
… chomsky
.000001
…
…
120
Exploring the Evaluation Metrics
• 1-best error rate
– Is the 1-best correct?
• Cross Entropy
– What is the
probability of the
correct answer?
Distribution Over
Surface Form:
UR
Prob
dæmeɪʃən
.80
dæmneɪʃən
.10
dæmineɪʃən. .001
dæmiineɪʃən .0001
…
… chomsky
.000001
…
…
121
Exploring the Evaluation Metrics
• 1-best error rate
– Is the 1-best correct?
• Cross Entropy
– What is the
probability of the
correct answer?
• Expected Edit Distance
– How close am I on
average?
Distribution Over
Surface Form:
UR
Prob
dæmeɪʃən
.80
dæmneɪʃən
.10
dæmineɪʃən. .001
dæmiineɪʃən .0001
…
… chomsky
.000001
…
…
122
Exploring the Evaluation Metrics
• 1-best error rate
– Is the 1-best correct?
• Cross Entropy
– What is the
probability of the
correct answer?
• Expected Edit Distance
– How close am I on
average?
• Average over many
training-test splits
Distribution Over
Surface Form:
UR
Prob
dæmeɪʃən
.80
dæmneɪʃən
.10
dæmineɪʃən. .001
dæmiineɪʃən .0001
…
… chomsky
.000001
…
…
123
Evaluation
• Metrics: (Lower is Always Better)
–
–
–
–
1-best error rate (did we get it right?)
cross-entropy (what probability did we give the right answer?)
expected edit-distance (how far away on average are we?)
Average each metric over many training-test splits
• Comparisons:
– Lower Bound: Phonology as noisy concatenation
– Upper Bound: Oracle URs from linguists
124
Evaluation Philosophy
• We’re evaluating a language learner, on languages
we didn’t examine when designing the learner.
• We directly evaluate how well our learner predicts
held-out words that the learner didn’t see.
• No direct evaluation of intermediate steps:
– Did we get the “right” underlying forms?
– Did we learn a “simple” or “natural” phonology?
– It’s hard to judge the answers. Anyway, we only want
the answers to be “yes” because we suspect that this
will give us a more predictive theory. So let’s just see if
the theory is predictive. Proof is in the pudding!
• Caveat: Linguists and child language learners also have access
to other kinds of data that we’re not considering yet.
125
Results
(using Loopy Belief Propagation for inference)
German Results
Error Bars with
bootstrap
resampling
127
CELEX Results
128
Phonological Exercise Results
129
Gold UR Recovery
130
Formalizing Our Setup
Many scoring functions on strings
(e.g., our phonology model)
can be represented using FSMs
What class of functions will we allow
for the factors? (black squares)
1) Morphemes
rizaign
s
eɪʃən
dæmn
rizaign#eɪʃən
rizaign#s
dæmn#eɪʃən
dæmn#s
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
Concatenation
2) Underlying words
Phonology
3) Surface words
132
Real-Valued Functions on Strings

We’ll need to define some nonnegative functions.


f(x) = score of a string
f(x,y) = score of a pair of strings

Can represent deterministic processes: f(x,y)  {1,0}


Is y the observable result of deleting latent material from x?
Can represent probability distributions, e.g.,



f(x) = p(x) under some “generating process”
f(x,y) = p(x,y) under some “joint generating process”
f(x,y) = p(y | x) under some “transducing process”
133
Restrict to Finite-State Functions
One string input
c
Boolean
output
a
e
Two string inputs
(on 2 tapes)
c:z
a:x
e:y
c:z/.7
c/.7
Real
output
a:x/.5
a/.5
.3
e/.5
.3
e:y/.5
Path weight = product of arc weights
Score of input = total weight of accepting paths
134
Example: Stochastic Edit Distance p(y|x)
O(k) deletion arcs
O(k2) substitution
arcs
O(k) insertion
arcs
O(k) identity arcs
Likely edits = high-probability arcs
135
Computing p(y|x)
0
0c 1l 2a 3r 4a 5
These are different
explanations for how x could
have been edited to yield y
(x-to-y alignments).
Use dyn. prog. to find
highest-prob path, or
total prob of all paths.
position in lower string
Given (x,y), construct a
graph of all accepting
paths in the original FSM.
1
2
3
4
5
0
?
0c 1a 2c 3a 4
position in upper string
1
2
3
4
136
Why restrict to finite-state functions?

Can always compute f(x,y) efficiently.



Construct the graph of accepting paths by FSM composition.
Sum over them via dynamic prog., or by solving a linear system.
Finite-state functions are closed under useful operations:
 Marginalization:
h(x) = ∑y f(x,y)
 Pointwise product: h(x) = f(x) ∙ g(x)
 Join:
h(x,y,z) = f(x,y) ∙ g(y,z)
137
Define a function family





Use finite-state machines (FSMs).
The arc weights are parameterized.
We tune the parameters to get weights that predict
our training data well.
The FSM topology defines a function family.
In practice, generalizations of stochastic edit distance.


So, we are learning the edit probabilities.
With more states, these can depend on left and right context.
138
Probabilistic FSTs
139
Probabilistic FSTs
140
Computational Hardness
Ordinary graphical model inference is
sometimes easy and sometimes hard,
depending on graph topology
But over strings, it can be hard even with
simple topologies and simple finite-state
factors 
Simple model family can be NP-hard
Multi-sequence alignment problem



Generalize edit distance to k strings of length O(n)
Dynamic programming would seek best path in a
hypercube of size O(n^k)
Similar to Steiner string problem (“consensus”)
0
1
2
3
4
5
0
in lower string

1
2
144
Simple model family can be undecidable (!)

Post’s Correspondence Problem (1946)

Given a 2-tape FSM f(x,y) of a certain form,
is there a string z for which f(z,z) > 0 ?
e:a
f =
a:a
b:b


z = bbaabbbaa
e:a
a:b
b:a
b:b
e:e
a:e
bba
bba+ab+bba
bba+ab
bba+ab+bba+aee
bbe+aa+bbe+baa
bbe+aa+bbe
bbe+aa
bbe
No Turing Machine can decide this in general
So no Turing machine can determine in general whether
this simple factor graph has any positive-weight solutions:
z
x
y
145
Inference by Belief Propagation
147
148
149
150
151
152
153
Loopy belief propagation (BP)

The beautiful observation (Dreyer & Eisner 2009):




Each message is a 1-tape FSM that scores all strings.
Messages are iteratively updated based on other
messages, according to the rules of BP.
The BP rules require only operations under which
FSMs are closed!
Achilles’ heel:

The FSMs may grow large as the algorithm iterates.
So the algorithm may be slow, or not converge at all.
154
Computing Marginal Beliefs
X7
s ae h
r i n g
eu ε
s ae h
r i n g X3
eu ε
s ae h
r i n g
eu ε
X1
s ae h
r i n g
eu ε
X4
X2
X5
Computing Marginal Beliefs
X7
X1
s ae h
Computation of belief
r i n g
results in large state
e us εae h
space
s sareaei hnhg
r ri einung εgs a h
s ae hes ueaueε εh r i en g
r i n gr siX3naeg h u
e ε
eeur ihε n g
e u εs a
r i ne ug ε
s eaue h
r i n εg
Ce u
X4
ε
X2
X5
Computing Marginal Beliefs
X7
X1
X2
s ae h
Computation of belief
r i n g
results in large state
e us εae h
space
s sareaei hnhg
r ri einung εgs a h
s ae hes ueaueε εh r i en g
r i n gr siX3naeg h u
e ε
eeur ihε n g
e u εs a
r i ne ug ε
s eaue h
r i n εg
Ce u
X4
ε
What a
hairball!
X5
Computing Marginal Beliefs
X7
s ae h
r i n g
eu ε
Approximation
Required!!!
s ae
s a h
e
r i n g X3
eu ε
s ae h
r i n g
eu ε
X1
h
r i n g
eu ε
X4
X2
X5
BP over String-Valued Variables
• In fact, with a cyclic factor graph,
messages and marginal beliefs grow
unboundedly complex.
a
a
a
X1
ψ2
ψ1
X2
a
a ε
a
a
a
BP over String-Valued Variables
• In fact, with a cyclic factor graph,
messages and marginal beliefs grow
unboundedly complex!
a
a
a
X1
ψ2
ψ1
X2
a
a
a ε
a
a
a
BP over String-Valued Variables
• In fact, with a cyclic factor graph,
messages and marginal beliefs grow
unboundedly complex!
a
a
a
X1
ψ2
ψ1
X2
a
a
a
a
a
a ε
a
a
BP over String-Valued Variables
• In fact, with a cyclic factor graph,
messages and marginal beliefs grow
unboundedly complex!
a
a
a
X1
ψ2
ψ1
X2
a
a
a
a
a
a ε
a
a
a
BP over String-Valued Variables
• In fact, with a cyclic factor graph,
messages and marginal beliefs grow
unboundedly complex!
a
a
a
X1
ψ2
ψ1
X2
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a ε
a
a
a
a
a
a
a
a
a
a
a
a
a
Inference by Expectation
Propagation
Expectation Propagation (EP)
Belief at X3
will be simple!
X7
exponential-family
approximations inside Messages to
and from X3
will be simple!
X3
X1
X4
X2
X5
178
Expectation propagation (EP)

EP solves the problem by simplifying each
message once it is computed.

Projects the message back into a tractable family.
179
Expectation Propagation (EP)
Belief at X3
will be simple!
Messages to
and from X3
will be simple!
X3
exponential-family
approximations inside
Expectation propagation (EP)

EP solves the problem by simplifying each
message once it is computed.


Projects the message back into a tractable family.
In our setting, we can use n-gram models.


fapprox(x) = product of weights of the n-grams in x
Just need to choose weights that give a good approx

Best to use variable-length n-grams.
181
Expectation Propagation (EP) in a
Nutshell
X7
s ae h
r i n g
eu ε
s ae h
r i n g X3
eu ε
s ae h
r i n g
eu ε
X1
s ae h
r i n g
eu ε
X4
X2
X5
Expectation Propagation (EP) in a
Nutshell
X7
foo 1.2
bar 0.5
baz 4.3
s ae h
r i n g X3
eu ε
s ae h
r i n g
eu ε
X1
s ae h
r i n g
eu ε
X4
X2
X5
Expectation Propagation (EP) in a
Nutshell
X7
foo 1.2
bar 0.5
baz 4.3
foo 1.2
bar 0.5
baz 4.3
X1
X3
s ae h
r i n g
eu ε
s ae h
r i n g
eu ε
X4
X2
X5
Expectation Propagation (EP) in a
Nutshell
X7
foo 1.2
bar 0.5
baz 4.3
foo 1.2
bar 0.5
baz 4.3
X1
X3
foo 1.2
bar 0.5
baz 4.3
s ae h
r i n g
eu ε
X4
X2
X5
Expectation Propagation (EP) in a
Nutshell
X7
foo 1.2
bar 0.5
baz 4.3
foo 1.2
bar 0.5
baz 4.3
X1
X3
foo 1.2
bar 0.5
baz 4.3
foo 1.2
bar 0.5
baz 4.3
X4
X2
X5
Expectation propagation (EP)
187
Variable Order Approximations
• Use only the n-grams you really need!
Approximating beliefs with n-grams

How to optimize this?


Option 1: Greedily add n-grams by expected count in f.
Stop when adding next batch hurts the objective.
Option 2: Select n-grams from a large set using a
convex relaxation + projected gradient (= treestructured group lasso). Must incrementally expand
the large set (“active set” method).
189
Results using Expectation
Propagation
Speed ranking (upper graph)
Accuracy ranking (lower graph)
… essentially opposites …
Trigram EP (Cyan) – slow, very accurate
Baseline (Black) – slow, very accurate (pruning)
Penalized EP (Red) – pretty fast, very accurate
Bigram EP (Blue) – fast but inaccurate
Unigram EP (Green) – fast but inaccurate
191
192
Inference by Dual Decomposition
Exact 1-best inference!
(can’t terminate in general because
of undecidability, but does terminate
in practice)
General Idea of Dual Decomp
rεzign
rizajgn
s
eɪʃən
dæmn
rεzɪgn#eɪʃən
rizajn#s
dæmn#eɪʃən
dæmn#s
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
196
General Idea of Dual Decomp
rεzɪgn
eɪʃən
rizajn
z
dæmn
eɪʃən
dæmn
z
rεzɪgn#eɪʃən
rizajn#z
dæmn#eɪʃən
dæmn#z
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
Subproblem 2
Subproblem 3
Subproblem 4
Subproblem 1
197
General Idea of Dual Decomp
I prefer
rεzɪgn
rεzɪgn
I prefer
rizajn
eɪʃən
rizajn
z
dæmn
eɪʃən
dæmn
z
rεzɪgn#eɪʃən
rizajn#z
dæmn#eɪʃən
dæmn#z
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
Subproblem 2
Subproblem 3
Subproblem 4
Subproblem 1
198
rεzɪgn
eɪʃən
rizajn
z
dæmn
eɪʃən
dæmn
z
rεzɪgn#eɪʃən
rizajn#z
dæmn#eɪʃən
dæmn#z
r,εzɪgn’eɪʃn
riz’ajnz
d,æmn’eɪʃn
d’æmz
Subproblem 2
Subproblem 3
Subproblem 4
Subproblem 1
199
Substring Features and Active Set
I prefer
rizajn
Less i, a, j;
more ε, ɪ, g
(to match others)
rεzɪgn
eɪʃən
I prefer
rεzɪgn
rεzɪgn#eɪʃən
r,εzɪgn’eɪʃn
Subproblem 1
rizajn
z
Less ε, ɪ, g;
rizajn#z
more
i, a, j
dæmn
eɪʃən
dæmn
z
dæmn#eɪʃən
dæmn#z
riz’ajnz
d,æmn’eɪʃn
d’æmz
Subproblem 1
Subproblem 1
Subproblem 1
(to match others)
200
Features: “Active set” method
• How many features?
• Infinitely many possible n-grams!
• Trick: Gradually increase feature set as needed.
– Like Paul & Eisner (2012), Cotterell & Eisner (2015)
1. Only add features on which strings disagree.
2. Only add abcd once abc and bcd already agree.
–
Exception: Add unigrams and bigrams for free.
201
Fragment of Our Graph for Catalan
?
?
?
?
gris
grizos
?
Stem of
“grey”
?
?
?
?
grize
grizes
Separate these 4 words into 4 subproblems as before …
202
Redraw the graph to focus on the stem …
?
?
?
?
?
gris
grizos
grize
grizes
203
Separate into 4 subproblems –
each gets its own copy of the stem
?
?
?
?
?
?
?
?
gris
grizos
grize
grizes
204
nonzero features:
{}
Iteration: 1
ε
ε
ε
ε
?
?
?
?
gris
grizos
grize
grizes
205
nonzero features:
{}
Iteration: 3
g
g
g
g
?
?
?
?
gris
grizos
grize
grizes
206
nonzero features:
{s, z, is, iz, s$, z$ }
Iteration: 4
Feature weights
(dual variable) gris
griz
griz
griz
?
?
?
?
gris
grizos
grize
grizes
207
nonzero features:
{s, z, is, iz, s$, z$,
o, zo, o$ }
Iteration: 5
Feature weights
(dual variable) gris
griz
grizo
griz
?
?
?
?
gris
grizos
grize
grizes
208
nonzero features:
{s, z, is, iz, s$, z$,
o, zo, o$ }
Iteration: 6
Iteration: 13
Feature weights
(dual variable) gris
griz
grizo
griz
?
?
?
?
gris
grizos
grize
grizes
209
nonzero features:
{s, z, is, iz, s$, z$,
o, zo, o$ }
Iteration: 14
Feature weights
(dual variable) griz
griz
grizo
griz
?
?
?
?
gris
grizos
grize
grizes
210
nonzero features:
{s, z, is, iz, s$, z$,
o, zo, o$ }
Iteration: 17
Feature weights
(dual variable) griz
griz
griz
griz
?
?
?
?
gris
grizos
grize
grizes
211
nonzero features:
{s, z, is, iz, s$, z$,
o, zo, o$, e, ze, e$}
Iteration: 18
Feature weights
(dual variable) griz
griz
griz
grize
?
?
?
?
gris
grizos
grize
grizes
212
nonzero features:
{s, z, is, iz, s$, z$,
o, zo, o$, e, ze, e$}
Iteration: 19
Iteration: 29
Feature weights
(dual variable) griz
griz
griz
grize
?
?
?
?
gris
grizos
grize
grizes
213
nonzero features:
{s, z, is, iz, s$, z$,
o, zo, o$, e, ze, e$}
Iteration: 30
Feature weights
(dual variable) griz
griz
griz
griz
?
?
?
?
gris
grizos
grize
grizes
214
nonzero features:
{s, z, is, iz, s$, z$,
o, zo, o$, e, ze, e$}
Iteration: 30
Converged!
griz
griz
griz
griz
?
?
?
?
gris
grizos
grize
grizes
215
Why n-gram features?
• Positional features don’t understand insertion:
griz
giz
I’ll try to arrange for
r not i at position 2,
i not z at position 3,
z not e at position 4.
• In contrast, our “z” feature counts the number of
“z” phonemes, without regard to position.
griz
giz
I need more r’s.
These solutions already agree on “g”, “i”, “z” counts
… they’re only negotiating over the “r” count.
216
Why n-gram features?
• Adjust weights λ until the “r” counts match:
griz
giz
I need more r’s …
somewhere.
• Next iteration agrees on all our unigram features:
griz
girz
I need more gr, ri, iz,
less gi, ir, rz.
– Oops! Features matched only counts, not positions 
– But bigram counts are still wrong …
so bigram features get activated to save the day 
– If that’s not enough, add even longer substrings …
217
Results using Dual Decomposition
7 Inference Problems (graphs)
EXERCISE (small)
CELEX (large)
o 4 languages: Catalan,
English, Maori, Tangale
o 3 languages: English,
German, Dutch
o 16 to 55 underlying
morphemes.
o 341 to 381 underlying
morphemes.
o 55 to 106 surface
words.
o 1000 surface words
for each language.
219
Experimental Questions
o Is exact inference by DD practical?
o Does it converge?
o Does it get better results than approximate
inference methods?
o Does exact inference help EM?
221
primal
(function of strings x)
dual
(function of weights λ)
≤
● DD seeks best λ via subgradient algorithm
 reduce dual objective
 tighten upper bound on primal objective
● If λ gets all sub-problems to agree (x1 = … = xK)
 constraints satisfied
 dual value is also value of a primal solution
 which must be max primal! (and min dual)
222
Convergence behavior (full graph)
Dual
(tighten upper bound)
primal
(improve strings)
Catalan
Maori
English
Tangale
223
Comparisons
● Compare DD with two types of Belief Propagation
(BP) inference.
Approximate MAP inference
(max-product BP)
(baseline)
Approximate marginal inference
(sum-product BP)
variational (TACL 2015)
approximation
Exact MAP inference
(dual decomposition)
(this paper) Viterbi
Exact marginal inference
(we don’t know how!)
approximation
224
Inference accuracy
Model 1 – trivial phonology
Model 2S – oracle phonology
Model 2E – learned phonology
(inference used within EM)
Approximate MAP inference Model 1, EXERCISE: 90%
(max-product BP)
Model 1, CELEX: 84%
(baseline)
Model 2S, CELEX: 99%
Model 2E, EXERCISE: 91%
Approximate marginal inference
(sum-product BP)
(TACL 2015)
Model 1, EXERCISE: 95%
Model 1, CELEX: 86%
Model 2S, CELEX: 96% worse
Model 2E, EXERCISE: 95%
Exact MAP inference
(dual decomposition)
(this paper)
Model 1, EXERCISE: 97%
Model 1, CELEX: 90%
Model 2S, CELEX: 99%
Model 2E, EXERCISE: 98%
225
Conclusion
•A general DD algorithm for MAP inference on
graphical models over strings.
•On the phonology problem, terminates in
practice, guaranteeing the exact MAP solution.
•Improved inference for supervised model;
improved EM training for unsupervised model.
•Try it for your own problems generalizing to
new strings!
226
Future Work
observed data
probability
distribution
hidden data
227
Future: Which words are related?


So far, we were told that “resignation” shares
morphemes with “resigns” and “damnation.”
We’d like to figure that out from raw text:


Related spellings
Related contexts
shared morphemes?
228
Linguistics quiz: Find a morpheme
Blah blah blah snozzcumber blah blah blah.
Blah blah blahdy abla blah blah.
Snozzcumbers blah blah blah abla blah.
Blah blah blah snezzcumbri blah blah
snozzcumber.
Dreyer & Eisner 2011 – “select & mutate”
Many possible morphological slots
Andrews, Dredze, & Eisner 2014 – “select & mutate”
Many possible phylogenies
NEW
Did [Taylor swift] just dis harry sytles
Lets see how bad [T Swift] will be. #grammys
it’s clear that [T-Swizzle] is on drugs
[Taylor swift] is apart of the Illuminati
Ladies STILL love [LL Cool James].
[LL Cool J] is looking realllll chizzled!
Future: Which words are related?



So far, we were told that “resignation” shares
morphemes with “resigns” and “damnation.”
We’d like to get that from raw text.
To infer the abstract morphemes from context,
we need to extend our generative story to
capture regularity in morpheme sequences.


Neural language models …
But, must deal with unknown, unbounded vocabulary
of morphemes
234
Reconstructing the (multilingual) lexicon
Index Spelling
Meaning
Pronunciation
Syntax
123
ca
[si.ei]
NNP (abbrev)
124
can
[kɛɪn]
NN
125
can
[kæn], [kɛn], … MD
126
cane
[keɪn]
NN (mass)
127
cane
[keɪn]
NN
128
canes
[keɪnz]
NNS
(other columns would include translations,
236
topics, counts, embeddings, …)
Conclusions

Unsupervised learning of how all the words in a
language (or across languages) are interrelated.




This is what kids and linguists do.
Given data, estimate a posterior distribution over
the infinite probabilistic lexicon.
While training parameters that model how lexical
entries are related (language-specific derivational
processes or soft constraints).
Starting to look feasible!

We now have a lot of the ingredients – generative
models and algorithms.
237
Download