FLMlectureEE517

advertisement
Factored Language Models
EE517 Presentation
April 19, 2005
Kevin Duh (duh@ee.washington.edu)
Outline
1.
2.
3.
4.
5.
6.
Motivation
Factored Word Representation
Generalized Parallel Backoff
Model Selection Problem
Applications
Tools
Factored Language Models
1
Word-based Language Models
• Standard word-based language models
T
p(w1 , w2 ,..., wT )   p(wt | w1 ,..., wt 1 )
t 1
T
  p(wt | wt 1 , wt  2 )
t 1
• How to get robust n-gram estimates ( p(wt | wt 1, wt  2 ))?
• Smoothing
• E.g. Kneser-Ney, Good-Turing
• Class-based language models
p(wt | wt 1 )  p(wt | C(wt ))p(C(wt ) | C(wt 1 ))
Factored Language Models
2
Limitation of Word-based
Language Models
• Words are inseparable whole units.
• E.g. “book” and “books” are distinct vocabulary units
• Especially problematic in morphologically-rich
languages:
•
•
•
•
E.g. Arabic, Finnish, Russian, Turkish
Many unseen word contexts
Arabic k-t-b
High out-of-vocabulary rate
Kitaab
A book
High perplexity
Kitaab-iy
My book
Kitaabu-hum
Their book
Kutub
Books
Factored Language Models
3
Arabic Morphology
pattern
particles
fa- sakan -tu affixes
root
LIVE + past + 1st-sg-past + part: “so I lived”
• ~5000 roots
• several hundred patterns
• dozens of affixes
Factored Language Models
4
Vocabulary Growth - full word forms
vocab size
English
Arabic
CallHome
16000
14000
12000
10000
8000
6000
4000
2000
0k
12
0k
11
0k
k
10
90
k
80
k
70
k
60
k
50
k
40
k
30
k
20
10
k
0
# word
tokens
Source: K. Kirchhoff, et al., “Novel Approaches to Arabic Speech Recognition
- Final Report from the JHU Summer Workshop 2002”, JHU Tech Report 2002
Factored Language Models
5
Vocabulary Growth - stemmed words
vocab size
EN words
AR words
EN stems
AR stems
CallHome
16000
14000
12000
10000
8000
6000
4000
2000
0k
12
0k
11
0k
10
k
90
k
80
k
70
k
60
k
50
k
40
k
30
k
20
10
k
0
# word
tokens
Source: K. Kirchhoff, et al., “Novel Approaches to Arabic Speech Recognition
- Final Report from the JHU Summer Workshop 2002”, JHU Tech Report 2002
Factored Language Models
6
Solution: Word as Factors
• Decompose words into “factors” (e.g. stems)
• Build language model over factors: P(w|factors)
• Two approaches for decomposition
• Linear
stem
suffix
prefix
stem
suffix
• [e.g. Geutner, 1995]
• Parallel
• [Kirchhoff et. al., JHU Workshop 2002]
• [Bilmes & Kirchhoff, NAACL/HLT 2003]
Factored Language Models
Mt-2
Mt-1
Mt
St-2
St-1
St
Wt-2
Wt-1
Wt
7
Factored Word Representations
w  { f , f ,..., f }  f
1
2
K
1:K
p(w1 , w2 ,..., wT )  p( f11:K , f 21:K ,..., fT1:K )
T
1:K
  p( ft1:K | ft1:K
,
f
)
1
t2
t 1
• Factors may be any word
 feature. Here we use
Mt-2
morphological features:
• E.g. POS, stem, root, pattern, etc.
P(wt | wt 1 , wt 2 , st 1 , st 2 , mt 1 , mt 2 )
Factored Language Models
Mt-1
Mt
St-2
St-1
St
Wt-2
Wt-1
Wt
8
Advantage of Factored Word
Representations
• Main advantage: Allows robust estimation of
1:K
1:K
p(
f
|
f
,
f
)
probabilities (i.e.
t
t 1
t 2 ) using backoff
• Word combinations in context may not be observed in
training data, but factor combinations are
• Simultaneous class assignment
Word
Kitaab-iy
(My book)
Kitaabu-hum
(Their book)
Kutub
(Books)
 word 
 stem 


 root 


tag


 kitaab-iy 
 kitaab



 ktb



noun+poss


 kitaabu-hum 
 kitaabu



 ktb



noun+poss


 kutub

 kutub



 ktb



noun
(pl.)


Factored Language Models
9
Example
• Training sentence: “lAzim tiqra kutubiy bi sorca”
(You have to read my books quickly)
• Test sentence:
“lAzim tiqra kitAbiy bi sorca”
(You have to read my book quickly)
Count(tiqra, kitAbiy, bi) = 0
Count(tiqra, kutubiy, bi) > 0
Count(tiqra, ktb, bi) > 0
P(bi| kitAbiy, tiqra) can back off to
P(bi | ktb, tiqra) to obtain more robust estimate.
=> this is better than P(bi | <unknown>, tiqra)
Factored Language Models
10
Language Model Backoff
• When n-gram count is low, use (n-1)-gram estimate
• Ensures more robust parameter estimation in sparse data:
Word-based LM:
Factored Language Model:
Backoff path: Drop most
distant word during backoff
Backoff graph: multiple backoff paths possible
F | F1 F2 F3
P(Wt | Wt-1 Wt-2 Wt-3)
P(Wt | Wt-1 Wt-2)
P(Wt | Wt-1)
F | F1 F2
F | F2 F3
F | F2
F | F1
P(Wt)
F | F1 F3
F | F3
F
Factored Language Models
11
Choosing Backoff Paths
•
Four methods for choosing backoff path
1. Fixed path (a priori)
2. Choose path dynamically during training
3. Choose multiple paths dynamically during training and
combine result (Generalized Parallel Backoff)
4. Constrained version of (2) or (3)
F | F1 F2 F3
F | F1 F2
F | F2 F3
F | F2
F | F1
F | F1 F3
F | F3
F
Factored Language Models
12
Generalized Backoff
• Katz Backoff:

N (wt , wt 1 , wt  2 )
d
if N (wt , wt 1 , wt  2 )  0
 N ( wt ,wt1 ,wt2 )
N (wt 1 , wt  2 )
PBO (wt | wt 1 , wt  2 )  

 (wt 1 , wt  2 )PBO (wt | wt 1 )
otherwise

• Generalized Backoff:
N ( f , f P1 , f P 2 )

d
if N ( f , f P1 , f P 2 )  0
 N ( f , f P1 , f P 2 )
N ( f P1 , f P 2 )
PBO ( f | f P1 , f P 2 )  

 ( f P1 , f P 2 ) g ( f , f P1 , f P 2 )
otherwise

g() can be any positive function, but
1
some g() makes backoff weight ( f P1 , f P2 ) 
computation difficult

dN ( f , f
f :N ( f , f P1 , f P 2 )0
Factored Language Models

P1 , f P 2
)
N ( f , f P1 , f P2 )
N ( f P1 , f P2 )
g( f , f P1 , f P2 )
f :N ( f , f P1 , f P 2 ) 0
13
g() functions
• A priori fixed path:
g ( f , f P1 , f P 2 )  PBO ( f | f P1 )
• Dynamic path: Max counts:
g ( f , f P1 , f P 2 )  PBO ( f | f Pj* )
j*  argmax N ( f , f Pj )
j
Based on raw counts
=> Favors robust estimation
• Dynamic path: Max normalized counts:
j  argmax
*
j
N ( f , f Pj )
N ( f Pj )
Based on maximum likelihood
=> Favors statistical predictability
Factored Language Models
14
Dynamically Choosing Backoff Paths
During Training
• Choose backoff path based based on g() and statistics of
the data
Wt | Wt-1 St-1 Tt-1
Wt | Wt-1 St-1
Wt | Wt-1
Wt | Wt-1 Tt-1
Wt | St-1
Wt | St-1 Tt-1
Wt | Tt-1
Wt
Factored Language Models
15
Multiple Backoff Paths:
Generalized Parallel Backoff
• Choose multiple paths during training and combine
probability estimates
Wt | Wt-1 St-1 Tt-1
Wt | Wt-1 St-1
Wt | Wt-1 Tt-1
Wt | St-1 Tt-1
 dc p ML (wt | wt 1 ,st 1 ,tt 1 ) if count  threshold

pbo (wt | wt 1 ,st 1 ,tt 1 )   
 [ pbo (wt | wt 1 ,st 1 )  pbo (wt | wt 1 ,tt 1 )] else
2
Options for combination are:
- average, sum, product, geometric mean, weighted mean
Factored Language Models
16
Summary:
Factored Language Models
FACTORED LANGUAGE MODEL =
Factored Word Representation + Generalized Backoff
• Factored Word Representation
• Allows rich feature set representation of words
• Generalized (Parallel) Backoff
• Enables robust estimation of models with many
conditioning variables
Factored Language Models
17
Model Selection Problem
• In n-grams, choose, eg.
• Bigram vs. trigram vs. 4gram
=> relatively easy search; just try each and note
perplexity on development set
• In Factored LM, choose:
• Initial Conditioning Factors
• Backoff Graph
• Smoothing Options
Too many options; need automatic search
Tradeoff: Factored LM is more general, but harder to
select a good model that fits data well.
Factored Language Models
18
Example: a Factored LM
• Initial Conditioning Factors, Backoff Graph, and Smoothing
parameters completely specify a Factored Language Model
• E.g. 3 factors total:
0. Begin with full graph
structure for 3 factors
Wt | Wt-1 St-1 Tt-1
Wt | Wt-1 St-1
1. Initial Factors
specify start-node
Wt | Wt-1
Wt | Wt-1 Tt-1
Wt | St-1
Wt | St-1 Tt-1
Wt | Tt-1
Wt
Factored Language Models
19
Example: a Factored LM
• Initial Conditioning Factors, Backoff Graph, and Smoothing
parameters completely specify a Factored Language Model
• E.g. 3 factors total:
3. Begin with subgraph obtained with new root node
Wt | Wt-1 St-1
5. Specify smoothing
for each edge
Wt | Wt-1
4. Specify backoff graph:
i.e. what backoff to
use at each node
Wt | St-1
Wt
Factored Language Models
20
Applications for Factored LM
• Modeling of Arabic, Turkish, Finnish, German, and other
morphologically-rich languages
• [Kirchhoff, et. al., JHU Summer Workshop 2002]
• [Duh & Kirchhoff, Coling 2004], [Vergyri, et. al., ICSLP 2004]
• Modeling of conversational speech
• [Ji & Bilmes, HLT 2004]
• Applied in Speech Recognition, Machine Translation
• General Factored LM tools can also be used to obtain
various smoothed conditional probability tables for other
applications outside of language modeling (e.g. tagging)
• More possibilities (factors can be anything!)
Factored Language Models
21
To explore further…
• Factored Language Model is now part of the
standard SRI Language Modeling Toolkit
distribution (v.1.4.1)
• Thanks to Jeff Bilmes (UW) and Andreas Stolcke (SRI)
• Downloadable at:
http://www.speech.sri.com/projects/srilm/
Factored Language Models
22
fngram Tools
fngram-count -factor-file my.flmspec -text train.txt
fngram -factor-file my.flmspec -ppl test.txt
train.txt: “Factored LM is fun”
W-Factored:P-adj W-LM:P-noun W-is:P-verb W-fun:P-adj
my.flmspec
W: 2 W(-1) P(-1) my.count my.lm 3
W1,P1 W1
kndiscount gtmin 1 interpolate
P1
P1
kndiscount gtmin 1
0
0
kndiscount gtmin 1
Factored Language Models
23
Factored Language Models
24
Turkish Language Model
• Newspaper text from web [Hakkani-Tür, 2000]
• Train: 400K tokens / Dev: 100K / Test: 90K
• Factors from morphological analyzer
Word
yararmanlak
 word

 root



 part-of-speech 


number


 case



other


 inflection-group 


 yararmanlak

 yarar



 NounInf-N:A3sg



singular


 Nom



Pnon


 NounA3sgPnonNom+Verb+Acquire+Pos 


Factored Language Models
25
Turkish: Dev Set Perplexity
Hand
FLM
555.0
Random
FLM
556.4
Genetic
FLM
539.2
ppl(%)
2
Wordbased LM
593.8
3
534.9
533.5
497.1
444.5
-10.6
4
534.8
549.7
566.5
522.2
-5.0
Ngram
-2.9
• Factored Language Models found by Genetic
Algorithms perform best
• Poor performance of high order Hand-FLM
corresponds to difficulty manual search
Factored Language Models
26
Turkish: Eval Set Perplexity
Hand
FLM
558.7
Random
FLM
525.5
Genetic
FLM
487.8
ppl(%)
2
Wordbased LM
609.8
3
545.4
583.5
509.8
452.7
-11.2
4
543.9
559.8
574.6
527.6
-5.8
Ngram
-7.2
• Dev Set results generalizes to Eval Set
=> Genetic Algorithms did not overfit
• Best models used Word, POS, Case, Root factors
and parallel backoff
Factored Language Models
27
Arabic Language Model
• LDC CallHome Conversational Egyptian Arabic
speech transcripts
• Train: 170K words / Dev: 23K / Test: 18K
• Factors from morphological analyzer
• [LDC,1996], [Darwish, 2002]
Word
Il+dOr
 word

 root



 morphological tag 


stem


 pattern

 Il+dOr

 dwr



 Noun+masc-sg+article 


dOr


CCC

Factored Language Models
28
Arabic: Dev Set and
Eval Set Perplexity
Dev Set perplexities
Ngram
Wordbased LM
Hand
FLM
Random
FLM
Genetic
FLM
ppl(%)
2
229.9
229.6
229.9
222.9
-2.9
3
229.3
226.1
230.3
212.6
-6.0
Eval Set perplexities
Ngram
Word
Hand
Random
Genetic
ppl(%)
2
249.9
230.1
239.2
223.6
-2.8
3
285.4
217.1
224.3
206.2
-5.0
The best models used all available factors (Word, Stem, Root, Pattern,
Morph), and various parallel backoffs
Factored Language Models
29
Word Error Rate (WER) Results
Dev Set
Eval Set (eval 97)
Stage
Word LM
Baseline
Factored
LM
Word LM
Baseline
Factored
LM
1
57.3
56.2
61.7
61.0
2a
54.8
52.7
58.2
56.5
2b
54.3
52.5
58.8
57.4
3
53.9
52.1
57.6
56.1
Factored language models gave 1.5% improvement in WER
Factored Language Models
30
Download