Syntax-based and Factored Language Models Rashmi Gangadharaiah April 16

advertisement

Syntax-based and Factored

Language Models

Rashmi Gangadharaiah

April 16 th , 2008

1

Noisy Channel Model

2

Why is MT output still bad

• Strong Translation models weak language models

• Using other knowledge sources in model building?

– Parse trees, taggers etc.

– How much improvement?

• Models can be computationally expensive,

– n-gram models are the least expensive models

– Other models have to efficiently coded

3

Conventional Language models

• n-gram word based language model: p(w i

|h)=p(w i

|w i-1

,….w

1

)

• Retain only n-1 most recent words of history to avoid storing a large number of parameters p(w i

|h)=p(w i

|w for n=3, p(S)=p(w

1

)p(w

2 i-1

,….w

i-n+1

)

|w

1

)…p(w i

|w i-1

,w i-2

)

• Estimated using MLE

• Innacurate probability estimates for higher order n-grams

• Smoothing/discounting to overcome sparseness

4

Problems still present in the n-gram model

• Do not make efficient use of training corpus

– Blindly discards relevant words that lie n positions in the past

– Retains words of little or no value

• Do not generalize well to unseen word sequences

main motivation for using class-based LMs and factored LMs

• Lexical dependencies are structurally related rather than sequentially related

 main motivation for using syntactic/structural

LMs

5

Earlier work on incorporating low level syntactic information(1)

Group words into classes(1)

• P. F. Brown et. al:

– Start with each word in a separate class, iteratively combine classes

• Heeman’s (1998) POS LM:

– achieved a perplexity reduction compared to a trigram LM by redefining the speech recognition problem:

6

P.F. Brown et al. 1992. Class-Based n-Gram Models of Natural Language. In Computational Linguistics, 18(4):467-479

P.A. Heeman. 1998. POS tagging versus classes in language modeling. In Proceedings of the 6 th Workshop on Very Large Corpora, Montreal.

Earlier work on incorporating low level syntactic information(2)

• Use predictive clustering and conditional clustering

• Predictive:

P(Tuesday|party on)=P(WEEKDAY|party on)*P(Tuesday|party on WEEKDAY)

• Conditional:

P(Tuesday|party EVENT on PREPOSITION)

Backoff order from P(w

P(w i

P(w

P(w i i

|W i-2

P(w i

|w i-1

|W i-1 w i-1

W

W i-1 i-1

)(=P(Tuesday)) i

|w i-2

W i-2 w i-1

W i-1

)

) (= P(Tuesday|EVENT on PREPOSITION)) to

) (=P(Tuesday|on PRESPOSITION)) to

) (=P(Tuesday|PREPOSITION))t o

J. Goodman. 2000. Putting it all together: Language model combination. In Proceedings of IEEE

International Conference on Acoustics, Speech and Signal Processing, volume 3, pages 1647-

1650, Istanbul

7

More Complex Language models that we will look at today…

• LMs that incorporate syntax

• Charniak et al. 2003 Syntax-based LM (in MT)

• LMs that incorporate both syntax and semantics

– Model Headword Dependency only

• N-best rescoring strategy

– Chelba et al. 1997 almost parsing (in ASR)

• Full parsing for decoding word lattices

– Chelba et al. 1998 full parsing in left-to-right fashion with

Dependency LM (in ASR)

– Model both Headword and non-Headword Dependencies

• N-best rescoring strategy

– Wang et al. 2007 SuperARV LMs (in MT)

– Kirchhoff et al. 2005 Factored LMs (in MT)

• Full parsing

– Rens Bod 2001 Data Oriented Parsing (in ASR)

– Wang et al. 2003 (in ASR)

8

link grammar to model long distance dependencies (1)

• Maximum Entropy language model that incorporates both syntax and semantics via dependency grammar

Motivation:

– dependencies are structurally related rather than sequentially related

• Incorporates the predictive power of words that lie outside of bigram or trigram range

• Elements of the model: A disjunct rule shows how a word must be connected to other words in a legal parse.

Ciprian Chelba, David Engle, Frederick Jelinek, Victor Jimenez, Sanjeev Khudanpur, lidia Mangu,

9

Harry Printz, Eric Ristad, Ronald Rosenfeld, Andreas Stolcke, Dekai Wu, 1997,“Structure and performance of a dependency language model”, In Eurospeech

link grammar to model long distance dependencies (2)

• Maps of histories

– Mapping retains

finite context of 0,1, or 2 preceding words

a link stack consisting of open links at the current position and the identities of the words from which they emerge

10

link grammar to model long distance dependencies (3)

• Maximum entropy formulation

– to treat each of the numerous elements of [h] as a distinct predictor variable

• Link grammar feature function

– “[h] matches d”: d is a legal disjunct to occupy the next position in the parse

– “yLz”: at least one of the links must bear label L and connect to word y

11

link grammar to model long distance dependencies (4)

• Tagging and Parsing

– Dependency parser of Michael Collins (required pos tags).

• P(S,K) = P(S|K) P(K|S)

– Parser didn’t operate in left to right direction hence used N-best lists.

– Training and testing data drawn from Switchboard corpus and from Treebank corpus

• Trained tagger on 1 million words (Ratnaparkhi), applied it on

226000 words of hand parsed training set and finally applied this on

1.44 million words, tested on 11 time marked telephone transcripts

• Dependency model

– Used the Maximum Entropy modeling toolkit

• Generated 100 best hypothesis for each utterance

– P(S)=

– Achieved reduction in WER from 47.4% (adjacency bigram) to

46.4%

12

Syntactic structure to model long distance dependencies (1)

• Language model develops syntactic structure and uses it to extract meaningful information from the word history

Motivation:

– 3-gram approach would predict

“after” from (7, cents)

– strongest predictor

Exposed headword when predicting

“after” should be “ended”

– Syntactic structure in the past filters out irrelavant words

Headword of

(ended(with(..)))

13

Ciprian Chelba and Frederick Jelinek,1998 “Exploiting Syntactic Structure for Language modeling”, ACL

Syntactic structure to model long distance dependencies (2)

• Terminology

– W k

: word k-prefix w

0

….w

k of the sentence

– W k

T k

: the word parse k-prefix

• A word-parse k-prefix contains – for a given parse only those binary subtrees whose span is completely included in the word k-prefix excluding w

0

= <s>

– Single words along with their POStag can be regarded as root-only trees

14

Syntactic structure to model long distance dependencies (3)

• Model operates by means of three modules

– WORD-PREDICTOR

• Predicts the next word w k+1 given the word-parse k-prefix and passes control to the TAGGER

– TAGGER

• Predicts the POS tag of the next word t k+1 parse k-prefix and w k+1 given the wordand passes control to the PARSER

– PARSER

• Grows the already existing binary branching structure by repeatedly generating the transitions: (unary, NTlabel),

(adjoin-left, NTlabel) or (adjoin-right, NTlabel) until it passes control to the PREDICTOR by taking a null transition

15

Syntactic structure to model long distance dependencies (4)

• Probabilistic Model:

• Word Level Perplexity

16

Syntactic structure to model long distance dependencies (5)

• Search strategy

– Synchronous multi-stack search algorithm

– Each stack contains partial parses constructed by the same number of predictor and parser operations

– Hypotheses ranked according to ln(P(W,T)) score

– Width controlled by maximum stack depth and log-probability threshold

• Parameter Estimation

– Solution inspired by HMM re-estimation technique  HMM reestimation technique that works on pruned N-best trellis (Byrne)

– binarized the UPenn Treebank parse trees and percolated the headwords using a rule-based approach

W. Byrne, A. Gunawardhana, and S. Khudanpur, 1998. “Information geometry and EM variants”. Technical Report CLSP Research Note 17.

17

Syntactic structure to model long distance dependencies (6)

• Setup - Upenn Treebank corpus

– Stack depth=10, log-probability threshold=6.91 nats

– Training data: 1Mwds of training data, word vocabulary:10k, POS tag vocabulary=40, non-terminal tag vocabulary=52

– Test data: 82430 words

• Results

– Reduced test-set perplexity from 167.14(trigram model) to 158.28

– Interpolating the model with a trigram model resulted in 148.90 (interpolation weight = 0.36)

18

Non-headword dependencies matter : DOP-based LM(1)

• The DOP (Data Oriented Parsing) model learns a stochastic tree-substitution grammar (STSG) from

– a treebank by extracting all subtrees from the treebank

– assigning probabilities to the subtrees

– DOP takes into account both headword and non-headword dependencies

– Subtrees are lexicalized at their frontiers with one or more words

• Motivation

– Head lexicalized grammar is limited

• It cannot capture dependencies between non-headwords

• Eg: “more people than cargo”, “more couples exchanging rings in

1988 than in the previous year” (from WSJ)

– Neither “more” nor “than” are headwords of these phrases

• Dependency between “more” and “than” is captured by a subtree where “more” and “than” are the only frontier words.

Rens Bod, 2000 “combining semantic and syntactic structure for language modeling”

19

Non-headword dependencies matter: DOP-based LM(2)

• DOP learns an STSG from a treebank by taking all subtrees in that treebank

– Eg: Consider a Treebank

20

Non-headword dependencies matter: DOP-based LM(3)

• New sentences may be derived by combining subtrees from the treebank

– Node substitution is left-associative

– Other derivations may yield the same parse tree

21

Non-headword dependencies matter : DOP-based LM(4)

• Model computes the probability of a subtree as: r(t): root label of t

• Probability of a derivation

• Probability of a parse tree

• Probability of a word string W

• Note:

– does not maximize the likelihood of the corpus

– implicit assumption that all derivations of a parse tree contribute equally to the total probability of the parse tree.

– There is a hidden component  DOP can be trained using EM

22

Non-headword dependencies matter : DOP-based LM(5)

• Combining semantic and syntactic structure

23

Non-headword dependencies matter : DOP-based LM(6)

• Computation of the most probable string

– NP hard : Employed Viterbi n best search

– Estimate the most probable string by the 1000 most probable derivations

• OVIS corpus

– 10,000 user utterances about Dutch public transport information, syntactically and semantically annotated

– DOP model obtained by extracting all subtrees of depth upto 4

24

More and More features (A

Hybrid) :SuperARV LMs (1)

• SuperARV LM is a highly lexicalized probabilistic LM based on the Constraint

Dependency Grammar (CDG)

• CDG represents a parse as assignments of dependency relations to functional variables (roles) associated with each word in a sentence

Motivation

– High levels of word prediction capability can be achieved by tightly integrating knowledge of words, structural constraints, morphological and lexical features at the word level.

25

Wen Wang, Mary P. Harper, 2002 “The SuperARV Language Model: Investigating the

Effectiveness of Tightly Integrating Multiple Knowledge Sources”, ACL

More and More features (A

Hybrid) :SuperARV LMs (2)

• CDG parse

– Each word in the parse has a lexical category and a set of feature values

– Each word has a governor role (G)

• Comprised of a label (indicates position of the words head/governor) and modifiee

• Need roles are used to ensure the grammatical requirements of a word are met

– Mechanism for using non-headword dependencies

26

More and More features (A

Hybrid) :SuperARV LMs (3)

• ARVs and ARVPs

– Using the relationship between a role value’s position and its modifiee’s position, unary and binary constraints can be represented as a finite set of ARVs and ARVPs

27

More and More features (A

Hybrid) :SuperARV LMs (4)

• SuperARVs

– Four-tuple for a word <C,F,(R,L,UC,MC)+,DC>

– Abstraction of the joint assignment of dependencies for a word

– a mechanism for lexicalizing CDG parse rules

– Encode lexical information, syntactic and semantic constraints – much more fine grained than POS

28

More and More features (A

Hybrid) :SuperARV LMs (5)

• SuperARV LM estimates the joint probability of

N N words w

1 and their SuperARV tags t

1

• SuperARV LM does not encode the word identity at the data structure level since this can cause serious data sparsity problems

• Estimate the probability distributions

– recursive linear interpolation

• WER on WSJ CSR 20k test sets,

– 3gram=14.74, SARV=14.28, Chelba=14.36

29

More and More features (A

Hybrid) :SuperARV LMs (6)

• SCDG Parser

– Probabilistic generative model

– For S, parser returns the parse T that maximizes its probability

• First step:

– N-best SuperARV assignments are generated

– Each SuperARV sequence is represented as: (w

1

, s

1

), . . . , (w n

• Second step: the modifiees are statistically specified in a left-to-right manner.

s n

)

– determine the left dependants of w k the farthest from the closest to

– also determine whether wk could be the (d+1)th right dependent of a previously seen word w p

, p = 1,. . . , k – 1

» d denotes the number of already assigned right dependents of w p

Wen Wang and Mary P. Harper, 2004, A Statistical Constraint Dependency Grammar (CDG)

Parser, ACL 30

More and More features (A

Hybrid) :SuperARV LMs (7)

• SCDG Parser (contd.)

• Second step (contd.)

– After processing word wk in each partial parse on the stack, the partial parses are re-ranked according to their updated probabilities.

– parsing algorithm is implemented as a simple best first search

– Two pruning thresholds: maximum stack depth and maximum difference between the log probabilities of the top and bottom partial parses in the stack

• WER

– LM training data for this task is composed of the 1987-1989 files containing 37,243,300 words

– evaluate all LMs on the 1993 20k open vocabulary DARPA WSJ

CSR evaluation set (denoted 93-20K), which consists of 213 utterances and 3,446 words.

– 3gram=13.72, Chelba=13.0, SCDG LM=12.18

31

More and More features (A

Hybrid) :SuperARV LMs (8)

• Employ LMs for N-best re-ranking in MT

• Two pass decoding

– First pass: generate N-best lists

• Uses a hierarchical phrase decoder with standard 4-gram LM

– Second pass:

• Rescore the N-best lists using several LMs trained on different corpora and estimated in different ways

• Scores are combined in a log-linear modeling framework

– Along with other features used in SMT

» Rule probabilities P(f|e), p(e|f); lexical weights pw(f|e) pw(e|f), sentence length and rule counts

» Optimized weights (GALE dev07) using minimum error training method to maximize BLEU search

» Blind test set NIST MT eval06 GALE portion (eval06)

Wen Wang, Andreas Stolcke and Jing Zheng, Dec 2007"Reranking machine translation hypothesis with structured and web-based language models“, ASRU. IEEE Workshop

32

More and More features (A

Hybrid) :SuperARV LMs (9)

• Structured LMs

– Almost parsing LM

– Parsing LM

• Using baseNP model

– Given W, generates the reduced sentence W’ by marking all baseNPs and then reducing all baseNPs to their headwords

• Further simplification of the parser LM

33

More and More features (A

Hybrid) :SuperARV LMs (10)

• LMs for searching:

– 4-gram (4g) LM

• English side of Arabic-English and Chinese-English from

LDC

• All of English BN and BC, webtexts and translations for

Mandarin and Arabic BC BN released under DARPA EARS and GALE

• LDC2005T12, LDC95T21, LDC98T30

• Webdata collected by SRI and BBN

• LMs for reranking (N-best list size =3000)

– 5 gram count LM Google LM (google) (1 terawords)

– 5 gram count LM Yahoo LM (yahoo) (3.4G words)

– first two sources for training almost parsing LM(sarv)

– the second source for training the parser LM(plm)

– 5-gram count-LM on all BBN webdata (wlm)

34

More and More features (A

Hybrid) :SuperARV LMs (11)

35

More and More features (A

Hybrid) :SuperARV LMs (12)

36

Syntax-based LMs (1)

• Performs translation by assuming the target language specifies not just words but complete parse

• Yamada 2002: incomplete use of syntactic information

– Decoder optimized, language model was not

• Develop a system in which the translation model of [Yamada 2001] is “married” to the syntaxbased language model of [Charniak 2001]

Kenji Yamada and Kevin Knight. “A Syntax-based statistical translation model”, 2001

Kenji Yamada and Kevin Knight, ”A decoder for syntax-based statistical mt”, 2002

Eugene Charniak,”Immediate-head parsing for language models” 2001

Eugene Charniak, Kevin Knight and Kenji Yamada, "Syntax-based Language Models for Statistical Machine Translation"

37

Syntax-based LMs (2)

• Translation model has 3 operations:

– Reorders child nodes, inserts an optional word, translates the leaf words

– Ө varies over the possible alignments between the F and E

• Decoding algorithm similar to a regular parser

– Build English parse tree from Chinese sentence

• Extract CFG rules from parsed corpus of English

• Supplement each non-lexical English CFG rule (VP  VB NP) with all possible reordered rules (VP  NP PP VB, VP  PP NP VB, etc)

• Add extra rules “VP  VP X” and “X  word” for insertion operations

• Also add “englishword  chineseword” for translation

38

Syntax-based LMs (3)

• Now we can parse a

Chinese sentence and extract English parse tree

– Removing leaf

Chinese words

– Recovering the reordered child nodes into English order

• pick the best tree

– the product of the LM probability and the

TM probability is the highest.

39

Syntax-based LMs (4)

• Decoding process

– First build a forest using the bottom-up decoding parser using only P(F|E)

– Pick the best tree from the forest with a LM

• Parser/Language model (Charniak 2001)

– Takes an English sentence and uses two parsing stages

• Simple non-lexical PCFG to create a large parse forest

– Pruning step

• Sophisticated lexicalized PCFG is applied to the sentence

40

Syntax-based LMs (5)

• Evaluation

– 347 previously unseen Chinese newswire sentences

– 780,000 English parse tree-Chinese sentence pairs

– YC: TM of Yamada2001 and LM of Charniak2001

– YT: TM of Yamada2001, trigram LM Yamada2002

– BT: TM of Peter et. al, 1993, trigram LM and greedy decoder

Ulrich Germann 1997

Eugene Charniak. A maximum-entropy-inspired Parser, 2000

Ulrich Germann, Michael Jahr, Daniel Marcu, Kevin Knight, , and Kenji Yamada.

Fast decoding and optimal deciding for machine translation, 2001

41

Factored Language models(1)

• Allow a larger set of conditioning variables predicting the current word

– morphological, syntactic or semantic word features, etc.

Motivation

– Statistical language modeling is a difficult problem for languages with rich morphology  high perplexity

– Probability estimates are unreliable even with smoothing

– Features mentioned above are shared by many words and hence can be used to obtain better smoothed probability estimates

Katrin Kirchhoff and Mei Yang, 2005 "Improved Language modeling for Statistical Machine

Translation“, ACL 42

Factored Language models(2)

• Factored Word Representation

– Decompose words into sets of features (or factors)

• Probabilistic language models constructed over subsets of word features.

• Word is equivalent to a fixed number of factors W=f

1:K

43

Factored Language models(3)

• Probability model

• Standard generalized parallel backoff c: count of (w t

,w t-1

,w t-2

), p

ML

τ

3

:ML distribution, d c

: discounting factor

:count threshold, α :normalization factor

– N-grams whose counts are above the threshold retain their ML estimates discounted by a factor that redistributes probability mass to the lower-order distribution

44

Factored Language models(4)

• Backoff paths

45

Factored Language models(5)

• Backoff paths

– Space of possible models is extremely large

– Ways of choosing among different paths

• Linguistically determined,

– Eg: drop syntactic before morphological variables

– Usually leads to sub-optimal results

• Choose path at runtime based on statistical criteria

• Choose multiple paths and combine their probability estimates

C: count of (f,f

1

,f

2

,f

3

), p

ML

τ

4

:ML distribution,

:count threshold, α :normalization factor, g:determines the backoff strategy, can be any non-negative function of f,f

1

,f

2

,f

3 46

Factored Language models(6)

• Learning FLM Structure

– Three types of parameters need to be specified

• Initial conditioning factors, backoff graph, smoothing options

– Model space is extremely large

• Find best model structure automatically

– Genetic algorithms(GA)

• Class of evolution-inspired search/optimization techniques

• Encode problem solutions as strings (genes) and evolve and test successive populations of solutions through the use of genetic operators (selection, crossover, mutation) applied to encoded strings

• Solution evaluates according to a fitness function which represents the desired optimization criteria

• No guarantee of finding the optimal solution, they find good solutions quickly

47

Factored Language models(7)

• Structure Search using GA

– Conditioning factors

• Encoded as binary strings

– Eg: with 3 factors (A,B,C), 6 conditioning variables {A

-1

,B

-1

,C

-1

,A

-

2

,B

-2

,C

-2

}

» String 10011 corresponds to F={A

-1

,B

-2

,C

-2

}

– Backoff graph  Large number of possible paths

• Encode a binary string in terms of graph grammar rules

– 1 indicating the use of the rule and 0 for non-use

48

Factored Language models(8)

• Structure Search using GA (contd.)

– Smoothing options

• Encoded as tuples of integers

– First integer  discounting method

– Second integer  backoff threshold

• Integer string consists of successive concatenated tuples each representing the smoothing option at a node in the graph

• GA operators are applied to concatenations of all three substrings describing the set of factors, backoff graph and smoothing options to jointly optimize all parameters

49

Factored Language models(9)

• Data: ACL05 shared MT task website for 4 language pairs

• Finnish, Spanish, French to English

• Development set provided by the website:2000 sentences

– Trained using GIZA++

– Pharaoh for phrase based decoding

• Trigram word LM trained using SRILM toolkit with Kneser-Ney smoothing and interpolation of higher and lower order n-grams

– Combination weights trained using minimum error weight optimization (Pharaoh)

50

Factored Language models(10)

• First pass

– Extract N-best lists: 2000 hyp per sentence

– 7 model scores collected from the outputs

• Distortion model score, the first pass LM score, word and phrase penalties, bidirectional phrase and word translation scores.

• Second pass

– N-best lists rescored with additional LMs.

• Word based 4-gram model,

• factored trigram model: separate FLM for each language

– Features: tags(Rathnaparki), stems(Porter)

– Optimized to achieve a low perplexity on the oracle 1-best hyp (with the best BLEU score) from the first pass

– Resulting scores combined with the above scores in a log-linear fashion

• Combination weights optimized on the development set to maximize the BLEU score

• Weights combined scores are used to select the best hyp

51

Factored Language models(11)

52

Conclusion(1)

• Chelba et al. 1997 Dependency LM (in ASR)

– Incorporated syntax and semantics

– Predicted words based on their relation to words that lie far in the past

– For practical purpose

• Selected the best output from an N-best list

– For MT - Can be applied on N-best lists

• Rens Bod 2001 DOP (in ASR)

– Used both headword and non-headword dependencies

– Incorporated syntax and semantics

• Showed semantic annotations contribute to performance

– For MT - huge space (reordering)

• Better to use it on N-best lists 53

Conclusion(2)

• Chelba et al. 1997 Dependency model (in ASR)

– Model assigned probability to joint sequences of words-binary-parse structures with headword annotations in L2R manner, improvements in PPL

– For MT

• Can be done but huge space (reordering)

• Charniak et al. 2003 Syntax-based LM (in MT)

– TM “married” to Sytax-based LM

– No improvements in BLEU score

• Blame it on BLEU ?

• Blame it on parse accuracies?

• Obtained fewer translations that were syntactically and semantically wrong, obtained more perfect translations

54

– In future, can integrate all knowledge sources in hand

Conclusion(3)

• SuperARV LMs

– Wang et al. 2003 (in ASR)

• Almost parsing, statistical dependency grammar parser

• Enriched tags

• Reduced WER

– Wang et al. 2007 (in MT)

• For reranking N-best hypotheses

• Showed improvements in BLEU score

– by almost a BLEU point

55

Conclusion(4)

• Katrin Kirchhoff 2005 Factored LMs (in MT)

– Used a set of factors to define a word

– Did not show improvements in MT quality

• Could be adding in more noise

• Blame BLEU?

– No study

• Structural learning intuitively makes sense

– Does not find the optimum structure

• Was the list of N-best hypotheses good?

• Context was limited to 3 grams

– Might help with higher values of N

• Features are probably not good/insufficient

– Interpolating with other LMs might help

– Might perform better than a word-based LM on morphologically rich languages

56

Thank You

57

58

Download