Sibel

advertisement
Major Focuses of Research in This Field:
0 Unification-Based Grammars
0 Probabilistic Approaches
0 Dynamic Programming
0 Stochastic Attribute-Value Grammars, Abney, 2007
0 Dynamic Programming for Parsing and Estimation of
Stochastic Unification-Based Grammars, Geman &
Johnson, 2002
0 Stochastic HPSG Parse Disambiguation using the
Redwoods Corpus, Toutanova & Flickinger, 2005
2
Stochastic Attribute-Value Grammars,
Abney, 1997
INTRO
0 Motivation for this paper:
0 Insufficiencies of previous probabilistic Attribute-Value
grammars in defining parameter-estimation
0 Other than Brew and Eisele’s attempts; there has not been serious
attempts to extend stochastic models developed for CFGs to
constraint-based grammars by assigning weights or preferences, or
introducing ‘weighted logic
*See Riezlar (1996)
**See Brew (1995), Eisele (1994)
3
Goal:
Abney’s goal in this research:
0 Defining a stochastic attribute-value grammar
0 Algorithm to compute the maximum-likelihood
estimations of AVG parameters.
0 How?
0 Adapting from Gibbs Sampling* for the Improved Iterative
Scaling Algorithm (ISS) to estimate certain function
expectations.
0 For further improvement, use of Metropolis-Hastings
algorithm.
* the Gibbs sampling was earlier used for English orthography application by Della Pietra, Della Pietra & Laferty(1995),
.
4
Overview of Abney’s experiments:
Based on Brew and Eisel’s earlier experiments, Abney
experiments with Empirical Relative Frequency (ERF)
used for SFCGs.
0 Using Expectation-Maximization (EM) Algorithm
0 What to do in case of true context-dependencies?
0 Random Fields (a generalization of Markov Chains and stochastic
branch processes)
0 For further improvement, Metropolis-Hastings Algorithm
5
Ideas and Methods for Parameter Estimation:
Abney’s ideas and methods for parameter-estimations in AVGs:
1. Empirical Relative Frequency (ERF) Method
0 Expectation-maximization used for CFGs
0 ‘Normalization’ (normalizing constant) in ERF for AVGs
2.
Random Fields (generalized Markov Chains) Method
0 Improved Iterative Scaling Algorithm (DL&L, English Ortho)
i.
ii.
iii.
iv.
3.
Null Fields
Feature Selection
Weight Adjustment
Iteration
Random Sampling Method
0 Metropolis-Hastings Algorithm
6
Probability Estimations in CFG:
For probabilistic estimations in CFG, assume the grammar and model is as
following:
The probability distribution according to this model is:
0 X is the tree, and the RHS of the equation are the weights:
Tree=x
Rule 1, used once, 1/3.1 for p|f_1| and 1/3.2 for p|f_3|
Rule3, used twice
0 The probability distribution of a parse tree is defined as:
0 Parameter estimation determines the values for weights for
stochastic grammar to be effective, so obtaining the correct
weights are crucial.
7
Empirical Relative Frequency in CFG:
0 Empirical Distribution is the frequency of how often a certain tree appears in
the corpus.
0 The dissimilarity of distributions is necessary to compare the probability
distribution to the empirical distribution.
0 Using the Kullback-Leibler Divergence, maximizing the likelihood of
probability distributions (for q)**.
0 CFGs use the ERF method to find the best weights for probability
distributions.
0 ERF: for each rule i with βi and function fi(x), returning the number of times rule i
is used in the derivation of tree (x), the p|f| represents the expectation of f under
probability distribution p:
p[f]=∑ₓp(x)f(x)
0 This way the ERF enables us to compute the expectation of each rule’s frequency
and normalize among rules with the same LHS.
*This is the structure generated by the grammar.
**q for distribution also counts for missing mass that is not available in the training corpus.
8
-table3
ERF weights are the best weights because they are closest to the empirical distribution. If the
training corpus is generated by SCFGs, the dependencies between the trees are assumed to be
coincidental.
However;
-What if the dependencies are not just coincidental, but True Context-Dependencies?
-Apply the same method to AVGs, in an attempt to formalize an AVG as a CFG, and follow
the similar rewrite rules and models???
Then, we can apply the same context-free grammar as an attribute-value grammar like the
following:
9
ERF in AVGs?
0 This involves the following:
0 attaching probabilities to AVG (in the same way used for
CFGs)
0 estimate weights
AV structure generated the grammar.
Rule applications, using the E
Rule applications in AV graph generated by the grammar,
using ERF
** ϕ represents the dag* weight for the tree x1.
*Dag is short for ‘directed acyclic graph’.
10
Probability Estimation Using ERF in AVGs:
However, here the weight ϕ points to the entire set of AV structures (shown in prev.
slide)—not the probability distribution (as in the CFGs).
0
In order to get the correct probability distributions and weights, a normalizing
constant for ϕ₂ is introduced:
q(x)=1/Z ϕ(x) where Z is a normalizer constant as in:
0
However, this normalization violates the certain conditions for the ERF to give the
best weights by changing the grammar, and affecting the probability.
***This shows that in order to apply ERF to AVGs, there is a need for normalization, however normalization violates the ERF
conditions.
11
Solution to Normalization Problem:
Random Fields Method:
Defines probability distribution over a set of labeled graph Ω*, using functions to
represents frequencies of functions, features and rules:
***This normalizes the weight distribution.
0 As a result, empirical distribution needs less feature. For parameter estimation with
Random Fields method, the followings are needed:
0 Values for weights, and features corresponding to those weights
Improved Iterative Scaling Algorithm determines these values, and involves the
following:
0
0
0
0
Null-field (no features)
Feature Selection
Weight Adjustment
Iteration until the best weight is obtained
ISS shows improvement for the AV parameter estimation weights. However it doesn’t give the
best weights (compared to CFG parameter estimations) due to the number of iterations
required for a given tree.
If the given grammar is small enough; the results can work moderately; but if the grammar is
large enough (as in the AVGs), another alternative proves better results.
*Omegas are actually the AV structures generated by the grammar.
12
Random Sampling (stochastic decision)
Using Metropolis-Hastings Algorithm:
0 Converts the sampler for the initial distribution into a sampler for field
distribution, which proposes a new item (out of the existing sampler).
If the new item proposed by the current sampler is greater than new
sampler’s distribution, it means there is overrepresentation.
0 With stochastic decision, if the item has a probability that is equal to its
degree of overrepresentation; the item is rejected.
0 If the degree of overrepresentations (of the proposed item) is equal to the
new sampler:
0 if the new item is underrepresented relative to the new sampler, the new
item is accepted with the probability of 1.
0 If the new item is overrepresented relative to the new sampler; the new item is
accepted with a probability that diminishes, as the representation increases.
13
Conclusion
0 Stochastic CFG Methods are not easily convertable to AVGs.
0 Because of constraints and True Context-Dependencies on AVGs , ERF Method gives
wrong weights for the parameter estimation.
0 Obtaining correct weights require various methods:
0 Random Sampling
0 Feature Selection
0 Weight Adjustment
0 Random Sampling shows promise for future experimentation but the performance is
too slow, and output weights are not the optimal weights due to required iterations.
This is even more difficult for larger grammars because of the large number of
features, and the number of iterations for Random sampling for each feature, so the
results are not optimal.
0 Also these methods assume a complete parsed data. So without additional methods*
incomplete data poses major challenges.
*(described in Riezler, 1997)
14
Dynamic Programming for Parsing and Estimations of
Stochastic Unification-based Grammars,
Geman & Johnson, 2002
INTRO
Because the earlier algorithms, proposed by Abney (1997)
and Johnson et al. (1999), for stochastic parsing and
estimation required extensive enumeration and did not work
well on large grammars; Geman and Johnson discusses
another algorithm that does graph-based dynamic
programming without the enumeration process.
15
Goal
In this research Geman and Johnson’s target is to find the most probable parse
by using the packed parse set representations (aka. Truth Maintenance System –
TMS).
*A packed representation quadruple: R=(F’, X, N, α)
Overview:
0 Viterbi Algorithm: finds the most probable parse of a string.
0 Inside-Outside Algorithm: estimates a PCFG from unparsed data.
Maxwell III and Kaplan’s algorithm:
takes a string y, returns a packed
representation R such that Ω(R)= Ω(y), where R represents the set of parses of string
y.
16
Features of Dynamic Programming in SUBGs:
0 Properties must be very local with respect to features.
0 Finding the probability of a parse requires maximization of functions,
where each function depends on subset of variables.
0 Finding the estimation of property weights, maximization of
conditional likelihood of the training corpus parses (given their
yields**) is required.
0 For conditional likelihood maximization, conditional expectation calculation
is required. For this, the Conjugate Gradient Algorithm is used.
Calculation of these features make up generalizations for Viterbi, and
forward-backward algorithm for HMMs used in dynamic programming.
*A property is a real-valued function of parses Ω.
**Yield is a string of words associated with well-formed parse Ω.
17
Conclusions:
0 Geman and Johnson in this paper examines only a few of the
algorithms with regards to Maxwell and Kaplan representations to
be used for dynamic programming for parsing.
0 Because these representations provide a very compact set of parses
of sentences, it eliminates the need to list each parse separately. As
the nature of natural languages, strings and features occur as
substrings or sub-features of other components, and therefore the
parsing process can be faster via dynamic programming.
0 The algorithms presented here have promising results for parsing
unification-based grammars compared to earlier experiments that
iterates through countless enumerations.
18
Stochastic HPSG Parse Disambiguation
Using The Redwoods Corpus
INTRO
Toutanova, Flickinger, 2005
Supporting Abney’s and other researches’ conclusions that point out the
improvements resulting from the use of conditional stochastic models in
parsing; Toutanova and Flickinger experimented both with generative and
conditional log-linear models.
Recall;
***Brew & Eisner had found that ERF doesn’t result in the best weights for parameter
estimation if there are true context-dependencies. This is why Abney had experiments with other
methods such as normalization factors, random fields and random sampling, etc…
19
Goal:
0 Using the Redwoods corpus, build a tagger for HPSG lexical
item identifiers to be used for parse disambiguation.
0 Train stochastic models using:
0 Derivation Trees
0 Phrase Structure Trees
0 Semantic Trees (Approximations to Minimal Recursion Semantics)
In order to build:
0 Generative Models
0 Conditional Log-linear Models
HOW?
0 Use probabilistic models (from training) to rank unseen test
sentences according to probabilities attach to them (from those
probabilistic models).
20
Generative Model vs. Conditional Log-Linear Model:
Generative Models define the probability distribution over the trees of
derivational types corresponding to HPSG analysis of sentences. The
PCFG parameters used for each rule of CFG correspond to the schemas of
HPSG. The probability distribution weights are obtained by using relative
frequency estimations with Witten-Bell smoothing.
Conditional Log-Linear Models have a set of features {f₁,…, f
M}, and
a set
of corresponding weights {λ1,…, λM}. The conditional models define
features over derivation and semantic trees. Trains the models by
maximizing the conditional likelihood of preferred analysis of parses, by
using a Gaussian prior for smoothing. For these models in this project, the
Gaussian prior is set to 1.
21
Overview of Models:
.
Model Types
Generative Models
Conditional Models
Tagger
Trigram HMM tagger
Ltagger: Ltrigram
Derivation Trees
•
•
•
•
Semantic Trees
PCFG-Sem
LPCFG-Sem
Model Combination
PCFG-Combined
LCombined
PCFG-1P
PCFG-2P
PCFG-3P
PCFG-A
• LPCFG-1P
• LPCFG-A
22
Comparison of Models:
Generative Model
Conditional Model
HMM Trigram Tagger:
Ltagger- Trigram:
• Includes features for all lexical
items trigrams, bigrams, and
unigrams.
For probability estimates, uses
maximum relative frequency over the
pre-terminal sequences and yields of
derivation trees.
***Pre-terminals are lexical itemidentifiers.
•
•
For probability weights, smoothed by
linear interpolation, using Witten-Bell
smoothing with varying paramater d.
•
It doesn’t, unfortunately, take advantage
of lexical-type or type-hierarchy
information from the HPSG.
***This is based more on Naive Bayes model probabilityestimation.
23
Comparison of Models: Derivation Trees
Generative Model
Conditional Model
PCFG-1P:
LPCFG-1P:
Parameters: θi,j=(P αj | Ai) maximizes the
preferred parses likelihood in the training
set.
• For wider context; extra parameters for
additional nodes are set, and models are
trained for PCFG-2P and PCFG-3P
For estimation of probabilities: Linear
interpolation with linear subsets of conditioning
•
•
•
Parameters: λi,j  has one feature for each
expansion of each nonterminal in the tree
of ‘Ai  αj ‘
Weights are obtained using maximum
conditional likelihood
LPCFG-A:
• For every path and expansion, a feature is
context.
added.
*Conditioning coefficients are obtained
• The feature values are the sums of feature
using Witten-Bell smoothing.
values of local trees.
PCFG-A:
• Uses a generative component of
• Conditions each nodes’ expansion up to five of its
expansion for feature selection, thus not
ancestors. The ancestor selection is similar to
purely discriminative in construction.
context specific independencies in Bayesian
• Uses feature parameters for leaves and
networks.
internal nodes; and it multiplies its
• Final probability estimates are linear
parameters.
interpolations of relative frequency where the
coefficients are estimated using Witten-Bell
smoothing.
24
Comparison of Models: Semantic Trees
Generative Model
Conditional Model
PCFG-Sem:
• Similar to markovized rule models, and uses
decision tree growing algorithm, with final tree
algorithm Witten-Bell smoothing.
• Adds stop symbols at the ends to the right and left
ends of the dependents.
• The probability of dependency estimated as a
product of probabilities of local trees. Five
conditions for dependent generation:
LPCFG-Sem:
Corresponds to the PCFG model.
Features were defined the same way
as the LPCFG-A was defined
according to PCFG-A model (i.e., by
1. The parent of the node
Third, Add
Stop symbols
2. The direct (left / right)
at the endsgenerated
to
3. The number of dependents already
in the
the LEFT and
surface string between the head and the dependent.
RIGHT
4.The grandparent label
5. The label of the immediately preceding dependent.
P(pron_rel|be_prd_rel,left,0,top,none)x
P(stop|be_prd_rel,left,0,top,pron_rel)x
P(sorry_rel | be_prd_rel, right,0,top,none)x
P(stop | be_prd_rel, right,1,top,sorry_rel)
adding a feature for every path and
expansion occurring in that node.)
first, create left
dependents (from
R to L, given H,
parent, R-Sister,
Num. of
dependents of L.
Second, create
right
dependents
from L to R
*** Semantic Dependency tree for ‘I am sorry’
25
Comparison of Models: Combination Trees
Generative Model
Conditional Model
PCFG-Combination:
LPCFG-Combination: LCombined
Combination of PCFG-A, HMM Tagger,
•
PCFG-Sem.
• It computes the scores of analyses by the •
individual models.
• Uses trigram tag sequence probabilities
(transition probabilities of the HMM
tagging model).
• Uses interpolation weights of λ₁=0.5 and
λ₂=0.4
***Generative models were combined by
taking a weighted log sum of probabilities
they assign to trees.
•
Log-linear models of LPCFG-A, LPCFGSem, Ltagger with their features.
Conditional models were combined by
collecting all features into one model.
26
Results:
-All sentences include ambiguous and unambiguous, and may be used for training. Ambiguous
only set used for testing.
-For this work, unambiguous sentences were discarded from
the test set.
-Unambiguous sets were used for training in generative
models. Conditional models did not use any ambiguous sets.*
Sentences
All
Ambiguous
#
6876
5266
Length
8.0
9.1
LPCFG
Accuracy
Str. Ambiguity
44.5
57.8
•
Method
PCFG
Accuracy
Random
22.7
22.7
Random
HMM-Trigram
42.1
43.2
LTrigram
•
perfect
48.8
48.8
perfect
•
PCFG-1P
PCFG-A
PCFG-Sem
Combined
61.6
71.0
62.8
73.2
72.4
75.9
65.4
76.7
LPCFG-1P
LPCFG-A
LPCFG-Sem
LCombined
Method
It’s also important to note that high accuracy might
result from:
• low ambiguity rate in the corpus
• short sentence length
Also, accuracy decreases as ambiguity increases**
And, there is over fitting between testing and training.
•
•
•
•
High accuracy already achieved even by simple stat.
models.
HMM tagger doesn’t perform by itself compared
with other models with more info. about the parse.
PCFG-A achieved 24% error reduction over PCFG1P
PCFG-Sem has good accuracy but not better than
PCFG-A by itself
Model combination has complimentary advantages:
• tagger adds left-context info to the PCFG-A
model
• PCFG-Sem provides semantical info.
LPCFG-1P achieves 28% error reduction over PCFG1P
Overall, 13% error reduction from PCFGCombined to Lcombined.
*The ambiguous sets contribute a constant to the log-likelihood in conditional models.
**Accuracy is 94.6% for sentences with 2 possible analyses, and 40% with more than 100 analyses.
27
Error Analysis & Conclusion:
0 Ancestor annotation proves higher accuracy over other models and especially
better results on Phrase Structure Trees.
0 Another combined model based on the log-probability combinations of derivation and PS
trees; the new model gives even slightly better accuracy on phrase structures. This shows:
0 Node labels (schema names) provide enough information for good accuracy.
0 Semantic models perform less than the expected.*
0 Data scarcity, size of corpus**
A more detailed analysis of error show that out of 165 sentences in this study:
0 %26 were caused by annotation error in the tree bank
0 %12 caused by both tree bank and the model
0 %62 were real errors (the parse was able to correctly capture***)
For these real errors, 103 out of 165:
0 27 PP-attachments
0 21 wrong lexical-item selection
0 15 modifier attachment
0 13 coordination
0 9 Complement / adjunt
0 18 other misc. errors
***This figure was %20 in 2002, for the 1st growth of the Redwoods corpus, which shows the corpus was
improved meantime as well.
28
Further Discussion & Future Work:
After seeing the numbers in the results, the efficiency of the numbers showing
the improvement can be argued, in terms of whether they are good enough, or if
so, according to who? And what’s the measure of goodness in this kind of
experiments?
Toutoneva & Flickinger’s research shows that the performance expected from
the stochastic models were not greatly higher than the statistical models.
It is especially surprising to see that the more work was put into the research
for the later models, the less improvement was obtained from the later models.
*If you recall, the highest improvement came from the first LPCFG model.
Some of the following ideas can be further researched to improve the current
working models and ameliorate the inefficacies:
0 Better use of semantic or lexical info, use of MRS semantic representation
0 Automatic clustering, further examination of existing lexical sources
0 Semantic collocation information
0 Use of other features within the HPSG framework such as:
0 syntactic categories
0 clause finiteness
0 agreement features
0 Increase in corpus size
29
Download