searn

advertisement
Search-Based Structured
Prediction
Hal Daume III, John Langford, Daniel Marcu
ABSTRACT
 This paper proposes SEARN, an algorithm for integrating search and
learning to solve complex structured prediction problem.
 Transform complex structured problems into simple binary
classification problem which simple binary classifier could be used.
 Suitable for any loss function and class of feature
 SEARN guarantee that: good performance of the derived
classification problem implies the good performance of whole
structured prediction problem
What complex internal structure is
 Normal classification focus on simple output {-1, 1}; more complex
structure could be label sequences.
 What we focus on are:
 The space of English sentence (in a machine translation
application).
 The space of short documents (in a automatic documents
summarization application).
 The space of possible assignment of elements in a database
What structured algorithm do we want
 Not limited to models with bounded treewidth
 Applicable to any loss function
 Handle arbitrary features
 Can cope with imperfect data
SEARN meet all these requirement by transforming structured prediction into
binary prediction which applying the binary classifier.
Good binary classification performance implies good structured prediction
performance.
Structured Prediction Problem
Definitions
As a simple example, consider a parsing problem under F1 loss. In this
case, D is a distribution over (x; c) where x is an input sequence and for all
trees y with |x|-many leaves, cy is the F1 loss of y on the “true" output. The
goal to create result is to minimizes the loss given in Eq (1),
The Searn Algorithm
There are several vital ingredients in the application of SEARN:
 A search space for decomposing the prediction problem.
 A cost sensitive learning algorithm
 Labeled structured prediction training data
 A known loss function for the structured prediction problem
 A good initial policy (we call the learned classifier a policy)
Search space and cost sensitive
learning algorithm
 For search space S, it could be concrete in simple application, like
all possible parts of speech tag for each word. But in general,
structured prediction has an abstract search space in the algorithm,
and a function that turn abstract space into output of correct form
 SEARN algorithm requires a learning algorithm to return a classifier
h(s) given cost sensitive training data. The performance of SEARN is
strongly dependent upon how capable the learned classifier is.
SEARN at train time: initial policy
SEARN operates in an iterative fashion. At each iteration, it uses the current
policy to create the new cost-sensitive classification examples, and use them
create a new classifier. It keeps iteration until current classifier has really low
dependence with initial policy
SEARN relies on a good initial policy on the training data.
SEARN at train time: Cost-sensitive
examples
 SEARN generates one path per structured training example, and create
cost-sensitive example for each path.
 The cost associated with taking an action that leads to state s is the regret
associated with this action, given our current policy. At each state s, we
take action a and then execute the policy to gain a full sequence of
predictions ^y for which minimize the expect loss cost.
 the deference in loss between taking action a and taking the action a’;
see Eq (2).
Viterbi algorithmas best path
Stochastic in policy classification
 The policy may be stochastic when the base classifier learned is
deterministic due to the stochastic interpolation with SEARN.
 The paper suggest using Monte-Carlo sampling as the best solution: one
draws many paths according to h beginning at s0 and average over the
costs.
Complete SEARN algorithm
Cost-sensitive Examples are
put into cost-sensitive
classification algorithm to
create new policy
Compute initial policy
 We mention SERAN relies upon the ability to start with a good initial policy π.
 under standard loss function, compute initial policy is easy: in sequence
labeling problem with Hamming loss, just choosing the correct label at
position i given the correct label sequence.
 SEARN is not limited to simple Hamming loss. It could learn under more
complex structures and loss function. This is the feature that many other
frameworks do not have.
Search-base policies
 The SEARN algorithm do not require the initial policy be
optimal, it can train against any policy.
 At any step of SEARN, including the initial, we need to
computer the best next step. So searching the shortest
path and taking the first step along this shortest path, we
can obtain a good initial policy. It is good for case
where a good initial policy is not available.
Theoretical analysis
 Each iteration of Searn degrades the current policy. The main theorem
states that the learned policy is not much worse than the starting (optimal)
policy plus a term related to the average cost sensitive loss of the learned
classifiers and another term related to the maximum cost sensitive loss.
where
 The paper claim the choice of iteration and β is not need to be too small,
and each iteration, make sure only one different choice from previous
policy at each iteration.
Comparison: Independent classifiers
 Pro: Independent classifier make training and testing classifiers incredibly
straightforward.
 Con: One cannot define complex features over the output space
 SEARN is more similar to MEMM prediction setting, which MEMM’s prediction
is on basis of the k previous predictions. However, MEMM make an
assumption that all previous predictions are correct, which may not be the
case.
Comparison: Perceptron-style
algorithms
 The incremental perceptron is a variant on the structured perceptron that
deals with the issue that the arg max may not be analytically available. It
replaces the arg max with a beam search algorithm. The key observation is
that it is often possible to detect in the process of executing search whether
it is possible for the resulting output to ever be correct.
 The incremental perceptron is essentially a search-based structured
prediction technique. it is, however, much more limited. It cannot cope
with arbitrary loss functions, and is limited to a beam-search application.
Moreover, for search problems with a large number of internal decisions,
aborting search at the first error is far from optimal.
Experimental results: Sequence
Labeling
 Sequence labeling is the task of assigning a label to each element in an
input sequence. Sequence labeling is an attractive test for structured
prediction algorithms because it is the simplest non-trivial structure.
 The paper implements four Sequence labeling:
 handwriting recognition
 named entity recognition (in Spanish)
 Syntactic chunking
 joint chunking and part-of-speech tagging
Sequence Labeling property
 In handwriting recognition and Spanish named entity recognition, the
feature are basic the same, each possible letter/entity has an unique
feature that counts how many time that letter shows in the output, and
there are group feature counting pair/triple appearance in output.
 For syntactic chunking, the paper use the same set of features across all
models, separated into “base features" and “meta features." The base
features apply to words individually, while meta features apply to entire
chunks.
 The initial policy is to label the next word correctly with respect to loss
function.
Experimental results: Sequence
Labeling
Experimental results: Sequence
Labeling
 structured techniques consistently outperform their classification
counterparts. For all classifiers, adding DEARN consistently improve
performance.
 Large data set have better performance than small data set. And some
simple classification techniques when applied to large data sets
outperform structured learning algorithm.
 the only task on which Searn does not outperform the competing
techniques is on the raw chunking task.
 With exception of the handwriting recognition task, Searn using logistic
regression as a base learner performs at the top of the pack. SVM-based
models tend to be expensive to train compare to perceptron.
Experimental results: Automatic
Document Summarization
 Multi-document summarization is the task of creating a summary out of a
collection of documents on a focused topic.
 The choice for automatic documents evaluation is Rouge, which computes
n-gram overlap between a system summary and a set of human written
summaries.
 The features are any aspect of currently generated summary, and any part
of the document set, including word and term identity, syntactic relation,
length of sentence.
Experimental results: Automatic
Document Summarization
 the oracle results for the vine-growth method are signicantly higher. That’s
because under Searn, the summaries produced by the vine-growth
technique are uniformly better than those produced by raw extraction.
 The Searn-based systems uniformly dominate this result, but this comparison
is not fair due to the training data.
Discussion and Conclusions about
SEARN
 SEARN is an easy technique for training models for which complex search
algorithms must be used.
 SEARN contrasts with more typical algorithms such as CRFs and M3Ns based
on considering how information is shared at test time. It share information at
training time, without showing entire output.
 Limitation for SEARN:
 although training via Searn is likely preferable to training against only an initial
policy, it can still be overly optimistic.
 there is a slight disparity between what SEARN does at a theoretical level and
how SEARN functions in practice.
Thank you!
Download