How to compute a conditional random field

advertisement
Introduction to Conditional
Random Fields
John Osborne
Sept 4, 2009
Overview
• Useful Definitions
• Background
– HMM
– MEMM
• Conditional Random Fields
– Statistical and Graph Definitions
• Computation (Training and Inference)
• Extensions
– Bayesian Conditional Random Fields
– Hierarchical Conditional Random Fields
– Semi-CRFs
• Future Directions
Useful Definitions
• Random Field (wikipedia)
– In probability theory, let S = {X1, ..., Xn}, with the Xi in {0, 1, ..., G − 1} being a set
of random variables on the sample space Ω = {0, 1, ..., G − 1}n. A probability
measure π is a random field if, for all ω in Ω, π(ω) > 0.
• Markov Process (chain if finite sequence)
– Stochastic process with Markov property
• Markov Property
– The probability that a random variable assumes a value depends on the other
random variables only through the ones that are its immediate neighbors
– “memoryless”
• Hidden Markov Model (HMM)
– Markov Model where the current state is unobserved
• Viterbi Algorithm
– Dynamic programming technique to discover the most likely sequence of states
required to explain the observed states in an HMM
– Determine labels
• Potential Function == Feature Function
– In CRF the potential function scores the compatibility of yt, yt-1 and wt(X)
Background
• Interest in CRFs arose from Richa’s work with
gene expression
• Current literature shows them performing better
on NLP tasks than other commonly used NLP
approaches like Support Vector Machines (SVM),
neural networks, HMMs and others
– Termed coined by Lafftery in 2001
• Predecessor was HMM and maximum entropy
Markov models (MEMM)
HMM
– Definition
• Markov Model where the current state is
unobserved
– Generative Model
– To examine all input X would be
prohibitive, hence Markov property
looking at only current element in the
sequence
– No multiple interacting features,
long range dependencies
MEMMs
– McCallum et al, 2000
– Non-generative finitestate model based on
next-state classifier
– Directed graph
– P(YjX) = ∏t P(yt | yt-1
wt(X)) where wt(X) is a
sliding window over the
X sequence
Label Bias Problem
• Transitions leaving a given state complete only
against each other, rather than against all other
transitions in the model
• Implies “Conversation of score mass” (Bottou,
1991)
• Observations can be ignored, Viterbi decoding
can’t downgrade a branch
• CRF will solve this problem by having a single exponential
model for the joint probability of the ENTIRE SEQUENCE OF
LABELS given the observation sequence
Big Picture Definition
• Wikipedia Definition (Aug 2009)
– A conditional random field (CRF) is a type of discriminative probabilistic
model most often used for the labeling or parsing of sequential data, such as
natural language text or biological sequences.
• Probabilistic model is a statistical model, in math terms “a pair (Y,P) where
Y is the set of possible observations and P the set of possible probability
distributions on Y”
– In statistics terms this means the objective is to infer (or pick) the distinct
element (probability distribution) in the set “P” given your observation Y
• Discriminative model meaning it models the conditional probability
distribution P(y|x) which can predict y given x.
– It can not do it the other way around (produce x from y) since it does not a
generative model (capable of generating sample data given a model) as it does
not model a joint probability distribution
– Similar to other discriminative models like support vector machines and neural
networks
• When analyzing sequential data a conditional model specifies the
probabilities of possible label sequences given an observation sequence
CRF Graphical Definition
Definition from Lafferty
• Undirected graphical model
• Let g = (V,E) be a graph such
that Y = (Yv)vεV, so that Y is
indexed by the vertices of G.
Then (X,Y) is a conditional
random field in case, when
conditioned on X, the random
variables Yv obey the Markov
property with respect to the
graph:
p(Yv|X,Yw,w≠v)=p(Yv|X,Yw,w~v),
where w~v means that w and
v are neighbors in G
CRF Undirected Graph
Computation of CRF
• Training
– Conditioning
– Calculation of Feature Function
– P(Y|X) = 1/Z(X)exp ∑t PSI (yt, yt-1 and wt(X))
• Z is normalizing factor
• Potential Function in paratheses
• Inference
– Viterbi Decoding
– Approximate Model Averaging
– Others?
Training Approaches
• CRF is supervised learning so can train using
– Maximum Likehood (original paper)
• Used iterative scaling method, was very slow
– Gradient Assent
• Also slow when naïve
– Mallet Implementation used BFGS algorithm
• http://en.wikipedia.org/wiki/BFGS
• Broyden-Fletcher-Goldfarb – Shanno
• Approximate 2nd order algorithm
– Stochastic Gradient Method (2006) accelerated via Stochastic Meta Descent
– Gradient Tree Boosting (variant of a 2001
• http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf
• Potential functions are sums of regression trees
– Decision trees using real values
• Published 2008
• Competitive with Mallet
– Bayesian (estimate posterior probability)
Conditional Random Field Extensions
Semi-CRF
• Semi-CRF
– Instead of assigning labels to each member of
sequence, labels are assigned to sub-sequences
– Advantage – “features for semi-CRF can measure
properties of segments, and transition within a
segment can be non-Markovian”
– http://www.cs.cmu.edu/~wcohen/postscript/semi
CRF.pdf
Bayesian CRF
• Qi et al, (2005)
• http://www.cs.purdue.edu/homes/alanqi/pap
ers/Qi-Bayesian-CRF-AIstat05.pdf
• Replacement for ML method of Lafferty
• Reducing over-fitting
• “Power EP Method”
Hierarchical CRF
(HCRF)
• http://www.springerlink.com/content/r84055k27
54464v5/
• http://www.cs.washington.edu/homes/fox/posts
cripts/places-isrr-05.pdf
• GPS motion, for surveillance, tracking, dividing
people’s workday into labels of work, travel,
sleep, etc..
• Less work
Future Directions
• Less work on conditional random fields in biology
– PubMed hits
• Conditional Random Field - 21
• Conditional Random Fields - 43
– CRF variants & promoter/regulatory element shows
no hits
• CRF and ontology show no hits
• Plan
– Implement CRF in Java, apply to biology problems, try
to find ways to extend?
Useful Papers
• Link to original paper and review paper
– http://www.inference.phy.cam.ac.uk/hmw26/crf/
– Review paper:
• http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf
• Another review
– http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf
• Review slides
– http://www.cs.pitt.edu/~mrotaru/comp/nlp/Random%20Fields/
Tutorial%20CRF%20Lafferty.pdf
• The boosting paper has a nice review
– http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietteri
ch08a.pdf
Download