Introduction to Conditional Random Fields John Osborne Sept 4, 2009 Overview • Useful Definitions • Background – HMM – MEMM • Conditional Random Fields – Statistical and Graph Definitions • Computation (Training and Inference) • Extensions – Bayesian Conditional Random Fields – Hierarchical Conditional Random Fields – Semi-CRFs • Future Directions Useful Definitions • Random Field (wikipedia) – In probability theory, let S = {X1, ..., Xn}, with the Xi in {0, 1, ..., G − 1} being a set of random variables on the sample space Ω = {0, 1, ..., G − 1}n. A probability measure π is a random field if, for all ω in Ω, π(ω) > 0. • Markov Process (chain if finite sequence) – Stochastic process with Markov property • Markov Property – The probability that a random variable assumes a value depends on the other random variables only through the ones that are its immediate neighbors – “memoryless” • Hidden Markov Model (HMM) – Markov Model where the current state is unobserved • Viterbi Algorithm – Dynamic programming technique to discover the most likely sequence of states required to explain the observed states in an HMM – Determine labels • Potential Function == Feature Function – In CRF the potential function scores the compatibility of yt, yt-1 and wt(X) Background • Interest in CRFs arose from Richa’s work with gene expression • Current literature shows them performing better on NLP tasks than other commonly used NLP approaches like Support Vector Machines (SVM), neural networks, HMMs and others – Termed coined by Lafftery in 2001 • Predecessor was HMM and maximum entropy Markov models (MEMM) HMM – Definition • Markov Model where the current state is unobserved – Generative Model – To examine all input X would be prohibitive, hence Markov property looking at only current element in the sequence – No multiple interacting features, long range dependencies MEMMs – McCallum et al, 2000 – Non-generative finitestate model based on next-state classifier – Directed graph – P(YjX) = ∏t P(yt | yt-1 wt(X)) where wt(X) is a sliding window over the X sequence Label Bias Problem • Transitions leaving a given state complete only against each other, rather than against all other transitions in the model • Implies “Conversation of score mass” (Bottou, 1991) • Observations can be ignored, Viterbi decoding can’t downgrade a branch • CRF will solve this problem by having a single exponential model for the joint probability of the ENTIRE SEQUENCE OF LABELS given the observation sequence Big Picture Definition • Wikipedia Definition (Aug 2009) – A conditional random field (CRF) is a type of discriminative probabilistic model most often used for the labeling or parsing of sequential data, such as natural language text or biological sequences. • Probabilistic model is a statistical model, in math terms “a pair (Y,P) where Y is the set of possible observations and P the set of possible probability distributions on Y” – In statistics terms this means the objective is to infer (or pick) the distinct element (probability distribution) in the set “P” given your observation Y • Discriminative model meaning it models the conditional probability distribution P(y|x) which can predict y given x. – It can not do it the other way around (produce x from y) since it does not a generative model (capable of generating sample data given a model) as it does not model a joint probability distribution – Similar to other discriminative models like support vector machines and neural networks • When analyzing sequential data a conditional model specifies the probabilities of possible label sequences given an observation sequence CRF Graphical Definition Definition from Lafferty • Undirected graphical model • Let g = (V,E) be a graph such that Y = (Yv)vεV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w≠v)=p(Yv|X,Yw,w~v), where w~v means that w and v are neighbors in G CRF Undirected Graph Computation of CRF • Training – Conditioning – Calculation of Feature Function – P(Y|X) = 1/Z(X)exp ∑t PSI (yt, yt-1 and wt(X)) • Z is normalizing factor • Potential Function in paratheses • Inference – Viterbi Decoding – Approximate Model Averaging – Others? Training Approaches • CRF is supervised learning so can train using – Maximum Likehood (original paper) • Used iterative scaling method, was very slow – Gradient Assent • Also slow when naïve – Mallet Implementation used BFGS algorithm • http://en.wikipedia.org/wiki/BFGS • Broyden-Fletcher-Goldfarb – Shanno • Approximate 2nd order algorithm – Stochastic Gradient Method (2006) accelerated via Stochastic Meta Descent – Gradient Tree Boosting (variant of a 2001 • http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf • Potential functions are sums of regression trees – Decision trees using real values • Published 2008 • Competitive with Mallet – Bayesian (estimate posterior probability) Conditional Random Field Extensions Semi-CRF • Semi-CRF – Instead of assigning labels to each member of sequence, labels are assigned to sub-sequences – Advantage – “features for semi-CRF can measure properties of segments, and transition within a segment can be non-Markovian” – http://www.cs.cmu.edu/~wcohen/postscript/semi CRF.pdf Bayesian CRF • Qi et al, (2005) • http://www.cs.purdue.edu/homes/alanqi/pap ers/Qi-Bayesian-CRF-AIstat05.pdf • Replacement for ML method of Lafferty • Reducing over-fitting • “Power EP Method” Hierarchical CRF (HCRF) • http://www.springerlink.com/content/r84055k27 54464v5/ • http://www.cs.washington.edu/homes/fox/posts cripts/places-isrr-05.pdf • GPS motion, for surveillance, tracking, dividing people’s workday into labels of work, travel, sleep, etc.. • Less work Future Directions • Less work on conditional random fields in biology – PubMed hits • Conditional Random Field - 21 • Conditional Random Fields - 43 – CRF variants & promoter/regulatory element shows no hits • CRF and ontology show no hits • Plan – Implement CRF in Java, apply to biology problems, try to find ways to extend? Useful Papers • Link to original paper and review paper – http://www.inference.phy.cam.ac.uk/hmw26/crf/ – Review paper: • http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf • Another review – http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf • Review slides – http://www.cs.pitt.edu/~mrotaru/comp/nlp/Random%20Fields/ Tutorial%20CRF%20Lafferty.pdf • The boosting paper has a nice review – http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietteri ch08a.pdf