Conditional Random Fields Dietrich Klakow

Conditional Random Fields Dietrich Klakow Warning • I might have to leave during the lecture for a meeting Overview • • • • • Sequence Labeling Bayesian Networks Markov Random Fields Conditional Random Fields Software example Background Reading Hanna M. Wallach Conditional Random Fields: An Introduction. Technical Report MS-CIS-04-21. Department of Computer and Information Science, University of Pennsylvania, 2004. http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf Sequence Labeling Tasks Sequence: a sentence Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . POS Labels Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . NNP NNP , CD NNS JJ , MD VB DT NN IN DT JJ NN NNP CD . Chunking Task: find phrase boundaries: Chunking Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . B-NP I-NP O B-NP I-NP B-ADJP O B-VP I-VP B-NP I-NP B-PP B-NP I-NP I-NP B-NP I-NP O Named Entity Tagging Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . B-PERSON I-PERSON O B-DATE:AGE I-DATE:AGE I-DATE:AGE O O O O B-ORG_DESC:OTHER O O O B-PER_DESC B-DATE:DATE I-DATE:DATE O Supertagging Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . N/N N , N/N N (S[adj]\NP)\NP , (S[dcl]\NP)/(S[b]\NP) ((S[b]\NP)/PP)/NP NP[nb]/N N PP/NP NP[nb]/N N/N N ((S\NP)\(S\NP))/N[num] N[num] . Hidden Markov Model HMM: just an Application of a Bayes Classifier (πˆ1 , πˆ 2 ...πˆ N ) = arg max[P( x1 , x2 ...x N , π 1 , π 2 ...π N )] π 1 ,π 2 ..π N x1 , x2 ...x N : observation/input sequence π 1 , π 2 ...π N : label sequence Decomposition of Probabilities P ( x1 , x2 ..x N , π 1 , π 2 ..π N ) N = ∏ P ( xi | π i ) P(π i | π i −1 ) i =1 P(π i | π i −1 ) : transition probability P ( xi | π i ) : emission probability Graphical view HMM Observation sequence X1 X2 X3 ……. XN π1 π2 π3 ……. πN Label sequence Criticism • HMMs model only limited dependencies a come up with more flexible models a come up with graphical description Bayesian Networks Example for Bayesian Network From Russel and Norvig 95 AI: A Modern Approach Corresponding joint P (C , S , R, W ) = distribution P (W | S , R) P( S | C ) P ( R | C ) P(C ) Naïve Bayes Observations x1, …. xD are assumed to be independent D ∏ P( x | z ) i i =1 Markov Random Fields • Undirected graphical model • New term: • clique in an undirected graph: • Set of nodes such that every node is connected to every other node • maximal clique: there is no node that can be added without add without destroying the clique property Example cliques: green and blue maximal clique: blue Factorization x : all nodes x1...x N xC : nodes in clique C C M : set of all maximal cliques ΨC ( xC ) : potential function ( ΨC ( xC ) ≥ 0) Joint distribution described by graph 1 p( x) = Z ∏Ψ C ( xC ) C∈C M Normalization Z =∑ ∏Ψ C ( xC ) x C∈C M Z is sometimes call the partition function Example x2 x1 x3 x5 x4 What are the maximum cliques? Write down joint probability described by this graph a white board Energy Function Define ΨC ( xC ) = e − E ( xC ) Insert into joint distribution 1 p ( x) = e Z − ∑ E ( xC ) C∈CM Conditional Random Fields Definition Maximum random field were each random variable yi is conditioned on the complete input sequence x1, …xn y=(y1…yn) y1 y2 y3 ….. x yn-1 x=(x1…xn) yn Distribution Distribution n 1 p( y | x) = e Z ( x) − N ∑∑ λ j f j ( yi−1 , yi , x ,i ) i =1 j =1 λ j : parameters to be trained f j ( yi −1 , yi , x, i ) : feature function Example feature functions Modeling transitions 1 if y i-1 = IN and y i = NNP f1 ( yi −1 , yi , x, i ) =   0 else Modeling emissions 1 if y i = NNP and x i = September f 2 ( yi −1 , yi , x, i ) =  0 else  Training • Like in maximum entropy models Generalized iterative scaling • Convergence: p(y|x) is a convex function a unique maximum Convergence is slow Improved algorithms exist Decoding: Auxiliary Matrix Define additional start symbol y0=START and stop symbol yn+1=STOP Define matrix M i (x ) such that N [M i ( x) ] y i −1 y i − =M i y i −1 y i ( x) = e ∑ λ j f j ( y i −1 , y i , x ,i ) j =1 Reformulate Probability With that definition we have 1 n +1 i p( y | x) = M yi−1 yi ( x) ∏ Z ( x) i =1 with Z ( x) = ∑∑∑ ...∑ M y1 y2 y3 yn 1 y0 y1 ( x)M 2 y1 y 2 ( x)....M n +1 y n y n+1 ( x) Use Matrix Properties Use matrix product [M ( x)M 1 2 ( x) ] y0 y 2 = ∑ M 1y0 y1 ( x) M y21 y2 ( x) y1 with [ 1 2 Z ( x) = M ( x) M ( x)...M n +1 ( x) ] y0 = START , y n+1 = STOP Viterbi Decoding • Matrix M replaces the product of transition and emission probability • Decoding can be done in Viterbi style • Effort: • linear in length of sequence • quadratic in the number of labels Software CRF++ • See http://crfpp.sourceforge.net/ Summary • Sequence labeling problems • CRFs are • flexible • Expensive to train • Fast to decode

Conditional Random Fields Dietrich Klakow

Related documents

Products

Support

Conditional Random Fields Dietrich Klakow

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib