Conditional Random Field(CRF)

advertisement
Presenter: Kuang-Jui Hsu
Date :2011/6/16(Thu.)
Outline
 1. Introduction
 2. Graphical Models
 2-1 Definitions
 2-2 Application of Graphical Models
 2-3 Discriminative and Generative Models
 3. Linear-Chain Conditional Random Fields
 3-1 From HMMs to CRFs
 3-2 Parameter Estimation
 3-3 Inference
Introduction
 Relational data:
 Statistical dependencies exist between the entities
 Each entity often has rich set of features that can aid classification
 Graphical models are a natural formalism for exploiting the
dependence structure among the entities
 We can use the two different types of models to model it
 Markov random field (MRF)
 Conditional random field (CRF)
Two Types of Models
 MRF
 The traditional type of model
 Modeled by the joint distribution 𝑝 𝒚, 𝒙 ,where 𝒚: the data we want
to predict, and 𝒙: the data we can observe or called features
 It requires modeling 𝑝 𝒙 , but 𝑝 𝒙 is hard to be modeled when
using rich features which often include the complex dependences
 Ignoring the features can lead to reducing the performance
 CRF
 Modeled by the conditional distribution 𝑝 𝒚|𝒙
 Specifies the probabilities of label data given observation data, so
not to model the distribution 𝑝 𝒙 even with rich features
 Solve the problem using the MRF
Introduction
 Divided into two part
 Present a tutorial on current training and inference technique for
CRF


Discuss the important special case of linear-chain CRFs
Generalize these to arbitrary graphical structure
 Present an example of applying a general CRF to a practical
relational learning problem
Outline
 2. Graphical Models
 2-1 Definitions
 2-2 Application of Graphical Models
 2-3 Discriminative and Generative Models
Definitions
 Probability distribution over sets of random variables:
𝑉 =𝑋∪𝑌
𝑋: a set of input variables we assumed are observed
𝑌: a set of output variables we wish to predict
 An assignment to 𝑋 by x and an assignment to a set 𝐴 ⊂ 𝑋
by 𝒙𝐴 , and similar for Y
 𝟏 𝑥=𝑥 ′
1, 𝑥 = 𝑥 ′
=
0, otheriwise
Definitions
 A graphical model is a family of probability distributions
that factorize according to an underlying graph.
 There are two type of graphical models
 Undirected graphical models by using the factors to represent
 Directed graphical models by using the Bayesian concept
Undirected Graphical Model
 Given a collection of subsets 𝐴 ⊂ 𝑉, define an undirected
graphical model form:
1
𝑝 𝒙, 𝒚 =
𝑍
Ψ𝐴 (𝒙𝐴 , 𝒚A )
𝐴
For any choice of factors 𝐹 = {Ψ𝐴 }, where Ψ𝐴 : 𝒱 𝑛 → ℛ +
 Z: a normalization factor defined or called a partition
function as
𝑍=
Ψ𝐴 (𝒙𝐴 , 𝒚A )
𝒙,𝒚
𝐴
 Computing the Z is a difficult work, but much work exists on
how to approximate it
Undirected graphical models
 Present the factorization by a factor graph
1
𝑝 𝒙, 𝒚 =
𝑍
Ψ𝐴 (𝒙𝐴 , 𝒚A )
𝐴
A bipartite graph G=(V, F, E)
Factor nodeΨ𝐴 ∈ 𝐹
Variable node 𝑣𝑠 ∈ 𝑉
Undirected graphical models
 Assume that each Ψ function has the form:
Ψ𝐴 𝒙𝐴 , 𝒚A = exp(
𝜃𝐴𝑘 𝑓𝐴𝑘 (𝒙𝐴 , 𝒚𝐴 ))
𝑘
For some real-valued parameter 𝜃𝐴 , and for some set of feature
function {𝑓𝐴𝑘 }
Directed graphical models
 Also known as a Bayesian network
𝑝 𝒚, 𝒙 =
𝑃(𝑣|𝜋(𝑣))
 Based on a directed graph 𝐺 = 𝑉, 𝐸
𝑣∈𝑉
 Factorized as:
The parent of the variable 𝑣
𝑝 𝒚, 𝒙 =
𝑃(𝑣|𝜋(𝑣))
𝑣∈𝑉
𝜋(𝑣): the parent of the variable 𝑣
The variable 𝑣
Generative model
 Generative model: call directed graphical models
satisfied the condition that no 𝑥 ∈ 𝑋 can be a parent of an
output y ∈ 𝑌
 A generative model is one that directly describes how the
outputs probabilistically “generate ” the inputs
Outline
 2. Graphical Models
 2-1 Definitions
 2-2 Application of Graphical Models
 2-3 Discriminative and Generative Models
Application of Graphical Models
 Devote special attention to the hidden Markov
model(HMM), because of close relation to the linear-
chain CRF
Classification
 Introduce two type of the classifier
 Naïve Bayes classifier: based on a joint probability model
 Logistic regression: based on a conditional probability model
Naïve Bayes Classifer
 Predicting a single class variable y given a vector of features
𝐱 = 𝑥1 , 𝑥2 , … , 𝑥𝑘
 Assume that all the features are independent
 The resulting classifier is called naïve Bayes classifier
 Based on a joint probability model of the form:
𝐾
𝑝 𝑦, 𝒙 = 𝑝(𝑦)
𝑝(𝑥𝑘 |𝑦)
𝑘=1
Proof
𝐾
𝑝 𝑦, 𝒙 = 𝑝(𝑦)
𝑝(𝑥𝑘 |𝑦)
𝑘=1
Proof:
𝑝 𝑦, 𝒙 = 𝑝
=𝑝
=𝑝
=𝑝
=𝑝
=𝑝
𝑦, 𝑥1 , 𝑥2 , … 𝑥𝑘
𝑦 𝑝(𝑥1 , 𝑥2 , … 𝑥𝑘 |𝑦) all the features are independent
𝑦 𝑝(𝑥1 |𝑦)𝑝(𝑥2 , … 𝑥𝑘 |𝑦, 𝑥1 )
𝑦 𝑝 𝑥1 𝑦 𝑝 𝑥2 𝑦, 𝑥1 … 𝑝(𝑥𝑘 |𝑦, 𝑥1 , 𝑥2 , … , 𝑥𝑘 )
𝑦 𝑝 𝑥1 𝑦 𝑝 𝑥2 𝑦 … 𝑝 𝑥𝑘 𝑦
𝑦 𝐾
𝑘=1 𝑝 𝑥𝑘 𝑦
Naïve Bayes Classifer
 Written as a factor graph
1
𝑝 𝒙, 𝒚 =
𝑍
𝐾
Ψ𝐴 (𝒙𝐴 , 𝒚A )
𝐴
𝑝 𝑦, 𝒙 = 𝑝(𝑦)
𝑝(𝑥𝑘 |𝑦)
𝑘=1
By defining Ψ 𝑦 = 𝑝(𝑦)
Ψ𝑘 𝑦, 𝑥𝑘 = 𝑝(𝑥𝑘 |𝑦)
 However, the assumption is not practical, because the
features are not always independent
Ex. Classify the ball using following features:
color and weight V.S. size and weight
Logistic Regression
𝐾
log
𝑝
𝑦,
𝒙
=
log
𝑝
𝑦
+
𝑥𝑘 𝑦 of each
𝑘=1 log
 Assumption that the log probability,
log𝑝𝑝(𝑦|𝐱),
𝐾
class
logis𝑝a 𝑦linear
𝐱 =function
−log Z of
𝑥 x,
+plus
𝜆𝑦 +a normalization
𝑗=1 𝜆𝑦,𝑗 𝑥𝑗
constant.
 The conditional distribution:
1
𝑝 𝑦𝐱 =
exp 𝜆𝑦 +
𝑍(𝐱)
𝐾
𝜆𝑦,𝑗 𝑥𝑗
𝑗=1
𝑍 𝐱 = 𝑦 exp 𝜆𝑦 + 𝐾
𝑗=1 𝜆𝑦,𝑗 𝑥𝑗 is a normalizing constant
𝜆𝑦 : a bias weight that act as log 𝑝(𝑦) in naïve Bayes.
Different Notation
 Using a set of feature functions
 Defined for the feature weights: 𝑓𝑦′ ,𝑗 𝑦, 𝐱 = 𝟏 𝑦′ =𝑦 𝑥𝑗
 Defined for the bias weights: 𝑓𝑦′ 𝑦, 𝐱 = 𝟏 𝑦′ =𝑦
 𝑓𝑘 to index each feature function 𝑓𝑦′ ,𝑗
 𝜆𝑘 to index corresponding weight 𝜆𝑦′ ,𝑗
1
𝑝 𝑦𝐱 =
exp
𝑍 𝐱
𝐾
𝜆𝑘 𝑓𝑘 (𝑦, 𝐱)
𝑘=1
Different Notation
𝐾
1
𝑝 𝑦𝐱 =
exp
𝑍 𝐱
𝜆𝑘 𝑓𝑘 (𝑦, 𝐱)
𝑘=1
1
𝑝 𝑦𝐱 =
exp 𝜆𝑦 +
𝑍(𝐱)
Where 𝜆1 𝑓1 𝑦, 𝐱 = 𝜆𝑦,1 𝑥1 +
𝜆2 𝑓2 𝑦, 𝐱 = 𝜆𝑦,2 𝑥2 +
𝜆𝑦
𝑘
𝜆𝑦
𝑘
.
.
.
𝜆𝐾 𝑓𝐾 𝑦, 𝐱 = 𝜆𝑦,𝐾 𝑥𝐾 +
𝜆𝑦
𝑘
𝐾
𝜆𝑦,𝑗 𝑥𝑗
𝑗=1
Sequence Models
 Discussing the simplest form of dependency, in which the
output variables are arranged in a sequence
 Use the hidden Markov model (HMM)
 An HMM models a sequence of observations 𝑿 = 𝑥𝑡
𝑇
𝑡=1
by assuming there is an underlying sequence of states 𝒀 =
𝑦𝑡 𝑇𝑡=1 drawn from a finite state set S
Sequence Models
 HMM makes two independence assumption:
 First, each state 𝑦𝑡 is independent of all its ancestors 𝑦1 , 𝑦2 ,…, 𝑦𝑡−2
given its previous state 𝑦𝑡−1
 Each observation variable 𝑥𝑡 depends only on the current state 𝑦𝑡
 With these assumption, specify an HMM using three
probability distributions:
 The distribution 𝑝(𝑦1 ) is over initial states
 The transition distribution 𝑝(𝑦𝑡 |𝑦𝑡−1 )
 The observation distribution 𝑝(𝑥𝑡 |𝑦𝑡 )
 The form of HMM:
𝑇
𝑝 𝒚, 𝒙 =
𝑝 𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
𝑡=1
Initial state distribution 𝑝 𝑦1 = 𝑝 𝑦1 𝑦0
Proof
𝑇
𝑝 𝒚, 𝒙 =
𝑝 𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
𝑡=1
Proof :
𝑝 𝒚, 𝒙
each state 𝑦𝑡 is independent of all its ancestors
𝑦1 , 𝑦2 ,…, 𝑦𝑡−2 given its previous state 𝑦𝑡−1
Each observation variable 𝑥𝑡 depends only on
the current state 𝑦𝑡
=𝑝 𝒚 𝑝 𝐱𝐲
= 𝑝 𝑦1 , 𝑦2 , … , 𝑦𝑇 𝑝(𝑥1 , 𝑥2 , … , 𝑥𝑇 |𝒚)
= 𝑝 𝑦1 )𝑝(𝑦2 , … , 𝑦𝑇 |𝑦1
𝑝 𝑥1 𝑝(𝑥2 , … , 𝑥𝑇 |𝒚, 𝑥1 )
= 𝑝 𝑦1 )𝑝 𝑦2 𝑦1 … 𝑝(𝑦𝑇 |𝑦𝑡−1 , 𝑦𝑡−2 , … , 𝑦1
𝑝 𝑥1 𝑝 𝑥2 𝒚, 𝑥1 … 𝑝(𝑥𝑇 |𝒚, 𝑥𝑡−1 , 𝑥𝑡−2 , … 𝑥1 )
= 𝑝 𝑦1 )𝑝 𝑦2 𝑦1 … 𝑝(𝑦𝑇 |𝑦𝑡−1
𝑇
=
𝑝 𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
𝑡=1
𝑝 𝑥1 𝑝 𝑥2 𝑦2 … 𝑝(𝑥𝑇 |𝑦𝑇 )
Outline
 2. Graphical Models
 2-1 Definitions
 2-2 Application of Graphical Models
 2-3 Discriminative and Generative Models
Discriminative and Generative Models
 Naïve Bayes is generative, meaning that it is based on a
model of the joint distribution 𝑝(𝑦, 𝒙)
 Logistic regression is discriminative, meaning that it is
based on a model of the conditional distribution 𝑝(𝑦|𝒙)
 The main difference is that the conditional distribution
does not include the distribution of 𝑝(𝒙)
 To include interdependent features in a generative models,
two choice are used:
 Enhancing the model to present dependencies among inputs
 Making simplifying independence assumptions
Discriminative and Generative Models
 If the two models are defined in the same hypothesis, the
two models can be converted with each other
 Interpret it generatively as
exp 𝑘 𝜆𝑘 𝑓𝑘 𝑦, 𝐱
𝑝 𝑦, 𝐱 =
𝑘 𝜆𝑘 𝑓𝑘 𝑦, 𝐱
𝑦,𝐱 exp
 Naïve Bayes and logistic regression form a generative-
discriminative pair
 The principal advantage of discriminative modeling is that
it is better suited to including rich, overlapping features.
Discriminative and Generative Models
 If the two models are defined in the same hypothesis, the
two models can be converted with each other
 Interpret it generatively as
exp 𝑘 𝜆𝑘 𝑓𝑘 𝑦, 𝐱
𝑝 𝑦, 𝐱 =
𝑘 𝜆𝑘 𝑓𝑘 𝑦, 𝐱
𝑦,𝐱 exp
 Naïve Bayes and logistic regression form a generative-
discriminative pair
 The principal advantage of discriminative modeling is that
it is better suited to including rich, overlapping features.
Outline
 3. Linear-Chain Conditional Random Fields
 3-1 From HMMs to CRFs
 3-2 Parameter Estimation
 3-3 Inference
From HMMs to CRFs
 Begin by considering the conditional distribution 𝑝(𝐲|𝐱) that
follows from the joint distribution 𝑝 𝒚, 𝒙 of an HMM
 Key point: with a particular choice of feature functions
 Sequence HMM joint distribution:
𝑇
𝑝 𝒚, 𝒙 =
𝑝 𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
𝑡=1
Rewriting it generally:
𝑝 𝒚, 𝒙
1
= exp
𝑍
𝜆𝑖𝑗 𝟏 𝑦𝑡 =𝑖 𝟏 𝑦𝑡−1=𝑗 +
𝑡 𝑖,𝑗∈𝑆
𝜇𝑜𝑖 1 𝑦𝑡 =𝑖 1 𝑥𝑡=𝑜
𝑡
𝑖∈𝑆 𝑜∈𝑂
Reviewing The HMM
 HMM makes two independence assumption:
 First, each state 𝑦𝑡 is independent of all its ancestors 𝑦1 , 𝑦2 ,…, 𝑦𝑡−2
given its previous state 𝑦𝑡−1
 Each observation variable 𝑥𝑡 depends only on the current state 𝑦𝑡
 With these assumption, specify an HMM using three
probability distributions:
 The distribution 𝑝(𝑦1 ) is over initial states
 The transition distribution 𝑝(𝑦𝑡 |𝑦𝑡−1 )
 The observation distribution 𝑝(𝑥𝑡 |𝑦𝑡 )
 The form of HMM:
𝑇
𝑝 𝒚, 𝒙 =
𝑝 𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
𝑡=1
Initial state distribution 𝑝 𝑦1 = 𝑝 𝑦1 𝑦0
From HMMs to CRFs
 HMM joint distribution: 𝑝 𝒚, 𝒙 =
𝑇
𝑡=1 𝑝
𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
Rewriting it generally :
𝑝 𝒚, 𝒙
1
= exp
𝑍
𝜆𝑖𝑗 𝟏 𝑦𝑡 =𝑖 𝟏 𝑦𝑡−1=𝑗 +
𝑡 𝑖,𝑗∈𝑆
S: the set of transition states
𝜆𝑖𝑗 : log 𝑝(𝑦𝑡 = 𝑖|𝑦𝑡−1 = 𝑗)
O: the set of observation states
𝜇𝑜𝑖 : log 𝑝(𝑥𝑡 = 𝑖|𝑦𝑡 = o)
 Easy to use the feature functions
𝜇𝑜𝑖 1 𝑦𝑡 =𝑖 1 𝑥𝑡=𝑜
𝑡
𝑖∈𝑆 𝑜∈𝑂
HMM by Using Feature Functions
 Each feature function has the form 𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡
 𝑓𝑖𝑗 𝑦, 𝑦 ′ , 𝑥 = 𝟏 𝑦=𝑖 𝟏{𝑦′ =𝑗} for each transition (𝑖, 𝑗)
 𝑓𝑖𝑜 𝑦, 𝑦 ′ , 𝑥 = 𝟏 𝑦=𝑖 𝟏{𝑥=𝑜} for each state-observation
pair (𝑖, 𝑗)
 Write an HMM as :
1
𝑝 𝒚, 𝒙 = exp
𝑍
𝐾
𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑘=1
Compare
1
𝑝 𝒚, 𝒙 = exp
𝑍
1
𝑝 𝒚, 𝒙 = exp
𝑍
𝐾
𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑘=1
𝜆𝑖𝑗 𝟏 𝑦𝑡 =𝑖 𝟏 𝑦𝑡−1 =𝑗 +
𝑡 𝑖,𝑗∈𝑆
𝜇𝑜𝑖 1 𝑦𝑡 =𝑖 1 𝑥𝑡 =𝑜
𝑡
𝑖∈𝑆 𝑜∈𝑂
Let
𝜆𝑘 =
𝜇𝑜𝑖 ,
𝜆𝑖𝑗 ,
𝑘 = 1~𝑇|𝑆|2
𝑘 = 𝑇 𝑆 2 + 1~𝑇 𝑆 2 + 𝑇|𝑆||𝑂|
𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
=
𝟏 𝑦𝑡=𝑖 𝟏 𝑦𝑡−1=𝑗 ,
𝑘 = 1~𝑇|𝑆|2
1 𝑦𝑡 =𝑖 1 𝑥𝑡 =𝑜 , 𝑘 = 𝑇 𝑆 2 + 1~𝑇 𝑆 2 + 𝑇|𝑆||𝑂|
Linear-Chain CRF
 By the definition of the conditional distribution:
𝑝(𝒚, 𝒙)
𝑝 𝒚𝒙 =
𝑝(𝒚, 𝒙)𝑝(𝒚)
𝑝 𝒚|𝒙 =
 Use the discussion of the joint
distribution:
𝑝(𝒚)
𝐾
𝑝(𝒚,
𝒙)
1
𝑝=
𝒚, 𝒙 = exp ′
𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑍𝑝(𝑦 , 𝒙)
′
𝑥
𝑘=1
exp 𝐾
𝑘=1 𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
 And obtain the
=distribution 𝑝 𝐾𝒚 by 𝑝 𝒚, 𝒙
𝐾 𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑥 ′ exp 𝑘=1
𝑝 𝒚 =
𝑥′
1
exp
𝑍
𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑘=1
Linear-Chain CRF
 Let the parameter 𝑥𝑡 of the feature function
𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 be 𝒙𝒕
 Lead to the general definition of linear-chain CRF
 Definition 1.1
 𝐘, 𝐗: random vectors
 Λ = {𝜆𝑘 } ∈ 𝑅 𝐾 : a parameter vector
 {𝑓𝑘 (𝑦, 𝑦 ′ , 𝑥𝑡 }𝐾
𝑘=1 : a set of real-valued feature functions
 linear-chain conditional random field
1
𝑝 𝒚𝒙 =
exp
𝑍(𝒙)
𝑍 𝒙 =
𝑦 exp
𝐾
𝜆𝑘 𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
𝑘=1
𝐾
𝑘=1 𝜆𝑘 𝑓𝑘
𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
Linear-Chain CRF
 HMM-liked CRF:
exp
𝑝 𝒚𝒙 =
𝑦 ′ exp
𝐾
𝑘=1 𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝐾
𝑘=1 𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
 linear-chain CRF:
1
𝑝 𝒚𝒙 =
exp
𝑍(𝒙)
𝐾
𝜆𝑘 𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
𝑘=1
Linear-Chain CRF
 Allow the score of the transition (𝑖, 𝑗) to depend on the
current observation vector, by adding a feature
1{𝑦𝑡 =𝑗} 1{𝑦𝑡−1 =1} 1{𝑥𝑡 =0}
𝑝 𝒚𝒙 =
1
exp
𝑍(𝒙)
𝐾
𝑘=1 𝜆𝑘 𝑓𝑘
𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
 However, the normalization constant sums over all
possible state sequence, an exponentially large number of
terms.
 It can be computed efficiently by forward-backward, as
explaining it later
Outline
 3. Linear-Chain Conditional Random Fields
 3-1 From HMMs to CRFs
 3-2 Parameter Estimation
 3-3 Inference
Parameter Estimation
 Discuss how to estimate the parameter Λ = 𝜆𝑘
 Given iid training data 𝒟 = {𝒙 𝑖 , 𝒚(𝑖) }𝑁
𝑖=1 ,
where 𝒙
𝑖
𝑖
𝑖
𝑖
= {𝑥1 , 𝑥2 , … , 𝑥 𝑇 } is a sequence of input
𝑖
𝑖
𝑖
and 𝒚 𝑖 = {𝑦1 , 𝑦2 , … , 𝑦𝑇 } is a sequence of the desired
prediction
 Performed by penalized maximum likelihood
 Because modeling the conditional distribution, called the
conditional log likelihood, is appropriate:
𝑁
𝑙𝑜𝑔 𝑝(𝒚(𝑖) |𝒙(𝑖) )
ℓ 𝜃 =
𝑖=1
Parameter Estimation
 After substituting in the CRF model into the likelihood
𝑁
𝑇
𝐾
𝑁
𝑖
ℓ 𝜃 =
𝑖
𝑖
log 𝑍(𝐱 𝑖 )
𝜆𝑘 𝑓𝑘 (𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝑡 ) −
𝑖=1 𝑡=1 𝑘=1
𝑖=1
 As a measure to avoid overfitting, use regularization, which is a
penalty on weight vectors whose norm is too large
 A common choice of penalty is based on the Euclidean norm of
𝜃 and on a regularization parameter 1/2σ2
Regularized log likelihood:
𝑁
𝑇
𝐾
ℓ 𝜃 =
𝑁
𝜆𝑘 𝑓𝑘 (𝑦𝑡
𝑖=1 𝑡=1 𝑘=1
𝑖
𝑖
𝑖
, 𝑦𝑡−1 , 𝒙𝑡
𝐾
log 𝑍(𝐱 𝑖 ) −
)−
𝑖=1
𝑘=1
𝜆2𝑘
2σ2
Parameter Estimation
 The function ℓ 𝜃 cannot be maximized in closed form.
The partial differential:
𝜕ℓ
=
𝜕𝜆𝑘
𝑁
𝑇
𝑓𝑘 (𝑦𝑡
𝑖
𝑖
𝑖
, 𝑦𝑡−1 , 𝒙𝑡
𝑁
𝑇
𝐾
𝑖
𝑓𝑘 𝑦, 𝑦 ′ , 𝒙𝑡
)−
𝑖
𝑝(𝑦, 𝑦 ′ |𝒙
𝑖=1 𝑡=1 𝑦,𝑦 ′
𝑖=1 𝑡=1
)−
𝑘=1
 First term: the expect value of 𝑓𝑘 under the empirical
distribution 𝑝 𝒚, 𝒙 =
1
𝑁
𝑁
𝑖=1 1{𝒚=𝒚(𝑖) } 1{𝒙=𝒙(𝑖) }
 Second term: arises from the derivative of log Z 𝐱 , the
expectation of 𝑓𝑘 under the model distribution 𝑝(𝒚|𝒙; 𝜃)𝑝(𝒙)
𝜆𝑘
σ2
Optimize ℓ 𝜃
 The function ℓ 𝜃 is concave, which follows from the convexity
of functions of the form 𝑔 𝐱 = 𝑙𝑜𝑔
𝑖 exp 𝑥𝑖
 Every local optimum is also a global optimum in concave
functions
 Adding regularization ensures ℓ is strictly concave, which
implies that it has exactly one global optimum
 The simplest approach to optimize ℓ is steepest ascent along the
gradient.
 Newton’s method converges much faster because it takes into
the curvature of the likelihood, but it requires computing the
Hessian
Optimize ℓ 𝜃
 Quasi-Newton methods: BFGS [Bertsekas,1999]
 Limited-memory version of BFGS, due to Byrd et al. [1994]
 When such second-order methods are used, gradient-based
optimization is much faster than original approaches based on
iterative scaling in Lafferty et al. [2001] as shown
experimentally by several authors [Sha and Pereira, 2003,
Wallcach, 2002, Malouf, 2002, Minka, 2003]
 Computational cost:
 𝑝(𝑦𝑡 , 𝑦𝑡−1 |𝐱): 𝑂(𝑇𝑀2 ) where M is the number of state of each 𝑦𝑡
 Total computer cost: 𝑂(𝑇𝑀2 𝑁𝐺) where N is the number of train
examples and G is the number of gradient computions
Outline
 3. Linear-Chain Conditional Random Fields
 3-1 From HMMs to CRFs
 3-2 Parameter Estimation
 3-3 Inference
Inference
 Two common inference problems for CRFS:
 During training, computing the gradient requires marginal
distribution for each edge 𝑝 𝑦𝑡 , 𝑦𝑡−1 𝒙 , and computing the
likelihood requires 𝑍(𝒙)
 To label an unseen instance, we compute the most likely labeling
𝒚∗ = arg max 𝑝(𝐲|𝐱)
𝒚
 In linear-chain CRFs, inference tasks can be performed
efficiently and exactly by dynamic-programming algorithms for
HMMs
 Here, review the HMM algorithms, and extend them to linear
CRFs
Introduce Notations
 HMM: 𝑝 𝒚, 𝒙 =
𝑇
𝑡=1 𝑝
𝑦𝑡 𝑦𝑡−1 𝑝 𝑥𝑡 𝑦𝑡
Viewed as a factor graph: 𝑝 𝒙, 𝒚 =
1
𝑍
𝑡 Ψ𝑡 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
Define the factors and normalization constant as :
𝑍=1
Ψ𝑡 𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 ≝ 𝑝 𝑦𝑡 = 𝑗 𝑦𝑡−1 = 𝑖 𝑝 𝑥𝑡 = 𝑥 𝑦𝑡 = 𝑗
 If viewed as a weighted finite state machine, the Ψ𝑡 𝑗, 𝑖, 𝑥 is the
weight on the transition from state 𝑖 to state 𝑗, when the current
observation is 𝑥
HMM Forward Algorithm
 Used to compute the probability 𝑝(𝐱) of the observations
 First, Using the distributive law:
𝑝 𝒙
=
𝑝(𝒙, 𝒚)
𝒚
=
Ψ𝑡 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝒚
𝑡
=
Ψ𝑡 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝑦1 ,𝑦2 ,…,𝑦𝑇
=
𝑡
Ψ𝑇 (𝑦𝑇 , 𝑦𝑇−1 , 𝑥 𝑇 )
𝑦𝑇 𝑦𝑇−1
Ψ𝑇−1 (𝑦𝑇−1 , 𝑦𝑇−2 , 𝑥 𝑇−1 ) …
𝑦𝑇−2
HMM Forward Algorithm
𝑝 𝒙 =
Ψ𝑇 (𝑦𝑇 , 𝑦𝑇−1 , 𝑥 𝑇 )
𝑦𝑇 𝑦𝑇−1
Ψ𝑇−1 (𝑦𝑇−1 , 𝑦𝑇−2 , 𝑥 𝑇−1 ) …
𝑦𝑇−2
 Each of the intermediate sums is reused many times, and we can
save an exponential amount of work by caching the inner sums
 forward variables 𝛼𝑡 :
 Each is a vector of size M (the number of states)
 Store the intermediate sums
HMM Forward Algorithm
𝑝 𝒙 =
Ψ𝑇 (𝑦𝑇 , 𝑦𝑇−1 , 𝑥 𝑇 )
𝑦𝑇 𝑦𝑇−1
Ψ𝑇−1 (𝑦𝑇−1 , 𝑦𝑇−2 , 𝑥 𝑇−1 ) …
𝑦𝑇−2
 Defined as :
𝛼𝑡 𝑗 ≝ 𝑝 𝒙 1…𝑡 , 𝑦𝑡 = 𝑗
𝑡−1
=
Ψ𝑡 (𝑗, 𝑦𝑡−1 , 𝑥𝑡 )
𝒚 1…𝑡
Ψ𝑡 ′ (𝑦𝑦′ , 𝑦𝑦′−1 , 𝑥𝑡′ )
𝑡 ′ =1
 Compute by the recursion:
𝛼𝑡 𝑗 =
𝑖∈𝑠 Ψ𝑡 (𝑗, 𝑖, 𝑥𝑡 ) 𝛼𝑡−1
 Initialization: 𝛼1 𝑗 =
𝑖
𝑦𝑇 𝛼 𝑇
𝑦𝑇
HMM Forward Algorithm
𝑝 𝒙 =
Ψ𝑇 (𝑦𝑇 , 𝑦𝑇−1 , 𝑥 𝑇 )
𝑦𝑇 𝑦𝑇−1
Ψ𝑇−1 (𝑦𝑇−1 , 𝑦𝑇−2 , 𝑥 𝑇−1 ) …
𝑦𝑇−2
 Backward :
𝛽𝑡 𝑗 ≝ 𝑝 𝒙 𝑡+1…𝑇 |𝑦𝑡 = 𝑖
𝑇
=
Ψ𝑡 ′ (𝑦𝑦′ , 𝑦𝑦′−1 , 𝑥𝑡′ )
𝑦 𝑡+1…𝑇 𝑡 ′ =𝑡+1
 Recursion:
𝛽𝑡 𝑖 =
𝑗∈𝑆 Ψ𝑡+1 (𝑗, 𝑖, 𝑥𝑡+1 )𝛽𝑡+1 (𝑗)
 Initialization: 𝛽𝑇 𝑖 = 1
HMM Forward Algorithm
 Appling the distributive law:
𝑝 𝑦𝑡−1 , 𝑦𝑡 𝐱
𝑡−1
= Ψ𝑡 𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡
Ψ𝑡 ′ 𝑦𝑡 ′ , 𝑦𝑡 ′ −1 , 𝑥𝑡 ′
𝑦 1…𝑡−2 𝑡 ′ =1
𝑇
Ψ𝑡 ′ (𝑦𝑡 ′ , 𝑦𝑡 ′ −1 , 𝑥𝑡 ′ )
𝑦 𝑡+1…𝑇 𝑡 ′ =𝑡+1
 Recursion:
𝑝 𝑦𝑡−1 , 𝑦𝑡 𝐱 ∝ 𝛼𝑡−1 𝑦𝑡−1 Ψ𝑡 𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 𝛽𝑡 𝑦𝑡
HMM Forward Algorithm
 Finally, compute the globally most probable assigment:
y ∗ = arg max 𝑝(𝑦|𝑥)
𝒚
 This yields the Viterbi recursion if all the summations are
replaced by maximization:
𝛿𝑡 𝑗 = max Ψ𝑡 (𝑗, 𝑖, 𝑥𝑡 ) 𝛿𝑡−1 𝑖
𝑖∈𝑠
Forward-Backward Algorithm For CRF
 The definition forward recursion, the backward recursion, and
the Viterbi recursion of CRF are the same as HMM
 Use the recursion to computer 𝑍 𝐱
𝐾
𝑍 𝒙 =
exp
𝐾
𝑦
=
𝜆𝑘 𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
𝑘=1
exp 𝜆𝑘 𝑓𝑘 𝑦𝑡 , 𝑦𝑡−1 , 𝒙𝒕
𝑦 𝑘=1
𝑝 𝒙
=
Ψ𝑡 (𝑦𝑡 , 𝑦𝑡−1 , 𝑥𝑡 )
𝒚
𝑡
Download