Gaussian Mixture models and EM

advertisement
Introcution to Pattern Recognition for Human ICT
Gaussian Mixture models and
EM
2014. 11. 21
Hyunki Hong
Contents
•
Supervised vs. unsupervised learning
•
Mixture models
•
Expectation Maximization (EM)
Supervised vs. unsupervised learning
• The pattern recognition methods covered in class up to this
point have focused on the issue of classification.
– A pattern consisted of a pair of variables {π‘₯, πœ”} where
1. π‘₯ was a collection of observations or features (feature vector).
2. πœ” was the concept behind the observation (label).
– Such pattern recognition problems are called supervised (training
with a teacher) since the system is given BOTH the feature vector
and the correct answer.
• In the next we investigate a number of methods that operate
on unlabeled data.
– Given a collection of feature vectors 𝑋 = {π‘₯(1, π‘₯(2, …, π‘₯(𝑁} without
class labels πœ”π‘–, these methods attempt to build a model that
captures the structure of the data.
– These methods are called unsupervised (training without a teacher)
since they are not provided the correct answer.
the computational process of discovering patterns in large data sets involving methods at AI,
database system, etc. The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for further use
• Although unsupervised learning methods may appear to have
limited capabilities, there are several reasons that make
them extremely useful.
Airport Surveillance Radar
– Labeling large data sets can be a costly procedure (i.e., ASR).
– Class labels may not be known beforehand (i.e., data mining).
– Large datasets can be compressed into a small set of prototypes
(kNN).
• The supervised and unsupervised paradigms comprise the
vast majority of pattern recognition problems.
– A third approach, known as reinforcement learning, uses a
reward signal (real-valued or binary) to tell the learning system
how well it is performing.
– In reinforcement learning, the goal of the learning system (or
agent) is to learn a mapping from states onto actions (an action
policy) that maximizes the total reward.
Approaches to unsupervised learning
• Parametric (mixture models)
– These methods model the underlying class-conditional
densities with a mixture of parametric densities, and the
objective is to find the model parameters.
– These methods are closely related to parameter
estimation.
• Non-parametric (clustering)
– No assumptions are made about the underlying densities,
instead we seek a partition of the data into clusters.
• There are two reasons why we cover mixture models
at this point.
– The solution to the mixture problem (the EM algorithm) is
also used for Hidden Markov Models.
– A particular form of the mixture model problem leads to
also known as
the most widely used clustering method: the k-means
Mixture models
• Consider the now familiar problem of modeling a pdf
given a dataset 𝑋 = {π‘₯(1, π‘₯(2, …, π‘₯(𝑁}.
– If the form of the underlying pdf was known (e.g. Gaussian),
the problem could be solved using Maximum Likelihood.
– If the form of the pdf was unknown, the problem had to be
solved with non-parametric DE methods such as Parzen
windows.
• We will now consider an alternative DE method:
modeling the pdf with a mixture of parametric
densities
– These methods are sometimes known as semi-parametric.
– In particular, we will focus on mixture models of Gaussian
densities (surprised?).
• Mixture models can be posed in terms of the ML
criterion.
– Given a dataset 𝑋 = {π‘₯(1, π‘₯(2, …, π‘₯(𝑁}, find the parameters of
the model that maximize the log likelihood of the data.
where πœƒπ‘ = {πœ‡π‘,Σ𝑐} and 𝑃(πœ”π‘) are the parameters and mixing
coefficient of the 𝑐𝑑h mixture component, respectively.
• We could try to find the maximum of this function by
differentiation.
– For Σ𝑖 = πœŽπ‘– 𝐼, the solution becomes
: 21~23 νŽ˜μ΄μ§€ μ°Έμ‘°
Right-hand side
• Notice that the previous equations are not a closed form
solution
– The model parameters πœ‡π‘, Σ𝑐, and 𝑃(πœ”π‘) also appear on the
RHS as a result of Bayes rule!
– Therefore, these expressions represent a highly non-linear
coupled system of equations.
• However, these expressions suggest that we may be
able to use a fixed-point algorithm to find the maxima.
- Begin with some “old” value of the model parameters.
- Evaluate the RHS of the equations to obtain “ new ”
parameter values .
- Let these “ new ” values become the “ old ” ones and repeat
the process.
• Surprisingly, an algorithm of this simple form can be
found which is guaranteed to increase the log-likelihood
with every iteration!
– This example represents a particular case of a more general
procedure known as the Expectation-Maximization
Expectation-Maximization (EM)
algorithm
• EM is a general method for finding the ML estimate of
the parameters of a pdf when the data has missing values.
– There are two main applications of the EM algorithm.
1. When the data indeed has incomplete, missing or
corrupted values as a result of a faulty observation
process.
2. Assuming the existence of missing or hidden parameters
can simplify the likelihood function, which would
otherwise lead to an analytically intractable optimization
problem.
• Assume a dataset containing two types of features.
– A set of features 𝑋 whose value is known. We call these the
incomplete data.
– A set of features 𝑍 whose value is unknown. We call these
the missing data.
Expectation-Maximization (EM)
algorithm
• We now define a joint pdf 𝑝(𝑋, 𝑍|𝜽) called the completedata likelihood.
– This function is a random variable since the features 𝑍 are
unknown.
– You can think of 𝑝(𝑋, 𝑍|πœƒ) = β„Žπ‘‹,πœƒ(𝑍), for some function β„Žπ‘‹,πœƒ(βˆ™),
where 𝑋 and πœƒare constant and 𝑍 is a random variable.
• As suggested by its name, the EM algorithm operates by
performing two basic operations over and over.
– An Expectation step
– A Maximization step
• EXPECTATION
– Find the expected value of log 𝑝(𝑋, 𝑍|πœƒ) with respect to the
unknown data 𝑍, given the data 𝑋 and the current parameter
estimates πœƒ(𝑖−1.
where πœƒ are the new parameters that we seek to optimize to
increase Q.
– Note that 𝑋 and πœƒ(𝑖−1 are constants, πœƒ is the variable that we
wish to adjust, and 𝑍 is a random variable defined by 𝑝(𝑍|𝑋,πœƒ(𝑖
−1).
– Therefore 𝑄(πœƒ|πœƒ(𝑖−1) is just a function of πœƒ.
• MAXIMIZATION
– Find the argument πœƒ that maximizes the expected value
𝑄(πœƒ|πœƒ(𝑖−1).
• Convergence properties
– It can be shown that
1. each iteration (E+M) is guaranteed to increase the log-likelihood
and
2. the EM algorithm is guaranteed to converge to a local maximum of
• The two steps of the EM algorithm are illustrated below
– During the E step, the unknown features 𝑍 are integrated out
assuming the current values of the parameters πœƒ(𝑖−1.
– During the M step, the values of the parameters that
maximize the expected value of the log likelihood are
obtained.
new
old

ο€½ arg maxQ( | 
)
Q( |  old ) ο€½ οƒ₯ p(Z | X , old ) log[ p( X , Z |  )]
Z
– IN A NUTSHELL: since Z are unknown, the best we can do
is maximize the average log-likelihood across all possible
values of Z.
EM algorithm and mixture
models
• Having formalized the EM algorithm, we are now ready
to find the solution to the mixture model problem.
– To keep things simple, we will assume a univariate mixture
model where all the components have the same known
standard deviation 𝜎.
• Problem formulation
– As usual, we are given a dataset 𝑋 = {π‘₯(1, π‘₯(2, …, π‘₯(𝑁}, and we
seek to estimate the model parameters πœƒ = {πœ‡1, πœ‡2, …, πœ‡πΆ}.
– The following process is assumed to generate each random
variable π‘₯(𝑛.
1. First, a Gaussian component is selected according to mixture
coeffs 𝑃(πœ”π‘).
2. Then, π‘₯(n is generated according to the likelihood 𝑝(π‘₯|πœ‡π‘) of that
particular component.
– We will also use hidden variables 𝑍 = {𝑧1(𝑛, 𝑧2(𝑛, …, 𝑧𝑐(𝑛} to
indicate which of the 𝐢 Gaussian components generated data
point π‘₯(𝑛.
• Solution
– The probability 𝑝(π‘₯, 𝑧|πœƒ) for a specific example is
where only one of the 𝑧𝑐(n can have a value of 1, and all
others are zero.
– The log-likelihood of the entire dataset is then
- To obtain 𝑄(πœƒ|πœƒ(𝑖−1 ) we must then take the expectation over
𝑍
where we have used the fact that 𝐸[𝑓(𝑧)]= 𝑓(𝐸[𝑧]) for 𝑓(𝑧) linear.
– 𝐸[𝑧𝑐(𝑛] is simply the probability that π‘₯(𝑛 was generated by the 𝑐
π‘‘β„Ž Gaussian component given current model parameters πœƒ(𝑖−1.
(1)
- These two expressions define the 𝑄 function.
– The second step (Maximization) consists of finding the values
{πœ‡1, πœ‡2, …, πœ‡πΆ} that maximize the 𝑄 function.
which, computing the zeros of the partial derivative, yields:
(2)
- Equations (1) and (2) define a fixed-point algorithm that can
be used to converge to a (local) maximum of the loglikelihood function.
κ°•μ˜ λ‚΄μš©
κ°€μš°μ‹œμ•ˆ ν˜Όν•© λͺ¨λΈ
ν•„μš”μ„±, μ •μ˜
ν•™μŠ΅ οƒ  μ΅œμš°μΆ”μ •λ²•κ³Ό 문제점
EM μ•Œκ³ λ¦¬μ¦˜
μ€λ‹‰λ³€μˆ˜, EM μ•Œκ³ λ¦¬μ¦˜μ˜ μˆ˜ν–‰ κ³Όμ •
κ°€μš°μ‹œμ•ˆ ν˜Όν•© λͺ¨λΈμ„ μœ„ν•œ EM μ•Œκ³ λ¦¬μ¦˜
μΌλ°˜ν™”λœ EM μ•Œκ³ λ¦¬μ¦˜
λ§€νŠΈλž©μ„ μ΄μš©ν•œ μ‹€ν—˜
1 κ°€μš°μ‹œμ•ˆ ν˜Όν•©λͺ¨λΈ(Gaussian Mixtures)
1-1. κ°€μš°μ‹œμ•ˆ ν˜Όν•© λͺ¨λΈμ˜ ν•„μš”μ„±
ν™•λ₯ λͺ¨λΈ: 데이터 뢄포 νŠΉμ„±μ„ μ•ŒκΈ° μœ„ν•΄ μ μ ˆν•œ
ν™•λ₯ λ°€λ„ν•¨μˆ˜ κ°€μ •ν•˜κ³  이에 λŒ€ν•œ λͺ¨λΈ 생성
Estimated pdf
True pdf
κ°€μš°μ‹œμ•ˆ 뢄포
οƒ  μœ λ‹ˆλͺ¨λ‹¬ ν˜•νƒœ : 데이터 평균 μ€‘μ‹¬μœΌλ‘œ
ν•˜λ‚˜μ˜ 그룹으둜 λ­‰μΉ¨
μœ„ 경우, 3개의
κ°€μš°μ‹œμ•ˆ λΆ„ν¬μ˜ 가쀑합
λ³΅μž‘ν•œ ν˜•νƒœμ˜ ν•¨μˆ˜μ˜ ν‘œν˜„μ΄ κ°€λŠ₯
6개의 κ°€μš°μ‹œμ•ˆ ν˜Όν•©μœΌλ‘œ λ‚˜νƒ€λ‚Ό 수 μžˆλŠ” 데이터 집합
1-2. GMM의 μ •μ˜
M개의 μ„±λΆ„(λ°€λ„ν•¨μˆ˜)의 ν˜Όν•©μœΌλ‘œ μ •μ˜λ˜λŠ” 전체 ν™•λ₯ λ°€λ„ν•¨μˆ˜
Ci : i번째 μ„±λΆ„μž„μ„
λ‚˜νƒ€λ‚΄λŠ” ν™•λ₯ λ³€μˆ˜
ν˜Όν•© λͺ¨λΈμ˜ κΈ°λ³Έ 성뢄이 λ˜λŠ”
κ°„λ‹¨ν•œ pdf
i번째 성뢄이 λ˜λŠ” pdfλ₯Ό
μ •μ˜ν•˜λŠ” νŒŒλΌλ―Έν„°
κ°€μš°μ‹œμ•ˆ pdfμ—μ„œλŠ” μi, Σi
i번째 성뢄이 μ „μ²΄μ—μ„œ μ°¨μ§€ν•˜λŠ”
μƒλŒ€μ μΈ μ€‘μš”λ„:
M개의 성뢄을 κ°€μ§€λŠ” GMM의 전체 νŒŒλΌλ―Έν„°
곡뢄산 행렬이
인 경우
αi
1-3. GMM의 ν•™μŠ΅
GMMμ—μ„œ ν•™μŠ΅: 데이터λ₯Ό μ΄μš©ν•΄ νŒŒλΌλ―Έν„° μΆ”μ •
→ λͺ¨μˆ˜μ  ν™•λ₯ λ°€λ„ μΆ”μ • 방법 일쒅
X = {x1, x2, …, xN} μ£Όμ–΄μ‘Œμ„ λ•Œ, i번째 데이터 xi의 ν™•λ₯ λ°€λ„ p(xi)λ₯Ό κ°€μš°μ‹œμ•ˆ ν˜Όν•©
λͺ¨λΈ ν‘œν˜„
j번째
κ°€μš°μ‹œμ•ˆ ν™•λ₯ λ°€λ„
(1차원인 경우)
M개λ₯Ό κ²°ν•©ν•œ ν˜Όν•© ν™•λ₯ λ°€λ„
ν•™μŠ΅ : λ°μ΄ν„°μ˜ λ‘œκ·Έμš°λ„λ₯Ό μ΅œλŒ€λ‘œ ν•˜λŠ” νŒŒλΌλ―Έν„° μ°ΎκΈ°
ν•™μŠ΅λ°μ΄ν„° 전체에 λŒ€ν•œ
λ‘œκ·Έμš°λ„ ν•¨μˆ˜
μ΅œμš°μΆ”μ •λŸ‰
1-3. GMM의 ν•™μŠ΅ – 평균과 곡뢄산 μΆ”μ •
p (C j | xi ,  ) ο€½
p ( xi | C j ,  ) p (C j |  )
p ( xi |  )
 j p( xi |  j ,  2j )
ο€½
p ( xi |  )
이용
d u
du
(e ) ο€½ e u
dx
dx
: 데이터 xi μ£Όμ–΄μ‘Œμ„ λ•Œ, 그것이 j번째 μ„±λΆ„μœΌλ‘œλΆ€ν„° λ‚˜μ™”μ„ ν™•λ₯ κ°’
μj 와 λ§ˆμ°¬κ°€μ§€λ‘œ μœ λ„
1-4. GMM의 ν•™μŠ΅ – ν˜Όν•©κ³„μˆ˜μ˜ μΆ”μ •
αj의 쑰건을 가진 μ΅œλŒ€ν™” 문제 ν•΄κ²°: λΌκ·Έλž‘μ œ 승수 이용
쑰건 λ§Œμ‘±ν•΄μ•Ό 함.
μœ„ 식을 λŒ€μž…
μœ„ 식을 μ΄μš©ν•˜λ©΄, λ = N
p (C j | xi ,  ) ο€½
ο€½
p ( xi | C j ,  ) p (C j |  )
p ( xi |  )
 j p( xi |  j ,  2j )
p ( xi |  )
이용
1-5. μ΅œμš°μΆ”μ •λ²•μ˜ 문제점
μš°λ³€μ— μΆ”μ •ν•΄μ•Ό ν•˜λŠ” νŒŒλΌλ―Έν„° 포함됨
θ ο€½ (1, ..., M , 12 , ...,  M2 , 1 , ..., M )
“반볡 μ•Œκ³ λ¦¬μ¦˜”으둜 ν•΄κ²°
EM μ•Œκ³ λ¦¬μ¦˜
μ‹œμž‘ λ‹¨κ³„μ—μ„œ
μž„μ˜ κ°’μœΌλ‘œ μ΄ˆκΈ°ν™”
μ΅œμ’…μ μœΌλ‘œ
λͺ©μ ν•¨μˆ˜ μ΅œλŒ€ν™”ν•˜λŠ”
νŒŒλΌλ―Έν„° κ΅¬ν•˜κΈ°
1-6. GMMκ³Ό μ€λ‹‰λ³€μˆ˜
: μ™ΈλΆ€μ—μ„œ κ΄€μ°°ν•  수 μ—†λŠ” λ³€μˆ˜
예) 남,μ—¬ 두 집단이 ν•¨κ»˜ μ†ν•œ ν•˜λ‚˜μ˜ κ·Έλ£Ήμ—μ„œ ν•œ μ‚¬λžŒμ”© 뽑아 μ‹ μž₯
μΈ‘μ • 데이터 X={x1, x2, …, xN } 주어진 경우, ν™•λ₯ λ°€λ„ν•¨μˆ˜ p(x) μΆ”μ •
μ—¬μž 그룹의 ν™•λ₯ λΆ„포 λ‚¨μž 그룹의 ν™•λ₯ λΆ„포
2
→ ꡬ해야 ν•˜λŠ” νŒŒλΌλ―Έν„°:  j , j , j ( j ο€½1, 2)
λ°μ΄ν„°μ˜ 클래슀 라벨을 ν‘œν˜„ν•˜κΈ° μœ„ν•œ μƒˆλ‘œμš΄ λ³€μˆ˜
x∈μ—¬μž κ·Έλ£Ή οƒ  C1=1, C2=0
x∈λ‚¨μž κ·Έλ£Ή οƒ  C1=0, C2=1
μ€λ‹‰λ³€μˆ˜ : μ™ΈλΆ€μ˜ 츑정될 수 μ—†λŠ” 값을 가진 λ³€μˆ˜
μ€λ‹‰λ³€μˆ˜λ₯Ό κ³ λ €ν•œ ν™•λ₯ λͺ¨λΈ
1-7. μ€λ‹‰λ³€μˆ˜λ₯Ό 가진 λͺ¨λΈμ˜ EM μ•Œκ³ λ¦¬μ¦˜
각 데이터 xi에 λŒ€ν•΄ μˆ¨κ²¨μ§„ λ³€μˆ˜μ˜ κ°’ zi도 같이 κ΄€μ°°λœλ‹€κ³  κ°€μ •ν•˜λ©΄
1. 주어진 z 값에 따라 두 그룹으둜 λ‚˜λˆ„κ³ , 각 그룹의 νŒŒλΌλ―Έν„° μΆ”μ •
OK!
Cij : i번째 데이터에 λŒ€ν•œ μ€λ‹‰λ³€μˆ˜ Cj 의 κ°’
Nj : 각 그룹에 μ†ν•œ λ°μ΄ν„°μ˜ 수
2. 각 그룹의 μ€‘μš”λ„λ₯Ό λ‚˜νƒ€λ‚΄λŠ” ν˜Όν•© κ³„μˆ˜ νŒŒλΌλ―Έν„° αj μΆ”μ •
: 전체 κ·Έλ£Ήμ—μ„œ j번째 그룹이 μ°¨μ§€ν•˜λŠ” λΉ„μœ¨
κ·ΈλŸ¬λ‚˜ μ‹€μ œλ‘œλŠ” xi만 κ΄€μ°°λ˜κ³ , zi = [Ci1, Ci2]λŠ” μ•Œ 수 μ—†μŒ.
→ EM μ•Œκ³ λ¦¬μ¦˜: μ€λ‹‰λ³€μˆ˜ z μΆ”μ •κ³Ό νŒŒλΌλ―Έν„° θ좔정을 반볡
E-Step
M-Step
“μˆ¨κ²¨μ§„ ν™•λ₯ λ³€μˆ˜ 가진 ν™•λ₯  λͺ¨λΈμ˜ νŒŒλΌλ―Έν„° μΆ”μ • 방법”
1-7. μ€λ‹‰λ³€μˆ˜ 가진 λͺ¨λΈμ˜ EM μ•Œκ³ λ¦¬μ¦˜
E-Step
M-Step
(1) E(Expectation)-Step : μ€λ‹‰λ³€μˆ˜ z의 κΈ°λŒ€μΉ˜ 계산
→ z μΆ”μ •μΉ˜λ‘œ
μ‚¬μš©
ν˜„μž¬μ˜ νŒŒλΌλ―Έν„°λ₯Ό μ΄μš©ν•˜μ—¬ μ€λ‹‰λ³€μˆ˜μ˜ κΈ°λŒ€μΉ˜λ₯Ό κ³„μ‚°ν•˜λŠ” κ³Όμ •
: μ²˜μŒμ—λŠ” μž„μ˜μ˜ κ°’μœΌλ‘œ νŒŒλΌλ―Έν„°λ₯Ό μ •ν•˜κ³ ,
μ΄ν›„μ—λŠ” M-Step을 μˆ˜ν–‰ν•˜μ—¬ μ–»μ–΄μ§€λŠ” νŒŒλΌλ―Έν„° μ‚¬μš©
C1κ³Ό C2λŠ” 각각 1 ν˜Ήμ€ 0 값을 κ°€μ§€λ―€λ‘œ, κΈ°λŒ€μΉ˜λŠ”
Cj =1일 ν™•λ₯ κ°’ μΆ”μ •μœΌλ‘œ 얻어짐.
μ•žμ˜ 관계 이용 P(C j | xi ,  ) ο€½
p ( xi | C j ,  ) p (C j |  )
p ( xi |  )
 j p( xi |  j ,  2j )
ο€½
p ( xi |  )
1-7. μ€λ‹‰λ³€μˆ˜λ₯Ό 가진 λͺ¨λΈμ˜ EM μ•Œκ³ λ¦¬
즘
(2) M(Maximization)-Step: μ€λ‹‰λ³€μˆ˜ z의 κΈ°λŒ€μΉ˜ μ΄μš©ν•œ νŒŒλΌλ―Έν„° μΆ”μ •
E-Stepμ—μ„œ 찾아진 κΈ°λŒ€μΉ˜λ₯Ό
μ€λ‹‰λ³€μˆ˜μ˜ κ΄€μ°°κ°’μœΌλ‘œ κ°„μ£Όν•˜μ—¬,
λ‘œκ·Έμš°λ„λ₯Ό μ΅œλŒ€ν™”ν•˜λŠ” νŒŒλΌλ―Έν„°λ₯Ό μ°ΎλŠ” κ³Όμ •
z=
이용
1-8. GMM을 μœ„ν•œ EM μ•Œκ³ λ¦¬μ¦˜
1-8. GMM을 μœ„ν•œ EM μ•Œκ³ λ¦¬μ¦˜
1-9. μΌλ°˜ν™”λœ EM μ•Œκ³ λ¦¬μ¦˜
κ°€μš°μ‹œμ•ˆ ν˜Όν•©λͺ¨λΈμ— κ΅­ν•œλœ ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ€ μ•„λ‹˜. μ€λ‹‰λ³€μˆ˜λ₯Ό κ°€μ§€λŠ” ν™•λ₯  λͺ¨λΈ(HMM,등에 적용 κ°€λŠ₯)
μ€λ‹‰λ³€μˆ˜μ™€ νŒŒλΌλ―Έν„°λ₯Ό ν•¨κ»˜ μΆ”μ •ν•  수 μžˆλŠ” μ΅œμ ν™” μ•Œκ³ λ¦¬μ¦˜
κ΄€μ°° λ°μ΄ν„°μ˜ ν™•λ₯ λ³€μˆ˜ x, 은닉 ν™•λ₯ λ³€μˆ˜ z, λͺ¨λΈ νŒŒλΌλ―Έν„° θ에 λŒ€ν•΄,
전체 ν™•λ₯ λ°€λ„ν•¨μˆ˜λŠ” 두 λ³€μˆ˜μ˜ κ²°ν•©ν™•λ₯ λ°€λ„ν•¨μˆ˜ p(x, z | θ) 둜 ν‘œν˜„
→ x의 κ΄€μ°° 데이터 X = {x1, x2, …, xN}와
이에 λŒ€ν•œ λ‘œκ·Έμš°λ„ ln p(X | θ)λ₯Ό μ΅œλŒ€λ‘œ ν•˜λŠ” νŒŒλΌλ―Έν„° μΆ”μ •
1. p(X | θ)λŠ” κ²°ν•©ν™•λ₯ λΆ„포 p(x, z | θ)λ₯Ό z에 λŒ€ν•΄
적뢄(μ£Όλ³€ν™•λ₯ λΆ„포)ν•΄μ„œ 계산 κ°€λŠ₯
λΆˆμ™„μ „ 데이터 λ‘œλ“œμš°λ„(ν™•λ₯ λ³€μˆ˜ x에 λŒ€ν•œ λ°μ΄ν„°λ§ŒμœΌλ‘œ μ •μ˜λœ μš°λ„)λ₯Ό
μ΅œλŒ€ν™”
계산이 어렀움
μ£Όλ³€ν™•λ₯ λΆ„포: 두 이산확λ₯ λ³€μˆ˜(X, Y)의 결합확룰뢄포(joint probability distribution)λ‘œλΆ€ν„° 각각의 이산
ν™•λ₯ λ³€μˆ˜μ— λŒ€ν•œ 뢄포 ꡬ함. Yκ°€ κ°€μ§ˆμˆ˜ μžˆλŠ” λͺ¨λ“  κ°’λ“€μ˜ κ²°ν•©ν•¨μˆ˜μ˜ ν•©: ν™•λ₯  λ³€μˆ˜ X의 μ£Όλ³€ν™•λ₯ 
1-9. μΌλ°˜ν™”λœ EM μ•Œκ³ λ¦¬μ¦˜
2. μ€λ‹‰λ³€μˆ˜ z의 κ΄€μ°° 데이터 Z = {z1, z2, …, zN}κ°€ μ£Όμ–΄μ§€λŠ” 경우,
두 데이터 집합 X와 Z에 λŒ€ν•œ μš°λ„ν•¨μˆ˜λ₯Ό μ •μ˜
각 λ°μ΄ν„°μ˜ 독립성 μ΄μš©ν•΄ 각 λ°μ΄ν„°μ˜
λ‘œκ·Έμš°λ„ ν•©μ‚°ν•œ ν˜•νƒœλ‘œ ν‘œν˜„ κ°€λŠ₯
λͺ¨λ“  ν™•λ₯ λ³€μˆ˜μ— λŒ€ν•œ 데이터가 κ΄€μ°°λ˜μ–΄ μ •μ˜λœ μš°λ„
: μ™„μ „ 데이터 λ‘œκ·Έμš°λ„
일반적으둜 μ€λ‹‰λ³€μˆ˜ z에 λŒ€ν•œ 데이터 Zκ°€ 주어지지 μ•ŠμŒ.
→ ꡬ체적인 데이터 μ‚¬μš©ν•˜λŠ” λŒ€μ‹ μ— z에 λŒ€ν•œ κΈ°λŒ€μΉ˜λ₯Ό μ‚¬μš©
z에 λŒ€ν•œ κΈ°λŒ€μΉ˜= μœ„ λ‘œκ·Έμš°λ„ ν•¨μˆ˜λ₯Ό z의 ν™•λ₯ λΆ„ν¬ν•¨μˆ˜λ₯Ό μ΄μš©ν•œ 적뢄
θ (τ) ν•¨μˆ˜. θ ν•¨μˆ˜
ν˜„μž¬ νŒŒλΌλ―Έν„° θ (τ)와 κ΄€μ°° 데이터 X 이용
→ z에 λŒ€ν•œ 쑰건뢀 ν™•λ₯  p(z | X, θ(τ)) μ–»μŒ.
Q ν•¨μˆ˜:
둜 μœ„ 식을 ν‘œν˜„
1-9. μΌλ°˜ν™”λœ EM μ•Œκ³ λ¦¬μ¦˜
κ΄€μ°° λ°μ΄ν„°λ‘œ μ •μ˜λœ λΆˆμ™„μ „ 데이터 λ‘œκ·Έμš°λ„μ˜ μ΅œλŒ€ν™” λΆˆκ°€λŠ₯ν•œ 경우, μ€λ‹‰λ³€μˆ˜ λ„μž…ν•΄
μ™„μ „ 데이터 λ‘œκ·Έμš°λ„ μ •μ˜ν›„, κΈ°λŒ€μΉ˜λ‘œ μ •μ˜λœ Q ν•¨μˆ˜
λ₯Ό μ΄μš©ν•˜μ—¬ νŒŒλΌλ―Έν„° μΆ”μ •
1-10. λ§€νŠΈλž©μ„ μ΄μš©ν•œ μ‹€ν—˜
ν•™μŠ΅νšŸμˆ˜μ— λ”°λ₯Έ νŒŒλΌλ―Έν„°μ˜ λ³€ν™”
(a) =0
(b) =1
(c) =10
(d) =50
1-10. λ§€νŠΈλž©μ„ μ΄μš©ν•œ μ‹€ν—˜
ν•™μŠ΅νšŸμˆ˜μ— λ”°λ₯Έ λ‘œκ·Έμš°λ“œμ˜ λ³€ν™”
Loglikelihood
Number of iteration
1-10. λ§€νŠΈλž©μ„ μ΄μš©ν•œ μ‹€ν—˜
2차원 데이터λ₯Ό μœ„ν•œ κ°€μš°μ‹œμ•ˆ ν˜Όν•© λͺ¨λΈμ˜ EM μ•Œκ³ λ¦¬μ¦˜
% EM μ•Œκ³ λ¦¬μ¦˜μ„ μœ„ν•œ μ΄ˆκΈ°ν™” ------------------------load data10_2
X=data';
N=size(X,2);
M=6;
Mu=rand(M,2)*5;
for i=1:M
% 데이터 뢈러였기
% X: ν•™μŠ΅ 데이터
% N: λ°μ΄ν„°μ˜ 수
% M: κ°€μš°μ‹œμ•ˆ μ„±λΆ„μ˜ 수
% νŒŒλΌλ―Έν„°μ˜ μ΄ˆκΈ°ν™” (평균)
% νŒŒλΌλ―Έν„°μ˜ μ΄ˆκΈ°ν™” (λΆ„μ‚°)
Sigma(i,1:2,1:2) = [1 0; 0 1];
end
alpha=zeros(6,1)+1/6;
% νŒŒλΌλ―Έν„°μ˜ μ΄ˆκΈ°ν™” (ν˜Όν•©κ³„μˆ˜)
drawgraph(X, Mu, Sigma, 1);
Maxtau=100;
% κ·Έλž˜ν”„κ·Έλ¦¬κΈ° ν•¨μˆ˜ 호좜
% μ΅œλŒ€ 반볡횟수 μ„€μ •
1-10. λ§€νŠΈλž©μ„ μ΄μš©ν•œ μ‹€ν—˜
2차원 데이터λ₯Ό μœ„ν•œ κ°€μš°μ‹œμ•ˆ ν˜Όν•© λͺ¨λΈμ˜ EM μ•Œκ³ λ¦¬μ¦˜
% EM μ•Œκ³ λ¦¬μ¦˜μ˜ E-Step ------------------------------for tau=1: Maxtau
%%% E-step %%%%%%%%%%%%%%%%%%%%%%%%%%
for j=1:M
px( j,:) = gausspdf(X, Mu( j,:), reshape(Sigma( j,:,:),2,2));
end
sump=px'*alpha
for j=1:M
r(:,j) = (alpha( j)*px( j,:))'./sump;
end
L(tau)=sum(log(sump));
%ν˜„μž¬ νŒŒλΌλ―Έν„°μ˜ λ‘œκ·Έμš°λ„ 계산
%% M-step %% (λ‹€μŒ μŠ¬λΌμ΄λ“œ)%%
end
1-10. λ§€νŠΈλž©μ„ μ΄μš©ν•œ μ‹€ν—˜
% EM μ•Œκ³ λ¦¬μ¦˜μ˜ M-Step ------------------------------for tau=1: Maxtau
%% E-step %%(이전 μŠ¬λΌμ΄λ“œ)%%
%%% M-step %%%%%%%%%%%%%%%%%%%%%%%%%
for j=1:M
sumr=sum(r(:,j))
Rj=repmat(r(:,j),1,2)';
Mu( j,:) = sum(Rj.*X,2)/sumr
% μƒˆλ‘œμš΄ 평균
rxmu= (X-repmat(Mu( j,:),N,1)').*Rj;
% μƒˆλ‘œμš΄ 곡뢄산 계산
Sigma( j,1:2,1:2)= rxmu*(X-repmat(Mu( j,:),N,1)')'/sumr;
alpha( j)=sumr/N;
% μƒˆλ‘œμš΄ ν˜Όν•©κ³„μˆ˜
end
if (mod(tau,10)==1)
% κ·Έλž˜ν”„κ·Έλ¦¬κΈ° ν•¨μˆ˜ 호좜
drawgraph(X, Mu, Sigma, ceil(tau/10)+1);
end
end
drawgraph(X, Mu, Sigma, tau);
% κ·Έλž˜ν”„κ·Έλ¦¬κΈ° ν•¨μˆ˜ 호좜
figure(tau+1); plot(L);
% λ‘œκ·Έμš°λ„μ˜ λ³€ν™” κ·Έλž˜ν”„
1-10. λ§€νŠΈλž©μ„ μ΄μš©ν•œ μ‹€ν—˜
κ°€μš°μ‹œμ•ˆ ν™•λ₯ λ°€λ„κ°’ 계산 ν•¨μˆ˜ gausspdf
% ν•¨μˆ˜ gausspdf
function [out]=gausspdf(X, mu, sigma)
% ν•¨μˆ˜ μ •μ˜
n=size(X,1);
% μž…λ ₯ λ²‘ν„°μ˜ 차원
N=size(X,2);
% λ°μ΄ν„°μ˜ 수
Mu=repmat(mu',1,N);
% ν–‰λ ¬ 연산을 μœ„ν•œ μ€€λΉ„
% ν™•λ₯ λ°€λ„κ°’ 계산 ν›„ λ°˜ν™˜
out = (1/((sqrt(2*pi))^n*sqrt(det(sigma))))
*exp(-diag((X-Mu)'*inv(sigma)*(X-Mu))/2);
1-10. λ§€νŠΈλž©μ„ μ΄μš©ν•œ μ‹€ν—˜
데이터와 νŒŒλΌλ―Έν„°λ₯Ό κ·Έλž˜ν”„λ‘œ ν‘œν˜„ν•˜λŠ” ν•¨μˆ˜
% ν•¨μˆ˜ drawgraph
function drawgraph(X, Mu, Sigma, cnt)
% ν•¨μˆ˜ μ •μ˜
M=size(Mu,1);
% μ„±λΆ„μ˜ 수
figure(cnt);
% 데이터 그리기
plot(X(1,:), X(2,:), '*'); hold on
axis([-0.5 5.5 -0.5 3.5]); grid on
plot(Mu(:,1), Mu(:,2), 'r*');
% 평균 νŒŒλΌλ―Έν„° 그리기
for j=1:M
sigma=reshape(Sigma( j,:,:),2,2);
% 곡뢄산에 λ”°λ₯Έ 타원 그리기
t=[-pi:0.1:pi]';
A=sqrt(2)*[cos(t) sin(t)]*sqrtm(sigma)+repmat(Mu( j,:), size(t),1);
plot(A(:,1), A(:,2), 'r-', 'linewidth' ,2 );
end
Download