Gaussian Mixture Models and EM Algorithm

Introcution to Pattern Recognition for Human ICT Gaussian Mixture models and EM 2014. 11. 21 Hyunki Hong Contents • Supervised vs. unsupervised learning • Mixture models • Expectation Maximization (EM) Supervised vs. unsupervised learning • The pattern recognition methods covered in class up to this point have focused on the issue of classification. – A pattern consisted of a pair of variables {𝑥, 𝜔} where 1. 𝑥 was a collection of observations or features (feature vector). 2. 𝜔 was the concept behind the observation (label). – Such pattern recognition problems are called supervised (training with a teacher) since the system is given BOTH the feature vector and the correct answer. • In the next we investigate a number of methods that operate on unlabeled data. – Given a collection of feature vectors 𝑋 = {𝑥(1, 𝑥(2, …, 𝑥(𝑁} without class labels 𝜔𝑖, these methods attempt to build a model that captures the structure of the data. – These methods are called unsupervised (training without a teacher) since they are not provided the correct answer. the computational process of discovering patterns in large data sets involving methods at AI, database system, etc. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use • Although unsupervised learning methods may appear to have limited capabilities, there are several reasons that make them extremely useful. Airport Surveillance Radar – Labeling large data sets can be a costly procedure (i.e., ASR). – Class labels may not be known beforehand (i.e., data mining). – Large datasets can be compressed into a small set of prototypes (kNN). • The supervised and unsupervised paradigms comprise the vast majority of pattern recognition problems. – A third approach, known as reinforcement learning, uses a reward signal (real-valued or binary) to tell the learning system how well it is performing. – In reinforcement learning, the goal of the learning system (or agent) is to learn a mapping from states onto actions (an action policy) that maximizes the total reward. Approaches to unsupervised learning • Parametric (mixture models) – These methods model the underlying class-conditional densities with a mixture of parametric densities, and the objective is to find the model parameters. – These methods are closely related to parameter estimation. • Non-parametric (clustering) – No assumptions are made about the underlying densities, instead we seek a partition of the data into clusters. • There are two reasons why we cover mixture models at this point. – The solution to the mixture problem (the EM algorithm) is also used for Hidden Markov Models. – A particular form of the mixture model problem leads to also known as the most widely used clustering method: the k-means Mixture models • Consider the now familiar problem of modeling a pdf given a dataset 𝑋 = {𝑥(1, 𝑥(2, …, 𝑥(𝑁}. – If the form of the underlying pdf was known (e.g. Gaussian), the problem could be solved using Maximum Likelihood. – If the form of the pdf was unknown, the problem had to be solved with non-parametric DE methods such as Parzen windows. • We will now consider an alternative DE method: modeling the pdf with a mixture of parametric densities – These methods are sometimes known as semi-parametric. – In particular, we will focus on mixture models of Gaussian densities (surprised?). • Mixture models can be posed in terms of the ML criterion. – Given a dataset 𝑋 = {𝑥(1, 𝑥(2, …, 𝑥(𝑁}, find the parameters of the model that maximize the log likelihood of the data. where 𝜃𝑐 = {𝜇𝑐,Σ𝑐} and 𝑃(𝜔𝑐) are the parameters and mixing coefficient of the 𝑐𝑡h mixture component, respectively. • We could try to find the maximum of this function by differentiation. – For Σ𝑖 = 𝜎𝑖 𝐼, the solution becomes : 21~23 페이지 참조 Right-hand side • Notice that the previous equations are not a closed form solution – The model parameters 𝜇𝑐, Σ𝑐, and 𝑃(𝜔𝑐) also appear on the RHS as a result of Bayes rule! – Therefore, these expressions represent a highly non-linear coupled system of equations. • However, these expressions suggest that we may be able to use a fixed-point algorithm to find the maxima. - Begin with some “old” value of the model parameters. - Evaluate the RHS of the equations to obtain “ new ” parameter values . - Let these “ new ” values become the “ old ” ones and repeat the process. • Surprisingly, an algorithm of this simple form can be found which is guaranteed to increase the log-likelihood with every iteration! – This example represents a particular case of a more general procedure known as the Expectation-Maximization Expectation-Maximization (EM) algorithm • EM is a general method for finding the ML estimate of the parameters of a pdf when the data has missing values. – There are two main applications of the EM algorithm. 1. When the data indeed has incomplete, missing or corrupted values as a result of a faulty observation process. 2. Assuming the existence of missing or hidden parameters can simplify the likelihood function, which would otherwise lead to an analytically intractable optimization problem. • Assume a dataset containing two types of features. – A set of features 𝑋 whose value is known. We call these the incomplete data. – A set of features 𝑍 whose value is unknown. We call these the missing data. Expectation-Maximization (EM) algorithm • We now define a joint pdf 𝑝(𝑋, 𝑍|𝜽) called the completedata likelihood. – This function is a random variable since the features 𝑍 are unknown. – You can think of 𝑝(𝑋, 𝑍|𝜃) = ℎ𝑋,𝜃(𝑍), for some function ℎ𝑋,𝜃(∙), where 𝑋 and 𝜃are constant and 𝑍 is a random variable. • As suggested by its name, the EM algorithm operates by performing two basic operations over and over. – An Expectation step – A Maximization step • EXPECTATION – Find the expected value of log 𝑝(𝑋, 𝑍|𝜃) with respect to the unknown data 𝑍, given the data 𝑋 and the current parameter estimates 𝜃(𝑖−1. where 𝜃 are the new parameters that we seek to optimize to increase Q. – Note that 𝑋 and 𝜃(𝑖−1 are constants, 𝜃 is the variable that we wish to adjust, and 𝑍 is a random variable defined by 𝑝(𝑍|𝑋,𝜃(𝑖 −1). – Therefore 𝑄(𝜃|𝜃(𝑖−1) is just a function of 𝜃. • MAXIMIZATION – Find the argument 𝜃 that maximizes the expected value 𝑄(𝜃|𝜃(𝑖−1). • Convergence properties – It can be shown that 1. each iteration (E+M) is guaranteed to increase the log-likelihood and 2. the EM algorithm is guaranteed to converge to a local maximum of • The two steps of the EM algorithm are illustrated below – During the E step, the unknown features 𝑍 are integrated out assuming the current values of the parameters 𝜃(𝑖−1. – During the M step, the values of the parameters that maximize the expected value of the log likelihood are obtained. new old   arg maxQ( |  ) Q( |  old )   p(Z | X , old ) log[ p( X , Z |  )] Z – IN A NUTSHELL: since Z are unknown, the best we can do is maximize the average log-likelihood across all possible values of Z. EM algorithm and mixture models • Having formalized the EM algorithm, we are now ready to find the solution to the mixture model problem. – To keep things simple, we will assume a univariate mixture model where all the components have the same known standard deviation 𝜎. • Problem formulation – As usual, we are given a dataset 𝑋 = {𝑥(1, 𝑥(2, …, 𝑥(𝑁}, and we seek to estimate the model parameters 𝜃 = {𝜇1, 𝜇2, …, 𝜇𝐶}. – The following process is assumed to generate each random variable 𝑥(𝑛. 1. First, a Gaussian component is selected according to mixture coeffs 𝑃(𝜔𝑐). 2. Then, 𝑥(n is generated according to the likelihood 𝑝(𝑥|𝜇𝑐) of that particular component. – We will also use hidden variables 𝑍 = {𝑧1(𝑛, 𝑧2(𝑛, …, 𝑧𝑐(𝑛} to indicate which of the 𝐶 Gaussian components generated data point 𝑥(𝑛. • Solution – The probability 𝑝(𝑥, 𝑧|𝜃) for a specific example is where only one of the 𝑧𝑐(n can have a value of 1, and all others are zero. – The log-likelihood of the entire dataset is then - To obtain 𝑄(𝜃|𝜃(𝑖−1 ) we must then take the expectation over 𝑍 where we have used the fact that 𝐸[𝑓(𝑧)]= 𝑓(𝐸[𝑧]) for 𝑓(𝑧) linear. – 𝐸[𝑧𝑐(𝑛] is simply the probability that 𝑥(𝑛 was generated by the 𝑐 𝑡ℎ Gaussian component given current model parameters 𝜃(𝑖−1. (1) - These two expressions define the 𝑄 function. – The second step (Maximization) consists of finding the values {𝜇1, 𝜇2, …, 𝜇𝐶} that maximize the 𝑄 function. which, computing the zeros of the partial derivative, yields: (2) - Equations (1) and (2) define a fixed-point algorithm that can be used to converge to a (local) maximum of the loglikelihood function. 강의 내용 가우시안 혼합 모델 필요성, 정의 학습  최우추정법과 문제점 EM 알고리즘 은닉변수, EM 알고리즘의 수행 과정 가우시안 혼합 모델을 위한 EM 알고리즘 일반화된 EM 알고리즘 매트랩을 이용한 실험 1 가우시안 혼합모델(Gaussian Mixtures) 1-1. 가우시안 혼합 모델의 필요성 확률모델: 데이터 분포 특성을 알기 위해 적절한 확률밀도함수 가정하고 이에 대한 모델 생성 Estimated pdf True pdf 가우시안 분포  유니모달 형태 : 데이터 평균 중심으로 하나의 그룹으로 뭉침 위 경우, 3개의 가우시안 분포의 가중합 복잡한 형태의 함수의 표현이 가능 6개의 가우시안 혼합으로 나타낼 수 있는 데이터 집합 1-2. GMM의 정의 M개의 성분(밀도함수)의 혼합으로 정의되는 전체 확률밀도함수 Ci : i번째 성분임을 나타내는 확률변수 혼합 모델의 기본 성분이 되는 간단한 pdf i번째 성분이 되는 pdf를 정의하는 파라미터 가우시안 pdf에서는 μi, Σi i번째 성분이 전체에서 차지하는 상대적인 중요도: M개의 성분을 가지는 GMM의 전체 파라미터 공분산 행렬이 인 경우 αi 1-3. GMM의 학습 GMM에서 학습: 데이터를 이용해 파라미터 추정 → 모수적 확률밀도 추정 방법 일종 X = {x1, x2, …, xN} 주어졌을 때, i번째 데이터 xi의 확률밀도 p(xi)를 가우시안 혼합 모델 표현 j번째 가우시안 확률밀도 (1차원인 경우) M개를 결합한 혼합 확률밀도 학습 : 데이터의 로그우도를 최대로 하는 파라미터 찾기 학습데이터 전체에 대한 로그우도 함수 최우추정량 1-3. GMM의 학습 – 평균과 공분산 추정 p (C j | xi ,  )  p ( xi | C j ,  ) p (C j |  ) p ( xi |  )  j p( xi |  j ,  2j )  p ( xi |  ) 이용 d u du (e )  e u dx dx : 데이터 xi 주어졌을 때, 그것이 j번째 성분으로부터 나왔을 확률값 μj 와 마찬가지로 유도 1-4. GMM의 학습 – 혼합계수의 추정 αj의 조건을 가진 최대화 문제 해결: 라그랑제 승수 이용 조건 만족해야 함. 위 식을 대입 위 식을 이용하면, λ = N p (C j | xi ,  )   p ( xi | C j ,  ) p (C j |  ) p ( xi |  )  j p( xi |  j ,  2j ) p ( xi |  ) 이용 1-5. 최우추정법의 문제점 우변에 추정해야 하는 파라미터 포함됨 θ  (1, ..., M , 12 , ...,  M2 , 1 , ..., M ) “반복 알고리즘”으로 해결 EM 알고리즘 시작 단계에서 임의 값으로 초기화 최종적으로 목적함수 최대화하는 파라미터 구하기 1-6. GMM과 은닉변수 : 외부에서 관찰할 수 없는 변수 예) 남,여 두 집단이 함께 속한 하나의 그룹에서 한 사람씩 뽑아 신장 측정 데이터 X={x1, x2, …, xN } 주어진 경우, 확률밀도함수 p(x) 추정 여자 그룹의 확률분포 남자 그룹의 확률분포 2 → 구해야 하는 파라미터:  j , j , j ( j 1, 2) 데이터의 클래스 라벨을 표현하기 위한 새로운 변수 x∈여자 그룹  C1=1, C2=0 x∈남자 그룹  C1=0, C2=1 은닉변수 : 외부의 측정될 수 없는 값을 가진 변수 은닉변수를 고려한 확률모델 1-7. 은닉변수를 가진 모델의 EM 알고리즘 각 데이터 xi에 대해 숨겨진 변수의 값 zi도 같이 관찰된다고 가정하면 1. 주어진 z 값에 따라 두 그룹으로 나누고, 각 그룹의 파라미터 추정 OK! Cij : i번째 데이터에 대한 은닉변수 Cj 의 값 Nj : 각 그룹에 속한 데이터의 수 2. 각 그룹의 중요도를 나타내는 혼합 계수 파라미터 αj 추정 : 전체 그룹에서 j번째 그룹이 차지하는 비율 그러나 실제로는 xi만 관찰되고, zi = [Ci1, Ci2]는 알 수 없음. → EM 알고리즘: 은닉변수 z 추정과 파라미터 θ추정을 반복 E-Step M-Step “숨겨진 확률변수 가진 확률 모델의 파라미터 추정 방법” 1-7. 은닉변수 가진 모델의 EM 알고리즘 E-Step M-Step (1) E(Expectation)-Step : 은닉변수 z의 기대치 계산 → z 추정치로 사용 현재의 파라미터를 이용하여 은닉변수의 기대치를 계산하는 과정 : 처음에는 임의의 값으로 파라미터를 정하고, 이후에는 M-Step을 수행하여 얻어지는 파라미터 사용 C1과 C2는 각각 1 혹은 0 값을 가지므로, 기대치는 Cj =1일 확률값 추정으로 얻어짐. 앞의 관계 이용 P(C j | xi ,  )  p ( xi | C j ,  ) p (C j |  ) p ( xi |  )  j p( xi |  j ,  2j )  p ( xi |  ) 1-7. 은닉변수를 가진 모델의 EM 알고리 즘 (2) M(Maximization)-Step: 은닉변수 z의 기대치 이용한 파라미터 추정 E-Step에서 찾아진 기대치를 은닉변수의 관찰값으로 간주하여, 로그우도를 최대화하는 파라미터를 찾는 과정 z= 이용 1-8. GMM을 위한 EM 알고리즘 1-8. GMM을 위한 EM 알고리즘 1-9. 일반화된 EM 알고리즘 가우시안 혼합모델에 국한된 학습 알고리즘은 아님. 은닉변수를 가지는 확률 모델(HMM,등에 적용 가능) 은닉변수와 파라미터를 함께 추정할 수 있는 최적화 알고리즘 관찰 데이터의 확률변수 x, 은닉 확률변수 z, 모델 파라미터 θ에 대해, 전체 확률밀도함수는 두 변수의 결합확률밀도함수 p(x, z | θ) 로 표현 → x의 관찰 데이터 X = {x1, x2, …, xN}와 이에 대한 로그우도 ln p(X | θ)를 최대로 하는 파라미터 추정 1. p(X | θ)는 결합확률분포 p(x, z | θ)를 z에 대해 적분(주변확률분포)해서 계산 가능 불완전 데이터 로드우도(확률변수 x에 대한 데이터만으로 정의된 우도)를 최대화 계산이 어려움 주변확률분포: 두 이산확률변수(X, Y)의 결합확룰분포(joint probability distribution)로부터 각각의 이산 확률변수에 대한 분포 구함. Y가 가질수 있는 모든 값들의 결합함수의 합: 확률 변수 X의 주변확률 1-9. 일반화된 EM 알고리즘 2. 은닉변수 z의 관찰 데이터 Z = {z1, z2, …, zN}가 주어지는 경우, 두 데이터 집합 X와 Z에 대한 우도함수를 정의 각 데이터의 독립성 이용해 각 데이터의 로그우도 합산한 형태로 표현 가능 모든 확률변수에 대한 데이터가 관찰되어 정의된 우도 : 완전 데이터 로그우도 일반적으로 은닉변수 z에 대한 데이터 Z가 주어지지 않음. → 구체적인 데이터 사용하는 대신에 z에 대한 기대치를 사용 z에 대한 기대치= 위 로그우도 함수를 z의 확률분포함수를 이용한 적분 θ (τ) 함수. θ 함수 현재 파라미터 θ (τ)와 관찰 데이터 X 이용 → z에 대한 조건부 확률 p(z | X, θ(τ)) 얻음. Q 함수: 로 위 식을 표현 1-9. 일반화된 EM 알고리즘 관찰 데이터로 정의된 불완전 데이터 로그우도의 최대화 불가능한 경우, 은닉변수 도입해 완전 데이터 로그우도 정의후, 기대치로 정의된 Q 함수 를 이용하여 파라미터 추정 1-10. 매트랩을 이용한 실험 학습횟수에 따른 파라미터의 변화 (a) =0 (b) =1 (c) =10 (d) =50 1-10. 매트랩을 이용한 실험 학습횟수에 따른 로그우드의 변화 Loglikelihood Number of iteration 1-10. 매트랩을 이용한 실험 2차원 데이터를 위한 가우시안 혼합 모델의 EM 알고리즘 % EM 알고리즘을 위한 초기화 ------------------------load data10_2 X=data'; N=size(X,2); M=6; Mu=rand(M,2)*5; for i=1:M % 데이터 불러오기 % X: 학습 데이터 % N: 데이터의 수 % M: 가우시안 성분의 수 % 파라미터의 초기화 (평균) % 파라미터의 초기화 (분산) Sigma(i,1:2,1:2) = [1 0; 0 1]; end alpha=zeros(6,1)+1/6; % 파라미터의 초기화 (혼합계수) drawgraph(X, Mu, Sigma, 1); Maxtau=100; % 그래프그리기 함수 호출 % 최대 반복횟수 설정 1-10. 매트랩을 이용한 실험 2차원 데이터를 위한 가우시안 혼합 모델의 EM 알고리즘 % EM 알고리즘의 E-Step ------------------------------for tau=1: Maxtau %%% E-step %%%%%%%%%%%%%%%%%%%%%%%%%% for j=1:M px( j,:) = gausspdf(X, Mu( j,:), reshape(Sigma( j,:,:),2,2)); end sump=px'*alpha for j=1:M r(:,j) = (alpha( j)*px( j,:))'./sump; end L(tau)=sum(log(sump)); %현재 파라미터의 로그우도 계산 %% M-step %% (다음 슬라이드)%% end 1-10. 매트랩을 이용한 실험 % EM 알고리즘의 M-Step ------------------------------for tau=1: Maxtau %% E-step %%(이전 슬라이드)%% %%% M-step %%%%%%%%%%%%%%%%%%%%%%%%% for j=1:M sumr=sum(r(:,j)) Rj=repmat(r(:,j),1,2)'; Mu( j,:) = sum(Rj.*X,2)/sumr % 새로운 평균 rxmu= (X-repmat(Mu( j,:),N,1)').*Rj; % 새로운 공분산 계산 Sigma( j,1:2,1:2)= rxmu*(X-repmat(Mu( j,:),N,1)')'/sumr; alpha( j)=sumr/N; % 새로운 혼합계수 end if (mod(tau,10)==1) % 그래프그리기 함수 호출 drawgraph(X, Mu, Sigma, ceil(tau/10)+1); end end drawgraph(X, Mu, Sigma, tau); % 그래프그리기 함수 호출 figure(tau+1); plot(L); % 로그우도의 변화 그래프 1-10. 매트랩을 이용한 실험 가우시안 확률밀도값 계산 함수 gausspdf % 함수 gausspdf function [out]=gausspdf(X, mu, sigma) % 함수 정의 n=size(X,1); % 입력 벡터의 차원 N=size(X,2); % 데이터의 수 Mu=repmat(mu',1,N); % 행렬 연산을 위한 준비 % 확률밀도값 계산 후 반환 out = (1/((sqrt(2*pi))^n*sqrt(det(sigma)))) *exp(-diag((X-Mu)'*inv(sigma)*(X-Mu))/2); 1-10. 매트랩을 이용한 실험 데이터와 파라미터를 그래프로 표현하는 함수 % 함수 drawgraph function drawgraph(X, Mu, Sigma, cnt) % 함수 정의 M=size(Mu,1); % 성분의 수 figure(cnt); % 데이터 그리기 plot(X(1,:), X(2,:), '*'); hold on axis([-0.5 5.5 -0.5 3.5]); grid on plot(Mu(:,1), Mu(:,2), 'r*'); % 평균 파라미터 그리기 for j=1:M sigma=reshape(Sigma( j,:,:),2,2); % 공분산에 따른 타원 그리기 t=[-pi:0.1:pi]'; A=sqrt(2)*[cos(t) sin(t)]*sqrtm(sigma)+repmat(Mu( j,:), size(t),1); plot(A(:,1), A(:,2), 'r-', 'linewidth' ,2 ); end

Gaussian Mixture Models and EM Algorithm

Related documents

Products

Support

Gaussian Mixture Models and EM Algorithm

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib