Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning Algorithm Description Model � t̂ = arg max C k-nearest neighbour The label of a new point x̂ is classified with the most frequent label t̂ of the k nearest training instances. i:xi ∈Nk (x,x̂) Objective Training Regularisation Complexity Non-linear Online learning No optimisation needed. Use cross-validation to learn the appropriate k; otherwise no training, classification based on existing points. k acts as to regularise the classifier: as k → N the boundary becomes smoother. O(N M ) space complexity, since all training instances and all their features need to be kept in memory. Natively finds non-linear boundaries. To be added. δ(ti , C) • Nk (x, x̂) ← k points in x closest to x̂ • Euclidean distance formula: � �D 2 i=1 (xi − x̂i ) • δ(a, b) ← 1 if a = b; 0 o/w Multivariate likelihood p(x|Ck ) = pMLE (xi y(x) = arg max p(Ck |x) = v|Ck ) �N Naive Bayes = arg max p(x|Ck ) × p(Ck ) Multinomial likelihood p(x|Ck ) = k D � = arg max k p(xi |Ck ) × p(Ck ) i=1 D � = arg max k No optimisation needed. pMLE (wordi . . . where: log p(xi |Ck ) + log p(Ck ) i=1 �D i=1 log p(xi |Ck ) δ(tj = Ck ∧ xji = v) �N j=1 δ(tj = Ck ) j=1 = k Learn p(Ck |x) by modelling p(x|Ck ) and p(Ck ), using Bayes’ rule to infer the class conditional probability. Assumes each feature independent of all others, ergo ‘Naive.’ �N �D y(x) = arg max p(Ck |x) j=1 δ(tj = Ck ) × xji = v|Ck ) = �N �D j=1 d=1 δ(tj = Ck ) × xdi xji is the count of word i in test example j; • xdi is the count of feature d in test example j. � Gaussian likelihood p(x|Ck ) = D i=1 N (v; µik , σik ) • k Log-linear = arg max k � m λm φm (x, Ck ) (x,t)∈D � = . . . where: � 1 e m λm φm (x,Ck ) Zλ (x) � � Zλ (x) = e m λm φm (x,Ck ) p(Ck |x) = (x,t)∈D = � (x,t)∈D � � log Zλ (x) − log � k e � m λm φm (x, t) m λm φm (x,Ck ) − ∆LMLE (λ, D) = � � λm φm (x, t) m k Perceptron Directly estimate the linear function y(x) by iteratively updating the weight vector when incorrectly classifying a training instance. sign(x) = � +1 −1 if x ≥ 0 if x < 0 Multiclass perceptron: y(x) = arg max wT φ(x, Ck ) � . . . where � (x,t)∈D E[φ(x, ·)] − φ(x, t) � (x,t)∈D For each class Ck : � E[φ(x, ·)] = E[φ(x, ·)] − � (x,t)∈D Tries to minimise the Error function; the number of incorrectly classified input vectors: � arg min EP (w) = arg min − w T xn t n . . . where typically η = 1. . . . where M is the set of misclassified training vectors. For the multiclass perceptron: w w � φ(x, t) n∈M w i+1 = w + η∆EP (w) = wi + ηxn tn L̃(∧) = N � n=1 λn − λn ≥ 0, Hard � assignments rnk ∈ {0, 1} s.t. ∀n k rnk = 1, i.e. each data point is assigned to one cluster k. A hard-margin, geometric clustering algorithm, where each data point is assigned to its closest centroid. Geometric distance: The Euclidean distance, l2 norm: � �D �� ||xn − µk ||2 = � (xni − µki )2 N � N � = arg min − log e λ = arg min λ To be added. Reformulate the class conditional distribution in terms of a kernel K(x, x� ), and use a non-linear kernel (for example K(x, x� ) = (1 + wT x)2 ). By the Representer Theorem: � m (x,t)∈D (0−λ)2 2σ 2 λ2m 2σ 2 − − log p(t|x) � (x,t)∈D � (x,t)∈D log p(t|x) O(IN M K), since each training instance must be visited and each combination of class and features must be calculated for the appropriate feature mapping. T 1 eλ φ(x,Ck ) Zλ (x) �N �K T 1 = e n=1 i=1 αnk φ(xn ,Ci ) φ(x,Ck ) Zλ (x) �N �K 1 = e n=1 i=1 αnk K((xn ,Ci ),(x,Ck )) Zλ (x) �N 1 = e n=1 αnk K(xn ,x) Zλ (x) p(Ck |x) = log p(t|x) The Voted Perceptron: run the perceptron i times and store each iteration’s weight vector. Then: � � � y(x) = sign ci × sign(wT x) i O(IN M L), since each combination of instance, class and features must be calculated (see log-linear). Use a kernel K(x, x� ), and 1 weight per training instance: � N � � y(x) = sign wn tn K(x, xn ) n=1 . . . and the update: . . . where ci is the number of correctly classified training instances for wi . N � 1 arg min ||w||2 + C ξn 2 w,w0 n=1 ∀n λ n λ m t n t m xT n xm • Quadratic Programming (QP) • SMO, Sequential Minimal Optimisation (chunking). s.t. tn (wT xn + w0 ) ≥ 1 − ξn , Online Gradient Descent: Update the parameters using GD after seeing each training instance. The perceptron is an online algorithm per default. i+1 i wn = wn +1 Use a non-linear kernel K(x, x� ): λn tn = 0, n=1 y(x) = L̃(∧) = ∀n s.t. N � n=1 λn − N � N � • ξn > 0 ∀n • QP: O(n3 ); = SMO: much more efficient than QP, since computation based only on support vectors. 0 ≤ λn ≤ C, L̃(∧) = λ n λ m t n t m xT n xm = λn tn = 0, n=1 λ n t n xT xn + w 0 N � n=1 N � n=1 ∀n N � Online SVM. See, for example: • The Huller: A Simple and Efficient Online SVM, Bordes & Bottou (2005) • Pegasos: Primal Estimated sub-Gradient Solver for SVM, Shalev-Shwartz et al. (2007) λn tn K(x, xn ) + w0 n=1 n=1 m=1 N � N � n=1 Dual n=1 m=1 N � With Gaussian attributes, quadratic boundaries can be learned with uni-modal distributions. Primal Dual λ n t n xT xn + w 0 s.t. k-means � i 1 ||w||2 2 tn (wT xn + w0 ) ≥ 1 s.t. n=1 LMAP (λ, D, σ) = arg min − log p(λ) − φ(x, ·)p(Ck |x) i Can only learn linear boundaries for multivariate/multinomial attributes. pMAP (wordi = v|Ck ) = � (αi − 1) + N j=1 δ(tj = Ck ) × xji �N �D � (δ(t = Ck ) × xdi ) − D + D j j=1 d=1 d=1 αd The soft margin SVM: penalise a hyperplane by the number and distance of misclassified points. w,w0 N � Multinomial likelihood (x,t)∈D wi+1 = wi + φ(x, t) − φ(x, y(x)) arg min y(x) = O(N M ), each training instance must be visited and each of its features counted. j=1 λ Primal Support vector machines δ(tj = Ck ∧ xji = v) � |xi |(βi − 1) + N j=1 δ(tj = Ck ) (x,t)∈D φ(x, t) are the empirical counts. Ck A maximum margin classifier: finds the separating hyperplane with the maximum margin to its closest data points. �N Iterate over each training example xn , and update the weight vector if misclassification: Binary, linear classifier: . . . where: (βi − 1) + Objective function � λ ∆LMAP (λ, D, σ) = 2 + σ (x,t)∈D y(x) = sign(wT x) pMAP (xi = v|Ck ) = . . . where η is the step parameter. log p(t|x) (x,t)∈D � Multivariate likelihood Penalise large values for the λ parameters, by introducing a prior distribution over them (typically a Gaussian). λn+1 = λn − η∆L Minimise the negative log-likelihood: � � LMLE (λ, D) = p(t|x) = − Use a Dirichlet prior on the parameters to obtain a MAP estimate. xi i=1 p(wordi |Ck ) Gradient descent (or gradient ascent if maximising objective): Estimate p(Ck |x) directly, by assuming a maximum entropy distribution and optimising an objective function over the conditional entropy distribution. 1 λn − λn − N � N � λn λm tn tm xT n xm n=1 m=1 N � N � λn λm tn tm K(xn , xm ) n=1 m=1 Expectation: arg min r,µ N � K � n=1 k=1 rnk = rnk ||xn − µk ||22 � Maximisation: (k) µMLE = . . . i.e. minimise the distance from each cluster centre to each of its points. i=1 if ||xn − µk ||2 minimal for k o/w 1 0 � n rnk xn � n rnk For non-linearly separable data, use kernel k-means as suggested in: Only hard-margin assignment to clusters. To be added. Kernel k-means, Spectral Clustering and Normalized Cuts, Dhillon et al. (2004). Sequential k-means: update the centroids after processing one point at a time. . . . where µ(k) is the centroid of cluster k. Expectation: For each n, k set: γnk = p(z (i) = k|x(i) ; γ, µ, Σ) (= p(k|xn )) p(x(i) |z (i) = k; µ, Σ)p(z (i) = k; π) = �K j=1 Mixture of Gaussians 1 Created A probabilistic clustering algorithm, where clusters are modelled as latent Guassians and each data point is assigned the probability of being drawn from a particular Gaussian. Assignments to clusters by specifying probabilities p(x(i) , z (i) ) = p(x(i) |z (i) )p(z (i) ) . . . with z (i) ∼ Multinomial(γ), and �k γnk ≡ p(k|xn ) s.t. j=1 γnj = 1. I.e. want to maximise the probability of the observed data x. L(x, π, µ, Σ) = log p(x|π, µ, Σ) � K � N � � = log πk Nk (xn |µk , Σk ) n=1 k=1 by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011. p(x(i) |z (i) = l; µ, Σ)p(z (i) = l; π) πk N (xn |µk , Σk ) = �K j=1 πj N (xn |µj , Σj ) Maximisation: N 1 � πk = γnk N n=1 �N T n=1 γnk (xn − µk )(xn − µk ) Σk = �N n=1 γnk �N γ x n=1 nk n µk = � N n=1 γnk The mixture of Gaussians assigns probabilities for each cluster to each data point, and as such is capable of capturing ambiguities in the data set. Online Gaussian Mixture Models. A good start is: To be added. Not applicable. A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants, Neal & Hinton (1998).