Machine Learning Algorithms Cheat Sheet

Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning Algorithm Description Model � t̂ = arg max C k-nearest neighbour The label of a new point x̂ is classified with the most frequent label t̂ of the k nearest training instances. i:xi ∈Nk (x,x̂) Objective Training Regularisation Complexity Non-linear Online learning No optimisation needed. Use cross-validation to learn the appropriate k; otherwise no training, classification based on existing points. k acts as to regularise the classifier: as k → N the boundary becomes smoother. O(N M ) space complexity, since all training instances and all their features need to be kept in memory. Natively finds non-linear boundaries. To be added. δ(ti , C) • Nk (x, x̂) ← k points in x closest to x̂ • Euclidean distance formula: � �D 2 i=1 (xi − x̂i ) • δ(a, b) ← 1 if a = b; 0 o/w Multivariate likelihood p(x|Ck ) = pMLE (xi y(x) = arg max p(Ck |x) = v|Ck ) �N Naive Bayes = arg max p(x|Ck ) × p(Ck ) Multinomial likelihood p(x|Ck ) = k D � = arg max k p(xi |Ck ) × p(Ck ) i=1 D � = arg max k No optimisation needed. pMLE (wordi . . . where: log p(xi |Ck ) + log p(Ck ) i=1 �D i=1 log p(xi |Ck ) δ(tj = Ck ∧ xji = v) �N j=1 δ(tj = Ck ) j=1 = k Learn p(Ck |x) by modelling p(x|Ck ) and p(Ck ), using Bayes’ rule to infer the class conditional probability. Assumes each feature independent of all others, ergo ‘Naive.’ �N �D y(x) = arg max p(Ck |x) j=1 δ(tj = Ck ) × xji = v|Ck ) = �N �D j=1 d=1 δ(tj = Ck ) × xdi xji is the count of word i in test example j; • xdi is the count of feature d in test example j. � Gaussian likelihood p(x|Ck ) = D i=1 N (v; µik , σik ) • k Log-linear = arg max k � m λm φm (x, Ck ) (x,t)∈D � = . . . where: � 1 e m λm φm (x,Ck ) Zλ (x) � � Zλ (x) = e m λm φm (x,Ck ) p(Ck |x) = (x,t)∈D = � (x,t)∈D � � log Zλ (x) − log � k e � m λm φm (x, t) m λm φm (x,Ck ) − ∆LMLE (λ, D) = � � λm φm (x, t) m k Perceptron Directly estimate the linear function y(x) by iteratively updating the weight vector when incorrectly classifying a training instance. sign(x) = � +1 −1 if x ≥ 0 if x < 0 Multiclass perceptron: y(x) = arg max wT φ(x, Ck ) � . . . where � (x,t)∈D E[φ(x, ·)] − φ(x, t) � (x,t)∈D For each class Ck : � E[φ(x, ·)] = E[φ(x, ·)] − � (x,t)∈D Tries to minimise the Error function; the number of incorrectly classified input vectors: � arg min EP (w) = arg min − w T xn t n . . . where typically η = 1. . . . where M is the set of misclassified training vectors. For the multiclass perceptron: w w � φ(x, t) n∈M w i+1 = w + η∆EP (w) = wi + ηxn tn L̃(∧) = N � n=1 λn − λn ≥ 0, Hard � assignments rnk ∈ {0, 1} s.t. ∀n k rnk = 1, i.e. each data point is assigned to one cluster k. A hard-margin, geometric clustering algorithm, where each data point is assigned to its closest centroid. Geometric distance: The Euclidean distance, l2 norm: � �D �� ||xn − µk ||2 = � (xni − µki )2 N � N � = arg min − log e λ  = arg min  λ To be added. Reformulate the class conditional distribution in terms of a kernel K(x, x� ), and use a non-linear kernel (for example K(x, x� ) = (1 + wT x)2 ). By the Representer Theorem: � m (x,t)∈D (0−λ)2 2σ 2 λ2m 2σ 2 − − log p(t|x) � (x,t)∈D � (x,t)∈D   log p(t|x) O(IN M K), since each training instance must be visited and each combination of class and features must be calculated for the appropriate feature mapping. T 1 eλ φ(x,Ck ) Zλ (x) �N �K T 1 = e n=1 i=1 αnk φ(xn ,Ci ) φ(x,Ck ) Zλ (x) �N �K 1 = e n=1 i=1 αnk K((xn ,Ci ),(x,Ck )) Zλ (x) �N 1 = e n=1 αnk K(xn ,x) Zλ (x) p(Ck |x) =  log p(t|x) The Voted Perceptron: run the perceptron i times and store each iteration’s weight vector. Then: � � � y(x) = sign ci × sign(wT x) i O(IN M L), since each combination of instance, class and features must be calculated (see log-linear). Use a kernel K(x, x� ), and 1 weight per training instance: � N � � y(x) = sign wn tn K(x, xn ) n=1 . . . and the update: . . . where ci is the number of correctly classified training instances for wi . N � 1 arg min ||w||2 + C ξn 2 w,w0 n=1 ∀n λ n λ m t n t m xT n xm • Quadratic Programming (QP) • SMO, Sequential Minimal Optimisation (chunking). s.t. tn (wT xn + w0 ) ≥ 1 − ξn , Online Gradient Descent: Update the parameters using GD after seeing each training instance. The perceptron is an online algorithm per default. i+1 i wn = wn +1 Use a non-linear kernel K(x, x� ): λn tn = 0, n=1 y(x) = L̃(∧) = ∀n s.t. N � n=1 λn − N � N � • ξn > 0 ∀n • QP: O(n3 ); = SMO: much more eﬃcient than QP, since computation based only on support vectors. 0 ≤ λn ≤ C, L̃(∧) = λ n λ m t n t m xT n xm = λn tn = 0, n=1 λ n t n xT xn + w 0 N � n=1 N � n=1 ∀n N � Online SVM. See, for example: • The Huller: A Simple and Eﬃcient Online SVM, Bordes & Bottou (2005) • Pegasos: Primal Estimated sub-Gradient Solver for SVM, Shalev-Shwartz et al. (2007) λn tn K(x, xn ) + w0 n=1 n=1 m=1 N � N � n=1 Dual n=1 m=1 N � With Gaussian attributes, quadratic boundaries can be learned with uni-modal distributions. Primal Dual λ n t n xT xn + w 0 s.t. k-means  � i 1 ||w||2 2 tn (wT xn + w0 ) ≥ 1 s.t. n=1  LMAP (λ, D, σ) = arg min − log p(λ) − φ(x, ·)p(Ck |x) i Can only learn linear boundaries for multivariate/multinomial attributes. pMAP (wordi = v|Ck ) = � (αi − 1) + N j=1 δ(tj = Ck ) × xji �N �D � (δ(t = Ck ) × xdi ) − D + D j j=1 d=1 d=1 αd The soft margin SVM: penalise a hyperplane by the number and distance of misclassified points. w,w0 N � Multinomial likelihood (x,t)∈D wi+1 = wi + φ(x, t) − φ(x, y(x)) arg min y(x) = O(N M ), each training instance must be visited and each of its features counted. j=1 λ Primal Support vector machines δ(tj = Ck ∧ xji = v) � |xi |(βi − 1) + N j=1 δ(tj = Ck ) (x,t)∈D φ(x, t) are the empirical counts. Ck A maximum margin classifier: finds the separating hyperplane with the maximum margin to its closest data points. �N Iterate over each training example xn , and update the weight vector if misclassification: Binary, linear classifier: . . . where: (βi − 1) + Objective function � λ ∆LMAP (λ, D, σ) = 2 + σ (x,t)∈D y(x) = sign(wT x) pMAP (xi = v|Ck ) = . . . where η is the step parameter. log p(t|x) (x,t)∈D � Multivariate likelihood Penalise large values for the λ parameters, by introducing a prior distribution over them (typically a Gaussian). λn+1 = λn − η∆L Minimise the negative log-likelihood: � � LMLE (λ, D) = p(t|x) = − Use a Dirichlet prior on the parameters to obtain a MAP estimate. xi i=1 p(wordi |Ck ) Gradient descent (or gradient ascent if maximising objective): Estimate p(Ck |x) directly, by assuming a maximum entropy distribution and optimising an objective function over the conditional entropy distribution. 1 λn − λn − N � N � λn λm tn tm xT n xm n=1 m=1 N � N � λn λm tn tm K(xn , xm ) n=1 m=1 Expectation: arg min r,µ N � K � n=1 k=1 rnk = rnk ||xn − µk ||22 � Maximisation: (k) µMLE = . . . i.e. minimise the distance from each cluster centre to each of its points. i=1 if ||xn − µk ||2 minimal for k o/w 1 0 � n rnk xn � n rnk For non-linearly separable data, use kernel k-means as suggested in: Only hard-margin assignment to clusters. To be added. Kernel k-means, Spectral Clustering and Normalized Cuts, Dhillon et al. (2004). Sequential k-means: update the centroids after processing one point at a time. . . . where µ(k) is the centroid of cluster k. Expectation: For each n, k set: γnk = p(z (i) = k|x(i) ; γ, µ, Σ) (= p(k|xn )) p(x(i) |z (i) = k; µ, Σ)p(z (i) = k; π) = �K j=1 Mixture of Gaussians 1 Created A probabilistic clustering algorithm, where clusters are modelled as latent Guassians and each data point is assigned the probability of being drawn from a particular Gaussian. Assignments to clusters by specifying probabilities p(x(i) , z (i) ) = p(x(i) |z (i) )p(z (i) ) . . . with z (i) ∼ Multinomial(γ), and �k γnk ≡ p(k|xn ) s.t. j=1 γnj = 1. I.e. want to maximise the probability of the observed data x. L(x, π, µ, Σ) = log p(x|π, µ, Σ) � K � N � � = log πk Nk (xn |µk , Σk ) n=1 k=1 by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011. p(x(i) |z (i) = l; µ, Σ)p(z (i) = l; π) πk N (xn |µk , Σk ) = �K j=1 πj N (xn |µj , Σj ) Maximisation: N 1 � πk = γnk N n=1 �N T n=1 γnk (xn − µk )(xn − µk ) Σk = �N n=1 γnk �N γ x n=1 nk n µk = � N n=1 γnk The mixture of Gaussians assigns probabilities for each cluster to each data point, and as such is capable of capturing ambiguities in the data set. Online Gaussian Mixture Models. A good start is: To be added. Not applicable. A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants, Neal & Hinton (1998).

Machine Learning Algorithms Cheat Sheet

Related documents

Products

Support

Machine Learning Algorithms Cheat Sheet

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib