Uploaded by ysokhypw

Cheat Sheet Algorithms for Supervised and Unsupervised Learning(1)

advertisement
Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning
Algorithm
Description
Model
�
t̂ = arg max
C
k-nearest
neighbour
The label of a new point x̂ is classified
with the most frequent label t̂ of the k
nearest training instances.
i:xi ∈Nk (x,x̂)
Objective
Training
Regularisation
Complexity
Non-linear
Online learning
No optimisation needed.
Use cross-validation to learn the appropriate k; otherwise no
training, classification based on existing points.
k acts as to regularise the classifier: as k → N the
boundary becomes smoother.
O(N M ) space complexity, since all
training instances and all their
features need to be kept in memory.
Natively finds non-linear boundaries.
To be added.
δ(ti , C)
•
Nk (x, x̂) ← k points in x
closest to x̂
•
Euclidean distance formula:
�
�D
2
i=1 (xi − x̂i )
•
δ(a, b) ← 1 if a = b; 0 o/w
Multivariate likelihood p(x|Ck ) =
pMLE (xi
y(x) = arg max p(Ck |x)
=
v|Ck )
�N
Naive Bayes
= arg max p(x|Ck ) × p(Ck )
Multinomial likelihood p(x|Ck ) =
k
D
�
= arg max
k
p(xi |Ck ) × p(Ck )
i=1
D
�
= arg max
k
No optimisation needed.
pMLE (wordi
. . . where:
log p(xi |Ck ) + log p(Ck )
i=1
�D
i=1
log p(xi |Ck )
δ(tj = Ck ∧ xji = v)
�N
j=1 δ(tj = Ck )
j=1
=
k
Learn p(Ck |x) by modelling p(x|Ck )
and p(Ck ), using Bayes’ rule to infer
the class conditional probability.
Assumes each feature independent of
all others, ergo ‘Naive.’
�N
�D
y(x) = arg max p(Ck |x)
j=1 δ(tj = Ck ) × xji
= v|Ck ) = �N �D
j=1
d=1 δ(tj = Ck ) × xdi
xji is the count of word i in test example j;
• xdi is the count of feature d in test example j.
�
Gaussian likelihood p(x|Ck ) = D
i=1 N (v; µik , σik )
•
k
Log-linear
= arg max
k
�
m
λm φm (x, Ck )
(x,t)∈D
�
=
. . . where:
�
1
e m λm φm (x,Ck )
Zλ (x)
� �
Zλ (x) =
e m λm φm (x,Ck )
p(Ck |x) =
(x,t)∈D
=
�
(x,t)∈D
�
�
log Zλ (x) −
log
�
k
e
�
m
λm φm (x, t)
m
λm φm (x,Ck )
−
∆LMLE (λ, D) =
�
�
λm φm (x, t)
m
k
Perceptron
Directly estimate the linear function
y(x) by iteratively updating the
weight vector when incorrectly
classifying a training instance.
sign(x) =
�
+1
−1
if x ≥ 0
if x < 0
Multiclass perceptron:
y(x) = arg max wT φ(x, Ck )
�
. . . where
�
(x,t)∈D
E[φ(x, ·)] − φ(x, t)
�
(x,t)∈D
For each class Ck :
�
E[φ(x, ·)] =
E[φ(x, ·)] −
�
(x,t)∈D
Tries to minimise the Error function; the number of
incorrectly classified input vectors:
�
arg min EP (w) = arg min −
w T xn t n
. . . where typically η = 1.
. . . where M is the set of misclassified training
vectors.
For the multiclass perceptron:
w
w
�
φ(x, t)
n∈M
w
i+1
= w + η∆EP (w)
= wi + ηxn tn
L̃(∧) =
N
�
n=1
λn −
λn ≥ 0,
Hard
� assignments rnk ∈ {0, 1} s.t.
∀n k rnk = 1, i.e. each data point is
assigned to one cluster k.
A hard-margin, geometric clustering
algorithm, where each data point is
assigned to its closest centroid.
Geometric distance: The Euclidean
distance, l2 norm:
�
�D
��
||xn − µk ||2 = �
(xni − µki )2
N
�
N
�
= arg min − log e
λ

= arg min 
λ
To be added.
Reformulate the class conditional distribution
in terms of a kernel K(x, x� ), and use a
non-linear kernel (for example
K(x, x� ) = (1 + wT x)2 ). By the Representer
Theorem:
�
m
(x,t)∈D
(0−λ)2
2σ 2
λ2m
2σ 2
−
−
log p(t|x)
�
(x,t)∈D
�
(x,t)∈D


log p(t|x)
O(IN M K), since each training
instance must be visited and each
combination of class and features
must be calculated for the
appropriate feature mapping.
T
1
eλ φ(x,Ck )
Zλ (x)
�N
�K
T
1
=
e n=1 i=1 αnk φ(xn ,Ci ) φ(x,Ck )
Zλ (x)
�N
�K
1
=
e n=1 i=1 αnk K((xn ,Ci ),(x,Ck ))
Zλ (x)
�N
1
=
e n=1 αnk K(xn ,x)
Zλ (x)
p(Ck |x) =

log p(t|x)
The Voted Perceptron: run the perceptron i times
and store each iteration’s weight vector. Then:
�
�
�
y(x) = sign
ci × sign(wT
x)
i
O(IN M L), since each combination of
instance, class and features must be
calculated (see log-linear).
Use a kernel K(x, x� ), and 1 weight per
training instance:
� N
�
�
y(x) = sign
wn tn K(x, xn )
n=1
. . . and the update:
. . . where ci is the number of correctly classified
training instances for wi .
N
�
1
arg min ||w||2 + C
ξn
2
w,w0
n=1
∀n
λ n λ m t n t m xT
n xm
•
Quadratic Programming (QP)
•
SMO, Sequential Minimal Optimisation (chunking).
s.t.
tn (wT xn + w0 ) ≥ 1 − ξn ,
Online Gradient Descent: Update the
parameters using GD after seeing each
training instance.
The perceptron is an online algorithm
per default.
i+1
i
wn
= wn
+1
Use a non-linear kernel K(x, x� ):
λn tn = 0,
n=1
y(x) =
L̃(∧) =
∀n
s.t.
N
�
n=1
λn −
N
�
N
�
•
ξn > 0
∀n
•
QP: O(n3 );
=
SMO: much more efficient than
QP, since computation based
only on support vectors.
0 ≤ λn ≤ C,
L̃(∧) =
λ n λ m t n t m xT
n xm
=
λn tn = 0,
n=1
λ n t n xT xn + w 0
N
�
n=1
N
�
n=1
∀n
N
�
Online SVM. See, for example:
•
The Huller: A Simple and
Efficient Online SVM, Bordes
& Bottou (2005)
•
Pegasos: Primal Estimated
sub-Gradient Solver for SVM,
Shalev-Shwartz et al. (2007)
λn tn K(x, xn ) + w0
n=1
n=1 m=1
N
�
N
�
n=1
Dual
n=1 m=1
N
�
With Gaussian attributes, quadratic
boundaries can be learned with uni-modal
distributions.
Primal
Dual
λ n t n xT xn + w 0
s.t.
k-means

�
i
1
||w||2
2
tn (wT xn + w0 ) ≥ 1
s.t.
n=1

LMAP (λ, D, σ) = arg min − log p(λ) −
φ(x, ·)p(Ck |x)
i
Can only learn linear boundaries for
multivariate/multinomial attributes.
pMAP (wordi = v|Ck ) =
�
(αi − 1) + N
j=1 δ(tj = Ck ) × xji
�N �D
�
(δ(t
=
Ck ) × xdi ) − D + D
j
j=1
d=1
d=1 αd
The soft margin SVM: penalise a hyperplane by the
number and distance of misclassified points.
w,w0
N
�
Multinomial likelihood
(x,t)∈D
wi+1 = wi + φ(x, t) − φ(x, y(x))
arg min
y(x) =
O(N M ), each training instance must
be visited and each of its features
counted.
j=1
λ
Primal
Support
vector
machines
δ(tj = Ck ∧ xji = v)
�
|xi |(βi − 1) + N
j=1 δ(tj = Ck )
(x,t)∈D φ(x, t) are the empirical counts.
Ck
A maximum margin classifier: finds
the separating hyperplane with the
maximum margin to its closest data
points.
�N
Iterate over each training example xn , and update the
weight vector if misclassification:
Binary, linear classifier:
. . . where:
(βi − 1) +
Objective function
�
λ
∆LMAP (λ, D, σ) = 2 +
σ
(x,t)∈D
y(x) = sign(wT x)
pMAP (xi = v|Ck ) =
. . . where η is the step parameter.
log p(t|x)
(x,t)∈D
�
Multivariate likelihood
Penalise large values for the λ parameters, by
introducing a prior distribution over them (typically
a Gaussian).
λn+1 = λn − η∆L
Minimise the negative log-likelihood:
�
�
LMLE (λ, D) =
p(t|x) = −
Use a Dirichlet prior on the parameters to obtain a
MAP estimate.
xi
i=1 p(wordi |Ck )
Gradient descent (or gradient ascent if maximising
objective):
Estimate p(Ck |x) directly, by
assuming a maximum entropy
distribution and optimising an
objective function over the
conditional entropy distribution.
1
λn −
λn −
N �
N
�
λn λm tn tm xT
n xm
n=1 m=1
N �
N
�
λn λm tn tm K(xn , xm )
n=1 m=1
Expectation:
arg min
r,µ
N �
K
�
n=1 k=1
rnk =
rnk ||xn − µk ||22
�
Maximisation:
(k)
µMLE =
. . . i.e. minimise the distance from each cluster centre
to each of its points.
i=1
if ||xn − µk ||2 minimal for k
o/w
1
0
�
n rnk xn
�
n rnk
For non-linearly separable data, use kernel
k-means as suggested in:
Only hard-margin assignment to clusters.
To be added.
Kernel k-means, Spectral Clustering and
Normalized Cuts, Dhillon et al. (2004).
Sequential k-means: update the centroids after processing one point at a
time.
. . . where µ(k) is the centroid of cluster k.
Expectation: For each n, k set:
γnk = p(z (i) = k|x(i) ; γ, µ, Σ)
(= p(k|xn ))
p(x(i) |z (i) = k; µ, Σ)p(z (i) = k; π)
= �K
j=1
Mixture of
Gaussians
1 Created
A probabilistic clustering algorithm,
where clusters are modelled as latent
Guassians and each data point is
assigned the probability of being
drawn from a particular Gaussian.
Assignments to clusters by specifying
probabilities
p(x(i) , z (i) ) = p(x(i) |z (i) )p(z (i) )
. . . with z (i) ∼ Multinomial(γ), and
�k
γnk ≡ p(k|xn ) s.t.
j=1 γnj = 1. I.e.
want to maximise the probability of
the observed data x.
L(x, π, µ, Σ) = log p(x|π, µ, Σ)
� K
�
N
�
�
=
log
πk Nk (xn |µk , Σk )
n=1
k=1
by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011.
p(x(i) |z (i) = l; µ, Σ)p(z (i) = l; π)
πk N (xn |µk , Σk )
= �K
j=1 πj N (xn |µj , Σj )
Maximisation:
N
1 �
πk =
γnk
N n=1
�N
T
n=1 γnk (xn − µk )(xn − µk )
Σk =
�N
n=1 γnk
�N
γ
x
n=1 nk n
µk = �
N
n=1 γnk
The mixture of Gaussians assigns probabilities for
each cluster to each data point, and as such is
capable of capturing ambiguities in the data set.
Online Gaussian Mixture Models. A
good start is:
To be added.
Not applicable.
A View of the EM Algorithm that Justifies Incremental, Sparse, and Other
Variants, Neal & Hinton (1998).
Download