Uploaded by cutie-siren0f

Machine Learning Cheat Sheet

advertisement
Basics
Fundamental Assumption
Data is iid for unknown P : (xi,yi)∼P (X,Y )
True risk and estimated
error
R
True risk: R(w)= P (x,y)(y−wT x)2∂x∂y =
Ex,y [(y−wT x)2]
1 P
T 2
Est. error: R̂D (w)= |D|
(x,y)∈D (y−w x)
Standardization
Centered data with unit variance: x̃i = xiσ̂−µ̂
P
P
µ̂= n1 ni=1xi, σ̂2 = n1 ni=1(xi −µ̂)2
Cross-Validation
For all models m, for all i∈{1,...,k} do:
(i)
(i)
1. Split data: D = Dtrain ] Dtest (MonteCarlo or k-Fold)
(i)
2. Train model: ŵi,m =argmin R̂train(w)
(i)
w
(i)
3. Estimate error: R̂m = R̂test(ŵi,m)
P
(i)
Select best model: m̂=argmin k1 ki=1R̂m
m
Parametric vs. Nonparametric models
Parametric: have finite set of parameters. e.g.
linear regression, linear perceptron
Nonparametric: grow in complexity with the
size of the data, more expressive. e.g. k-NN
Gradient Descent
1. Pick arbitrary w0 ∈Rd
2. wt+1 =wt −ηt∇R̂(wt)
Stochastic Gradient Descent (SGD)
1. Pick arbitrary w0 ∈Rd
2. wt+1 =wt −ηt∇w l(wt;x0,y0), with u.a.r.
data point (x0,y0)∈D
Regression
Solve w∗ =argmin R̂(w)+λC(w)
w
Linear Regression
P
R̂(w)= ni=1(yi −wT xi)2 =||Xw−y||22
P
∇w R̂(w)=−2 ni=1(yi −wT xi)·xi
w∗ =(X T X)−1X T y
Ridge regression
P
R̂(w)= ni=1(yi −wT xi)2 +λ||w||22
P
∇w R̂(w)=−2 ni=1(yi −wT xi)·xi +2λw
w∗ =(X T X +λI)−1X T y
Reformulating the perceptron
Neural networks
P
Ansatz: w∗ ∈span(X)⇒w = nj=1αj yj xj
Parameterize feature map with θ: φ(x,θ) =
Pn
Pn
T
∗
(activation
ϕ)
α =argmin i=1max(0,− j=1αj yiyj xi xj ) ϕ(θT x)=ϕ(z) P
Pfunction
n
m
∗
α∈Rn
⇒w =argmin i=1l(yi; j=1wj φ(xi,θj ))
Classification
w,θ
Kernelized perceptron and SVM
∗
Pm
Solve w =argmin l(w;xi,yi); loss function l
T
T
T k instead of wT x ,
f(x;w,θ
Use
α
1:d )=
i
i
w
j=1 wj ϕ(θj x)=w ϕ(Θx)
use αT Dy KDy α instead of ||w||22
0/1 loss
Activation functions
T
1
ki =[y1k(xi,x1),...,ynk(xi,x
)],
D
=diag(y)
n
y
l0/1(w;yi,xi)=1 if yi =
6 sign(w xi) else 0
Sigmoid: 1+exp(−z)
, ϕ0(z)=(1−ϕ(z))·ϕ(z)
Pn
Prediction: ŷ = sign( i=1 αiyik(xi, x̂))
exp(z)−exp(−z)
Perceptron algorithm
tanh: ϕ(z)=tanh(z)= exp(z)+exp(−z)
T x ) and SGD SGD update: αt+1 = αt , if mispredicted:
Use lP (w;yi,xi)=max(0,−y
w
i
i
(
αt+1,i = αt,i + ηt (c.f. updating weights ReLU: ϕ(z)=max(z,0)
0
if yiwT xi ≥0
Predict: forward propagation
towards mispredicted point)
∇w lP (w;yi,xi)=
−yixi otherwise
v(0) =x; for l =1,...,L−1:
Kernelized linear
regression
(KLR)
Pn
∗
Data lin. separable ⇔ obtains a lin. separator Ansatz: w = i=1αix
v(l) =ϕ(z(l)), z(l) =W (l)v(l−1)
∗
T
2
T
(not necessarily optimal)
f =W (L)v(L−1)
α =argmin||α K −y||2 +λα Kα
α
Predict f for regression, sign(f) for class.
Support Vector Machine (SVM)
=(K +λI)−1y
T
Hinge loss: lH (w;x
Compute gradient: backpropagation
n
( i,yi)=max(0,1−yiw xi)
P
T
Prediction:
ŷ
=
α
k(x
,x̂)
i
i
Output layer: δj =lj0 (fj ), ∂w∂j,i =δj vi
0
if yiw xi ≥1
i=1
∇w lH (w;y,x)=
Hidden layer
−yixi otherwise
Pl =L−1,...,1:
k-NN
∂
0 (z )·
Pn
δ
=ϕ
∗
2
j
i∈Layerl+1 wi,j δi , ∂wj,i =δj vi
w =argmin lH (w;xi,yi)+λ||w||2
y = sign i=1 yi[xi among k nearest neigh- j
w
bours of x] – No weights ⇒ no training! But Learning with momentum
a←m·a+ηt∇W l(W ;y,x); Wt+1 ←Wt −a
depends on all data :(
L1-regularized
regression (Lasso)
P
R̂(w)= ni=1(yi −wT xi)2 +λ||w||1
Kernels
efficient, implicit inner products
Clustering
Imbalance
Properties of kernel
k-meanP
k :X ×X →R, k must be some inner product up-/downsampling
R̂(µ)= ni=1 min ||xi −µj ||22
(symmetric, positive-definite, linear) for some Cost-Sensitive Classification
j∈{1,...k}
Eucl. Scale loss by cost: lCS (w;x,y)=c± l(w;x,y)
µ̂=argmin R̂(µ) ...non-convex, NP-hard
space V. i.e. k(x,x0) = hϕ(x),ϕ(x0)iV =
Metrics
µ
ϕ(x)T ϕ(x0) and k(x,x0)=k(x0,x)
n=n+ +n−, n+ =T P +F N, n− =T N +F P Algorithm (Lloyd’s heuristic): Choose starting
P
N
Kernel
Accuracy: T P +T
, Precision: T PT+F

 matrix
centers, assign points to closest center, update
n
P
k(x1,x1) ... k(x1,xn)
TP
FP
centers to mean of each cluster, repeat
Recall/TPR: n+ , FPR: n−


..
..
..
K =

.
2T P
2
.
.
F1 score: 2T P +F P +F N = 1 + 1
prec rec
k(xn,x1) ... k(xn,xn)
Dimension reduction
ROC
Curve:
y
=TPR,
x=FPR
Positive semi-definite matrices ⇔ kernels k
PCA
1 Pn
T
Important kernels
D =x1,...,xn ⊂Rd, Σ=
i=1 xi xi , µ=0
n
P
T
n
Multi-class
Linear: k(x,y)=x y
(W,z1,...,zn)=argmin i=1||W zi −xi||22,
c(c−1)
T
d
Polynomial: k(x,y)=(x y+1)
one-vs-all (c), one-vs-one ( 2 ), encoding W =(v1|...|vk )∈Rd×k , orthogonal; zi =W T xi
2
2
Gaussian: k(x,y)=exp(−||x−y||2/(2h ))
vi are the eigen vectors of Σ
Multi-class Hinge loss
(1)
(c)
Laplacian: k(x,y)=exp(−||x−y||1/h)
lMC−H (w ,...,w ;x,y)=
Kernel PCA
Composition rules
max
w(j)T x−w(y)T x) Kernel PC: α(1),...,α(k) ∈ Rn, α(i) = √1 vi,
max(0,1+
λi
j∈{1,···,y−1,y+1,···,c}
Pn
Valid kernels k1,k2, also valid kernels: k1 +k2;
T , λ ≥...≥λ ≥0
K
=
λ
v
v
1
d
i=1 i i i
k1 ·k2; c·k1, c>0; f(k1) if f polynomial with
Pn (i)
New
point:
ẑ
=f(x̂)=
pos. coeffs. or exponential
j=1 αj k(x̂,xj )


cF P , if y =−1, a=+1
C(y,a)= cF N , if y =+1, a=−1

0 , otherwise
Regression: C(y,a) = (y − a)2; asymmetric:
Probability modeling
C(y,a)=c1max(y−a,0)+c2max(a−y,0)
Find
h:X →Y that min. pred. error: R(h)= E.g. y ∈ {−1, +1}, predict + if c < c ,
R
+
−
P (x,y)l(y;h(x))∂yx∂y =Ex,y [l(y;h(x))]
c+ = E(C(y,+1)|x) = P (y = 1|x) · 0 + P (y =
For least squares regression
−1|x)·cF P , c− likewise
Best h: h∗(x)=E[Y |X =x]
R
Pred.: ŷ = Ê[Y |X = x̂]= P̂ (y|X = x̂)y∂y
Discriminative / generative modeling
Maximum Likelihood Estimation (MLE)
Discr. estimate P (y|x), generative P (y,x)
θ∗ =argmax P̂ (y1,...,yn|x1,...,xn,θ)
Approach (generative): P (x,y)=P (x|y)·P (y)
Autoencoders
Find identity function: x≈f(x;θ)
f(x;θ)=fdecode(fencode(x;θencode);θdecode)
θ
E.g. lin. + Gauss: yi =wT xi +εi,εi ∼N (0,σ2)
i.e. yi ∼N (wT xi,σ2), With P
MLE (use
argmin −log): w∗ =argmin (yi −wT xi)2
w
Bias/Variance/Noise
Prediction error = Bias2 +V ariance+Noise
Maximum a posteriori estimate (MAP)
Assume bias on parameters, e.g. wi ∈N (0,β 2)
(y|x,w)
(y|x,w)
Bay.: P (w|x,y)= P (w|x)P
= P (w)P
P (y|x)
P (y|x)
Logistic regression
1
Link func.: σ(wT x)= 1+exp(−w
T x) (Sigmoid)
1
P (y|x,w) = Ber(y;σ(wT x)) = 1+exp(−yw
T x)
Classification: Use P (y|x,w), predict most
likely class label.
MLE: argmax P (y1:n|w,x1:n)
w
P
⇒w∗ =argmin ni=1log(1+exp(−yiwT xi))
w
SGD update: w =w+ηtyxP̂ (Y =−y|w,x)
1
P̂ (Y =−y|w,x)= 1+exp(yw
T x)
MAP: Gauss. prior ⇒||w||22, Lap. p. ⇒||w||1
SGD: w =w(1−2ληt)+ηtyxP̂ (Y =−y|w,x)
Bayesian decision theory
- Conditional distribution over labels P (y|x)
- Set of actions A
- Cost function C :Y ×A→R
a∗ =argmin E[C(y,a)|x]
a∈A
Calculate E via sum/integral.
Classification: C(y,a)=[y =
6 a]; asymmetric:
Probabilities
Outlier Detection
(R
P (x)≤τ
x·p(x)∂x if continuous
Ex[X]= P
Categorical Naive Bayes Classifier
otherwise
x x·p(x)
(i)
MLE for feature distr.: P̂ (Xi =c|Y =y)=θc|y Var[X]=E[(X −µ )2]=E[X 2]−E[X]2
X
(i)
i =c,Y =y)
P (B|A)·P (A)
θc|y = Count(X
P
(A|B)=
; p(Z|X,θ)= p(X,Z|θ)
Count(Y =y)
P (B)
p(X|θ)
P (x,y)=P (y|x)·P (x)=P (x|y)·P (y)
Prediction: y∗ =argmaxP̂ (y|x)
y
Bayes Rule
(A)
P (A|B)= P (B|A)·P
P (B)
Missing data
P-Norm
1
P
||x||p =( ni=1|xi|p) p , 1≤p<∞
Some gradients
- Estimate prior on labels P (y)
∇x||x||22 =2x
- Estimate cond. distr. P (x|y) for each class y
f(x)=xT Ax; ∇xf(x)=(A+AT )x
Gaussian-Mixture Bayes classifiers
P (y)P (x|y)
E.g. ∇w log(1+exp(-ywTx))=
- Pred. using Bayes: P (y|x)= P (x)
Estimate prior P (y);
Est.
cond.
1
P
·exp(−ywT x)·(−yx)=
P (x)= y P (x,y)
distr.
for each class:
P (x|y) = 1+exp(−ywT x)
1
Pky (y)
(y) (y)
·(−yx)
Examples
1+exp(ywT x)
w
N
(x;µ
,Σ
)
j=1
j
j
j
n+
MLE for P (y)=p= n
Convex / Jensen’s inequality
2 ):
MLE for P (xi|y)=N (xi;µi,y ,σi,y
g(x) convex⇔g00(x)>0⇔x1,x2 ∈R,λ∈[0,1]:
Hard-EM
algorithm
P
g(λx1 +(1−λ)x2)≤λg(x1)+(1−λ)g(x2)
µ̂i,y = n1y x∈Dx |y x
Initialize parameters θ(0)
i
1 P
2
2
Gaussian
/ Normal Distribution
E-step: Predict most likely class for each
σ̂i,y = ny x∈Dx |y (x−µ̂i,y )
2
i
1
(t)
f(x)= √ 2 exp(− (x−µ)
)
point: zi =argmax P (z|xi,θ(t−1))
2σ2
MLE for Poi.: λ=avg(xi)
2πσ
z
Q
(i)
Multivariate Gaussian
Rd: P (X =x|Y =y)= di=1P ois(λy ,x(i))
=argmax P (z|θ(t−1))P (xi|z,θ(t−1));
Σ= covariance matrix, µ = mean
z
Deriving decision rule
1
T −1
P
1
Compute the MLE: θ(t) = f(x)= √
e− 2 (x−µ) Σ (x−µ)
P (y|x)= Z1 P (y)P (x|y), Z = y P (y)P (x|y) M-step:
P
2π |Σ|
(t)
Q
P
P (D(t)|θ), i.e. µj = n1j i:zi=jxj
y∗ =amax P (y|x)=amax P (y) di=1P (xi|y) argmax
Empirical:
Σ̂ = n1 ni=1 xixTi (needs centered
θ
y
y
data points)
Soft-EM algorithm
Gaussian Bayes Classifier
(t)
semi-definite matrices
E-step: Calc p for each point and cls.: γj (xi) Positive
P̂ (x|y)=N (x;µ̂y ,Σ̂y )
n×n is psd ⇔
M
∈R
n
M-step: Fit clusters to weighted data points:
P̂ (Y =y)=p̂y = ny
Pn (t)
∀x∈Rn :xT Mx≥0⇔
P
Pn (t)
(t)
(t)
1
i=1 γj (xi )xi
d
1
µ̂y = ny i:yi=y xi ∈R
wj = n i=1γj (xi); µj = Pn (t)
all eigenvalues of M are positive: λi ≥0
i=1 γj (xi )
1 P
T
d×d
P
(t)
(t)
(t)
n
Σ̂y = ny i:yi=y (xi −µ̂y )(xi −µ̂y ) ∈R
γ (xi )(xi −µ )T (xi −µj )
(t)
σj = i=1 j Pn (t)j
Fisher’s lin. discrim. analysis (LDA, c=2)
i=1 γj (xi )
Assume: p=0.5; Σ̂− = Σ̂+ = Σ̂
Soft-EM for semi-supervised learning
p
(t)
+
discriminant function: f(x)=log 1−p
labeled yi: γj (xi)=[j =yi], unlabeled:
|
Σ̂
|
1
(t)
[log − + (x−µ̂−)T Σ̂−1
γj (xi)=P (Z =j|xi,µ(t−1),Σ(t−1),w(t−1))
− (x−µ̂− ) −
|Σ̂+ |
2
(x−µ̂+)T Σ̂−1
+ (x−µ̂+ ) ]
Predict: y =sign(f(x))=sign(wT x+w0)
w = Σ̂−1(µ̂+ −µ̂−);
w0 = 21 (µ̂T−Σ̂−1µ̂− −µ̂T+Σ̂−1µ̂+)
Mixture modeling
Model each c. as probability distr. P (x|θj )
Q P
P (D|θ)= ni=1 kj=1wj P (xi|θj )
P
P
L(w,θ)=− ni=1log kj=1wj P (xi|θj )
Useful math
Download