Basics Fundamental Assumption Data is iid for unknown P : (xi,yi)∼P (X,Y ) True risk and estimated error R True risk: R(w)= P (x,y)(y−wT x)2∂x∂y = Ex,y [(y−wT x)2] 1 P T 2 Est. error: R̂D (w)= |D| (x,y)∈D (y−w x) Standardization Centered data with unit variance: x̃i = xiσ̂−µ̂ P P µ̂= n1 ni=1xi, σ̂2 = n1 ni=1(xi −µ̂)2 Cross-Validation For all models m, for all i∈{1,...,k} do: (i) (i) 1. Split data: D = Dtrain ] Dtest (MonteCarlo or k-Fold) (i) 2. Train model: ŵi,m =argmin R̂train(w) (i) w (i) 3. Estimate error: R̂m = R̂test(ŵi,m) P (i) Select best model: m̂=argmin k1 ki=1R̂m m Parametric vs. Nonparametric models Parametric: have finite set of parameters. e.g. linear regression, linear perceptron Nonparametric: grow in complexity with the size of the data, more expressive. e.g. k-NN Gradient Descent 1. Pick arbitrary w0 ∈Rd 2. wt+1 =wt −ηt∇R̂(wt) Stochastic Gradient Descent (SGD) 1. Pick arbitrary w0 ∈Rd 2. wt+1 =wt −ηt∇w l(wt;x0,y0), with u.a.r. data point (x0,y0)∈D Regression Solve w∗ =argmin R̂(w)+λC(w) w Linear Regression P R̂(w)= ni=1(yi −wT xi)2 =||Xw−y||22 P ∇w R̂(w)=−2 ni=1(yi −wT xi)·xi w∗ =(X T X)−1X T y Ridge regression P R̂(w)= ni=1(yi −wT xi)2 +λ||w||22 P ∇w R̂(w)=−2 ni=1(yi −wT xi)·xi +2λw w∗ =(X T X +λI)−1X T y Reformulating the perceptron Neural networks P Ansatz: w∗ ∈span(X)⇒w = nj=1αj yj xj Parameterize feature map with θ: φ(x,θ) = Pn Pn T ∗ (activation ϕ) α =argmin i=1max(0,− j=1αj yiyj xi xj ) ϕ(θT x)=ϕ(z) P Pfunction n m ∗ α∈Rn ⇒w =argmin i=1l(yi; j=1wj φ(xi,θj )) Classification w,θ Kernelized perceptron and SVM ∗ Pm Solve w =argmin l(w;xi,yi); loss function l T T T k instead of wT x , f(x;w,θ Use α 1:d )= i i w j=1 wj ϕ(θj x)=w ϕ(Θx) use αT Dy KDy α instead of ||w||22 0/1 loss Activation functions T 1 ki =[y1k(xi,x1),...,ynk(xi,x )], D =diag(y) n y l0/1(w;yi,xi)=1 if yi = 6 sign(w xi) else 0 Sigmoid: 1+exp(−z) , ϕ0(z)=(1−ϕ(z))·ϕ(z) Pn Prediction: ŷ = sign( i=1 αiyik(xi, x̂)) exp(z)−exp(−z) Perceptron algorithm tanh: ϕ(z)=tanh(z)= exp(z)+exp(−z) T x ) and SGD SGD update: αt+1 = αt , if mispredicted: Use lP (w;yi,xi)=max(0,−y w i i ( αt+1,i = αt,i + ηt (c.f. updating weights ReLU: ϕ(z)=max(z,0) 0 if yiwT xi ≥0 Predict: forward propagation towards mispredicted point) ∇w lP (w;yi,xi)= −yixi otherwise v(0) =x; for l =1,...,L−1: Kernelized linear regression (KLR) Pn ∗ Data lin. separable ⇔ obtains a lin. separator Ansatz: w = i=1αix v(l) =ϕ(z(l)), z(l) =W (l)v(l−1) ∗ T 2 T (not necessarily optimal) f =W (L)v(L−1) α =argmin||α K −y||2 +λα Kα α Predict f for regression, sign(f) for class. Support Vector Machine (SVM) =(K +λI)−1y T Hinge loss: lH (w;x Compute gradient: backpropagation n ( i,yi)=max(0,1−yiw xi) P T Prediction: ŷ = α k(x ,x̂) i i Output layer: δj =lj0 (fj ), ∂w∂j,i =δj vi 0 if yiw xi ≥1 i=1 ∇w lH (w;y,x)= Hidden layer −yixi otherwise Pl =L−1,...,1: k-NN ∂ 0 (z )· Pn δ =ϕ ∗ 2 j i∈Layerl+1 wi,j δi , ∂wj,i =δj vi w =argmin lH (w;xi,yi)+λ||w||2 y = sign i=1 yi[xi among k nearest neigh- j w bours of x] – No weights ⇒ no training! But Learning with momentum a←m·a+ηt∇W l(W ;y,x); Wt+1 ←Wt −a depends on all data :( L1-regularized regression (Lasso) P R̂(w)= ni=1(yi −wT xi)2 +λ||w||1 Kernels efficient, implicit inner products Clustering Imbalance Properties of kernel k-meanP k :X ×X →R, k must be some inner product up-/downsampling R̂(µ)= ni=1 min ||xi −µj ||22 (symmetric, positive-definite, linear) for some Cost-Sensitive Classification j∈{1,...k} Eucl. Scale loss by cost: lCS (w;x,y)=c± l(w;x,y) µ̂=argmin R̂(µ) ...non-convex, NP-hard space V. i.e. k(x,x0) = hϕ(x),ϕ(x0)iV = Metrics µ ϕ(x)T ϕ(x0) and k(x,x0)=k(x0,x) n=n+ +n−, n+ =T P +F N, n− =T N +F P Algorithm (Lloyd’s heuristic): Choose starting P N Kernel Accuracy: T P +T , Precision: T PT+F matrix centers, assign points to closest center, update n P k(x1,x1) ... k(x1,xn) TP FP centers to mean of each cluster, repeat Recall/TPR: n+ , FPR: n− .. .. .. K = . 2T P 2 . . F1 score: 2T P +F P +F N = 1 + 1 prec rec k(xn,x1) ... k(xn,xn) Dimension reduction ROC Curve: y =TPR, x=FPR Positive semi-definite matrices ⇔ kernels k PCA 1 Pn T Important kernels D =x1,...,xn ⊂Rd, Σ= i=1 xi xi , µ=0 n P T n Multi-class Linear: k(x,y)=x y (W,z1,...,zn)=argmin i=1||W zi −xi||22, c(c−1) T d Polynomial: k(x,y)=(x y+1) one-vs-all (c), one-vs-one ( 2 ), encoding W =(v1|...|vk )∈Rd×k , orthogonal; zi =W T xi 2 2 Gaussian: k(x,y)=exp(−||x−y||2/(2h )) vi are the eigen vectors of Σ Multi-class Hinge loss (1) (c) Laplacian: k(x,y)=exp(−||x−y||1/h) lMC−H (w ,...,w ;x,y)= Kernel PCA Composition rules max w(j)T x−w(y)T x) Kernel PC: α(1),...,α(k) ∈ Rn, α(i) = √1 vi, max(0,1+ λi j∈{1,···,y−1,y+1,···,c} Pn Valid kernels k1,k2, also valid kernels: k1 +k2; T , λ ≥...≥λ ≥0 K = λ v v 1 d i=1 i i i k1 ·k2; c·k1, c>0; f(k1) if f polynomial with Pn (i) New point: ẑ =f(x̂)= pos. coeffs. or exponential j=1 αj k(x̂,xj ) cF P , if y =−1, a=+1 C(y,a)= cF N , if y =+1, a=−1 0 , otherwise Regression: C(y,a) = (y − a)2; asymmetric: Probability modeling C(y,a)=c1max(y−a,0)+c2max(a−y,0) Find h:X →Y that min. pred. error: R(h)= E.g. y ∈ {−1, +1}, predict + if c < c , R + − P (x,y)l(y;h(x))∂yx∂y =Ex,y [l(y;h(x))] c+ = E(C(y,+1)|x) = P (y = 1|x) · 0 + P (y = For least squares regression −1|x)·cF P , c− likewise Best h: h∗(x)=E[Y |X =x] R Pred.: ŷ = Ê[Y |X = x̂]= P̂ (y|X = x̂)y∂y Discriminative / generative modeling Maximum Likelihood Estimation (MLE) Discr. estimate P (y|x), generative P (y,x) θ∗ =argmax P̂ (y1,...,yn|x1,...,xn,θ) Approach (generative): P (x,y)=P (x|y)·P (y) Autoencoders Find identity function: x≈f(x;θ) f(x;θ)=fdecode(fencode(x;θencode);θdecode) θ E.g. lin. + Gauss: yi =wT xi +εi,εi ∼N (0,σ2) i.e. yi ∼N (wT xi,σ2), With P MLE (use argmin −log): w∗ =argmin (yi −wT xi)2 w Bias/Variance/Noise Prediction error = Bias2 +V ariance+Noise Maximum a posteriori estimate (MAP) Assume bias on parameters, e.g. wi ∈N (0,β 2) (y|x,w) (y|x,w) Bay.: P (w|x,y)= P (w|x)P = P (w)P P (y|x) P (y|x) Logistic regression 1 Link func.: σ(wT x)= 1+exp(−w T x) (Sigmoid) 1 P (y|x,w) = Ber(y;σ(wT x)) = 1+exp(−yw T x) Classification: Use P (y|x,w), predict most likely class label. MLE: argmax P (y1:n|w,x1:n) w P ⇒w∗ =argmin ni=1log(1+exp(−yiwT xi)) w SGD update: w =w+ηtyxP̂ (Y =−y|w,x) 1 P̂ (Y =−y|w,x)= 1+exp(yw T x) MAP: Gauss. prior ⇒||w||22, Lap. p. ⇒||w||1 SGD: w =w(1−2ληt)+ηtyxP̂ (Y =−y|w,x) Bayesian decision theory - Conditional distribution over labels P (y|x) - Set of actions A - Cost function C :Y ×A→R a∗ =argmin E[C(y,a)|x] a∈A Calculate E via sum/integral. Classification: C(y,a)=[y = 6 a]; asymmetric: Probabilities Outlier Detection (R P (x)≤τ x·p(x)∂x if continuous Ex[X]= P Categorical Naive Bayes Classifier otherwise x x·p(x) (i) MLE for feature distr.: P̂ (Xi =c|Y =y)=θc|y Var[X]=E[(X −µ )2]=E[X 2]−E[X]2 X (i) i =c,Y =y) P (B|A)·P (A) θc|y = Count(X P (A|B)= ; p(Z|X,θ)= p(X,Z|θ) Count(Y =y) P (B) p(X|θ) P (x,y)=P (y|x)·P (x)=P (x|y)·P (y) Prediction: y∗ =argmaxP̂ (y|x) y Bayes Rule (A) P (A|B)= P (B|A)·P P (B) Missing data P-Norm 1 P ||x||p =( ni=1|xi|p) p , 1≤p<∞ Some gradients - Estimate prior on labels P (y) ∇x||x||22 =2x - Estimate cond. distr. P (x|y) for each class y f(x)=xT Ax; ∇xf(x)=(A+AT )x Gaussian-Mixture Bayes classifiers P (y)P (x|y) E.g. ∇w log(1+exp(-ywTx))= - Pred. using Bayes: P (y|x)= P (x) Estimate prior P (y); Est. cond. 1 P ·exp(−ywT x)·(−yx)= P (x)= y P (x,y) distr. for each class: P (x|y) = 1+exp(−ywT x) 1 Pky (y) (y) (y) ·(−yx) Examples 1+exp(ywT x) w N (x;µ ,Σ ) j=1 j j j n+ MLE for P (y)=p= n Convex / Jensen’s inequality 2 ): MLE for P (xi|y)=N (xi;µi,y ,σi,y g(x) convex⇔g00(x)>0⇔x1,x2 ∈R,λ∈[0,1]: Hard-EM algorithm P g(λx1 +(1−λ)x2)≤λg(x1)+(1−λ)g(x2) µ̂i,y = n1y x∈Dx |y x Initialize parameters θ(0) i 1 P 2 2 Gaussian / Normal Distribution E-step: Predict most likely class for each σ̂i,y = ny x∈Dx |y (x−µ̂i,y ) 2 i 1 (t) f(x)= √ 2 exp(− (x−µ) ) point: zi =argmax P (z|xi,θ(t−1)) 2σ2 MLE for Poi.: λ=avg(xi) 2πσ z Q (i) Multivariate Gaussian Rd: P (X =x|Y =y)= di=1P ois(λy ,x(i)) =argmax P (z|θ(t−1))P (xi|z,θ(t−1)); Σ= covariance matrix, µ = mean z Deriving decision rule 1 T −1 P 1 Compute the MLE: θ(t) = f(x)= √ e− 2 (x−µ) Σ (x−µ) P (y|x)= Z1 P (y)P (x|y), Z = y P (y)P (x|y) M-step: P 2π |Σ| (t) Q P P (D(t)|θ), i.e. µj = n1j i:zi=jxj y∗ =amax P (y|x)=amax P (y) di=1P (xi|y) argmax Empirical: Σ̂ = n1 ni=1 xixTi (needs centered θ y y data points) Soft-EM algorithm Gaussian Bayes Classifier (t) semi-definite matrices E-step: Calc p for each point and cls.: γj (xi) Positive P̂ (x|y)=N (x;µ̂y ,Σ̂y ) n×n is psd ⇔ M ∈R n M-step: Fit clusters to weighted data points: P̂ (Y =y)=p̂y = ny Pn (t) ∀x∈Rn :xT Mx≥0⇔ P Pn (t) (t) (t) 1 i=1 γj (xi )xi d 1 µ̂y = ny i:yi=y xi ∈R wj = n i=1γj (xi); µj = Pn (t) all eigenvalues of M are positive: λi ≥0 i=1 γj (xi ) 1 P T d×d P (t) (t) (t) n Σ̂y = ny i:yi=y (xi −µ̂y )(xi −µ̂y ) ∈R γ (xi )(xi −µ )T (xi −µj ) (t) σj = i=1 j Pn (t)j Fisher’s lin. discrim. analysis (LDA, c=2) i=1 γj (xi ) Assume: p=0.5; Σ̂− = Σ̂+ = Σ̂ Soft-EM for semi-supervised learning p (t) + discriminant function: f(x)=log 1−p labeled yi: γj (xi)=[j =yi], unlabeled: | Σ̂ | 1 (t) [log − + (x−µ̂−)T Σ̂−1 γj (xi)=P (Z =j|xi,µ(t−1),Σ(t−1),w(t−1)) − (x−µ̂− ) − |Σ̂+ | 2 (x−µ̂+)T Σ̂−1 + (x−µ̂+ ) ] Predict: y =sign(f(x))=sign(wT x+w0) w = Σ̂−1(µ̂+ −µ̂−); w0 = 21 (µ̂T−Σ̂−1µ̂− −µ̂T+Σ̂−1µ̂+) Mixture modeling Model each c. as probability distr. P (x|θj ) Q P P (D|θ)= ni=1 kj=1wj P (xi|θj ) P P L(w,θ)=− ni=1log kj=1wj P (xi|θj ) Useful math