An Introductory Crash-Course in Machine Learning

advertisement
Tony Jebara, Columbia University
An Introductory
Crash-Course in
Machine Learning
Tony Jebara
Columbia University
Tony Jebara, Columbia University
Machine Learning: What/Why
Application
Statistical Data-Driven Computational Models
Up
Real domains (vision, speech, climate):
Inference
2
have no E=MC
are noisy, complex, nonlinear
Algorithm
have many variables
Criterion
may be non-deterministic
have incomplete/approximate models
Model
Bottom-Up: build models from data
Why? Big data is everywhere,
sensors, audio, video, internet…
Representation
Data
Bottom
Sensors
Tony Jebara, Columbia University
Machine Learning Applications
• ML: Interdisciplinary (CS, Math, Stats, Physics, OR, Neuro)
• Data-driven approach
• When domain knowledge is too hard to encode manually
Speech Recognition (HMMs, ICA)
Computer Vision (face rec, digits, MRFs, super-res)
xxxxxx
x
x
xx xx
Time Series Prediction (weather, finance)
Genomics (micro-arrays, SVMs, splice-sites)
NLP and Parsing (HMMs, CRFs, Google)
Text and InfoRetrieval (docs, google, spam, TSVMs)
Medical (QMR-DT, informatics, ICA)
Behavior/Games (reinforcement, gammon, gaming)
Tony Jebara, Columbia University
x
x
x
x
x
xx
x
x
x
x
x
Classification y=sign(f(x))
x
x xx x x x
x
xx x
x
O
OO
O OO
OO O
O
Dimensionality Reduction
Clustering
Probabilistic Modeling p(x)
Detection p(x)<t
x
x xx x x x
x
xx x
x
UNSUPERVISED
Regression y=f(x)
SUPERVISIED
Machine Learning Tasks
Tony Jebara, Columbia University
Regression
⎡
⎢ x (1)
• Start with iid training dataset
⎢
⎢ x (2)
D
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ R = ⎢⎢
⎢ ...
⎢
• Given N (input, output) training pairs
⎢ x (D )
⎢⎣
• Find a function f(x) to predict y from x
y
x x
{
}
x
x
x
x
x
x
x
x
x
⎤
⎥
⎥
⎥
⎥ y ∈ R1
⎥
⎥
⎥
⎥
⎥⎦
x
• Example: predict the price of house in dollars y
given input x = [#rooms; latitude; longitude; …]
x
• Need: a) Way to evaluate how good a fit we have
b) Class of functions in which to search for f(x)
Tony Jebara, Columbia University
Empirical Risk Minimization (ERM)
• Idea: minimize ‘loss’ on the training data set
• Empirical: use the training set to find the best fit
• Define a loss function of how well we fit a single point:
(
)
L y, f (x )
• Empirical Risk = the average loss over the dataset
Remp =
1
N
∑
N
i =1
(
)
L yi , f (x i )
• Simplest loss: squared error from y value
(
)
Lsqr yi , f (x i ) =
1
2
(y − f (x ))
2
i
i
• Another example loss: absolute error
(
)
Labs yi , f (x i ) = yi − f (x i )
Tony Jebara, Columbia University
Linear Function Classes
• The simplest class of functions to search over:
f (x; θ) = ∑ d =1 θd x (d ) + θ 0 = θT x
D
• Plug into Empirical Risk
Remp (θ) =
1
N
∑
N
i =1
(
)
L yi , f (x i ; θ)
• If Remp(θ) is convex, its minimum occurs when it is flat:
∇θRemp = 0
• If we can’t solve the gradient=0, use numerical methods
like gradient descent
Tony Jebara, Columbia University
Least Squares Regression
• Have N apartments each with D measurements
• Form a data matrix X and an outputs vector y of prices
⎡
⎤
⎡ y ⎤
1
x
1
…
x
D
(
)
(
)
⎢
⎥
1
1
⎢ 1 ⎥
⎢
⎥
⎢
⎥
⎥
X=⎢ 
y
=
⎢  ⎥



⎢
⎥
⎢
⎥
⎢ 1 x (1) … x (D ) ⎥
y
⎢ N ⎥
N
N
⎢⎣
⎥⎦
⎣
⎦
• Empirical risk is Remp (θ) =
1
2N
∑ (y
(
T
N
i =1
• Setting gradient=0: θ = X X
*
(
)
• Use f x; θ* = x T θ*
)
−1
T
i
− θ xi
2
)
=
1
2N
y − Xθ
2
XT y ≈ pinv (X) y ≈ X \ y
Tony Jebara, Columbia University
Evaluating Our Learned Function
• We minimized empirical risk to get θ*
• How well does f(x;θ*) perform on future data?
• It should Generalize and have low True Risk:
Rtrue (θ) =
∫ P (x,y ) L (y, f (x; θ))dx dy
• Can’t compute true risk, instead use Testing Empirical Risk
• We split data into training and testing portions
{(x ,y ),…, (x
1
1
N
}
,yN )
• Find θ* with training data:
{(x
N +1
}
,yN +1 ),…, (x N +M ,yN +M )
Rtrain (θ) =
• Evaluate it with testing data: Rtest (θ) =
1
N
1
M
∑
∑
N
i =1
(
)
L yi , f (x i ; θ)
N +M
i =N +1
(
)
L yi , f (x i ; θ)
Tony Jebara, Columbia University
Overfitting
• Overfitting: get misleadingly low Rtrain(θ*) and high Rtest(θ*)
• High dimensional problems often lead to overfitting
• Adding more features (extra columns in X) will always
reduce Rtrain(θ*) and eventually cause overfitting
• If we have more dimensions than data, can get Rtrain(θ*)=0
( )
Rtest θ*
Risk
( )
Rtrain θ*
 underfitting
overfitting 
D
Tony Jebara, Columbia University
Regularized Risk Minimization
• Empirical Risk Minimization gave overfitting & underfitting
• Apply a penalty for using too many theta values
• This gives us the Regularized Risk
Rregularized (θ) = Rempirical (θ) + Penalty (θ)
=
1
N
∑
N
i =1
(
)
L yi , f (x i ; θ) +
θ
λ
2
2
• Solution for l2-Regularized Risk with Least Squares Loss:
∇θRregularized
*
2
⎛1
= 0 ⇒ ∇θ ⎜⎜⎜ 2N y − Xθ +
⎝
(
T
θ = X X + λI
)
−1
XT y
λ
2
⎞⎟
θ ⎟⎟ = 0
⎠
2
Tony Jebara, Columbia University
Regularized Risk Minimization
• Have D=16 features, try minimizing Rregularized(θ) to get θ*
• Try minimizing with different λ, note that λ=0 is like ERM
Tony Jebara, Columbia University
Crossvalidation
• Try fitting with different lambda regularization levels
• Select lambda which gives lowest Rtest(θ*)
Risk
( )
Rtest θ*
( )
Rtrain θ*
 underfitting overfitting 
Best lambda
• Lambda measures simplicity of the model
• Models with low lambda are more flexible
1/λ
Tony Jebara, Columbia University
From Regression To Classification
• Classification is another important learning problem
Regression
Classification
{(x ,y ), (x ,y ),…, (x
X = {(x ,y ), (x ,y ),…, (x
X =
1
1
1
1
2
2
}
)} x ∈ R
2
1
D
,y
)
y
∈
R
x
∈
R
N
N
2
,yN
N
D
y ∈ {0,1}
• E.g. Given x = [tumor size, tumor density]
Predict y in {benign,malignant}
• Should we solve this as a least squares regression problem?
x
x xx x x x
x
x x x
O
x
OO
O OO
O
O O
O
Tony Jebara, Columbia University
Logistic Regression
• Given a classification problem with binary outputs
X = (x1,y1 ), (x 2,y 2 ),…, (x N ,yN ) x ∈ RD y ∈ {0,1}
{
}
• Use this function and output 1 if f(x)>0.5 and 0 otherwise
(
(
f (x; θ) = 1 + exp −θ x
T
))
−1
• Instead of squared loss, use Logistic Loss
(
)
(
)
(
)
Llog y, f (x; θ) = yi log f (x; θ) + (1 − yi ) log 1 − f (x; θ)
• The resulting method is called Logistic Regression.
• But Empirical Risk Minimization has no closed-form sol’n:
Remp (θ) =
1
N
∑
N
(
)
(
)
y log f (x; θ) + (1 − yi ) log 1 − f (x; θ)
i =1 i
Tony Jebara, Columbia University
Gradient Descent
• Useful when we can’t get minimum solution in closed form
• Gradient points in direction of fastest increase
• Take step in the opposite direction!
• Gradient Descent Algorithm
choose scalar step size η
initialize t=0 and θ t = small random vector
while not converged {
θt +1 = θt − η∇θRemp
t = t +1
}
• For appropriate η, this will converge to local minimum
Tony Jebara, Columbia University
Logistic Regression
• Logistic regression gives better classification performance
• Its empirical risk is convex so gradient descent always
converges to the same solution
Tony Jebara, Columbia University
Perceptrons
• Classification scenario once again but consider +1, -1 labels
X =
{
}
(x1,y1 ), (x 2,y2 ),…, (xN ,yN )
x ∈ RD
y ∈ {−1,1}
• Use linear function, output 1 if f(x)>0 and -1 otherwise
f (x; θ) = θT x
• Classification Loss is nice but makes ERM hard
(
)
(
)
L
Lcls y, f (x; θ) = step −yf (x; θ)
yf(x;θ)
• So, we’ll instead consider the Perceptron Loss
Lper
⎧
⎪
⎪ −yf (x; θ) if yf (x; θ) < 0
y, f (x; θ) = ⎪
⎨
⎪
⎪
0
otherwise
⎪
⎩
(
)
L
yf(x;θ)
Tony Jebara, Columbia University
Perceptrons
()
per
• Perceptron empirical risk: Remp θ = − N1
(
T
y
θ
∑ i∈misclassified i xi
• Use gradient descent... or approximate the gradient
on one random point via Stochastic
Gradient Descent

initialize t = 0 and θ 0 = 0
while not converged {
pick i ∈ {1,…,N }
(
)
if yi x iT θt < 0
{ θt +1 = θt + yi x i
t = t +1
}}
• SGD only needs O(D) per iteration
• If all ||xi|| < r and an ideal θ* achieves
a margin γ, then SGD converges in t ≤ r 2 γ−2 θ*
2
)
Tony Jebara, Columbia University
Perceptrons
• Were developed in the 1950s
Tony Jebara, Columbia University
Regularized Risk for Classification
• As with regression, Empricial Risk Minimization can overfit
• ERM finds minimum training (not testing) error
• For non-separable problems, Perceptron does not converge
• The Perceptron can stop at a bunch of different solutions:
⇒
… a better solution  SVMs
Tony Jebara, Columbia University
Bounding the True Risk
• ERM may do better
on training than
on testing!
Remp (θ)
R (θ̂) ≥ Remp (θ̂)
• Idea: add a regularizer to
• Use penalty or regularize
J (θ) = Remp (θ) +C (θ)
Remp (θ)
which favors simpler
• If R (θ) ≤ J (θ) we have a bound called the guaranteed risk
• After training, guarantees future error rate is ≤ min θ J (θ)
Tony Jebara, Columbia University
Bound the True Risk with VC
• Vapnik: with probability 1-η the following bound holds:
R (θ) ≤ J (θ) = Remp (θ) +
2h log ( 2eN
+ 2 log
h )
N
( ) ⎛⎜⎜⎜1 +
4
η
⎜⎜
⎜⎝
1+
NRemp (θ)
h log ( 2eN
+ log
h )
h = Vapnik-Chervonenkis (VC) dimension
⎧
⎡ 4r 2 ⎤ ⎫
⎪
⎪
⎪
⎢
⎥
h ≤ min ⎨ceil ⎢ 2 ⎥ ,d ⎪
⎬+1
⎪
⎢⎣ m ⎥⎦ ⎪
⎪
⎪
⎩
⎭
m
m=margin
2r r=radius
d=dimensionality
• Instead of ERM, do Structural Risk Minimization:
minimize a bound J(θ) on true risk to find θ*
()
4
η
⎞⎟
⎟⎟
⎟⎟
⎟⎟
⎠
Tony Jebara, Columbia University
Support Vector Machines
• An SVM maximizes margin m while classifying perfectly
• This reduces VC dimension while minimizing empirical risk
• These constraints ensure we get zero empirical risk:
wT x i + b ≥ +1
∀yi = +1
wT x i + b ≤ −1
∀yi = −1
• Margin of an SVM is m = 2 w
−1
m
• So, solve this quadratic program (QP):
min
1
w,b 2
w
2
(
)
subject to yi wT x i + b −1 ≥ 0
L
yf(x;θ)
Tony Jebara, Columbia University
Support Vector Machines
• For non-separable problems, add slack variables to the QP
• Penalize slacks with a scalar C (found via crossvalidation)
LP : min
1
w,b,ξ 2
(
2
w +C ∑ i ξi
)
s.t. yi wT x i + b −1 + ξi ≥ 0
• Instead of the primal QP, we can solve the dual QP
LD : max α ∑ i αi − 12 ∑ i, j αi α j yiy j x iT x j
s.t. ∑ i αiyi = 0 and αi ∈ ⎡⎢0,C ⎤⎥
⎣
⎦
• Classifier is the sign of
f (x ) = ∑ i αiyi x T x i + b
Tony Jebara, Columbia University
Kernelized SVMs
• The dual only involves inner produces between data-points
LD : max α ∑ i αi − 12 ∑ i, j αi α j yiy j x iT x j s.t. ∑ i αiyi = 0 and αi ∈ ⎡⎢0,C ⎤⎥
⎣
⎦
f (x ) = ∑ i αiyi x T x i + b
• To make the SVM handle non-linear decision boundaries,
replace any inner-product with a general kernel function
T
i
(
x x j → k x i ,x j
)
for example
(
) (
T
i
)
k x i ,x j = x x j + 1
3
• We can use (almost) any function which accepts two data
points and outputs a scalar, this gives the kernelized SVM
Tony Jebara, Columbia University
Kernelized SVMs
• Polynomial kernel:
• Radial basis function kernel:
Polynomial Kernel
)
( ) (
⎛ 1
k (x ,x ) = exp ⎜⎜⎜−
⎝
2σ
T
i
k x i ,x j = x x j + 1
i
j
p
2
xi − x j
RBF kernel
• Least-squares, logistic-regression, perceptron are also
kernelizable
2
⎞⎟
⎟⎟
⎠
Tony Jebara, Columbia University
Relative Margin Machines
• Instead of max margin,
can do max relative margin:
SVM : min
1
w,b,ξ 2
w
(
2
)
s.t. yi wT x i + b ≥ 1
RMM : min
1
w,b,ξ 2
(
w
2
)
s.t. B ≥ yi wT x i + b ≥ 1
• Similar QP algorithm
• Gives better generalization
(Shivaswamy & Jebara 2010)
Tony Jebara, Columbia University
Classifying Digits
• Classify MNIST digits
• Train on 60,000 pairs
• Test on 10,000
Algorithm
Error Rate
Logistic Regression
12.0%
SVM (RBF)
1.46%
SVM (Poly 4)
1.83%
SVM (Poly 7)
1.36%
RMM (RBF)
1.44%
RMM (Poly 4)
1.56%
RMM (Poly 7)
1.15%
Tony Jebara, Columbia University
Dimensionality Reduction
• An unsupervised learning problem
• Given only high-dimensional input data (no target outputs)
X = {x1,x 2,…,x N } x ∈ RD
• Learn low-dimensional outputs (where d<<D)
Y = {y1,y 2,…,yN } y ∈ Rd
→
Tony Jebara, Columbia University
Principal Components Analysis
• Idea: instead of writing data in all its dimensions,
only write it as mean + steps along one direction
x
⎡
⎤ ⎡
x
1
⎢ i ( ) ⎥ ⎢ µ (1)
⎢
⎥≈⎢
⎢ x (2) ⎥ ⎢ µ (2)
⎢⎣ i
⎥⎦ ⎢⎣
⎤
⎥
⎥ + yi
⎥
⎥⎦
⎡
⎤
v
1
⎢ () ⎥
⎢
⎥
⎢ v (2) ⎥
⎢⎣
⎥⎦
• More generally, keep a subset
of dimensions d from D (i.e. 2 of 3)




d
x i ≈ µ + ∑ j =1 yi ( j )v j


x

y
• We reduce dimensionality if: i
i
• Optimal directions: along eigenvectors of covariance
• Which directions to keep: highest eigenvalues (variances)
Tony Jebara, Columbia University
Principal Components Analysis
• With eigenvectors, mean and coefficients, reconstruct x:




d
x i ≈ µ + ∑ j =1 yi ( j )v j
• Get eigenvectors and eigenvalues of DxD covariance matrix
Σ = V ΛV T
• Higher eigenvalues are higher variance, use the top d ones
λ1 ≥ λ 2 ≥ λ 3 ≥ λ 4 ≥ … ≥ 0
• To compute the coefficients use:

 T 
yi ( j ) = (x i − µ) v j
• Can also get the eigenvectors from the NxN Gram matrix:
Aij = x iT x j
…these inner-products can be kernelized
Tony Jebara
Principal Components Analysis
• Related to spectral graph embedding (Hall 70, Shoelkopf 98)
• Just embed with Y=leading eigenvectors of A
• But, for nonlinear data PCA fails
Aij = x iT x j
→
• Problem: PCA preserves all distances
• This requires many dimensions for Y
• Solution: sparsify graph of pairwise distances between points
• Preserve only local distance edges (Weingberger & Saul33 04)
Tony Jebara
Sparsifying graphs
• Given distance graph between all points find sparse subgraph
n (n−1)/2
• There are 2
possible subgraphs!
6
4
34
Tony Jebara
Sparsifying graphs
• InO (n 2 ) we could find k nearest neighbors
6
4
6
4
35
Tony Jebara
Sparsifying graphs
• InO (n 2 ) we could find k nearest neighbors
6
6
4
4
• Get a binary connectivity matrix C
⎡
⎢
⎢
A=⎢
⎢
⎢
⎢⎣
4
3
1
0
1
3
3
4
3
1
0
1
1
3
4
3
1
0
0
1
3
4
3
1
1
0
1
3
4
3
3
3
0
1
3
4
⎤
⎡
⎥
⎢
⎥
⎢
⎥ → C =⎢
⎥
⎢
⎥
⎢
⎥⎦
⎢⎣
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
1
0
0
0
0
1
1
1
0
0
0
0
1
1
1
1
0
0
0
1
1
36
⎤
⎥
⎥
⎥
⎥
⎥
⎥⎦
Tony Jebara
Semidefinite embedding
• Semidefinite embedding (SDE) improves spectral embedding
• Unfolds sparse graph by finding large weight matrix
• But K must preserve distances on sparse graph
• Given connectivity C, solve the SDP:
⎧
⎪
⎪
⎪
⎪
κ⎪
⎨
⎪
⎪
⎪
⎪
⎪
⎩
max K tr (K )
subject to K  0,
∑
ij
K ij = 0
K ii + K jj − K ij − K ji =
Aii + Ajj − Aij − Aji if C ij = 1
• Embed with eigenvectors of K instead of A
37
Tony Jebara
Semidefinite embedding
• SDE pulls apart pruned graph maximally
while preserving pairwise distances if Cij=1
• SDE’s stretching of data improves spectral embedding
• Are we done? Is SDE the solution for all sparse graphs?38
Tony Jebara, Columbia University
Semidefinite Embedding
• On general graphs, SDE unfolding can worsen visualization!
• Example: star graph…
PCA
eigenvalues
SDE
eigenvalues
• SDE is increasing dimensionality!
• Can we do better?
Tony Jebara
Minimum volume embedding
• Maximizing trace in SDE can be bad…
Fans out some manifolds
• Minimum Volume Embedding (MVE) (Shaw Jebara 07)
Unfold only in the two
preserved dimensions
Squash down in others
40
Tony Jebara
Minimum volume embedding
• To reduce dimension, drive energy into top (e.g. 2) dims
• Maximize fidelity F (K ) = λ1 + λ2 or % energy in top dims
∑
i
λi
F (K ) = 0.98
F (K ) = 0.29
41
Tony Jebara
Minimum volume embedding
• An iterative SDP:
Fidelity = max K ∈κ ∑ i αi λi
(
= max K ∈κ maxV tr K ∑ i αiviviT
)
s.t. viT v j = δij
SVD for V
SDE’s SDP for K
Iterate SDP and SVD
Maximizes variational bound on original problem.
Fast monotonic convergence, code available.
42
Tony Jebara
Minimum volume embedding
• Star graph experiment visualization and spectra
• MVE converges in 5 iterations
PCA
SDE
MVE
43
Tony Jebara
Minimum volume embedding
• Collaboration network with spectra
PCA
SDE
MVE
44
44
Tony Jebara
Comparing Embeddings
• Evaluate fidelity or % energy in top 2 visualized dimensions
PCA
SDE
MVE
Star
95.0%
29.9% 100.0%
Spiral
45.8%
99.9%
99.9%
Twos
18.4%
88.4%
97.8%
Faces
31.4%
83.6%
99.2%
5.9%
95.3%
99.2%
Network
45
Tony Jebara
Comparing Embeddings
• ENSO Data (Lima, Lall, Jebara & Barston 2009)
PCA
SDE
46
Tony Jebara, Columbia University
x
x
x
x
x
xx
x
x
x
x
x
Classification y=sign(f(x))
x
x xx x x x
x
xx x
x
O
OO
O OO
OO O
O
Dimensionality Reduction
Clustering
Probabilistic Modeling p(x)
Detection p(x)<t
x
x xx x x x
x
xx x
x
UNSUPERVISED
Regression y=f(x)
SUPERVISIED
Machine Learning Tasks
Download