Advanced Machine Learning & Perception

advertisement
Tony Jebara, Columbia University
Advanced Machine
Learning & Perception
Instructor: Tony Jebara
Tony Jebara, Columbia University
Topic 1
• Introduction, researchy course, latest papers
• Going beyond simple machine learning
• Perception, strange spaces, images, time, behavior
• Info, policies, texts, web page
• Syllabus, Overview, Review
• Gaussian Distributions
• Representation & Appearance Based Methods
• Least Squares, Correlation, Gaussians, Bases
• Principal Components Analysis and its Shortcomings
Tony Jebara, Columbia University
About me
• Tony Jebara, Associate Professor in Computer Science
• Started at Columbia in 2002
• PhD from MIT in Machine Learning
• Thesis: Discriminative, Generative and Imitative Learning (2001)
• Research Areas:
(Columbia Machine Learning Lab, CEPSR 6LE5)
• www.cs.columbia.edu/learning
• Machine Learning
• Some Computer Vision
Tony Jebara, Columbia University
Course Web Page
http://www.cs.columbia.edu/~jebara/4772
http://www.cs.columbia.edu/~jebara/6772
Some material & announcements will be online
But, many things will be handed out in class such as
photocopies of papers for readings, etc.
Check NEWS link to see deadlines, homework, etc.
Available online, see TA info, etc.
Follow the policies, homework, deadlines, readings
closely please!
Tony Jebara, Columbia University
Syllabus
Week 1: Introduction, Review of Basic Concepts, Representation Issues, Vector
and Appearance-Based Models, Correlation and Least Squared Error Methods,
Bases, Eigenspace Recognition, Principal Components Analysis
Week 2: Nonlinear Dimensionality Reduction, Manifolds, Kernel PCA, Locally
Linear Embedding, Maximum Variance Unfolding, Minimum Volume Embedding
Week 3: Maximum Entropy, Exponential Families, Maximum Entropy
Discrimination, Large Margin Probability Models
Week 4: Conditional Random Fields and Linear Models, Iterative Scaling and
Majorization
Week 5: Graphical Models, Multi-Class Support Vector Machines, Structured
Support Vector Machines, Cutting Plane Algorithms
Week 6: Kernels and Probabilistic Kernels
Tony Jebara, Columbia University
Syllabus
Week 7: Feature Selection and Kernel Selection, Support Vector Machine
Extensions
Week 8: Meta-Learning and Multi-Task Support Vector Machines
Week 9: Semi-Supervised Learning and Graph-Based Semi-Supervised Learning
Week 10: High-Tree Width Graphical Models, Approximate Inference, Graph
Structure Learning
Week 11: Clustering, Spectral Clustering, Normalized Cuts.
Week 12: Boosting, Mixtures of Experts, AdaBoost, Online Learning
Week 13: Project Presentations
Week 14: Project Presentations
Tony Jebara, Columbia University
Beyond Canned Learning
• Latest research methods, high dimensions, nonlinearities,
dynamics, manifolds, invariance, unlabeled, feature selection
• Modeling Images / People / Activity / Time Series / MoCAP
Tony Jebara, Columbia University
Representation & Vectorization
• How to represent our data? Images, time series, genes…
• Vectorization: the poor man’s representation
• Almost anything can be written as a long vector
• E.g. image is read lexicographically
⎡ 1 ⎤
⎢
⎥
RGB of eachpixel is added to a vector
⎢ 3 ⎥
• Or, a gene sequence can be written
as a binary vector
⎥
→⎢
⎢ 0 ⎥
⎢
⎥
⎢⎣ 2 ⎥⎦
GATACAC = [0100 0001 1000 0001 0010 0001 0010]
• For images, this is called “Appearance Based” representation
• But, we lose many important properties this way
• We will fix this in later lectures
• For now, it is an easy way to proceed
Tony Jebara, Columbia University
Least Squares Detection
• How to find a face in an image given a template?
• Template Matching and Sum-Squares-Difference (SSD)
• Naïve: try all positions/scales, find least squares distance
*
i = arg min
i * = arg mini∈ ⎡1,T ⎤
⎢⎣
1
⎥⎦ 2
1
i∈ ⎡⎢1,T ⎤⎥ 2
⎣ ⎦
T
µ − xi
(µ − x ) (µ − x )
i
• Correlation and Normalized Correlation
Could normalize length of all vectors (fixes lighting)
*
T
T
T
1
i
=
arg
min
µ̂
µ̂
−
2
µ̂
x̂
+
x̂
x̂
µ
i
i i
i∈ ⎡1,T ⎤ 2
µ̂ =
µ
⎢⎣
(
⎥⎦
= arg maxi∈ ⎡1,T ⎤ µ̂T x̂ i
⎢⎣
⎥⎦
2
i
)
Tony Jebara, Columbia University
Least Squares as Gaussian Model
• Minimum squared error is equivalent to
maximum likelihood under a Gaussian
2
*
1
i = arg mini∈ ⎡1,T ⎤ 2 µ − x i
⎢⎣
⎥⎦
⎛ 1
2 ⎞⎞
⎛ 1
⎜
= arg maxi∈ ⎡1,T ⎤ log ⎜ D/2 exp ⎜⎜⎜− 2 µ − x i ⎟⎟⎟⎟⎟⎟
⎜⎝ (2π)
⎝
⎠⎟⎠
⎣⎢ ⎦⎥
• Can now treat it as a probabilistic problem
• Trying to find the most likely position
(or sub-image)
in search image given the Gaussian model
of the template
• Define the log of the likelihood as: log p (x | θ)
• For a Gaussian probability or likelihood is:
2⎞
⎛ 1
1
⎜
p (x i | θ) = D/2 exp ⎜⎜− 2 µ − x i ⎟⎟⎟
⎝
⎠
(2π)
Tony Jebara, Columbia University
Multivariate Gaussian
• Gaussian extend to D-dimensions and have 2 parameters:
• Mean vector µ vector (translates)
• Covariance matrix Σ (stretches and rotates)
T
⎛ 1
1
p x | µ,Σ = D/2 exp ⎜⎜− 2 x − µ Σ−1 x − µ
⎝
(2π) Σ
(
)
(
)
(
⎞⎟
)⎟⎟⎠
x, µ ∈ ℜ D , Σ ∈ ℜ D×D
• Max and expectation = µ
• Mean is any real vector, variance is now Σ matrix
• Covariance matrix is positive semi-definite
• Covariance matrix is symmetric
• Need matrix inverse (inv)
• Need matrix determinant (det)
• Need matrix trace operator (trace)
Tony Jebara, Columbia University
Max Likelihood Gaussian
• How to make face detector to work on all faces?
• Why use a template? How can we use many templates?
• Have IID samples of template vectors i=1..N: x 1,x 2,x 3,…,x N
• Represent IID samples
with parameters as network:
More efficiently drawn using
Replicator Plate
i = 1…N
• Let us get a good Gaussian from these many templates.
• Standard approach: Max Likelihood
N
∑ log p (x
i =1
i
)
N
| µ,Σ = ∑ log
i =1
1
D/2
(2π)
Σ
T
⎛ 1 i
⎜
exp ⎜− 2 x − µ Σ−1 x i − µ
⎜⎝
(
)
(
⎞⎟
⎟⎟
⎠
)
Tony Jebara, Columbia University
Max Likelihood Gaussian
• Max over µ
∂ N
log D/21
∑
(2π)
∂µ i =1
Σ
T
⎛ 1 i
⎜
exp ⎜− 2 x − µ Σ−1 x i − µ
⎜⎝
(
)
(
⎞⎟
⎟⎟ = 0
⎠
)
T
∂ N D
i
1
1
− 2 log 2π − 2 log Σ − 2 x − µ Σ−1 x i − µ = 0
∑
∂µ i =1
(
)
N
∑ (x
T
∂x x
= 2x T
∂x
(
i
)
)
T
− µ Σ−1 = 0
i =1
see Jordan Ch. 12, get sample mean…
µ=
1
N
• For Σ need Trace operator: tr (A) = tr (AT ) = ∑ d =1 Add
D
and several properties:
tr (AB ) = tr (BA)
(
)
tr (xx A) = tr (x
tr BAB −1 = tr (A)
T
T
)
Ax = x T Ax
N
∑x
i =1
i
Tony Jebara, Columbia University
Max Likelihood Gaussian
• Likelihood rewritten in trace notation:
T
(
)
⎡
tr
∑ ⎢⎢⎣(x
(
− µ)
)
(x
l = ∑ i =1 − D2 log 2π − 12 log Σ − 12 x i − µ Σ−1 x i − µ
N
=−
ND
2
log 2π +
N
2
N
= − ND
log
2π
+
2
2
= − ND
log 2π + N2
2
• Max over A=Σ-1
use properties:
⎤
⎥
−
µ
i =1
⎥⎦
T
⎡ i
⎤
N
−1
i
1
log Σ − 2 ∑ i =1 tr ⎢ x − µ x − µ Σ−1 ⎥
⎢⎣
⎥⎦
T
⎡
⎤
N
log A − 12 ∑ i =1 tr ⎢ x i − µ x i − µ A⎥
⎢⎣
⎥⎦
−1
log Σ
∂A
Σ−
T
= A
T
• Get sample covariance:
i =1
)
)(
)
∂tr ⎡⎢BA⎤⎥
⎣ ⎦ = BT
∂A
T
)
i
T
T⎤
⎡ i
i
1
− 2 ∑ i =1 ⎢ x − µ x − µ ⎥
⎢⎣
⎥⎦
∑ (x
N
Σ−1
)(
( )
−1
( )
=
i
(
∂ log A
1
2
N
(
∂l
= −0 + N2 A−1
∂A
N
2
−
1
2
(
N
i
)(
i
−µ x −µ
)(
)
)
T
∂l
=0 → Σ=
∂A
1
N
∑ (x
N
i =1
i
)(
i
−µ x −µ
)
T
Tony Jebara, Columbia University
Principal Components Analysis
• Problem: for high dimensional data, D is large
• Storing Σ, inverting Σ-1 and determining |Σ| are expensive!
• Idea: limit Gaussian model to directions of high variance
• Use Principal Components Analysis to mimic Σ
Tony Jebara, Columbia University
Principal Components Analysis
• PCA approximates each datapoint as a mean vector plus
steps along eigenvector directions. E.g. ci steps along v
⎡
⎢
⎢
⎢
⎢⎣
x

x i (1)

x i (2)
⎤ ⎡ 
⎥ ⎢ µ (1)
⎥≈⎢ 
⎥ ⎢ µ (2)
⎥⎦ ⎢⎣
⎤
⎥
⎥ + ci
⎥
⎥⎦
⎡
⎢
⎢
⎢
⎢⎣

v (1)

v (2)
⎤
⎥
⎥
⎥
⎥⎦
• More generally, PCA uses a set of
eigenvectors M (where M<<D)



M
x i ≈ x̂ i = µ + ∑ j =1 cij v j



N
• PCA selects µ,cij , v j to minimize ∑ i =1 x i − x̂ i
{
}
2
• The optimal directions are eigenvectors of covariance
• Which directions to keep: highest eigenvalues (variances)
Tony Jebara, Columbia University
Principal Components Analysis
• Use eigenvectors, mean & coefficients to approximate data



M
x i ≈ µ + ∑ j =1 cij v j
• PCA finds eigenvectors by decomposing covariance matrix:
Σ = V ΛV T
⎡ Σ
⎢ 11 Σ12
⎢
⎢ Σ12 Σ22
⎢
⎢ Σ
⎢⎣ 13 Σ23

D
= ∑ i =1 λiviviT
Σ13 ⎤⎥
⎥ ⎡ ⎡ ⎤
Σ23 ⎥ = ⎢ ⎢⎣v1 ⎥⎦
⎥ ⎣
Σ33 ⎥⎥
⎦
⎡v ⎤
⎢⎣ 2 ⎥⎦
⎡ λ
⎢ 1

⎡v ⎤ ⎤⎥ ⎢⎢ 0
⎢⎣ 3 ⎥⎦ ⎦ ⎢
⎢ 0
⎢⎣
0
λ2
0
0 ⎤⎥
⎥⎡
0 ⎥⎢
⎥⎣
λ 3 ⎥⎥
⎦
⎡v ⎤
⎢⎣ 1 ⎥⎦
⎡v ⎤
⎢⎣ 2 ⎥⎦
T

⎤
⎡v ⎤ ⎥
⎢⎣ 3 ⎥⎦ ⎦
T 
• Eigenvectors are orthonormal: vi v j = δij
• Eigenvalues are non-negative and non-increasing
λ1 ≥ λ 2 ≥  ≥ λD ≥ 0
• PCA selects the M eigenvectors with largest eigenvalues
 T
M
• Truncating gives an approximate covariance:
Σ̂
=
λ
v
∑ i =1 i ivi
 T 
• PCA finds coefficients by: cij = (x i − µ) v j
Tony Jebara, Columbia University
PCA via the Snapshot Method
• Careful… how big is the covariance matrix?
• Assume 1000 images each containing D=20,000 pixels
• It is DxD pixels, that’s unstoreable!
• Also, finding the eigenvectors or
inverting DxD, requires O(D3)!
• First compute mean of all data (easy)
and subtract it from each point
Instead of: Σ =
1
N
T
T
Φ
where
Φ
=
x
xj
compute
x
x
∑ i =1 i i
i, j
i
N
Then find eigendecomposition of Gram matrix Φ= VΛVT
Eigenvectors of Σ are then: vi ∝
∑
N
j =1
x j vi ( j )
Tony Jebara, Columbia University
PCA via the Snapshot Method
{x ,…,x } =
1
N

µ

µ + ∑ c 1j v j ,… =
{
}
Tony Jebara, Columbia University
Truncated Gaussian Detection
• Approximate Σ with PCA plus spherical term via p (x | µ, Σ̂)
Spherical
Specific
Eigenvalues &
Eigenvectors
Eigenvalues
Σ = ∑ k =1 λkvkvkT ≈ Σ̂
D
Σ̂ = ∑ k =1 λ v v +∑ k =M +1 ρv v
M
−1
Σ̂
=∑
M
1
k =1 λk
Σ̂−1 = ∑ k =1
M
M
D
T
k k k
(
v v +∑
1
λk
T
k k
)
D
1
k =M +1 ρ
− 1ρ vkvkT + 1ρ I
D
Σ̂ = ∏k =1 λk ∏k =M +1 ρ
T
k k
T
k k
vv
ρ=
1
D−M
=
1
D−M
=
1
D−M
Σkk =
1
N
∑
(∑
D
k =M +1
(∑
D
λk
k =1
D
Σkk − ∑ k =1
M
k =1
∑ (x (k ) − µ (k ))
2
N
i =1
i
)
λ )
λk − ∑ k =1 λk
M
k
Tony Jebara, Columbia University
Truncated Gaussian Detection
• Instead of minimizing squared error, use Gaussian model
T
⎛ 1
⎞⎟
−1
1
⎜
p (x | µ, Σ̂) =
exp ⎜− 2 (x − µ) Σ̂ (x − µ)⎟⎟
⎝
⎠
(2π) Σ̂
• Use Snapshot PCA to efficiently store the big covariance
• This maximum likelihood Gaussian model achieved
state of the art face finding as evaluated by NIST/DARPA
(Moghaddam et al., 2000)
D/2
• Top performer after 2000 is Viola-Jones (boosted cascade)
Tony Jebara, Columbia University
Gaussian Face Recognition
• Instead of modeling face images with Gaussian
model the difference of two face images with a Gaussian
• Each difference of all pairs of images in our data is
represented as a D-dimensional vector x
• Also have a binary label y, y=1 same person same, y=0 not
x ∈ R D y ∈ 0,1
{ }
yi = 1,x i =
-
• One Gaussian for
same-face deltas
another for different
people deltas
y j = 0,x j =
-
Tony Jebara, Columbia University
Gaussian Classification
• Have two classes, each with their own Gaussian:
{(x ,y ),…, (x
1
1
}
,yN )
N
x ∈ R D y ∈ {0,1}
p (α) p (µ) p (Σ) p (y | α) p (x | y, µ,Σ)
p (y | α) = α (1 − α)
1−y
y
i = 1…N
(
p (x | µ,Σ,y ) = N x | µy ,Σy
)
• Generation: 1) flip a coin, get y
2) pick Gaussian y, sample x from it
• Maximum Likelihood:
l = ∑ i =1 log p (x i ,yi | α, µ,Σ)
N
= ∑ i =1 log p (yi | α) +∑ i =1 log p (x i | yi , µ,Σ)
N
N
= ∑ i =1 log p (yi | α) +∑ y ∈0 log p (x i | µ 0 ,Σ0 ) + ∑ y ∈1 log p (x i | µ1,Σ1 )
N
i
i
Tony Jebara, Columbia University
Gaussian Classification
• Max Likelihood can be done separately for the 3 terms
l = ∑ i =1 log p (yi | α) +∑ y ∈0 log p (x i | µ 0 ,Σ0 ) + ∑ y ∈1 log p (x i | µ1,Σ1 )
N
i
i
• Count # of pos & neg examples (class prior): α =
N1
N 0 +N 1
• Get mean & cov of negatives and mean & cov of positives:
µ0 =
1
N0
∑
∑
yi∈0
xi
Σ0 =
1
N0
Σ1 =
1
N1
∑ (x
∑ (x
− µ 0 )(x i − µ 0 )
T
yi∈0
i
yi∈1
i
− µ1 )(x i − µ1 )
• Given (x,y) pair, can now compute likelihood p (x,y )
µ1 =
1
N1
yi∈1
xi
T
• To make classification, a bit of Decision Theory
• Without x, can compute prior guess for y p (y )
• Give me x, want y, I need posterior p (y | x )
ŷ = arg maxy ={0,1} p (y | x )
• Bayes Optimal Decision:
• Optimal iff we have true probability
Tony Jebara, Columbia University
Gaussian Classification
• Example cases, plotting decision boundary when = 0.5
p (y = 1 | x ) =
=
p (x,y = 1)
p (x,y = 0) + p (x,y = 1)
αN (x | µ1,Σ1 )
(1 − α) N (x | µ ,Σ ) + αN (x | µ ,Σ )
0
• If covariances are equal:
linear decision
• If covariances are different:
quadratic decision
0
1
1
Tony Jebara, Columbia University
Intra-Extra Personal Gaussians
• Intrapersonal Gaussian model
N (x | µ1,Σ1 )
• Covariance is approximated
by these eigenvectors:
• Extrapersonal Gaussian model
N (x | µ 0 ,Σ0 )
• Covariances is approximated
by these eigenvectors:
• Question: what are the Gaussian means?
• Probability a pair is the same person:
p (y = 1 | x ) =
αN (x | µ1,Σ1 )
(1 − α) N (x | µ ,Σ ) + αN (x | µ ,Σ )
0
0
1
1
Tony Jebara, Columbia University
Other Standard Bases
• There are other choices for the eigenvectors, not just PCA
• Could pick eigenvectors without looking at the data
• Just for their interesting properties



C
x i ≈ µ + ∑ j =1 cij v j
• Fourier basis: denoises, only keeps smooth parts of image
• Wavelet basis: localized or windowed Fourier
• PCA: optimal least squares linear dataset reconstruction
Tony Jebara, Columbia University
Basis for Natural Images
• What happens if we
do PCA on all natural
images instead of
just faces?
• Get difference of Gaussian bases
• Like Gabor or Wavelet basis
• Not specific like faces
• Multi-scale &
orientation
• Also called steerable
filters
• Similar to visual cortex
Tony Jebara, Columbia University
Problems with Linear Bases
• Coefficient representation
changes wildly if image
rotates and so does p(x)
• The eigenspace is sensitive
to rotations, translations and
transformations of the image
• Simple linear/Gaussian/PCA models
are not enough
• What worked for aligned faces
breaks for general image datasets
• Most of the PCA eigenvectors and
spectrum energy is wasted due to
NONLINEAR EFFECTS…
 T 
cij = (x i − µ) v j
Download