Learning-Representat..

advertisement
Learning representations
By Yang Zechao
The Machine Learning course
www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml
Neural Nets for Face Recognition
Learning Lower Dimensional
Representations
• Supervised learning of lower dimension
representation
– Hidden layers in Neural Networks
– Fisher linear discriminant
• Unsupervised learning of lower dimension
representation
–
–
–
–
Principle Components Analysis (PCA)
Independent components analysis (ICA)
Canonical correlation analysis (CCA)
Deep Belief Networks(DBN)
Learning Lower Dimensional
Representations
• Supervised learning of lower dimension
representation
– Hidden layers in Neural Networks
– Fisher linear discriminant
• Unsupervised learning of lower dimension
representation
–
–
–
–
Principle Components Analysis (PCA)
Independent components analysis (ICA)
Canonical correlation analysis (CCA)
Deep Belief Networks(DBN)
Principle Components Analysis
• Idea:
– Given data points in d-dimensional space, project
into lower dimensional space while preserving as
much information as possible
• E.g., find best planar approximation to 3D data
• E.g., find best planar approximation to 10^4 D data
– In particular, choose projection that minimizes the
squared error in reconstructing original data
Principle Components Analysis
• Like auto-encoding neural networks, learn
re-representation of input data that can best
reconstruct it.
Principle Components Analysis
• learned encoding is linear function of inputs (not
logistic)
• No local minimum problems when training!
• Given d-dimensional data X, learns d-dimensional
• representation, where
– the dimensions are orthogonal
– top k dimensions are the k-dimensional linear rerepresentation that minimizes reconstruction error
(sum of squared errors)
Principle Components Analysis
Assume data is set of d-dimensional vectors, where nth vector
is π‘₯ 𝑛 =< π‘₯1𝑛 … π‘₯𝑑𝑛 >
We can represent these in terms of any d orthogonal vectors
𝑒1 … 𝑒𝑑 ,
π‘₯𝑛 =
𝑑
𝑛
𝑧
𝑖=1 𝑖 𝑒𝑖
𝑒𝑖𝑇 𝑒𝑗 = 𝛿𝑖𝑗
;
So, PCA: given M<d, Find < 𝑒1 … 𝑒𝑀 >
that minimizes πΈπ‘š ≡
Where π‘₯ 𝑛 = π‘₯ +
π‘₯=
1
𝑁
𝑑
𝑛
π‘₯
𝑖=1
𝑁
𝑖=1
𝑛
𝑀
𝑧
𝑖=1 𝑖 𝑒𝑖
π‘₯𝑛 − π‘₯𝑛
2
Principle Components Analysis
• Note we get zero error if M=d, so all error is
due to missing components.
• Therefore,
𝐸𝑀 =
𝑑
𝑇
𝑁
𝑒
𝑖=𝑀+1 𝑛=1 𝑖
𝑑
𝑇
𝑒
𝑖=𝑀+1 𝑖 Σ𝑒𝑖
=
Covariance matrix:Σ =
𝑛
π‘₯ −π‘₯
2
π‘₯𝑛 − π‘₯ π‘₯𝑛 − π‘₯
𝑇
Principle Components Analysis
• Minimize EM = 𝑑𝑖=𝑀+1 𝑒𝑖𝑇 Σ𝑒𝑖
• use Lagrange Multiplier, get
𝑒𝑖𝑇 Σ𝑒𝑖 + πœ†π‘– 1 − 𝑒𝑖𝑇 𝑒𝑖
• derive → 𝑆𝑒𝑖 = πœ†π‘– ui
• so EM =
𝑑
𝑖=𝑀+1 πœ†π‘–
PCA algorithm 1
1. X ← Create N x d data matrix, with one row vector xn
per data point
2. X ← subtract mean x from each row vector xn in X
3. Σ ← covariance matrix of X
4. Find eigenvectors and eigenvalues of S
5. PC’s ← the M eigenvectors with largest eigenvalues
Very Nice When Initial Dimension Not Too Big
• What if very large dimensional data?
– e.g., Images (d ¸ 10^4)
• Problem:
– Covariance matrix Σ is size (d x d)
– d=10^4 | Σ | = 10^8
• Singular Value Decomposition (SVD) to the rescue!
– pretty efficient algs available, including Matlab SVD
– some implementations find just top N eigenvectors
SVD
Data X, one row
per data point
U gives coordinates
of rows of X in the
space of principle
components
S is diagonal,
π‘†π‘˜ > π‘†π‘˜+1 ,
π‘†π‘˜2 is kth
eigenvalue
rows of 𝑉 𝑇 are unit
length eigenvectors
of 𝑋 𝑇 𝑋
Singular Value Decomposition
To generate principle components:
1
𝑁
𝑑
𝑛
• Subtract mean π‘₯ =
π‘₯
from each data point, to
𝑖=1
create zero-centered data
• Create matrix X with one row vector per (zero centered)
data point
• Solve SVD: X = US𝑉 𝑇
• Output Principle components: columns of V (= rows of
𝑉𝑇)
– Eigenvectors in V are sorted from largest to smallest
eigenvalues
– S is diagonal, with π‘ π‘˜2 giving eigenvalue for kth eigenvector
Singular Value Decomposition
• To project a point (column vector x) into PC
coordinates:𝑉 𝑇 x
• If xi is ith row of data matrix X, then
• To project a column vector x to M dim
Principle Components subspace, take just the
first M coordinates of 𝑉 𝑇 x
Independent Components Analysis
• PCA seeks orthogonal directions <Y1 … YM> in
feature space X that minimize reconstruction
error
• ICA seeks directions <Y1 … YM> that are most
statistically independent. I.e., that minimize
I(Y), the mutual information between the Yj :
Learning Lower Dimensional
Representations
• Supervised learning of lower dimension
representation
– Hidden layers in Neural Networks
– Fisher linear discriminant
• Unsupervised learning of lower dimension
representation
–
–
–
–
Principle Components Analysis (PCA)
Independent components analysis (ICA)
Canonical correlation analysis (CCA)
Deep Belief Networks(DBN)
Dimensionality reduction across
multiple datasets
• Given data sets A and B, find linear projections of
each into a common lower dimensional space!
– Generalized SVD: minimize sq reconstruction errors of
both
– Canonical correlation analysis: maximize correlation of
A and B in the projected space
Canonical Correlation Analysis
• Measuring the linear relationship between
two multi dimensional variables
• Finding two sets of basis vectors such that the
correlation between the projections of the
variables onto these basis vectors is
maximized
• Determine Correlation Coefficients
Learning Lower Dimensional
Representations
• Supervised learning of lower dimension
representation
– Hidden layers in Neural Networks
– Fisher linear discriminant
• Unsupervised learning of lower dimension
representation
–
–
–
–
Principle Components Analysis (PCA)
Independent components analysis (ICA)
Canonical correlation analysis (CCA)
Deep Belief Networks(DBN)
Deep Belief Networks
• Problem: training networks with many hidden layers
doesn’t work very well
– local minima, very slow training if initialize with zero weights
• Deep belief networks
– autoencoder networks to learn
low dimensional encodings
– but more layers, to learn better encodings
Deep Belief Networks
The second row is reconstructed from 2000-1000500-30 DBN
The third row is reconstructed from 2000-300,
linear PCA
Encoding of digit images in two dimensions
784-2 linear encoding (PCA)
784-1000-500-250-2 DBNet
Learning Lower Dimensional
Representations
• Supervised learning of lower dimension
representation
– Hidden layers in Neural Networks
– Fisher linear discriminant
• Unsupervised learning of lower dimension
representation
–
–
–
–
Principle Components Analysis (PCA)
Independent components analysis (ICA)
Canonical correlation analysis (CCA)
Deep Belief Networks(DBN)
Fisher linear discriminant
Objective :
• LDA seeks to reduce dimensionality while
preserving as much of the class
discriminatory information as possible
• We seek to obtain a scalar 𝑦 by projecting the
samples π‘₯ onto a line
𝑦 = 𝑀𝑇π‘₯
Fisher linear discriminant
• Of all the possible lines we would like to select
the one that maximizes the separability of the
scalars
• In order to find a good projection vector, we need
to define a measure of separation
Fisher linear discriminant
• Define class means:
• Could choose w according to:
Fisher Linear Discriminant
• For each class we define the scatter, an
equivalent of the variance, as
• Fisher Linear Discriminant chooses:
Fisher Linear Discriminant
• Choose n-1 dimension projection for n-class
classification problem
• Use within-class covariances to determine the
projection
• Minimizes a different error function (the
projected withinclass variances)
Thank You!
Download