Learning representations By Yang Zechao The Machine Learning course www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml Neural Nets for Face Recognition Learning Lower Dimensional Representations • Supervised learning of lower dimension representation – Hidden layers in Neural Networks – Fisher linear discriminant • Unsupervised learning of lower dimension representation – – – – Principle Components Analysis (PCA) Independent components analysis (ICA) Canonical correlation analysis (CCA) Deep Belief Networks(DBN) Learning Lower Dimensional Representations • Supervised learning of lower dimension representation – Hidden layers in Neural Networks – Fisher linear discriminant • Unsupervised learning of lower dimension representation – – – – Principle Components Analysis (PCA) Independent components analysis (ICA) Canonical correlation analysis (CCA) Deep Belief Networks(DBN) Principle Components Analysis • Idea: – Given data points in d-dimensional space, project into lower dimensional space while preserving as much information as possible • E.g., find best planar approximation to 3D data • E.g., find best planar approximation to 10^4 D data – In particular, choose projection that minimizes the squared error in reconstructing original data Principle Components Analysis • Like auto-encoding neural networks, learn re-representation of input data that can best reconstruct it. Principle Components Analysis • learned encoding is linear function of inputs (not logistic) • No local minimum problems when training! • Given d-dimensional data X, learns d-dimensional • representation, where – the dimensions are orthogonal – top k dimensions are the k-dimensional linear rerepresentation that minimizes reconstruction error (sum of squared errors) Principle Components Analysis Assume data is set of d-dimensional vectors, where nth vector is π₯ π =< π₯1π … π₯ππ > We can represent these in terms of any d orthogonal vectors π’1 … π’π , π₯π = π π π§ π=1 π π’π π’ππ π’π = πΏππ ; So, PCA: given M<d, Find < π’1 … π’π > that minimizes πΈπ ≡ Where π₯ π = π₯ + π₯= 1 π π π π₯ π=1 π π=1 π π π§ π=1 π π’π π₯π − π₯π 2 Principle Components Analysis • Note we get zero error if M=d, so all error is due to missing components. • Therefore, πΈπ = π π π π’ π=π+1 π=1 π π π π’ π=π+1 π Σπ’π = Covariance matrix:Σ = π π₯ −π₯ 2 π₯π − π₯ π₯π − π₯ π Principle Components Analysis • Minimize EM = ππ=π+1 π’ππ Σπ’π • use Lagrange Multiplier, get π’ππ Σπ’π + ππ 1 − π’ππ π’π • derive → ππ’π = ππ ui • so EM = π π=π+1 ππ PCA algorithm 1 1. X ← Create N x d data matrix, with one row vector xn per data point 2. X ← subtract mean x from each row vector xn in X 3. Σ ← covariance matrix of X 4. Find eigenvectors and eigenvalues of S 5. PC’s ← the M eigenvectors with largest eigenvalues Very Nice When Initial Dimension Not Too Big • What if very large dimensional data? – e.g., Images (d ¸ 10^4) • Problem: – Covariance matrix Σ is size (d x d) – d=10^4 | Σ | = 10^8 • Singular Value Decomposition (SVD) to the rescue! – pretty efficient algs available, including Matlab SVD – some implementations find just top N eigenvectors SVD Data X, one row per data point U gives coordinates of rows of X in the space of principle components S is diagonal, ππ > ππ+1 , ππ2 is kth eigenvalue rows of π π are unit length eigenvectors of π π π Singular Value Decomposition To generate principle components: 1 π π π • Subtract mean π₯ = π₯ from each data point, to π=1 create zero-centered data • Create matrix X with one row vector per (zero centered) data point • Solve SVD: X = USπ π • Output Principle components: columns of V (= rows of ππ) – Eigenvectors in V are sorted from largest to smallest eigenvalues – S is diagonal, with π π2 giving eigenvalue for kth eigenvector Singular Value Decomposition • To project a point (column vector x) into PC coordinates:π π x • If xi is ith row of data matrix X, then • To project a column vector x to M dim Principle Components subspace, take just the first M coordinates of π π x Independent Components Analysis • PCA seeks orthogonal directions <Y1 … YM> in feature space X that minimize reconstruction error • ICA seeks directions <Y1 … YM> that are most statistically independent. I.e., that minimize I(Y), the mutual information between the Yj : Learning Lower Dimensional Representations • Supervised learning of lower dimension representation – Hidden layers in Neural Networks – Fisher linear discriminant • Unsupervised learning of lower dimension representation – – – – Principle Components Analysis (PCA) Independent components analysis (ICA) Canonical correlation analysis (CCA) Deep Belief Networks(DBN) Dimensionality reduction across multiple datasets • Given data sets A and B, find linear projections of each into a common lower dimensional space! – Generalized SVD: minimize sq reconstruction errors of both – Canonical correlation analysis: maximize correlation of A and B in the projected space Canonical Correlation Analysis • Measuring the linear relationship between two multi dimensional variables • Finding two sets of basis vectors such that the correlation between the projections of the variables onto these basis vectors is maximized • Determine Correlation Coefficients Learning Lower Dimensional Representations • Supervised learning of lower dimension representation – Hidden layers in Neural Networks – Fisher linear discriminant • Unsupervised learning of lower dimension representation – – – – Principle Components Analysis (PCA) Independent components analysis (ICA) Canonical correlation analysis (CCA) Deep Belief Networks(DBN) Deep Belief Networks • Problem: training networks with many hidden layers doesn’t work very well – local minima, very slow training if initialize with zero weights • Deep belief networks – autoencoder networks to learn low dimensional encodings – but more layers, to learn better encodings Deep Belief Networks The second row is reconstructed from 2000-1000500-30 DBN The third row is reconstructed from 2000-300, linear PCA Encoding of digit images in two dimensions 784-2 linear encoding (PCA) 784-1000-500-250-2 DBNet Learning Lower Dimensional Representations • Supervised learning of lower dimension representation – Hidden layers in Neural Networks – Fisher linear discriminant • Unsupervised learning of lower dimension representation – – – – Principle Components Analysis (PCA) Independent components analysis (ICA) Canonical correlation analysis (CCA) Deep Belief Networks(DBN) Fisher linear discriminant Objective : • LDA seeks to reduce dimensionality while preserving as much of the class discriminatory information as possible • We seek to obtain a scalar π¦ by projecting the samples π₯ onto a line π¦ = π€ππ₯ Fisher linear discriminant • Of all the possible lines we would like to select the one that maximizes the separability of the scalars • In order to find a good projection vector, we need to define a measure of separation Fisher linear discriminant • Define class means: • Could choose w according to: Fisher Linear Discriminant • For each class we define the scatter, an equivalent of the variance, as • Fisher Linear Discriminant chooses: Fisher Linear Discriminant • Choose n-1 dimension projection for n-class classification problem • Use within-class covariances to determine the projection • Minimizes a different error function (the projected withinclass variances) Thank You!