Learning Near-Isometric Linear Embeddings Chinmay Hegde MIT Richard Baraniuk Rice University Aswin Sankaranarayanan CMU Wotao Yin UCLA Edward Snowden Ex-NSA NSA PRISM 4972 Gbps Source: Wikipedia.org NSA PRISM 4972 Gbps Source: Wikipedia.org NSA PRISM Source: Wikipedia.org NSA PRISM Source: Wikipedia.org NSA PRISM DIMENSIONALITY REDUCTION Source: Wikipedia.org Large Scale Datasets Intrinsic Dimensionality Intrinsic dimension << Extrinsic dimension! • Why? Geometry, that’s why • Exploit to perform more efficient analysis and processing of large-scale data Dimensionality Reduction Goal: Create a (linear) mapping from RN to RM with M < N that preserves the key geometric properties of the data ex: configuration of the data points Dimensionality Reduction • Given a training set of signals, find “best” that preserves its geometry Dimensionality Reduction • Given a training set of signals, find “best” that preserves its geometry • Approach 1: PCA via SVD of training signals – find average best fitting subspace in least-squares sense – average error metric can distort point cloud geometry Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIP Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIP – but not the Restricted Itinerary Property [Maduro, Snowden ’13] Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIP and Whitney – design to preserve inter-point distances (secants) – more faithful to training data Near-Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIP and Whitney – design to preserve inter-point distances (secants) – more faithful to training data – but exact isometry can be too much to ask Near-Isometric Embedding • Given a training set of signals, find “best” that preserves its geometry • Approach 2: Inspired by RIP and Whitney – design to preserve inter-point distances (secants) – more faithful to training data – but exact isometry can be too much to ask Why Near-Isometry? • Sensing – guarantees existence of a recovery algorithm • Machine learning applications – kernel matrix depends only on pairwise distances • Approximate nearest neighbors for classification – efficient dimensionality reduction Existence of Near Isometries • Johnson-Lindenstrauss Lemma • Given a set of Q points, there exists a Lipchitz map that achieves near-isometry (with constant ) provided • Random matrices with iid subGaussian entries work – c.f. so-called “compressive sensing” [J-L, 84] [Frankl and Meahara, 88][Indyk and Motwani, 99] [Achlioptas, 01][Dasgupta and Gupta, 02] L1 Energy Existence of Near Isometries • Johnson-Lindenstrauss Lemma • Given a set of Q points, there exists a Lipchitz map that achieves near-isometry (with constant ) provided • Random matrices with iid subGaussian entries work – c.f. so-called “compressive sensing” • Existence of solution! – but constants are poor – oblivious to data structure [J-L, 84] [Frankl and Meahara, 88][Indyk and Motwani, 99] [Achlioptas, 01][Dasgupta and Gupta, 02] Near-Isometric Embedding • Q. Can we beat random projections? • A. … – on the one hand: lower bounds for JL [Alon ’03] Near-Isometric Embedding • Q. Can we beat random projections? • A. … – on the one hand: lower bounds for JL [Alon ’03] – on the other hand: carefully constructed linear projections can often do better • Our quest: An optimization based approach for learning “good” linear embeddings Normalized Secants • Normalized pairwise vectors [Whitney; Kirby; Wakin, B ’09] • Goal is to approximately preserve the length of • Obviously, projecting in direction of is a bad idea Normalized Secants • Normalized pairwise vectors • Goal is to approximately preserve the length of • Note: total number of secants is large: “Good” Linear Embedding Design • Given: normalized secants • Seek: the “shortest” matrix such that Erratum alert: we will use Q to denote both the number of data points and the number of secants “Good” Linear Embedding Design • Given: normalized secants • Seek: the “shortest” matrix such that “Good” Linear Embedding Design • Given: normalized secants • Seek: the “shortest” matrix such that Lifting Trick • Convert quadratic constraints in constraints in • After designing via matrix square root , obtain into linear Relaxation • Convert quadratic constraints in constraints in into linear Relax rank minimization to nuclear norm minimization NuMax • Semi-Definite Program (SDP) • Nuclear norm minimization with Max-norm constraints (NuMax) • Solvable by standard interior point techniques • Rank of solution is determined by Practical Considerations • In practice N large, Q very large! • Computational cost per iteration scales as Solving NuMax • Alternating Direction Method of Multipliers (ADMM) - solve for P using spectral thresholding - solve for L using least-squares - solve for q using “clipping” • Computational/memory cost per iteration: Accelerating NuMax • Poor scaling with N and Q – least squares involves matrices with Q2 rows – SVD of an NxN matrix • Observation 1 – intermediate estimates of P are low-rank – use low-rank representation to reduce memory and accelerate computations – use incremental SVD for faster computations Accelerating NuMax • Observation 2 – by KKT conditions, by complementary slackness, only constraints that are satisfied with equality determine solutions (“active constraints”) Analogy: Recall support vector machines (SVMs)., where we solve The solution is determined only by the support vectors – those for which NuMax-CG • Observation 2 – by KKT conditions, by complementary slackness, only constraints that are satisfied with equality determine solutions (“active constraints”) • Hence, given feasibility of a solution P*, only secants vk for which |vkTP*vk – 1| = determine the value of P* • Key: Number of “support secants” << total number of secants – and so we only need to track the support secants – “column generation” approach to solving NuMax Computation Time Can solve for datasets with Q=100k points in N=1000 dimensions in a few hours Squares – Near Isometry N=16x16=256 • Images of translating blurred squares live on a K=2 dimensional smooth manifold in N=256 dimensional space • Project a collection of these images into M-dimensional space while preserving structure (as measured by isometry constant ) Squares – Near Isometry N=16x16=256 • M=40 linear measurements enough to ensure isometry constant of = 0.01 Squares – Near Isometry Squares – Near Isometry Squares – Near Isometry Squares – Near Isometry Squares – CS Recovery N=16x16=256 NuMax: 20 dB NuMax: 6 dB NuMax: 0 dB Random: 20 dB Random: 6 dB Random: 0 dB 20 MSE 15 10 5 • Signal recovery in AWGN 0 15 20 25 M 30 35 MNIST (8) – Near Isometry N=20x20=400 M = 14 basis functions achieve = 0.05 MNIST (8) – Near Isometry N=20x20=400 MNIST – NN Classification • MNIST dataset – N = 20x20 = 400-dim images – 10 classes: digits 0-9 – Q = 60000 training images • Nearest neighbor (NN) classifier – Test on 10000 images • Miss-classification rate of NN classifier: 3.63% MNIST – Naïve NuMax Classification • MNIST dataset – – – – N = 20x20 = 400-dim images 10 classes: digits 0-9 Q = 60000 training images, so >1.8 billion secants! NuMax-CG took 3-4 hours to process • Miss-classification rate of NN classifier: 3.63% δ 0.40 0.25 0.1 Rank of NuMax solution 72 97 167 NuMax 2.99 3.11 3.31 Gaussian 5.79 4.51 3.88 PCA 4.40 4.38 4.41 Missclassification rates in % • NuMax provides the best NN-classification rates Task Adaptivity • Prune the secants according to the task at hand – If goal is signal reconstruction, then preserve all secants – If goal is signal classification, then preserve inter-class secants differently from intra-class secants • Can preferentially weight the training set vectors according to their importance (connections with boosting) Optimized Classification Intra-class secants are not expanded Inter-class secants are not shrunk This simple modification improves NN classification rates while using even fewer measurements Optimized Classification • Optimized classification formulation same as NuMax • Can expect a smaller rank solution (smaller M) • consequence of having fewer constraints as compared to NuMax • Can expect improved NN classification • intra-class secants will “shrink” while inter-class secants will “expand” • after embedding, (on average) each data-point will have more neighbors from its own class Optimized Classification • MNIST dataset – – – – N = 20x20 = 400-dim images 10 classes: digits 0-9 Q = 60000 training images, so >1.8 billion secants! NuMax-CG took 3-4 hours to process δ 0.40 0.25 0.1 Algorithm NuMax NuMax Class NuMax NuMax Class NuMax NuMax Class Rank 72 52 97 69 167 116 Miss-classification rate in % 2.99 2.68 3.11 2.72 3.31 3.09 1. Significant reduction in number of measurements (M) 1. Significant improvement in classification rate CVDomes Radar Signals • Training data: 2000 secants (inter-class, joint) • Test data: 100 signatures from each class Image Retrieval on LabelMe • Goal: preserve neighborhood structure • N = 512, Q = 4000, M = 45 suffices NuMax: Analysis • Performance of NuMax depends upon the tightness of the convex relaxation: Q. When is this relaxation tight? A. Open Problem, likely very hard NuMax: Analysis However, can rigorously analyze if constrained to be orthonormal is further • Essentially enforces that the rows of are (i) unit norm and (ii) pairwise orthogonal • Upshot: Models a per-sample energy constraint of a CS acquisition system – Different measurements necessarily probe “new” portions of the signal space – Measurements remain uncorrelated, so noise/perturbations in the input data are not amplified Slight Refinement 1. Look at the converse problem fix the embedding dimension and solve for the linear embedding with minimum distortion, , as a function of M – Does not change the problem qualitatively 2. Restrict the problem to the space of orthonormal embeddings orthonormality Slight Refinement • As in NuMax, lifting + trace-norm relaxation: • Efficient solution algorithms (NuMax, NuMax-CG) remain essentially unchanged • However, solutions come with guarantees … Analytical Guarantee • Theorem [Grant, Hegde, Indyk ‘13] Denote the optimal distortion obtained by a rank-M orthonormal embedding as Then, by solving an SDP, we can efficiently construct a rank-2M embedding with distortion at most ie: One can get close to the optimal distortion by paying an additional price in the measurement budget (M) Conclusions Conclusions • Never trust your system administrator! Conclusions • NuMax – new adaptive data representation that is linear, near-isometric – optimize RIP constant to preserve geometrical info in a set of training signals • Posed as a rank-minimization problem – relaxed to a Semi-definite program (SDP) – NuMax solves very efficiently via ADMM and CG • Applications: Compressive sensing, classification, retrieval, ++ • Nontrivial extension from signal recovery to signal inference Extensions • [Grant, Hegde, Indyk] Specialize to orthonormal projections, similar algorithm as NuMax, achieve near-optimal analytical performance • [Sadeghian, Bah Cevher] Place time and energy constraints on projection; can impose sparsity on projection; “digital fountain” property • [H,S,Y,B] Extension to dictionaries Open Problems • Equivalence between the solutions of min-rank and min-trace problems ? • Convergence rate of NuMax – Preliminary studies show o(1/k) rate of convergence • Scaling of the algorithm – Given dataset of Q-points, #secants is O(Q2) – Are there alternate formulations that scale linearly/sub-linearly in Q ? • Understanding how RIP properties weaken from training dataset to test dataset? Software • GNuMax Software package at dsp.rice.edu • PneuMax French-version software package coming soon References • C. Hegde, A. C. Sankaranarayanan, W. Yin, and R. G. Baraniuk, “A Convex Approach for Learning Near-Isometric Linear Embeddings,” Submitted to Journal of Machine Learning Research, 2012 • C. Hegde, A. C. Sankaranarayanan, and R. G. Baraniuk, “Near-Isometric Linear Embeddings of Manifolds,” IEEE Statistical Signal Processing Workshop (SSP), August 2012