1 RANDOM PROJECTIONS IN DIMENSIONALITY REDUCTION APPLICATIONS TO IMAGE AND TEXT DATA Ângelo Cardoso IST/UTL November 2009 Ella Bingham and Heikki Mannila Outline 2 Dimensionality Reduction – Motivation Methods for dimensionality reduction 1. 2. 1. 2. 3. 3. 4. 5. PCA DCT Random Projection Results on Image Data Results on Text Data Conclusions Dimensionality Reduction Motivation 3 Many applications have high dimensional data Market basket analysis Text Large vocabulary Image Wealth of alternative products Large image window We want to process the data High dimensionality of data restricts the choice of data processing methods Time needed to use processing methods is too long Memory requirements make it impossible to use some methods Dimensionality Reduction Motivation 4 We want to visualize high dimensional data Some features may be irrelevant Some dimensions may be highly correlated with some other, e.g. height and foot size “Intrinsic” dimensionality may be smaller than the number of features The data can be best described and understood by a smaller number dimensions Methods for dimensionality reduction 5 Main idea is to project the high-dimensional (d) space into a lower-dimensional (k) space A statistically optimal way is to project into a lowerdimensional orthogonal subspace that captures as much variation of the data as possible for the chosen k The best (in terms of mean squared error ) and most widely used way to do this is PCA How to compare different methods? Amount of distortion caused Computational complexity Principal Components Analysis (PCA) Intuition 6 Given an original space in 2d How can we represent that points in a k-dimensional space (k<=d) while preserving as much information as possible Second principal component * * * * * * * * ** * * * * * * * * * Data points * * * * First principal component * Original axes Principal Components Analysis (PCA) Algorithm 7 1. 2. 3. 4. 5. Algorithm X Create N x d data matrix, with one row vector xn per data point X subtract mean x from each dimension in X PC’s the k eigenvectors with largest eigenvalues Can be used to find the eigenvectors and eigenvalues of the covariance matrix To project into the lower-dimensional space A measure of how much data variance is explained by each eigenvector Singular Value Decomposition (SVD) Σ covariance matrix of X Find eigenvectors and eigenvalues of Σ Eigenvalues Multiply the principal components (PC’s) by X and subtract the mean of X in each dimension To restore into the original space Multiply the projection by the principal components and add the mean of X in each dimension Random Projection (RP) Idea 8 PCA even when calculated using SVD is computationally expensive Complexity is O(dcN) Where d is the number of dimensions, c is the average number of non-zero entries per column and N the number of points Idea What if we randomly constructed principal component vectors? Johnson-Lindenstrauss lemma If points in vector space are projected onto a randomly selected subspace of suitably high dimensions, then the distances between the points are approximately preserved Random Projection (RP) Idea 9 Use a random matrix (R) equivalently to the principal components matrix R is usually Gaussian distributed Complexity is O(kcn) The generated random matrix (R) is usually not orthogonal Making R orthogonal is computationally expensive However we can rely on a result by Hecht-Nielsen: In a high-dimensional space, there exists a much larger number of almost orthogonal than orthogonal directions. Thus vectors with random directions are close enough to orthogonal Euclidean distance in the projected space can be scaled to the original space by d /k Random Projection Simplified Random Projection (SRP) 10 Random matrix is usually gaussian distributed mean: 0; standart deviation: 1 Achlioptas showed that a much simpler distribution can be used This implies further computational savings since the matrix is sparse and the computations can be performed using integer arithmetic's Discrete Cosine Transform (DCT) 11 Widely used method for image compression Optimal for human eye Distortions are introduced at the highest frequencies which humans tend to neglect as noise DCT is not data-dependent, in contrast to PCA that needs the eigenvalue decomposition This makes DCT orders of magnitude cheaper to compute Results Noiseless Images 12 Results Noiseless Images 13 Results Noiseless Images 14 Original space 2500-d Error Measurement (100 image pairs with 50x50 pixels) Average error on euclidean distance between 100 pairs of images in the original and reduced space Amount of distortion RP and SRP give accurate results for very small k (k>10) PCA gives accurate results for k>600 In PCA such scaling is not straightforward DCT still as a significant error even for k > 600 Computational complexity Distance scaling might be an explanation for the success Number of floating point operations for RP and SRP is on the order of 100 times less than PCA RP and SRP clearly outperform PCA and DCT at smallest dimensions Results Noisy Images 15 Images were corrupted by salt and pepper impulse noise with probability 0.2 Error is computed in the highdimensional noiseless space RP, SRP, PCA and DCT perform quite similarly to the noiseless case Results Text Data 16 Data set Newsgroups corpus Pre-processing Term frequency vectors Some common terms were removed but no stemming was used Document vectors normalized to unit length Data was not made zero mean Size sci.crypt, sci.med, sci.space, soc.religion 5000 terms 2262 newsgroup documents Error measurement 100 pairs of documents were randomly selected and the error between their cosine before and after the dimensionality reduction was calculated Results Text Data 17 Results Text Data 18 The cosine was used as similarity measure since it is more common for this task RP is not as accurate as SVD The Johnson-Lindenstrauss result states that the euclidean distance are retained well in random projection not the cosine RP error may be neglected in most applications RP can be used on large document collections with less computational complexity than SVD Conclusion 19 Random Projection is an effective dimensionality reduction method for high-dimensional real-world data sets RP preserves the similarities even if the data is projected into a moderate number of dimensions RP is beneficial in applications where the distances of the original space are meaningful RP is a good alternative for traditional dimensionality reduction methods which are infeasible for high dimensional data since it does not suffer from the curse of dimensionality Questions 20