Random Projections

advertisement
1
RANDOM PROJECTIONS IN
DIMENSIONALITY REDUCTION
APPLICATIONS TO IMAGE AND TEXT DATA
Ângelo Cardoso
IST/UTL
November 2009
Ella Bingham and Heikki Mannila
Outline
2
Dimensionality Reduction – Motivation
Methods for dimensionality reduction
1.
2.
1.
2.
3.
3.
4.
5.
PCA
DCT
Random Projection
Results on Image Data
Results on Text Data
Conclusions
Dimensionality Reduction
Motivation
3

Many applications have high dimensional data

Market basket analysis


Text


Large vocabulary
Image


Wealth of alternative products
Large image window
We want to process the data

High dimensionality of data restricts the choice of data
processing methods


Time needed to use processing methods is too long
Memory requirements make it impossible to use some methods
Dimensionality Reduction Motivation
4


We want to visualize high dimensional data
Some features may be irrelevant
 Some
dimensions may be highly correlated with some
other, e.g. height and foot size

“Intrinsic” dimensionality may be smaller than the
number of features
 The
data can be best described and understood by a
smaller number dimensions
Methods for dimensionality reduction
5




Main idea is to project the high-dimensional (d) space
into a lower-dimensional (k) space
A statistically optimal way is to project into a lowerdimensional orthogonal subspace that captures as much
variation of the data as possible for the chosen k
The best (in terms of mean squared error ) and most
widely used way to do this is PCA
How to compare different methods?
Amount of distortion caused
 Computational complexity

Principal Components Analysis (PCA)
Intuition
6


Given an original space in 2d
How can we represent that points in a k-dimensional
space (k<=d) while preserving as much information
as possible
Second principal component
*
*
* * *
*
* *
**
* *
*
* *
*
*
* *
Data points
* *
* *
First principal component
*
Original axes
Principal Components Analysis (PCA)
Algorithm
7

1.
2.
3.
4.
5.
Algorithm
X  Create N x d data matrix,
with one row vector xn per data
point
X subtract mean x from each
dimension in X




PC’s  the k eigenvectors with
largest eigenvalues
Can be used to find the eigenvectors and
eigenvalues of the covariance matrix
To project into the lower-dimensional space


A measure of how much data variance is
explained by each eigenvector
Singular Value Decomposition (SVD)

Σ  covariance matrix of X
Find eigenvectors and eigenvalues
of Σ
Eigenvalues
Multiply the principal components (PC’s) by X
and subtract the mean of X in each dimension
To restore into the original space

Multiply the projection by the principal
components and add the mean of X in each
dimension
Random Projection (RP)
Idea
8

PCA even when calculated using SVD is computationally
expensive

Complexity is O(dcN)


Where d is the number of dimensions, c is the average number of non-zero
entries per column and N the number of points
Idea

What if we randomly constructed principal component
vectors?

Johnson-Lindenstrauss lemma

If points in vector space are projected onto a randomly selected
subspace of suitably high dimensions, then the distances between the
points are approximately preserved
Random Projection (RP)
Idea
9

Use a random matrix (R) equivalently to the principal
components matrix



R is usually Gaussian distributed
Complexity is O(kcn)
The generated random matrix (R) is usually not orthogonal

Making R orthogonal is computationally expensive

However we can rely on a result by Hecht-Nielsen:



In a high-dimensional space, there exists a much larger number of almost
orthogonal than orthogonal directions.
Thus vectors with random directions are close enough to orthogonal
Euclidean distance in the projected space can be scaled to the
original space by
d /k
Random Projection
Simplified Random Projection (SRP)
10

Random matrix is usually gaussian distributed
 mean:

0; standart deviation: 1
Achlioptas showed that a much simpler distribution
can be used
 This
implies further computational savings since the
matrix is sparse and the computations can be
performed using integer arithmetic's
Discrete Cosine Transform (DCT)
11


Widely used method for image compression
Optimal for human eye
 Distortions
are introduced at the highest frequencies
which humans tend to neglect as noise

DCT is not data-dependent, in contrast to PCA that
needs the eigenvalue decomposition
 This
makes DCT orders of magnitude cheaper to
compute
Results
Noiseless Images
12
Results
Noiseless Images
13
Results
Noiseless Images
14

Original space 2500-d


Error Measurement


(100 image pairs with 50x50 pixels)
Average error on euclidean distance between 100 pairs
of images in the original and reduced space
Amount of distortion

RP and SRP give accurate results for very small k (k>10)


PCA gives accurate results for k>600



In PCA such scaling is not straightforward
DCT still as a significant error even for k > 600
Computational complexity


Distance scaling might be an explanation for the success
Number of floating point operations for RP and SRP is on
the order of 100 times less than PCA
RP and SRP clearly outperform PCA and DCT at
smallest dimensions
Results
Noisy Images
15



Images were corrupted by
salt and pepper impulse
noise with probability 0.2
Error is computed in the highdimensional noiseless space
RP, SRP, PCA and DCT
perform quite similarly to the
noiseless case
Results
Text Data
16

Data set

Newsgroups corpus


Pre-processing



Term frequency vectors
Some common terms were removed but no stemming was used
Document vectors normalized to unit length


Data was not made zero mean
Size



sci.crypt, sci.med, sci.space, soc.religion
5000 terms
2262 newsgroup documents
Error measurement

100 pairs of documents were randomly selected and the error between their
cosine before and after the dimensionality reduction was calculated
Results
Text Data
17
Results
Text Data
18


The cosine was used as similarity
measure since it is more common for
this task
RP is not as accurate as SVD



The Johnson-Lindenstrauss result states
that the euclidean distance are
retained well in random projection not
the cosine
RP error may be neglected in most
applications
RP can be used on large document
collections with less computational
complexity than SVD
Conclusion
19




Random Projection is an effective dimensionality
reduction method for high-dimensional real-world data
sets
RP preserves the similarities even if the data is
projected into a moderate number of dimensions
RP is beneficial in applications where the distances of
the original space are meaningful
RP is a good alternative for traditional dimensionality
reduction methods which are infeasible for high
dimensional data since it does not suffer from the curse
of dimensionality
Questions
20
Download