Project Report: Analysis of JL Transform Implementations 1 Objective

advertisement
Project Report: Analysis of JL Transform
Implementations
Abhishek Kumar (u0660208),
Manasi Datar (u0564527)
Dec 21, 2009
1
Objective
The aim of this project was to experiment with implementations of the Johnson
Lindenstrauss (JL) Lemma and present empirical results in relation to a couple
of applications. We have implemented and tested two of the most commonly
used implementations of the JL transform and tested their performance in a
classification application.
2
Introduction
Most of the scientific research today has become data-intensive [1]. Vast amounts
of data are being collected from different sources by researchers in all the areas
we can imagine, ranging from basic sciences like nuclear physics (Large Hadron
Collider is a recent example), astronomy, biology to more applied areas like web
mining, natural language processing, biomedical applications, etc. This huge
amount of data (both structured and unstructured) presents a big challenge in
terms of finding ways to do efficient dimensionality reduction (what is ‘efficient’,
is dictated by the application) to achieve one or more of the these goals - being
able to visualize data effectively, looking for interesting patterns in the data, reduce the computational power needed to process the data without compromising
much on the end goal, and sometimes to provide the computer algorithms with
only relevant information so they are not confused by irrelevant information,
and there may be many others depending on the application.
The most popular way of doing dimensionality reduction during the past few
decades has been the principle component analysis (PCA). PCA attempts to reduce the dimensionality of data such that the mean square error between original
and reconstructed data (from lower ranked representation) is minimized. PCA,
however, does not provide any guarantees on preserving the local structures
present in the data (like pairwise distances, others). Although, given sufficient
samples of un-noisy data, PCA does a decent job in dimensionality reduction
1
for certain types of applications (e.g. in pattern classification), it can overfit the
current data set in the lack of sufficient amount of data and may not generalize
very well to future unseen data from the same distribution. It may also not be
robust in presence of noise in the data.
Johnson and Lindenstrauss [2] presented a seminal result which makes way
for a dimensionality reduction technique that may be dependent or independent
of the actual data points presented to the algorithm. The formal statement of
the Lemma is given below.
Lemma 1. Given ǫ > 0 and an integer n, let k be a positive integer such
that k ≥ k0 = O(ǫ−2 log n). For every set P of n points in Rd there exists
f : Rd 7→ Rk such that for all u, v ∈ P
(1 − ǫ)||u − v||2 ≤ ||f (u) − f (v)||2 ≤ (1 + ǫ)||u − v||2 .
In plain words, the lemma says that there exists a transformation f when
applied to the data, can preserve the Euclidean distances in the new space up
to arbitrary precision. The JL Lemma guarantees that there exists an approximately isometric lower dimensional embedding of the data. It does not give a
way of finding that mapping though. The lemma has been used in many problems in computer science. For a reasonably comprehensive survey on JL lemma
and its uses in various computer science problems, the reader is referred to [3].
After the lemma was presented and became popular in the community, the
researchers have tried to come up with ways to construct the mappings that
have this distance preserving property. In this project, we experiment with two
such constructions [4] and measure the performance on two different metrics distance preservation and effectiveness for pattern classification problems.
3
Methods and Applications
[4] provide two constructions for JL embeddings, such that the elements of the
projection matrix belong to the set {−1, 0, 1}. The main contribution of this
paper changes the nature of the projection matrix from a dense matrix of real
numbers to a sparse matrix which allows aggregation methods in place of actual
matrix multiplication, which is a non-trivial task in practical situations.
The method presented in [4] is stated in Theorem.
Theorem
Let P be an arbitrary set of n points in ℜd , represented as an n × d matrix A,
Given ǫ, β > 0, let
k0 =
4 + 2β
log n = O(ǫ−2 log n)
ǫ2 /2 − ǫ3 /3
2
For integer k ≥ k0 , let R be a d × k random matrix with R(i, j) = rij , where
{rij } are independent random variables from either one of the following two
probability distributions:
rij = {
rij =
√
+1
−1
3×{
Let
with probability 1/2,
with probability 1/2,
(1)
+1 with probability 1/6,
0 with probability 2/3,
−1 with probability 1/6,
(2)
1
E = √ AR
k
and let f : ℜd → ℜk map the ith row of A to the ith row of E.
With probability at least 1 − n−β , for all u, v ∈ P
(1 − ǫ)ku − vk2 ≤ kf (u) − f (v)k2 ≤ (1 + ǫ)ku − vk2
We implement Equation. 1 as method-1, using a single coin toss to set each value
of the projection matrix, and Equation. 2 as method-2 using two coin tosses to
set each value of the projection matrix. For both implementations, β was set to
1 and k was automatically computed to be the smallest integer greater than k0
formulated in the Theorem above.
Next section presents and discusses various results obtained using these mappings. Both randomly generated data and real data from a practical application
are used in evaluations.
4
Results and Discussion
This section details experiments designed to illustrate and validate the JL implementations described in Section. 3. First, we present an experiment with
synthetically generated data to validate the implementations by checking for
preservation of distances as predicted by the JL Lemma. Next, we present
results on a binary classification problem from biomedical domain.
4.1
Random Data
Synthetic data points were generated using a Gaussian distribution with µ =
100.0 and σ = 1.0. A 100 × 10000 matrix was generated and used as input data
for the experiment. The composition of this matrix was a row-wise concatenation of 100 data points, each represented by a 10000 dimensional vector.
For values of ǫ = 0.1, 0.05, JL projection matrices were generated using
each of the two methods described in Section. 3. These matrices were used to
project the input data into the JL space, and pair-wise distances were used to
validate the sanctity of the projections. Number of point pairs considered was
np = 1000, 10000
3
Randomly picked point-pairs were evaluated and the following measures were
computed:
p-value This statistic reflects the percentage of point-pairs for which distances
in the JL space fall out of the range prescribed by Theorem. 3. This value
can be considered as being similar to the statistical p-value describing the
significance of the projection and is expected to be close to 1/n, where n
is the number of points (or rows) in the input data.
distortion (δ) Consider that u, v are points in the original space, and f (u), f (v)
are the corresponding projections in the JL space. We now define the distortion as follows
δ = |ku − vk2 − kf (u) − f (v)k2 |
The mean value of δ, computed over np point pairs is reported, along with
the minimum and maximum value.
epsilon (ǫ) The value of ǫ is computed as
ǫ = |1 −
kf (u) − f (v)k2
|
ku − vk2
The minimum and maximum values are reported.
#pairs
method
p-value
1000
1000
10000
10000
1
2
1
2
0.084
0.039
0.0903
0.0529
1000
1000
10000
10000
1
2
1
2
0.077
0.055
0.0697
0.0691
δµ
δmin
δmax
ǫ = 0.1, k = 665
868.426
0
4274.32
807.506
0
3389.1
948.157
0
5691.81
835.691
0
3727.15
ǫ = 0.05, k = 2658
438.214
0
1987.58
427.033
0
1747.62
432.72
0
2175
440.845
0
2029.11
ǫmin
ǫmax
1.90891e-005
0.000115442
9.7865e-006
6.22849e-006
0.213059
0.168979
0.28161
0.186965
1.90197e-005
1.60807e-006
1.29553e-005
7.25565e-006
0.0984301
0.0884449
0.108482
0.0990592
Table 1: Results on synthetic random data
Table. 1 shows the results of the validation experiment of the JL implementations on synthetic, randomly generated data. p − values indicate that both
implementations perform reasonably in terms of preserving the distances after
projection. The low values of ǫ validate both the methods, and also indicate
that method-2 performs better than method-1. This fact can be attributed to
the sparse nature of the projection matrix produced by method-2.
4
4.2
Real Data
We use Positron Emission Tomography (PET) brain image data collected from
two populations - people who suffer from Alzheimer’s disease and normal people.
The goal is to classify the test population into their correct categories using
PET images. The main challenge in this type of classification task is the large
dimensionality of data (each sample is of dimension 15964) and very few number
of samples (total 152). We use 110 samples in training the model and rest 42
samples in testing. Support Vector Machine (SVM) is used as a classifier in this
task.
The classification accuracy with all 15964 dimensions is 88.10%.
method
1
2
1
2
Accuracy(%) p-value
ǫ = 0.1, k = 725
78.57
0.057
78.57
0.070
ǫ = 0.05, k = 2900
80.95
0.051
80.95
0.046
ǫmin
ǫmax
2.327e-6
5.36e-5
0.217
0.206
1.353e-7
3.452e-6
0.109
0.095
Table 2: Results on PET brain image data: Classification accuracy
As tabulated above, we can observe that the mapping results in a nontrivial reduction in the accuracy but the advantages in terms of reduction in
dimensionality are huge. With almost 20 times reduction in dimensionality, the
accuracy is reduced by almost 10% absolute.
5
Conclusions and Future Work
We implemented two simple methods to compute JL-embeddings of data using
coin tosses. The advantages of these methods lies in the fact that the projection
matrices generated are sparse, and the projection operation can be implemented
efficiently as a series of selection and aggregation operations. Next, we validated
the implementations for accuracy using synthetic data and various measures.
The classification accuracy results show that JL mappings may not be an ideal
method to use in pattern recognition problems. However, the amount of reduction in dimensionality obtained is huge if one is able to compromise on the
prediction accuracy. Following the experiments with synthetic data, we tested
the implementations on real data in a practical classification application.
In the future, we hope to improve the current implementations, and also
include some of the more recent variants like the FJLT proposed in [5]. The
next step will also include a more thorough analysis of the time-complexity in
major applications.
5
References
[1] Gray, J.: The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research (2009)
[2] Johnson, W., Lindenstrauss, J.: Extensions of lipschitz maps into a hilbert
space. Contemporary Mathematics 26 (1984) 189–206
[3] Saha, A.: A survey of johnson lindenstrauss transform - methods, extensions
and applications. CS6160, Class Project Report (2008)
[4] Achlioptas, D.:
Database-friendly random projections: Johnsonlindenstrauss with binary coins. J Computer and System Sciences 66 (2003)
671–687
[5] Ailon, N., Chazelle, B.: Approximate nearest neighbors and the fast johnsonlindenstrauss trasnform. In: Proceedings of the 38st Annual Symposium on
the Theory of Computing (STOC). (2006) 557–563
6
Download