Project Report: Analysis of JL Transform Implementations Abhishek Kumar (u0660208), Manasi Datar (u0564527) Dec 21, 2009 1 Objective The aim of this project was to experiment with implementations of the Johnson Lindenstrauss (JL) Lemma and present empirical results in relation to a couple of applications. We have implemented and tested two of the most commonly used implementations of the JL transform and tested their performance in a classification application. 2 Introduction Most of the scientific research today has become data-intensive [1]. Vast amounts of data are being collected from different sources by researchers in all the areas we can imagine, ranging from basic sciences like nuclear physics (Large Hadron Collider is a recent example), astronomy, biology to more applied areas like web mining, natural language processing, biomedical applications, etc. This huge amount of data (both structured and unstructured) presents a big challenge in terms of finding ways to do efficient dimensionality reduction (what is ‘efficient’, is dictated by the application) to achieve one or more of the these goals - being able to visualize data effectively, looking for interesting patterns in the data, reduce the computational power needed to process the data without compromising much on the end goal, and sometimes to provide the computer algorithms with only relevant information so they are not confused by irrelevant information, and there may be many others depending on the application. The most popular way of doing dimensionality reduction during the past few decades has been the principle component analysis (PCA). PCA attempts to reduce the dimensionality of data such that the mean square error between original and reconstructed data (from lower ranked representation) is minimized. PCA, however, does not provide any guarantees on preserving the local structures present in the data (like pairwise distances, others). Although, given sufficient samples of un-noisy data, PCA does a decent job in dimensionality reduction 1 for certain types of applications (e.g. in pattern classification), it can overfit the current data set in the lack of sufficient amount of data and may not generalize very well to future unseen data from the same distribution. It may also not be robust in presence of noise in the data. Johnson and Lindenstrauss [2] presented a seminal result which makes way for a dimensionality reduction technique that may be dependent or independent of the actual data points presented to the algorithm. The formal statement of the Lemma is given below. Lemma 1. Given ǫ > 0 and an integer n, let k be a positive integer such that k ≥ k0 = O(ǫ−2 log n). For every set P of n points in Rd there exists f : Rd 7→ Rk such that for all u, v ∈ P (1 − ǫ)||u − v||2 ≤ ||f (u) − f (v)||2 ≤ (1 + ǫ)||u − v||2 . In plain words, the lemma says that there exists a transformation f when applied to the data, can preserve the Euclidean distances in the new space up to arbitrary precision. The JL Lemma guarantees that there exists an approximately isometric lower dimensional embedding of the data. It does not give a way of finding that mapping though. The lemma has been used in many problems in computer science. For a reasonably comprehensive survey on JL lemma and its uses in various computer science problems, the reader is referred to [3]. After the lemma was presented and became popular in the community, the researchers have tried to come up with ways to construct the mappings that have this distance preserving property. In this project, we experiment with two such constructions [4] and measure the performance on two different metrics distance preservation and effectiveness for pattern classification problems. 3 Methods and Applications [4] provide two constructions for JL embeddings, such that the elements of the projection matrix belong to the set {−1, 0, 1}. The main contribution of this paper changes the nature of the projection matrix from a dense matrix of real numbers to a sparse matrix which allows aggregation methods in place of actual matrix multiplication, which is a non-trivial task in practical situations. The method presented in [4] is stated in Theorem. Theorem Let P be an arbitrary set of n points in ℜd , represented as an n × d matrix A, Given ǫ, β > 0, let k0 = 4 + 2β log n = O(ǫ−2 log n) ǫ2 /2 − ǫ3 /3 2 For integer k ≥ k0 , let R be a d × k random matrix with R(i, j) = rij , where {rij } are independent random variables from either one of the following two probability distributions: rij = { rij = √ +1 −1 3×{ Let with probability 1/2, with probability 1/2, (1) +1 with probability 1/6, 0 with probability 2/3, −1 with probability 1/6, (2) 1 E = √ AR k and let f : ℜd → ℜk map the ith row of A to the ith row of E. With probability at least 1 − n−β , for all u, v ∈ P (1 − ǫ)ku − vk2 ≤ kf (u) − f (v)k2 ≤ (1 + ǫ)ku − vk2 We implement Equation. 1 as method-1, using a single coin toss to set each value of the projection matrix, and Equation. 2 as method-2 using two coin tosses to set each value of the projection matrix. For both implementations, β was set to 1 and k was automatically computed to be the smallest integer greater than k0 formulated in the Theorem above. Next section presents and discusses various results obtained using these mappings. Both randomly generated data and real data from a practical application are used in evaluations. 4 Results and Discussion This section details experiments designed to illustrate and validate the JL implementations described in Section. 3. First, we present an experiment with synthetically generated data to validate the implementations by checking for preservation of distances as predicted by the JL Lemma. Next, we present results on a binary classification problem from biomedical domain. 4.1 Random Data Synthetic data points were generated using a Gaussian distribution with µ = 100.0 and σ = 1.0. A 100 × 10000 matrix was generated and used as input data for the experiment. The composition of this matrix was a row-wise concatenation of 100 data points, each represented by a 10000 dimensional vector. For values of ǫ = 0.1, 0.05, JL projection matrices were generated using each of the two methods described in Section. 3. These matrices were used to project the input data into the JL space, and pair-wise distances were used to validate the sanctity of the projections. Number of point pairs considered was np = 1000, 10000 3 Randomly picked point-pairs were evaluated and the following measures were computed: p-value This statistic reflects the percentage of point-pairs for which distances in the JL space fall out of the range prescribed by Theorem. 3. This value can be considered as being similar to the statistical p-value describing the significance of the projection and is expected to be close to 1/n, where n is the number of points (or rows) in the input data. distortion (δ) Consider that u, v are points in the original space, and f (u), f (v) are the corresponding projections in the JL space. We now define the distortion as follows δ = |ku − vk2 − kf (u) − f (v)k2 | The mean value of δ, computed over np point pairs is reported, along with the minimum and maximum value. epsilon (ǫ) The value of ǫ is computed as ǫ = |1 − kf (u) − f (v)k2 | ku − vk2 The minimum and maximum values are reported. #pairs method p-value 1000 1000 10000 10000 1 2 1 2 0.084 0.039 0.0903 0.0529 1000 1000 10000 10000 1 2 1 2 0.077 0.055 0.0697 0.0691 δµ δmin δmax ǫ = 0.1, k = 665 868.426 0 4274.32 807.506 0 3389.1 948.157 0 5691.81 835.691 0 3727.15 ǫ = 0.05, k = 2658 438.214 0 1987.58 427.033 0 1747.62 432.72 0 2175 440.845 0 2029.11 ǫmin ǫmax 1.90891e-005 0.000115442 9.7865e-006 6.22849e-006 0.213059 0.168979 0.28161 0.186965 1.90197e-005 1.60807e-006 1.29553e-005 7.25565e-006 0.0984301 0.0884449 0.108482 0.0990592 Table 1: Results on synthetic random data Table. 1 shows the results of the validation experiment of the JL implementations on synthetic, randomly generated data. p − values indicate that both implementations perform reasonably in terms of preserving the distances after projection. The low values of ǫ validate both the methods, and also indicate that method-2 performs better than method-1. This fact can be attributed to the sparse nature of the projection matrix produced by method-2. 4 4.2 Real Data We use Positron Emission Tomography (PET) brain image data collected from two populations - people who suffer from Alzheimer’s disease and normal people. The goal is to classify the test population into their correct categories using PET images. The main challenge in this type of classification task is the large dimensionality of data (each sample is of dimension 15964) and very few number of samples (total 152). We use 110 samples in training the model and rest 42 samples in testing. Support Vector Machine (SVM) is used as a classifier in this task. The classification accuracy with all 15964 dimensions is 88.10%. method 1 2 1 2 Accuracy(%) p-value ǫ = 0.1, k = 725 78.57 0.057 78.57 0.070 ǫ = 0.05, k = 2900 80.95 0.051 80.95 0.046 ǫmin ǫmax 2.327e-6 5.36e-5 0.217 0.206 1.353e-7 3.452e-6 0.109 0.095 Table 2: Results on PET brain image data: Classification accuracy As tabulated above, we can observe that the mapping results in a nontrivial reduction in the accuracy but the advantages in terms of reduction in dimensionality are huge. With almost 20 times reduction in dimensionality, the accuracy is reduced by almost 10% absolute. 5 Conclusions and Future Work We implemented two simple methods to compute JL-embeddings of data using coin tosses. The advantages of these methods lies in the fact that the projection matrices generated are sparse, and the projection operation can be implemented efficiently as a series of selection and aggregation operations. Next, we validated the implementations for accuracy using synthetic data and various measures. The classification accuracy results show that JL mappings may not be an ideal method to use in pattern recognition problems. However, the amount of reduction in dimensionality obtained is huge if one is able to compromise on the prediction accuracy. Following the experiments with synthetic data, we tested the implementations on real data in a practical classification application. In the future, we hope to improve the current implementations, and also include some of the more recent variants like the FJLT proposed in [5]. The next step will also include a more thorough analysis of the time-complexity in major applications. 5 References [1] Gray, J.: The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research (2009) [2] Johnson, W., Lindenstrauss, J.: Extensions of lipschitz maps into a hilbert space. Contemporary Mathematics 26 (1984) 189–206 [3] Saha, A.: A survey of johnson lindenstrauss transform - methods, extensions and applications. CS6160, Class Project Report (2008) [4] Achlioptas, D.: Database-friendly random projections: Johnsonlindenstrauss with binary coins. J Computer and System Sciences 66 (2003) 671–687 [5] Ailon, N., Chazelle, B.: Approximate nearest neighbors and the fast johnsonlindenstrauss trasnform. In: Proceedings of the 38st Annual Symposium on the Theory of Computing (STOC). (2006) 557–563 6