CS6160 Class Project Report A survey of Johnson Lindenstrauss transform -

advertisement
CS6160 Class Project Report
A survey of Johnson Lindenstrauss transform methods, extensions and applications
by
Avishek Saha
School Of Computing
University of Utah
2
·
Contents
1 Introduction
3
2 Brief Lineage of the JL transform
4
3 Related Work
4
4 Survey of proof techniques for the JL transform
4.1 Proof Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 High level proof idea . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
7
5 Extensions of the JL transform
8
6 Applications of the JL transform
10
7 Discussions
12
8 Appendix
13
8.1 Definition of Sphericity . . . . . . . . . . . . . . . . . . . . . . . . . . 13
8.2 Structural Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
References
13
·
1.
3
INTRODUCTION
Dimensionality Reduction: Advancement in data collection and storage capabilities have enabled researchers in diverse domain to observe and collect huge
amounts of data. However, this large datasets present substantial challenges to existing data analysis tools. One major bottleneck in this regard is the large number
of features or dimensions associated with some measured quantity, a problem frequently termed as the “curse of dimensionality”. Existing algorithms usually scale
very poorly with increase in number of dimensions of the data. This motivates
mapping the data from the high-dimensional space to a lower dimensional space in
a manner such that the mapping preserves (or almost preserves) the structure of
the data. Substantial research efforts have been made (and are still being made) to
overcome the aforementioned curse and tame high-dimensional data. Dimensionality reduction encompasses all such techniques which aim to reduce the number
of random variables (dimensions or features) associated with some observable or
measurable quantity with the hope that the data in lower dimensions would be
much more amenable to efficient exploration and analysis.
The Johnson-Lindenstrauss Lemma: In the process of extending Lipschitz
mappings to Hilbert spaces [Johnson and Lindenstrauss 1984], the authors formulated a key geometric lemma. This lemma (Lemma 1 of [Johnson and Lindenstrauss 1984]) was thereafter referred to as Johnson-Lindenstrauss Lemma. The
Johnson-Lindenstrauss Lemma states that a set of points in high-dimensional
space can be mapped to a much lower dimension such that the pairwise distances
of the points in the higher dimensional space are almost preserved. The cardinality
of the lower dimension space depends on the number of input points and degree
(approximation afctor) to which the pairwise distances need to be preserved. More
formally:
Definition: 1.1. Johnson-Lindenstrauss Lemma For any such that 1/2 >
> 0, and any set of points S ∈ Rd with |S| = n upon projection to a uniform
random k-dimensional subspace where k = O(log n), the following property holds
with probability atleast 1/2 for every pair u, v ∈ S,
(1 − )||u − v||2 ≤ ||f (u) − f (v)||2 ≤ (1 + )||u − v||2
where f (u), f (v) are projection of u, v.
The lemma finds use in approximate nearest neighbor search, clustering, machine
learning, compressed sensing, graph embedding - to name a few. A more detailed
exposition of its numerous applications is deferred to future sections.
Outline: This survey is not a self-contained introduction to Johnson-Lindenstrauss
Lemma in the sense that it assumes somewhat basic familiarity with the concepts
of embedding in metric spaces and some dimensionality reduction techniques. In
addition, it also takes for granted that the readers are comfortable with the diverse
multitude of dimensionality reduction techniques and applications - approximate
nearest neighbors, manifolds, compressed sensing, to name a few.
4
·
The literature cited in this work is far from exhaustive and we make no attempt
to present a concise review of the existing work, any effort of which might be
overwhelming given the huge interest and the vast amount of work being done in
this rapidly evolving area of research. We devote the first two sections (section 2
and section 3) to briefly cover the (so-called) lineage and related work on JohnsonLindenstrauss Lemma. Section 4 starts with a high-level proof intuition of the
flattening lemma. The subsequent subsection delves deep into each of the existing
proof techniques. In section 5, we introduce the reader to the vast majority of ways
in which Johnson-Lindenstrauss Lemma have been extended. The applications
of the lemma have been covered in section 6. At this point it is worth mentioning
that section 4 presents the main emphasis of this work. The next two sections
have been added to provide the reader a somewhat complete picture of this much
celebrated lemma. Finally, section 7 identifies a interesting direction of future work
and concludes this report.
2.
BRIEF LINEAGE OF THE JL TRANSFORM
This section draws a brief lineage of the Johnson-Lindenstrauss Lemma. The
aim is to identify papers which have previously proposed or used results similar in
essence to the Johnson-Lindenstrauss Lemma. We present two papers in this
regard:
(1) [A.Dvoretzky 1961] - this paper states that convex bodies in high dimension
have low dimension sections which are almost spherical. The dimension of the
almost spherical section is given by Dvoretzky’s theorem and is about log of
the dimension of the space.
(2) [T. Figiel and Milman 1976]- this paper uses [A.Dvoretzky 1961] to present
results related to isoperimetric inequality for the n-sphere.
Finally, [Johnson and Lindenstrauss 1984] was the first work to formulate the JL
lemma in its current form and use it as an elementary geometric lemma to prove
results related to extension of Lipschitz mappings to hilbert spaces. The problem
discussed in their paper is outside the scope of the current work and we refer the
reader to [Johnson and Lindenstrauss 1984] for an in-depth treatment.
3.
RELATED WORK
Although a section on related work might seem somewhat redundant in the context
of a survey, the main aim of this section is to present a brief overview of the discussed
papers and the context in which they were discussed.
As mentioned earlier, this survey is divided into three parts - (a) proof techniques, (b) extensions, and (c) applications. For a survey of proof techniques we
mainly discuss the papers [Johnson and Lindenstrauss 1984], [Frankl and Maehara
1987], [Indyk and Motwani 1998], and [Dasgupta and Gupta 1999]. In the original proof [Johnson and Lindenstrauss 1984], the concentration bound uses heavy
geometric approximation machinery for the projection (onto a uniformly random
hyperplane through the origin). [Frankl and Maehara 1987] simplified the previous proof by considering a direct projection onto k random orthonormal vectors.
Subsequently, [Indyk and Motwani 1998] and [Dasgupta and Gupta 1999] indepen-
·
5
dently proposed further simplification using different distributions and elementary
probabilistic reasoning.
The work in [Achlioptas 2001], [Ailon and Chazelle 2006], [Baraniuk et al.
], [Clarkson 2008], [Magen 2002], [Sarlos 2006], [Indyk and Naor 2007] and [Johnson and Naor 2009] provide a glimpse into the diverse ways in which JohnsonLindenstrauss Lemma has been extended. A brief survey of these papers can
also be found in [Matoušek 2008].
Finally, with respect to the applications of Johnson-Lindenstrauss Lemma
the following papers are worth mentioning: [Linial et al. 1994], [Indyk and Motwani
1998], [Schulman 2000], [Achlioptas 2001], [Ailon and Chazelle 2006], [Clarkson
2008], [Baraniuk et al. ], [Baraniuk et al. 2008]. In addition, the monograph [Vempala 2004] on random projections present an in-depth treatment of the various
applications of the Johnson-Lindenstrauss Lemma.
4.
SURVEY OF PROOF TECHNIQUES FOR THE JL TRANSFORM
This section is divided into two parts. In the first part we discuss each of the
relevant papers in detail. In the second part, we abstract out the key technique
which underlies the different proof approaches in the papers discussed earlier.
4.1
Proof Techniques
[Johnson and Lindenstrauss 1984]: We start with the definition of the Lipschitz constant of a function. The following definition is from [def ].
Definition: Lipschitz Constant. Let f be a function from one metric space
into another. The nonnegative real number k acts as a lipschitz constant for the
function f if, for every x and y in the domain, the distance from f(x) to f(y) is no
larger than k times the distance from x to y. We can write the relationship this
way:
|f (x), f (y)| = k|x, y|
The Johnson-Lindenstrauss Lemma shows that there exists some small τ
(0 < τ < 1) such that for any mapping f : Rn2 → Rk2 (k < n) the product of
lipschitz constants of f and f −1 is upper-bounded by the ratio (1 + τ )/(1 − τ ).
Here, R2 denotes the euclidean metric space. This upper-bounding of the product
essentially implies that the direct mapping f of points from Rn2 to Rk2 (and the
inverse mapping) introduces only a small relative distortion in the mapped (and
inverse-mapped) pairwise distances.
The proof presented in this paper draws on sophisticated techniques to achieve
their results. Here, we provide a high-level overview.
Let, Q be some projection matrix which projects data from Rn onto its first k
co-ordinates. Let, σ be the measure on the orthogonal group O(n) of all rotations
of Rn about the origin. (O(n), σ) form a measure space. So, a random projection of
rank k can be thought to be a random variable F (drawn from some distribution)
on the probability space (O(n), σ), where σ quantifies the level of distortion due
to some particular choice of F . Now that we have defined F , how do we compute
the mapping F ? Let F = U ∗ QU where U ∈ O(n). The multiplication of Q
with U ensures that the projection matrix is orthogonal. However, in addition to
6
·
orthogonality we also need to ensure randomness1 . Hence, we pre-multiply QU
with U ∗ . The rest of the proof uses heavy machinery from geometry to prove that
for any unit vector mapped using F , the expected squared length of the projected
vector is within some (good) region with high probability.
[Frankl and Maehara 1987]: The authors in this paper provide a much
shorter and simpler proof of the Johnson-Lindenstrauss Lemma. Thereafter,
they apply this lemma to compute the sphericity (see appendix for the definition)
of a graph. In their considerable simpler and shorter proof they consider the projection of a fixed unit vector on a randomly chosen k-dimensional subspace. This
is the same as choosing a random unit vector (drawn uniformly from a zero mean,
unit-variance, multi-variate gaussian) and projecting into a fixed k-dimensional subspace. It is required that the projected length be within some factor of the length
of the original vector which implies that not all projections are permissible. They
bound the region of forbidden projection (in terms of the projection angle between
the unit vector and the subspace) and show that the probability of this forbidden projection region is bounded. Subsequently, they show that the probability
of the permissible projection region is non-zero and hence by simple probabilistic
arguments such a region (and hence such a mapping) exists.
In the process of their substantial simpler proof they also present improved results of the cardinality of the projection subspace. Specifically, they show that
9
d 2 −2
3 /3 log ne + 1 number of projected dimensions is sufficient. The 9 was not
specified in the previous results. In addition, the authors showed results for the
case when the vectors were assumed to be drawn uniformly from the surface of an
n-dimensional hypersphere 2 ; a point whose connection would be evident later.
[Matousek 1990]: In this context it is worth mentioning that this paper
presents the proof of Johnson-Lindenstrauss Lemma on the same lines as the
original paper [Johnson and Lindenstrauss 1984] but their calculations are slightly
different resulting in a k = O(n log n) which is considerably worse than the previous
results of k = O(log n) where k is the cardinality of the low-dimensional subspace.
[Indyk and Motwani 1998]: All the previous proofs assume orthogonality of
the projection matrix, spherical symmetry of the input points and randomization
(i.e., random choice of the projection subspace). This work shows that we do not
need orthogonality of the projection matrix but only need spherical symmetry and
randomness.
Instead of choosing vectors from the surface of an n-dimensional sphere, the
authors propose to choose random vectors independently from the d-dimensional
gaussian distribution N d (0, 1). By 2-stability property of the gaussians, this is the
same as choosing each component of the random vector independently from the
standard normal distribution N (0, 1). Let Xi denote the projection of an unit vector along each of the k dimensions in the low dimensional subspace. The authors
show that the summation of Xi can be represented as the exponential distribution
1 The
need for randomness arises in the context of an adversarial mapping and has been discussed
in detail in section 4.2
2 Suppose we generate an n-dimensional vector form a zero mean unit variance gaussian. Now, if
we normalize the vector (by its radius) then the resultant vectors are uniformly distributed over
the surface of an n- ball
·
7
and the expected length of the projected vector can be represented as the gamma
distribution which is the dual of the poisson distribution. Then they invoke known
concentration bounds readily available for gamma distributions to bound the expected squared length of the projected vector. The key contribution of this proof
is the use of gaussian distribution for ascertaining spherical symmetry. The use of
a different distribution results in a much simpler proof and improved complexity
results. [Har-Peled 2005] provides a detailed and thorough exposition on the lemma
and proofs presented in [Indyk and Motwani 1998].
[Dasgupta and Gupta 1999]: As the title suggests, this paper presents an
elementary proof and substantially simplifies the proof techniques for the JohnsonLindenstrauss Lemma. As discussed earlier, in [Frankl and Maehara 1987], the
authors use geometric insights to achieve their results whereas this work relies purely
on probabilistic techniques to achieve the same results. The key contribution is a
structural lemma which has been re-stated (in the appendix) for completeness.
The lemma essentially implies that the expected squared length of random unit
vector (drawn from the surface of the n-sphere) and projected into some k-dimensional
subspace deviates from its mean with a very small probability. Subsequently, the
authors show that for a fixed pair of points this probability of deviation from the
mean is very small (at atmost 2/n2 ) and invoke the union bound to claim that
probability that any pair of points suffers a large distortion is bounded by n2 × 2/n2
which is equal to (1 − 1/n). Hence, using probabilistic reasoning the authors conclude that a mapping with low distortion exists with some probability (= 1/n)
and this probability can be improved by using standard probability amplification
4
techniques. They also improve the bound results to 2 /2−2
3 /3 log n
4.2
High level proof idea
As we have already seen in the previous section, the probabilistic method has greatly
simplified the original proof of the Johnson Lindenstrauss lemma. In addition,
the construction of the projection matrices were made considerably easier using
simpler randomized algorithms. In addition, we also note that the simpler is the
construction, the more involved is the proof technique. In the following, we intend
to connect the aforementioned approaches and abstract out the commonalities in
their proof techniques.
First, we provide some intuition to the standard practice of randomly rotating
the projection matrix. Consider two points in n-dimensions and we want to project
them onto k(< n) dimensions while preserving pairwise distances as much as possible. It might so happen that the two points are close in all other coordinates and
very far in a single coordinate (say, r). Now if an adversary chooses k coordinates
in a manner such that it purposefully includes an axis direction perpendicular to
r then we have a bad case (since this one coordinate will result in a very high
distortion). So, in order to keep things good (and bound the worst case distortion
for all pair of points) we require that all coordinates contribute similar amounts to
the total pairwise distortion. In other words, the total distortion for any particular pair of points should be spread out over all the coordinates. But how do we
“always” achieve this ? - we use known techniques of randomization and rotation
(of axes) for our rescue. The key idea is to rotate the coordinate axes in a manner
such that all coordinates axes contribute equally to the pairwise distortions. This
8
·
is equivalent to a smoothing effect whereby we smooth away any spikes caused by
unbalanced contribution to the pairwise distortion by some particular coordinate
axis. This rotation of the projection matrix is done in a randomized manner so
that the adversary cannot identify the bad axes. Now, all we need to show is that
despite the randomly choosen matrix elements and rotation of the projection axes,
in expectation the squared length of the projected vector is sharply concentrated
or in other words the expected length of the squared projected vector (say, v) does
not deviate much from the original vector (say, u),
E[||u||] = ||v||
In all methods for producing JL-embeddings, the main idea is to show that for any
vector, the expected squared length of the projection vector is sharply concentrated
around its expected value by using standard
concentration bounds. Thereafter, one
invokes the union bound for the n2 events (corresponding to all pair of points) to
show that the probability that no (projected) distance vector is being distorted by
more than (1 ± ) is extremely high. The steps of this high level idea have been
summarized below:
(1) good event - squared length of projected vector highly concentrated
(2) bad event - squared length of projected vector is far away from it mean value
(3) the probability of this bad event is very small
(4) for all pairs, by union bound the probability of this bad event is bounded
(5) hence, probability of good event is non-zero
(6) and by probabilistic arguments, such a low dimension mapping exists
(7) by probability amplification techniques, we can make the probability of existence of this mapping arbitrarily high
5.
EXTENSIONS OF THE JL TRANSFORM
As has been shown by the authors of [Indyk and Motwani 1998], it is sufficient
to work with a gaussian distribution which implies that we can construct the projection matrix by choosing each element of the matrix uniformly from a N (0, 1)
distribution. This also suggests that we need no more the initial assumption of orthogonality but only need spherical symmetry and randomness. With this in mind
it is natural to ask whether this is the best we can do or whether we can do even
better ? It turns out that actually we can construct even simpler and more efficient
projection matrices, as will be shown below.
[Achlioptas 2001]: In an attempt to make random projections easier to use
in practice [Achlioptas 2001] presented a much simpler scheme for constructing
the projection matrix. Instead of choosing each element from a N (0, 1) gaussian,
a matrix whose elements are chosen randomly from the set {−1, +1} can be used
for JL projection. In addition, he also showed that we can do as well by making
roughly two-third of the matrix null (i.e., choose elements from {−1, 0, +1} where 0
is chosen with 2/3 probability and +1 or −1 with 1/6 probability each). Moreover,
the authors show that this improvement in efficiency results in no loss in quality of
embedding. This is inspired by prior work on projecting on random lines for nearest
·
9
neighbor search [Kleinberg 1997] and for learning intersection of halfspaces [Vempala 2004]. Such projections have also been used in learning mixture of Gaussians
models [Dasgupta 1999], [Arora and Kannan 2001].
In order to prove their result, the author uses sophisticated concentration inequalities that bound the moment of a random variable. The key contribution in
this work relies on the observation that spherical symmetry is no longer essential
and randomness alone in enough to warrant such low-distortion mappings.
Now that we have done away with both orthogonality and spherical symmetry,
it is imperative to again ask the question - “Can we do better ?”
[Ailon and Chazelle 2006]: The authors in this paper contend that the
improvement by [Achlioptas 2001] cannot be stretched beyond a point since a
sparse matrix will typically distort a sparse vector and the resultant projected vector
might not be sufficiently concentrated. In order to counter this, the authors show
that for a sparse vector the resultant projected vector might still be concentrated if
the mass of the projecting vector is distributed over many components. The mass
can be spread by applying a fourier transform. This results in a Fast-JohnsonLindestrauss-Transform (FJLT) which is a randomized FFT followed by a sparse
projection.
Thus, this work improves on the previous approach and at the same time reduces
the amount of randomness required. We again ask the question: “Can we do any
better ?”.
Although the results might not be better, it turns out that we can completely remove randomness while constructing the projection matrix. This approach has been
suggested in derandomized construction of the Johnson-Lindenstrauss Lemma.
We refer the interested reader to [Engebretsen et al. 2002], [Bhargava and Kosaraju
2005].
[Matoušek 2008]: Similar to the improvement proposed by [Achlioptas 2001],
[Matoušek 2008] improved on [Ailon and Chazelle 2006] by showing that we need
not draw elements of the projection matrix from a zero-mean, unit-variance gaussian and can do as well by choosing elements randomly from {+1, −1}. They use
sophisticated techniques from concentration bounds to establish their claims.
[Baraniuk et al. 2008]: The authors extend the notion of random projections
to manifolds. They show that random linear projections can preserve key information about manifold-modeled signals, recover faithful approximations to manifoldmodeled signals, can identify key properties about the manifold and can preserve
pairwise mapped geodesic distances.
[Clarkson 2008]: This paper tightens the theoretical bounds presented in
[Baraniuk et al. 2008].
[Sarlos 2006]: Until now, the Johnson-Lindenstrauss Lemma was considered primarily for euclidean spaces. [Sarlos 2006] was the first work to extend
Johnson-Lindenstrauss Lemma to affine subspaces. They showed that a d-flat
3
can be -embedded by an O(d/2 )-map.
[Indyk and Naor 2007]: In this work, the Johnson-Lindenstrauss Lemma
was extended beyond linear subspaces by showing that a set of points with bounded
3a
d-flat is a d-dimensional affine subspace
10
·
doubling dimension can be additively embedded4 . The main approach undertaken
was to approximate a set of points using -nets and then extend the embedding
results from -nets back to the set of points.
[Johnson and Naor 2009]: The authors extend the JL dimensionality reduction lemma to other normed spaces. They show that if any normed space satisfies
the Johnson-Lindenstrauss Lemma then all the n-dimensional subspaces of the
normed space are isomorphic to hilbert space with some distortion. Thus any
normed space which satisfies the Johnson-Lindenstrauss Lemma is very similar
to a hilbert space.
[Magen 2002]: Until now we assumed that the Johnson-Lindenstrauss
Lemma preserves only pairwise distance of points. However, the fact that JL preserves angles and can be used to preserve any “k dimensional angle” by projecting
down to dimension O(k log n/2 ) was first observed by [Magen 2002]. This mapping can preserve the volume of sets of size s < k within a factor of (1 + )s−1 . In
addition, they also showed that JL preserves the pairwise distance of affine hulls of
(k − 1) points to within a factor of (1 + ).
6.
APPLICATIONS OF THE JL TRANSFORM
The Johnson-Lindenstrauss Lemma has diverse uses in a wide multitude of
application areas. The monograph [Vempala 2004] on random projections presents
a nice overview of the applications of the Johnson-Lindenstrauss Lemma, some
of which have been listed below:
Combinatorial Optimization:
1. Rounding via Random Projection: Usually approximation algorithms are
solved by first relaxing the original problem, then obtaining a solution to the
relaxed version of the problem and then rounding the solution. The JohnsonLindenstrauss Lemma can be used as an important tool in the rounding phase
of the approximation schemes.
2. Embedding metrics in Euclidean Space: The distance preserving properties of
the Johnson-Lindenstrauss Lemma can be used to construct low-dimensional
embeddings efficiently. [Bourgain 1984] was the first to observe that the lemma can
be used to embed any n-point symmetric space in an O(log n)-dimensional space
with O(log n) distortion.
3. Beyond Distance Preservation: We know that Johnson-Lindenstrauss
Lemma can be used for embeddings that preserve the pairwise distance between
points in a set. Can we use it to preserve other properties of the point set, for
example volume? In order to preserve volume we have to preserve the point-tosubset distance which is a natural generalization of the point-to-point distance. It
has been shown [Vempala 2004] that Johnson-Lindenstrauss Lemma can be
used to preserve other properties (such as volume by preserving point-to-subset
distance) as well.
Learning Theory:
1. Neuron-friendly random projections: Robust concept learning is a paradigm
in machine learning which quantifies the degree to which the attributes of an example can be altered without affecting the concept. It is closely related to large
4 additive
embedding implies that the distance can be bounded by additive terms
·
11
margin classifiers which in turn form the basis of Support Vector Machines. It
turns out that the notion of random projections can be used to reduce the dimensionality of the examples without reducing affecting the concept class. Or in other
words, random projections preserve the properties of the classifier in low dimensional space. Based on this observation, “neuronal” versions of random projections
were proposed [Arriaga and Vempala 1999].
2. Robust half-spaces: Random projections can be used to reduce the dimensionality of the number of half-spaces required to classify a set of points. Suppose,
a set on points exists in an n-dimensional space. The Johnson-Lindenstrauss
Lemma can be used to find a set of k half-spaces such that the intersection of the
half-spaces yield the same classification of the point set as can be obtained in the
original n-dimensional space.
Information Retrieval:
1. Random projection for NN in hypercubes: This problem deals with finding
the nearest neighbor of a query from a point set on the d-dimensional hypercube,
where the distance between points are measured as the number of coordinates on
which they differ (also called Hamming distance) where coordinates can take values
0 and 1. One approach to solving this NN problem is to project the point in a lowdimensional space such that the Hamming distance is preserved and then use data
structures which exploit the low dimensionality of the projected space. It has been
shown [Vempala 2004] that random projections can be used for such projection
purposes while preserving the hamming distance.
2. Fast-low rank approximation: Latent semantic indexing (LSI) [Dumais et al.
1988] is a well known information retrieval technique based on low-rank approximations of projection matrices. Standard methods (such as SVD) can be used for
low-rank approximations with a complexity of O(mn2 ). It turns out that JohnsonLindenstrauss Lemma can be used to make LSI even faster with a complexity of
O(mn log n) while resulting in a slight loss in accuracy.
3. Geometric p-median: Once again, we can use Johnson-Lindenstrauss
Lemma to speed up clustering problems. One approach would be to project points
on a hypercube such that the Hamming distance between pair of points are preserved. Subsequently, the clustering can be performed on the hypercube.
Other Applications:
1. JL+geometry of graphs: This seminal paper [Linial et al. 1994] discusses the
algorithmic applications of embeddings which respect the local properties of the
embedded graphs, such as, pairwise distances and volumes. The authors show that
random projections can be used to embed finite metric spaces into euclidean spaces
with minimum distortion. They improve on the results of [Bourgain 1984] and show
that any n-point metric can be embedded in O(log n)-dimensional euclidean space
with a logarithmic distortion.
2. JL+data streams: [Indyk 2006] showed that JL-embeddings can be performed in a “data-stream” setting where one has limited memory and is allowed
only a single pass over the entire data (stream). They compare JL- type dimensionality reduction with their prior work on sketching data streams and observe
that the space of sketching operators is not a normed space which restricts the use
of sketches as a dimensionality reduction technique. Subsequently, they propose
modified sketching algorithms for normed spaces which can be viewed as streaming
12
·
variants of the Johnson-Lindenstrauss Lemma.
3. JL+clustering: [Schulman 2000] seeks to cluster a set of points to minimize
the sum of the squares of intracluster distances. JL-embeddings have been used
as part of the proposed approximation algorithm for the clustering. The authors
contend that to cluster a set of n points lying in some high dimensions we never
need to consider a space of dimension greater than (n − 1). Hence, near-isometric
dimensionality reduction techniques can be used to to project the set of points onto
an affine subspace and then perform clustering on that subspace.
4. JL+approximate nearest neighbor search: Exact nearest neighbor search is
computationally expensive, particularly in high dimensions and for large datasets.
An alternative is to find the approximate nearest neighbor since most applications
do not require the closest point to the query but settle for a answer which is quite
close. However, in almost all algorithms related to the approximate nearest neighbor
search the bottleneck is the high dimensionality of the data set. Most algorithms
proceed by mapping points to a low dimensions and then solving the approximate
nearest problem in the lower dimensional space. In [Ailon and Chazelle 2006], the
authors propose and use the FJLT (Fast JL Transform) to map points to l1 space
and subsequently perform a nearest neighbor search in the mapped space.
5. JL+compressed sensing: The paradigm of Compressed Sensing (CS) posits
that if a signal can be approximated using a sparse representation then it can also
be accurately reconstructed from a small collection of linear measurements. CS
makes use of randomness in constructing its transform matrices. On the other
hand, we have already seen that the Johnson-Lindenstrauss Lemma relies on
random matrices for near-isometric projection in low-dimensional subspaces. This
implies some connection in between the two. [Baraniuk et al. 2008] presents key
results which highlight the intimate relation between JL transform and compressed
sensing. This connection has been established by using the Restricted Isometric
Property (RIP) which is a tractable and stable method of signal recovery and a
key property of the CS projection matrix. Although JL preserves pairwise distance
of a finite set of points while RIP ensures isometric embedding of an infinite set
of points, it has been shown that the RIP can be derived from a combination of
k-sparse signals in Rn (k << n) and JL based near-isometric embedding. We point
the reader to [Baraniuk et al. 2008] for full details.
7.
DISCUSSIONS
In this report, we have covered much ground on the Johnson-Lindenstrauss
Lemma. It started with a survey of the proof techniques. In addition to the
survey of existing literature an attempt was made to connect the existing proofs and
abstract out the key ideas underlying the initially involved and the much simplified
later approaches. The high-level idea of the proof techniques exposes the fact that
the Johnson-Lindenstrauss transform is no more than yet another instance of the
concentration of mass phenomena (i.e., like the Chernoff inequality). We also notice
that the right choice of distribution leads to considerably simpler proofs. Another
observation made was the fact that the complexity of the proof techniques gradually
increases as the construction of the projection matrix is simplified.
The original Johnson-Lindenstrauss Lemma was primarily defined for hilbert
(more specifically euclidean) spaces. Obvious supplements were to propose exten-
·
13
sions for non-hilbert spaces and non-linear spaces. We have discussed a few representative works in this regard.
No survey work is complete without a discussion of the applications that motivate
the topic of the survey. As can be seen, the JL lemma find applications in diverse
areas which rely on dimensionality reduction techniques to design better and more
efficient algorithms.
Bregman divergences are a class of distance functions which do not obey the
properties of asymmetry and triangle inequality. Hence, they are not a metric and so
in a way generalize distance functions. Few examples include the Kullback-Leibler
divergence used in machine learning, the Itakura-Saito used in speech processing,
the Mahalanobis divergence used in computer vision, etc. Similar to the case of
euclidean distances, algorithms that deal with bregman divergences face the “curse
of dimensionality”. One interesting avenue of future work would be to explore the
possibility of JL like results for this class of divergence functions.
8.
APPENDIX
8.1
Definition of Sphericity
The following definition of sphericity is due to [Frankl and Maehara 1987].
Definition: Sphericity. The sphericity of a graph G(V, E), sph(G), is the
smallest integer n such that there is a embedding f : V → Rn such that 0 <
||f (u) − f (v)|| < 1 iff uv ∈ E.
8.2
Structural Lemma
Lemma: 8.1. Let k < d. Then,
If β < 1 then
(d−k)/2
k
(1 − β)k
k/2
≤ exp( (1 − β + ln β))
P r[L ≤ βk/d] ≤ β
1+
(d − k)
2
and, if β > 1 then
(d−k)/2
(1 − β)k
k
P r[L ≥ βk/d] ≤ β k/2 1 +
≤ exp( (1 − β + ln β))
(d − k)
2
The proofs follow from those of Chernoff-Hoeffding bounds and we refer the reader
to [Dasgupta and Gupta 1999].
REFERENCES
Metric spaces, the lipschitz constant. http://www.mathreference.com/top-ms,lip.html.
Achlioptas, D. 2001. Database-friendly random projections. In PODS ’01: Proceedings of the
twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.
ACM, New York, NY, USA, 274–281.
A.Dvoretzky. 1961. Some results on convex bodies and banach spaces. 1961 Proc. Internat.
Sympos. Linear Spaces (Jerusalem, 1960), 123–160.
Ailon, N. and Chazelle, B. 2006. Approximate nearest neighbors and the fast johnsonlindenstrauss transform. In STOC ’06: Proceedings of the thirty-eighth annual ACM symposium
on theory of computing. ACM, New York, NY, USA, 557–563.
Arora, S. and Kannan, R. 2001. Learning mixtures of arbitrary gaussians. In STOC ’01:
Proceedings of the thirty-third annual ACM symposium on Theory of computing. ACM, New
York, NY, USA, 247–257.
14
·
Arriaga, R. I. and Vempala, S. 1999. An algorithmic theory of learning: Robust concepts and
random projection. Foundations of Computer Science, Annual IEEE Symposium on 0, 616.
Baraniuk, R., Davenport, M., Devore, R., and Wakin, M. The johnson-lindenstrauss lemma
meets compressed sensing.
Baraniuk, R., Davenport, M., Devore, R., and Wakin, M. 2008. A simple proof of the restricted
isometry property for random matrices. Constr. Approx .
Bhargava, A. and Kosaraju, S. R. 2005. Derandomization of dimensionality reduction and sdp
based algorithms. In Algorithms and Data Structures. 396–408.
Bourgain, J. 1984. On lipschitz embedding of finite metric spaces in hilbert space. In Israel
Journal of Mathematics. Vol. 52. 46–52.
Clarkson, K. L. 2008. Tighter bounds for random projections of manifolds. In SCG ’08: Proceedings of the twenty-fourth annual symposium on Computational geometry. ACM, New York,
NY, USA, 39–48.
Dasgupta, S. 1999. Learning mixtures of gaussians. In FOCS ’99: Proceedings of the 40th Annual
Symposium on Foundations of Computer Science. IEEE Computer Society, Washington, DC,
USA, 634.
Dasgupta, S. and Gupta, A. 1999. An elementary proof of the johnsonlindenstrauss lemma.
In PODS ’01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on
Principles of database systems. Technical report 99006, U. C. Berkeley, 274–281.
Dumais, S. T., Furnas, G. W., Landauer, T. K., and Deerwester, S. 1988. Using latent semantic analysis to improve information retrieval. In Conference on Human Factors in Computing.
281–285.
Engebretsen, L., Indyk, P., and O’Donnell, R. 2002. Derandomized dimensionality reduction
with applications. In SODA ’02: Proceedings of the thirteenth annual ACM-SIAM symposium
on Discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA,
USA, 705–712.
Frankl, P. and Maehara, H. 1987. The johnson-lindenstrauss lemma and the sphericity of some
graphs. J. Comb. Theory Ser. A 44, 3, 355–362.
Har-Peled, S. 2005. JL Notes.
Indyk, P. 2006. Stable distributions, pseudorandom generators, embeddings, and data stream
computation. J. ACM 53, 3, 307–323.
Indyk, P. and Motwani, R. 1998. Approximate nearest neighbors: towards removing the curse of
dimensionality. In STOC ’98: Proceedings of the thirtieth annual ACM symposium on Theory
of computing. ACM, New York, NY, USA, 604–613.
Indyk, P. and Naor, A. 2007. Nearest-neighbor-preserving embeddings. ACM Trans. Algorithms 3, 3, 31.
Johnson, W. and Lindenstrauss, J. 1984. Extensions of lipschitz maps into a hilbert space. In
Contemporary Mathematics. Vol. 26. 189–206.
Johnson, W. B. and Naor, A. 2009. The johnson-lindenstrauss lemma almost characterizes
hilbert space, but not quite. In SODA ’09: Proceedings of the Nineteenth Annual ACM SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA, 885–891.
Kleinberg, J. M. 1997. Two algorithms for nearest-neighbor search in high dimensions. In STOC
’97: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM,
New York, NY, USA, 599–608.
Linial, N., London, E., and Rabinovich, Y. 1994. The geometry of graphs and some of its algorithmic applications. In SFCS ’94: Proceedings of the 35th Annual Symposium on Foundations
of Computer Science. IEEE Computer Society, Washington, DC, USA, 577–591.
Magen, A. 2002. Dimensionality reductions that preserve volumes and distance to affine spaces,
and their algorithmic applications. In RANDOM ’02: Proceedings of the 6th International
Workshop on Randomization and Approximation Techniques. Springer-Verlag, London, UK,
239–253.
Matousek, J. 1990. Bi-lipschitz embeddings into low dimensional euclidean spaces. In Comment.
Math. Univ. Carolinae. Vol. 31. 589–600.
·
15
Matoušek, J. 2008. On variants of the johnson–lindenstrauss lemma. Random Struct. Algorithms 33, 2, 142–156.
Sarlos, T. 2006. Improved approximation algorithms for large matrices via random projections.
In FOCS ’06: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer
Science. IEEE Computer Society, Washington, DC, USA, 143–152.
Schulman, L. J. 2000. Clustering for edge-cost minimization (extended abstract). In STOC ’00:
Proceedings of the thirty-second annual ACM symposium on Theory of computing. ACM, New
York, NY, USA, 547–555.
T. Figiel, J. L. and Milman, V. D. 1976. The dimension of almost spherical sections of convex
bodies. Bull. Amer. Math. Soc. 82, 4, 575–578.
Vempala, S. 2004. The random projection method. Vol. 65. AMS.
Download