Intrinsic Dimensionality Estimation

advertisement
Estimating Intrinsic Dimension
by
Justin Eberhardt
Department of Mathematics and Statistics
University of Minnesota Duluth
Duluth, MN 55812
June 2007
Estimating Intrinsic Dimension
A project
submitted to the faculty of the Graduate School
of the University of Minnesota
by
Justin Eberhardt
In partial fulfillment of the requirements
for the degree of
Master of Science
In Applied and Computational Mathematics
June 2007
Estimating Intrinsic Dimension
Abstract
The intrinsic dimension of a dataset is often much less than the dimension of the original dataset. It is
valuable to know the intrinsic dimension of a dataset so that the high dimensional dataset can be
replaced by a lower dimensional dataset that is easier to manipulate. Traditional intrinsic dimension
estimators, such as principal component analysis can only be used on linear spaces. Non-linear
manifolds require other methods, such as nearest-neighbor estimators. We will compare three nearestneighbor estimators based on several criteria and show that two estimators perform well on a wide range
of non-linear datasets.
I would like to thank my advisors, Dr. Kang James and Dr. Barry James,
for their guidance and support throughout this entire project.
Contents
1. Introduction ............................................................................................................................................1
2. Nearest Neighbor Estimators ...............................................................................................................3
2.1 Nearest-Neighbor Information ......................................................................................................... 3
2.2 Nearest-Neighbor Regression Estimator Overview ......................................................................... 4
2.4 Nearest-Neighbor Regression Estimator Derivation ....................................................................... 5
2.5 Nearest-Neighbor Maximum Likelihood Estimator ...................................................................... 10
2.6 Revised Nearest-Neighbor Maximum Likelihood Estimator ........................................................ 13
3. Datasets ................................................................................................................................................14
3.1 Gaussian Sphere ............................................................................................................................. 14
3.2 Swiss Roll ...................................................................................................................................... 14
3.3 Double Swiss Roll.......................................................................................................................... 15
3.4 Artificial Face ................................................................................................................................ 16
3.5 25-Dimensional Gaussian Sphere (ID = 15) .................................................................................. 16
4. Results ..................................................................................................................................................17
4.1
Accuracy ................................................................................................................................... 17
4.2 Dependence on Number of Neighbors ........................................................................................... 19
4.3 Dependence on Distribution Type ................................................................................................. 20
4.4 Effectiveness on Datasets with High Intrinsic Dimension ............................................................ 21
4.5 Summary ........................................................................................................................................ 22
References .................................................................................................................................................23
A. Appendix .............................................................................................................................................24
A.1 Program Overview ........................................................................................................................ 24
A.2 Code .............................................................................................................................................. 25
int_dim.cpp ........................................................................................................................................ 25
int_dim_reg.h ..................................................................................................................................... 31
int_dim_mle.h .................................................................................................................................... 33
random_gen.h .................................................................................................................................... 34
gauss_gen.h ........................................................................................................................................ 35
1. Introduction
High-dimensionality limits the usefulness of practical data, but often it is possible to represent highdimensional data in low dimensional space. The least dimension that describes a data set without
significant loss of feature is the intrinsic dimension of the data set [1]. Examples of high-dimensional
datasets with low intrinsic dimensionality include biometric datasets such as face images, fingerprints,
and iris scans. Genetic information is also believed to have low intrinsic dimension [2]. Intrinsic
dimension estimators are a useful tool for determining whether or not a dataset can be represented in a
lower dimensional space.
Several intrinsic dimension estimators are currently available. One traditional method for determining
intrinsic dimension is principal component analysis (PCA). More recently, nearest-neighbor (NN)
methods, including NN regression estimation and NN maximum likelihood estimation have been
proposed and preliminary results on real and simulated datasets are promising [1], [2].
PCA can be used as an intrinsic dimension estimator but its effectiveness is limited in many practical
applications. Implementation of PCA as an intrinsic dimension estimator requires the covariance matrix
of the input data set. Eigenvalues of the covariance matrix are determined, and the number of
eigenvalues greater than a specified threshold value is considered the intrinsic dimension.
PCA is useful for certain applications, including data compression algorithms, however, when working
with high-dimensional matrices the process is computationally expensive. The greatest limitation to
PCA is that is only useful for linear manifolds, that is, manifolds that can be represented by linear
transformations of Euclidean spaces. Biometric and genetic data is often non-linear, thus PCA is not
useful in determining the actual intrinsic dimension.
Consider the data manifold pictured below, typically referred to as a Swiss Roll. The manifold is a twodimensional plane that has been “rolled” so that the shape occupies three-dimensional space. Since the
underlying manifold is non-linear, traditional methods for calculating intrinsic dimension fail.
1
Figure 1
A typical Swiss Roll manifold.
Data containing more than three dimensions cannot be represented in Cartesian space; however, it would
is often useful to visualize such data. Multi-Dimensional Scaling (MDS) is a method that seeks to
reduce dimensionality by preserving Euclidean distances between data points. Through MDS, data can
be “flattened” without significant loss of information.
MDS flattens data in a way that preserves distances between observations; however, it requires
computationally-expensive iteration. The quality of the flattening is determined by the amount of stress
in the dataset. Stress is a measure of the difference between distances in the original dataset and
corresponding distances in the flattened dataset. Low stress implies that the data in the reduced set is
similar to the data in the original set.
Traditional Estimator
PCA
MDS
Limitation
Fails on Non-linear Manifolds
Computationally Expensive
The limitations of PCA and MDS applied to high-dimensional, non-linear datasets led to the
development of a new type of estimator that uses nearest-neighbor information. The nearest-neighbor
estimators rely on the assumption that the density of observations in a small neighborhood around each
observation in the dataset is constant. The literature shows that, on both practical and simulated
datasets, the nearest-neighbor estimators are very accurate [1], [2], [3]. We give a detailed explanation
of the theory behind nearest neighbor estimators in Section 2 including a full derivation of a regression
and maximum likelihood estimator. Section 3 describes the datasets used in our simulations, and in
section 4, we compare the general accuracy and characteristics of the estimators.
Proposed Estimators
Nearest-Neighbor Regression Estimator (NN REG)
Nearest-Neighbor Maximum Likelihood Estimator (NN MLE)
2
2. Nearest Neighbor Estimators
2.1 Nearest-Neighbor Information
Nearest-neighbor information is extracted from a dataset by calculating the Euclidian distance between
each pair of observations and is stored in a n x n matrix (where n is the number of observations). The
nearest-neighbor estimators require that the distance matrix be sorted so that the distance from each
observation to its ith neighbor is known. This sorted matrix is called the nearest-neighbor matrix.
Distance Matrix
1 2 3 .
.
.
N
1
0
d1,2
d1,3
d1,n
1
0
t1,2
t1,3
t1,N
2
d2,1
0
d2,3
d2,n
2
0
t2,2
t2,3
t2,N
3
d3,1
d3,2
0
d3,n
3
0
t3,2
t3,3
t3,N
.
.
.
.
.
.
.
.
0
N
.
.
.
.
.
N
dn,1
dn,2
dn,3
.
.
Nearest-Neighbor Matrix
1 2 3 . . . N
.
.
0
tN,2
tN,3
.
.
.
.
.
.
.
tN,N
Figure 2
Distance matrix and nearest-neighbor matrix. Row i, column j in the distance matrix is the distance from observation i to
observation j. Row i, column k in the nearest-neighbor matrix is the distance from observation i to its kth NN.
3
2.2 Nearest-Neighbor Regression Estimator Overview
Distance to Kth NN PDF
(Approximated as Poisson)
Expected Distance to Kth NN
Expected Distance to the
Sample-Averaged Distance to Kth NN
NN REG Estimator
Figure 3
Outline of the nearest-neighbor regression estimator.
The nearest-neighbor regression estimator (NN REG) is derived by finding the density of the distance to
the kth nearest-neighbor for an individual observation in the dataset. Based on this density, the expected
distance to the kth nearest-neighbor can be obtained. We will show that the natural log of the expected
distance to the sample-averaged kth NN is approximately equal to the product of the inverse of the
intrinsic dimension and the natural log of k. By selecting a range of values for k, we estimate m using a
simple linear regression model.
4
2.3 Nearest-Neighbor Maximum Likelihood Estimator Overview
Counting Process
Binomial (Approximated as Poisson)
Joint Occurrence Probability
Joint Occurrence Density
Log-likelihood Function
NN MLE Estimator
Figure 4
Outline of the nearest-neighbor maximum likelihood estimator.
The derivation of the nearest neighbor maximum likelihood estimator is composed of three steps. It is
assumed that the density of observations in a small is constant. Therefore, the number of observations
within distance t of observation x is binomially distributed with probability of success equal to f(x). The
joint counting density and joint occurrence density are calculated based on the Poisson approximation to
this binomial. The final formula for the estimator is achieved by maximizing the log-likelihood function
obtained from the joint occurrence density.
2.4 Nearest-Neighbor Regression Estimator Derivation
Pettis, Bailey, Jain, & Dubes (1979)
The following derivation is based on [1] and [4]. We have added additional explanation when needed.
Let:
x : an observation in the input dataset – high dimensional dataset (HDD)
p : dimension of the HDD
m : the intrinsic dimensionality of the dataset
Tk , Tx ,k : the distance from x to its k th NN
N (t , x ) : the number of observations within distance t of observation x
N r ,s : the number of observations between the distances r and s from observation x
Vt  V (m)t m : volume of a sphere with radius t
V (m) : volume of a m-dimensional unit sphere
5
Counting the number of observations within distance t of x is a binomial counting process with n trials
and probability of success equal to the density times the volume of a sphere with radius t. We will
assume that the density is constant and dependent on x. Thus, the probability that there are k
observations within distance t of x can be written as:
 n  1
k
nk 1
P[ N (t , x)  k ]  
[ f ( x)Vt ] [1  f ( x)Vt ]
 k 
(1)
We begin by finding the probability density function (pdf) of the distance from x to its kth NN.
P[t  Tx ,k  t  t ]  f k , x t , where f x ,k is the pdf of the distance from x to its kth NN
1
P[t  Tx ,k  t  t ]
t
1

P[ N (t , x)  k  1 N (t  t , x)  k ]
t
f x ,k (t ) 
(2)
This probability has a trinomial distribution.
Figure 5

1
(n  1)!
[ f ( x)Vt ]k 1[ f ( x)Vt ]1[1  f ( x)Vt t ]nk 1
t (k  1)!1!(n  k )!
Vt
(n  2)!
[ f ( x)Vt ]k 1[1  f ( x)Vt ]nk 1
t (k  1)!(n  k  1)!
V
lim t  Vt '  V (m)mt m1
t 0 t
 n  2
(n  2)!


(k  1)!(n  k  1)!  k  1 
(3)
 (n  1) f ( x)
 n  2
k 1
nk 1
 (n  1) f ( x)V (m)mt m1 
[ f ( x)Vt ] [1  f ( x)Vt ]
 k 1 
(4)
(5)
6
 n  2
k 1
nk 1
is a binomial distribution with n-1 trials and

[ f ( x)Vt ] [1  f ( x)Vt ]
 k 1 
probability of success equal to f ( x)Vt .
Approximating this binomial distribution as a Poisson with   (n  2) f ( x)Vt , we obtain the following
pdf.
(n  1) f ( x)V (m)mt m1[(n  2) f ( x)V (m)t m ]k 1e  ( n2) f ( x )V ( m )t
f x ,k (t ) 
(k  1)!
(k  1)!  (k )
Let c  (n  2) f ( x)V (m)
 nf ( x)V (m)mt
m 1
 0 , elsewhere
m
(ct m )k 1 ct m  c 
e   , when t  0
(k )
c
(6)
Now, we find the expected value of f x ,k (t ) by integrating over all values of t :

  tf (t x ,k )dt
E (Tx ,k )
0


m
(n  1) f ( x)V (m)
t  t m1  (t m )k 1  c  c k 1  m  ect dt

c(k )
0

m
(n  1) f ( x)V (m)

(ct m )k 1 me ct dt

c(k )
0
(7)
1
Let u  ct , du  cmt
m
 u m
dt  cm  
c
m 1
1

( m 1)
1
1
1  u m
dt , dt 
  du
cm  c 
1
(n  1) f ( x)V (m) k  u 1  u  m

0 u e m cm  c  du
c(k )
(8)
 m1
k
 1

 km

(9)

 1
 1
 cm


u
 (n  1) f ( x)V (m)  ( k  1 )1

u m e u du


c(k )
0

1
( k  ) 1
u
m
e du  (k 
0

(k 
1
m
1
)
m
1
) 1
m  k m  (n  1) f ( x)V ( m)
k (k )
1
c
1
m
(10)
We define a sample-averaged distance from observation xi to its k th nearest neighbor
7
Tk 
1 n
 Tx ,k
n i1 i
(11)
and, combining the preceding two equations, we have the following expected value of Tk .
1 n
 E (Txi ,k )
n i 1
1
) 1
n (k 
1
(n  1) f ( x)V (m)
m
  1
km 
1
1
n i 1 m
k (k )
c m
1
( k  ) 1
1 n (n  1) f ( x)V (m)
 1 m km  
1
1
n i 1
m
m
k (k )
[(n  1) f ( x)V (m)]
E(Tk ) 
(12)
(13)
1
k m ( k )
Let Gk ,m 
,
1
( k  )
m
n
1
(n  1) f ( x)V (m)
and let Cn  
1
1
n i 1
[(n  1) f ( x)V (m)] m
1
1
m

 k  Cn
Gk ,m
(14)
Taking the logarithm of both sides and estimating E(Tk ) using Tk gives us the result
1
log(Gk ,m )  log(Tk )  log(k )  log(Cn )
(15)
m
This equation has the form Y  0  1 X   where:
Y  log(Gk ,m )  log(Tk )
1
m
X  log( k )
0  log(Cn ) , which is independent of k
1 
log(Gk ,m ) is close to zero for all k and d. Thus, plotting X  log( k ) vs. Y  log(tk ) for values of k
ranging from 1 to an arbitrarily chosen K, we can estimate the intrinsic dimension, m, as the inverse
slope of a regression line through these points. K should be chosen so that the density is relatively
constant through the distance to the Kth NN.
8
K
K
 K

K
XY

X
Y




1
k 1
k 1

mˆ 
  k 1
K
1  K
2
2 
K [ X ]  [ X ]
 

k 1
k 1
1
(16)
A slightly better estimate can be obtained by writing log(Gk ,m ) as a Taylor series in terms of k and d and
iterating.
Gk ,m [Taylor Series]
m  1 (m  1)(m  2) (m  1) 2 (m  1)(m  2)(m2  3m  3)




 O(k , m)
2km2
12k 2 m3
12k 2 m4
120k 4 m5
(17)
NN REG
Calculate NN Matrix
Initial Estimate
X  log( k ) , Y  log(tk )
1
K
K
 K

K
XY

X
Y


 
k 1
k 1

mˆ 0   k K1
K
2
 K [ X ]  [ X ]2 

 

k 1
k 1
Calculate Gk ,m0 with the Taylor Series approximation to Gk , mˆ 0
While ( mˆ i 1  mˆ i   )
X  log(k )  log(Gk ,mˆ i1 ) , Y  log(tk )
K
K
 K

K
XY

X
Y


 
k 1
k 1

mˆ i   k K1
K
 K [ X ]2  [ X ]2 

 

k 1
k 1
Calculate Gk ,mˆ i
1
Increment i
Figure 6
Pseudo Code for NN REG Program.
■
9
2.5 Nearest-Neighbor Maximum Likelihood Estimator
Levina & Bickel (2004)
Similar to the estimator in [1], NN MLE exploits neighborhood information. The following derivation
is based on [2], [4], and [5]. We have added additional explanation when needed.
n
N (t , x)  1{ X i  S x (r )} , 0  r  t , S x (r ) is a sphere of radius r about x
(18)
i 1
To find the joint occurrence density, we will model the binomial counting process N (t , x) as a Poisson
counting process.
N (t , x) has mean, t  nf ( x)Vt . We approximate N (t , x) as Nt ~ POI ( t ) with rate
t  t '  nf ( x)V (m)mt m1 .
P[ N r ,s  n] 
fT1 ,
,TK
(t1 ,

e  (  s  r ) (  s   r ) n
, for 0 < r < s
n!
(19)
tK ) 
1
P[ N t0 ,t1  0, N t1 ,t1 t  1, N t1 t ,t2  0, Nt2 ,t2 t  1,
t K
, NtK ,tK t  1]
(20)
Increments are independent the process is a nonhomogeneous Poisson process.
10

K
1
P
[
N

0]
P[ Nti ,ti t  1, Nti t ,ti1  0]  P[ NtK ,tK t  1]

0,t1
t K
i 1
1 ( t1 0) K e
 Ke

t
i 1
e
 ( t1 0)
 ( ti t  ti )
K
e
( ti t  ti )1
(1)!
 ( ti t  ti )
i 1
K
e
K
e
 ( ti ti1t )
(21)
i 2
 ( ti  ti 1t )
e
 t1  0 t1t  t1  t2  t1t   tK 1t  tK 1  tK  tK 1t
e
 t K
i 2
K

1  t K
e  ( ti t  ti )
t K
i 1
ti t  ti  ti t

1  tK K
e  (ti t )
t K
i 1
e
 t K
(22)
(23)
K

i 1
(24)
ti
tK
e

 t dt
0
K

i 1
(25)
ti
Taking logarithms of both sides, we have the following log-likelihood equation:
Lx
K
tK
i 1
0
  ln ti   t dt
(26)
K
tK
i 1
0
 ln ti   ln t dNt
tK
  ln[nf ( x)V (m)mt
0
tK
m1
tK
]dNt   (nf ( x)V (m)mt m1 )dt
(27)
0
tK
tK
  ln(nf ( x))  ln(V (m))  ln(m) dNt   (m  1)ln t  dNt nf ( x)V (m)  mt m1dt (28)
0
0
tK
0
tK
 [ln nf ( x)  ln V (m)  ln m]NtK  (m  1)  ln(t )dNt  nf ( x)V (m)  mt m1dt
0
Nt K
tK
j 1
0
(29)
0
 [ln nf ( x)  ln V (m)  ln m]NtK  (m  1) ln T j  nf ( x)V (m)  mt m1dt
(30)
where Ti is the distance from observation x to its ith nearest neighbor.
Let   ln f ( x) and N ( R, x)  N (tK , x)
Lx
 [  ln V (m)  ln m]N ( R, x)  (m  1)
N ( R,x)

j 1
ln T j  V (m)e R m
(31)
11
Lx

Lx
m
 N ( R, x)  V (m)e R m  0
[
N ( R,x)
V '(m) 1
 ]N ( R, x)   ln T j  V '(m)e R m  V (m)e R m log R  0
V (m) m
j 1
(32)
(33)
From the previous two equations:
N ( R, x) N ( R ,x )
  ln T j  N ( R)log R  0
m
j 1
(34)
N ( R, x ) N ( R , x ) R
  ln
m
Tj
j 1
(35)
mx  [
N ( R,x)
1
R
ln ]1

N ( R, x) j 1
Tj
(36)
NN MLE
Calculate NN Matrix
1 K T
m x  [  ln K ]1
K j 1 T j
m
1
 mx
N x
Figure 7
Pseudo Code for NN MLE Program.
12
2.6 Revised Nearest-Neighbor Maximum Likelihood Estimator
MacKay any Ghahramani noted in 2005 [3] that the standard NN MLE is biased for low values of K. As
is shown in Figure 7, the standard NN MLE is calculated by averaging the estimates over all
observations. MacKay and Ghahramani argued that the likelihood equation for the entire data set results
in an estimator identical to that of Levina and Bickel except that the final estimate must be obtained by
averaging the inverse of the estimates over all observations, then taking the inverse of the results. As
our simulations show, the revised NN MLE does not appear to be biased as small values of K and results
in a much improved estimator.
NN MLE (Revised)
Calculate NN Matrix
1 K T
m x  [  ln K ]1
K j 1 T j
1
1 
m  
 N x mx 
1
Figure 8
Pseudo Code for Revised NN MLE Program.
■
13
3. Datasets
Several datasets have been developed to test the efficacy of intrinsic dimension estimators. These
datasets range from the practical, such as human faces, to mathematically generated random datasets
such as the Swiss roll. Several common datasets are listed below.
3.1 Gaussian Sphere
The Gaussian sphere is a three dimensional dataset and the intrinsic dimension is also three. Due to this
characteristic, the dataset is useful as a baseline test for intrinsic dimension estimators. Each
observation is generated with three random numbers which are Gaussian distributed with mean zero and
variance one.
3-D Gaussian Sphere w/ ID=3
Xi = (xi0, xi1, xi2)
xi0 – xi2 = Random Numbers: ~N(0,1) * 6.28
Figure 9
Pseudo code for generating Gaussian sphere datasets.
3.2 Swiss Roll
The Swiss Roll is a three dimensional dataset with a non-linear, two dimensional manifold. The Swiss
Roll name is due to the shape of the object when viewed in three-dimensional space (see Figure 1).
Each observations has the form ( cos( ),sin( ), z ) , where  and z are random numbers.
Two main features make this dataset a popular test set for intrinsic dimension estimation. First, it is
easily generated, and second, traditional methods of dimension estimation, such as PCA fail.
Swiss Roll Procedure
Xi = (xi0, xi1, xi2)
r1 : Random Number: ~U(0,1) * 6.28
xi0 = r1 * cos(r1)
xi1 = r1 * sin(r1)
xi2 = r1
Figure 10
Pseudo code for generating the Swiss Roll dataset.
14
Figure 11
2000 observations plotted on the Swiss Roll manifold.
3.3 Double Swiss Roll
Two nested Swiss Rolls make up the Double Swiss Roll dataset. Both datasets have intrinsic dimension
of two, however, where the two Swiss Roll manifolds meet at the center of the Double Swiss Roll, the
intrinsic dimension appears to be greater than two due to the high density of observations in the region.
Figure 12
2000 observations plotted on the double Swiss Roll manifold.
15
3.4 Artificial Face
This face database includes many pictures of a single artificial face under various lighting and in various
horizontal and vertical orientations. Due to the three changing conditions, the intrinsic dimensionality
should be three, but since the pictures are two-dimensional projections, we do not know what the exact
intrinsic dimension is. Levina and Bickel propose that the intrinsic dimension of the dataset is
approximately four [citiation].
Figure 13
Images of an artificial face under various lighting conditions and various poses. [2]
3.5 25-Dimensional Gaussian Sphere (ID = 15)
25-D Gaussian Sphere w/ ID=15
Xi = (xi0, xi1,…, xi24)
xi0 – xi14 : Random Numbers: ~N(0,1) * 10
xi15 – xi25 = sin(x(i-15))
16
4. Results
Previous studies have shown that nearest-neighbor estimators produce accurate results when tested on
simulated datasets with moderate to low intrinsic dimension. Levina & Bickel noted that the estimators
are dependent on K, the number of nearest neighbors used in the estimate. MacKay and Ghahramani
stated that the choice of K was less of an issue with their revised NN MLE. Our preliminary simulations
have shown that a dependence on K exists even in the revised NN MLE. To further investigate the
characteristics of the estimators with respect to K, our simulations show the estimates for a wide range
of K. Levina & Bickel have reported simulation results that indicate the estimators are less accurate at
high-dimensions due to underestimation. The 25-dimension Gaussian sphere will be used to test the
high intrinsic dimension characteristics of the estimators. Finally, we would like to know what effect
the distribution type has on estimation, so we will test the estimators on a Gaussion sphere and threedimensional cube of similar size.
4.1 Accuracy
Input Data: 3-Dimensional Gaussian Sphere
2000 Observations
Int Dim = 3
Intrinsic Dimension Estimate
6
5
4
NN MLE
Rev NN MLE
NN REG
3
2
1
0
1
10
100
1000
Neighbors (K)
Figure 14
Three-dimensional Gaussian Sphere. We expect the intrinsic dimension to be three for the three-dimensional Gaussian
sphere. This graph shows that the MacKay & Ghahramani estimator (revised NN MLE) and the Jain, et. al. estimator (NN
REG) both provide accurate estimates for a wide range of K values.
17
Input Data: 3-Dimensional Swiss Roll Intrinsic Dimension = 2
2000 Observations
Intrinsic Dimension Estimate
4
3.5
3
2.5
NN MLE
Rev NN MLE
NN REG
2
1.5
1
0.5
0
1
10
100
1000
Neighbors (K)
Figure 15
Swiss Roll. The Swiss Roll has an intrinsic dimension of two, which is correctly estimated by the revised NN MLE and the
NN REG estimators.
Input Data: 3-Dimensional Double Swiss Roll
2000 Observations
Int Dim = 2
Intrinsic Dimension Estimate
4
3.5
3
2.5
NN MLE
Rev NN MLE
NN REG
2
1.5
1
0.5
0
1
10
100
1000
Neighbors (K)
Figure 16
Double Swiss Roll. The estimates of intrinsic dimension for the Double Swiss Roll are not accurate for values of K greater
than 100 due to the high concentration of observations near the center of the dataset.
18
Face Data
128 Data Point (128 images)
10
Intrinsic Dimension Estimate
9
8
7
6
NN MLE
Rev NN MLE
NN REG
5
4
3
2
1
0
1
10
100
Neighbors (K)
Figure 17
Artificial Face. The estimate of 3.5 is expected for the artificial face data since the dataset is a single 3-D face projected on to
a 2-D plane under varying lighting condition, and variation in horizontal and vertical orientation.
Both the MacKay & Gharamani estimator and the Jain, et. al. estimator perform exceptionally well on
these four simulated datasets. The best estimates occur when K, the number of nearest neighbors used
in the estimate, is small relative to the total number of observations. This is reasonable since we make
the assumption that the density near each observation is constant; a good assumption when K is small.
As expected, the Levina & Bickel estimator is not useful when K is small [3].
4.2 Dependence on Number of Neighbors
The artificial face and double Swiss Roll results show that the estimators may be heavily dependent on
the number of nearest neighbors used in the estimate. This dependence on K is exaggerated when the
intrinsic dimension is large and N is moderate, as shown in Figure 19.
19
4.3 Dependence on Distribution Type
Input Data: 3-Dim Gaussian Sphere/Uniform Cube
1000 Observations
Int Dim = 3
Intrinsic Dimension Estimate
6
5
4
Gaussian Distributed
Uniform Distributed
3
2
1
0
1
100
10
1000
Neighbors (K)
Figure 18
Distribution Comparison. This graph shows that the estimators are not highly dependent on the distribution type in the case
of three dimensions and 1000 observations.
Figure 18 shows similar estimates for the case of a Gaussian distributed sphere and a uniformly
distributed cube in three dimensional space. There is a slight overestimate when for the Gaussian
distributed sphere and a slight underestimate for the uniform distributed cube, however each estimate
would be rounded to the same nearest integer.
20
4.4 Effectiveness on Datasets with High Intrinsic Dimension
Input Data: 25-Dimensional Gaussian Intrinsic Dimension = 15
2000/1000/500/250 Observations
15
Intrinsic Dimension Estimate
14.5
14
13.5
13
N = 250
N = 500
N = 1000
N = 2000
12.5
12
11.5
11
10.5
10
0
5
10
15
20
Neighbors (K)
Figure 19
25-D Dimensional Dataset with 250, 500, 1000, and 2000 observations. As N increases, the estimators become less
dependent on K.
Previous studies have shown that nearest-neighbor estimators perform poorly at high intrinsic
dimensions, although the dependence on K at high intrinsic dimensions has not been fully explored [2].
Each curve in Figure 19 represents the average of the MacKay & Gharamani estimators when the
number of nearest neighbors in the simulated dataset is set at 250, 500, 1000, and 2000. Our results
agree with the findings of Levina and Bickel that shows an estimate of approximately 12-13 when the
true intrinsic dimension is 15. We also found that at high intrinsic dimensions, the estimate becomes
highly dependent on K. For example, when N is equal to 1000, the intrinsic dimension estimate is 13.3
for K equal to 2, but quickly falls off to 12.1 when K is increased to 20. Increasing N partially alleviates
this problem, however the estimate is still better when K is small.
21
4.5 Summary
Our simulations have shown that nearest-neighbor intrinsic dimension estimators are effective on
datasets with non-linear manifolds and intrinsic dimensions less than ten. The results of the artificial
face simulations are encouraging for biometric applications such a facial and iris recognition. In each of
our simulations, the best estimates occur when the number of nearest neighbors is small. In general, K
less than ten appears to be the most accurate, although the exact choice of K is dictated by the specific
dataset. When the intrinsic dimension is greater than 15, the estimators begin to underestimate the true
intrinsic dimension, and the problem worsens as the intrinsic dimension increases.
22
References
[1] K.W. Pettis,T.A. Bailey,A.K. Jain, and R.C. Dubes. An intrinsic dimensionality estimator from
near-neighbor information. IEEE Trans. Patt. Anal. Machine Intell. vol. 1. pp. 25-37. 1979.
[2] E. Levina and P. J. Bickel. Maximum Likelihood Estimation of Intrinsic Dimension. Advances in
NIPS 17. 2005.
[3] D. J.C. MacKay and Z. Ghahramani. Comments on 'Maximum Likelihood Estimation of Intrinsic
Dimension' by E. Levina and P. Bickel. Available online:
http://www.inference.phy.cam.ac.uk/mackay/dimension/. 2005.
[4] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973.
[5] D. L. Snyder. Random Point Processes. Wiley, New York, 1975.
[6] J. B. Tenenbaum, V. de Silva, and J. C. Landford. A global geometric framework for nonlinear
dimensionality reduction. Science, 290:2319{2323, 2000.
[7] W. H. Press, et. al. Numerical Recipes in C++ : The Art of Scientific Computing. Cambridge
University Press, New York, 2002.
23
A. Appendix
A.1 Program Overview




Create/Import Dataset
Start Loop
o Set K (K: Number of NN to use in the estimate)
o Execute Levina & Bickel MLE
o Execute Jain Regression Estimator
End Loop
Output Results in .csv table format (dimension estimate vs. # of NN)
24
A.2 Code
int_dim.cpp
/* Main Program: int_dim.cpp
Req Header Files: int_dim_reg.h, int_dim_mle.h, random_gen.h and gauss_gen.h
/*
/*
Research Project
Justin Eberhardt
Project: Intrinsic Dimension Estimation
Advisor: Dr. Kang James
*/
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <iomanip>
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
#include <ctime>
#include <cstdlib>
#include <vector>
#include "random_gen.h"
#include "gauss_gen.h"
#include "int_dim_mle.h"
#include "int_dim_reg.h"
#include "int_dim_rev.h"
#include "timer.h"
using namespace std;
float ran1(int &idum);
float gasdev(int &idum);
void int_dim_mle(int &n, int &p, float &levina_set_estimate, float &mackay_set_estimate, vector<vector<float> >
&nearest_neighbor, int &neighbor_k);
float int_dim_reg(int &n, int &p, vector<vector<float> > &nearest_neighbor , int &neighbor_k);
float int_dim_rev(int &n, int &p, vector<vector<float> > &nearest_neighbor);
main(void)
{
/* VARIABLES */
int p; // high-dimension (of original input_data)
float m; // low-dimension (trying to find)
int n; // number of observations (data points)
int idum;
int sim_m;
int dist_type;
int function;
float r_a; float r_b; float r_c; float r_noise;
int neighbor_k; // k (notation from paper)
float levina_set_estimate;
float mackay_set_estimate;
int trials;
float d_denominator; // d_i denominator estimator from paper
25
float d_this;
/* INITIALIZATION
m = 0;
n = 0;
sim_m = 0;
*/
// open files
ofstream outfile;
outfile.open("output.csv");
outfile << "Neighbors, Levina & Bickel, MacKay & Ghahramani, Jain et. al." << endl;
ofstream indatafile;
indatafile.open("input_data.csv");
/* DATA GENERATION */
int which_input;
which_input = 0;
//prompt user for input_data
cout << endl << "Please select a dataset: " << endl << " 1) From File (input_data.txt)" << endl << " 2) Simulated Data" <<
endl << " 3) 1D Swiss Roll" << endl << " 4) 2D Swiss Roll" << endl << " 5) 2D Double Swiss Roll" << endl <<
" 6) 2D Swiss Roll with Noise" << endl;
which_input = 2;
cout << "Please enter 1 - 10: ";
cin >> which_input;
cout << endl << "Number of Observations: ";
cin >> n;
cout << endl << "Number of Parameters: ";
cin >> p;
cout << endl << "Simulated Int Dim: ";
cin >> sim_m;
cout << endl;
//initialize the input_data vector
vector<float> rows(p);
vector<vector<float> > input_data(n, rows);
/* USE input_data.txt */
if(which_input == 1) {
cout << "How many dimensions does the data in 'input_data.txt' contain? " << endl << "Please enter an integer: ";
cin >> p;
cout << endl;
float data_point;
string row; // one observation with 'p' dimensions
ifstream myfile ("input_data.txt");
if ( myfile.is_open() ) {
int current_row;
current_row = 0;
int current_column;
current_column = 0;
// If the last set of observations does not contain a full set of dimension, the program will fill those spots with the last known
entry
while (! myfile.eof() ) {
while(current_column < p) {
myfile >> data_point;
input_data[current_row][current_column] = data_point;
current_column++;
}
26
current_column = 0;
current_row++;
}
myfile.close();
n = current_row; // number of observations
}
}
else { cout << "Unable to open file"; }
/* GENERATE A UNIFORM OR GAUSSIAN DISTRIBUTED DATASET
if(which_input == 2) {
cout << "(1) Gaussian, (2) Uniform: ";
cin >> dist_type;
cout << endl;
cout << "Function (1) sin(x), (2) ln(x^2), (3) x^(1/3): ";
cin >> function;
cout << endl;
idum = time(0);
for(int i=0; i < n; i++) {
for(int j=0; j < p; j++) {
if(j < sim_m) {
if(dist_type == 1) { input_data[i][j] = gasdev(idum); }
if(dist_type == 2) { input_data[i][j] = ran1(idum); }
}
else {
if(function == 1) { input_data[i][j] = sin( input_data[i][j-sim_m]); }
if(function == 2) { input_data[i][j] = log( pow(input_data[i][j-sim_m],2) ); }
if(function == 3) { input_data[i][j] = pow(input_data[i][j-sim_m],(1/3)); }
}
}
}
}
*/
/* GENERATE A 3-D DATASET W/ ID=1 */
if(which_input == 3) {
if(p !=3 ) cout << "ERROR Parameter # does not Match" ;
idum = time(0);
for(int i=0; i < n; i++) {
r_a = ran1(idum) * 6.14159;
input_data[i][0] = r_a * cos(r_a);
input_data[i][1] = r_a * sin(r_a);
input_data[i][2] = r_a;
}
}
/* GENERATE A 3-D DATASET W/ ID=2 */
if(which_input == 4) {
if(p !=3 ) cout << "ERROR Parameter # does not Match" ;
idum = time(0);
for(int i=0; i < n; i++) {
r_a = ran1(idum) * 6.14159;
r_b = gasdev(idum) * 6.14159;
input_data[i][0] = r_a * cos(r_a);
input_data[i][1] = r_a * sin(r_a);
input_data[i][2] = r_b;
}
}
/* GENERATE A 3-D DATASET W/ ID=2
if(which_input == 5) {
*/
27
}
if(p !=3 ) cout << "ERROR Parameter # does not Match" ;
idum = time(0);
for(int i=0; i < n; i++) {
r_a = ran1(idum) * 6.14159;
r_b = ran1(idum) * 6.14159;
if( i % 2 == 0 ) {
input_data[i][0] = r_a * cos(r_a);
input_data[i][1] = r_a * sin(r_a);
}
else {
input_data[i][0] = r_a * .5 * cos(r_a);
input_data[i][1] = r_a * .5 * sin(r_a);
}
input_data[i][2] = r_b;
}
/* GENERATE A 3-D DATASET W/ ID=2 & NOISE */
if(which_input == 6) {
if(p !=3 ) cout << "ERROR Parameter # does not Match" ;
float r_a; float r_b;
idum = time(0);
for(int i=0; i < n; i++) {
r_a = ran1(idum) * 6.14159;
r_b = gasdev(idum) * 6.14159;
r_noise = gasdev(idum) * .25;
input_data[i][0] = r_a * cos(r_a) + r_noise;
input_data[i][1] = r_a * sin(r_a) + r_noise;
input_data[i][2] = r_b;
}
}
/* SHOW DATA */
char show_data;
cout << "Would you like to view the dataset? y/n: " << endl;
cin >> show_data;
cout << endl;
if(show_data == 'y') {
cout << endl;
for(int i = 0; i < n; i++) {
for(int j = 0; j < p; j++) {
cout << setw(10) << input_data[i][j];
}
cout << endl;
}
}
/* RECORD INPUT DATA IN input_data.csv */
show_data = 'n';
cout << "Would you like to store the dataset to input_data.csv? y/n: " << endl;
cin >> show_data;
cout << endl;
if(show_data == 'y') {
indatafile << endl;
for(int i = 0; i < n; i++) {
for(int j = 0; j < p; j++) {
indatafile << setw(10) << input_data[i][j] << ", ";
}
indatafile << endl;
}
28
}
//The data to be processed is now stored in input_data
/* NEAREST NEIGHBOR MATRIX */
vector< vector<float> > nearest_neighbor(n, vector<float>(n+1,0)); // matrix size: n x n
float sum;
float sum_dist; //sum of distances to all NN
float sort_vector[n];
/* for loop
Description: Produces an n x n matrix that contains the distance between each pair of observations in the dataset.
Output: nearest_neighbor (n x n+1 matrix) & nearest_neighbor[i][n] = sum of all distances
*/
for(int i = 0; i < n; i++) {
sum_dist = 0;
for(int j = 0; j < n; j++) {
// Distances
sum = 0;
for(int k = 0; k < p; k++) {
sum += pow(float(input_data[i][k] - input_data[j][k]), 2);
}
nearest_neighbor[i][j] = sqrt(sum);
// End Distances
// Start Sum of Distances
sum_dist += nearest_neighbor[i][j];
}
nearest_neighbor[i][n] = sum_dist;
}
/*
Description: In the following 'for' loop, the nearest_neighbor matrix is sorted. Each row corresponds to an observation, and
each row is a sorted vector of distances to each neighbor.
Output: nearest_neighbor (n x n sorted matrix)
*/
for(int i = 0; i < n; i++) {
for(int j = 0; j < n; j++) {
sort_vector[j] = nearest_neighbor[i][j];
}
sort(sort_vector, ( sort_vector + n ));
for(int j = 0; j < n; j++) {
nearest_neighbor[i][j] = sort_vector[j];
}
}
/*
Run the Estimators
*/
for( neighbor_k = n; neighbor_k > 1; neighbor_k--) {
if( neighbor_k > 40 ) {
neighbor_k = neighbor_k - 20;
}
/* LEVINA & BICKEL MLE */
int_dim_mle(n, p, levina_set_estimate, mackay_set_estimate, nearest_neighbor, neighbor_k);
// output results to the outfile
cout << " " << neighbor_k - 1;
outfile << neighbor_k - 1 << "," << levina_set_estimate << "," << mackay_set_estimate << ",";
/*
JAIN ESTIMATOR */
d_this = int_dim_reg(n, p, nearest_neighbor, neighbor_k);
// output results to the outfile
29
outfile << d_this << "," << endl;
}
cout << endl << "Estimates are in: output.csv" << endl;
// close the files
outfile.close();
indatafile.close();
}
30
int_dim_reg.h
/* Main Program: int_dim.cpp
Req Header Files: int_dim_reg.h, int_dim_mle.h, random_gen.h and gauss_gen.h
/*
#include <cmath>
#include <math.h>
#include <algorithm>
using namespace std;
float int_dim_reg(int &n, int &p, vector< vector<float> >& nearest_neighbor, int &neighbor_k) {
int count;
float k_float;
float epsilon = 0.01; // (same notation as paper)
int maxiter = 10; // iteration maximum
float mmax; // notation from paper
float s2max; // notation from paper
float log_t_hat[neighbor_k];
int n_mod;
float log_g[neighbor_k];
float d_this; // estimated dimension at THIS iteration
float d_previous; // estimated dimension at the PREVIOUS iteration
float sum1;
float sum2;
float sum_log_k;
float sum;
float d_denominator;
sum_log_k = 0;
// use "neighbor_k" variable from above for k (notation from paper)
/* following two for loops
Description: Calculates values required to remove outliers
Output: mmax, s2max
*/
sum = 0;
for(int j = 0; j < n; j++) {
sum += nearest_neighbor[j][neighbor_k];
}
mmax = ( 1 / float( n ) ) * sum;
sum = 0;
for(int j = 0; j < n; j++) {
sum += pow( ( nearest_neighbor[j][neighbor_k] - mmax ) , 2);
}
s2max = (1 / (float( n ) - 1) ) * sum;
/* for loop
Description: Remove outlier and calculate the sample-average distance to kth NN
Output: log_t_hat[k] (sample-averaged distance to kth NN)
*/
for(int k = 1; k <= neighbor_k; k++) {
sum = 0;
n_mod = n;
for(int j = 0; j < n; j++) {
if( nearest_neighbor[j][neighbor_k] <= (mmax + sqrt(s2max) ) ) {
sum += nearest_neighbor[j][k];
}
31
}
else { n_mod--; }
}
log_t_hat[k] = log( (1/float( n_mod ) ) * sum );
/*
Description: Find the initial estimate for the intrinsic dimension of the dataset.
Output: d_this (contains the initial estimate of ID)
*/
d_previous = 0;
d_this = 0;
for(int k = 0; k <= neighbor_k; k++) {
log_g[k] = 0; // initialize log_g to zero
}
sum1 = 0;
sum2 = 0;
d_denominator = 0;
for(int k = 1; k <= neighbor_k; k++) {
sum_log_k += log( float(k) );
d_denominator += pow( log( float(k) ) , 2);
sum1 += log( float(k) ) * log_t_hat[k];
sum2 += log_t_hat[k];
}
d_denominator = float( neighbor_k ) * d_denominator - pow( sum_log_k, 2 );
d_this = d_denominator / ( float( neighbor_k ) * sum1 - ( sum_log_k * sum2 ) );
/* while loop
Description: Iterate to better estimate ID. At each iteration, log_g[k] is calculated based on previous dimension estimate.
Output: d_this (final ID estimate)
*/
count = 0;
while( d_this - d_previous > epsilon && count < maxiter ) {
d_previous = d_this;
count ++;
for(int k = 1; k <= neighbor_k; k++) {
k_float = float( k );
log_g[k] = ( ( d_previous - 1 ) / ( 2 * k_float * pow(d_previous, 2) ) )
+ ( ( d_previous - 1 ) * ( d_previous - 2 ) / ( 12 * pow(k_float, 2) * pow(d_previous, 3) ) )
- ( pow( ( d_previous - 1 ), 2 )/ ( 12 * pow(k_float, 3) * pow(d_previous, 4) ) )
- ( (d_previous - 1)*(d_previous - 2)*( pow(d_previous, 2) + 3*d_previous - 3 ) /
( 120 * pow(k_float, 4) * pow(d_previous, 5) ) );
}
sum1=0;
sum2=0;
sum_log_k=0;
d_denominator=0;
for(int k = 1; k <= neighbor_k; k++) {
sum_log_k += log( float(k) );
d_denominator += pow( log( float(k) ) , 2);
sum1 += log( float(k) ) * (log_t_hat[k] + log_g[k]);
sum2 += (log_t_hat[k] + log_g[k]);
}
d_denominator = float( neighbor_k ) * d_denominator - pow( sum_log_k, 2 );
d_this = d_denominator / ( float( neighbor_k ) * sum1 - ( sum_log_k * sum2 ) );
}
return d_this;
}
32
int_dim_mle.h
/* Main Program: int_dim.cpp
Req Header Files: int_dim_reg.h, int_dim_mle.h, random_gen.h and gauss_gen.h
/*
#include <cmath>
using namespace std;
void int_dim_mle(int &n, int &p, float &levina_set_estimate, float &mackay_set_estimate, vector< vector<float> >&
nearest_neighbor, int &neighbor_k) {
float mackay_point_estimate;
float levina_point_estimate;
float sum; // running total of a SUM(...)
int x; //observation #
//initializing variables for levina estimate
levina_set_estimate = 0;
mackay_set_estimate = 0;
/* for loop
Description: Provides a point estimate of intrinsic dimension for each observation in the dataset.
Number of Iterations: n
Output: levina_set_estimate (sum of all levina point estimates)
Output: mackay_set_estimate (sum of all mackay point estimates)
*/
for(x = 0; x < n; x++ ) {
levina_point_estimate = 0;
sum = 0;
for(int i = 1; i < neighbor_k; i++) {
if(nearest_neighbor[x][neighbor_k] != 0 && nearest_neighbor[x][i] != 0 ) {
sum += log( nearest_neighbor[x][neighbor_k] / nearest_neighbor[x][i] );
}
}
}
levina_point_estimate = (neighbor_k - 1) * 1/sum;
mackay_point_estimate = sum;
levina_set_estimate += levina_point_estimate;
mackay_set_estimate += mackay_point_estimate;
levina_set_estimate = levina_set_estimate / n;
mackay_set_estimate = mackay_set_estimate / (n * (neighbor_k - 1) );
mackay_set_estimate = 1 / mackay_set_estimate;
}
33
random_gen.h
/* Main Program: int_dim.cpp
Req Header Files: int_dim_reg.h, int_dim_mle.h, random_gen.h and gauss_gen.h
/*
/* from "Numerical Recipies in C++ Second Edition"
Press, Teukolsky, Vetterling, Flannery
ISBN 0-521-75033-4
[7]
*/
float ran1(int &idum) {
const int IA=16807,IM=2147483647, IQ=127773, IR=2836, NTAB=32;
const int NDIV = (1+(IM-1)/NTAB);
const double EPS=3.0e-16, AM = 1.0/IM, RNMX=(1.0-EPS);
static int iy=0;
static int iv[NTAB];
int j,k;
double temp;
if (idum <= 0 || !iy) {
if(-idum < 1) idum = 1;
else idum = -idum;
for( j=NTAB+7; j>=0; j--) {
k=idum/IQ;
idum = IA*(idum-k*IQ)-IR*k;
if (idum < 0) idum += IM;
if (j < NTAB) iv[j] = idum;
}
iy=iv[0];
}
k = idum/IQ;
idum = IA*(idum-k*IQ)-IR*k;
if (idum < 0) idum += IM;
j=iy/NDIV;
iy=iv[j];
iv[j] = idum;
if ((temp=AM*iy) > RNMX) return float( RNMX );
else return float( temp );
}
34
gauss_gen.h
/* Main Program: int_dim.cpp
Req Header Files: int_dim_reg.h, int_dim_mle.h, random_gen.h and gauss_gen.h
/*
/* from "Numerical Recipies in C++ Second Edition"
Press, Teukolsky, Vetterling, Flannery
ISBN 0-521-75033-4
[7]
*/
#include <cmath>
using namespace std;
float gasdev(int &idum) {
static int iset =0;
static double gset;
double fac, rsq, v1, v2;
if (idum <0) iset =0;
if (iset == 0) {
do {
v1 = 2.0*ran1(idum)-1.0;
v2 = 2.0*ran1(idum)-1.0;
rsq = v1*v1+v2*v2;
} while (rsq >= 1.0 || rsq == 0.0);
fac=sqrt(-2.0*log(rsq)/rsq);
gset = v1*fac;
iset=1;
return float( v2*fac );
} else {
iset = 0;
return float( gset );
}
}
35
Download