Intrinsic Dimensionality Estimation

advertisement
Intrinsic Dimension Estimation
Research Project
Graduate Student: Justin Eberhardt
Advisor: Dr. Kang James
University of Minnesota Duluth
Department of Mathematics and Statistics
Abstract
A variety of estimators have been proposed for estimating the intrinsic dimensionality of
a dataset. These estimators include principal component analysis (PCA),
multidimensional scaling (MDS), near neighborhood density estimation, and maximum
likelihood estimation (MLE). This project provides a comparison of the mathematical
and computational properties of each estimator and closely examines the MLE process
proposed by Levina and Bickel in 2004.
Introduction
High-dimensionality can prohibit the usefulness of practical data, but often this data can
effectively be represented in low dimensional space. The least dimension that describes a
data set without significant loss of feature is the intrinsic dimension of the data.
Therefore, it is important to determine intrinsic dimension in order to optimize dimension
reduction. Several estimators have been proposed. They include principal component
analysis (PCA), multidimensional scaling (MDS), near neighborhood density estimation,
and maximum likelihood estimation (MLE),
The following is an overview of the methods listed above with more detail given in the
section on near neighborhood density estimation and MLE.
PCA
Principal Component Analysis
Implementation of PCA requires a covariance matrix of the input to be calculated. The
eigenvalues of the covariance matrix are then calculated. The number of eigenvalues
greater than a specified threshold value is then considered the intrinsic dimension.
The method does well with image compression applications. Computationally, the
method is very expensive.
MDS
Multidimensional Scaling
Data containing more than three dimensions cannot be visualized in Cartesian space;
however, it would often be useful to “visualize” such data. MDS seeks to reduce
dimensionality by preserving Euclidean distances between data points. Given certain
data, the data can be “flattened” using a steepest descent algorithm without significant
loss of information.
This method ensures that the data is flattened in a way that preserves distance in the best
way possible, however, it requires computationally expensive iteration. The quality of
the flattening is determined by the amount of stress in the dataset. The basic idea of
stress is that it measures the distances in the original dataset and compares those to the
corresponding distances in the flattened dataset. Low stress implies that the data in the
reduced set is similar to the data in the original set.
Near Neighborhood Method
Pettis, Bailey, Jain, & Dubes (1979)
The estimator is derived based on the following density function (parts of the following
derivation are taken from a paper by Jain, et. al. (1979) An Intrinsic Dimensionality
Estimator from Near-Neighbor Information):
x : some observation in the input dataset – high dimensional dataset (HDD)
L : dimension of the HDD
Rk : radius of a hypershpere in L-dimensional space
k : the number of observations in the dataset within a distance of Rk of
observation x
rx ,k : the distance from x to its k th nearest neighbor
d : the intrinsic dimensionality of the dataset
Let:
(cr d ) k 1 cr d
, if r  0
f k , x (r )  cdr
e
( k )
 0 , elsewhere
d 1
where, c  np( x)Vd
f k , x ( r ) is the probability of finding k observations within a circle of radius r
Find the expected value of f k , x ( r ) by integrating over all values of r :
1
( k  )
k
E (rk , x )   rf k , x (r )dr  1 / d d [
]1 / d
np
(
x
)
V
k

(
k
)
d
0

Now, define a sample-averaged distance from observation X i to its k th nearest neighbor.
rk 
1 n
 rk , X i
n i 1
Combining the preceding two equations, the expected value of rk is:
E (rk ) 
1 n
1 1/ d
E (rk , X i ) 
k Cn

n i 1
Gk ,d
Where, Gk ,d 
k 1 / d (k )
1
(k  )
d
And, C n 
1
[np ( X i )Vd ] 1 / d

n
Now, taking the logarithm of E (rk ) results in the following:
1
log( Gk ,d )  log E (rk )  log( k )  log( C n )
d
Or,
1
log E (rk )  log( k )  log( C n )  log( Gk ,d )
d
The above equation has the form y  mx  b   where:
y  log E(rk )
1
m
d
x  log( k )
b  log( C n ) , which is dependent of k
And,   log( Gk ,d ) , which is close to zero fro all k and d
1
d
Then, using the observed value of rk as an estimator for E (rk ) results in:
1
log( rk )  log( k )  log( C n )  log( Gk ,d )
d
rk and log( k ) are known (based on the given dataset of observations) thus the inverse
Thus, a plot of y  log E(rk ) vs. x  log( k ) should have a slope of m 
slope of the plot rk vs. log( k ) is an estimate of the intrinsic dimensionality, d.
A slightly better estimate can be obtained by writing log( Gk ,d ) as a Taylor series in terms
of k and d and iterating.
MLE
Maximum Likelihood Estimation
Levina & Bickel (2004)
Similar to the estimator proposed by Pettis, Bailey, Jain & Dubes, MLE also makes use
of near neighborhood information. (parts of the following derivation are taken from a
paper by Levina & Bickel (2004) Maximum Likelihood Estimation of Intrinsic
Dimension)
Let:
X 1 , X 2 , X 3 ,... X n be observations in a high dimensional dataset (HDD) in R p
Y1 , Y2 , Y3 ,...Yn be observations in a low dimensional dataset (LDD) in R m with
density f
m  p , X i  g (Yi ) , Yi  g 1 ( X i )  h( X i ) , h  g 1
-----n
Let:
N (t , x)   1{ X i  S x (t )} , 0  t  R
i 1
-----d
[V (m)t m ]  V (m)mt m1
dt
Note: Surface area of S x (t ) is
 (t )  f ( x)V (m)mt m1
R
def
E ( X t )    (u )du  u (t )
0
X t ~ POI (u (t ))
e u (t ) ( (t )) x
x!
0  t1  t 2  t3  ...  t k  t
f X ( t ) ( x) 
Let:
e  (u (ti )u (ti 1 )) (u (ti )  u (ti 1 )) X i  X i 1

u ( 0 ) 0
( X i  X i 1 )!
i 1
t 0
f X (t1 )... X (tk ) ( x1 ,..., xk )

k
k
 e u ( t )
 (u (t )  u (t
i
i 1
)) X i  X i 1
i 1
k
(X
i
 X i 1 )!
i 1
k
 e u ( t )
 ( (t )(t ))
i
i 1
k
(X
i 1
Let:
0  T1  T2  T3  ...  Tk  t
X i  X i 1
i
i
 X i 1 )!
fT1...TN (t1 ,..., t N ) t1tN  e u (t1 )  (t1 ) t1 e  (t1 ) t1  e u (t2 t1 )  (t2 ) t2 e  (t2 ) t2  ...  e u (tN tN 1 )  (t N ) t2 e  (tN ) tN e u (t tN )
N
 e u (t )   (t i )e  ( ti )  ( ti )
i 1
N
R
n
i 1
0
i 1
( t )0
ln( f )  u(t )   ln  (t i )  ( (t i )(t i )) 
   (u )du   ln(  (t i ))
R
n
 ln(  (t ))   ln  (u)dN (u)
i
i 1
Let:
0
R
R
0
0
Lx   ln  (t )dN (t )   (t )dt
R
  ln( f ( x)V (m)mt
m1
0
R
)dN (t )  ( f ( x)V (m)mt m1 )dt
0
R
R
R
0
0
0
  ln f ( x)dN (t )   [ln V (m)  ln mt m1 ]dN (t )  f ( x)  V (m)mt m1dt
Datasets
Several datasets have been developed to test the efficacy of intrinsic dimension
estimators. These datasets range from the practical, such as human faces, to
mathematically generated random datasets such as the Swiss roll. Several common
datasets are listed below.
Swiss Roll
The dataset is three-dimensional with each observation in the dataset based on a random
number. Observations have the form ( phi * cos (phi)), sin(phi), z), where phi is a
random number and z is a random number. The dataset was given its name due to the
shape of the object when viewed in three-dimensional space. Since the dataset can be
“unrolled” onto a flat surface the intrinsic dimension should be two.
Two main features make this dataset a popular. First, it is easily generated, and second,
traditional methods of dimension estimation, such as PCA fail.
Artificial Face
This face database includes many pictures of a single artificial face under various lighting
and in various horizontal and vertical orientations. Due to the three changing conditions,
the intrinsic dimensionality should be three, but since the pictures are two-dimensional
projections, we do not know what the exact intrinsic dimension is. Levina and Bickel
propose that the intrinsic dimension of the dataset is approximately four. (citiation)
Rotating Hand
Similar to the Artifical Face set, this dataset is a video sequence of a hand rotating
through a one-dimensional curve in space. (citation) Here, Levina and Bickel propose
the dimension to be approximately three (due to the number of “basis” images required).
Hand-Written Twos
This dataset is a collection of images of the number two written by many different
people, therefore some have large loops and some small, some have extended tops and
some not. This dataset has an intrinsic dimension of about two.
Maximum Likelihood Equations
Example: Flipping Coins
Assuming that the probability of heads is 0.5 (pH=0.5)
The probability of getting two heads in sequence is P(HH|pH=0.5)=0.25
The likelihood that the probability of getting two heads given that two heads are observed
in sequence is L(pH=0.5|HH)=0.25.
So, a likelihood equation takes those values that are observed and determines the
likelihood that an event has a certain probability. When the likelihood is plotted as a
function of probability, the maximum value is referred to as the maximum likelihood, and
the equation for that maximum value is a maximum likelihood equation.
Results from Simulation
Download