Intrinsic Dimension Estimation Research Project Graduate Student: Justin Eberhardt Advisor: Dr. Kang James University of Minnesota Duluth Department of Mathematics and Statistics Abstract A variety of estimators have been proposed for estimating the intrinsic dimensionality of a dataset. These estimators include principal component analysis (PCA), multidimensional scaling (MDS), near neighborhood density estimation, and maximum likelihood estimation (MLE). This project provides a comparison of the mathematical and computational properties of each estimator and closely examines the MLE process proposed by Levina and Bickel in 2004. Introduction High-dimensionality can prohibit the usefulness of practical data, but often this data can effectively be represented in low dimensional space. The least dimension that describes a data set without significant loss of feature is the intrinsic dimension of the data. Therefore, it is important to determine intrinsic dimension in order to optimize dimension reduction. Several estimators have been proposed. They include principal component analysis (PCA), multidimensional scaling (MDS), near neighborhood density estimation, and maximum likelihood estimation (MLE), The following is an overview of the methods listed above with more detail given in the section on near neighborhood density estimation and MLE. PCA Principal Component Analysis Implementation of PCA requires a covariance matrix of the input to be calculated. The eigenvalues of the covariance matrix are then calculated. The number of eigenvalues greater than a specified threshold value is then considered the intrinsic dimension. The method does well with image compression applications. Computationally, the method is very expensive. MDS Multidimensional Scaling Data containing more than three dimensions cannot be visualized in Cartesian space; however, it would often be useful to “visualize” such data. MDS seeks to reduce dimensionality by preserving Euclidean distances between data points. Given certain data, the data can be “flattened” using a steepest descent algorithm without significant loss of information. This method ensures that the data is flattened in a way that preserves distance in the best way possible, however, it requires computationally expensive iteration. The quality of the flattening is determined by the amount of stress in the dataset. The basic idea of stress is that it measures the distances in the original dataset and compares those to the corresponding distances in the flattened dataset. Low stress implies that the data in the reduced set is similar to the data in the original set. Near Neighborhood Method Pettis, Bailey, Jain, & Dubes (1979) The estimator is derived based on the following density function (parts of the following derivation are taken from a paper by Jain, et. al. (1979) An Intrinsic Dimensionality Estimator from Near-Neighbor Information): x : some observation in the input dataset – high dimensional dataset (HDD) L : dimension of the HDD Rk : radius of a hypershpere in L-dimensional space k : the number of observations in the dataset within a distance of Rk of observation x rx ,k : the distance from x to its k th nearest neighbor d : the intrinsic dimensionality of the dataset Let: (cr d ) k 1 cr d , if r 0 f k , x (r ) cdr e ( k ) 0 , elsewhere d 1 where, c np( x)Vd f k , x ( r ) is the probability of finding k observations within a circle of radius r Find the expected value of f k , x ( r ) by integrating over all values of r : 1 ( k ) k E (rk , x ) rf k , x (r )dr 1 / d d [ ]1 / d np ( x ) V k ( k ) d 0 Now, define a sample-averaged distance from observation X i to its k th nearest neighbor. rk 1 n rk , X i n i 1 Combining the preceding two equations, the expected value of rk is: E (rk ) 1 n 1 1/ d E (rk , X i ) k Cn n i 1 Gk ,d Where, Gk ,d k 1 / d (k ) 1 (k ) d And, C n 1 [np ( X i )Vd ] 1 / d n Now, taking the logarithm of E (rk ) results in the following: 1 log( Gk ,d ) log E (rk ) log( k ) log( C n ) d Or, 1 log E (rk ) log( k ) log( C n ) log( Gk ,d ) d The above equation has the form y mx b where: y log E(rk ) 1 m d x log( k ) b log( C n ) , which is dependent of k And, log( Gk ,d ) , which is close to zero fro all k and d 1 d Then, using the observed value of rk as an estimator for E (rk ) results in: 1 log( rk ) log( k ) log( C n ) log( Gk ,d ) d rk and log( k ) are known (based on the given dataset of observations) thus the inverse Thus, a plot of y log E(rk ) vs. x log( k ) should have a slope of m slope of the plot rk vs. log( k ) is an estimate of the intrinsic dimensionality, d. A slightly better estimate can be obtained by writing log( Gk ,d ) as a Taylor series in terms of k and d and iterating. MLE Maximum Likelihood Estimation Levina & Bickel (2004) Similar to the estimator proposed by Pettis, Bailey, Jain & Dubes, MLE also makes use of near neighborhood information. (parts of the following derivation are taken from a paper by Levina & Bickel (2004) Maximum Likelihood Estimation of Intrinsic Dimension) Let: X 1 , X 2 , X 3 ,... X n be observations in a high dimensional dataset (HDD) in R p Y1 , Y2 , Y3 ,...Yn be observations in a low dimensional dataset (LDD) in R m with density f m p , X i g (Yi ) , Yi g 1 ( X i ) h( X i ) , h g 1 -----n Let: N (t , x) 1{ X i S x (t )} , 0 t R i 1 -----d [V (m)t m ] V (m)mt m1 dt Note: Surface area of S x (t ) is (t ) f ( x)V (m)mt m1 R def E ( X t ) (u )du u (t ) 0 X t ~ POI (u (t )) e u (t ) ( (t )) x x! 0 t1 t 2 t3 ... t k t f X ( t ) ( x) Let: e (u (ti )u (ti 1 )) (u (ti ) u (ti 1 )) X i X i 1 u ( 0 ) 0 ( X i X i 1 )! i 1 t 0 f X (t1 )... X (tk ) ( x1 ,..., xk ) k k e u ( t ) (u (t ) u (t i i 1 )) X i X i 1 i 1 k (X i X i 1 )! i 1 k e u ( t ) ( (t )(t )) i i 1 k (X i 1 Let: 0 T1 T2 T3 ... Tk t X i X i 1 i i X i 1 )! fT1...TN (t1 ,..., t N ) t1tN e u (t1 ) (t1 ) t1 e (t1 ) t1 e u (t2 t1 ) (t2 ) t2 e (t2 ) t2 ... e u (tN tN 1 ) (t N ) t2 e (tN ) tN e u (t tN ) N e u (t ) (t i )e ( ti ) ( ti ) i 1 N R n i 1 0 i 1 ( t )0 ln( f ) u(t ) ln (t i ) ( (t i )(t i )) (u )du ln( (t i )) R n ln( (t )) ln (u)dN (u) i i 1 Let: 0 R R 0 0 Lx ln (t )dN (t ) (t )dt R ln( f ( x)V (m)mt m1 0 R )dN (t ) ( f ( x)V (m)mt m1 )dt 0 R R R 0 0 0 ln f ( x)dN (t ) [ln V (m) ln mt m1 ]dN (t ) f ( x) V (m)mt m1dt Datasets Several datasets have been developed to test the efficacy of intrinsic dimension estimators. These datasets range from the practical, such as human faces, to mathematically generated random datasets such as the Swiss roll. Several common datasets are listed below. Swiss Roll The dataset is three-dimensional with each observation in the dataset based on a random number. Observations have the form ( phi * cos (phi)), sin(phi), z), where phi is a random number and z is a random number. The dataset was given its name due to the shape of the object when viewed in three-dimensional space. Since the dataset can be “unrolled” onto a flat surface the intrinsic dimension should be two. Two main features make this dataset a popular. First, it is easily generated, and second, traditional methods of dimension estimation, such as PCA fail. Artificial Face This face database includes many pictures of a single artificial face under various lighting and in various horizontal and vertical orientations. Due to the three changing conditions, the intrinsic dimensionality should be three, but since the pictures are two-dimensional projections, we do not know what the exact intrinsic dimension is. Levina and Bickel propose that the intrinsic dimension of the dataset is approximately four. (citiation) Rotating Hand Similar to the Artifical Face set, this dataset is a video sequence of a hand rotating through a one-dimensional curve in space. (citation) Here, Levina and Bickel propose the dimension to be approximately three (due to the number of “basis” images required). Hand-Written Twos This dataset is a collection of images of the number two written by many different people, therefore some have large loops and some small, some have extended tops and some not. This dataset has an intrinsic dimension of about two. Maximum Likelihood Equations Example: Flipping Coins Assuming that the probability of heads is 0.5 (pH=0.5) The probability of getting two heads in sequence is P(HH|pH=0.5)=0.25 The likelihood that the probability of getting two heads given that two heads are observed in sequence is L(pH=0.5|HH)=0.25. So, a likelihood equation takes those values that are observed and determines the likelihood that an event has a certain probability. When the likelihood is plotted as a function of probability, the maximum value is referred to as the maximum likelihood, and the equation for that maximum value is a maximum likelihood equation. Results from Simulation